SF signalFrameUpdate hwasan UAF问题分析

比较典型的并发问题导致的内存破坏

BUGOS2-355198

并发问题分析

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'Xiaomi/chenfeng/chenfeng:15/AQ3A.240912.001/OS2.0.250226.1.VNJCNXM.STABLE-TEST.hwasan:user/test-keys'​
Revision: '0'​
ABI: 'arm64'​
Timestamp: 2025-02-28 09:02:39.709798311+0800​
Process uptime: 2240s​
ZygotePid: 1420​
Cmdline: /system/bin/surfaceflinger​
pid: 1887, tid: 2103, name: binder:1887_1 >>> /system/bin/surfaceflinger <<<
uid: 1000​
tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)​
pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)​
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------​
Abort message: '==1887==ERROR: HWAddressSanitizer: tag-mismatch on address 0x003935d761f0 at pc 0x006f0c56ce54​
WRITE of size 8 at 0x003935d761f0 tags: f7/3f (ptr/mem) in thread T1​
 #0 0x6f0c56ce54 (/system_ext/lib64/libmisurfaceflinger.so+0xabe54) (BuildId: a9a032276f1bbfb0e786249c8bf2f26b)​
 android::FrameWaiter::signalFrameUpdate(android::sp<android::Layer>) at vendor/xiaomi/frameworks/native/services/surfaceflinger/FrameWaiter.cpp:241​

 #1 0x6f0c52a198 (/system_ext/lib64/libmisurfaceflinger.so+0x69198) (BuildId: a9a032276f1bbfb0e786249c8bf2f26b)​
 android::MiSurfaceFlingerImpl::signalFrameUpdate(android::sp<android::Layer>) at vendor/xiaomi/frameworks/native/services/surfaceflinger/MiSurfaceFlingerImpl.cpp:7050 (discriminator 2)​

 #2 0x6ff1674e74 (/system_ext/lib64/libsurfaceflinger.so+0x5aae74) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #3 0x6ff15e3474 (/system_ext/lib64/libsurfaceflinger.so+0x519474) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #4 0x6fc895f7e0 (/system/lib64/libgui.so+0x13f7e0) (BuildId: fbeda03ae8857a44299576d917e8ed65)​
 #5 0x6ff15f95a4 (/system_ext/lib64/libsurfaceflinger.so+0x52f5a4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #6 0x6fbe2dfa24 (/system/lib64/libbinder.so+0x76a24) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #7 0x6fbe2c103c (/system/lib64/libbinder.so+0x5803c) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #8 0x6fbe2c08c4 (/system/lib64/libbinder.so+0x578c4) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #9 0x6fbe2c174c (/system/lib64/libbinder.so+0x5874c) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #10 0x6fbe2d0438 (/system/lib64/libbinder.so+0x67438) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #11 0x6fdfd283c4 (/system/lib64/libutils.so+0x133c4) (BuildId: cd3c0f3d02af113ec0485a7fb8d7ce83)​
 #12 0x6fd9c89ecc (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x81ecc) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)​
 #13 0x6fd9c758ac (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x6d8ac) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)​
[0x003935d761e0,0x003935d76200) is a small unallocated heap chunk; size: 32 offset: 16​

Cause: heap-buffer-overflow​
0x003935d761f0 is located 3628 bytes after a 4-byte region [0x003935d753c0,0x003935d753c4)​

分配栈和访问栈毫无关联​
allocated by thread T42 here:​
 #0 0x6fc79a3dec (/apex/com.android.runtime/lib64/bionic/libclang_rt.hwasan-aarch64-android.so+0x28dec) (BuildId: bd2b4326ea0cac4ac0ec1712874405a96a9f4930)​
 #1 0x6fd9c5a470 (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x52470) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)​
 #2 0x6fc896d538 (/system/lib64/libgui.so+0x14d538) (BuildId: fbeda03ae8857a44299576d917e8ed65)​
 (inlined by) std::__1::unordered_map<unsigned int, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>>>::operator=[abi:nn180000](std::__1::unordered_map<unsigned int, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>>> const&) at prebuilts/clang/host/linux-x86/clang-r522817/include/c++/v1/unordered_map:1184​
 (inlined by) android::gui::LayerMetadata::operator=(android::gui::LayerMetadata const&) at frameworks/native/libs/gui/LayerMetadata.cpp:87​

 #3 0x6ff14d21f4 (/system_ext/lib64/libsurfaceflinger.so+0x4081f4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 android::Layer::Layer(android::surfaceflinger::LayerCreationArgs const&) at frameworks/native/services/surfaceflinger/Layer.cpp:237​

 #4 0x6ff164e150 (/system_ext/lib64/libsurfaceflinger.so+0x584150) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #5 0x6ff15eb04c (/system_ext/lib64/libsurfaceflinger.so+0x52104c) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #6 0x6ff15ea1e4 (/system_ext/lib64/libsurfaceflinger.so+0x5201e4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #7 0x6ff13e041c (/system_ext/lib64/libsurfaceflinger.so+0x31641c) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)​
 #8 0x5835d3bd88 (/system/bin/surfaceflinger+0x97d88) (BuildId: d944549c90f39dd839d6921639f7e976)​
 #9 0x6fbe2dfa24 (/system/lib64/libbinder.so+0x76a24) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #10 0x6fbe2c103c (/system/lib64/libbinder.so+0x5803c) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #11 0x6fbe2c08c4 (/system/lib64/libbinder.so+0x578c4) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #12 0x6fbe2c174c (/system/lib64/libbinder.so+0x5874c) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #13 0x6fbe2d0438 (/system/lib64/libbinder.so+0x67438) (BuildId: 32e47bbbcbedcf105b457ea018604686)​
 #14 0x6fdfd283c4 (/system/lib64/libutils.so+0x133c4) (BuildId: cd3c0f3d02af113ec0485a7fb8d7ce83)​
 #15 0x6fd9c89ecc (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x81ecc) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)​
 #16 0x6fd9c758ac (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x6d8ac) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)​
Memory tags around the buggy address (one tag corresponds to 16 bytes):​
 0x003935d75900: 20 20 8e 08 f0 f0 92 92 01 08 aa aa c7 c7 04 be​
 0x003935d75a00: 32 32 c5 db b1 39 04 4d 93 93 05 05 45 45 f8 f8​
 0x003935d75b00: cb cb 96 96 f2 f2 85 44 c3 e7 b1 b1 04 db 90 e5​
 0x003935d75c00: 12 12 d7 d7 f3 7d c0 c0 90 e7 fe fe 84 e8 5d 2d​
 0x003935d75d00: c7 c7 21 21 04 2e 55 46 a2 a2 35 82 52 52 3d 3d​
 0x003935d75e00: 47 47 9f 9f f1 49 50 d8 2e 2e c8 08 ab 98 c4 08​
 0x003935d75f00: b0 33 6c 4e c8 c8 32 32 06 08 cc e4 da da de 63​
 0x003935d76000: bb bb 37 37 ca 20 50 50 47 2f 49 49 5e 08 8f 68​
=>0x003935d76100: 4d 4d 83 83 2e 2e 58 90 9a 9a a6 a6 31 51 3f [3f]​
 0x003935d76200: f1 f1 15 15 74 74 4d 4d c2 08 8d 83 84 84 58 58​
 0x003935d76300: f3 08 1f 1f e5 e5 7f 2a 03 20 e8 e8 d9 97 c9 c9​
 0x003935d76400: 2d 2d 67 08 65 65 f0 2d 43 43 c5 c5 f3 08 a6 70​
 0x003935d76500: 3f 08 26 26 ba ba 2a 2a e2 e2 9d 08 04 e2 04 de​
 0x003935d76600: b5 fc 9e 9e 5f 5f 36 36 91 08 6b 6b dd dd 2a 2a​
 0x003935d76700: 08 68 ff ff 54 fa 3e 08 80 2b 2b 2b e4 e4 e9 e9​
 0x003935d76800: a9 6c 04 e1 da da d2 d2 fb fb d7 08 1a 1a fa 08​
 0x003935d76900: ae 08 5f 5f 9b 8b bd 74 04 db d1 d1 d6 84 91 d8​

打印short granules对应的随机tag的值。本例中,因为访问到的内存块没有零头,所以输出了两个点,也就是说影子内存里存的是随机tag而不是​
short granules。​
地址0x003935d760d0,里面存的5e,就是short granules,它的随机tag是5e,它的影子内存存的是0x08,也就是说这个内存块只能访问前8字节的内容,后8字节是不能访问的。​
Tags for short granules around the buggy address (one tag corresponds to 16 bytes):​
 0x003935d76000: .. .. .. .. .. .. .. .. .. .. .. .. .. 5e .. ..​
=>0x003935d76100: .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. [..]​
 0x003935d76200: .. .. .. .. .. .. .. .. .. c2 .. .. .. .. .. ..​
See https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#short-granules for a description of short granule tags​
 

这个问题比较奇怪的是,为什么判定成了heap buffer overflow? 看起来很像是UAF。因为分配栈根本不是实际访问到的内存。

实际访问的内存是FrameWaitStat实例:

void FrameWaiter::signalFrameUpdate(sp<Layer> layer) {​
 CLOSEABLE_ATRACE_CALL(!mEnableTrace);
 if (!FrameWaitStat::sFeatureEnable) {
 return;
 }​
 // mpendingToSignalLayers.push_back(layer);
 if ((strstr(layer->getDebugName(), sVipLayerName) != NULL) && (strstr(layer->getDebugName(), "SurfaceView[") != NULL) && (strstr(layer->getDebugName(), "BLAST") != NULL)) {
 if (mLayerFE == nullptr) {
 return;
 }​
 void *ptr = mLayerFE->getFrameWaitStat();
 if (ptr != nullptr) {
 FrameWaitStat *stat = (FrameWaitStat *)ptr;
 stat->mSignalFrameTime = systemTime(); <<<<<<<<<< 报错点​
 // MITRACE_FORMAT("lastTime:%ld", stat->getLastFrameTime())
 if (stat->getEnableWaitFrame()) {
 AutoMutex _l(mUpdateSignalMutex);
 mLayerFrameUpdated = true;
 }​
 }​
 }​
}​

FrameWaitStat定义:

//services/surfaceflinger/FrameWaitStat.h​
class FrameWaitStat {​
private:
 bool mWaitFrame = false;​
 nsecs_t mLastFrameTime = 0;​

public:
 static bool sFeatureEnable;​
 nsecs_t mSignalFrameTime = 0;​
...​
};

排除掉static变量,nsecs_t没记错应该是64位,也就是8字节。按照8字节对齐,bool应该也是8字节。这样整个数据结构是24字节。

再来看下分配栈。

分配栈分配的是LayerMetadata

//services/surfaceflinger/Layer.cpp​
Layer::Layer(const surfaceflinger::LayerCreationArgs& args)... {​
 ...​
 mDrawingState.metadata = args.metadata;

mDrawingState.metadata 会调用赋值构造函数:

//libs/gui/LayerMetadata.cpp​
LayerMetadata& LayerMetadata::operator=(const LayerMetadata& other) {​
 mMap = other.mMap;​
 return *this;​
}

LayerMetadata只有一个map类型的成员,map的key是uint32_t,value是vector<uint8_t>:

struct LayerMetadata : public Parcelable {​
 std::unordered_map<uint32_t, std::vector<uint8_t>> mMap;​
...​
}​

所以分配点跟访问点毫无关系。

这样看来,问题大概率是这样:分配给FrameWaitStat的内存已经被释放,又分配给了LayerMetadata。而出错的线程不知道,又访问了这块内存,hwasan发现tag不匹配报错。

分析FrameWaitStat::mSignalFrameTime的访问点,可知其写入点都加了锁:

//services/surfaceflinger/FrameWaiter.cpp​
void FrameWaiter::ensureOutputLayer(sp<compositionengine::LayerFE> &layerFE) {​
 // CLOSEABLE_ATRACE_CALL(!mEnableTrace);("FrameWaiter::ensureOutputLayer");​
 if (!FrameWaitStat::sFeatureEnable) {​
 return;​
 }​
 AutoMutex _l(mLayerMutex);​
 if ((strstr(layerFE->getDebugName(), sVipLayerName) != NULL) && (strstr(layerFE->getDebugName(), "SurfaceView[") != NULL) && (strstr(layerFE->getDebugName(), "BLAST") != NULL)) {​
 CLOSEABLE_ATRACE_CALL(!mEnableTrace); //("Camera_Layer");​
 FrameWaitStat *stat = nullptr;​
 void *ptr = layerFE->getFrameWaitStat();​
 if (ptr != nullptr) {​
 stat = (FrameWaitStat *)ptr;​
 stat->setWaitFrameEnable(true);​
 } else {​
 stat = new FrameWaitStat();​
 stat->setWaitFrameEnable(true);​
 layerFE->setFrameWaitStat(stat);​
 }​
 }​
}​
void FrameWaiter::onLayerDestroyed(Layer *layer) {​
 if (!FrameWaitStat::sFeatureEnable) {​
 return;​
 }​
 if (layer == nullptr) {​
 return;​
 }​
 if (layer == mWaitFrameLayer) {​
 AutoMutex _l(mLayerMutex);​
 if (mLayerFE != nullptr) {​
 void *ptr = mLayerFE->getFrameWaitStat();​
 if (ptr != nullptr) {​
 FrameWaitStat *stat = (FrameWaitStat *)ptr;​
 delete stat;​
 mLayerFE->setFrameWaitStat(nullptr);​
 }​
 }​
 mWaitFrameLayer = nullptr;​
 mLayerFE = nullptr;​
 }​
}

但出问题的地方,也就是读取点,未加锁。可以看前面贴的FrameWaiter::signalFrameUpdate代码。

而且,signalFrameUpdate从SF::setTransactionState调过来,这个是binder线程里执行的,确实存在并发场景。

修复方案

gerrit.pt.mioffice.cn

加锁即可。

💡需要注意的是,这个类还有一把锁mUpdateSignalMutex,我们需要保证其访问顺序,避免死锁。

看了下代码,目前只有FrameWaiter::signalFrameUpdate一个函数会导致同时拿两把锁,顺序是mLayerMutexmUpdateSignalMutex。因此应当不会导致死锁。

是否可能和SF大锁mStateLock死锁?

目前看,signalFrameUpdate的代码不会直接或间接拿mStateLock锁,因此,至少这个改动没有引入逆序持锁,不会引发死锁。

是否可以把操作放在主线程,避免加锁?

通过和代码owner沟通,得知是设计如此,不能放在主线程。这个功能就是要让主线程等待。

因为方案仅在视频播放场景下开启,因此风险可控。

详情:FrameWaiter源码分析

为什么报heap-buffer-overflow而不是use-after-free

💡

这是这个问题中最有意思的地方。

前文讲到,hwasan报错时,应该访问到的内存已经被释放,并且分配给了另外的对象。那么hwasan是不是应该报UAF更准确?

这个应该就和hwasan的原理有关了。我的推测是这样:

hwasan的基本原理是用指针ptr和内存mem的tag做对比,如果tag不匹配,就报错。那么对于mem是不是之前分配的,它并不能检测出来。

对于一块内存,它只会记录最近的一次分配和释放(如果被释放了)。如果检测到UAF,它会把最近一次的分配和释放堆栈都打印出来;如果不是UAF,那么只会输出最近的一次分配。

那么hwasan如何识别UAF?它是通过把释放掉的内存标记为0xFD识别的。也就是说,如果mem tag不是0xFD等特殊值,它就会认为这块内存是已分配的。至于是哪次分配的,它并不记录。

仔细思考一下,这个设计是很合理的。设想一下,要精确知道ptr对应的是哪一次分配,那么每次分配都得记录一个id,然后还要记下这次分配的堆栈。实际程序里,一块内存会被反复多次地分配。这样一来要记录多少数据?这个代价太大了。

ASan

以前的ASan方案里,释放掉的内存会被标记为0xFD,这一点可以通过简单的实验验证。我们写一个带heap overflow的测试程序,用clang -fsanitize=address编译一下,然后运行即可看到:

SUMMARY: AddressSanitizer: heap-use-after-free ./main.cpp:7:8 in main​
Shadow bytes around the buggy address:​
 0x72221481fd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00​
 0x72221481fe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00​
 0x72221481fe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00​
 0x72221481ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00​
 0x72221481ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00​
=>0x722214820000: fa fa[fd]fa fa fa fa fa fa fa fa fa fa fa fa fa​
 0x722214820080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa​
 0x722214820100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa​
 0x722214820180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa​
 0x722214820200: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa​
 0x722214820280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa​
Shadow byte legend (one shadow byte represents 8 application bytes):​
 Addressable: 00​
 Partially addressable: 01 02 03 04 05 06 07​
 Heap left redzone: fa​
 Freed heap region: fd​
 Stack left redzone: f1​
 Stack mid redzone: f2​
 Stack right redzone: f3​
 Stack after return: f5​
 Stack use after scope: f8​
 Global redzone: f9​
 Global init order: f6​
 Poisoned by user: f7​
 Container overflow: fc​
 Array cookie: ac​
 Intra object redzone: bb​
 ASan internal: fe​
 Left alloca redzone: ca​
 Right alloca redzone: cb

但是在hwasan里,free的内存并不是固定标记为0xFD,而是另一个随机值。这个只要在Android环境随便写个UAF的demo,多运行几次就会发现

再次推测:UAF的识别方式,可能是hwasan通过当前访问的内存块最后记录的trace是被分配还是被释放决定的。这样可以免去填充0xFD的过程,提升效率。

要验证这个推测,需要去研究hwasan源码。有时间了看下。

hwasan源码探究

研究了一下这块的源码,和我的推测比较接近。这其实就是hwasan设计的一种权衡。不是不能追求严谨检测,而是成本太高。最终目的还是为了解决内存错误问题,不能舍本逐末。而现有的报错,已经足够我们解决问题。

具体可参考这篇文章hwasan源码探究中FindBufferOverflowCandidate的实现和UAF和overflow的判断逻辑。