SF signalFrameUpdate hwasan UAF问题分析
比较典型的并发问题导致的内存破坏
并发问题分析
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'Xiaomi/chenfeng/chenfeng:15/AQ3A.240912.001/OS2.0.250226.1.VNJCNXM.STABLE-TEST.hwasan:user/test-keys'
Revision: '0'
ABI: 'arm64'
Timestamp: 2025-02-28 09:02:39.709798311+0800
Process uptime: 2240s
ZygotePid: 1420
Cmdline: /system/bin/surfaceflinger
pid: 1887, tid: 2103, name: binder:1887_1 >>> /system/bin/surfaceflinger <<<
uid: 1000
tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
Abort message: '==1887==ERROR: HWAddressSanitizer: tag-mismatch on address 0x003935d761f0 at pc 0x006f0c56ce54
WRITE of size 8 at 0x003935d761f0 tags: f7/3f (ptr/mem) in thread T1
#0 0x6f0c56ce54 (/system_ext/lib64/libmisurfaceflinger.so+0xabe54) (BuildId: a9a032276f1bbfb0e786249c8bf2f26b)
android::FrameWaiter::signalFrameUpdate(android::sp<android::Layer>) at vendor/xiaomi/frameworks/native/services/surfaceflinger/FrameWaiter.cpp:241
#1 0x6f0c52a198 (/system_ext/lib64/libmisurfaceflinger.so+0x69198) (BuildId: a9a032276f1bbfb0e786249c8bf2f26b)
android::MiSurfaceFlingerImpl::signalFrameUpdate(android::sp<android::Layer>) at vendor/xiaomi/frameworks/native/services/surfaceflinger/MiSurfaceFlingerImpl.cpp:7050 (discriminator 2)
#2 0x6ff1674e74 (/system_ext/lib64/libsurfaceflinger.so+0x5aae74) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#3 0x6ff15e3474 (/system_ext/lib64/libsurfaceflinger.so+0x519474) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#4 0x6fc895f7e0 (/system/lib64/libgui.so+0x13f7e0) (BuildId: fbeda03ae8857a44299576d917e8ed65)
#5 0x6ff15f95a4 (/system_ext/lib64/libsurfaceflinger.so+0x52f5a4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#6 0x6fbe2dfa24 (/system/lib64/libbinder.so+0x76a24) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#7 0x6fbe2c103c (/system/lib64/libbinder.so+0x5803c) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#8 0x6fbe2c08c4 (/system/lib64/libbinder.so+0x578c4) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#9 0x6fbe2c174c (/system/lib64/libbinder.so+0x5874c) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#10 0x6fbe2d0438 (/system/lib64/libbinder.so+0x67438) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#11 0x6fdfd283c4 (/system/lib64/libutils.so+0x133c4) (BuildId: cd3c0f3d02af113ec0485a7fb8d7ce83)
#12 0x6fd9c89ecc (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x81ecc) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)
#13 0x6fd9c758ac (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x6d8ac) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)
[0x003935d761e0,0x003935d76200) is a small unallocated heap chunk; size: 32 offset: 16
Cause: heap-buffer-overflow
0x003935d761f0 is located 3628 bytes after a 4-byte region [0x003935d753c0,0x003935d753c4)
分配栈和访问栈毫无关联
allocated by thread T42 here:
#0 0x6fc79a3dec (/apex/com.android.runtime/lib64/bionic/libclang_rt.hwasan-aarch64-android.so+0x28dec) (BuildId: bd2b4326ea0cac4ac0ec1712874405a96a9f4930)
#1 0x6fd9c5a470 (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x52470) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)
#2 0x6fc896d538 (/system/lib64/libgui.so+0x14d538) (BuildId: fbeda03ae8857a44299576d917e8ed65)
(inlined by) std::__1::unordered_map<unsigned int, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>>>::operator=[abi:nn180000](std::__1::unordered_map<unsigned int, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>>>> const&) at prebuilts/clang/host/linux-x86/clang-r522817/include/c++/v1/unordered_map:1184
(inlined by) android::gui::LayerMetadata::operator=(android::gui::LayerMetadata const&) at frameworks/native/libs/gui/LayerMetadata.cpp:87
#3 0x6ff14d21f4 (/system_ext/lib64/libsurfaceflinger.so+0x4081f4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
android::Layer::Layer(android::surfaceflinger::LayerCreationArgs const&) at frameworks/native/services/surfaceflinger/Layer.cpp:237
#4 0x6ff164e150 (/system_ext/lib64/libsurfaceflinger.so+0x584150) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#5 0x6ff15eb04c (/system_ext/lib64/libsurfaceflinger.so+0x52104c) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#6 0x6ff15ea1e4 (/system_ext/lib64/libsurfaceflinger.so+0x5201e4) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#7 0x6ff13e041c (/system_ext/lib64/libsurfaceflinger.so+0x31641c) (BuildId: 7f824c81163d2ccbd31645cfe6bee402)
#8 0x5835d3bd88 (/system/bin/surfaceflinger+0x97d88) (BuildId: d944549c90f39dd839d6921639f7e976)
#9 0x6fbe2dfa24 (/system/lib64/libbinder.so+0x76a24) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#10 0x6fbe2c103c (/system/lib64/libbinder.so+0x5803c) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#11 0x6fbe2c08c4 (/system/lib64/libbinder.so+0x578c4) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#12 0x6fbe2c174c (/system/lib64/libbinder.so+0x5874c) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#13 0x6fbe2d0438 (/system/lib64/libbinder.so+0x67438) (BuildId: 32e47bbbcbedcf105b457ea018604686)
#14 0x6fdfd283c4 (/system/lib64/libutils.so+0x133c4) (BuildId: cd3c0f3d02af113ec0485a7fb8d7ce83)
#15 0x6fd9c89ecc (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x81ecc) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)
#16 0x6fd9c758ac (/apex/com.android.runtime/lib64/bionic/hwasan/libc.so+0x6d8ac) (BuildId: 68e69cd2a12bebc7070200c2d9ada377)
Memory tags around the buggy address (one tag corresponds to 16 bytes):
0x003935d75900: 20 20 8e 08 f0 f0 92 92 01 08 aa aa c7 c7 04 be
0x003935d75a00: 32 32 c5 db b1 39 04 4d 93 93 05 05 45 45 f8 f8
0x003935d75b00: cb cb 96 96 f2 f2 85 44 c3 e7 b1 b1 04 db 90 e5
0x003935d75c00: 12 12 d7 d7 f3 7d c0 c0 90 e7 fe fe 84 e8 5d 2d
0x003935d75d00: c7 c7 21 21 04 2e 55 46 a2 a2 35 82 52 52 3d 3d
0x003935d75e00: 47 47 9f 9f f1 49 50 d8 2e 2e c8 08 ab 98 c4 08
0x003935d75f00: b0 33 6c 4e c8 c8 32 32 06 08 cc e4 da da de 63
0x003935d76000: bb bb 37 37 ca 20 50 50 47 2f 49 49 5e 08 8f 68
=>0x003935d76100: 4d 4d 83 83 2e 2e 58 90 9a 9a a6 a6 31 51 3f [3f]
0x003935d76200: f1 f1 15 15 74 74 4d 4d c2 08 8d 83 84 84 58 58
0x003935d76300: f3 08 1f 1f e5 e5 7f 2a 03 20 e8 e8 d9 97 c9 c9
0x003935d76400: 2d 2d 67 08 65 65 f0 2d 43 43 c5 c5 f3 08 a6 70
0x003935d76500: 3f 08 26 26 ba ba 2a 2a e2 e2 9d 08 04 e2 04 de
0x003935d76600: b5 fc 9e 9e 5f 5f 36 36 91 08 6b 6b dd dd 2a 2a
0x003935d76700: 08 68 ff ff 54 fa 3e 08 80 2b 2b 2b e4 e4 e9 e9
0x003935d76800: a9 6c 04 e1 da da d2 d2 fb fb d7 08 1a 1a fa 08
0x003935d76900: ae 08 5f 5f 9b 8b bd 74 04 db d1 d1 d6 84 91 d8
打印short granules对应的随机tag的值。本例中,因为访问到的内存块没有零头,所以输出了两个点,也就是说影子内存里存的是随机tag而不是
short granules。
地址0x003935d760d0,里面存的5e,就是short granules,它的随机tag是5e,它的影子内存存的是0x08,也就是说这个内存块只能访问前8字节的内容,后8字节是不能访问的。
Tags for short granules around the buggy address (one tag corresponds to 16 bytes):
0x003935d76000: .. .. .. .. .. .. .. .. .. .. .. .. .. 5e .. ..
=>0x003935d76100: .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. [..]
0x003935d76200: .. .. .. .. .. .. .. .. .. c2 .. .. .. .. .. ..
See https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#short-granules for a description of short granule tags
这个问题比较奇怪的是,为什么判定成了heap buffer overflow? 看起来很像是UAF。因为分配栈根本不是实际访问到的内存。
实际访问的内存是FrameWaitStat实例:
void FrameWaiter::signalFrameUpdate(sp<Layer> layer) {
CLOSEABLE_ATRACE_CALL(!mEnableTrace);
if (!FrameWaitStat::sFeatureEnable) {
return;
}
// mpendingToSignalLayers.push_back(layer);
if ((strstr(layer->getDebugName(), sVipLayerName) != NULL) && (strstr(layer->getDebugName(), "SurfaceView[") != NULL) && (strstr(layer->getDebugName(), "BLAST") != NULL)) {
if (mLayerFE == nullptr) {
return;
}
void *ptr = mLayerFE->getFrameWaitStat();
if (ptr != nullptr) {
FrameWaitStat *stat = (FrameWaitStat *)ptr;
stat->mSignalFrameTime = systemTime(); <<<<<<<<<< 报错点
// MITRACE_FORMAT("lastTime:%ld", stat->getLastFrameTime())
if (stat->getEnableWaitFrame()) {
AutoMutex _l(mUpdateSignalMutex);
mLayerFrameUpdated = true;
}
}
}
}FrameWaitStat定义:
//services/surfaceflinger/FrameWaitStat.h
class FrameWaitStat {
private:
bool mWaitFrame = false;
nsecs_t mLastFrameTime = 0;
public:
static bool sFeatureEnable;
nsecs_t mSignalFrameTime = 0;
...
};排除掉static变量,nsecs_t没记错应该是64位,也就是8字节。按照8字节对齐,bool应该也是8字节。这样整个数据结构是24字节。
再来看下分配栈。
分配栈分配的是LayerMetadata
//services/surfaceflinger/Layer.cpp
Layer::Layer(const surfaceflinger::LayerCreationArgs& args)... {
...
mDrawingState.metadata = args.metadata;mDrawingState.metadata 会调用赋值构造函数:
//libs/gui/LayerMetadata.cpp
LayerMetadata& LayerMetadata::operator=(const LayerMetadata& other) {
mMap = other.mMap;
return *this;
}LayerMetadata只有一个map类型的成员,map的key是uint32_t,value是vector<uint8_t>:
struct LayerMetadata : public Parcelable {
std::unordered_map<uint32_t, std::vector<uint8_t>> mMap;
...
}所以分配点跟访问点毫无关系。
这样看来,问题大概率是这样:分配给FrameWaitStat的内存已经被释放,又分配给了LayerMetadata。而出错的线程不知道,又访问了这块内存,hwasan发现tag不匹配报错。
分析FrameWaitStat::mSignalFrameTime的访问点,可知其写入点都加了锁:
//services/surfaceflinger/FrameWaiter.cpp
void FrameWaiter::ensureOutputLayer(sp<compositionengine::LayerFE> &layerFE) {
// CLOSEABLE_ATRACE_CALL(!mEnableTrace);("FrameWaiter::ensureOutputLayer");
if (!FrameWaitStat::sFeatureEnable) {
return;
}
AutoMutex _l(mLayerMutex);
if ((strstr(layerFE->getDebugName(), sVipLayerName) != NULL) && (strstr(layerFE->getDebugName(), "SurfaceView[") != NULL) && (strstr(layerFE->getDebugName(), "BLAST") != NULL)) {
CLOSEABLE_ATRACE_CALL(!mEnableTrace); //("Camera_Layer");
FrameWaitStat *stat = nullptr;
void *ptr = layerFE->getFrameWaitStat();
if (ptr != nullptr) {
stat = (FrameWaitStat *)ptr;
stat->setWaitFrameEnable(true);
} else {
stat = new FrameWaitStat();
stat->setWaitFrameEnable(true);
layerFE->setFrameWaitStat(stat);
}
}
}
void FrameWaiter::onLayerDestroyed(Layer *layer) {
if (!FrameWaitStat::sFeatureEnable) {
return;
}
if (layer == nullptr) {
return;
}
if (layer == mWaitFrameLayer) {
AutoMutex _l(mLayerMutex);
if (mLayerFE != nullptr) {
void *ptr = mLayerFE->getFrameWaitStat();
if (ptr != nullptr) {
FrameWaitStat *stat = (FrameWaitStat *)ptr;
delete stat;
mLayerFE->setFrameWaitStat(nullptr);
}
}
mWaitFrameLayer = nullptr;
mLayerFE = nullptr;
}
}但出问题的地方,也就是读取点,未加锁。可以看前面贴的FrameWaiter::signalFrameUpdate代码。
而且,signalFrameUpdate从SF::setTransactionState调过来,这个是binder线程里执行的,确实存在并发场景。
修复方案

加锁即可。
💡需要注意的是,这个类还有一把锁mUpdateSignalMutex,我们需要保证其访问顺序,避免死锁。
看了下代码,目前只有FrameWaiter::signalFrameUpdate一个函数会导致同时拿两把锁,顺序是mLayerMutex→mUpdateSignalMutex。因此应当不会导致死锁。
是否可能和SF大锁mStateLock死锁?
目前看,signalFrameUpdate的代码不会直接或间接拿mStateLock锁,因此,至少这个改动没有引入逆序持锁,不会引发死锁。
是否可以把操作放在主线程,避免加锁?
通过和代码owner沟通,得知是设计如此,不能放在主线程。这个功能就是要让主线程等待。
因为方案仅在视频播放场景下开启,因此风险可控。
为什么报heap-buffer-overflow而不是use-after-free
💡
这是这个问题中最有意思的地方。
前文讲到,hwasan报错时,应该访问到的内存已经被释放,并且分配给了另外的对象。那么hwasan是不是应该报UAF更准确?
这个应该就和hwasan的原理有关了。我的推测是这样:
hwasan的基本原理是用指针ptr和内存mem的tag做对比,如果tag不匹配,就报错。那么对于mem是不是之前分配的,它并不能检测出来。
对于一块内存,它只会记录最近的一次分配和释放(如果被释放了)。如果检测到UAF,它会把最近一次的分配和释放堆栈都打印出来;如果不是UAF,那么只会输出最近的一次分配。
那么hwasan如何识别UAF?它是通过把释放掉的内存标记为0xFD识别的。也就是说,如果mem tag不是0xFD等特殊值,它就会认为这块内存是已分配的。至于是哪次分配的,它并不记录。
仔细思考一下,这个设计是很合理的。设想一下,要精确知道ptr对应的是哪一次分配,那么每次分配都得记录一个id,然后还要记下这次分配的堆栈。实际程序里,一块内存会被反复多次地分配。这样一来要记录多少数据?这个代价太大了。
ASan
以前的ASan方案里,释放掉的内存会被标记为0xFD,这一点可以通过简单的实验验证。我们写一个带heap overflow的测试程序,用clang -fsanitize=address编译一下,然后运行即可看到:
SUMMARY: AddressSanitizer: heap-use-after-free ./main.cpp:7:8 in main
Shadow bytes around the buggy address:
0x72221481fd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x72221481fe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x72221481fe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x72221481ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x72221481ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x722214820000: fa fa[fd]fa fa fa fa fa fa fa fa fa fa fa fa fa
0x722214820080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x722214820100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x722214820180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x722214820200: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x722214820280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
但是在hwasan里,free的内存并不是固定标记为0xFD,而是另一个随机值。这个只要在Android环境随便写个UAF的demo,多运行几次就会发现
再次推测:UAF的识别方式,可能是hwasan通过当前访问的内存块最后记录的trace是被分配还是被释放决定的。这样可以免去填充0xFD的过程,提升效率。
要验证这个推测,需要去研究hwasan源码。有时间了看下。
hwasan源码探究
研究了一下这块的源码,和我的推测比较接近。这其实就是hwasan设计的一种权衡。不是不能追求严谨检测,而是成本太高。最终目的还是为了解决内存错误问题,不能舍本逐末。而现有的报错,已经足够我们解决问题。
具体可参考这篇文章hwasan源码探究中FindBufferOverflowCandidate的实现和UAF和overflow的判断逻辑。