论文标题
XDA:通过转移学习准确,稳健的拆卸
XDA: Accurate, Robust Disassembly with Transfer Learning
论文作者
论文摘要
精确而坚固的脱衣二进制装置是具有挑战性的。困难的根源是,在剥离的二进制文件中不存在高级结构,例如指令和功能边界,必须基于不完整的信息来恢复。当前的拆卸方法依赖于启发式方法或简单的模式匹配来近似恢复,但是这些方法通常是不准确和脆弱的,尤其是在不同的编译器优化范围内。 我们提出了XDA,这是一个基于转移学习的拆卸框架,该框架学习机器代码中存在的不同上下文依赖关系,并转移此知识以进行准确,稳健的拆卸。我们设计了一项由掩盖语言建模动机的自我监管的学习任务,以学习二进制中字节序列之间的互动。此任务的输出是字节嵌入,它们在输入二进制字节令牌之间编码复杂的上下文依赖项,然后可以对下游拆卸任务进行填充。 我们在从Spec CPU2017,Spec CPU2006和BAP Copcus的3,121个二进制文件中评估了两个拆卸任务,恢复功能边界和汇编指令的XDA性能。二进制文件由X86/X64 Windows和Linux平台上的GCC,ICC和MSVC编译,超过4个优化级别。 XDA分别在恢复功能边界和说明上达到99.0%和99.7%的F1分数,在这两个任务上都超过了先前的最新时间。它还与最快的基于ML的方法保持速度,并且比Ida Pro这样的手写拆卸器快38倍。我们在https://github.com/cumlsec/xda上发布XDA的代码。
Accurate and robust disassembly of stripped binaries is challenging. The root of the difficulty is that high-level structures, such as instruction and function boundaries, are absent in stripped binaries and must be recovered based on incomplete information. Current disassembly approaches rely on heuristics or simple pattern matching to approximate the recovery, but these methods are often inaccurate and brittle, especially across different compiler optimizations. We present XDA, a transfer-learning-based disassembly framework that learns different contextual dependencies present in machine code and transfers this knowledge for accurate and robust disassembly. We design a self-supervised learning task motivated by masked Language Modeling to learn interactions among byte sequences in binaries. The outputs from this task are byte embeddings that encode sophisticated contextual dependencies between input binaries' byte tokens, which can then be finetuned for downstream disassembly tasks. We evaluate XDA's performance on two disassembly tasks, recovering function boundaries and assembly instructions, on a collection of 3,121 binaries taken from SPEC CPU2017, SPEC CPU2006, and the BAP corpus. The binaries are compiled by GCC, ICC, and MSVC on x86/x64 Windows and Linux platforms over 4 optimization levels. XDA achieves 99.0% and 99.7% F1 score at recovering function boundaries and instructions, respectively, surpassing the previous state-of-the-art on both tasks. It also maintains speed on par with the fastest ML-based approach and is up to 38x faster than hand-written disassemblers like IDA Pro. We release the code of XDA at https://github.com/CUMLSec/XDA.
