TYGR: Type Inference on Stripped Binaries using Graph Neural Networks

By Aravind Machiry , Adam Doupe , Chang Zhu , Yibo Liu , Ruoyu Wang , Tiffany Bao , Yan Shoshitaishvili , Ati Bajaj , Wil Gibbs , Ziyang Li , Anton Xue , Rajeev Alur , Hanjun Dai , Mayur Naik on 15 Aug 2024 @ Usenix
πŸ“Š Presentation πŸ“„ Whitepaper πŸ“Ή Video πŸ”— Link
#binary-analysis #reverse-engineering #deep-learning #static-analysis
Focus Areas: πŸ€– AI & ML Security , πŸ” Application Security , 🦠 Malware Analysis , πŸ”¬ Reverse Engineering

Presentation Material

Abstract

Binary type inference is a core research challenge in binary program analysis and reverse engineering. It concerns identifying the data types of registers and memory values in a stripped executable (or object file), whose type information is discarded during compilation. Current methods rely on either manually crafted inference rules, which are brittle and demand significant effort to update, or machine learning-based approaches that suffer from low accuracy.

In this paper we propose TYGR, a graph neural network based solution that encodes data-flow information for inferring both basic and struct variable types in stripped binary programs. To support different architectures and compiler optimizations, TYGR was implemented on top of the ANGR binary analysis platform and uses an architecture-agnostic data-flow analysis to extract a graph-based intra-procedural representation of data-flow information.

We noticed a severe lack of diversity in existing binary executables datasets and created TyDa, a large dataset of diverse binary executables. The sole publicly available dataset, provided by STATEFORMER, contains only 1% of the total number of functions in TyDa. TYGR is trained and evaluated on a subset of TyDa and generalizes to the rest of the dataset. TYGR demonstrates an overall accuracy of 76.6% and struct type accuracy of 45.2% on the x64 dataset across four optimization levels (O0-O3). TYGR outperforms existing works by a minimum of 26.1% in overall accuracy and 10.2% in struct accuracy.

AI Generated Summary

This research addresses the problem of variable type recovery from compiled binaries, a critical task for reverse engineering, malware analysis, and software maintenance. The presented system, TIGER, introduces a graph-based method to explicitly encode data flow information, such as variable access patterns and usage, for type inference.

TIGER’s pipeline processes a binary through symbolic execution to construct a graph where nodes represent memory locations, registers, and concrete values, while edges encode operations like stores, adds, and moves. These graphs are embedded using graph neural networks, with features capturing value, address, and operation types. A key innovation is the two-step recovery for structure types: first classifying a memory location as a structure, then predicting the concrete types of its individual members by masking other primitive types. This approach avoids the limitations of size-based heuristics and dictionary-dependent machine learning methods.

The authors constructed a large dataset spanning five architectures and four optimization levels, containing over 300,000 binaries and 100 million functions. Evaluation against state-of-the-art methods shows TIGER achieves higher overall accuracy and specifically improved structure type recovery. Analysis indicates that distinct access patterns, such as those for boolean flag variables, are effectively captured by the explicit graph encoding, contributing to performance gains.

The work provides a publicly available dataset, models, and source code. The primary implication is a more accurate and architecture-agnostic technique for recovering high-level type semantics from binaries, enhancing automated program analysis capabilities.

Disclaimer: This summary was auto-generated from the video transcript using AI and may contain inaccuracies. It is intended as a quick overview β€” always refer to the original talk for authoritative content. Learn more about our AI experiments.