Abstract
Binary authorship characterization refers to the process of identifying stylistic characteristics that are related to the author of an anonymous binary code. The aim is to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of binary code. This paper presents CPA, a novel approach for characterizing the authors of program binaries. Instead of using generic features such as n-grams, CPA proposes a set of new features based on collections of various aspects of author style, including author code traits, code structure characteristics, and author expertise in solving coding tasks. It employs the Latent Dirichlet Allocation (LDA) algorithm to generate author style signatures to help identify similar author style characteristics in other binaries. We evaluated CPA on large datasets extracted from selected open-source C/C++ projects in GitHub and Google Code Jam events, and it successfully attributed a large number of authors with a significantly higher F1 score: around 91% when the number of authors was 1,500. In addition, the false positive rate was low, around 1.5%. When the code was subjected to refactoring techniques or code transformation or was processed using different compilers/compilation settings, there was no significant drop in accuracy, demonstrating the robustness of our tool. Finally, in the case of code written by multiple authors, CPA was able to identify the authors with a high F1 score, around 89%.
Original language | English |
---|---|
Article number | 9032128 |
Pages (from-to) | 3051-3066 |
Number of pages | 16 |
Journal | IEEE Transactions on Information Forensics and Security |
Volume | 15 |
DOIs | |
Publication status | Published - 2020 |
Keywords
- Binary code analysis
- LDA
- authorship characterization
- reverse engineering
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality
- Computer Networks and Communications