TY - JOUR
T1 - CPA
T2 - Accurate Cross-Platform Binary Authorship Characterization Using LDA
AU - Alrabaee, Saed
AU - Debbabi, Mourad
AU - Wang, Lingyu
N1 - Funding Information:
Manuscript received June 16, 2019; revised November 19, 2019 and February 5, 2020; accepted February 27, 2020. Date of publication March 11, 2020; date of current version April 9, 2020. The work of Saed Alrabaee was supported by the United Arab Emirates University Start-up Grant G00003261. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mauro Conti. (Corresponding author: Saed Alrabaee.) Saed Alrabaee is with the Information Systems and Security Department, United Arab Emirates University, Al Ain, United Arab Emirates (e-mail: salrabaee@uaeu.ac.ae).
Publisher Copyright:
© 2005-2012 IEEE.
PY - 2020
Y1 - 2020
N2 - Binary authorship characterization refers to the process of identifying stylistic characteristics that are related to the author of an anonymous binary code. The aim is to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of binary code. This paper presents CPA, a novel approach for characterizing the authors of program binaries. Instead of using generic features such as n-grams, CPA proposes a set of new features based on collections of various aspects of author style, including author code traits, code structure characteristics, and author expertise in solving coding tasks. It employs the Latent Dirichlet Allocation (LDA) algorithm to generate author style signatures to help identify similar author style characteristics in other binaries. We evaluated CPA on large datasets extracted from selected open-source C/C++ projects in GitHub and Google Code Jam events, and it successfully attributed a large number of authors with a significantly higher F1 score: around 91% when the number of authors was 1,500. In addition, the false positive rate was low, around 1.5%. When the code was subjected to refactoring techniques or code transformation or was processed using different compilers/compilation settings, there was no significant drop in accuracy, demonstrating the robustness of our tool. Finally, in the case of code written by multiple authors, CPA was able to identify the authors with a high F1 score, around 89%.
AB - Binary authorship characterization refers to the process of identifying stylistic characteristics that are related to the author of an anonymous binary code. The aim is to automate the laborious and error-prone reverse engineering task of discovering information related to the author(s) of binary code. This paper presents CPA, a novel approach for characterizing the authors of program binaries. Instead of using generic features such as n-grams, CPA proposes a set of new features based on collections of various aspects of author style, including author code traits, code structure characteristics, and author expertise in solving coding tasks. It employs the Latent Dirichlet Allocation (LDA) algorithm to generate author style signatures to help identify similar author style characteristics in other binaries. We evaluated CPA on large datasets extracted from selected open-source C/C++ projects in GitHub and Google Code Jam events, and it successfully attributed a large number of authors with a significantly higher F1 score: around 91% when the number of authors was 1,500. In addition, the false positive rate was low, around 1.5%. When the code was subjected to refactoring techniques or code transformation or was processed using different compilers/compilation settings, there was no significant drop in accuracy, demonstrating the robustness of our tool. Finally, in the case of code written by multiple authors, CPA was able to identify the authors with a high F1 score, around 89%.
KW - Binary code analysis
KW - LDA
KW - authorship characterization
KW - reverse engineering
UR - http://www.scopus.com/inward/record.url?scp=85083556825&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083556825&partnerID=8YFLogxK
U2 - 10.1109/TIFS.2020.2980190
DO - 10.1109/TIFS.2020.2980190
M3 - Article
AN - SCOPUS:85083556825
SN - 1556-6013
VL - 15
SP - 3051
EP - 3066
JO - IEEE Transactions on Information Forensics and Security
JF - IEEE Transactions on Information Forensics and Security
M1 - 9032128
ER -