TY - GEN
T1 - Revisiting Binary Code Authorship Analysis
AU - Alrabaee, Saed
AU - Al-kfairy, Mousa
AU - Taha, Mohammad Bany
AU - Alfandi, Omar
AU - Taher, Fatma
AU - Tang, Jie
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Binary authorship analysis is a crucial step in malware reverse engineering, but the volume and complexity of the code exacerbate the challenge of this manually intensive task. Consequently, efforts have been made to develop reliable automated tools to facilitate malware authorship analysis; however, many challenges are associated with automated approaches. For instance, the compilation process may remove stylistic features present in the source code. This paper evaluates the features used in existing approaches by utilizing various datasets, including programs written for the Google Code Jam programming competition, student projects from programming courses at multiple universities, and content from GitHub repositories. Additionally, we examined the impact of statistical features on precision, recall, and the false positive rate of these methodologies. The evaluation results reveal that the accuracy of these approaches varies across different application domains and datasets, and some of the selected features appear unrelated to the author’s style, indicating that careful consideration is needed when applying this approach. Finally, using statistical features enhanced the precision and recall of existing approaches while reducing the false positive rate by 10–15%.
AB - Binary authorship analysis is a crucial step in malware reverse engineering, but the volume and complexity of the code exacerbate the challenge of this manually intensive task. Consequently, efforts have been made to develop reliable automated tools to facilitate malware authorship analysis; however, many challenges are associated with automated approaches. For instance, the compilation process may remove stylistic features present in the source code. This paper evaluates the features used in existing approaches by utilizing various datasets, including programs written for the Google Code Jam programming competition, student projects from programming courses at multiple universities, and content from GitHub repositories. Additionally, we examined the impact of statistical features on precision, recall, and the false positive rate of these methodologies. The evaluation results reveal that the accuracy of these approaches varies across different application domains and datasets, and some of the selected features appear unrelated to the author’s style, indicating that careful consideration is needed when applying this approach. Finally, using statistical features enhanced the precision and recall of existing approaches while reducing the false positive rate by 10–15%.
KW - Binary Code Analysis
KW - Reverse Engineering
UR - https://www.scopus.com/pages/publications/105001379790
UR - https://www.scopus.com/pages/publications/105001379790#tab=citedBy
U2 - 10.1007/978-981-96-3531-3_21
DO - 10.1007/978-981-96-3531-3_21
M3 - Conference contribution
AN - SCOPUS:105001379790
SN - 9789819635306
T3 - Lecture Notes in Computer Science
SP - 428
EP - 449
BT - Network and System Security - 18th International Conference, NSS 2024, Proceedings
A2 - Song, Houbing Herbert
A2 - Di Pietro, Roberto
A2 - Alrabaee, Saed
A2 - Tubishat, Mohammad
A2 - Al-kfairy, Mousa
A2 - Alfandi, Omar
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th International Conference on Network and System Security, NSS 2024
Y2 - 20 November 2024 through 22 November 2024
ER -