TY - GEN
T1 - A Comparative Study on Source Code Attribution Using AI
T2 - 20th EAI International Conference on Security and Privacy in Communication Networks, SecureComm 2024
AU - Alalawi, Shamma
AU - Alrabaee, Saed
AU - Khan, Wasif
AU - Al-Azzoni, Issam
AU - Parambil, Medha Mohan Ambali
N1 - Publisher Copyright:
© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2026.
PY - 2026
Y1 - 2026
N2 - In recent years, the application of artificial intelligence (AI) techniques for source code authorship attribution has gained significant attention from academia and industry. Accurately attributing source code to its original author is crucial for various purposes, such as intellectual property protection, cybersecurity, and software forensics. Advances in AI technologies, like ChatGPT, which can generate code, present new challenges and opportunities in distinguishing between human- and machine-generated code. This article aims to comprehensively review existing research on source code authorship attribution and present a series of experiments using a dataset of 600 source codes. The study involves extracting lexical and layout features, ranking methods, and employing several machine learning models (SVM, LR, MLP, XGBoost, and RF) and deep learning models (LSTM, RNN, and CNN). The objectives include identifying the best model to determine whether source codes were written by a human or ChatGPT-4 and providing insights into two human characteristics: gender and region. Our results show that we achieved up to 94.7% accuracy with RF using TF-IDF and 95% accuracy with the CNN model. Finally, we identify emerging trends and potential future research directions in AI for authorship attribution.
AB - In recent years, the application of artificial intelligence (AI) techniques for source code authorship attribution has gained significant attention from academia and industry. Accurately attributing source code to its original author is crucial for various purposes, such as intellectual property protection, cybersecurity, and software forensics. Advances in AI technologies, like ChatGPT, which can generate code, present new challenges and opportunities in distinguishing between human- and machine-generated code. This article aims to comprehensively review existing research on source code authorship attribution and present a series of experiments using a dataset of 600 source codes. The study involves extracting lexical and layout features, ranking methods, and employing several machine learning models (SVM, LR, MLP, XGBoost, and RF) and deep learning models (LSTM, RNN, and CNN). The objectives include identifying the best model to determine whether source codes were written by a human or ChatGPT-4 and providing insights into two human characteristics: gender and region. Our results show that we achieved up to 94.7% accuracy with RF using TF-IDF and 95% accuracy with the CNN model. Finally, we identify emerging trends and potential future research directions in AI for authorship attribution.
KW - ChatGPT-generated Code
KW - Code Authorship Attribution
KW - Deep Learning
KW - Machine Learning
KW - Software Forensics
UR - https://www.scopus.com/pages/publications/105023282686
UR - https://www.scopus.com/pages/publications/105023282686#tab=citedBy
U2 - 10.1007/978-3-031-94445-1_18
DO - 10.1007/978-3-031-94445-1_18
M3 - Conference contribution
AN - SCOPUS:105023282686
SN - 9783031944444
T3 - Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST
SP - 332
EP - 353
BT - Security and Privacy in Communication Networks - 20th EAI International Conference, SecureComm 2024, Proceedings
A2 - Alrabaee, Saed
A2 - Choo, Kim-Kwang Raymond
A2 - Damiani, Ernesto
A2 - Deng, Robert H.
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 28 October 2024 through 30 October 2024
ER -