Skip to main navigation Skip to search Skip to main content

A Comparative Study on Source Code Attribution Using AI: Datasets, Features, and Techniques

  • Shamma Alalawi
  • , Saed Alrabaee
  • , Wasif Khan
  • , Issam Al-Azzoni
  • , Medha Mohan Ambali Parambil

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, the application of artificial intelligence (AI) techniques for source code authorship attribution has gained significant attention from academia and industry. Accurately attributing source code to its original author is crucial for various purposes, such as intellectual property protection, cybersecurity, and software forensics. Advances in AI technologies, like ChatGPT, which can generate code, present new challenges and opportunities in distinguishing between human- and machine-generated code. This article aims to comprehensively review existing research on source code authorship attribution and present a series of experiments using a dataset of 600 source codes. The study involves extracting lexical and layout features, ranking methods, and employing several machine learning models (SVM, LR, MLP, XGBoost, and RF) and deep learning models (LSTM, RNN, and CNN). The objectives include identifying the best model to determine whether source codes were written by a human or ChatGPT-4 and providing insights into two human characteristics: gender and region. Our results show that we achieved up to 94.7% accuracy with RF using TF-IDF and 95% accuracy with the CNN model. Finally, we identify emerging trends and potential future research directions in AI for authorship attribution.

Original languageEnglish
Title of host publicationSecurity and Privacy in Communication Networks - 20th EAI International Conference, SecureComm 2024, Proceedings
EditorsSaed Alrabaee, Kim-Kwang Raymond Choo, Ernesto Damiani, Robert H. Deng
PublisherSpringer Science and Business Media Deutschland GmbH
Pages332-353
Number of pages22
ISBN (Print)9783031944444
DOIs
Publication statusPublished - 2026
Event20th EAI International Conference on Security and Privacy in Communication Networks, SecureComm 2024 - Dubai, United Arab Emirates
Duration: Oct 28 2024Oct 30 2024

Publication series

NameLecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST
Volume627 LNICST
ISSN (Print)1867-8211
ISSN (Electronic)1867-822X

Conference

Conference20th EAI International Conference on Security and Privacy in Communication Networks, SecureComm 2024
Country/TerritoryUnited Arab Emirates
CityDubai
Period10/28/2410/30/24

Keywords

  • ChatGPT-generated Code
  • Code Authorship Attribution
  • Deep Learning
  • Machine Learning
  • Software Forensics

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A Comparative Study on Source Code Attribution Using AI: Datasets, Features, and Techniques'. Together they form a unique fingerprint.

Cite this