TY - JOUR
T1 - A scalable multi-level feature extraction technique to detect malicious executables
AU - Masud, Mohammad M.
AU - Khan, Latifur
AU - Thuraisingham, Bhavani
N1 - Funding Information:
O(Blog2K)+O(N) O(K) O(Nlog2S) O(S) O(nBlog2S) O (SC) O(nBlog2K) O (SC) Acknowledgment The work reported in this paper is supported by AFOSR under contract FA9550-06-1-0045 and by the Texas Enterprise Funds. We thank Dr. Robert Herklotz of AFOSR and Prof. Robert Helms, Dean of the School of Engineering at the University of Texas at Dallas for funding this research.
PY - 2008/3
Y1 - 2008/3
N2 - We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.
AB - We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.
KW - Disassembly
KW - Feature extraction
KW - Malicious executable
KW - n-gram analysis
UR - http://www.scopus.com/inward/record.url?scp=39749143915&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=39749143915&partnerID=8YFLogxK
U2 - 10.1007/s10796-007-9054-3
DO - 10.1007/s10796-007-9054-3
M3 - Article
AN - SCOPUS:39749143915
SN - 1387-3326
VL - 10
SP - 33
EP - 45
JO - Information Systems Frontiers
JF - Information Systems Frontiers
IS - 1
ER -