Cloud-based malware detection for evolving data streams

Mohammad M. Masud, Tahseen M. Al-Khateeb, Kevin W. Hamlen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham

Research output: Contribution to journalArticlepeer-review

55 Citations (Scopus)


Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length,making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.

Original languageEnglish
Article number16
JournalACM Transactions on Management Information Systems
Issue number3
Publication statusPublished - Oct 2011
Externally publishedYes


  • Data mining
  • Data streams
  • Malicious executable
  • Malware detection
  • N-gram analysis

ASJC Scopus subject areas

  • Management Information Systems
  • General Computer Science


Dive into the research topics of 'Cloud-based malware detection for evolving data streams'. Together they form a unique fingerprint.

Cite this