TY - JOUR
T1 - Feature Selection for Binary Classification Within Functional Genomics Experiments via Interquartile Range and Clustering
AU - Khan, Zardad
AU - Naeem, Muhammad
AU - Khalil, Umair
AU - Khan, Dost Muhammad
AU - Aldahmani, Saeed
AU - Hamraz, Muhammad
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2019
Y1 - 2019
N2 - Datasets produced in modern research, such as biomedical science, pose a number of challenges for machine learning techniques used in binary classification due to high dimensionality. Feature selection is one of the most important statistical techniques used for dimensionality reduction of the datasets. Therefore, techniques are needed to find an optimal number of features to obtain more desirable learning performance. In the machine learning context, gene selection is treated as a feature selection problem, the objective of which is to find a small subset of the most discriminative features for the target class. In this paper, a gene selection method is proposed that identifies the most discriminative genes in two stages. Genes that unambiguously assign the maximum number of samples to their respective classes using a greedy approach are selected in the first stage. The remaining genes are divided into a certain number of clusters. From each cluster, the most informative genes are selected via the lasso method and combined with genes selected in the first stage. The performance of the proposed method is assessed through comparison with other state-of-The-Art feature selection methods using gene expression datasets. This is done by applying two classifiers i.e., random forest and support vector machine, on datasets with selected genes and training samples and calculating their classification accuracy, sensitivity, and Brier score on samples in the testing part. Boxplots based on the results and correlation matrices of the selected genes are thenceforth constructed. The results show that the proposed method outperforms the other methods.
AB - Datasets produced in modern research, such as biomedical science, pose a number of challenges for machine learning techniques used in binary classification due to high dimensionality. Feature selection is one of the most important statistical techniques used for dimensionality reduction of the datasets. Therefore, techniques are needed to find an optimal number of features to obtain more desirable learning performance. In the machine learning context, gene selection is treated as a feature selection problem, the objective of which is to find a small subset of the most discriminative features for the target class. In this paper, a gene selection method is proposed that identifies the most discriminative genes in two stages. Genes that unambiguously assign the maximum number of samples to their respective classes using a greedy approach are selected in the first stage. The remaining genes are divided into a certain number of clusters. From each cluster, the most informative genes are selected via the lasso method and combined with genes selected in the first stage. The performance of the proposed method is assessed through comparison with other state-of-The-Art feature selection methods using gene expression datasets. This is done by applying two classifiers i.e., random forest and support vector machine, on datasets with selected genes and training samples and calculating their classification accuracy, sensitivity, and Brier score on samples in the testing part. Boxplots based on the results and correlation matrices of the selected genes are thenceforth constructed. The results show that the proposed method outperforms the other methods.
KW - Clustering
KW - classification
KW - feature selection
KW - high dimensional data
KW - microarray gene expression data
UR - http://www.scopus.com/inward/record.url?scp=85068328361&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068328361&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2019.2922432
DO - 10.1109/ACCESS.2019.2922432
M3 - Article
AN - SCOPUS:85068328361
SN - 2169-3536
VL - 7
SP - 78159
EP - 78169
JO - IEEE Access
JF - IEEE Access
M1 - 8735703
ER -