TY - GEN
T1 - Utilizing cost-sensitive machine learning classifiers to identify compounds that inhibit Alzheimer's APP translation
AU - Alashwal, Hany
AU - Lucman, Juwayni
N1 - Funding Information:
This work received financial support from the United Arab Emirates University (grant no. CIT 31T129).
Publisher Copyright:
© 2020 ACM.
PY - 2020/8/26
Y1 - 2020/8/26
N2 - Virtual screening of bioassay data can be of immense benefit to identify compounds which can assist in restricting the production of amyloid beta peptides (Aβ), observed in Alzheimer patients, by inhibiting the translation of amyloid precursor protein (APP). Machine learning classifiers can be adopted on the dataset to investigate those compounds. The ratio of the active molecules that achieve the goal of inhibiting APP, nonetheless, is minimal compared to their inactive counterparts. The imbalance between the two classes is handled by introducing cost-sensitivity to reweight the training instances depending on the misclassification cost allotted to each class. The paper shows the performance of cost-sensitive classifiers (Random Forest, Naive Bayes, and Logistic Regression classifier) to spot the minority (active) molecules from the majority (inactive) classes and shows their evaluation metrics. Sensitivity, specificity, False Negative rate, ROC area, and accuracy are evaluated while keeping the False Positive rate at 20.6%. The aim of the study is to investigate the most reliable classifier for the bioassay data and to explore the ideal misclassification cost. Random Forest classifier was the most robust model compared to Naive Bayes and Logistic Regression Classifiers. Moreover, each classifier had a different optimal misclassification cost.
AB - Virtual screening of bioassay data can be of immense benefit to identify compounds which can assist in restricting the production of amyloid beta peptides (Aβ), observed in Alzheimer patients, by inhibiting the translation of amyloid precursor protein (APP). Machine learning classifiers can be adopted on the dataset to investigate those compounds. The ratio of the active molecules that achieve the goal of inhibiting APP, nonetheless, is minimal compared to their inactive counterparts. The imbalance between the two classes is handled by introducing cost-sensitivity to reweight the training instances depending on the misclassification cost allotted to each class. The paper shows the performance of cost-sensitive classifiers (Random Forest, Naive Bayes, and Logistic Regression classifier) to spot the minority (active) molecules from the majority (inactive) classes and shows their evaluation metrics. Sensitivity, specificity, False Negative rate, ROC area, and accuracy are evaluated while keeping the False Positive rate at 20.6%. The aim of the study is to investigate the most reliable classifier for the bioassay data and to explore the ideal misclassification cost. Random Forest classifier was the most robust model compared to Naive Bayes and Logistic Regression Classifiers. Moreover, each classifier had a different optimal misclassification cost.
KW - Alzheimer's Disease
KW - Classification
KW - Cost Sensitivity
KW - Logistic Regression
KW - Naive Bayes
KW - Primary Screen Bioassay
KW - Random Forest
UR - http://www.scopus.com/inward/record.url?scp=85092701807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092701807&partnerID=8YFLogxK
U2 - 10.1145/3416921.3416931
DO - 10.1145/3416921.3416931
M3 - Conference contribution
AN - SCOPUS:85092701807
T3 - ACM International Conference Proceeding Series
SP - 113
EP - 117
BT - Proceedings of the 2020 4th International Conference on Cloud and Big Data Computing, ICCBDC 2020
PB - Association for Computing Machinery
T2 - 4th International Conference on Cloud and Big Data Computing, ICCBDC 2020
Y2 - 26 August 2020 through 28 August 2020
ER -