TY - GEN
T1 - Incremental SMOTE with Control Coefficient for Classifiers in Data Starved Medical Applications
AU - Bae, Wan D.
AU - Alkobaisi, Shayma
AU - Bankar, Siddheshwari
AU - Bhuvaji, Sartaj
AU - Singhvi, Jay
AU - Irukulla, Madhuroopa
AU - McDonnell, William
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - Prediction models for data-starved medical applications lag behind general machine learning solutions, despite their potential to improve early interventions. This is largely due to the assumption that optimization approaches are applied on a balanced distribution of events, yet medical data often has an imbalanced distribution within classes. The curse of dimensionality is further exacerbated by small samples and a high number of features in individual-based risk prediction models. In this paper, we propose a data augmentation system to gradually create synthetic minority samples with a control coefficient, which improves the quality of generated data over time and consequently boosts prediction model performance. This system incrementally adjusts to the data distribution, avoiding overfitting. We evaluate our approach using four synthetic oversampling techniques on real asthma patient data. Our results show that this system enhances classifiers’ overall performance across all four techniques. Specifically, applying the incremental data augmentation approach to three oversampling methods led to an increase in sensitivity of 4.01% to 7.79% in deep transfer learning-based classifiers.
AB - Prediction models for data-starved medical applications lag behind general machine learning solutions, despite their potential to improve early interventions. This is largely due to the assumption that optimization approaches are applied on a balanced distribution of events, yet medical data often has an imbalanced distribution within classes. The curse of dimensionality is further exacerbated by small samples and a high number of features in individual-based risk prediction models. In this paper, we propose a data augmentation system to gradually create synthetic minority samples with a control coefficient, which improves the quality of generated data over time and consequently boosts prediction model performance. This system incrementally adjusts to the data distribution, avoiding overfitting. We evaluate our approach using four synthetic oversampling techniques on real asthma patient data. Our results show that this system enhances classifiers’ overall performance across all four techniques. Specifically, applying the incremental data augmentation approach to three oversampling methods led to an increase in sensitivity of 4.01% to 7.79% in deep transfer learning-based classifiers.
KW - class imbalance problem
KW - control coefficient
KW - data starved contexts
KW - rare event prediction
KW - synthetic minority oversampling technique
UR - http://www.scopus.com/inward/record.url?scp=85202155451&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85202155451&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-68323-7_9
DO - 10.1007/978-3-031-68323-7_9
M3 - Conference contribution
AN - SCOPUS:85202155451
SN - 9783031683220
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 112
EP - 119
BT - Big Data Analytics and Knowledge Discovery - 26th International Conference, DaWaK 2024, Proceedings
A2 - Wrembel, Robert
A2 - Chiusano, Silvia
A2 - Kotsis, Gabriele
A2 - Khalil, Ismail
A2 - Tjoa, A Min
PB - Springer Science and Business Media Deutschland GmbH
T2 - 26th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2024
Y2 - 26 August 2024 through 28 August 2024
ER -