TY - JOUR
T1 - Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
AU - Bae, Wan D.
AU - Alkobaisi, Shayma
AU - Horak, Matthew
AU - Bankar, Siddheshwari
AU - Bhuvaji, Sartaj
AU - Kim, Sungroul
AU - Park, Choon Sik
N1 - Publisher Copyright:
© 2025 The Authors.
PY - 2025
Y1 - 2025
N2 - With their ability to find solutions among complex relationships of variables, machine learning (ML) techniques are becoming more applicable to various fields, including health risk prediction. However, prediction models are sensitive to the size and distribution of the data they are trained on. ML algorithms rely heavily on vast quantities of training data to make accurate predictions. Ideally, the dataset should have an equal number of samples for each label to encourage the model to make predictions based on the input data rather than the distribution of the training data. In medical applications, class imbalance is a common issue because the occurrence of a disease or risk episode is often rare. This leads to a training dataset where healthy cases outnumber unhealthy ones, resulting in biased prediction models that struggle to detect the minority, unhealthy cases effectively. This paper addresses the problem of class imbalance, given the scarcity of training datasets by improving the quality of generated data. We propose an incremental synthetic data generation system that improves data quality over iterations by gradually adjusting to the data distribution and thus avoids overfitting in classifiers. Through extensive experimental assessments on real asthma patients' datasets, we demonstrate the efficiency and applicability of our proposed system for individual-based health risk prediction models. Incremental SMOTE methods were compared to the original SMOTE variants as well as various architectures of autoencoders. Our incremental data generation system enhances selected state-of-the-art SMOTE methods, resulting in sensitivity improvements for deep transfer learning (TL) classifiers ranging from 4.01% to 7.79%. Compared with the performance of TL without oversampling, the improvement achieved by the incremental SMOTE methods ranged from 27.18% to 40.97%. These results highlight the effectiveness of our technique in predicting asthma risk and their applicability to imbalanced, data-starved medical contexts.
AB - With their ability to find solutions among complex relationships of variables, machine learning (ML) techniques are becoming more applicable to various fields, including health risk prediction. However, prediction models are sensitive to the size and distribution of the data they are trained on. ML algorithms rely heavily on vast quantities of training data to make accurate predictions. Ideally, the dataset should have an equal number of samples for each label to encourage the model to make predictions based on the input data rather than the distribution of the training data. In medical applications, class imbalance is a common issue because the occurrence of a disease or risk episode is often rare. This leads to a training dataset where healthy cases outnumber unhealthy ones, resulting in biased prediction models that struggle to detect the minority, unhealthy cases effectively. This paper addresses the problem of class imbalance, given the scarcity of training datasets by improving the quality of generated data. We propose an incremental synthetic data generation system that improves data quality over iterations by gradually adjusting to the data distribution and thus avoids overfitting in classifiers. Through extensive experimental assessments on real asthma patients' datasets, we demonstrate the efficiency and applicability of our proposed system for individual-based health risk prediction models. Incremental SMOTE methods were compared to the original SMOTE variants as well as various architectures of autoencoders. Our incremental data generation system enhances selected state-of-the-art SMOTE methods, resulting in sensitivity improvements for deep transfer learning (TL) classifiers ranging from 4.01% to 7.79%. Compared with the performance of TL without oversampling, the improvement achieved by the incremental SMOTE methods ranged from 27.18% to 40.97%. These results highlight the effectiveness of our technique in predicting asthma risk and their applicability to imbalanced, data-starved medical contexts.
KW - Autoencoders
KW - class imbalance problem
KW - control coefficient
KW - data starved contexts
KW - rare event prediction
KW - synthetic minority oversampling technique
KW - transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85216278673&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216278673&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3532222
DO - 10.1109/ACCESS.2025.3532222
M3 - Article
AN - SCOPUS:85216278673
SN - 2169-3536
VL - 13
SP - 16584
EP - 16602
JO - IEEE Access
JF - IEEE Access
ER -