TY - JOUR
T1 - Mixed Data Imputation Using Generative Adversarial Networks
AU - Khan, Wasif
AU - Zaki, Nazar
AU - Ahmad, Amir
AU - Masud, Mohammad Mehedy
AU - Ali, Luqman
AU - Ali, Nasloon
AU - Ahmed, Luai A.
N1 - Funding Information:
This work was supported by a grant from Zayed Center for Health Sciences, United Arab Emirates University (31R239).
Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - Missing values are common in real-world datasets and pose a significant challenge to the performance of statistical and machine learning models. Generally, missing values are imputed using statistical methods, such as the mean, median, mode, or machine learning approaches. These approaches are limited to either numerical or categorical data. Imputation in mixed datasets that contain both numerical and categorical attributes is challenging and has received little attention. Machine learning-based imputation algorithms usually require a large amount of training data. However, obtaining such data is difficult. Furthermore, no considerate work has been conducted in the literature that focuses on the effects of the training and testing size with increasing amounts of missing data. To address this gap, we proposed that increasing the amount of training data will improve imputation performance. We first used generative adversarial network (GAN) methods to increase the amount of training data. We considered two state-of-the-art GANs (tabular and conditional tabular) to add synthetic samples using observed data with different synthetic sample ratios. We then used three state-of-the-art imputation models that can handle mixed data: MissForest, multivariate imputation by chained equations, and denoising auto encoder (DAE). We proposed robust experimental setups on four publicly available datasets with different training-testing data divisions that have increasing missingness ratios. Extensive experimental results show that incorporating synthetic samples with training data achieves better performance compared to the baseline methods for mixed data imputation in both categorical and numerical variables, especially for large missingness ratios.
AB - Missing values are common in real-world datasets and pose a significant challenge to the performance of statistical and machine learning models. Generally, missing values are imputed using statistical methods, such as the mean, median, mode, or machine learning approaches. These approaches are limited to either numerical or categorical data. Imputation in mixed datasets that contain both numerical and categorical attributes is challenging and has received little attention. Machine learning-based imputation algorithms usually require a large amount of training data. However, obtaining such data is difficult. Furthermore, no considerate work has been conducted in the literature that focuses on the effects of the training and testing size with increasing amounts of missing data. To address this gap, we proposed that increasing the amount of training data will improve imputation performance. We first used generative adversarial network (GAN) methods to increase the amount of training data. We considered two state-of-the-art GANs (tabular and conditional tabular) to add synthetic samples using observed data with different synthetic sample ratios. We then used three state-of-the-art imputation models that can handle mixed data: MissForest, multivariate imputation by chained equations, and denoising auto encoder (DAE). We proposed robust experimental setups on four publicly available datasets with different training-testing data divisions that have increasing missingness ratios. Extensive experimental results show that incorporating synthetic samples with training data achieves better performance compared to the baseline methods for mixed data imputation in both categorical and numerical variables, especially for large missingness ratios.
KW - GANs
KW - MICE
KW - Mixed data imputation
KW - denoising auto encoders
KW - miss forest
KW - missing data
UR - http://www.scopus.com/inward/record.url?scp=85141525791&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141525791&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3218067
DO - 10.1109/ACCESS.2022.3218067
M3 - Article
AN - SCOPUS:85141525791
SN - 2169-3536
VL - 10
SP - 124475
EP - 124490
JO - IEEE Access
JF - IEEE Access
ER -