NEW APPROACH TO ADDRESSING CLASS IMBALANCE IN MEDICAL DATASETS CONSIDERING SPECIFICS

Authors

DOI:

https://doi.org/10.37943/21VWQH9068

Keywords:

imbalance, oversampling, medical data analysis, data analysis, сlassification, noise filtering, nominal data

Abstract

Currently, the popularization of the integration of machine learning into the field of medicine for data processing and analysis is being traced, but at the same time difficulties such as class imbalance and noisy datasets arise. Due to the prevalence of the problem, there are already existing solutions, but in all of them there is an abstraction from the field of medicine, namely, gender, racial and other differences are not taken into account. It is this side of the problem that is solved in our resampling algorithm. A feature of our algorithm is the use of splitting the dataset by an important feature through the p-value of Spearman correlation, which helps to consider subgroups of observations without losing their unique characteristics and removing noise data using LOF and Z-score separately for minority and majority classes, respectively. Synthetic data is generated in a flexible way, adapting to the data set using algorithm parameters. Work is provided with both quantitative and nominative features. The algorithm was tested on datasets for heart attack, chronic kidney disease, and liver disease, and the Random Forest ensemble method was used to train the model. After applying this class balancing method, improvements were recorded on average in Accuracy by 36%, in AUC by 15-25%, in Precision by 39-42%, and in Recall by 21-37% compared with SMOTE, ADASYN algorithms and the data set before balancing. Applying the algorithm on medical data can improve the accuracy of the algorithm and reduce the loss of reliability compared to other resampling methods.

Author Biographies

Zholdas Buribayev, Al-Farabi Kazakh National University, Kazakhstan

PhD and Acting Associate Professor, Department of Computer Science

Ainur Yerkos, Al-Farabi Kazakh National University, Kazakhstan

PhD candidate, Department of Computer Science

Zhibek Zhetpisbay, Al-Farabi Kazakh National University, Kazakhstan

Bachelor’s student, Department of Computer Science

References

Bjudzhetnyj miks na 2024 god: +19%. (n.d.). Retrieved November 5, 2024, from https://ulagat-m.kz/analyst/finansy-v-meditsine/byudzhetnyy-miks-na-2024-god-19.html

Artificial Intelligence (AI) in Healthcare Market Size, Share, Trends & Industry Growth Analysis Report 2032. (n.d.). MarketsandMarkets. Retrieved November 5, 2024, from https://www.marketsandmarkets.com/Market-Reports/artificial-intelligence-healthcare-market-54679303.html

Gnip, P., Vokorokos, L., & Drotár, P. (2021). Selective oversampling approach for strongly imbalanced data. PeerJ Computer Science, 7, e604. https://doi.org/10.7717/peerj-cs.604

Pradipta, G. A., Wardoyo, R., Musdholifah, A., & Sanjaya, I. N. H. (2021). Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data. IEEE Access, 9, 74763–74777. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3080316

Jiang, Z., Pan, T., Zhang, C., & Yang, J. (2021). A new oversampling method based on the classification contribution degree. Symmetry, 13(2), 1–13. Scopus. https://doi.org/10.3390/sym13020194

Tao, L., Li, H., Wang, F., Liu, M., Tang, Z., & Wang, Q. (2024). An Adaptive Safe-Region Diversity Oversampling Algorithm for Imbalanced Classification. IEEE Access, 12, 63713–63724. Scopus. https://doi.org/10.1109/ACCESS.2024.3396155

Kunakorntum, I., Hinthong, W., & Phunchongharn, P. (2020). A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets. IEEE Access, 8, 114692–114704. Scopus. https://doi.org/10.1109/ACCESS.2020.3003346

Yi, X., Xu, Y., Hu, Q., Krishnamoorthy, S., Li, W., & Tang, Z. (2022). ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex and Intelligent Systems, 8(3), 2247–2272. Scopus. https://doi.org/10.1007/s40747-021-00638-w

Chen, Y., Zou, J., Liu, L., & Hu, C. (2024). Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry, 16(3), Article 3. https://doi.org/10.3390/sym16030273

Wang, Z., & Wang, H. (2021). Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning. IEEE Access, 9, 44770–44783. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3067060

Deng, Y., & Li, M. (2023). An Adaptive and Robust Method for Oriented Oversampling With Spatial Information for Imbalanced Noisy Datasets. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2023.3329560

Hassan, M. M., Eesa, A. S., Mohammed, A. J., & Arabo, W. K. (2021). Oversampling method based on gaussian distribution and K-means clustering. Computers, Materials and Continua, 69(1), 451–469. Scopus. https://doi.org/10.32604/cmc.2021.018280

Wang, C.-R., & Shao, X.-H. (2020). An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2020.3047923

Zhang, Y., Zuo, T., Fang, L., Li, J., & Xing, Z. (2020). An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2020.3047741

Yao, L., & Lin, T.-B. (2021). Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. Sensors, 21(19), Article 19. https://doi.org/10.3390/s21196616

Liu, D., Qiao, S., Han, N., Wu, T., Mao, R., Zhang, Y., Yuan, C.-A., & Xiao, Y. (2020). SOTB: Semi-Supervised Oversampling Approach Based on Trigonal Barycenter Theory. IEEE Access, 8, 50180–50189. Scopus. https://doi.org/10.1109/ACCESS.2020.2980157

Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234. Scopus. https://doi.org/10.1016/j.eswa.2023.121053

Heart Disease Dataset. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/mirzahasnine/heart-disease-dataset

Chronic Kidney Disease Dataset. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/rabieelkharoua/chronic-kidney-disease-dataset-analysis

Indian Liver Patient Records. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/uciml/indian-liver-patient-records

More than half of U.S. adults don’t know heart disease is leading cause of death, despite 100-year reign. (n.d.). American Heart Association. Retrieved November 1, 2024, from https://newsroom.heart.org/news/more-than-half-of-u-s-adults-dont-know-heart-disease-is-leading-cause-of-death-despite-100-year-reign

Web resourse (2023, March 30). New global kidney health report sheds light on current capacity around the world to deliver kidney care. International Society of Nephrology. https://www.theisn.org/blog/2023/03/30/new-global-kidney-health-report-sheds-light-on-current-capacity-around-the-world-to-deliver-kidney-care/

Baviskar, K., Kshirsagar, A., Raut, H., & Shaikh, M. R. N. (2024). Overview: Global burden of liver disease. International Journal of Pharmaceutical Chemistry and Analysis, 11(1), 1–10. https://doi.org/10.18231/j.ijpca.2024.001

Downloads

Published

2025-03-30

How to Cite

Buribayev, Z., Yerkos, A., & Zhetpisbay, Z. (2025). NEW APPROACH TO ADDRESSING CLASS IMBALANCE IN MEDICAL DATASETS CONSIDERING SPECIFICS. Scientific Journal of Astana IT University, 21. https://doi.org/10.37943/21VWQH9068

Issue

Section

Information Technologies