NEW APPROACH TO ADDRESSING CLASS IMBALANCE IN MEDICAL DATASETS CONSIDERING SPECIFICS
DOI:
https://doi.org/10.37943/21VWQH9068Keywords:
imbalance, oversampling, medical data analysis, data analysis, сlassification, noise filtering, nominal dataAbstract
Currently, the popularization of the integration of machine learning into the field of medicine for data processing and analysis is being traced, but at the same time difficulties such as class imbalance and noisy datasets arise. Due to the prevalence of the problem, there are already existing solutions, but in all of them there is an abstraction from the field of medicine, namely, gender, racial and other differences are not taken into account. It is this side of the problem that is solved in our resampling algorithm. A feature of our algorithm is the use of splitting the dataset by an important feature through the p-value of Spearman correlation, which helps to consider subgroups of observations without losing their unique characteristics and removing noise data using LOF and Z-score separately for minority and majority classes, respectively. Synthetic data is generated in a flexible way, adapting to the data set using algorithm parameters. Work is provided with both quantitative and nominative features. The algorithm was tested on datasets for heart attack, chronic kidney disease, and liver disease, and the Random Forest ensemble method was used to train the model. After applying this class balancing method, improvements were recorded on average in Accuracy by 36%, in AUC by 15-25%, in Precision by 39-42%, and in Recall by 21-37% compared with SMOTE, ADASYN algorithms and the data set before balancing. Applying the algorithm on medical data can improve the accuracy of the algorithm and reduce the loss of reliability compared to other resampling methods.
References
Bjudzhetnyj miks na 2024 god: +19%. (n.d.). Retrieved November 5, 2024, from https://ulagat-m.kz/analyst/finansy-v-meditsine/byudzhetnyy-miks-na-2024-god-19.html
Artificial Intelligence (AI) in Healthcare Market Size, Share, Trends & Industry Growth Analysis Report 2032. (n.d.). MarketsandMarkets. Retrieved November 5, 2024, from https://www.marketsandmarkets.com/Market-Reports/artificial-intelligence-healthcare-market-54679303.html
Gnip, P., Vokorokos, L., & Drotár, P. (2021). Selective oversampling approach for strongly imbalanced data. PeerJ Computer Science, 7, e604. https://doi.org/10.7717/peerj-cs.604
Pradipta, G. A., Wardoyo, R., Musdholifah, A., & Sanjaya, I. N. H. (2021). Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data. IEEE Access, 9, 74763–74777. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3080316
Jiang, Z., Pan, T., Zhang, C., & Yang, J. (2021). A new oversampling method based on the classification contribution degree. Symmetry, 13(2), 1–13. Scopus. https://doi.org/10.3390/sym13020194
Tao, L., Li, H., Wang, F., Liu, M., Tang, Z., & Wang, Q. (2024). An Adaptive Safe-Region Diversity Oversampling Algorithm for Imbalanced Classification. IEEE Access, 12, 63713–63724. Scopus. https://doi.org/10.1109/ACCESS.2024.3396155
Kunakorntum, I., Hinthong, W., & Phunchongharn, P. (2020). A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets. IEEE Access, 8, 114692–114704. Scopus. https://doi.org/10.1109/ACCESS.2020.3003346
Yi, X., Xu, Y., Hu, Q., Krishnamoorthy, S., Li, W., & Tang, Z. (2022). ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex and Intelligent Systems, 8(3), 2247–2272. Scopus. https://doi.org/10.1007/s40747-021-00638-w
Chen, Y., Zou, J., Liu, L., & Hu, C. (2024). Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry, 16(3), Article 3. https://doi.org/10.3390/sym16030273
Wang, Z., & Wang, H. (2021). Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning. IEEE Access, 9, 44770–44783. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3067060
Deng, Y., & Li, M. (2023). An Adaptive and Robust Method for Oriented Oversampling With Spatial Information for Imbalanced Noisy Datasets. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2023.3329560
Hassan, M. M., Eesa, A. S., Mohammed, A. J., & Arabo, W. K. (2021). Oversampling method based on gaussian distribution and K-means clustering. Computers, Materials and Continua, 69(1), 451–469. Scopus. https://doi.org/10.32604/cmc.2021.018280
Wang, C.-R., & Shao, X.-H. (2020). An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2020.3047923
Zhang, Y., Zuo, T., Fang, L., Li, J., & Xing, Z. (2020). An Improved MAHAKIL Oversampling Method for Imbalanced Dataset Classification. IEEE Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2020.3047741
Yao, L., & Lin, T.-B. (2021). Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. Sensors, 21(19), Article 19. https://doi.org/10.3390/s21196616
Liu, D., Qiao, S., Han, N., Wu, T., Mao, R., Zhang, Y., Yuan, C.-A., & Xiao, Y. (2020). SOTB: Semi-Supervised Oversampling Approach Based on Trigonal Barycenter Theory. IEEE Access, 8, 50180–50189. Scopus. https://doi.org/10.1109/ACCESS.2020.2980157
Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234. Scopus. https://doi.org/10.1016/j.eswa.2023.121053
Heart Disease Dataset. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/mirzahasnine/heart-disease-dataset
Chronic Kidney Disease Dataset. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/rabieelkharoua/chronic-kidney-disease-dataset-analysis
Indian Liver Patient Records. (n.d.). Retrieved November 1, 2024, from https://www.kaggle.com/datasets/uciml/indian-liver-patient-records
More than half of U.S. adults don’t know heart disease is leading cause of death, despite 100-year reign. (n.d.). American Heart Association. Retrieved November 1, 2024, from https://newsroom.heart.org/news/more-than-half-of-u-s-adults-dont-know-heart-disease-is-leading-cause-of-death-despite-100-year-reign
Web resourse (2023, March 30). New global kidney health report sheds light on current capacity around the world to deliver kidney care. International Society of Nephrology. https://www.theisn.org/blog/2023/03/30/new-global-kidney-health-report-sheds-light-on-current-capacity-around-the-world-to-deliver-kidney-care/
Baviskar, K., Kshirsagar, A., Raut, H., & Shaikh, M. R. N. (2024). Overview: Global burden of liver disease. International Journal of Pharmaceutical Chemistry and Analysis, 11(1), 1–10. https://doi.org/10.18231/j.ijpca.2024.001
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.