AN INFORMATION TECHNOLOGY APPROACH TO PREDICT BREAST CANCER USING MACHINE LEARNING

Zamart Ramazanova; Yeldar Baiken; Bakhyt Matkarimov; Arshat Urazbayev; Askhat Myngbay; Bauyrzhan Aituov

doi:10.37943/24UTRW4400

Authors

Zamart Ramazanova Nazarbayev University, Kazakhstan https://orcid.org/0000-0002-2623-8901
Yeldar Baiken Nazarbayev University, Kazakhstan https://orcid.org/0000-0003-1742-2536
Bakhyt Matkarimov Nazarbayev University, Kazakhstan https://orcid.org/0000-0003-0775-7324
Arshat Urazbayev Nazarbayev University, Kazakhstan https://orcid.org/0000-0002-4763-1438
Askhat Myngbay K.Zhubanov Aktobe Regional University, Kazakhstan https://orcid.org/0000-0002-3867-847X
Bauyrzhan Aituov Center for BioEnergy Research LLP, Kazakhstan https://orcid.org/0009-0008-3001-8144

DOI:

https://doi.org/10.37943/24UTRW4400

Keywords:

information technology, breast cancer, machine learning, model and feature selection, 5-fold cross-validation

Abstract

Breast cancer continues to be the most encountered malignancy in women globally and a leading cause of cancer-related mortality. This study describes an Information Technology approach to evaluate interpretable machine-learning methods for breast cancer prediction using routine clinical data and to situate performance against prior literature. All calculations are based on the Breast Cancer Wisconsin Diagnostic dataset (569 instances; malignant/benign labels) hosted by the UCI Machine Learning Repository. Each sample corresponds to a breast mass classified as malignant or benign. Four supervised machine learning models were applied: Logistic Regression with L1 penalty, Random Forest, Decision Tree, and Naïve Bayes, and compared the area under the ROC curve (AUC), accuracy, sensitivity, and specificity using DeLong’s test with Holm correction. The reproducible pipeline consisted of preprocessing, recursive feature elimination for feature selection, and a 5-fold cross-validation for hyperparameter tuning. Among the four models, the L1-penalized Logistic Regression yielded the best results, with an AUC indicating accuracy, sensitivity, and specificity of 99.6% (97.3%, 95.2%, 98.6%) on the test sets, respectively. This study illustrates the effective integration of supervised machine learning methods into diagnostic systems to produce early, accurate, interpretable diagnoses of disease. This study reinforces the proposed information technology approach for breast cancer prognosis. Limitations of the study are a moderately sized, homogeneous cohort, and restricted focus on structured variables, which may enhance internal validity while restricting generalizability. Our findings contribute to an emerging body of literature that well-tuned, regularized logistic regression provides a reasonable baseline against which breast cancer risk and other study outcomes can be compared, and a pragmatic route toward trustworthy AI in oncology.

Author Biographies

Zamart Ramazanova, Nazarbayev University, Kazakhstan

MS in Physics, Researcher, Department of Electrical and Computer Engineering and National Laboratory Astana

Yeldar Baiken, Nazarbayev University, Kazakhstan

Ph.D., Researcher, National Laboratory Astana
Senior researcher, Center for BioEnergy Research LLP

Bakhyt Matkarimov, Nazarbayev University, Kazakhstan

Dr.Sci., Leading Researcher, National Laboratory Astana

Arshat Urazbayev, Nazarbayev University, Kazakhstan

Ph.D., Senior Researcher, National Laboratory Astana

Askhat Myngbay, K.Zhubanov Aktobe Regional University, Kazakhstan

Ph.D., Senior Researcher, National Laboratory Astana

Bauyrzhan Aituov, Center for BioEnergy Research LLP, Kazakhstan

General Director

References

Siegel, R. L., Miller, K. D., & Jemal, A. (2023). Cancer statistics, 2023. CA: A Cancer Journal for Clinicians, 73(1), 17-48. https://doi.org/10.3322/caac.21763

Gupta, M., Jain, R., Solanki, A., & Al-Turjman, F. (Eds.). (2021). Cancer Prediction for Industrial IoT 4.0: A Machine Learning Perspective (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003185604

Duffy, S. W., Vulkan, D., Cuckle, H., Parmar, D., Sheikh, S., Smith, R. A., & Evans, A. (2020). Effect of mammographic screening from age 40 years on breast cancer mortality. The Lancet Oncology, 21(1), 113-122. https://doi.org/10.1016/S1470-2045(19)30721-5

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118. https://doi.org/10.1038/nature21056

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., & van der Laak, J. A. W. M. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60-88. https://doi.org/10.1016/j.media.2017.07.005

Mazurowski, M. A., Buda, M., Saha, A., & Bashir, M. R. (2019). Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on MRI. Radiology, 294(2), 350-367. https://doi.org/10.1002/jmri.26534

Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., ... & Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 20170387. https://doi.org/10.1098/rsif.2017.0387

Zhang, T., Tan, T., Han, L., et al. (2023). Predicting breast cancer types on and beyond molecular level in a multi-modal fashion. npj Breast Cancer, 9, Article 16. https://doi.org/10.1038/s41523-023-00517-2

Mu, J., Nazar, A., Ali, M. A., & Hussain, A. (2025). Integrating machine learning with OMICs data for early detection in breast cancer. Gene Reports, 41, 102325. https://doi.org/10.1016/j.genrep.2025.102325

Lu, C., Wang, J., Zhang, H., & Wang, S. (2022). Integrating histopathological images and genomic data for breast cancer subtype classification. Frontiers in Oncology, 12, 928763. https://doi.org/10.3389/fonc.2022.928763

Hussain, S., Lafarga-Osuna, Y., Ali, M., Naseem, U., Ahmed, M., & Tamez-Peña, J. G. (2023). Deep learning, radiomics and radiogenomics applications in the digital breast tomosynthesis: A systematic review. BMC Bioinformatics, 24, Article 401. https://doi.org/10.1186/s12859-023-05515-6

Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B

Demir-Kavuk, O., Kamada, M., Akutsu, T., & Knapp, E. W. (2011). Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features. BMC Bioinformatics, 12, Article 412. https://doi.org/10.1186/1471-2105-12-412

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Chapman and Hall/CRC. https://doi.org/10.1201/9781315139470

Webb, G. I. (2011). Naïve Bayes. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning. Springer. https://doi.org/10.1007/978-0-387-30164-8_576

Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. https://doi.org/10.1016/j.inffus.2021.11.011

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422. https://doi.org/10.1023/A:1012487302797

Agarap, A. F. (2017). On breast cancer detection: An application of machine learning algorithms on the Wisconsin Diagnostic Dataset [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1711.07831

Bourlard, H. A., & Morgan, N. (1994). Multilayer perceptrons. In Connectionist speech recognition: A hybrid approach (The Springer International Series in Engineering and Computer Science, Vol. 247, pp. 59–80). Springer. https://doi.org/10.1007/978-1-4615-3210-1_4

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018

Entezari, R. (2018). Breast cancer diagnosis via classification algorithms [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1807.01334

Fix, E., & Hodges, J. L., Jr. (1989). Discriminatory analysis. Nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–247. https://doi.org/10.2307/1403797

Gosho, M., Ohigashi, T., Nagashima, K., Ito, Y., & Maruo, K. (2023). Bias in odds ratios from logistic regression methods with sparse data sets. Journal of Epidemiology, 33(6), 265–275. https://doi.org/10.2188/jea.JE20210089

Cowsik, A., & Clark, J. W. (2019). Breast cancer diagnosis by higher-order probabilistic perceptrons [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1912.06969

Ghosh, P. (2022). Breast Cancer Wisconsin (Diagnostic) prediction. International Journal of Science and Research (IJSR), 11(5), 178–185. https://doi.org/10.21275/SR22501213650

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785

Murty, P. S. R. C., Anuradha, C., Naidu, P. A., et al. (2024). Integrative hybrid deep learning for enhanced breast cancer diagnosis: Leveraging the Wisconsin Breast Cancer Database and the CBIS-DDSM dataset. Scientific Reports, 14, Article 26287. https://doi.org/10.1038/s41598-024-74305-8

Aamir, S., Rahim, A., Aamir, Z., Abbasi, S. F., Khan, M. S., Alhaisoni, M., Khan, M. A., Khan, K., & Ahmad, J. (2022). Predicting breast cancer leveraging supervised machine learning techniques. Computational and Mathematical Methods in Medicine, 2022, Article 5869529. https://doi.org/10.1155/2022/5869529

Akay, M. F. (2009). Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications, 36(2), 3240–3247. https://doi.org/10.1016/j.eswa.2008.01.009

AN INFORMATION TECHNOLOGY APPROACH TO PREDICT BREAST CANCER USING MACHINE LEARNING

Authors

DOI:

Keywords:

Abstract

Author Biographies

Zamart Ramazanova, Nazarbayev University, Kazakhstan

Yeldar Baiken, Nazarbayev University, Kazakhstan

Bakhyt Matkarimov, Nazarbayev University, Kazakhstan

Arshat Urazbayev, Nazarbayev University, Kazakhstan

Askhat Myngbay, K.Zhubanov Aktobe Regional University, Kazakhstan

Bauyrzhan Aituov, Center for BioEnergy Research LLP, Kazakhstan

References

Downloads

Published

How to Cite

Issue

Section

License