AN INFORMATION TECHNOLOGY APPROACH TO PREDICT BREAST CANCER USING MACHINE LEARNING
DOI:
https://doi.org/10.37943/24UTRW4400Keywords:
information technology, breast cancer, machine learning, model and feature selection, 5-fold cross-validationAbstract
Breast cancer continues to be the most encountered malignancy in women globally and a leading cause of cancer-related mortality. This study describes an Information Technology approach to evaluate interpretable machine-learning methods for breast cancer prediction using routine clinical data and to situate performance against prior literature. All calculations are based on the Breast Cancer Wisconsin Diagnostic dataset (569 instances; malignant/benign labels) hosted by the UCI Machine Learning Repository. Each sample corresponds to a breast mass classified as malignant or benign. Four supervised machine learning models were applied: Logistic Regression with L1 penalty, Random Forest, Decision Tree, and Naïve Bayes, and compared the area under the ROC curve (AUC), accuracy, sensitivity, and specificity using DeLong’s test with Holm correction. The reproducible pipeline consisted of preprocessing, recursive feature elimination for feature selection, and a 5-fold cross-validation for hyperparameter tuning. Among the four models, the L1-penalized Logistic Regression yielded the best results, with an AUC indicating accuracy, sensitivity, and specificity of 99.6% (97.3%, 95.2%, 98.6%) on the test sets, respectively. This study illustrates the effective integration of supervised machine learning methods into diagnostic systems to produce early, accurate, interpretable diagnoses of disease. This study reinforces the proposed information technology approach for breast cancer prognosis. Limitations of the study are a moderately sized, homogeneous cohort, and restricted focus on structured variables, which may enhance internal validity while restricting generalizability. Our findings contribute to an emerging body of literature that well-tuned, regularized logistic regression provides a reasonable baseline against which breast cancer risk and other study outcomes can be compared, and a pragmatic route toward trustworthy AI in oncology.
References
Siegel, R. L., Miller, K. D., & Jemal, A. (2023). Cancer statistics, 2023. CA: A Cancer Journal for Clinicians, 73(1), 17-48. https://doi.org/10.3322/caac.21763
Gupta, M., Jain, R., Solanki, A., & Al-Turjman, F. (Eds.). (2021). Cancer Prediction for Industrial IoT 4.0: A Machine Learning Perspective (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003185604
Duffy, S. W., Vulkan, D., Cuckle, H., Parmar, D., Sheikh, S., Smith, R. A., & Evans, A. (2020). Effect of mammographic screening from age 40 years on breast cancer mortality. The Lancet Oncology, 21(1), 113-122. https://doi.org/10.1016/S1470-2045(19)30721-5
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118. https://doi.org/10.1038/nature21056
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., & van der Laak, J. A. W. M. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42, 60-88. https://doi.org/10.1016/j.media.2017.07.005
Mazurowski, M. A., Buda, M., Saha, A., & Bashir, M. R. (2019). Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on MRI. Radiology, 294(2), 350-367. https://doi.org/10.1002/jmri.26534
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., ... & Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 20170387. https://doi.org/10.1098/rsif.2017.0387
Zhang, T., Tan, T., Han, L., et al. (2023). Predicting breast cancer types on and beyond molecular level in a multi-modal fashion. npj Breast Cancer, 9, Article 16. https://doi.org/10.1038/s41523-023-00517-2
Mu, J., Nazar, A., Ali, M. A., & Hussain, A. (2025). Integrating machine learning with OMICs data for early detection in breast cancer. Gene Reports, 41, 102325. https://doi.org/10.1016/j.genrep.2025.102325
Lu, C., Wang, J., Zhang, H., & Wang, S. (2022). Integrating histopathological images and genomic data for breast cancer subtype classification. Frontiers in Oncology, 12, 928763. https://doi.org/10.3389/fonc.2022.928763
Hussain, S., Lafarga-Osuna, Y., Ali, M., Naseem, U., Ahmed, M., & Tamez-Peña, J. G. (2023). Deep learning, radiomics and radiogenomics applications in the digital breast tomosynthesis: A systematic review. BMC Bioinformatics, 24, Article 401. https://doi.org/10.1186/s12859-023-05515-6
Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B
Demir-Kavuk, O., Kamada, M., Akutsu, T., & Knapp, E. W. (2011). Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features. BMC Bioinformatics, 12, Article 412. https://doi.org/10.1186/1471-2105-12-412
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Chapman and Hall/CRC. https://doi.org/10.1201/9781315139470
Webb, G. I. (2011). Naïve Bayes. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning. Springer. https://doi.org/10.1007/978-0-387-30164-8_576
Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. https://doi.org/10.1016/j.inffus.2021.11.011
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422. https://doi.org/10.1023/A:1012487302797
Agarap, A. F. (2017). On breast cancer detection: An application of machine learning algorithms on the Wisconsin Diagnostic Dataset [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1711.07831
Bourlard, H. A., & Morgan, N. (1994). Multilayer perceptrons. In Connectionist speech recognition: A hybrid approach (The Springer International Series in Engineering and Computer Science, Vol. 247, pp. 59–80). Springer. https://doi.org/10.1007/978-1-4615-3210-1_4
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
Entezari, R. (2018). Breast cancer diagnosis via classification algorithms [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1807.01334
Fix, E., & Hodges, J. L., Jr. (1989). Discriminatory analysis. Nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–247. https://doi.org/10.2307/1403797
Gosho, M., Ohigashi, T., Nagashima, K., Ito, Y., & Maruo, K. (2023). Bias in odds ratios from logistic regression methods with sparse data sets. Journal of Epidemiology, 33(6), 265–275. https://doi.org/10.2188/jea.JE20210089
Cowsik, A., & Clark, J. W. (2019). Breast cancer diagnosis by higher-order probabilistic perceptrons [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1912.06969
Ghosh, P. (2022). Breast Cancer Wisconsin (Diagnostic) prediction. International Journal of Science and Research (IJSR), 11(5), 178–185. https://doi.org/10.21275/SR22501213650
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785
Murty, P. S. R. C., Anuradha, C., Naidu, P. A., et al. (2024). Integrative hybrid deep learning for enhanced breast cancer diagnosis: Leveraging the Wisconsin Breast Cancer Database and the CBIS-DDSM dataset. Scientific Reports, 14, Article 26287. https://doi.org/10.1038/s41598-024-74305-8
Aamir, S., Rahim, A., Aamir, Z., Abbasi, S. F., Khan, M. S., Alhaisoni, M., Khan, M. A., Khan, K., & Ahmad, J. (2022). Predicting breast cancer leveraging supervised machine learning techniques. Computational and Mathematical Methods in Medicine, 2022, Article 5869529. https://doi.org/10.1155/2022/5869529
Akay, M. F. (2009). Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications, 36(2), 3240–3247. https://doi.org/10.1016/j.eswa.2008.01.009
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.