SOLVING THE PROBLEM OF DETECTING PHISHING WEBSITES  USING ENSEMBLE LEARNING MODELS

Dinara Kaibassova; Margulan Nurtay; Ardak Tau; Mira Kissina

doi:10.37943/12OYRS4391

Authors

Dinara Kaibassova Abylkas Saginov Karagandy Technical University https://orcid.org/0000-0002-8410-7758
Margulan Nurtay Abylkas Saginov Karagandy Technical University https://orcid.org/0000-0002-0786-6195
Ardak Tau Abylkas Saginov Karagandy Technical University https://orcid.org/0000-0003-4883-6328
Mira Kissina Abylkas Saginov Karagandy Technical University https://orcid.org/0000-0003-2232-1203

DOI:

https://doi.org/10.37943/12OYRS4391

Keywords:

phishing detection, Ensemble Learning, imbalanced classification, gradient boosting

Abstract

Due to the popularity of the easiest way to obtain personal information among attackers, phishing detection is becoming a popular area for research aimed at countering the implementation of such attacks. Malicious website detection is essential to prevent the spread of malware and protect end users from victims. Unfortunately, malicious URL detection still needs to be better understood due to a lack of features and inaccurate classification. Possible sources were examined in order to investigate the subject. Based on the collected information from previous studies, this study is devoted to solving the problem of detecting phishing websites using Ensemble Learning. The aim of the work is to choose the most optimal algorithm for classifying phishing websites using gradient boosting algorithms. AdaBoost, CatBoost, and Gradient Boosting Classifier were chosen as Ensemble Learning algorithms and were used to improve the efficiency of classifiers. Practical studies of the parameters of each algorithm for finding the optimal classification model are given. Research and experiments were carried out on a dataset containing information extracted from the contents of a URL: main URL, domain, directory, and file. A thorough Exploratory Data Analysis (EDA) was carried out, as a result of which the main dependencies and patterns of determining phishing resources were identified using correlation analysis. ROC AUC Score was chosen as an evaluation metric for the algorithms. The best result for predicting phishing websites was demonstrated by the AdaBoost Classifier algorithm, with an average ROC AUC score of 99%. The results of the experiments were illustrated in the form of graphs and tables.

Author Biographies

Dinara Kaibassova, Abylkas Saginov Karagandy Technical University

PhD, Acting Associate Professor of the Department of Information and Computing Systems

Margulan Nurtay, Abylkas Saginov Karagandy Technical University

Computer Science master student

Ardak Tau, Abylkas Saginov Karagandy Technical University

Senior lecturer at the Information and Computing Systems Department

Mira Kissina, Abylkas Saginov Karagandy Technical University

Senior lecturer at the Information and Computing Systems Department

References

Google Transparency Report. (2022, March 2). Google Safe Browsing. Retrieved March 2, 2022, from https://transparencyreport.google.com/safe-browsing/overview?hl=en

Liu, M., Zhang, B., Chen, W., & Zhang, X. (2019). A survey of exploitation and detection methods of XSS vulnerabilities. IEEE Access, 7, 182004–182016. https://doi.org/10.1109/ACCESS.2019.2960449

Rao, R.S., & Pais, A.R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873. https://doi.org/10.1007/s00521-017-3305-0

Laxmi Prasanna, K., Pradeepthi, K. V., & Saxena, A. (2022). Phishing URL Identification Using Machine Learning, Ensemble Learning and Deep Learning Techniques. In Smart Intelligent Computing and Applications, Volume 2 (pp. 573-582). Springer, Singapore. https://doi.org/10.1007/978-981-16-9705-0_56

Sahingoz, O., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029

Abutaha, M., Ababneh, M., Mahmoud, K., & Baddar, S. (2021, May). URL Phishing Detection using Machine Learning Techniques based on URLs Lexical Analysis. In 2021 12th International Conference on Information and Communication Systems (ICICS), (pp. 147-152). IEEE. https://doi.org/10.1109/ICICS52457.2021.9464539

Huang, Z., Zhang, Y., Duan, R., & Wang, R. (2021, November). Research on Malicious URL Identification and Analysis for Network Security. In 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), (pp. 418-422). IEEE.

Subasi, A., Balfaqih, M., Balfagih, Z., & Alfawwaz, K. (2021). A Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection. Procedia Computer Science, 194, 272-279. https://doi.org/10.1016/j.procs.2021.10.082

Alsaedi, M., Ghaleb, F.A., Saeed, F., Ahmad, J., & Alasli, M. (2022). Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors, 22(9), 3373. https://doi.org/10.3390/s22093373

Vishva, E.S., & Aju, D. (2021). Phisher Fighter: Website Phishing Detection System Based on URL and Term Frequency-Inverse Document Frequency Values. Journal of Cyber Security and Mobility, 11(1), 83–104. https://doi.org/10.13052/jcsm2245-1439.1114

Alsariera, Y., Balogun, A., Adeyemo, V., Tarawneh, O. & Mojeed, H. (2022). Intelligent tree-based ensemble approaches for phishing website detection. Journal of Engineering Science and Technology,17(1), 563–582.

Saleem, A., Vinodini, R., & Kavitha, A. (2021). Lexical features based malicious URL detection using machine learning techniques. Materials Today: Proceedings, 47, 163–166.

Zahra, S.R., Chishti, M., Baba, A. & Wu, F. (2021). Detecting Covid-19 chaos driven phishing/malicious URL attacks by a fuzzy logic and data mining based intelligence system. Egyptian Informatics Journal, 23(2), 197–214. https://doi.org/10.1016/j.eij.2021.12.003

Tu, C., Liu, H., & Xu, B., (2017). AdaBoost typical Algorithm and its application research. In MATEC Web of Conferences, 139. 00222. EDP Sciences. https://doi.org/10.1051/matecconf/201713900222

Hancock, J., & Khoshgoftaar, T. (2020). CatBoost for big data: an interdisciplinary review. Journal of Big Data, 7(1), 1-45. https://doi.org/10.1186/s40537-020-00369-8

Natekin, A. & Knoll, A., (2013). Gradient Boosting Machines, A Tutorial. Frontiers in neurorobotics, 7, 21. https://doi.org/10.3389/fnbot.2013.00021

Vrbančič, G., Fister, Jr. I., & Podgorelec, V. (2020). Datasets for Phishing Websites Detection. Data in Brief, 33, 106438. https://doi.org/10.1016/j.dib.2020.106438

SOLVING THE PROBLEM OF DETECTING PHISHING WEBSITES USING ENSEMBLE LEARNING MODELS

Authors

DOI:

Keywords:

Abstract

Author Biographies

Dinara Kaibassova, Abylkas Saginov Karagandy Technical University

Margulan Nurtay, Abylkas Saginov Karagandy Technical University

Ardak Tau, Abylkas Saginov Karagandy Technical University

Mira Kissina, Abylkas Saginov Karagandy Technical University

References

Downloads

Published

How to Cite

Issue

Section

License