SOLVING THE PROBLEM OF DETECTING PHISHING WEBSITES USING ENSEMBLE LEARNING MODELS
DOI:
https://doi.org/10.37943/12OYRS4391Keywords:
phishing detection, Ensemble Learning, imbalanced classification, gradient boostingAbstract
Due to the popularity of the easiest way to obtain personal information among attackers, phishing detection is becoming a popular area for research aimed at countering the implementation of such attacks. Malicious website detection is essential to prevent the spread of malware and protect end users from victims. Unfortunately, malicious URL detection still needs to be better understood due to a lack of features and inaccurate classification. Possible sources were examined in order to investigate the subject. Based on the collected information from previous studies, this study is devoted to solving the problem of detecting phishing websites using Ensemble Learning. The aim of the work is to choose the most optimal algorithm for classifying phishing websites using gradient boosting algorithms. AdaBoost, CatBoost, and Gradient Boosting Classifier were chosen as Ensemble Learning algorithms and were used to improve the efficiency of classifiers. Practical studies of the parameters of each algorithm for finding the optimal classification model are given. Research and experiments were carried out on a dataset containing information extracted from the contents of a URL: main URL, domain, directory, and file. A thorough Exploratory Data Analysis (EDA) was carried out, as a result of which the main dependencies and patterns of determining phishing resources were identified using correlation analysis. ROC AUC Score was chosen as an evaluation metric for the algorithms. The best result for predicting phishing websites was demonstrated by the AdaBoost Classifier algorithm, with an average ROC AUC score of 99%. The results of the experiments were illustrated in the form of graphs and tables.
References
Google Transparency Report. (2022, March 2). Google Safe Browsing. Retrieved March 2, 2022, from https://transparencyreport.google.com/safe-browsing/overview?hl=en
Liu, M., Zhang, B., Chen, W., & Zhang, X. (2019). A survey of exploitation and detection methods of XSS vulnerabilities. IEEE Access, 7, 182004–182016. https://doi.org/10.1109/ACCESS.2019.2960449
Rao, R.S., & Pais, A.R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873. https://doi.org/10.1007/s00521-017-3305-0
Laxmi Prasanna, K., Pradeepthi, K. V., & Saxena, A. (2022). Phishing URL Identification Using Machine Learning, Ensemble Learning and Deep Learning Techniques. In Smart Intelligent Computing and Applications, Volume 2 (pp. 573-582). Springer, Singapore. https://doi.org/10.1007/978-981-16-9705-0_56
Sahingoz, O., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029
Abutaha, M., Ababneh, M., Mahmoud, K., & Baddar, S. (2021, May). URL Phishing Detection using Machine Learning Techniques based on URLs Lexical Analysis. In 2021 12th International Conference on Information and Communication Systems (ICICS), (pp. 147-152). IEEE. https://doi.org/10.1109/ICICS52457.2021.9464539
Huang, Z., Zhang, Y., Duan, R., & Wang, R. (2021, November). Research on Malicious URL Identification and Analysis for Network Security. In 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), (pp. 418-422). IEEE.
Subasi, A., Balfaqih, M., Balfagih, Z., & Alfawwaz, K. (2021). A Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection. Procedia Computer Science, 194, 272-279. https://doi.org/10.1016/j.procs.2021.10.082
Alsaedi, M., Ghaleb, F.A., Saeed, F., Ahmad, J., & Alasli, M. (2022). Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors, 22(9), 3373. https://doi.org/10.3390/s22093373
Vishva, E.S., & Aju, D. (2021). Phisher Fighter: Website Phishing Detection System Based on URL and Term Frequency-Inverse Document Frequency Values. Journal of Cyber Security and Mobility, 11(1), 83–104. https://doi.org/10.13052/jcsm2245-1439.1114
Alsariera, Y., Balogun, A., Adeyemo, V., Tarawneh, O. & Mojeed, H. (2022). Intelligent tree-based ensemble approaches for phishing website detection. Journal of Engineering Science and Technology,17(1), 563–582.
Saleem, A., Vinodini, R., & Kavitha, A. (2021). Lexical features based malicious URL detection using machine learning techniques. Materials Today: Proceedings, 47, 163–166.
Zahra, S.R., Chishti, M., Baba, A. & Wu, F. (2021). Detecting Covid-19 chaos driven phishing/malicious URL attacks by a fuzzy logic and data mining based intelligence system. Egyptian Informatics Journal, 23(2), 197–214. https://doi.org/10.1016/j.eij.2021.12.003
Tu, C., Liu, H., & Xu, B., (2017). AdaBoost typical Algorithm and its application research. In MATEC Web of Conferences, 139. 00222. EDP Sciences. https://doi.org/10.1051/matecconf/201713900222
Hancock, J., & Khoshgoftaar, T. (2020). CatBoost for big data: an interdisciplinary review. Journal of Big Data, 7(1), 1-45. https://doi.org/10.1186/s40537-020-00369-8
Natekin, A. & Knoll, A., (2013). Gradient Boosting Machines, A Tutorial. Frontiers in neurorobotics, 7, 21. https://doi.org/10.3389/fnbot.2013.00021
Vrbančič, G., Fister, Jr. I., & Podgorelec, V. (2020). Datasets for Phishing Websites Detection. Data in Brief, 33, 106438. https://doi.org/10.1016/j.dib.2020.106438
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.