BUILDING A DEEP SEARCH CRAWLER FOR THE KAZAKH LANGUAGE: A REPRODUCIBLE WEB-SCALE PIPELINE

Madina Mansurova; Assel Ospan; Fakhriddin Nuraliev; Rustam Khamdamov; Talshyn Sarsembayeva

doi:10.37943/25VBID7988

Authors

Madina Mansurova Al Farabi Kazakh National University https://orcid.org/0000-0002-9680-2758
Assel Ospan Al Farabi Kazakh National University https://orcid.org/0000-0002-1860-6997
Fakhriddin Nuraliev Tashkent University of Information Technologies named after Muhammad al-Khwarizmi https://orcid.org/0000-0002-0574-9278
Rustam Khamdamov Tashkent University of Information Technologies named after Muhammad al-Khwarizmi https://orcid.org/0000-0003-3796-4631
Talshyn Sarsembayeva Al-Farabi Kazakh National University https://orcid.org/0000-0001-7668-2640

DOI:

https://doi.org/10.37943/25VBID7988

Keywords:

deep search crawler, Kazakh, e-government, breadth-first search (BFS), Django, BeautifulSoup, Playwright, Selenium, JSON schema

Abstract

We present a reproducible, web-scale pipeline for building a Kazakh-language corpus from the national e-government portal. The system treats the website as a directed graph and performs breadth-first traversal to preserve section hierarchy. Static acquisition relies on robust HTTP requests and HTML parsing; for pages with dynamic widgets, we selectively enable a headless layer to render the final DOM prior to extraction. We define a minimal JSON schema aligned with downstream NLP needs (URL, category, titles, cleaned descriptions) and implement normalization (Unicode NFC/NFKC, transliteration repair for Kazakh, boilerplate removal) and fragment-level deduplication. To strengthen the scientific contribution, we formalize the crawling–extraction process as an optimization under resource constraints and propose field-level quality metrics (precision, recall, F1), coverage of categories, and completeness gains attributable to headless rendering. Our experimental protocol compares static parsing against a hybrid static+headless setup on multiple portal categories, reports field-wise effectiveness with confidence intervals, and analyzes dominant error sources (DOM drift, client-side rendering, code-switching). Ablation studies quantify the impact of normalization and duplication. We also outline ethical access (robots.txt compliance, throttling, conditional requests) and provide artifacts to ensure reproducibility (versioned scripts, schema validators, logging). We release open-source scripts, detailed runbooks, and a small, labeled benchmark to facilitate fair comparisons and independent replication across institutions. The resulting corpus targets low-resource Kazakh NLP and e-government analytics, supporting tasks such as classification, terminology normalization, named-entity recognition, and LLM adaptation. Overall, the proposed pipeline demonstrates that selective headless rendering combined with rigorous normalization is a practical and effective strategy for high-quality data acquisition in dynamically rendered public portals.

Author Biographies

Madina Mansurova, Al Farabi Kazakh National University

Professor, Head of the Department of Artificial Intelligence and Big Data

Assel Ospan, Al Farabi Kazakh National University

Master degree, Senior Lecturer, Department of Artificial Intelligence and Big Data

Fakhriddin Nuraliev, Tashkent University of Information Technologies named after Muhammad al-Khwarizmi

Doctor of Technical Science, Professor

Rustam Khamdamov, Tashkent University of Information Technologies named after Muhammad al-Khwarizmi

Doctor of Technical Sciences, Professor, Head of the Lab. "Smart Systems. Internet of Things"

Talshyn Sarsembayeva, Al-Farabi Kazakh National University

Master degree, Senior Lecturer, Department of Artificial Intelligence and Big Data

References

Lotfi, C., Srinivasan, S., Ertz, M., & Latrous, I. (2021). Web scraping techniques and applications: A literature review. In Advances in Data Science and Management (pp. 381–394). https://doi.org/10.52458/978-93-91842-08-6-38

Pichiyan, V., Muthulingam, S., Sathar, G., Nalajala, S., Ch, A., & Das, M. N. (2023). Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Computer Science, 230, 193–202. https://doi.org/10.1016/j.procs.2023.12.074

A. Ospan, A. Mussa, M. Mansurova and T. Sarsembayeva, "LLM Agents for Enhanced Tabular Data Interpretation: A Perspective," 2025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 2025, pp. 1-6, doi: 10.1109/SIST61657.2025.11139242 .

Abodayeh, A., Hejazi, R., Najjar, W., Shihadeh, L., & Latif, R. (2023). Web scraping for data analytics: A BeautifulSoup implementation. Proceedings of the Sixth International Conference of Women in Data Science at Prince Sultan University (WiDS PSU), 65–69. https://doi.org/10.1109/WiDS-PSU57071.2023.00025

Kazmali, A. S., & Sayar, A. (2025). Web scraping: Legal and ethical considerations in general and local context – A review. Procedia Computer Science, 259, 1563–1572. https://doi.org/10.1016/j.procs.2025.04.111

García, B., del Alamo, J. M., Leotta, M., & Ricca, F. (2024). Exploring browser automation: A comparative study of Selenium, Cypress, Puppeteer, and Playwright. In A. Bertolino, J. Pascoal Faria, P. Lago, & L. Semini (Eds.), Quality of Information and Communications Technology (QUATIC 2024), Communications in Computer and Information Science (Vol. 2178, pp. xx–xx). Springer, Cham. https://doi.org/10.1007/978-3-031-70245-7_10

Mansurova, M., Barakhnin, V., Ospan, A., & Titkov, R. (2023). Ontology-driven semantic analysis of tabular data. Applied Sciences, 13(19), 10918. https://doi.org/10.3390/app131910918

Colla, D., Mensa, E., & Radicioni, D. P. (2020). LessLex: Linking multilingual embeddings to sense representations of lexical items. Computational Linguistics, 46(2), 289–333. https://doi.org/10.1162/coli_a_00375

Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2), Article 27, 1–47. https://doi.org/10.1145/3604931

Supriyono, Wibawa, A. P., Suyono, & Kurniawan, F. (2024). A survey of text summarization: Techniques, evaluation and challenges. Natural Language Processing Journal, 7, 100070. https://doi.org/10.1016/j.nlp.2024.100070

Widyassari, A. P., Rustad, S., Shidik, G. F., Noersasongko, E., Syukur, A., Affandy, A., & Setiadi, D. R. I. M. (2022). Review of automatic text summarization techniques & methods. Journal of King Saud University – Computer and Information Sciences, 34(4), 1029–1046. https://doi.org/10.1016/j.jksuci.2020.05.006

Ferrara, E., De Meo, P., Fiumara, G., & Baumgartner, R. (2014). Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70, 301–323. https://doi.org/10.1016/j.knosys.2014.07.007

Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., & Hassan, A. (2023). Topic modeling algorithms and applications: A survey. Information Systems, 112, 102131. https://doi.org/10.1016/j.is.2022.102131

Kozhirbayev, Z., & Yessenbayev, Z. (2020). Kazakh text normalization using machine translation approaches. CEUR Workshop Proceedings, 2780, 115–122. http://ceur-ws.org/Vol-2780/paper10.pdf

Rapisheva, Zh. D., Rakhymberlina, S. A., Akisheva, Zh. S., & Akshabaeva, L. M. (2023). Features of application of single- and multi-component terms in the Kazakh official style. Bulletin of the Karaganda University. Philology Series, 112(4), 67–72. https://doi.org/10.31489/2023ph4/67-72

Tolegen, G., Toleu, A., & Mussabayev, R. (2024). Contrastive learning for morphological disambiguation. Applied Sciences, 14(21), 9992. https://doi.org/10.3390/app14219992

Kim, S., Park, H., & Lee, J. (2020). Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Systems with Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401

Yoon, S. H., & Kim, K. H. (2021). Expansion of Topic Modeling with Word2Vec and Case Analysis. The Journal of Information Systems, 30(1), 45–64. https://doi.org/10.5859/KAIS.2021.30.1.45

Yang, J., Yang, B., Sun, Q., Yan, S., & Miao, Y. (2022). Research on the key technology of web data extraction and mining based on the probability distribution. Wireless Communications and Mobile Computing, 2022, 6714785. https://doi.org/10.1155/2022/6714785

Bhatt, C., Bisht, A., Chauhan, R., Vishvakarma, A., Kumar, M., & Sharma, S. (2023). Web scraping techniques and its applications: A review. In Proceedings of the 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT) (pp. 1–8). https://doi.org/10.1109/CISCT57197.2023.10351298

Kumar, N., Lohani, D., & Acharya, D. (2022). Vehicle accident sub-classification modeling using stacked generalization: A multisensor fusion approach. Future Generation Computer Systems, 133, 39–52. https://doi.org/10.1016/j.future.2022.03.005

Mansurova, M., Barakhnin, V., Ospan, A., & Titkov, R. (2023). Ontology-Driven Semantic Analysis of Tabular Data: An Iterative Approach with Advanced Entity Recognition. Applied Sciences, 13(19), 10918. https://doi.org/10.3390/app131910918

Krotov, V., Johnson, L., & Silva, L. (2020). Legality and ethics of web scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1CAIS.04724

Nguyen, T. H. (2018). Deep learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:248557100

Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509–553. doi:10.1017/S1351324922000213

Gupta, R., et al. (2024). Generative AI: A systematic review using topic modelling techniques. Data and Information Management, 8(2), 100066. https://doi.org/10.1016/j.dim.2024.100066

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv. https://doi.org/10.48550/arXiv.2302.13971

OpenAI. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. https://doi.org/10.48550/arXiv.2005.11401

BUILDING A DEEP SEARCH CRAWLER FOR THE KAZAKH LANGUAGE: A REPRODUCIBLE WEB-SCALE PIPELINE

Authors

DOI:

Keywords:

Abstract

Author Biographies

Madina Mansurova, Al Farabi Kazakh National University

Assel Ospan, Al Farabi Kazakh National University

Fakhriddin Nuraliev, Tashkent University of Information Technologies named after Muhammad al-Khwarizmi

Rustam Khamdamov, Tashkent University of Information Technologies named after Muhammad al-Khwarizmi

Talshyn Sarsembayeva, Al-Farabi Kazakh National University

References

Downloads

Published

How to Cite

Issue

Section

License