BUILDING A DEEP SEARCH CRAWLER FOR THE KAZAKH LANGUAGE: A REPRODUCIBLE WEB-SCALE PIPELINE
DOI:
https://doi.org/10.37943/25VBID7988Keywords:
deep search crawler, Kazakh, e-government, breadth-first search (BFS), Django, BeautifulSoup, Playwright, Selenium, JSON schemaAbstract
We present a reproducible, web-scale pipeline for building a Kazakh-language corpus from the national e-government portal. The system treats the website as a directed graph and performs breadth-first traversal to preserve section hierarchy. Static acquisition relies on robust HTTP requests and HTML parsing; for pages with dynamic widgets, we selectively enable a headless layer to render the final DOM prior to extraction. We define a minimal JSON schema aligned with downstream NLP needs (URL, category, titles, cleaned descriptions) and implement normalization (Unicode NFC/NFKC, transliteration repair for Kazakh, boilerplate removal) and fragment-level deduplication. To strengthen the scientific contribution, we formalize the crawling–extraction process as an optimization under resource constraints and propose field-level quality metrics (precision, recall, F1), coverage of categories, and completeness gains attributable to headless rendering. Our experimental protocol compares static parsing against a hybrid static+headless setup on multiple portal categories, reports field-wise effectiveness with confidence intervals, and analyzes dominant error sources (DOM drift, client-side rendering, code-switching). Ablation studies quantify the impact of normalization and duplication. We also outline ethical access (robots.txt compliance, throttling, conditional requests) and provide artifacts to ensure reproducibility (versioned scripts, schema validators, logging). We release open-source scripts, detailed runbooks, and a small, labeled benchmark to facilitate fair comparisons and independent replication across institutions. The resulting corpus targets low-resource Kazakh NLP and e-government analytics, supporting tasks such as classification, terminology normalization, named-entity recognition, and LLM adaptation. Overall, the proposed pipeline demonstrates that selective headless rendering combined with rigorous normalization is a practical and effective strategy for high-quality data acquisition in dynamically rendered public portals.
References
References
Lotfi, C., Srinivasan, S., Ertz, M., & Latrous, I. (2021). Web scraping techniques and applications: A literature review. In Advances in Data Science and Management (pp. 381–394). https://doi.org/10.52458/978-93-91842-08-6-38
Pichiyan, V., Muthulingam, S., Sathar, G., Nalajala, S., Ch, A., & Das, M. N. (2023). Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Computer Science, 230, 193–202. https://doi.org/10.1016/j.procs.2023.12.074
A. Ospan, A. Mussa, M. Mansurova and T. Sarsembayeva, "LLM Agents for Enhanced Tabular Data Interpretation: A Perspective," 2025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 2025, pp. 1-6, doi: 10.1109/SIST61657.2025.11139242 .
Abodayeh, A., Hejazi, R., Najjar, W., Shihadeh, L., & Latif, R. (2023). Web scraping for data analytics: A BeautifulSoup implementation. Proceedings of the Sixth International Conference of Women in Data Science at Prince Sultan University (WiDS PSU), 65–69. https://doi.org/10.1109/WiDS-PSU57071.2023.00025
Kazmali, A. S., & Sayar, A. (2025). Web scraping: Legal and ethical considerations in general and local context – A review. Procedia Computer Science, 259, 1563–1572. https://doi.org/10.1016/j.procs.2025.04.111
García, B., del Alamo, J. M., Leotta, M., & Ricca, F. (2024). Exploring browser automation: A comparative study of Selenium, Cypress, Puppeteer, and Playwright. In A. Bertolino, J. Pascoal Faria, P. Lago, & L. Semini (Eds.), Quality of Information and Communications Technology (QUATIC 2024), Communications in Computer and Information Science (Vol. 2178, pp. xx–xx). Springer, Cham. https://doi.org/10.1007/978-3-031-70245-7_10
Mansurova, M., Barakhnin, V., Ospan, A., & Titkov, R. (2023). Ontology-driven semantic analysis of tabular data. Applied Sciences, 13(19), 10918. https://doi.org/10.3390/app131910918
Colla, D., Mensa, E., & Radicioni, D. P. (2020). LessLex: Linking multilingual embeddings to sense representations of lexical items. Computational Linguistics, 46(2), 289–333. https://doi.org/10.1162/coli_a_00375
Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2), Article 27, 1–47. https://doi.org/10.1145/3604931
Supriyono, Wibawa, A. P., Suyono, & Kurniawan, F. (2024). A survey of text summarization: Techniques, evaluation and challenges. Natural Language Processing Journal, 7, 100070. https://doi.org/10.1016/j.nlp.2024.100070
Widyassari, A. P., Rustad, S., Shidik, G. F., Noersasongko, E., Syukur, A., Affandy, A., & Setiadi, D. R. I. M. (2022). Review of automatic text summarization techniques & methods. Journal of King Saud University – Computer and Information Sciences, 34(4), 1029–1046. https://doi.org/10.1016/j.jksuci.2020.05.006
Ferrara, E., De Meo, P., Fiumara, G., & Baumgartner, R. (2014). Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70, 301–323. https://doi.org/10.1016/j.knosys.2014.07.007
Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., & Hassan, A. (2023). Topic modeling algorithms and applications: A survey. Information Systems, 112, 102131. https://doi.org/10.1016/j.is.2022.102131
Kozhirbayev, Z., & Yessenbayev, Z. (2020). Kazakh text normalization using machine translation approaches. CEUR Workshop Proceedings, 2780, 115–122. http://ceur-ws.org/Vol-2780/paper10.pdf
Rapisheva, Zh. D., Rakhymberlina, S. A., Akisheva, Zh. S., & Akshabaeva, L. M. (2023). Features of application of single- and multi-component terms in the Kazakh official style. Bulletin of the Karaganda University. Philology Series, 112(4), 67–72. https://doi.org/10.31489/2023ph4/67-72
Tolegen, G., Toleu, A., & Mussabayev, R. (2024). Contrastive learning for morphological disambiguation. Applied Sciences, 14(21), 9992. https://doi.org/10.3390/app14219992
Kim, S., Park, H., & Lee, J. (2020). Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Systems with Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401
Yoon, S. H., & Kim, K. H. (2021). Expansion of Topic Modeling with Word2Vec and Case Analysis. The Journal of Information Systems, 30(1), 45–64. https://doi.org/10.5859/KAIS.2021.30.1.45
Yang, J., Yang, B., Sun, Q., Yan, S., & Miao, Y. (2022). Research on the key technology of web data extraction and mining based on the probability distribution. Wireless Communications and Mobile Computing, 2022, 6714785. https://doi.org/10.1155/2022/6714785
Bhatt, C., Bisht, A., Chauhan, R., Vishvakarma, A., Kumar, M., & Sharma, S. (2023). Web scraping techniques and its applications: A review. In Proceedings of the 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT) (pp. 1–8). https://doi.org/10.1109/CISCT57197.2023.10351298
Kumar, N., Lohani, D., & Acharya, D. (2022). Vehicle accident sub-classification modeling using stacked generalization: A multisensor fusion approach. Future Generation Computer Systems, 133, 39–52. https://doi.org/10.1016/j.future.2022.03.005
Mansurova, M., Barakhnin, V., Ospan, A., & Titkov, R. (2023). Ontology-Driven Semantic Analysis of Tabular Data: An Iterative Approach with Advanced Entity Recognition. Applied Sciences, 13(19), 10918. https://doi.org/10.3390/app131910918
Krotov, V., Johnson, L., & Silva, L. (2020). Legality and ethics of web scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1CAIS.04724
Nguyen, T. H. (2018). Deep learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:248557100
Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509–553. doi:10.1017/S1351324922000213
Gupta, R., et al. (2024). Generative AI: A systematic review using topic modelling techniques. Data and Information Management, 8(2), 100066. https://doi.org/10.1016/j.dim.2024.100066
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv. https://doi.org/10.48550/arXiv.2302.13971
OpenAI. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. https://doi.org/10.48550/arXiv.2005.11401
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.