SCALABLE NEAR-DUPLICATE DETECTION IN KAZAKH SCIENTIFIC TEXTS VIA SEMANTIC EMBEDDINGS AND OPTIMIZED CANDIDATE FILTERING

Authors

DOI:

https://doi.org/10.37943/25NVVS5297

Keywords:

near-duplicate detection, semantic similarity, Kazakh language, agglutinative languages, text canonicalization, indexing, candidate filtering, optimization, transformer-based language models

Abstract

This work considers the problem of efficient detection of near-duplicate documents in Kazakh scientific texts, which is particularly challenging due to the agglutinative nature of the language and the high computational cost of pairwise document comparison. Traditional approaches based on lexical similarity are ineffective under such conditions, while semantic models, although more accurate, are computationally expensive and scale poorly. To overcome these limitations, the study proposes a scalable framework that combines semantic similarity modeling with optimization techniques, including text canonicalization, efficient indexing, and multi-stage candidate filtering. The canonicalization process reduces morphological variability, increasing the stability of similarity estimation for Kazakh texts. The indexing mechanism, based on dense vector representations, enables efficient selection of candidate pairs using approximate nearest neighbor search. The hierarchical filtering strategy further reduces the number of comparisons, while a transformer-based model provides accurate semantic matching. The proposed approach is evaluated on a large-scale dataset of Kazakh scientific abstracts and near-duplicate pairs. The results demonstrate that the framework achieves high detection accuracy while significantly reducing computational costs compared to exhaustive pairwise comparison. The use of dynamic threshold adjustment allows effective handling of overlapping similarity distributions between duplicate and non-duplicate classes. The obtained results confirm that the combination of linguistic preprocessing and computational optimization is crucial for scalable near-duplicate detection in low-resource agglutinative languages such as Kazakh. The proposed framework can be applied in plagiarism detection, document deduplication, and large-scale text analysis systems.

Author Biographies

Valeriya Kazagasheva, Astana IT University

Master student, School of Artificial Intelligence and Data Science

Oleksandr Kuchanskyi, Astana IT University

Professor, School of Artificial Intelligence and Data Science

Professor, Department of Biomedical Cybernetics, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine

Svitlana Biloshchytska, Astana IT University

Professor, School of Artificial Intelligence and Data Science

Dina Kantayeva, Astana IT University

PhD student, School of Software Engineering

References

Agglutinative language. (n.d.). In Glossary of Linguistic Terms. SIL International. Retrieved November 2, 2025. https://glossary.sil.org/term/agglutinative-language

Kessikbayeva, G., et al. (2014). Rule based morphological analyzer of Kazakh language. In Proceedings of the Joint SIGMORPHON/SIGFSM Workshop (ACL Anthology). https://aclanthology.org/W14-2806.pdf

Washington, J., et al. (2014). Finite-state morphological transducers for three Kypchak languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1207_Paper.pdf

Makhambetov, O., et al. (2014). Toward a data-driven morphological analysis of Kazakh language. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 7(2). https://dergipark.org.tr/tr/download/article-file/395210

Yiner, Z., et al. (2021). Two level Kazakh morphology. Acta Infologica. https://doi.org/10.26650/acin.842758

Assylbekov, Z., et al. (2016). A free/open-source hybrid morphological disambiguation tool for Kazakh. In TurCLing 2016. https://doi.org/10.13140/RG.2.2.12467.43045

Budur, E., et al. (2020). Data and representation for Turkish natural language inference. In Proceedings of EMNLP 2020. https://aclanthology.org/2020.emnlp-main.662.pdf

Ercan, G., et al. (2018). AnlamVer: Semantic model evaluation dataset for Turkish – word similarity and relatedness. In Proceedings of COLING 2018. https://aclanthology.org/C18-1323.pdf

Alkurdi, B., et al. (2022). Semantic similarity based filtering for Turkish paraphrase dataset creation. In Proceedings of ICNLSP 2022. https://aclanthology.org/2022.icnlsp-1.14.pdf

Dehghan, S., et al. (2025). A Turkish dataset and BERTurk-contrastive model for semantic textual similarity. Journal of Information Systems and Telecommunication. https://doi.org/10.61186/jist.48127.13.49.24

Biloshchytska, S., et al. (2025). Text similarity detection in agglutinative languages: A case study of Kazakh using hybrid n-gram and semantic models. Applied Sciences, 15(12), 6707. https://doi.org/10.3390/app15126707

Reimers, N., et al. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP 2019. https://public.ukp.informatik.tu-darmstadt.de/UKP_Webpage/publications/2019/2019_EMNLP_NR_SentenceBert.pdf

Kuchanskyi, O., et al. (2026). Hierarchical ensemble framework for detecting paraphrased near duplicates in scientific abstracts. In CEUR Workshop Proceedings. https://ceur-ws.org/Vol-4155/paper05.pdf

Bhoi, S., Markhedkar, S., Phadke, S., & Agrawal, P. (2024). MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection. https://doi.org/10.48550/arXiv.2401.06783

Lizunov, P., Biloshchytskyi, A., Kuchanskyi, O., Andrashko, Y., Biloshchytska, S., & Serbin, O. (2021). Development of the combined method of identification of near duplicates in electronic scientific works. Eastern-European Journal of Enterprise Technologies, 4, 57–63. https://doi.org/10.15587/1729-4061.2021.238318

Amirzhanov, A., Turan, C., & Makhmutova, A. (2025). Plagiarism types and detection methods: A systematic survey of algorithms in text analysis. Frontiers in Computer Science, 7, 1504725. https://doi.org/10.3389/fcomp.2025.1504725

Shahmohammadi, H., Dezfoulian, M. H., & Mansoorizadeh, M. (2021). Paraphrase detection using LSTM networks and handcrafted features. Multimedia Tools and Applications, 80(4), 6479–6492. https://doi.org/10.1007/s11042-020-09996-y

Zhang, Y. (2025). An ensemble deep learning model for author identification through multiple features. Scientific Reports, 15, 26477. https://doi.org/10.1038/s41598-025-11596-5

Agarwal, B., Ramampiaro, H., Langseth, H., & Ruocco, M. (2018). A deep network model for paraphrase detection in short text messages. Information Processing and Management, 54(6), 922–937. https://doi.org/10.1016/j.ipm.2018.06.005

Iqbal, H. R., Maqsood, R., Raza, A. A., & Hassan, S. U. (2023). Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus. Natural Language Engineering, 30, 354–384. https://doi.org/10.1017/S1351324923000189

Mehak, G., Muneer, I., & Nawab, R. M. A. (2023). Urdu Text Reuse Detection at Phrasal Level Using Sentence Transformer-Based Approach. Expert Systems with Applications, 234, 121063. https://doi.org/10.1016/j.eswa.2023.121063

Kuchanskyi, O., & Kazagasheva, V. (2026). Kazakh scientific publications dataset from Semantic Scholar (2000–2025) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18672817

Downloads

Published

2026-03-30

How to Cite

Kazagasheva, V., Kuchanskyi, O., Biloshchytska, S. ., & Kantayeva, D. . (2026). SCALABLE NEAR-DUPLICATE DETECTION IN KAZAKH SCIENTIFIC TEXTS VIA SEMANTIC EMBEDDINGS AND OPTIMIZED CANDIDATE FILTERING. Scientific Journal of Astana IT University, 25. https://doi.org/10.37943/25NVVS5297

Issue

Section

Information Technologies