SCALABLE NEAR-DUPLICATE DETECTION IN KAZAKH SCIENTIFIC TEXTS VIA SEMANTIC EMBEDDINGS AND OPTIMIZED CANDIDATE FILTERING
DOI:
https://doi.org/10.37943/25NVVS5297Keywords:
near-duplicate detection, semantic similarity, Kazakh language, agglutinative languages, text canonicalization, indexing, candidate filtering, optimization, transformer-based language modelsAbstract
This work considers the problem of efficient detection of near-duplicate documents in Kazakh scientific texts, which is particularly challenging due to the agglutinative nature of the language and the high computational cost of pairwise document comparison. Traditional approaches based on lexical similarity are ineffective under such conditions, while semantic models, although more accurate, are computationally expensive and scale poorly. To overcome these limitations, the study proposes a scalable framework that combines semantic similarity modeling with optimization techniques, including text canonicalization, efficient indexing, and multi-stage candidate filtering. The canonicalization process reduces morphological variability, increasing the stability of similarity estimation for Kazakh texts. The indexing mechanism, based on dense vector representations, enables efficient selection of candidate pairs using approximate nearest neighbor search. The hierarchical filtering strategy further reduces the number of comparisons, while a transformer-based model provides accurate semantic matching. The proposed approach is evaluated on a large-scale dataset of Kazakh scientific abstracts and near-duplicate pairs. The results demonstrate that the framework achieves high detection accuracy while significantly reducing computational costs compared to exhaustive pairwise comparison. The use of dynamic threshold adjustment allows effective handling of overlapping similarity distributions between duplicate and non-duplicate classes. The obtained results confirm that the combination of linguistic preprocessing and computational optimization is crucial for scalable near-duplicate detection in low-resource agglutinative languages such as Kazakh. The proposed framework can be applied in plagiarism detection, document deduplication, and large-scale text analysis systems.
References
Agglutinative language. (n.d.). In Glossary of Linguistic Terms. SIL International. Retrieved November 2, 2025. https://glossary.sil.org/term/agglutinative-language
Kessikbayeva, G., et al. (2014). Rule based morphological analyzer of Kazakh language. In Proceedings of the Joint SIGMORPHON/SIGFSM Workshop (ACL Anthology). https://aclanthology.org/W14-2806.pdf
Washington, J., et al. (2014). Finite-state morphological transducers for three Kypchak languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1207_Paper.pdf
Makhambetov, O., et al. (2014). Toward a data-driven morphological analysis of Kazakh language. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 7(2). https://dergipark.org.tr/tr/download/article-file/395210
Yiner, Z., et al. (2021). Two level Kazakh morphology. Acta Infologica. https://doi.org/10.26650/acin.842758
Assylbekov, Z., et al. (2016). A free/open-source hybrid morphological disambiguation tool for Kazakh. In TurCLing 2016. https://doi.org/10.13140/RG.2.2.12467.43045
Budur, E., et al. (2020). Data and representation for Turkish natural language inference. In Proceedings of EMNLP 2020. https://aclanthology.org/2020.emnlp-main.662.pdf
Ercan, G., et al. (2018). AnlamVer: Semantic model evaluation dataset for Turkish – word similarity and relatedness. In Proceedings of COLING 2018. https://aclanthology.org/C18-1323.pdf
Alkurdi, B., et al. (2022). Semantic similarity based filtering for Turkish paraphrase dataset creation. In Proceedings of ICNLSP 2022. https://aclanthology.org/2022.icnlsp-1.14.pdf
Dehghan, S., et al. (2025). A Turkish dataset and BERTurk-contrastive model for semantic textual similarity. Journal of Information Systems and Telecommunication. https://doi.org/10.61186/jist.48127.13.49.24
Biloshchytska, S., et al. (2025). Text similarity detection in agglutinative languages: A case study of Kazakh using hybrid n-gram and semantic models. Applied Sciences, 15(12), 6707. https://doi.org/10.3390/app15126707
Reimers, N., et al. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP 2019. https://public.ukp.informatik.tu-darmstadt.de/UKP_Webpage/publications/2019/2019_EMNLP_NR_SentenceBert.pdf
Kuchanskyi, O., et al. (2026). Hierarchical ensemble framework for detecting paraphrased near duplicates in scientific abstracts. In CEUR Workshop Proceedings. https://ceur-ws.org/Vol-4155/paper05.pdf
Bhoi, S., Markhedkar, S., Phadke, S., & Agrawal, P. (2024). MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection. https://doi.org/10.48550/arXiv.2401.06783
Lizunov, P., Biloshchytskyi, A., Kuchanskyi, O., Andrashko, Y., Biloshchytska, S., & Serbin, O. (2021). Development of the combined method of identification of near duplicates in electronic scientific works. Eastern-European Journal of Enterprise Technologies, 4, 57–63. https://doi.org/10.15587/1729-4061.2021.238318
Amirzhanov, A., Turan, C., & Makhmutova, A. (2025). Plagiarism types and detection methods: A systematic survey of algorithms in text analysis. Frontiers in Computer Science, 7, 1504725. https://doi.org/10.3389/fcomp.2025.1504725
Shahmohammadi, H., Dezfoulian, M. H., & Mansoorizadeh, M. (2021). Paraphrase detection using LSTM networks and handcrafted features. Multimedia Tools and Applications, 80(4), 6479–6492. https://doi.org/10.1007/s11042-020-09996-y
Zhang, Y. (2025). An ensemble deep learning model for author identification through multiple features. Scientific Reports, 15, 26477. https://doi.org/10.1038/s41598-025-11596-5
Agarwal, B., Ramampiaro, H., Langseth, H., & Ruocco, M. (2018). A deep network model for paraphrase detection in short text messages. Information Processing and Management, 54(6), 922–937. https://doi.org/10.1016/j.ipm.2018.06.005
Iqbal, H. R., Maqsood, R., Raza, A. A., & Hassan, S. U. (2023). Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus. Natural Language Engineering, 30, 354–384. https://doi.org/10.1017/S1351324923000189
Mehak, G., Muneer, I., & Nawab, R. M. A. (2023). Urdu Text Reuse Detection at Phrasal Level Using Sentence Transformer-Based Approach. Expert Systems with Applications, 234, 121063. https://doi.org/10.1016/j.eswa.2023.121063
Kuchanskyi, O., & Kazagasheva, V. (2026). Kazakh scientific publications dataset from Semantic Scholar (2000–2025) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18672817
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.