DEVELOPMENT OF A METHOD FOR AUTOMATIC DOCUMENT RECOVERY FOLLOWED BY ANALYSIS OF INTEGRITY AND ABSENCE OF ENCRYPTION FOR FORENSIC PURPOSES

Authors

DOI:

https://doi.org/10.37943/25MLBP3346

Keywords:

digital forensics, document recovery, entropy analysis, encryption detection, machine learning, XML validation, BiLSTM, memory dump

Abstract

As digital infrastructures grow increasingly complex, the need for robust forensic tools that can recover and interpret Office documents, particularly Microsoft Word (.docx) files, has become paramount. Traditional recovery tools often struggle with file integrity verification and fail to determine whether a document is encrypted, leading to limited courtroom admissibility and investigative delays. To address this, this work presents ForenDOC, a systematic approach for the automated recovery and forensic examination of fragmented Office Open XML documents obtained from volatile memory sources. The methodology begins with byte-level capture using raw image formats to preserve unallocated and slack space data. It proceeds with signature-based scanning to detect probable document file offsets, followed by automated Extensible Markup Language (XML) schema validation to guarantee structural integrity and filter out corrupted data. To ensure data uniqueness, Secure Hash Algorithm 1 (SHA-1) hashing and textual deduplication are implemented. Furthermore, the framework utilizes an entropy-based analysis using a Shannon entropy threshold of 5.0 to distinguish readable material from encrypted or obfuscated segments, facilitating the prompt triage of suspicious files. The system functions strictly offline via a read-only interface, enforcing stringent security protocols in accordance with ISO/IEC 27001 and National Institute of Standards and Technology (NIST) Special Publication 800-101 standards. The retrieved documents undergo processing via a custom machine learning pipeline. This includes a Random Forest model for encryption detection, achieving 94.7% precision, and a Bidirectional Long Short-Term Memory (BiLSTM) network for semantic classification spanning legal, fraud, medical, darknet, religious, and economic sectors. Experimental validation of 7,680 memory fragments yielded 970 signature matches, from which ForenDOC successfully isolated exactly 12 structurally viable files. This highlights the system's efficiency in filtering out approximately 98.7% of corrupted data—or false positives—that traditional carving tools would otherwise present to investigators. The results validate the practicality of integrating low-level recovery methods with sophisticated classification models within a cohesive forensic framework. The suggested approach improves evidential reliability and investigation efficiency, providing a scalable tool for digital forensics that adheres to international compliance requirements.

Author Biographies

Leila Rzayeva, Astana IT University

PhD, Research and Innovation Center “CyberTech”

Tomiris Zhumakan, Astana IT University

Junior Researcher, Research and Innovation Center “CyberTech”

Aizada Kapatayeva, Astana IT University

Junior Researcher, Research and Innovation Center “CyberTech”

Tabigat Serik, Astana IT University

Junior Researcher, Research and Innovation Center “CyberTech”

Alisher Batkuldin, Astana IT University

Junior Researcher, Research and Innovation Center “CyberTech”

References

Al-Sharif, Z., Bagci, H., Zaitoun, T., & Asad, A. (2017). Towards the memory forensics of MS Word documents. Advances in Intelligent Systems and Computing, 585, 179–185. https://doi.org/10.1007/978-3-319-54978-1_25

Hassan, M. M., Gumaei, A., Alsanad, A., Alrubaian, M., & Fortino, G. (2020). A hybrid deep learning model for efficient intrusion detection in big data environment. Information Sciences, 513, 386–396. https://doi.org/10.1016/j.ins.2019.10.069

Langlois, P., Pinto, A., Hylender, D., & Widup, S. (2023). 2023 Data Breach Investigations Report. Verizon Communications. https://www.verizon.com/business/resources/reports/2023-data-breach-investigations-report-dbir.pdf

Gysberth, F., Zamsari, P., & Wahyono, T. (2024). Forensic investigation of digital evidence on flash disk with forensic process method based on NIST. ECOTIPE, 11(1), 88– 96. https://doi.org/10.33019/jurnalecotipe.v11i1.4489

Naveen, R., Vijayarajan, M., Archana, P., & Nidhin, S. (2025). Recovery of deleted files: Challenges and techniques. International Journal for Multidisciplinary Research (IJFMR), 7(2), 46–52. https://doi.org/10.36948/ijfmr.2025.v07i02.41088

Menéndez, D., Bhattacharya, S., Clark, D., & Barr, T. (2018). The arms race: Adversarial search defeats entropy used to detect malware. Expert Systems with Applications, 118, 246–260. https://doi.org/10.1016/j.eswa.2018.10.011

Oyetoro, A., Mart, J., & Amah, U. (2023). Using machine learning techniques Random Forest and Neural Network to detect cyber attacks. Creative Commons Attribution License. https://doi.org/10.13140/RG.2.2.27484.05763/1

Ogunseyi, B., & Adedayo, M. (2023). Cryptographic techniques for data privacy in digital forensics. IEEE Access, 99(1), 1–19. https://doi.org/10.1109/ACCESS.2023.3343360

Yan, X., He, L., Xu, Y., Cao, J., Wang, L., & Xie, G. (2025). High-speed encrypted traffic classification by using payload features. Digital Communications and Networks, 11(2), 412–423. https://doi.org/10.1016/j.dcan.2024.02.003

Fakiha, B. (2023). Enhancing cyber forensics with AI and machine learning: A study on automated threat analysis and classification. International Journal of Safety & Security Engineering, 13(4), 329–336. https://doi.org/10.18280/ijsse.130412

Hosgor, E. (2020). Detection and mitigation of anti-forensics. International Journal of Computer Science and Information Security. https://doi.org/10.5281/zenodo.4425257

Bai, S. (2025). Recovering and analysing data from encrypted devices. International Journal of Scientific Research in Engineering and Management, 9(4), 1–9. https://doi.org/10.55041/IJSREM45625

Li, L., Zheng, D., Zhang, H., & Qin, B. (2023). Data secure de-duplication and recovery based on public key encryption with keyword search. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3251370

Varayogula, N., Dodiya, K., Lakhalani, P., & Chawla, A. (2022). Computer forensics data recovery software: A comparative study. International Journal of Innovative Research in Computer Science & Technology (IJIRCST), 10(2), 513–518. https://www.researchgate.net/profile/Parth_Lakhalani/publication/382411368_Computer_Forensics_Data_Recovery_Software_A_Comparative_Study/links/669bca7c8dca9f441b8c6f2b/Computer-Forensics-Data-Recovery-Software-A-Comparative-Study.pdf

Goni, I., Gumpy, M., Maigari, U., & Mohammad, M. (2020). Cybersecurity and cyber forensics: Machine learning approach systematic review. Semiconductor Science and Information Devices, 2(2), 25–29. https://doi.org/10.11648/j.mlr.20200504.11

CCleaner. (2024). Recuva. https://www.ccleaner.com/recuva

R-Tools Technology Inc. (2024). R-Studio Data Recovery Software. https://www.r- studio.com

Belkasoft. (2024). Belkasoft Evidence Center. https://belkasoft.com/ec

X-Ways Software Technology AG. (2024). X-Ways Forensics. https://www.x- ways.net/forensics/

Yermekov, Y., Rzayeva, L., Imanberdi, A., Alibek, A., Kayisli, K., Myrzatay, A., & Feldman, G. (2025). Secure chip-off method with acoustic-based fault diagnostics for IoT and smart grid data recovery. International Journal of Smart Grid, 9(3). http://doi.org/10.20508/ijsmartgrid.v9i3.502.g392

Hand, S., Lin, Z., Gu, G., & Thuraisingham, B. (2012). Bin-Carver: Automatic recovery of binary executable files. Digital Investigation, 9, S108–S117. https://doi.org/10.1016/j.diin.2012.05.014

ElBahrawy, A., Alessandretti, L., Rusnac, L., et al. (2020). Collective dynamics of dark web marketplaces. Scientific Reports, 10(1), 18827. https://doi.org/10.1038/s41598-020-74416-y

Al-Nabki, W., Janez-Martino, F., Vasco-Carofilis, A., Fidalgo, E., & Velasco-Mata, J. (2020). Improving named entity recognition in Tor darknet with local distance neighbor feature. arXiv preprint, https://doi.org/10.48550/arXiv.2005.08746

Ranaldi, L., Corcoglioniti, F., & Navigli, R. (2022). The dark side of the language: Pre-trained transformers in the darknet. arXiv preprint, https://doi.org/10.48550/arXiv.2201.05613

Pathmaperuma, H., Rahulamathavan, Y., Dogan, S., & Kondoz, M. (2022). Deep learning for encrypted traffic classification and unknown data detection. Sensors, 22(19), 7643. https://doi.org/10.3390/s22197643

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché Buc, F., Lin, C., & Larochelle, H. (2021). Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program. Journal of Machine Learning Research, 22(164), 1–20. https://doi.org/10.48550/arXiv.2003.12206

Farasat, T., Ahmadzai, A., George, A. E., Qaderi, A., Dordevic, D., & Posegga, J. (2024). SafePyScript: A web-based solution for machine learning-driven vulnerability detection in Python. Cornell University. https://doi.org/10.48550/arXiv.2411.00636

Downloads

Published

2026-03-30

How to Cite

Rzayeva, L., Zhumakan, T., Kapatayeva, A., Serik, T., & Batkuldin, A. (2026). DEVELOPMENT OF A METHOD FOR AUTOMATIC DOCUMENT RECOVERY FOLLOWED BY ANALYSIS OF INTEGRITY AND ABSENCE OF ENCRYPTION FOR FORENSIC PURPOSES. Scientific Journal of Astana IT University, 25. https://doi.org/10.37943/25MLBP3346

Issue

Section

Information Technologies