DEVELOPMENT OF A METHOD FOR AUTOMATIC DOCUMENT RECOVERY FOLLOWED BY ANALYSIS OF INTEGRITY AND ABSENCE OF ENCRYPTION FOR FORENSIC PURPOSES
DOI:
https://doi.org/10.37943/25MLBP3346Keywords:
digital forensics, document recovery, entropy analysis, encryption detection, machine learning, XML validation, BiLSTM, memory dumpAbstract
As digital infrastructures grow increasingly complex, the need for robust forensic tools that can recover and interpret Office documents, particularly Microsoft Word (.docx) files, has become paramount. Traditional recovery tools often struggle with file integrity verification and fail to determine whether a document is encrypted, leading to limited courtroom admissibility and investigative delays. To address this, this work presents ForenDOC, a systematic approach for the automated recovery and forensic examination of fragmented Office Open XML documents obtained from volatile memory sources. The methodology begins with byte-level capture using raw image formats to preserve unallocated and slack space data. It proceeds with signature-based scanning to detect probable document file offsets, followed by automated Extensible Markup Language (XML) schema validation to guarantee structural integrity and filter out corrupted data. To ensure data uniqueness, Secure Hash Algorithm 1 (SHA-1) hashing and textual deduplication are implemented. Furthermore, the framework utilizes an entropy-based analysis using a Shannon entropy threshold of 5.0 to distinguish readable material from encrypted or obfuscated segments, facilitating the prompt triage of suspicious files. The system functions strictly offline via a read-only interface, enforcing stringent security protocols in accordance with ISO/IEC 27001 and National Institute of Standards and Technology (NIST) Special Publication 800-101 standards. The retrieved documents undergo processing via a custom machine learning pipeline. This includes a Random Forest model for encryption detection, achieving 94.7% precision, and a Bidirectional Long Short-Term Memory (BiLSTM) network for semantic classification spanning legal, fraud, medical, darknet, religious, and economic sectors. Experimental validation of 7,680 memory fragments yielded 970 signature matches, from which ForenDOC successfully isolated exactly 12 structurally viable files. This highlights the system's efficiency in filtering out approximately 98.7% of corrupted data—or false positives—that traditional carving tools would otherwise present to investigators. The results validate the practicality of integrating low-level recovery methods with sophisticated classification models within a cohesive forensic framework. The suggested approach improves evidential reliability and investigation efficiency, providing a scalable tool for digital forensics that adheres to international compliance requirements.
References
Al-Sharif, Z., Bagci, H., Zaitoun, T., & Asad, A. (2017). Towards the memory forensics of MS Word documents. Advances in Intelligent Systems and Computing, 585, 179–185. https://doi.org/10.1007/978-3-319-54978-1_25
Hassan, M. M., Gumaei, A., Alsanad, A., Alrubaian, M., & Fortino, G. (2020). A hybrid deep learning model for efficient intrusion detection in big data environment. Information Sciences, 513, 386–396. https://doi.org/10.1016/j.ins.2019.10.069
Langlois, P., Pinto, A., Hylender, D., & Widup, S. (2023). 2023 Data Breach Investigations Report. Verizon Communications. https://www.verizon.com/business/resources/reports/2023-data-breach-investigations-report-dbir.pdf
Gysberth, F., Zamsari, P., & Wahyono, T. (2024). Forensic investigation of digital evidence on flash disk with forensic process method based on NIST. ECOTIPE, 11(1), 88– 96. https://doi.org/10.33019/jurnalecotipe.v11i1.4489
Naveen, R., Vijayarajan, M., Archana, P., & Nidhin, S. (2025). Recovery of deleted files: Challenges and techniques. International Journal for Multidisciplinary Research (IJFMR), 7(2), 46–52. https://doi.org/10.36948/ijfmr.2025.v07i02.41088
Menéndez, D., Bhattacharya, S., Clark, D., & Barr, T. (2018). The arms race: Adversarial search defeats entropy used to detect malware. Expert Systems with Applications, 118, 246–260. https://doi.org/10.1016/j.eswa.2018.10.011
Oyetoro, A., Mart, J., & Amah, U. (2023). Using machine learning techniques Random Forest and Neural Network to detect cyber attacks. Creative Commons Attribution License. https://doi.org/10.13140/RG.2.2.27484.05763/1
Ogunseyi, B., & Adedayo, M. (2023). Cryptographic techniques for data privacy in digital forensics. IEEE Access, 99(1), 1–19. https://doi.org/10.1109/ACCESS.2023.3343360
Yan, X., He, L., Xu, Y., Cao, J., Wang, L., & Xie, G. (2025). High-speed encrypted traffic classification by using payload features. Digital Communications and Networks, 11(2), 412–423. https://doi.org/10.1016/j.dcan.2024.02.003
Fakiha, B. (2023). Enhancing cyber forensics with AI and machine learning: A study on automated threat analysis and classification. International Journal of Safety & Security Engineering, 13(4), 329–336. https://doi.org/10.18280/ijsse.130412
Hosgor, E. (2020). Detection and mitigation of anti-forensics. International Journal of Computer Science and Information Security. https://doi.org/10.5281/zenodo.4425257
Bai, S. (2025). Recovering and analysing data from encrypted devices. International Journal of Scientific Research in Engineering and Management, 9(4), 1–9. https://doi.org/10.55041/IJSREM45625
Li, L., Zheng, D., Zhang, H., & Qin, B. (2023). Data secure de-duplication and recovery based on public key encryption with keyword search. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3251370
Varayogula, N., Dodiya, K., Lakhalani, P., & Chawla, A. (2022). Computer forensics data recovery software: A comparative study. International Journal of Innovative Research in Computer Science & Technology (IJIRCST), 10(2), 513–518. https://www.researchgate.net/profile/Parth_Lakhalani/publication/382411368_Computer_Forensics_Data_Recovery_Software_A_Comparative_Study/links/669bca7c8dca9f441b8c6f2b/Computer-Forensics-Data-Recovery-Software-A-Comparative-Study.pdf
Goni, I., Gumpy, M., Maigari, U., & Mohammad, M. (2020). Cybersecurity and cyber forensics: Machine learning approach systematic review. Semiconductor Science and Information Devices, 2(2), 25–29. https://doi.org/10.11648/j.mlr.20200504.11
CCleaner. (2024). Recuva. https://www.ccleaner.com/recuva
R-Tools Technology Inc. (2024). R-Studio Data Recovery Software. https://www.r- studio.com
Belkasoft. (2024). Belkasoft Evidence Center. https://belkasoft.com/ec
X-Ways Software Technology AG. (2024). X-Ways Forensics. https://www.x- ways.net/forensics/
Yermekov, Y., Rzayeva, L., Imanberdi, A., Alibek, A., Kayisli, K., Myrzatay, A., & Feldman, G. (2025). Secure chip-off method with acoustic-based fault diagnostics for IoT and smart grid data recovery. International Journal of Smart Grid, 9(3). http://doi.org/10.20508/ijsmartgrid.v9i3.502.g392
Hand, S., Lin, Z., Gu, G., & Thuraisingham, B. (2012). Bin-Carver: Automatic recovery of binary executable files. Digital Investigation, 9, S108–S117. https://doi.org/10.1016/j.diin.2012.05.014
ElBahrawy, A., Alessandretti, L., Rusnac, L., et al. (2020). Collective dynamics of dark web marketplaces. Scientific Reports, 10(1), 18827. https://doi.org/10.1038/s41598-020-74416-y
Al-Nabki, W., Janez-Martino, F., Vasco-Carofilis, A., Fidalgo, E., & Velasco-Mata, J. (2020). Improving named entity recognition in Tor darknet with local distance neighbor feature. arXiv preprint, https://doi.org/10.48550/arXiv.2005.08746
Ranaldi, L., Corcoglioniti, F., & Navigli, R. (2022). The dark side of the language: Pre-trained transformers in the darknet. arXiv preprint, https://doi.org/10.48550/arXiv.2201.05613
Pathmaperuma, H., Rahulamathavan, Y., Dogan, S., & Kondoz, M. (2022). Deep learning for encrypted traffic classification and unknown data detection. Sensors, 22(19), 7643. https://doi.org/10.3390/s22197643
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché Buc, F., Lin, C., & Larochelle, H. (2021). Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program. Journal of Machine Learning Research, 22(164), 1–20. https://doi.org/10.48550/arXiv.2003.12206
Farasat, T., Ahmadzai, A., George, A. E., Qaderi, A., Dordevic, D., & Posegga, J. (2024). SafePyScript: A web-based solution for machine learning-driven vulnerability detection in Python. Cornell University. https://doi.org/10.48550/arXiv.2411.00636
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.