AUDIO-TO-TEXT TRANSLATION FOR THE HARD OF HEARING: A WHISPER MODEL-BASED STUDY

Authors

DOI:

https://doi.org/10.37943/22SNOK5872

Keywords:

Whisper model, Audio-to-text transcription, Hearing impairments, Machine learning

Abstract

This study investigates the effectiveness of the Whisper model for audio-to-text transcription, specifically targeting the enhancement of accessibility for individuals with hearing impairments. The research focuses on the processing of audio recordings obtained from WhatsApp messenger, which often contain significant background noise that complicates speech recognition. To address this issue, advanced audio processing techniques were employed, including the use of the Librosa library and the Noisereduce package for noise reduction. The spectral gating methods applied in this study effectively diminished wind noise and other ambient sounds, allowing for clearer recognition of spoken content. To ensure the quality of the processed audio, we assessed its clarity using a SimpleRNN model. The training results demonstrated a progressive reduction in loss values across epochs, confirming the successful enhancement of audio quality. Once the audio files were adequately cleaned, we utilized the Whisper model, a sophisticated machine learning tool for speech recognition developed by OpenAI, to transcribe the audio into text. The transcription process yielded accurate Kazakh language output, despite the initial challenges posed by background noise. These findings underscore the critical role of high-quality audio input in achieving reliable transcription results and highlight the potential of machine learning technologies in improving communication access for hearing-impaired individuals. This study concludes with recommendations for future research, including the exploration of additional noise reduction techniques and the application of the Whisper model across various languages and dialects. Such advancements could significantly contribute to creating more inclusive digital environments and enhancing the overall user experience for individuals with hearing impairments.

Author Biographies

Arypzhan Aben, Khoja Akhmet Yassawi International Kazakh-Turkish University, Kazakhstan

Master of Science, Department of Computer Engineering

Gulnur Kazbekova, Khoja Akhmet Yassawi International Kazakh-Turkish University

Candidate of Technical Sciences, Associate Professor, Head of the Department of Computer Engineering

Zhuldyz Ismagulova, ALT University, Kazakhstan

Candidate of Technical Sciences, Associate professor of  The Department Information and communication technologies

Gulmira Ibrayeva, Natural Sciences Forces Named After Twice Hero of the Soviet  Union T. Ya. Bigeldinov, Kazakhstan

Candidate of Physical and Mathematical Sciences, Head of the Department

References

International Health Organization. (2024, September 26). Deafness and hearing loss. https://www.who.int/ru/news-room/fact-sheets/detail/deafness-and-hearing-loss

Martini, A., Cozza, A., & Di Pasquale Fiasca, V. M. (2024). The inheritance of hearing loss and deafness: A historical perspective. Audiology Research, 14(1), 116-128. https://doi.org/10.3390/audiolres14010012

Lan, S., Ye, L., & Zhang, K. (2023). Applying mmWave Radar Sensors to Vocabulary-Level Dynamic Chinese Sign Language Recognition for the Community With Deafness and Hearing Loss. IEEE Sensors Journal. https://doi.org/10.1109/JSEN.2023.3241237

Kral, A., & Sharma, A. (2023). Crossmodal plasticity in hearing loss. Trends in neurosciences, 46(5), 377-393. https://doi.org/10.1016/j.tins.2023.02.004

Baballe, M. A., Garba, A., & Dahiru, M. (2023). Reasons for Deafness and Hearing Loss. Available at SSRN 4629219. https://ssrn.com/abstract=4629219

Podury, A., Jiam, N. T., Kim, M., Donnenfield, J. I., & Dhand, A. (2023). Hearing and sociality: the implications of hearing loss on social life. Frontiers in Neuroscience, 17, 1245434. https://doi.org/10.3389/fnins.2023.1245434

de Guimaraes, T. A. C., Arram, E., Shakarchi, A. F., Georgiou, M., & Michaelides, M. (2023). Inherited causes of combined vision and hearing loss: clinical features and molecular genetics. British Journal of Ophthalmology, 107(10), 1403-1414. https://doi.org/10.1136/bjo-2022-322062

Jiang, L., Wang, D., He, Y., & Shu, Y. (2023). Advances in gene therapy hold promise for treating hereditary hearing loss. Molecular Therapy, 31(4), 934-950. https://doi.org/10.1016/j.ymthe.2023.01.022

Mohammed, H. B., & Cavus, N. (2024). Utilization of Detection of Non-Speech Sound for Sustainable Quality of Life for Deaf and Hearing-Impaired People: A Systematic Literature Review. Sustainability, 16(20), 8976. https://doi.org/10.3390/su16208976

Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., & Pino, J. (2020). Fairseq S2T: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171. https://arxiv.org/abs/2010.05171

Xu, C., Ye, R., Dong, Q., Zhao, C., Ko, T., Wang, M., ... & Zhu, J. (2023). Recent advances in direct speech-to-text translation. arXiv preprint arXiv:2306.11646. https://arxiv.org/abs/2306.11646

Guo, Z., Leng, Y., Wu, Y., Zhao, S., & Tan, X. (2023, June). Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. https://doi.org/10.1109/ICASSP49357.2023.10094829

Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., Cheng, Y., ... & Conneau, A. (2022). mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374. https://arxiv.org/abs/2202.01374

Liu, Y., Zhu, J., Zhang, J., & Zong, C. (2020). Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920. https://arxiv.org/abs/2010.14920

Tang, Y., Pino, J., Wang, C., Ma, X., & Genzel, D. (2021, June). A general multi-task learning framework to leverage text data for speech to text tasks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6209-6213). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9413686

Matre, M. E., & Cameron, D. L. (2024). A scoping review on the use of speech-to-text technology for adolescents with learning difficulties in secondary education. Disability and Rehabilitation: Assistive Technology, 19(3), 1103-1116. https://doi.org/10.1080/17483107.2023.2243206

Wu, J., Gaur, Y., Chen, Z., Zhou, L., Zhu, Y., Wang, T., ... & Wu, Y. (2023, December). On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU59171.2023.10383163

Higuchi, Y., Chen, N., Fujita, Y., Inaguma, H., Komatsu, T., Lee, J., ... & Watanabe, S. (2021, December). A comparative study on non-autoregressive modelings for speech-to-text generation. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(pp. 47-54). IEEE. https://doi.org/10.1109/ASRU51503.2021.9688364

Wang, M., Han, W., Shafran, I., Wu, Z., Chiu, C. C., Cao, Y., ... & Wu, Y. (2023, December). Slm: Bridge the thin gap between speech and text foundation models. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU59171.2023.10383153

Bhandari, A., Shah, S. B., Thapa, S., Naseem, U., & Nasim, M. (2023). Crisishatemm: Multimodal analysis of directed and undirected hate speech in text-embedded images from russia-ukraine conflict. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1994-2003). https://doi.org/10.1109/CVPR52729.2023.00209

Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020, November). Speech to text translation enabling multilingualism. In 2020 IEEE International Conference for Innovation in Technology (INOCON) (pp. 1-4). IEEE. https://doi.org/10.1109/INOCON50539.2020.9298288

Golla, Ramsri Goutham (2024-09-26). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2024-09-26. Retrieved 2024-09-26. https://slator.com/here-are-six-practical-use-cases-for-the-new-whisper-api

Albahra, S., Gorbett, T., Robertson, S., D'Aleo, G., Kumar, S. V. S., Ockunzzi, S., ... & Rashidi, H. H. (2023, March). Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts. In Seminars in Diagnostic Pathology (Vol. 40, No. 2, pp. 71-87). WB Saunders. https://doi.org/10.1053/j.semdp.2023.02.002

Samadi, M. E., Mirzaieazar, H., Mitsos, A., & Schuppert, A. (2024). Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions. BMC bioinformatics, 25(1), 155. https://doi.org/10.1186/s12859-024-05693-1

Gholami, H., Mohammadifar, A., Golzari, S., Song, Y., & Pradhan, B. (2023). Interpretability of simple RNN and GRU deep learning models used to map land susceptibility to gully erosion. Science of the Total Environment, 904, 166960. https://doi.org/10.1016/j.scitotenv.2023.166960

Zezario, R. E., Chen, Y. W., Fu, S. W., Tsao, Y., Wang, H. M., & Fuh, C. S. (2024, July). A study on incorporating Whisper for robust speech assessment. In 2024 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE. https://doi.org/10.1109/ICME52920.2024.10462847

Downloads

Published

2025-06-30

How to Cite

Aben, A., Kazbekova, G., Ismagulova, Z., & Ibrayeva, G. (2025). AUDIO-TO-TEXT TRANSLATION FOR THE HARD OF HEARING: A WHISPER MODEL-BASED STUDY. Scientific Journal of Astana IT University, 22, 24–36. https://doi.org/10.37943/22SNOK5872

Issue

Section

Information Technologies