AUDIO-TO-TEXT TRANSLATION FOR THE HARD OF HEARING: A WHISPER MODEL-BASED STUDY
DOI:
https://doi.org/10.37943/22SNOK5872Keywords:
Whisper model, Audio-to-text transcription, Hearing impairments, Machine learningAbstract
This study investigates the effectiveness of the Whisper model for audio-to-text transcription, specifically targeting the enhancement of accessibility for individuals with hearing impairments. The research focuses on the processing of audio recordings obtained from WhatsApp messenger, which often contain significant background noise that complicates speech recognition. To address this issue, advanced audio processing techniques were employed, including the use of the Librosa library and the Noisereduce package for noise reduction. The spectral gating methods applied in this study effectively diminished wind noise and other ambient sounds, allowing for clearer recognition of spoken content. To ensure the quality of the processed audio, we assessed its clarity using a SimpleRNN model. The training results demonstrated a progressive reduction in loss values across epochs, confirming the successful enhancement of audio quality. Once the audio files were adequately cleaned, we utilized the Whisper model, a sophisticated machine learning tool for speech recognition developed by OpenAI, to transcribe the audio into text. The transcription process yielded accurate Kazakh language output, despite the initial challenges posed by background noise. These findings underscore the critical role of high-quality audio input in achieving reliable transcription results and highlight the potential of machine learning technologies in improving communication access for hearing-impaired individuals. This study concludes with recommendations for future research, including the exploration of additional noise reduction techniques and the application of the Whisper model across various languages and dialects. Such advancements could significantly contribute to creating more inclusive digital environments and enhancing the overall user experience for individuals with hearing impairments.
References
International Health Organization. (2024, September 26). Deafness and hearing loss. https://www.who.int/ru/news-room/fact-sheets/detail/deafness-and-hearing-loss
Martini, A., Cozza, A., & Di Pasquale Fiasca, V. M. (2024). The inheritance of hearing loss and deafness: A historical perspective. Audiology Research, 14(1), 116-128. https://doi.org/10.3390/audiolres14010012
Lan, S., Ye, L., & Zhang, K. (2023). Applying mmWave Radar Sensors to Vocabulary-Level Dynamic Chinese Sign Language Recognition for the Community With Deafness and Hearing Loss. IEEE Sensors Journal. https://doi.org/10.1109/JSEN.2023.3241237
Kral, A., & Sharma, A. (2023). Crossmodal plasticity in hearing loss. Trends in neurosciences, 46(5), 377-393. https://doi.org/10.1016/j.tins.2023.02.004
Baballe, M. A., Garba, A., & Dahiru, M. (2023). Reasons for Deafness and Hearing Loss. Available at SSRN 4629219. https://ssrn.com/abstract=4629219
Podury, A., Jiam, N. T., Kim, M., Donnenfield, J. I., & Dhand, A. (2023). Hearing and sociality: the implications of hearing loss on social life. Frontiers in Neuroscience, 17, 1245434. https://doi.org/10.3389/fnins.2023.1245434
de Guimaraes, T. A. C., Arram, E., Shakarchi, A. F., Georgiou, M., & Michaelides, M. (2023). Inherited causes of combined vision and hearing loss: clinical features and molecular genetics. British Journal of Ophthalmology, 107(10), 1403-1414. https://doi.org/10.1136/bjo-2022-322062
Jiang, L., Wang, D., He, Y., & Shu, Y. (2023). Advances in gene therapy hold promise for treating hereditary hearing loss. Molecular Therapy, 31(4), 934-950. https://doi.org/10.1016/j.ymthe.2023.01.022
Mohammed, H. B., & Cavus, N. (2024). Utilization of Detection of Non-Speech Sound for Sustainable Quality of Life for Deaf and Hearing-Impaired People: A Systematic Literature Review. Sustainability, 16(20), 8976. https://doi.org/10.3390/su16208976
Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., & Pino, J. (2020). Fairseq S2T: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171. https://arxiv.org/abs/2010.05171
Xu, C., Ye, R., Dong, Q., Zhao, C., Ko, T., Wang, M., ... & Zhu, J. (2023). Recent advances in direct speech-to-text translation. arXiv preprint arXiv:2306.11646. https://arxiv.org/abs/2306.11646
Guo, Z., Leng, Y., Wu, Y., Zhao, S., & Tan, X. (2023, June). Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. https://doi.org/10.1109/ICASSP49357.2023.10094829
Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., Cheng, Y., ... & Conneau, A. (2022). mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374. https://arxiv.org/abs/2202.01374
Liu, Y., Zhu, J., Zhang, J., & Zong, C. (2020). Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920. https://arxiv.org/abs/2010.14920
Tang, Y., Pino, J., Wang, C., Ma, X., & Genzel, D. (2021, June). A general multi-task learning framework to leverage text data for speech to text tasks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6209-6213). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9413686
Matre, M. E., & Cameron, D. L. (2024). A scoping review on the use of speech-to-text technology for adolescents with learning difficulties in secondary education. Disability and Rehabilitation: Assistive Technology, 19(3), 1103-1116. https://doi.org/10.1080/17483107.2023.2243206
Wu, J., Gaur, Y., Chen, Z., Zhou, L., Zhu, Y., Wang, T., ... & Wu, Y. (2023, December). On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU59171.2023.10383163
Higuchi, Y., Chen, N., Fujita, Y., Inaguma, H., Komatsu, T., Lee, J., ... & Watanabe, S. (2021, December). A comparative study on non-autoregressive modelings for speech-to-text generation. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(pp. 47-54). IEEE. https://doi.org/10.1109/ASRU51503.2021.9688364
Wang, M., Han, W., Shafran, I., Wu, Z., Chiu, C. C., Cao, Y., ... & Wu, Y. (2023, December). Slm: Bridge the thin gap between speech and text foundation models. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU59171.2023.10383153
Bhandari, A., Shah, S. B., Thapa, S., Naseem, U., & Nasim, M. (2023). Crisishatemm: Multimodal analysis of directed and undirected hate speech in text-embedded images from russia-ukraine conflict. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1994-2003). https://doi.org/10.1109/CVPR52729.2023.00209
Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020, November). Speech to text translation enabling multilingualism. In 2020 IEEE International Conference for Innovation in Technology (INOCON) (pp. 1-4). IEEE. https://doi.org/10.1109/INOCON50539.2020.9298288
Golla, Ramsri Goutham (2024-09-26). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2024-09-26. Retrieved 2024-09-26. https://slator.com/here-are-six-practical-use-cases-for-the-new-whisper-api
Albahra, S., Gorbett, T., Robertson, S., D'Aleo, G., Kumar, S. V. S., Ockunzzi, S., ... & Rashidi, H. H. (2023, March). Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts. In Seminars in Diagnostic Pathology (Vol. 40, No. 2, pp. 71-87). WB Saunders. https://doi.org/10.1053/j.semdp.2023.02.002
Samadi, M. E., Mirzaieazar, H., Mitsos, A., & Schuppert, A. (2024). Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions. BMC bioinformatics, 25(1), 155. https://doi.org/10.1186/s12859-024-05693-1
Gholami, H., Mohammadifar, A., Golzari, S., Song, Y., & Pradhan, B. (2023). Interpretability of simple RNN and GRU deep learning models used to map land susceptibility to gully erosion. Science of the Total Environment, 904, 166960. https://doi.org/10.1016/j.scitotenv.2023.166960
Zezario, R. E., Chen, Y. W., Fu, S. W., Tsao, Y., Wang, H. M., & Fuh, C. S. (2024, July). A study on incorporating Whisper for robust speech assessment. In 2024 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE. https://doi.org/10.1109/ICME52920.2024.10462847
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.