ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES

Akbayan Bekarystankyzy; Orken Mamyrbayev

doi:10.37943/13IMII7575

Authors

Bekarystankyzy Akbayan Satbayev University, Narxoz University https://orcid.org/0000-0003-3984-2718
Mamyrbayev Orken Zhumazhanovich Institute of Information and Computational Technologies, CS MES RK https://orcid.org/0000-0001-8318-3794

DOI:

https://doi.org/10.37943/13IMII7575

Keywords:

agglutinative languages, integral approach, CTC, LSHTM, neural network, speech recognition

Abstract

With the improvement of intelligent systems, speech recognition technologies are being widely integrated into various aspects of human life. Speech recognition is applied to smart assistants, smart home infrastructure, the call center applications of banks, information system components for impaired people, etc. But these facilities of information systems are available only for common languages, like English, Chinese, or Russian. For low-resource language, these opportunities for information technologies are still not implemented. Most modern speech recognition approaches are still not tested on agglutinative languages, especially for the languages of Turkic group like Kazakh, Tatar, and Turkish Languages. The HMM-GMM (Hidden Markov Models - Gaussian Mixture Models) model has been the most popular in the field of Automatic Speech Recognition (ASR) for a long time. Currently, neural networks are widely used in different fields of NLP, especially in automatic speech recognition. In an enormous number of works application of neural networks within different stages of automatic speech recognition makes the quality level of this systems much better. Integral speech recognition systems based on neural networks are investigated in the article. The paper proves that the Connectionist Temporal Classification (CTC) model works precisely for agglutinative languages. The author conducted an experiment with the LSHTM neural network using an encoder-decoder model, which is based on the attention-based models. The result of the experiment showed a Character Error Rate (CER) equal to 8.01% and a Word Error Rate (WER) equal to 17.91%. This result proves the possibility of getting a good ASR model without the use of the Language Model (LM).

Author Biographies

Bekarystankyzy Akbayan, Satbayev University, Narxoz University

Master of IS, PhD student

Mamyrbayev Orken Zhumazhanovich, Institute of Information and Computational Technologies, CS MES RK

PhD, Professor, Deputy Director

References

Perera, F.P., Tang, D., Rauh, V., Tu, Y.H., Tsai, W.Y., Becker, M., ... & Lederman, S.A. (2007). Relationship between polycyclic aromatic hydrocarbon–DNA adducts, environmental tobacco smoke, and child development in the World Trade Center cohort. Environmental Health Perspectives, 115(10), 1497- 1502. https://doi.org/10.1289/ehp.10144

Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., & Turdalykyzy, T. (2019, March). Automatic recognition of Kazakh speech using deep neural networks. In Intelligent Information and Database Systems: 11th Asian Conference, ACIIDS 2019, Yogyakarta, Indonesia, April 8–11, 2019, Proceedings, Part II (pp. 465-474). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-14802-7_40

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010, September). Recurrent neural network based language model. In Interspeech (Vol. 2, No. 3, pp. 1045-1048).

Rao, K., Peng, F., Sak, H., & Beaufays, F. (2015, April). Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4225-4229). IEEE. https://doi.org/10.1109/ICASSP.2015.7178767

Jaitly, N., & Hinton, G. (2011, May). Learning a better representation of speech soundwaves using restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884-5887). IEEE. https://doi.org/10.1109/ICASSP.2011.5947700

Joshua, B. (2019). Effective Java (3rd ed.). Addison-Wesley.

Vaněk, J., Zelinka, J., Soutner, D., & Psutka, J. (2017). A regularization post layer: An additional way how to make deep neural networks robust. In Statistical Language and Speech Processing: 5th International Conference, SLSP 2017, Le Mans, France, October 23–25, 2017, Proceedings 5 (pp. 204- 214). Springer International Publishing. https://doi.org/10.1007/978-3-319-68456-7_17

Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE. https://doi.org/10.1109/ICASSP.2017.7953075

Speech & Voice. https://labs.mozilla.org/learn/speech/

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720. https://doi.org/10.48550/arXiv.1701.02720

Dragon Naturally Speaking Solutions. (2018). http://www.dragonsys.com 12. СMUSphinx Wiki Tutorial. (2018). http://cmusphinx.sourceforge.net/wiki/

Bot API Tutorial. (2018). https://tlgrm.ru/docs/bots/api

FFmpeg Filters Tutorial. (2018). https://www.ffmpeg.org/ffmpeg-filters.html#afftdn

Journal of Engineering Trends and Technology. (2018). 4(2).

Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97. https://doi.org/10.1109/MSP.2012.2205597

Englund, C. (2004). Speech recognition in the JAS 39 Gripen aircraft-adaptation to speech at different G-loads. Centre forSpeech Technology, Stockholm, 2.

Dongsuk, Y. (2019). Robust speech recognition using neural networks and hidden markov models. [Doctoral thesis, The State University of New Jersey].

Giampiero, S. (2017). Mining speech sounds, machine learning methods for automatic speech recognition and analysis. [Doctoral thesis, Stockholm: KTH school of computer science and communication].

(2019). An Open-Source Machine Learning Framework for Everyone. Tensorflow. https://www.tensorflow.org/

Benesty, J., Sondh, M., & Huang, Y. (2008). Springer handbook of speech recognition. NY: Springer, 1176. 22. Vyas, G., & Kumari, B. (2013). Speaker recognition system based on MFCC and DCT. International Journal of Engineering and Advanced Technology (IJEAT), 2(5), 145-148.

Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., ... & Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition.

Nilsson, M., Ejnarsson, M. (2018) Speech recognition using hidden Markov model. Karlskrona: Kaserntryck-eriet AB.

Bekarystankyzy, A., Mamyrbayev, О. (2023). Integral’naja sistema avtomaticheskogo razpoznavanija slitnoj rechi dlja aggljutinativnyh jazykov [Integral automatic fusion speech recognition system for agglutinative languages]. Proceedings of the National Academy of Sciences of Kazakhstan. Physics and Mathematics Series, (1), 37–49. https://doi.org/10.32014/2022.2518-1726.167

ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES

Authors

DOI:

Keywords:

Abstract

Author Biographies

Bekarystankyzy Akbayan, Satbayev University, Narxoz University

Mamyrbayev Orken Zhumazhanovich, Institute of Information and Computational Technologies, CS MES RK

References

Downloads

Published

How to Cite

Issue

Section

License