ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES
DOI:
https://doi.org/10.37943/13IMII7575Keywords:
agglutinative languages, integral approach, CTC, LSHTM, neural network, speech recognitionAbstract
With the improvement of intelligent systems, speech recognition technologies are being widely integrated into various aspects of human life. Speech recognition is applied to smart assistants, smart home infrastructure, the call center applications of banks, information system components for impaired people, etc. But these facilities of information systems are available only for common languages, like English, Chinese, or Russian. For low-resource language, these opportunities for information technologies are still not implemented. Most modern speech recognition approaches are still not tested on agglutinative languages, especially for the languages of Turkic group like Kazakh, Tatar, and Turkish Languages. The HMM-GMM (Hidden Markov Models - Gaussian Mixture Models) model has been the most popular in the field of Automatic Speech Recognition (ASR) for a long time. Currently, neural networks are widely used in different fields of NLP, especially in automatic speech recognition. In an enormous number of works application of neural networks within different stages of automatic speech recognition makes the quality level of this systems much better. Integral speech recognition systems based on neural networks are investigated in the article. The paper proves that the Connectionist Temporal Classification (CTC) model works precisely for agglutinative languages. The author conducted an experiment with the LSHTM neural network using an encoder-decoder model, which is based on the attention-based models. The result of the experiment showed a Character Error Rate (CER) equal to 8.01% and a Word Error Rate (WER) equal to 17.91%. This result proves the possibility of getting a good ASR model without the use of the Language Model (LM).
References
Perera, F.P., Tang, D., Rauh, V., Tu, Y.H., Tsai, W.Y., Becker, M., ... & Lederman, S.A. (2007). Relationship between polycyclic aromatic hydrocarbon–DNA adducts, environmental tobacco smoke, and child development in the World Trade Center cohort. Environmental Health Perspectives, 115(10), 1497- 1502. https://doi.org/10.1289/ehp.10144
Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., & Turdalykyzy, T. (2019, March). Automatic recognition of Kazakh speech using deep neural networks. In Intelligent Information and Database Systems: 11th Asian Conference, ACIIDS 2019, Yogyakarta, Indonesia, April 8–11, 2019, Proceedings, Part II (pp. 465-474). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-14802-7_40
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010, September). Recurrent neural network based language model. In Interspeech (Vol. 2, No. 3, pp. 1045-1048).
Rao, K., Peng, F., Sak, H., & Beaufays, F. (2015, April). Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4225-4229). IEEE. https://doi.org/10.1109/ICASSP.2015.7178767
Jaitly, N., & Hinton, G. (2011, May). Learning a better representation of speech soundwaves using restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884-5887). IEEE. https://doi.org/10.1109/ICASSP.2011.5947700
Joshua, B. (2019). Effective Java (3rd ed.). Addison-Wesley.
Vaněk, J., Zelinka, J., Soutner, D., & Psutka, J. (2017). A regularization post layer: An additional way how to make deep neural networks robust. In Statistical Language and Speech Processing: 5th International Conference, SLSP 2017, Le Mans, France, October 23–25, 2017, Proceedings 5 (pp. 204- 214). Springer International Publishing. https://doi.org/10.1007/978-3-319-68456-7_17
Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE. https://doi.org/10.1109/ICASSP.2017.7953075
Speech & Voice. https://labs.mozilla.org/learn/speech/
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720. https://doi.org/10.48550/arXiv.1701.02720
Dragon Naturally Speaking Solutions. (2018). http://www.dragonsys.com 12. СMUSphinx Wiki Tutorial. (2018). http://cmusphinx.sourceforge.net/wiki/
Bot API Tutorial. (2018). https://tlgrm.ru/docs/bots/api
FFmpeg Filters Tutorial. (2018). https://www.ffmpeg.org/ffmpeg-filters.html#afftdn
Journal of Engineering Trends and Technology. (2018). 4(2).
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97. https://doi.org/10.1109/MSP.2012.2205597
Englund, C. (2004). Speech recognition in the JAS 39 Gripen aircraft-adaptation to speech at different G-loads. Centre forSpeech Technology, Stockholm, 2.
Dongsuk, Y. (2019). Robust speech recognition using neural networks and hidden markov models. [Doctoral thesis, The State University of New Jersey].
Giampiero, S. (2017). Mining speech sounds, machine learning methods for automatic speech recognition and analysis. [Doctoral thesis, Stockholm: KTH school of computer science and communication].
(2019). An Open-Source Machine Learning Framework for Everyone. Tensorflow. https://www.tensorflow.org/
Benesty, J., Sondh, M., & Huang, Y. (2008). Springer handbook of speech recognition. NY: Springer, 1176. 22. Vyas, G., & Kumari, B. (2013). Speaker recognition system based on MFCC and DCT. International Journal of Engineering and Advanced Technology (IJEAT), 2(5), 145-148.
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., ... & Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition.
Nilsson, M., Ejnarsson, M. (2018) Speech recognition using hidden Markov model. Karlskrona: Kaserntryck-eriet AB.
Bekarystankyzy, A., Mamyrbayev, О. (2023). Integral’naja sistema avtomaticheskogo razpoznavanija slitnoj rechi dlja aggljutinativnyh jazykov [Integral automatic fusion speech recognition system for agglutinative languages]. Proceedings of the National Academy of Sciences of Kazakhstan. Physics and Mathematics Series, (1), 37–49. https://doi.org/10.32014/2022.2518-1726.167
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Articles are open access under the Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.