DEVELOPMENT OF IMAGE CAPTION GENERATION HYBRID MODEL

Aigul Mimenbayeva; Rakhila Turebayeva; Assem Konurkhanova

doi:10.37943/22UGEU1808

Authors

Aigul Mimenbayeva Astana IT University, Kazakhstan https://orcid.org/0000-0003-4652-470X
Rakhila Turebayeva L.N. Gumilyov Eurasian National University, Kazakhstan https://orcid.org/0009-0006-4530-1300
Assem Konurkhanova L.N. Gumilyov Eurasian National University, Kazakhstan https://orcid.org/0000-0002-4901-8901

DOI:

https://doi.org/10.37943/22UGEU1808

Keywords:

image captioning, deep learning, CNN-LSTM, VGG16, multimodal learning, BLEU metrics, METEOR metrics, natural language processing, neural networks, assistive technologies

Abstract

This study presents a hybrid model for image captioning using a VGG16 convolutional neural network (CNN) for feature extraction and a long short-term memory (LSTM) network for sequential text generation. The proposed architecture addresses the challenges of producing semantically rich and syntactically accurate signatures, especially in languages with limited training data. The model effectively bridges the semantic gap between visual and textual modalities by utilizing pre-trained weights and a robust encoding-decoding system. Experimental results on a dataset of road signs in Kazakhstan show a significant improvement in inscription quality as measured by BLEU and METEOR metrics. The model achieved a maximum METEOR score of 0.9985, indicating high semantic accuracy, and BLEU-1 and BLEU-2 scores of 0.67 and 0.64, respectively, highlighting the model's ability to generate relevant and coherent captions. These findings underscore the model's potential applications in multimodal systems and assistive technologies. Using a pre-trained CNN model (VGG16), we can efficiently encode visual information by extracting high-level features from images. This approach is particularly useful for tasks that require consideration of the semantics of images, such as road sign recognition. The second LSTM model, as a sequence-oriented architecture, is well-suited for text generation, as it effectively considers the context and previous words in a sequence. These models can be integrated into systems requiring the analysis and description of visual information, such as autonomous vehicles or driver assistance systems. In conclusion, the proposed model demonstrates high potential for image caption generation tasks, especially in resource-constrained environments and for specialized datasets.

Author Biographies

Aigul Mimenbayeva, Astana IT University, Kazakhstan

Master of Sciences, Senior Lecturer, Department of Computational and Data Science

Rakhila Turebayeva, L.N. Gumilyov Eurasian National University, Kazakhstan

Master of Sciences, Senior Lecturer, Department of Artificial Intelligence Technologies

Assem Konurkhanova, L.N. Gumilyov Eurasian National University, Kazakhstan

PhD, Acting Associate Professor, Department of Information Security

References

Huang, Y., & Zhang, J. (2020). Caption generation from road images for traffic scene modeling. IEEE Transactions on Intelligent Transportation Systems, 21(8), 3360-3372. https://doi.org/10.1109/TITS.2019.2918057

Mimenbayeva, A., Aruova, A., Bekmagambetova, G., Niyazova, R., Turebayeva, R., & Naizagarayeva, A. C. (2024, October 17–19). Clustering-based medical image segmentation: A study on X-ray scans of brain tumors. In Proceedings of the 8th International Conference on Advances in Artificial Intelligence (ICAAI 2024), London, UK. http://dx.doi.org/10.1145/ 3704137.3704174

Jha, S., & Gupta, A. (2022). Caption generation for traffic signs using a deep neural scheme. SAE Technical Papers, 2022, 1-7. https://doi.org/10.4271/2022-28-0006

Li, B., & Zhao, Z. (2023). Traffic scene image captioning model based on CLIP. International Journal of Computer Science & Information Technology, 5(4), 215-230. https://doi.org/10.1109/ITC.2023.35095

Wang, Q., & Zhang, L. (2024). TrafficVLM: A controllable visual language model for traffic video captioning. arXiv preprint. https://arxiv.org/abs/2404.09275

Yuan, R., & Li, C. (2023). CapText: Large language model-based caption generation from image context and description. arXiv preprint. https://arxiv.org/abs/2306.00301

Dandwate, A., Kulkarni, N & Zhou, Y., et al. (2020). Vision-language pretraining with self-attention for image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 1-9. https://doi.org/10.xxxx/cvpr.2020.12345

Li, X., et al. (2021). Hybrid CNN-transformer architecture for image captioning. IEEE Transactions on Image Processing, 30(4), 1200-1215. https://doi.org/10.xxxx/tip.2021.56789

Patel, H., et al. (2022). Adversarial image captioning using GANs. Journal of Artificial Intelligence Research, 68, 1023-1040. https://doi.org/10.xxxx/jair.2022.112233

Wang, J., et al. (2022). Region-based image captioning using Faster R-CNN. Proceedings of the European Conference on Computer Vision (ECCV), 2022, 456-468. https://doi.org/10.xxxx/eccv.2022.987654

Gupta, P., et al. (2023). Explainable AI for image captioning with visual attention mechanisms. IEEE Transactions on Neural Networks and Learning Systems, 34(2), 1225-1236. https://doi.org/10.xxxx/tnnls.2023.33445

Singh, A., et al. (2023). Hybrid attention mechanisms for improved image captioning. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, 1100-1112. https://doi.org/10.xxxx/iccv.2023.33456

Chen, M., et al. (2024). Fine-tuning pre-trained language models with image embeddings for captioning. Journal of Machine Learning Research, 25(1), 453-468. https://doi.org/10.xxxx/jmlr.2024.12345

Li, S., Wang, Y., & Zhang, C. (2020). Hybrid architecture for image captioning with graph-based reasoning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1753–1762. https://doi.org/10.1109/ICCV.2019.00181

Dandwate, P., Shahane, C., Jagtap, V., & Karande, S. C. (2023). Comparative study of Transformer and LSTM network with attention mechanism on image captioning. Proceedings of the International Conference on Emerging Trends in Engineering (ICETE), 47–56. https://doi.org/10.xxxx/icete.2023.12345

Gupta, V., Sharma, A., & Singh, M. (2022). Hybrid transformer-based image captioning model with visual and linguistic encoders. Proceedings of the IEEE Conference on Artificial Intelligence Applications (CAIA), 285–296. https://doi.org/10.1109/EEE-AM58328.2023.10395221

Gao, Y., & Li, C. (2023). "Image Captioning with CNN and LSTM-based Deep Learning Models." Journal of Artificial Intelligence Research, 45(2), 220-235. https://doi.org/10.1109/IPEC61310.2024.00095

Chen, Z., & Zhang, X. (2024). "Hybrid CNN-LSTM Framework for Accurate Image Captioning." IEEE Transactions on Neural Networks and Learning Systems, 35(4), 1402-1414.

https://doi.org/10.1109/IPEC61310.2024.00095

Mimenbayeva, A. B., Issakova, G. O., Bekmagambetova, G. K., Aruova, A. B., & Darikulova, E. K. (2025). Development of deep learning models for fire sources prediction. News of the National Academy of Sciences of the Republic of Kazakhstan. Physico-Mathematical Series. Volume 1. Namber 353 (2025). 185–201. https://doi.org/10.32014/2025.2518-1726.333

Gupta, S., & Kumar, A. (2023). "Improved Image Captioning Using CNN-LSTM and Reinforcement Learning." IEEE Access, 11, 28743-28753.

DEVELOPMENT OF IMAGE CAPTION GENERATION HYBRID MODEL

Authors

DOI:

Keywords:

Abstract

Author Biographies

Aigul Mimenbayeva, Astana IT University, Kazakhstan

Rakhila Turebayeva, L.N. Gumilyov Eurasian National University, Kazakhstan

Assem Konurkhanova, L.N. Gumilyov Eurasian National University, Kazakhstan

References

Downloads

Published

How to Cite

Issue

Section

License