DEVELOPMENT OF IMAGE CAPTION GENERATION HYBRID MODEL
DOI:
https://doi.org/10.37943/22UGEU1808Keywords:
image captioning, deep learning, CNN-LSTM, VGG16, multimodal learning, BLEU metrics, METEOR metrics, natural language processing, neural networks, assistive technologiesAbstract
This study presents a hybrid model for image captioning using a VGG16 convolutional neural network (CNN) for feature extraction and a long short-term memory (LSTM) network for sequential text generation. The proposed architecture addresses the challenges of producing semantically rich and syntactically accurate signatures, especially in languages with limited training data. The model effectively bridges the semantic gap between visual and textual modalities by utilizing pre-trained weights and a robust encoding-decoding system. Experimental results on a dataset of road signs in Kazakhstan show a significant improvement in inscription quality as measured by BLEU and METEOR metrics. The model achieved a maximum METEOR score of 0.9985, indicating high semantic accuracy, and BLEU-1 and BLEU-2 scores of 0.67 and 0.64, respectively, highlighting the model's ability to generate relevant and coherent captions. These findings underscore the model's potential applications in multimodal systems and assistive technologies. Using a pre-trained CNN model (VGG16), we can efficiently encode visual information by extracting high-level features from images. This approach is particularly useful for tasks that require consideration of the semantics of images, such as road sign recognition. The second LSTM model, as a sequence-oriented architecture, is well-suited for text generation, as it effectively considers the context and previous words in a sequence. These models can be integrated into systems requiring the analysis and description of visual information, such as autonomous vehicles or driver assistance systems. In conclusion, the proposed model demonstrates high potential for image caption generation tasks, especially in resource-constrained environments and for specialized datasets.
References
Huang, Y., & Zhang, J. (2020). Caption generation from road images for traffic scene modeling. IEEE Transactions on Intelligent Transportation Systems, 21(8), 3360-3372. https://doi.org/10.1109/TITS.2019.2918057
Mimenbayeva, A., Aruova, A., Bekmagambetova, G., Niyazova, R., Turebayeva, R., & Naizagarayeva, A. C. (2024, October 17–19). Clustering-based medical image segmentation: A study on X-ray scans of brain tumors. In Proceedings of the 8th International Conference on Advances in Artificial Intelligence (ICAAI 2024), London, UK. http://dx.doi.org/10.1145/ 3704137.3704174
Jha, S., & Gupta, A. (2022). Caption generation for traffic signs using a deep neural scheme. SAE Technical Papers, 2022, 1-7. https://doi.org/10.4271/2022-28-0006
Li, B., & Zhao, Z. (2023). Traffic scene image captioning model based on CLIP. International Journal of Computer Science & Information Technology, 5(4), 215-230. https://doi.org/10.1109/ITC.2023.35095
Wang, Q., & Zhang, L. (2024). TrafficVLM: A controllable visual language model for traffic video captioning. arXiv preprint. https://arxiv.org/abs/2404.09275
Yuan, R., & Li, C. (2023). CapText: Large language model-based caption generation from image context and description. arXiv preprint. https://arxiv.org/abs/2306.00301
Dandwate, A., Kulkarni, N & Zhou, Y., et al. (2020). Vision-language pretraining with self-attention for image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 1-9. https://doi.org/10.xxxx/cvpr.2020.12345
Li, X., et al. (2021). Hybrid CNN-transformer architecture for image captioning. IEEE Transactions on Image Processing, 30(4), 1200-1215. https://doi.org/10.xxxx/tip.2021.56789
Patel, H., et al. (2022). Adversarial image captioning using GANs. Journal of Artificial Intelligence Research, 68, 1023-1040. https://doi.org/10.xxxx/jair.2022.112233
Wang, J., et al. (2022). Region-based image captioning using Faster R-CNN. Proceedings of the European Conference on Computer Vision (ECCV), 2022, 456-468. https://doi.org/10.xxxx/eccv.2022.987654
Gupta, P., et al. (2023). Explainable AI for image captioning with visual attention mechanisms. IEEE Transactions on Neural Networks and Learning Systems, 34(2), 1225-1236. https://doi.org/10.xxxx/tnnls.2023.33445
Singh, A., et al. (2023). Hybrid attention mechanisms for improved image captioning. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, 1100-1112. https://doi.org/10.xxxx/iccv.2023.33456
Chen, M., et al. (2024). Fine-tuning pre-trained language models with image embeddings for captioning. Journal of Machine Learning Research, 25(1), 453-468. https://doi.org/10.xxxx/jmlr.2024.12345
Li, S., Wang, Y., & Zhang, C. (2020). Hybrid architecture for image captioning with graph-based reasoning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1753–1762. https://doi.org/10.1109/ICCV.2019.00181
Dandwate, P., Shahane, C., Jagtap, V., & Karande, S. C. (2023). Comparative study of Transformer and LSTM network with attention mechanism on image captioning. Proceedings of the International Conference on Emerging Trends in Engineering (ICETE), 47–56. https://doi.org/10.xxxx/icete.2023.12345
Gupta, V., Sharma, A., & Singh, M. (2022). Hybrid transformer-based image captioning model with visual and linguistic encoders. Proceedings of the IEEE Conference on Artificial Intelligence Applications (CAIA), 285–296. https://doi.org/10.1109/EEE-AM58328.2023.10395221
Gao, Y., & Li, C. (2023). "Image Captioning with CNN and LSTM-based Deep Learning Models." Journal of Artificial Intelligence Research, 45(2), 220-235. https://doi.org/10.1109/IPEC61310.2024.00095
Chen, Z., & Zhang, X. (2024). "Hybrid CNN-LSTM Framework for Accurate Image Captioning." IEEE Transactions on Neural Networks and Learning Systems, 35(4), 1402-1414.
https://doi.org/10.1109/IPEC61310.2024.00095
Mimenbayeva, A. B., Issakova, G. O., Bekmagambetova, G. K., Aruova, A. B., & Darikulova, E. K. (2025). Development of deep learning models for fire sources prediction. News of the National Academy of Sciences of the Republic of Kazakhstan. Physico-Mathematical Series. Volume 1. Namber 353 (2025). 185–201. https://doi.org/10.32014/2025.2518-1726.333
Gupta, S., & Kumar, A. (2023). "Improved Image Captioning Using CNN-LSTM and Reinforcement Learning." IEEE Access, 11, 28743-28753.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.