OVERVIEW OF TRANSFORMER-BASED MODELS FOR MEDICAL IMAGE SEGMENTATION

Authors

DOI:

https://doi.org/10.37943/13BKBF2003

Keywords:

Computer Vision, Transformers, Image processing, premedical diagnostics, Segmentation

Abstract

Premedical diagnostics is the process of examining survey results. Correct premedical diagnostics can improve the process of patient management and reduce the burden on the medical sector. Diagnostics of medical images such as computed tomography and X-ray are obligatory steps for further treatment. However, the shortage of clinicians causes delays in this step. We observed two state-of-the-art algorithms proposed for medical image segmentation: TransUnet and Swin-Unet. We conducted a theoretical comparison of algorithms in terms of the applicability of pre-hospital diagnostics according to quality and speed of training. The comparison is based on the original source of code provided by the authors of the original articles. We chose these two algorithms because they have similar U-form architecture, a high level of citation, and show competitive DICE scores on pictures of various human organs. Some architectural features were also important. Both models inherit key elements of U-net. TransUnet is a hybrid Transformer and CNN model. It consists of Transformer encoder and a convolutional decoder. Some additional computations are required in the bottleneck. Swin-Unet is a fully Transformer-based model. These architectural differences give rise to a difference in the number of trainable parameters. Generally, deeper architectures with a bigger number of parameters usually show better performance, however, according to our review, Swin-Unet has a smaller number of parameters and shows better DICE and Hausdorff Distance. It should be noted that the distribution between false positive and false negative predictions is important in medical image processing. It is crucial to avoid overloading the medical sector while also not missing any sick patients. Precision and recall can be used to evaluate the ratio of incorrect predictions. Therefore, we also observed the results of caries segmentation where precision and DICE were provided. In this specific case, TransUnet shows better DICE and recall values but worse precision.

Author Biographies

Nam Diana, Kazakh British Technical University

Master of Tech. Sci., PhD Student, School of Information Technology and Engineering

Pak Alexandr Alexandrovich, Kazakh British Technical University

Candidate of Tech. Sciences, Professor, School of Information Technology and Engineering

References

LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., & Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541-551. https://doi.org/10.1162/neco.1989.1.4.541

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90. https://doi.org/10.1145/3065386

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009, June). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). IEEE. https://doi.org/10.1109/CVPR.2009.5206848

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). IEEE.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. https://doi.org/10.1007/978-3-319-24574-4_28

Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., & Wang, M. (2023, February). Swin-unet: Unet like pure transformer for medical image segmentation. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III (pp. 205-218). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-25066-8_9

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., ... & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. https://doi.org/10.48550/arXiv.2102.04306

Jia, Q., & Shu, H. (2022, July). Bitr-unet: a cnn-transformer combined network for mri brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 7th International Workshop, BrainLes 2021, Held in Conjunction with MICCAI 2021, Virtual Event, September 27, 2021, Revised Selected Papers, Part II (pp. 3-14). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-09002-8_1

Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., ... & Jodoin, P.M. (2018). Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved?. IEEE transactions on medical imaging, 37(11), 2514-2525. https://doi.org/10.1109/TMI.2018.2837502

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556

Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).

Multi-Atlas Labeling Beyond the Cranial Vault - Workshop and Challenge. Synapse multi-organ computer tomography dataset. [Data set]. Synapse.org. https://repo-prod.prod.sagebase.org/repo/v1/doi/loca te?id=syn3193805type=ENTITY. https://doi.org/10.7303/SYN3193805

Chen et. al. (2021). TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. Github. https://github.com/Beckschen/TransUNet

Hu et.al. (2022). Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. Github. https://github.com/HuCaoFighting/Swin-Unet

Ying, S., Wang, B., Zhu, H., Liu, W., & Huang, F. (2022). Caries segmentation on tooth X-ray images with a deep network. Journal of Dentistry, 119, 104076. https://doi.org/10.1016/j.jdent.2022.104076

Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. https://doi.org/10.48550/arXiv.1803.02155

Downloads

Published

2023-03-30

How to Cite

Nam, D., & Pak, A. (2023). OVERVIEW OF TRANSFORMER-BASED MODELS FOR MEDICAL IMAGE SEGMENTATION. Scientific Journal of Astana IT University, 13(13), 64–75. https://doi.org/10.37943/13BKBF2003

Issue

Section

Articles
betpas