BALANCING SPEED AND PERFORMANCE WITH LAYER FREEZING STRATEGIES FOR TRANSFORMER MODELS

Authors

DOI:

https://doi.org/10.37943/22OXKY5402

Keywords:

layer freezing, BERT, NER, English, Kazakh

Abstract

In this paper, we evaluated different approaches to freezing BERT-base layers and analyzed their impact on the quality and speed of training in the task of named entity recognition in two languages. Layer freezing is an optimization technique in deep neural network training in which specific layers of a model remain fixed. This means their weights do not change during the backpropagation process. By not updating these layers, the overall number of parameters requiring adjustment is reduced, which results in lower computational demands and faster training times. Partial freezing of layers proved to be an effective way to preserve key representations of the model and ensure its adaptation to new tasks. Experimental results showed that freezing from three to six layers allows to achieve stable model performance regardless of the training language. Unlike standard approaches, our method highlights cross-linguistic applicability and promotes energy-efficient training. We personally designed the experimental setup, implemented the freezing scenarios, and carried out all performance evaluations. This study aims to evaluate the effectiveness of layer freezing in a pre-trained BERT model when performing the named entity recognition task. Two variants of the freezing strategy are considered: in the first one the upper layers of the model are fixed, in the second one the lower layers remain unchanged. The analysis is based on two corpora, the English language CoNLL 2003 and the Kazakh language KazNERD.  Our experiments showed that freezing three to six layers provides the best balance between training speed and model quality. On the CoNLL-2003 dataset, the training time decreased from 266 to 167 seconds and the Macro F1 score remained at 87%. On KazNERD, learning accelerated from 1609 to 958 seconds with an accuracy of 94-95 % and Macro F1 in the range of 71-72 %. Full freezing of all 12 layers caused a dramatic drop in quality, with Macro F1 dropping to 50 % on CoNLL and to 7 % on KazNERD. This emphasises the importance of limited freezing and fine-tuning of the model architecture.

The study further examines how the choice of layers to freeze influences the model’s ability to adapt to new linguistic patterns and domain-specific terminology. These findings offer useful insights for researchers and practitioners aiming to enhance the efficiency of fine-tuning large language models while ensuring robust performance across different languages and datasets. The results also highlight the potential for optimizing resource usage in various NER applications without compromising critical language understanding.

Author Biographies

Bauyrzhan Kairatuly, Farabi University, Kazakhstan

PhD Student, Faculty of Information Technology

Aday Shomanov, Nazarbayev University, Kazakhstan

PhD, Instructor, School of Engineering and Digital Sciences

References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., … Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. https://arxiv.org/abs/2005.14165

Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. https://arxiv.org/abs/1906.02243

Shavrina, T., Shevelev, M., Artemova, E., Malykh, V., Fenogenova, A., Voronov, A., & Kozlova, O. (2020). RussianGPT: Towards an open-source Russian GPT-3-like model. arXiv preprint arXiv:2010.15923. https://arxiv.org/abs/2010.15923

Lee, J., Tang, R., & Lin, J. (2019). What would Elsa do? Freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090. https://arxiv.org/abs/1911.03090

Wang, Y., Sun, D., Chen, K., Lai, F., & Chowdhury, M. (2023). Egeria: Efficient DNN training with knowledge-guided layer freezing. Proceedings of the Eighteenth European Conference on Computer Systems, 851–866. https://doi.org/10.1145/3552326.3587572

Liu, Y., Agarwal, S., & Venkataraman, S. (2021). AutoFreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386. https://arxiv.org/abs/2102.01386

Li, S., Yuan, G., Dai, Y., Zhang, Y., Wang, Y., & Tang, X. (2024). SmartFRZ: An efficient training framework using attention-based layer freezing. arXiv preprint arXiv:2401.16720. https://arxiv.org/abs/2401.16720

Ingle, D., Tripathi, R., Kumar, A., Patel, K., & Vepa, J. (2022, December). Investigating the characteristics of a transformer in a few-shot setup: Does freezing layers in RoBERTa help?. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (pp. 238–248).

Kim, Y., Ahn, J., Kim, M., Choi, C., Kim, H., Tuvshinjargal, N., Lee, S., Zhang, Y., Pei, Y., Linghu, X., Ma, J., Chen, L., Dai, Y., & Yoo, S. (2024). Breaking MLPerf Training: A Case Study on Optimizing BERT. arXiv preprint arXiv:2402.02447. https://arxiv.org/abs/2402.02447

Reguero, Á. D., Martínez-Fernández, S., & Verdecchia, R. (2025). Energy-efficient neural network training through runtime layer freezing, model quantization, and early stopping. Computer Standards & Interfaces, 92, 103906. https://doi.org/10.1016/j.csi.2024.103906

Sang, E. F. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 142–147). Association for Computational Linguistics. https://doi.org/10.3115/1119176.1119195

Nurpeiissov, M., Mussakhojayeva, A., Makhambetov, Y., & Nurmukhanbet, K. (2023). KazNERD: A Kazakh named entity recognition dataset. arXiv preprint arXiv:2304.08179. https://arxiv.org/abs/2304.08179

Goutam, K., Balasubramanian, S., Gera, D., & Sarma, R. R. (2024). LayerOut: Freezing layers in deep neural networks. SN Computer Science, 5(1), 123. https://doi.org/10.1007/s42979-023-01678-9

Hugging Face. (n.d.). BERT. Hugging Face Transformers Documentation. https://huggingface.co/docs/transformers/en/modeldoc/bert

Fukuhata, S., & Kano, Y. (2025). Few Dimensions are Enough: Fine‑tuning BERT with Selected Dimensions Revealed Its Redundant Nature (arXiv:2504.04966) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2504.04966

Miao, Z., & Zhao, M. (2023). Weight freezing: A regularization approach for fully connected layers with an application in EEG classification. arXiv preprint arXiv:2306.05775. https://arxiv.org/abs/2306.05775

Sorrenti, A., Bellitto, G., Proietto Salanitri, F., Pennisi, M., Spampinato, C., & Palazzo, S. (2023). Selective freezing for efficient continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (pp. 1234–1243). https://doi.org/10.1109/ICCVW58398.2023.00156

Shen, Z., Liu, Z., Qin, J., Savvides, M., & Cheng, K.-T. (2021). Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10), 9546–9553. https://doi.org/10.1609/aaai.v35i10.17055

Isikdogan, L. F., Nayak, B. V., Wu, C.-T., Moreira, J. P., Rao, S., & Michael, G. (2020). SemifreddoNets: Partially frozen neural networks for efficient computer vision systems. arXiv preprint arXiv:2006.06888. https://arxiv.org/abs/2006.06888

Downloads

Published

2025-06-30

How to Cite

Kairatuly, B., & Shomanov, A. (2025). BALANCING SPEED AND PERFORMANCE WITH LAYER FREEZING STRATEGIES FOR TRANSFORMER MODELS. Scientific Journal of Astana IT University, 22, 153–162. https://doi.org/10.37943/22OXKY5402

Issue

Section

Information Technologies