BALANCING SPEED AND PERFORMANCE WITH LAYER FREEZING STRATEGIES FOR TRANSFORMER MODELS
DOI:
https://doi.org/10.37943/22OXKY5402Keywords:
layer freezing, BERT, NER, English, KazakhAbstract
In this paper, we evaluated different approaches to freezing BERT-base layers and analyzed their impact on the quality and speed of training in the task of named entity recognition in two languages. Layer freezing is an optimization technique in deep neural network training in which specific layers of a model remain fixed. This means their weights do not change during the backpropagation process. By not updating these layers, the overall number of parameters requiring adjustment is reduced, which results in lower computational demands and faster training times. Partial freezing of layers proved to be an effective way to preserve key representations of the model and ensure its adaptation to new tasks. Experimental results showed that freezing from three to six layers allows to achieve stable model performance regardless of the training language. Unlike standard approaches, our method highlights cross-linguistic applicability and promotes energy-efficient training. We personally designed the experimental setup, implemented the freezing scenarios, and carried out all performance evaluations. This study aims to evaluate the effectiveness of layer freezing in a pre-trained BERT model when performing the named entity recognition task. Two variants of the freezing strategy are considered: in the first one the upper layers of the model are fixed, in the second one the lower layers remain unchanged. The analysis is based on two corpora, the English language CoNLL 2003 and the Kazakh language KazNERD. Our experiments showed that freezing three to six layers provides the best balance between training speed and model quality. On the CoNLL-2003 dataset, the training time decreased from 266 to 167 seconds and the Macro F1 score remained at 87%. On KazNERD, learning accelerated from 1609 to 958 seconds with an accuracy of 94-95 % and Macro F1 in the range of 71-72 %. Full freezing of all 12 layers caused a dramatic drop in quality, with Macro F1 dropping to 50 % on CoNLL and to 7 % on KazNERD. This emphasises the importance of limited freezing and fine-tuning of the model architecture.
The study further examines how the choice of layers to freeze influences the model’s ability to adapt to new linguistic patterns and domain-specific terminology. These findings offer useful insights for researchers and practitioners aiming to enhance the efficiency of fine-tuning large language models while ensuring robust performance across different languages and datasets. The results also highlight the potential for optimizing resource usage in various NER applications without compromising critical language understanding.
References
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., … Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. https://arxiv.org/abs/1906.02243
Shavrina, T., Shevelev, M., Artemova, E., Malykh, V., Fenogenova, A., Voronov, A., & Kozlova, O. (2020). RussianGPT: Towards an open-source Russian GPT-3-like model. arXiv preprint arXiv:2010.15923. https://arxiv.org/abs/2010.15923
Lee, J., Tang, R., & Lin, J. (2019). What would Elsa do? Freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090. https://arxiv.org/abs/1911.03090
Wang, Y., Sun, D., Chen, K., Lai, F., & Chowdhury, M. (2023). Egeria: Efficient DNN training with knowledge-guided layer freezing. Proceedings of the Eighteenth European Conference on Computer Systems, 851–866. https://doi.org/10.1145/3552326.3587572
Liu, Y., Agarwal, S., & Venkataraman, S. (2021). AutoFreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386. https://arxiv.org/abs/2102.01386
Li, S., Yuan, G., Dai, Y., Zhang, Y., Wang, Y., & Tang, X. (2024). SmartFRZ: An efficient training framework using attention-based layer freezing. arXiv preprint arXiv:2401.16720. https://arxiv.org/abs/2401.16720
Ingle, D., Tripathi, R., Kumar, A., Patel, K., & Vepa, J. (2022, December). Investigating the characteristics of a transformer in a few-shot setup: Does freezing layers in RoBERTa help?. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (pp. 238–248).
Kim, Y., Ahn, J., Kim, M., Choi, C., Kim, H., Tuvshinjargal, N., Lee, S., Zhang, Y., Pei, Y., Linghu, X., Ma, J., Chen, L., Dai, Y., & Yoo, S. (2024). Breaking MLPerf Training: A Case Study on Optimizing BERT. arXiv preprint arXiv:2402.02447. https://arxiv.org/abs/2402.02447
Reguero, Á. D., Martínez-Fernández, S., & Verdecchia, R. (2025). Energy-efficient neural network training through runtime layer freezing, model quantization, and early stopping. Computer Standards & Interfaces, 92, 103906. https://doi.org/10.1016/j.csi.2024.103906
Sang, E. F. T. K., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 142–147). Association for Computational Linguistics. https://doi.org/10.3115/1119176.1119195
Nurpeiissov, M., Mussakhojayeva, A., Makhambetov, Y., & Nurmukhanbet, K. (2023). KazNERD: A Kazakh named entity recognition dataset. arXiv preprint arXiv:2304.08179. https://arxiv.org/abs/2304.08179
Goutam, K., Balasubramanian, S., Gera, D., & Sarma, R. R. (2024). LayerOut: Freezing layers in deep neural networks. SN Computer Science, 5(1), 123. https://doi.org/10.1007/s42979-023-01678-9
Hugging Face. (n.d.). BERT. Hugging Face Transformers Documentation. https://huggingface.co/docs/transformers/en/modeldoc/bert
Fukuhata, S., & Kano, Y. (2025). Few Dimensions are Enough: Fine‑tuning BERT with Selected Dimensions Revealed Its Redundant Nature (arXiv:2504.04966) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2504.04966
Miao, Z., & Zhao, M. (2023). Weight freezing: A regularization approach for fully connected layers with an application in EEG classification. arXiv preprint arXiv:2306.05775. https://arxiv.org/abs/2306.05775
Sorrenti, A., Bellitto, G., Proietto Salanitri, F., Pennisi, M., Spampinato, C., & Palazzo, S. (2023). Selective freezing for efficient continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (pp. 1234–1243). https://doi.org/10.1109/ICCVW58398.2023.00156
Shen, Z., Liu, Z., Qin, J., Savvides, M., & Cheng, K.-T. (2021). Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10), 9546–9553. https://doi.org/10.1609/aaai.v35i10.17055
Isikdogan, L. F., Nayak, B. V., Wu, C.-T., Moreira, J. P., Rao, S., & Michael, G. (2020). SemifreddoNets: Partially frozen neural networks for efficient computer vision systems. arXiv preprint arXiv:2006.06888. https://arxiv.org/abs/2006.06888
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.