Project

Uncertainty-Driven Knowledge Distillation for Language Model Compression

  • Tianyu Huang, Weisheng Dong, Xin Li and Guangming Shi
  • Fig. 1. Workflow of our model compression scheme. (a) N-to-1 compression compresses the stacked N consecutive layers of the transformer module into one transformer module. (b) Uncertainty-driven distillation architecture (when N = 2) includes copying and freezing the parameters of the odd-numbered multi-head attention (MHA) layer as well as imposing MSE constraints on the feed-forward network (FFN) of the even-numbered layer. We propose an FFN loss function and uncertainty estimate module (UEM) to improve knowledge distillation in pretrained language model compression.

    Abstract

     Despite the remarkable performance on various Nat- ural Language Processing (NLP) tasks, the parametric complexity of pretrained language models has remained a major obstacle due to limited computational resources in many practical applications. Techniques such as knowledge distillation, network pruning, and quantization have been developed for language model compression. However, it has remained challenging to achieve an optimal tradeoff between model size and inference accuracy. To address this issue, we propose a novel and efficient uncertainty-driven knowledge distillation compression method for transformer-based pretrained language models. Specifically, we design a method of parame- ter retention and feedforward network parameter distillation to compress N-stacked transformer modules into one module in the fine-tuning stage. A key innovation of our approach is to add the un- certainty estimation module (UEM) into the student network such that it can guide the student network’s feature reconstruction in the latent space (similar to the teacher’s). Across multiple datasets in the natural language inference tasks of GLUE, we have achieved more than 95% accuracy of the original BERT, while only using about 50% of the parameters.

    Paper & Code & Demo

    Experimental Results

      Fig. 1. Distribution of sigma values.

      Fig. 2. Parameter analysis

      Table 1. PERFORMANCE ON DIFFERENT DATASETS

      Table 2. EXPERIMENTAL R ESULTS OF DIFFERENT C OMPRESS R ATE N

    Citation

    @article{huang2023uncertainty,
     title={Uncertainty-Driven Knowledge Distillation for Language Model Compression},
     author={Huang, Tianyu and Dong, Weisheng and Wu, Fangfang and Li, Xin and Shi, Guangming},
     journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
     year={2023},
     publisher={IEEE}
    }

    Concat

    Tianyu Huang , Email: 19171213910@stu.xidian.edu.cn;
    Weisheng Dong, Email: wsdong@mail.xidian.edu.cn
    Xin Li, Email: xin.li@mail.wvu.edu
    Fangfang Wu, Email: wufangfang@xidian.edu.cn
    Guangming Shi, Email: gmshi@xidian.edu.cn