Language model optimization using pruning, distillation and quantization techniques for NLP tasks

Petrenko, Mykhailo

Language model optimization using pruning, distillation and quantization techniques for NLP tasks

Files

Petrenko_master_thesis.pdf (2.36 MB)

Petrenko_Mahisterska_robota I.pdf (1.57 MB)

Date

2024

Authors

Petrenko, Mykhailo

Abstract

The dominant approaches to quantizing neural net- works with billions of parameters focus primarily on weight quantization due to accuracy considerations. However, activation quantization remains a significant bottleneck for inference speed. Building upon the foundational research of GPTQ and Qual- comm, we propose GPTAQ, a novel framework that introduces activation quantization for large language models (LLMs) while attempting to balance out activation-induced error with the following enhancements: Eigenvalues of the Hessian sensitivity matrix, although our experiments reveal this approach yields mixed results. Cross-Layer Equalization (CLE), which balances weight scales across layers to prevent channel suppression. Bias Correction, to correct the effects of CLE. We demonstrate the effects of our approach through exper- iments on the Facebook OPT model using the C4 dataset for calibration. Our results show that RTN and Token-wise activa- tion quantization combined with CLE achieve the best trade- off between model efficiency and accuracy. GPTAQ introduces activation quantization while maintaining low perplexity scores, indicating minimal performance degradation given the limited experimental setup. Our framework offers a comprehensive solution for effective activation quantization, enhancing the deployment efficiency of large language models and providing valuable insights for future research, such as further Hessian Eigenvalues tuning to decrease introduced error, expand and switch calibration dataset, and remaining ablation study.

Keywords

nlp, llm, gpt, quantization, eigenvalues, masters thesis

URI

https://ekmair.ukma.edu.ua/handle/123456789/31756

Collections

F3 Комп'ютерні науки

Full item page