Language model optimization using pruning, distillation and quantization techniques for NLP tasks

Loading...
Thumbnail Image
Date
2024
Authors
Petrenko, Mykhailo
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The dominant approaches to quantizing neural net- works with billions of parameters focus primarily on weight quantization due to accuracy considerations. However, activation quantization remains a significant bottleneck for inference speed. Building upon the foundational research of GPTQ and Qual- comm, we propose GPTAQ, a novel framework that introduces activation quantization for large language models (LLMs) while attempting to balance out activation-induced error with the following enhancements: Eigenvalues of the Hessian sensitivity matrix, although our experiments reveal this approach yields mixed results. Cross-Layer Equalization (CLE), which balances weight scales across layers to prevent channel suppression. Bias Correction, to correct the effects of CLE. We demonstrate the effects of our approach through exper- iments on the Facebook OPT model using the C4 dataset for calibration. Our results show that RTN and Token-wise activa- tion quantization combined with CLE achieve the best trade- off between model efficiency and accuracy. GPTAQ introduces activation quantization while maintaining low perplexity scores, indicating minimal performance degradation given the limited experimental setup. Our framework offers a comprehensive solution for effective activation quantization, enhancing the deployment efficiency of large language models and providing valuable insights for future research, such as further Hessian Eigenvalues tuning to decrease introduced error, expand and switch calibration dataset, and remaining ablation study.
Description
Keywords
nlp, llm, gpt, quantization, eigenvalues, masters thesis
Citation