Language model optimization using pruning, distillation and quantization techniques for NLP tasks

Petrenko, Mykhailo

Language model optimization using pruning, distillation and quantization techniques for NLP tasks

dc.contributor.advisor	Marchenko, Oleksandr	en_US
dc.contributor.author	Petrenko, Mykhailo	en_US
dc.date.accessioned	2024-10-05T14:26:58Z
dc.date.available	2024-10-05T14:26:58Z
dc.date.issued	2024
dc.description.abstract	The dominant approaches to quantizing neural net- works with billions of parameters focus primarily on weight quantization due to accuracy considerations. However, activation quantization remains a significant bottleneck for inference speed. Building upon the foundational research of GPTQ and Qual- comm, we propose GPTAQ, a novel framework that introduces activation quantization for large language models (LLMs) while attempting to balance out activation-induced error with the following enhancements: Eigenvalues of the Hessian sensitivity matrix, although our experiments reveal this approach yields mixed results. Cross-Layer Equalization (CLE), which balances weight scales across layers to prevent channel suppression. Bias Correction, to correct the effects of CLE. We demonstrate the effects of our approach through exper- iments on the Facebook OPT model using the C4 dataset for calibration. Our results show that RTN and Token-wise activa- tion quantization combined with CLE achieve the best trade- off between model efficiency and accuracy. GPTAQ introduces activation quantization while maintaining low perplexity scores, indicating minimal performance degradation given the limited experimental setup. Our framework offers a comprehensive solution for effective activation quantization, enhancing the deployment efficiency of large language models and providing valuable insights for future research, such as further Hessian Eigenvalues tuning to decrease introduced error, expand and switch calibration dataset, and remaining ablation study.	en_US
dc.identifier.uri	https://ekmair.ukma.edu.ua/handle/123456789/31756
dc.language.iso	en	en_US
dc.relation.organisation	НаУКМА	en_US
dc.status	first published	en_US
dc.subject	nlp	en_US
dc.subject	llm	en_US
dc.subject	gpt	en_US
dc.subject	quantization	en_US
dc.subject	eigenvalues	en_US
dc.subject	masters thesis	en_US
dc.title	Language model optimization using pruning, distillation and quantization techniques for NLP tasks	en_US
dc.type	Other	en_US

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Petrenko_master_thesis.pdf
Size:: 2.36 MB
Format:: Adobe Portable Document Format

Download

Name:: Petrenko_Mahisterska_robota I.pdf
Size:: 1.57 MB
Format:: Adobe Portable Document Format

Download

Collections

F3 Комп'ютерні науки