Language model optimization using pruning, distillation and quantization techniques for NLP tasks

dc.contributor.advisorMarchenko, Oleksandren_US
dc.contributor.authorPetrenko, Mykhailoen_US
dc.date.accessioned2024-10-05T14:26:58Z
dc.date.available2024-10-05T14:26:58Z
dc.date.issued2024
dc.description.abstractThe dominant approaches to quantizing neural net- works with billions of parameters focus primarily on weight quantization due to accuracy considerations. However, activation quantization remains a significant bottleneck for inference speed. Building upon the foundational research of GPTQ and Qual- comm, we propose GPTAQ, a novel framework that introduces activation quantization for large language models (LLMs) while attempting to balance out activation-induced error with the following enhancements: Eigenvalues of the Hessian sensitivity matrix, although our experiments reveal this approach yields mixed results. Cross-Layer Equalization (CLE), which balances weight scales across layers to prevent channel suppression. Bias Correction, to correct the effects of CLE. We demonstrate the effects of our approach through exper- iments on the Facebook OPT model using the C4 dataset for calibration. Our results show that RTN and Token-wise activa- tion quantization combined with CLE achieve the best trade- off between model efficiency and accuracy. GPTAQ introduces activation quantization while maintaining low perplexity scores, indicating minimal performance degradation given the limited experimental setup. Our framework offers a comprehensive solution for effective activation quantization, enhancing the deployment efficiency of large language models and providing valuable insights for future research, such as further Hessian Eigenvalues tuning to decrease introduced error, expand and switch calibration dataset, and remaining ablation study.en_US
dc.identifier.urihttps://ekmair.ukma.edu.ua/handle/123456789/31756
dc.language.isoenen_US
dc.relation.organisationНаУКМАen_US
dc.statusfirst publisheden_US
dc.subjectnlpen_US
dc.subjectllmen_US
dc.subjectgpten_US
dc.subjectquantizationen_US
dc.subjecteigenvaluesen_US
dc.subjectmasters thesisen_US
dc.titleLanguage model optimization using pruning, distillation and quantization techniques for NLP tasksen_US
dc.typeOtheren_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Petrenko_master_thesis.pdf
Size:
2.36 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Petrenko_Mahisterska_robota I.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format