HomeThe Linux FoundationUnlocking Local LLMs with Quantization – Marc Sun, Hugging Face

Unlocking Local LLMs with Quantization – Marc Sun, Hugging Face

0 views

0 0

Unlocking Local LLMs with Quantization - Marc Sun, Hugging Face

This talk will share the story of quantization, its rise in popularity, and its current status in the open-source community. We’ll begin by reviewing key quantization papers, such as QLoRA by Tim Dettmers and GPTQ by Elias Frantar. Next, we’ll demonstrate how quantization can be applied at various stages of model development, including pre-training, fine-tuning, and inference. Specifically, we’ll share our experience in pre-training a 1.58-bit model, show how fine-tuning is achievable using PEFT + QLoRA, and discuss optimizing inference performance with torch.compile or custom kernels. Finally, we’ll highlight efforts within the community to make quantized models more accessible, including how transformers incorporate state-of-the-art quantization schemes and how to run GGUF models from llama.cpp.

Date: October 31, 2024