Large Language Models are unlocking incredible capabilities, but deploying them comes with hidden costs. Discover the real-world economics of running LLMs and why model compression is no longer an option, it’s a requirement for sustainable AI.
In this video, Red Hat expert Mark Kutz breaks down the staggering infrastructure costs required to deploy modern LLMs for common use cases like Retrieval-Augmented Generation (RAG) and summarization. Learn why over half of today’s deployments are wasting compute and money by using uncompressed models, and see how you can avoid the same mistake.
You will learn:
• Why models are getting bigger faster than the hardware they run on.
• How techniques like quantization, pruning, and knowledge distillation work.
• The dramatic cost savings (up to 8x or more) from deploying optimized models.
• How Red Hat’s contributions to open source tools like vLLM, LLM Compressor, and InstructLab make these savings accessible.
Timestamps:
00:00 – The hidden cost of Large Language Models (LLMs)
00:31 – Why LLM deployments are expensive
02:49 – The Solution: Model compression explained
03:07 – Technique 1: Quantization
03:47 – Technique 2: Pruning
04:15 – Technique 3: Knowledge & data distillation
04:56 – Technique 4: Speculative decoding
05:50 – The cost of an online RAG use case
07:20 – How compression cuts RAG deployment costs
09:20 – How Red Hat enables optimized AI
Ready to reduce your AI deployment costs? Explore Red Hat’s tools and pre-compressed models.
🧠 Learn more about Red Hat’s approach to AI → https://www.redhat.com/en/topics/ai
🛠️ Find pre-compressed models on Red Hat’s Hugging Face Hub → https://huggingface.co/RedHatAI
📖 Read the blog on faster, more efficient LLM inference → https://www.redhat.com/en/blog/meet-vllm-faster-more-efficient-llm-inference-and-serving
#AI #RedHat #vLLM