Random Samples: The State of LLM Compression — From Research to Production

0 views

0 0

Random Samples: The State of LLM Compression — From Research to Production

Welcome to Random Samples — a weekly AI seminar series that bridges the gap between cutting-edge research and real-world application. Designed for AI developers, data scientists, and researchers, each episode explores the latest advancements in AI and how they’re being used in production today.

This week’s topic: The State of LLM Compression — From Research to Production

LLMs have owned the stage, but with size comes complexity. This talk explores the evolving landscape of LLM Compression, from the latest SOTA research to real-world deployments. We’ll break down the high-level effects of techniques such as quantization and sparsity and their tradeoffs between accuracy and performance. Additionally, we’ll walk through the differences between academic and real-world benchmarks, what’s ready for production today, what’s sitting in the research lab, and what it will take to close the gap. Whether you’re optimizing latency on a single GPU or scaling across clusters, we’re diving into a pragmatic look at how compression is changing the way we build and ship generative AI systems.

Session slides: https://docs.google.com/presentation/d/1w961h0QGqRl0Wd3ATdu2B98AeoaloJyaRE2mxZ8W2AU/

Subscribe to stay ahead of the curve with weekly deep dives into AI! New episodes drop every week.

Date: April 18, 2025