Massive language models are here, but getting them to run efficiently is a major challenge. In this episode, Red Hat CTO Chris Wright sits down with Nick Hill, Senior Principal Software Engineer at Red Hat, to explore vLLM, an open-source project revolutionizing AI inference. They discuss how innovations born from systems-level thinking are making AI more practical and accessible.
00:00 – The challenge of running massive language models
00:59 – Nick Hill’s journey from IBM Watson to generative AI
03:03 – What is vLLM and why is it different?
05:41 – Optimizing the KV Cache and GPU utilization
07:35 – PagedAttention: Virtual memory for your GPU
09:51 – Speculative decoding and its CPU parallels
11:50 – The future of distributed and heterogeneous hardware in AI
16:38 – How open source and community are accelerating AI innovation
Learn More:
vLLM Project: https://vllm.ai/
Sky Computing Lab at UC Berkeley: https://sky.cs.berkeley.edu/
Follow us:
Chris Wright: https://www.linkedin.com/in/chris-wright-b733851/
Chris Wright: https://twitter.com/kernelcdub
What is Technically Speaking?
Technically Speaking taps into emerging technology trends with insights from leading experts across the globe and Red Hat CTO Chris Wright. The series blends deep-dive discussions, tech updates, and creative short-form content, solidifying Red Hat’s role as a pioneer in technology innovation and open source thought leadership.
Want to participate? Leave us a comment if there’s a topic or a guest you’d like to see featured.
Watch More Technically Speaking:
YouTube Playlist: https://www.youtube.com/playlist?list=PLbMP1JcGBmSGfI0Rl4s6PpycLF4rZcfW8
Show Page: https://www.redhat.com/en/technically-speaking
Subscribe to Red Hat’s YouTube channel: https://www.youtube.com/redhat/?sub_confirmation=1
#RedHat #vLLM #AIInference #TechnicallySpeaking #OpenSource