sebae banner ad-300x250
sebae intro coupon 30 off
sebae banner 728x900
sebae banner 300x250

Serving AI models at scale with vLLM

0 views
0%

Serving AI models at scale with vLLM

Unlock the full potential of your AI models by serving them at scale with vLLM. This video addresses common challenges like memory inefficiency, high latency under load, and large model sizes, showing how vLLM maximizes throughput from your existing hardware. Discover vLLM’s innovative features such as PagedAttention, Prefix Caching, multi-host serving, and disaggregated serving, and learn how it seamlessly integrates with Google Cloud GPUs and TPUs for flexible, high performance AI inference.

Chapters:
0:00 – Introduction: The Challenge of Scaling AI
0:25 – 3 Common Issues
1:01 – Solution: vLLM for Performant Serving
1:13 – vLLM Feature: PagedAttention
1:30 – vLLM Feature: Prefix Caching
1:46 – vLLM Feature: Multi-Host and Disaggregated Serving
2:07 – vLLM Support on Google Cloud (GPUs & TPUs)
2:29 – vLLM Tunable Parameters
2:46 – Conclusion

Resources:
Welcome to vLLM → https://goo.gle/49zlRZN
TPU Inference GitHub → https://goo.gle/3JUkBpn

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

#GoogleCloud #vLLM #AIInfrastructure

Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Tensor Processing Units, Cloud GPUs

Date: November 13, 2025