
Learn why your powerful new AI model might be running slowly during inference. This video dives into the landscape of modern AI models, including Large Language Models (LLMs), Diffusion Models, Visual Language Models (VLMs), and Mixture of Experts (MoE). We uncover the four common performance bottlenecks—compute, memory capacity, memory bandwidth, and networking—and provide practical strategies for engineers to identify and address these issues, helping you achieve optimal performance for your AI applications.
Chapters:
0:00 – Introduction: Why is my AI model slow?
0:47 – The 4 types of modern AI models
2:03 – The four common bottlenecks
4:12 – Practical strategies for LLMs (Quantization)
4:52 – Practical strategies for Diffusion Models
5:24 – Practical strategies for Mixture of Experts (MoE)
5:53 – Conclusion: A playbook for performance
Resources:
AI Hypercomputer overview → https://goo.gle/3JMXNb2
Introduction to Cloud TPU → https://goo.gle/4nMA0WE
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #LLM #VLM #AIModel
Speakers: Duncan Campbell
Products Mentioned: AI Infrastructure











