Deploying scalable and reliable AI inference on Google Cloud

0 views

0 0

Deploying scalable and reliable AI inference on Google Cloud

Learn how to deploy scalable and reliable AI inference workloads on Google Cloud for millions of users. This video outlines a comprehensive architecture focused on multi-region deployments, treating services as disposable, and building in robust observability. Discover how to identify and overcome performance bottlenecks, leverage frameworks like vLLM for efficiency, and utilize Google Cloud storage solutions like GCS Fuse with Anywhere Cache and Managed Lustre. We also explore the GKE Inference Reference Architecture and the model aware GKE Inference Gateway for intelligent routing.

Chapters:
0:00 – Introduction to AI inference challenges
0:16 – Building reliable AI deployments
1:13 – Optimizing AI inference performance
2:23 – Strategies for scalable AI storage
3:18 – Introducing the GKE Inference Architecture
3:35 – GKE Inference Gateway capabilities
4:00 – Deploying AI workloads with confidence

Resources:
High performance parallel file system → https://goo.gle/ra-managed-lustre
Optimize AI and ML workloads with Cloud Storage FUSE → https://goo.gle/ra-gcs-fuse

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

#GoogleCloud #GCSFUSE #CloudStorage #Lustre

Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Cloud Storage

Date: November 13, 2025