
Running simple AI jobs is easy, but scaling them to massive, distributed clusters for thousands of customers is a different challenge. This video explores two robust options for AI orchestration and cluster management on Google Cloud: the cloud native approach using Google Kubernetes Engine (GKE) and a high performance approach with Slurm and Cluster Director. Learn how these solutions help you manage hardware, software, data distribution, and fault tolerance for your training and inference workloads, enabling you to scale AI effectively.
Chapters:
0:00 – Introduction
0:55 – Cloud-Native Path: Google Kubernetes Engine (GKE)
1:30 – GKE: Job Orchestration Options (Kubernetes-Native, Ray)
2:10 – High-Performance Path: Slurm with Cluster Director
2:53 – Choosing the Right Path for Your Team
3:18 – Conclusion
Resources:
AI/ML orchestration on GKE documentation → https://goo.gle/3X9er7T
New Cluster Director features → https://goo.gle/4nxJdC3
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #GKE #Slurm #ClusterDirector
Speakers: Drew Brown
Products Mentioned: AI Infrastructure, Google Kubernetes Engine, GKE











