
In this Part 1 video, we explore the key challenges of training and deploying large-scale generative AI models — from hardware failures and underutilization to scalability and observability — and introduce Amazon SageMaker HyperPod as the purpose-built solution. We walk through HyperPod’s architecture, its EKS-based orchestration layer, the full ML software stack, and key HyperPod features that help team improve ML performance and resource utilization.
AI on SageMaker HyperPod – our new website and GitHub repo containing Slurm and EKS reference architectures, training and inference code samples, tips and tricks, and setup guides based on 2+ years and way to many cluster deployments to count. Linked here: https://go.aws/4aARwdd
Learn more:
Amazon SageMaker HyperPod Documentation: https://go.aws/4qWpii8
GitHub code of presentation demo: https://go.aws/4s8z9lN
ML Framework’s team repository: https://go.aws/4rHd3Hi
Subscribe to AWS: https://go.aws/subscribe
Create a free AWS account: https://go.aws/signup
Try AWS for free: https://go.aws/free
Connect with an expert: https://go.aws/contact
Explore more: https://go.aws/more
Next steps:
Explore on AWS in Analyst Research: https://go.aws/reports
Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace
Join the AWS Partner Network: https://go.aws/partners
Learn more on how Amazon builds and operates software: https://go.aws/library
Do you have technical AWS questions?
Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb
Why AWS?
Amazon Web Services is the world’s most comprehensive and broadly adopted cloud, enabling customers to build anything they can imagine. We offer the greatest choice of innovative cloud capabilities and expertise, on the most extensive global infrastructure with industry-leading security, reliability, and performance.
#AWS #AmazonWebServices #CloudComputing #SageMakerHyperPod #GenAI











