Optimize Your AI Cloud Infrastructure: A Hardware Perspective – Liang Yan, CoreWeave
GPU Cloud has become a ubiquitous component of contemporary AI infrastructure, especially for distributed machine learning scenarios. While conversations around AI infrastructure optimization typically revolve around the application layer, such as machine learning tasks and distributed job schedulers, delving into the underhood of the GPU cloud is essential. Numerous factors, including POD Scheduler, Device Plugin, GPU/NUMA topology, ROCE/NCCL Stack, and more, can significantly impact performance.
This session will thoroughly explore the tuning of various machine models(CNN/RNN/Transformer) from MLPerf using an H100 Cluster as a reference. We will analyze the correlation between model performance and device operator configuration in nodes by presenting first-hand experimental results to unveil the hidden potential within an AI Cloud.