HomeThe Linux FoundationExploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA – Hope Wang & Bin Fan

Exploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA – Hope Wang & Bin Fan

0 views

0 0

Exploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA - Hope Wang & Bin Fan

Exploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA – Hope Wang & Bin Fan, Alluxio

As GPUs become increasingly powerful, the separation between compute and storage often results in underutilized GPUs waiting for data. Meanwhile, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization.

In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.

Date: October 31, 2024