Back To Schedule
Monday, October 14 • 3:30pm - 3:45pm
Sharing Resources in the Age of Deep Learning

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!


Communication patterns of Data Science workloads have become increasingly relevant for Cloud Service Providers, High-Performance Computing (HPC) centers and Database hosting infrastructure. For users, contention for shared resources manifests as slow access to filesystems and run-to-run performance variability. Distributed training of Deep Neural Networks (DNNs) requires high-bandwidth, low-latency networks, high-performance filesystems, and GPU resources. Distributed training is known to impact other users' applications but this impact has yet to be quantified. We use the Global-Performance-and-Congestion-Network-Tests (GPCNeT) to model common Data Science applications running alongside proxied distributed training instances to quantify the potential impact. We compare state-of-the-art interconnect features against previous generations, and demonstrate Congestion Management and Adaptive Routing can mitigate common performance issues experienced in multi-user environments, reducing the impact by up to 5x in some cases.


Jacob Balma

Presenter, Cray, Inc

Richard Walsh

Cray, Inc

Nick Hill

Cray, Inc

Monday October 14, 2019 3:30pm - 3:45pm CDT
BRC 280

Attendees (5)