AI/ML training remains one of the most demanding workloads in modern data infrastructure. Maximizing throughput to drive optimal GPU utilization is critical for accelerating model training, improving accuracy, and reducing operational costs. In the MLPerf Storage 1.0 Open Division benchmark, which focused on the storage-intensive 3D-UNet model, Volumez DIaaS for AI/ML demonstrated extraordinary linear scaling. The solution achieved a remarkable 1.079 TB/sec throughput with 92.21% GPU utilization and 9.9M IOPS1 on AWS—establishing a new industry standard for AI/ML training performance.
Benchmark Overview and Industry Impact
Volumez deployed 137 application nodes (c5n.18xlarge), each simulating 3 H100 GPUs, streaming data from 128 media nodes (i3en.24xlarge) equipped with 60TB of storage per node. Unlike traditional architectures, Volumez' DIaaS solution introduces no additional layers to the Linux data path and leverages cloud-aware intelligence to optimize infrastructure for the 3D-UNet workload. This unique approach delivered a level of speed and efficiency previously unseen in the benchmark, transforming both the economics and scalability of AI/ML training environments.
"These results mark a significant achievement for Volumez," said John Blumenthal, chief product and business officer at Volumez. "The performance and scalability achieved during testing are unprecedented and highlight the critical role Volumez plays in the AI/ML ecosystem, providing solutions that meet the growing demands of AI/ML workloads on cloud infrastructure – to maximize the yield on our industry's scarcest resource, GPUs."
MLPerf Benchmark Achievements include:
- 1.079 TB/sec peak throughput, setting a new benchmark for AI/ML storage performance.
- 92.21% GPU utilization, driving exceptional efficiency in AI model training.
- 9.9M IOPS, highlighting unparalleled data handling capabilities for large-scale workloads.
- Proven scalability for massive datasets, empowering businesses to tackle increasingly complex AI/ML models with ease.
"We are excited to have Volumez participate in their first MLPerf Storage benchmark. The importance that storage plays in the AI technology stack and the innovations happening in this space are critical to the successful deployment of ML training systems. MLPerf benchmark results are important measures for storage consumers to analyze as they seek to procure and tune ML systems to maximize their utilization – and ultimately their return on investment," said David Kanter, head of MLPerf at MLCommons.
Industry Perspective
According to Gartner®, "From a feature and functionality perspective, storage for GenAI is not too different from storage for any other analytics applications. The exception is that the performance capabilities required to feed the compute farm become even more relevant for GenAI and can be amplified at a larger scale. The training stage of GenAI workflow can be very demanding from a performance point of view, depending on the model size. Not only must the storage layer support high throughput to feed the CPU or GPU farm, but it also must have the right performance to support model checkpoint and recovery fast enough to keep the computer farm running."2 Solutions like Volumez DIaaS are essential for enabling the next generation of AI infrastructure that balances performance, scalability, and cost.
Innovative Results in Real-World Environments
As an active member of the MLCommons community, Volumez took an additional step by submitting a second benchmark run in the Open Division. This submission focused on addressing real-world trade-offs faced by ML engineers and MLOps teams - optimizing throughput and utilization without sacrificing model accuracy. Specifically, we modified the benchmark's weight exchange frequency, a common practice in high-scale environments. This adjustment reduces network overhead to achieve increased throughput and GPU utilization. We delivered an impressive 1.140 TB/sec throughput and 97.82% GPU utilization1, a 5.43% improvement over our first submission.
For a deeper dive into the breakthrough architecture that powers Volumez' DIaaS for AI/ML, explore our white paper here. The document provides comprehensive insights into how our cloud-aware control plane drives transformative results for AI/ML workloads at scale.
Revolutionizing AI/ML Infrastructure
Volumez has fundamentally redefined the standards for AI/ML training infrastructure. By eliminating traditional bottlenecks and delivering industry-leading performance, the Volumez DIaaS platform empowers organizations to accelerate their AI/ML initiatives and gain a competitive edge in a rapidly evolving market.
1 Results verified by MLCommons Association. Retrieved from https://mlcommons.org/benchmarks/storage/. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
2 Gartner, 2024 Strategic Roadmap for Storage, By Jeff Vogel, Julia Palmer, Michael Hoeck, Chandra Mukhyala, 23 February 2024. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
About Volumez
Volumez innovates next generation cloud-native storage as the premier data infrastructure as a service company that helps organizations realize the true potential of their data. With its patented controller-less architecture, Volumez tackles latency and scalability challenges by establishing direct Linux data paths, ensuring exceptional performance and resiliency with transformative economics. Through innovative technology and a customer-centric approach, Volumez offers comprehensive solutions that streamline data workflows, enhance data quality, and drive informed decision-making. Discover more at Volumez.com.
Photo - https://mma.prnewswire.com/media/2515277/Volumez.jpg
SOURCE Volumez
Share this article