Bringing a magnifying glass to data center operations
When MIT’s Lincoln Laboratory Supercomputing Center (LLSC) unveiled its TX-GAIA supercomputer in 2019, it provided the MIT community with a powerful new resource for applying artificial intelligence to their research. Anyone at MIT can submit a job to the system, which performs billions of operations per second to train models for various applications, such as detecting tumors in medical images, discovering new drugs, or modeling climatic effects. But with this great power comes great responsibility to manage and harness it sustainably – and the team is looking for ways to improve.
“We have these powerful computational tools that allow researchers to create complex models to solve problems, but they can essentially be used as black boxes. What gets lost in this is whether we are actually using the hardware as efficiently as possible,” says LLSC research scientist Siddharth Samsi.
To better understand this challenge, the LLSC has collected detailed data on TX-GAIA usage over the past year. Over a million user jobs later, the team has released the open-source dataset to the computing community.
Their goal is to enable IT and data center operators to better understand avenues for data center optimization – an important task as processing needs continue to grow. They also see potential for leveraging AI in the data center itself, using the data to develop models to predict points of failure, optimize job scheduling and improve performance. energetic efficiency. While cloud providers are actively working to optimize their data centers, they often don’t make their data or models available to the wider high performance computing (HPC) community. The release of this dataset and associated code aims to fill that gap.
“Data centers are changing. We have an explosion of hardware platforms, the types of workloads are changing, and the types of people using data centers are changing,” says Vijay Gadepally, Principal Investigator at LLSC. “Until now, there was no effective way to analyze the impact on data centers. We view this research and dataset as a big step toward a principles-based approach to understanding how these variables interact with each other, and then applying AI to gain insights and improvements.
Papers describing the dataset and potential applications have been accepted at a number of venues, including the IEEE International Symposium on High-Performance Computing Architecture, the IEEE International Parallel and Distributed Processing Symposium , the Annual Conference of the North American Chapter of the Association for Computational Linguistics, the IEEE Conference on High-Performance and Embedded Computing, and the International Conference on High-Performance Computing, Networking, Storage, and Analytics.
Among the world’s TOP500 supercomputers, TX-GAIA combines traditional computing hardware (central processing units, or CPUs) with nearly 900 graphics processing unit (GPU) accelerators. These NVIDIA GPUs are specialized for deep learning, the class of AI that gave birth to speech recognition and computer vision.
The dataset covers CPU, GPU, and memory usage per task; planning logs; and physical surveillance data. Compared to similar datasets, such as those from Google and Microsoft, the LLSC dataset offers “labeled data, a variety of known AI workloads, and more detailed time series data compared to previous data sets. To our knowledge, this is one of the most comprehensive and accurate datasets available,” says Gadepally.
In particular, the team collected time-series data at an unprecedented level of detail: 100-millisecond intervals on each GPU and 10-second intervals on each CPU, as the machines processed over 3,000 training jobs. known in depth. One of the first goals is to use this labeled dataset to characterize the workloads that different types of deep learning tasks place on the system. This process would extract features that reveal differences in how the material processes natural language models versus image classification or material design models, for example.
The team has now launched the MIT Datacenter Challenge to mobilize this research. The challenge calls on researchers to use AI techniques to identify with 95% accuracy the type of work that was performed, using their labeled time series data as ground truth.
Such information could allow data centers to better match a user’s work demand with the most suitable hardware, which could potentially save energy and improve system performance. Classification of workloads could also allow operators to quickly notice deviations resulting from hardware failures, inefficient data access patterns, or unauthorized use.
Too many choices
Today, LLSC offers tools that allow users to submit their work and select the processors they want to use, “but that’s a lot of guesswork on the part of users,” says Samsi. “Someone might want to use the latest GPU, but maybe their compute doesn’t actually need it and they could get equally impressive results on less powerful CPUs or machines.”
Professor Devesh Tiwari of Northeastern University is working with the LLSC team to develop techniques that can help users match their workloads to the appropriate hardware. Tiwari explains that the emergence of different types of AI, GPU and CPU accelerators has left users with too many choices. Without the right tools to take advantage of this heterogeneity, they miss out on the benefits: better performance, lower costs and increased productivity.
“We are closing this capability gap by making users more productive and helping users do science better and faster without worrying about managing heterogeneous hardware,” says Tiwari. “My PhD student, Baolin Li, is developing new capabilities and tools to help HPC users take advantage of heterogeneity in a near-optimal way without user intervention, using techniques based on Bayesian optimization and d other learning-based optimization methods. But this is only the beginning. We are looking for ways to introduce heterogeneity into our data centers as part of a principles-based approach to help our users make the most of heterogeneity in an autonomous and cost-effective way. »
Workload classification is the first of many issues in the Datacenter Challenge. Others include developing AI techniques to predict job failures, conserve energy, or create job scheduling approaches that improve data center cooling efficiency.
To mobilize research on greener computing, the team also plans to release a TX-GAIA operations environmental dataset, containing rack temperature, power consumption, and other relevant data.
According to the researchers, huge opportunities exist to improve the energy efficiency of HPC systems used for AI processing. As an example, recent work from LLSC has determined that a simple hardware tweak, such as limiting the amount of power an individual GPU can consume, could reduce the power cost of training by 20%. an AI model, with only modest increases in computation time. “This reduction translates to about an entire week of household energy for just a three hour increase in time,” says Gadepally.
They also developed techniques to predict model accuracy, so users can quickly end experiments that are unlikely to yield meaningful results, thereby saving energy. The Datacenter Challenge will share relevant data to allow researchers to explore other opportunities to save energy.
The team expects the lessons learned from this research can be applied to the thousands of data centers operated by the US Department of Defense. The US Air Force is a sponsor of this work, which is being conducted as part of the USAF-MIT AI Accelerator.
Other collaborators include researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). Professor Charles Leiserson’s Supertech research group is studying performance-enhancing techniques for parallel computing, and research scientist Neil Thompson is designing studies on how to incentivize data center users to adopt climate-friendly behavior .
Samsi presented this work at the inaugural AI for Datacenter Optimization (ADOPT’22) workshop last spring as part of the IEEE International Parallel and Distributed Processing Symposium. The workshop officially introduced their Datacenter Challenge to the HPC community.
“We hope this research will enable us and others who run supercomputing centers to be more responsive to user needs while reducing power consumption at the center level,” Samsi said. .