University of Michigan Researchers Develop Zeus: A Machine Learning-Based Framework for Optimizing GPU Power Consumption of Deep Neural Networks DNN Training

Deep neural networks (DNNs) have become widely used in recent years in a variety of data-driven application areas, including speech recognition, natural language processing, computer vision, and personalized recommendations. DNN models are usually trained in clusters of highly parallel and ever more powerful GPUs in order to efficiently handle such growth.

But as computing becomes more popular, the demand for energy increases accordingly. For example, the 1,287 megawatt-hours (MWh) needed to form the GPT-3 model is equivalent to 120 years of typical family electrical use in the United States. Electricity demand for AI is increasing, according to Meta, despite a reduction in the operating energy footprint of 28.5%. However, the majority of existing DNN training literature ignores energy efficiency.

Common techniques for improving DNN training performance can use energy inefficiently. For example, many recent articles recommend larger batch sizes for faster formation rates. However, maximizing raw throughput can result in lower energy efficiency. Similar to how contemporary GPUs allow the equal contribution configuration of a power limit, existing solutions frequently ignore it. Four generations of NVIDIA GPUs were analyzed and the results demonstrate that none of them are completely proportional to power and that using the highest amount of power has diminishing returns.

Unfortunately, saving energy isn’t entirely free. For a given target accuracy, there is a trade-off between power consumption and training time; one must be sacrificed while the other is enhanced. Two remarkable events are highlighted by the description of the energy-time Pareto frontier. First, relative to the naive use of maximum batch size and GPU power limit, all Pareto optimal configurations for a particular training project provide varying degrees of power reduction. Second, as training time increases, there is frequently a nonlinear relationship between the amount of reduced energy and consumption.

Researchers from the University of Michigan proposed Zeus as a solution to this problem in a publication. Batch size and GPU power limit are automatically configured by Zeus, a plugin optimization framework, to reduce total power consumption and training time for DNN training operations. Zeus considers both work-related and GPU-related configurations, unlike several recent studies that only consider GPU-specific configurations.

There is no need to perform offline per-task profiling or train prediction models, which can be prohibitively expensive in large clusters with heterogeneous hardware and varying workloads. Zeus, on the other hand, adopts an online exploration and exploitation strategy tailored to the functionality of DNN training workflows. Models must be periodically recycled as new data enters the pipeline, which manifests as repeatable tasks on production clusters. Zeus uses this fact to automatically investigate various setups, measure the benefits or losses, and then modify its activities as needed.

Due to the sources of uncertainty in DNN training, designing such a solution is difficult. First, even when the same job is performed on the same GPU with the same configuration, the energy expenditure of a training job varies. Indeed, the randomness introduced by the initialization of the model and the loading of the data leads to a variation of the duration of training from end to end to reach a particular quality of model. Second, both DNN models and GPUs have varied topologies and distinct energy properties.

Therefore, data collected from offline power consumption profiling of certain models and GPUs does not generalize. To do this, the researchers created a just-in-time (JIT) energy profiler that, when activated by an online training task, quickly and efficiently records its energy properties. Zeus also uses a multi-arm bandit with Thompson Sampling, allowing the group to capture the stochastic nature of DNN training and optimize in the face of uncertainty.

Test results on a variety of workloads, including speech recognition, image classification, NLP, and recommendation tasks, revealed that Zeus reduced training time by 60.6% and energy consumption. energy from 15.3% to 75.8% compared to the choice of maximum batch size and maximum GPU power limit. Zeus can effectively resist data drift and quickly converge to ideal parameters. The advantages of Zeus also apply in multi-GPU configurations.


In this study, researchers from the University of Michigan aimed to understand and improve the power consumption of DNN training on GPUs. The researchers determined the trade-off between training time and energy consumption and showed how routine behaviors could lead to unnecessary energy consumption. Zeus is an online system that determines the Pareto frontier for recurring DNN training projects. It allows users to browse it by automatically adjusting the batch size and GPU power limit of their jobs. Zeus continuously adapts to dynamic workload changes like data drift, outperforming the state of the art in power consumption for a variety of workloads and real-world cluster traces. Zeus, according to the researchers, will encourage the community to prioritize energy as a resource in improving DNN.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Asif Razzaq is an AI journalist and co-founder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of artificial intelligence for good.

Asif’s latest endeavor is the development of an artificial intelligence media platform (Marktechpost) that will revolutionize the way people can find relevant news related to artificial intelligence, data science and technology. machine learning.

Asif was featured by Onalytica in its ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the ‘Influential Journalists in AI’ ( His interview was also featured by Onalytica (

Source link

Comments are closed.