Cutting-edge AI engineering and research is increasingly dependent on computing power to drive the digital exploration of our physical world. From genomics to aerodynamics, engineers and scientists are using sophisticated simulations, generative design, machine learning, and digital twins to develop their designs.
Harnessing this computing power today requires navigating the increasingly rich and complex world of specialized and domain-specific computing architectures, where choosing the right infrastructure configurations can mean solving problems three times faster, or using one third the cost or energy.
To address this challenge, Rescale has developed a new technology we call the Compute Recommendation Engine (CRE). Rescale’s CRE applies AI to find the optimal computing stack for any workload to best utilize the multi-cloud infrastructure we have at our disposal to power the next era in engineering and scientific computing.
Following the recommendations of Rescale’s CRE, our customers have seen improvements accelerating simulation speeds by over 200 percent and reducing costs by over 60 percent for key benchmarks.
The Rise of Specialized and Domain-Specific Computing Architectures
For decades, compute power has consistently followed Moore’s Law, which stipulates that computer processing power doubles approximately every two years, as chip manufacturers jam ever more transistors into integrated circuits. But this trend is running into physical limits, as processors simply can’t get much more densely packed than they are now.
This end of Moore’s Law means that compute-intensive software like simulations can no longer benefit from performance gains at the same rate without a disproportionate increase in cost and energy consumption.
This bottleneck has created an opportunity for domain-specific architectures, or accelerated computing, optimized for running specific workloads. Companies like NVIDIA have been able to demonstrate performance gains that are outpacing Moore’s Law for GPU-accelerated workloads like graphics processing and machine learning.
Additionally, cloud providers such as Amazon AWS, Microsoft Azure, and Google Cloud are building their own chips, while Arm-based IP is democratizing chip design for startups and new entrants in the space.
We are now in the midst of a Cambrian explosion of computing architectures. The pace of the number of specialized processors and hardware architecture configurations has increased dramatically. In 2022 alone, almost 400 new hardware options were introduced by cloud service providers. Currently, organizations can choose from more than 1,500 unique architectures.
These choices in specialized computing architecture have delivered some incredible performance improvements for workloads that embraced these capabilities, providing 1,000-fold performance gains in just the past 10 years.
The Need for a Compute Recommendation Engine
To capture the gains made possible by domain-specific architectures, each workload needs to be matched against the optimal computing architecture configuration. Rescale has spent over a decade building expertise and helping organizations make the right choices regarding the hardware architectures, operating systems, software, storage, and networking technologies that power their R&D efforts.
Between the several hundred computing architectures, as well as additional possible configurations for storage, networking, middleware, applications, and models, organizations need to select from over 50 million possible combinations to deliver an optimal compute stack.
Through several evolutions, we have now developed what we believe to be one of the most important breakthroughs in full-stack computing since the advent of cloud computing, an AI-based Compute Recommendation Engine.
Rescale’s approach has been to develop data-driven automation by tapping into the wealth of proprietary telemetry from usage patterns and hardware performance on real customer workloads, as well as a formidable collection of intelligence, tooling, and automation for engineering and scientific computing hardware, software, networks, and storage.
How the Rescale Compute Recommendation Engine Works
At its core, the Compute Recommendation Engine identifies the best possible infrastructure compute architecture and scale that should be used for running a given workload. This recommendation can be implemented as a user is about to run a batch job, or run as a series of benchmarks in Performance Profiles, for example.
To develop this intelligence, Rescale has taken advantage of our extensive database of historical benchmarks, enhanced metadata metrics, and detailed infrastructure performance time-series telemetry data. Then, by using recurrent neural networks we were able to output a series of tags that helps infer the type of hardware that would perform best against the workload.
Rescale’s CRE uses this rich and current dataset of many millions of metadata points to provide AI-driven recommendations on the best hardware to meet the user’s need to balance performance and costs to run a given type of workload. Through a process of AI-driven filtering and scoring, the CRE assesses which chips and architectures are most effective for a given application and computing task. This provides a near-instantaneous predictive benchmark and AI-powered recommendation that makes it easy for organizations to pick the right HPC architecture for their particular needs.
Rescale Compute Recommendation Engine Performance
The CRE has proven through internal testing and validation to be more than 95 percent accurate in identifying the optimal chips and full-stack configurations for a given computational task, ensuring the most efficient scaling and costs. The recommender system can also be leveraged to identify key bottlenecks (including network message passing, storage, and memory performance) in the overall HPC architecture in a data-driven and automated way. As an added benefit, this feedback from the recommender system can be used as key feedback to application developers for potential improvements in their code architecture to improve performance.
Following the recommendations of the CRE, our customers have seen improvements accelerating simulation speeds by over 200 percent and reducing costs by over 60 percent for key benchmarks. Not only are these benefits experienced immediately, but leveraging the CRE will help customers continuously stay updated to the optimal recommendations as new technologies and optimizations in the stack become possible.
The CRE provides an initial chip architecture recommendation based on the application the customer wants to run and other metadata from the workflow. Then, based on the performance of the job during runtime, the CRE will train on this metadata set and fine-tune its recommendations. Critically, the CRE will oversee jobs continuously and provide additional recommendations, monitoring the market on a continuous basis to know when new chip architectures become available.
The recommendations provide guidance in numerous ways for how to optimize workload performance. For example, a recommendation may be to reduce the number of computer cores used, as job metadata metrics indicate an inefficiently scaling job. Alternative hardware architectures with lower interconnect latency may be recommended when message passing metadata indicates a networking bottleneck.
Using the Rescale Compute Recommendation Engine
The intelligence of the CRE technology is already available to all Rescale customers today through AI-Recommended hardware architectures as part of the Compute Optimizer feature and embedded recommendations are coming to Performance Profiles soon.
Compute Optimizer eliminates the need for benchmarking and automatically keeps the recommendations for workloads dynamically updated, providing significant productivity, cost, and time benefits. These automated recommendations are always available to every user on the Rescale platform.
With just a couple of clicks in the Performance Profiles dashboard, users can benchmark the performance of any chip architecture with their own specific computing workload. Performance Profiles makes it easy for customers to assess the detailed cost, performance, and energy consumption of any HPC architecture customized to a specific workload.
Performance Profiles and Compute Optimizer are just two examples of how elements of the CRE are making it much easier for organizations to manage their engineering and scientific computing resources.
Over time, we will continuously improve the CRE functionality with richer data analysis and greater automation of the recommendation process. For example the CRE could be used to instantly conduct an audit across months of historical data to measure improvement opportunities. Or it could be used to forecast technology curve improvements (based on historical trends) to guide future infrastructure planning choices.
The metadata which informs the CRE updates dynamically as new hardware architectures, software applications, middleware configurations, and workload use cases are added and run on the platform. New forms of data, such as metadata related to energy use, carbon footprint, and other sustainability metrics can be added to help organizations make more holistic compute architecture decisions.
Accelerated Computing is Fueling Engineering and Scientific Breakthroughs
Rescale’s Compute Recommendation Engine was born from our recognition that optimizing the computing stack has profound implications for digital R&D. With the exponential growth of specialized semiconductors and accelerated computing choices, organizations can reap unprecedented and continued benefits from cloud-based engineering and scientific computing.
Rescale is committed to evolving the intelligence of our platform to provide the best computing capabilities for our customers and empower engineering and science teams to help them invent the future.