Rescale’s Compute Recommendation Engine: AI-Driven Intelligence for Full-Stack Computing

Cutting-edge AI engineering and research is increasingly dependent on computing power to drive the digital exploration of our physical world. From genomics to aerodynamics, engineers and scientists are using sophisticated simulations, generative design, machine learning, and digital twins to develop their designs.

Harnessing this computing power today requires navigating the increasingly rich and complex world of specialized and domain-specific computing architectures, where choosing the right infrastructure configurations can mean solving problems three times faster, or using one third the cost or energy.

To address this challenge, Rescale has developed a new technology we call the Compute Recommendation Engine (CRE). Rescale’s CRE applies AI to find the optimal computing stack for any workload to best utilize the multi-cloud infrastructure we have at our disposal to power the next era in engineering and scientific computing.

Following the recommendations of Rescale’s CRE, our customers have seen improvements accelerating simulation speeds by over 200 percent and reducing costs by over 60 percent for key benchmarks.

The Rise of Specialized and Domain-Specific Computing Architectures

For decades, compute power has consistently followed Moore’s Law, which stipulates that computer processing power doubles approximately every two years, as chip manufacturers jam ever more transistors into integrated circuits. But this trend is running into physical limits, as processors simply can’t get much more densely packed than they are now.

This end of Moore’s Law means that compute-intensive software like simulations can no longer benefit from performance gains at the same rate without a disproportionate increase in cost and energy consumption.

This bottleneck has created an opportunity for domain-specific architectures, or accelerated computing, optimized for running specific workloads. Companies like NVIDIA have been able to demonstrate performance gains that are outpacing Moore’s Law for GPU-accelerated workloads like graphics processing and machine learning.

Additionally, cloud providers such as Amazon AWS, Microsoft Azure, and Google Cloud are building their own chips, while Arm-based IP is democratizing chip design for startups and new entrants in the space.

As Moore’s Law has slowed down, specialized chip architectures have exploded, creating greater performance gains but also making HPC choices more complex.

We are now in the midst of a Cambrian explosion of computing architectures. The pace of the number of specialized processors and hardware architecture configurations has increased dramatically. In 2022 alone, almost 400 new hardware options were introduced by cloud service providers. Currently, organizations can choose from more than 1,500 unique architectures.

These choices in specialized computing architecture have delivered some incredible performance improvements for workloads that embraced these capabilities, providing 1,000-fold performance gains in just the past 10 years.

The Need for a Compute Recommendation Engine

To capture the gains made possible by domain-specific architectures, each workload needs to be matched against the optimal computing architecture configuration. Rescale has spent over a decade building expertise and helping organizations make the right choices regarding the hardware architectures, operating systems, software, storage, and networking technologies that power their R&D efforts.

Between the several hundred computing architectures, as well as additional possible configurations for storage, networking, middleware, applications, and models, organizations need to select from over 50 million possible combinations to deliver an optimal compute stack.

Through several evolutions, we have now developed what we believe to be one of the most important breakthroughs in full-stack computing since the advent of cloud computing, an AI-based Compute Recommendation Engine.

Rescale’s approach has been to develop data-driven automation by tapping into the wealth of proprietary telemetry from usage patterns and hardware performance on real customer workloads, as well as a formidable collection of intelligence, tooling, and automation for engineering and scientific computing hardware, software, networks, and storage.

How the Rescale Compute Recommendation Engine Works

At its core, the Compute Recommendation Engine identifies the best possible infrastructure compute architecture and scale that should be used for running a given workload. This recommendation can be implemented as a user is about to run a batch job, or run as a series of benchmarks in Performance Profiles, for example.

To develop this intelligence, Rescale has taken advantage of our extensive database of historical benchmarks, enhanced metadata metrics, and detailed infrastructure performance time-series telemetry data. Then, by using recurrent neural networks we were able to output a series of tags that helps infer the type of hardware that would perform best against the workload.

Rescale’s CRE uses this rich and current dataset of many millions of metadata points to provide AI-driven recommendations on the best hardware to meet the user’s need to balance performance and costs to run a given type of workload. Through a process of AI-driven filtering and scoring, the CRE assesses which chips and architectures are most effective for a given application and computing task. This provides a near-instantaneous predictive benchmark and AI-powered recommendation that makes it easy for organizations to pick the right HPC architecture for their particular needs.

Rescale’s Compute Recommendation Engine uses the intelligence of full-stack cloud platform data to power automated guidance and insights to help organizations control costs, boost engineering productivity, and accelerate innovation.

Rescale Compute Recommendation Engine Performance

The CRE has proven through internal testing and validation to be more than 95 percent accurate in identifying the optimal chips and full-stack configurations for a given computational task, ensuring the most efficient scaling and costs. The recommender system can also be leveraged to identify key bottlenecks (including network message passing, storage, and memory performance) in the overall HPC architecture in a data-driven and automated way. As an added benefit, this feedback from the recommender system can be used as key feedback to application developers for potential improvements in their code architecture to improve performance.

Following the recommendations of the CRE, our customers have seen improvements accelerating simulation speeds by over 200 percent and reducing costs by over 60 percent for key benchmarks. Not only are these benefits experienced immediately, but leveraging the CRE will help customers continuously stay updated to the optimal recommendations as new technologies and optimizations in the stack become possible.

The CRE provides an initial chip architecture recommendation based on the application the customer wants to run and other metadata from the workflow. Then, based on the performance of the job during runtime, the CRE will train on this metadata set and fine-tune its recommendations. Critically, the CRE will oversee jobs continuously and provide additional recommendations, monitoring the market on a continuous basis to know when new chip architectures become available.

The recommendations provide guidance in numerous ways for how to optimize workload performance. For example, a recommendation may be to reduce the number of computer cores used, as job metadata metrics indicate an inefficiently scaling job. Alternative hardware architectures with lower interconnect latency may be recommended when message passing metadata indicates a networking bottleneck.

Using the Rescale Compute Recommendation Engine

The intelligence of the CRE technology is already available to all Rescale customers today through AI-Recommended hardware architectures as part of the Compute Optimizer feature and embedded recommendations are coming to Performance Profiles soon.

Compute Optimizer eliminates the need for benchmarking and automatically keeps the recommendations for workloads dynamically updated, providing significant productivity, cost, and time benefits. These automated recommendations are always available to every user on the Rescale platform.

With just a couple of clicks in the Performance Profiles dashboard, users can benchmark the performance of any chip architecture with their own specific computing workload. Performance Profiles makes it easy for customers to assess the detailed cost, performance, and energy consumption of any HPC architecture customized to a specific workload.

Performance Profiles and Compute Optimizer are just two examples of how elements of the CRE are making it much easier for organizations to manage their engineering and scientific computing resources.

Over time, we will continuously improve the CRE functionality with richer data analysis and greater automation of the recommendation process. For example the CRE could be used to instantly conduct an audit across months of historical data to measure improvement opportunities. Or it could be used to forecast technology curve improvements (based on historical trends) to guide future infrastructure planning choices.

The metadata which informs the CRE updates dynamically as new hardware architectures, software applications, middleware configurations, and workload use cases are added and run on the platform. New forms of data, such as metadata related to energy use, carbon footprint, and other sustainability metrics can be added to help organizations make more holistic compute architecture decisions.

Accelerated Computing is Fueling Engineering and Scientific Breakthroughs

Rescale’s Compute Recommendation Engine was born from our recognition that optimizing the computing stack has profound implications for digital R&D. With the exponential growth of specialized semiconductors and accelerated computing choices, organizations can reap unprecedented and continued benefits from cloud-based engineering and scientific computing.

Rescale is committed to evolving the intelligence of our platform to provide the best computing capabilities for our customers and empower engineering and science teams to help them invent the future.

Learn more about how the Rescale platform’s unique intelligence can help your research and engineering teams power their innovation efforts.

Joris Poort

Joris is CEO and is responsible for leading the management team at Rescale. Prior to founding Rescale, Joris worked for McKinsey & Company on product development engagements in the high-tech sector. Joris began his career at Boeing, where he worked for four years as a structural and software engineer on the 787 program, optimizing the design of the tail and wings. Joris holds an M.B.A. with distinction from Harvard Business School, an M.S. in Aeronautics and Astronautics magna cum laude from the University of Washington, and a B.S. in Mechanical Engineering and minor in Applied Mathematics magna cum laude from the University of Michigan.

View all posts
Adam McKenzie

As CTO, Adam is responsible for managing the HPC and customer success teams. Adam began his career at Boeing, where he spent seven years working on the 787, managing structural and software engineering projects designing, analyzing, and optimizing the wing. Adam holds a B.S. in Mechanical Engineering cum laude from Oregon State University.

View all posts
Chris Langel

Chris Langel is a senior HPC engineering manager at Rescale focusing on CAE application performance across various cloud architectures.. He previously worked at Siemens Gamesa Renewable Energy working in the aerodynamics group on CFD analysis and optimization of wind turbine blades. He holds a Ph.D. in mechanical and aerospace engineering from the University of California, Davis and a B.S. in mechanical engineering from University of California, Berkeley

View all posts

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
player	1 year	Vimeo uses this cookie to save the user's preferences when playing embedded videos from Vimeo.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
sync_active	never	This cookie is set by Vimeo and contains data on the visitor's video-content preferences, so that the website remembers parameters such as preferred volume or video quality.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-32985745-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
utm_campaign	past	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	past	This cookie is used for storing the session content value if present.
utm_source	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_chtbl	session	No description available.
_dtses	30 minutes	No description available.
_dtuid	10 years	No description available.
BIGipServersj30web-nginx-app_https	session	No description
email	past	No description available.
gclid	past	No description
handl_ip	1 month	No description available.
handl_landing_page	1 month	No description available.
handl_original_ref	past	No description available.
handl_ref	past	No description available.
handl_url	1 month	No description available.
li_gc	2 years	No description
muc_ads	2 years	No description
username	past	No description available.

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines