How to Increase HPC Performance While Reducing Costs and Lowering Energy Consumption

High performance computing is now a cornerstone to modern research and engineering. Organizations throughout industries are turning to digital modeling and simulations to shorten product development cycles. Particularly in engineering, electronic design automation (EDA) and the rapid expansion of the Industrial Internet of Things (IIoT) are driving HPC demand.

As companies turn to HPC for increasingly complex simulations and other tasks, they must also continue to control costs and reduce energy consumption.

What Is High Performance Computing (HPC)?

Compared to general-purpose computing, HPC offers greater throughput to process complex computational problems at extremely high speeds. HPC systems include three primary components: compute, network, and storage. They aggregate computing power through massively parallel processing.

HPC clusters consist of a large number of servers connected in a network. Each component computer is considered a “node.” HPC systems often contain 16 to 64 nodes with two CPUs per node.

The need for high performance computing is driven by today’s increasingly sophisticated software and the massive data sets used in simulations and analysis. This software is used to improve product performance in diverse disciplines, including aircraft aerodynamics, autonomous driving, drug discovery, and weather modeling. For example, simulation software applications from Ansys, Siemens, Dassault, and Convergent Science rely on specialized HPC architectures to perform computational fluid dynamics for commercial planes, military aircraft, and spacecraft development.R&D organizations also typically maintain broad portfolios of applications, including commercial, open source, and those with home-grown codes. Ensuring they run efficiently on HPC infrastructure is a challenge, because they all have different needs. At the same time, the independent software vendor (ISV) landscape continues to expand, further complicating how organizations need to support the use of advanced R&D software.

Specialized HPC Clusters

*Given the growing variety of specialized HPC semiconductor chips, organizations have to consider many trade-offs between performance and cost for running their complex R&D applications.*

HPC takes advantage of specialized HPC clusters to optimize workflows for specific kinds of applications and workloads.

Some tasks require more communication between nodes, as well as specialized hardware and software. A given workload’s computational requirements determine the number of nodes needed in the cluster. Some software and computational tasks also perform better with certain kinds of semiconductor chips. An automated benchmarking assessment tool like Rescale Performance Profiles can be incredibly helpful in matching the best chip architecture to a given computing task.

A high-performance interconnect for clusters addresses the need for low latency and bandwidth. It tracks the workload and re-routes it as necessary. One way to deal with large data sets is to package HPC applications and run them across multiple clusters. A cluster manager runs capacity and health checks to find and use available resources.

Containerization

Some organizations are also addressing HPC workload management with GPU-optimized containers, which have become increasingly prevalent with AI deployments. Open-source Apptainer (Singularity) is the most widely used container system for HPC. Shifter and Docker are other options. They allow for the seamless integration of leading AI applications. Containerized applications increase portability, making possible the use of in-house and commercial applications from anywhere.

Virtualization is an alternative to containerization. It generates a virtual environment on top of the host operating system. Virtual machines (VMs) are programmed with their own operating systems, allowing for complete isolation from one to another. Hyper-V, vSphere, and OpenStack are some examples.

Why is HPC Important?

HPC delivers critical information and analysis in far less time than traditional computing. The speed of HPC provides benefits to many roles, from engineers and data scientists to product designers and researchers.

It also takes modeling and simulation (M&S) to an entirely new level. For example, higher-resolution models deliver more granular information about a new product, which then reduces or eliminates the need for prototypes and real-world testing. Think automotive crash simulations rather than actual crash tests, or training pilots on flight simulators rather than in real aircraft.

With cloud HPC, a wide array of enterprises can rapidly scale their computing needs on demand.

Some examples are:

Engineering firms
Research labs
Financial technology (fintech)
Product development
Government and defense

Even startups and small businesses can take advantage of highly scalable cloud HPC.

Understanding HPC Performance

The pace of digital research and engineering is accelerating, making it critical for organizations to automate as much of their HPC provisioning cycle as possible, including selection of the right chip architecture for a specific application and compute task.

HPC optimization addresses the complexities of providing the right computing architecture for a given workload. It is also essential in making systems more energy efficient. An HPC workload is a data-intensive task spread across system resources located on-premises or in the cloud.

Today’s HPC systems can handle incredible workloads, including AI, machine learning, and deep learning. They run millions of scenarios simultaneously while processing massive amounts of data.

Key Performance Metrics

Analysts measure the power of HPC systems in flops per second. Right now, the Frontier machine at Oak Ridge National Laboratory sits atop the TOP500 list of the most powerful supercomputers, delivering 1.102 Eflop/s (one exaflop is one quintillion calculations).

Another key metric is power usage effectiveness (PUE), which determines the energy efficiency of the entire data center. You can calculate the PUE by dividing the total power entering a data center by the power used to operate all the IT equipment. The closer the number is to 1.0, the better the overall efficiency. Another benchmarking standard is data center infrastructure efficiency (DCiE). This energy efficiency metric is calculated by dividing IT equipment power by total facility power.

Finally, metrics are important, but only up to a point. Ultimately, users care most about the real-world performance that helps their computational jobs run faster. It can be difficult to fully assess HPC performance for all types of software and workloads. Some types of semiconductor chips work better on certain types of software than on others.

Computational Bottlenecks

For some companies, on-premises infrastructure is itself a bottleneck. This type of infrastructure investment is typically calculated for 100% utilization, so the instant demand exceeds supply, there’s a bottleneck because there is no more capacity. By comparison, cloud HPC is elastic, scaling up and down as needs change. Organizations can simply subscribe for more compute power. As a result, HPC in the cloud delivers full utilization without running into the constraints of capped capacity.

Many other potential bottlenecks exist within HPC systems, including memory capacity, I/O throughput, and storage speed/capacity. CPU cores, clock speed, or caching may also limit performance, while other inhibitors might include network switch bandwidth.

Memory capacity is another issue, as higher data transfer rates mean more memory required for buffering and storage. Traditional DDR3, DDR4, and even DDR5 memory may become a bottleneck. However, high-bandwidth memory (HBM) is a possible solution, as it delivers eight times the bandwidth of DDR5 memory.

To avoid bottlenecks, it is also important to align software specifications with HPC configurations that optimize performance.

HPC Energy Efficiency

Sustainability is an increasingly pressing need in HPC management. Being able to assess the most energy efficient hardware is essential to controlling the carbon footprint of HPC operations.

The energy efficiency of HPC systems, measured in flops per watt, continues to improve. One example of this is the Henri system at the Flatiron Institute in New York City, with an efficiency score of 65.09 GFlops/Watts.

Data center operators continue to improve energy efficiency in diverse ways. For example, next-gen, low-power chipsets reduce energy consumption and are better at dissipating heat. Power-optimized IP cores also reduce energy use and data transport via high-bandwidth memory. Some operators have turned to alternative sustainability methods, like liquid cooling and heat recycling.

Data centers increasingly look to renewable energy sources like hydro, wind, solar, biomass, and green hydrogen. Significant progress is being made, as evidenced by the fact that data center power consumption leveled off at 191 terawatt hours between 2015 to 2021. However, one-time migrations from on-premises data centers somewhat masked overall growth in HPC demand.

Responding to HPC Computing Demand

To meet demand, the industry is responding with more powerful machines than ever. Systems are moving from petaflop to exaflop capacity and beyond. A supercomputer with exaflop capability requires one second to perform one quintillion calculations. It would take more than 31 billion years to complete that number of calculations at just one calculation per second.

Innovations to expand HPC efficiency include new architectures and hardware. For example, 3DIC and die-to-die connectivity address the latest performance requirements. And more flexible switching is possible when FPGA, GPU, CPU, and other processing architectures are integrated into a single node.

New hardware often favors cloud-based HPC. Accordingly, legacy on-premises data centers cannot always take advantage of more energy-efficient chipsets. Migration to the cloud is one way to effectively address the increasing demand for speed, scalability, and sustainability.

However, a simple “lift and shift” migration from on-premises to the cloud does not always address a company’s emerging HPC needs. Sometimes legacy infrastructure can’t cope with the changing business needs, as it is typically a four to five-year refresh cycle. Such a lengthy cycle cannot keep up with rapid changes in the HPC ecosystem. Cloud adoption also offers a relatively high degree of financial flexibility. Corporate HPC cost models transition from longer-term CapEx to shorter-term OpEx. They don’t tie up as much capital, and they can better match different cloud HPC cost models to current needs.

Key Takeaways

For those with on-premises data centers, migration to the cloud is an important way to increase energy efficiency while reducing costs. Cloud HPC offers businesses of all sizes a way to benefit from the latest technological advances.

Optimizing HPC performance requires alignment between software specs and available hardware. Plus, specialized HPC clusters and containerization can also increase HPC performance and energy efficiency.

As the use of AI proliferates, HPC systems will become even more energy efficient.

Learn More From Rescale

Discover how Rescale can help your organization control costs while powering greater innovation. With Performance Profiles, it is easy to identify the best cloud HPC architecture for your needs.

Learn more in our on-demand webinar “Optimize Workload Cost and Performance in the Cloud.”

Garrett VanLee

Garrett VanLee leads Product Marketing at Rescale where he works closely with customers on the cutting edge of innovation across industries. He enjoys sharing customer success stories, research breakthrouths, and best-practices from Rescale engineers, scientists, and IT professionals to help other organizations. Garrett is currently focused on the convergence of supercomputing, HPC, and AI simulation models and how these trends are driving discoveries in science and industry.

View all posts

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
player	1 year	Vimeo uses this cookie to save the user's preferences when playing embedded videos from Vimeo.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
sync_active	never	This cookie is set by Vimeo and contains data on the visitor's video-content preferences, so that the website remembers parameters such as preferred volume or video quality.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-32985745-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
utm_campaign	past	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	past	This cookie is used for storing the session content value if present.
utm_source	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_chtbl	session	No description available.
_dtses	30 minutes	No description available.
_dtuid	10 years	No description available.
BIGipServersj30web-nginx-app_https	session	No description
email	past	No description available.
gclid	past	No description
handl_ip	1 month	No description available.
handl_landing_page	1 month	No description available.
handl_original_ref	past	No description available.
handl_ref	past	No description available.
handl_url	1 month	No description available.
li_gc	2 years	No description
muc_ads	2 years	No description
username	past	No description available.

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines