Key Tips for Managing High Performance Computing Systems

Rescale’s engineering team is dedicated to solving the complexities of managing high performance computing (HPC) systems in this era of hybrid and multi-cloud computing.

Fundamental to HPC for R&D is creating and managing a compute job to carry out a digital simulation or other kind of analysis. So a key area of focus for the Rescale engineering team is in automating many of the tasks required to successfully set up and run a simulation job or other big compute task.

In this second of my two-part blog post series (read part one: “Best Practices for Running HPC Batch Jobs”), I will discuss some of the broader management considerations for HPC batch jobs, including scheduling, security, troubleshooting, and the growing need to understand the specific requirements for cloud HPC.

Juggling HPC Batch Jobs

So to run batch processing, you need to set up your hardware, configure your network, and set up your software, whether in the cloud or on-premises. The processes are different, but they both require HPC expertise to do this all properly.

If you’re relatively familiar with the requirements of a particular batch job, you can typically get that all done in a few hours. But that’s not very realistic, because generally you’re not going to set up the same kind of batch job all the time. You’ll need to configure your compute environment to the specific application. Some tasks need great throughput, others need more parallelization, and so on.

Each HPC batch job is its own journey and requires addressing all hardware and software components to ensure the system is optimized for a given workload. This can mean starting from scratch to build a new type of HPC job, and it can require some lessons learned as you go along.

And then the next task is to set up a way to schedule all these batch jobs so people with higher priority jobs get their work done to meet their deadlines. And when dealing with an ongoing flow of HPC batch jobs, you will have a whole other level of configuration and provisioning that you have to do.

You also have to set up and support new versions of software, depending on how many different applications your R&D team uses. There’s significant maintenance and tuning for new applications to make sure those work like they should on your clustered hardware.

Ideally, you should also be refreshing your hardware periodically, whether in the cloud or on-premises. Every month there are new chips entering the market—there’s always a faster race car coming out. You might best benefit from a new Arm-based CPU to capture greater energy efficiency, or you might need pure parallelization power with GPUs. So that’s another thing you have to maintain and manage.

And then there’s just the ongoing maintenance of the system. Your scheduler’s going to get into a bad state sometimes and you have to go in and fix it. There’s a lot of maintenance pieces for both cloud-based and on-premises HPC even after you get everything set up and running the way you want.

The Costs of Faulty HPC Batch Jobs

If you don’t have your computing environment set up correctly, and you have faults in the system, this could result in your job not completing or the incorrect completion of your simulations or other analysis.

That has dramatic implications for whatever product you’re designing. If you’re getting bad data out of the system—especially if it’s bad data that’s not being flagged as an error—that could be a major issue for product development or regulatory compliance. You might build a product with a flaw that you didn’t realize it had.

Also, if there’s a failure and the software does call it out, you still lose all that work. Then you have to fix it and run your simulation again. And if it was a hardware fault, that is even more frustrating, because those kinds of errors can be very tricky to determine. Same with inter-node communications. Often, this leads to a team not finding the fault and rerunning a simulation and having it break again, cycling through an expensive and time-consuming troubleshooting process.

So, overall, if your HPC batch jobs are not set up correctly, you could end up with failed or inaccurate jobs that can cost time, money, or put your company at risk.

So an example of this was with one of our cloud provider partners. They didn’t have a consistent version of firmware set up on some of their InfiniBand switches. The networking library would randomly fail 48 hours into the batch job on that particular compute cluster.

That kind of firmware on a switch is pretty down in the stack. You can’t just look at the output of your application and discern that the fault is the network firmware. If you see those faults, you just know it failed. But a lot of times you have no idea where in the stack that is happening because of all the layers in an HPC system.

So there’s the problem of debugging, which can be very time-intensive. And maybe it isn’t a big deal if it just happens once, but if you have faults happening a few times a day for different workloads, that’s potentially a lot of simulation data that gets thrown out, along with a lot of time lost.

To ensure your HPC jobs run dependably (on-premises and the cloud), you really need a team of HPC experts for networking, systems administration, storage, data center management, and complex application maintenance. That’s a lot of technical resources to make sure your HPC systems are reliable and efficient, but that’s what is required for best practices HPC management if you do it yourself.

Security for Your R&D Data

Security, of course, is paramount for HPC. HPC systems typically house an organization’s most sensitive design and product information.

To manage security, it depends on how open you want your compute environment to be within your organization. You have to consider different types of user access, making sure that simulations and data are easily available to authorized users but ensuring they don’t migrate or “leak” to other parts of the organization or even outside the organization.

And that all has to be set up and maintained properly at the file system level or wherever you store your simulation data. So you need to know how to set up and administer a shared file system in a multi-user environment securely. That’s yet another skill set you need on your HPC team.

Multi-Cloud Management

The cloud, of course, addresses the biggest issue of traditional on-premises HPC data centers by providing nearly limitless on-demand high performance compute capacity. But multi-cloud HPC computing brings new and equally challenging technical complications.

Your HPC team really needs to know how to manage infrastructure-as-code on different cloud providers. The cloud providers vary quite a bit in how you need to interact with their interfaces and configurations.

Each cloud provider has a different way you orchestrate compute to build out a cluster to support a batch job and ensure low latency in the network fabric connecting nodes. There’s a whole different set of configurations for each CSP, in most cases.

A lot of this is because once you get into these considerations of low-latency networks, it is still a little bit niche. Especially because cloud HPC is so new, there are no strong standards across providers. Each cloud provider does things differently, because they’re still figuring out the complex world of HPC computing. So you have to know how to ask the cloud provider for the right configurations through their APIs or their SDKs to get optimal performance from the hardware.

For example, AWS provides something called EFA (elastic fabric adapter). That’s AWS’s in-house solution for low-latency networking on their computing infrastructure. Azure supports InfiniBand, which is an HPC industry standard type of technology, but it’s also virtualized.

So if you want to run HPC workloads on nboth AWS and Azure, you have to figure out how to provision the nodes in order to take best advantage of those different networking technologies. And then once you have connected your cluster of nodes for each fabric, you must know how to configure your MPI libraries to take advantage of each type of network.

Beyond the network fabric, you then also need to configuration other parts of the stack sitting above the hardware itself.

Then there’s the challenge of understanding which HPC cloud service provider in a region provides best cost-to-performance trade-offs. Even the time of the day can make a big difference in the cost to run an HPC job. And for HPC supercomputing clusters, availability is not a given with cloud providers, especially for specialized infrastructures.

You also need to carefully monitor all your cloud service accounts. It is surprisingly easy to lose track of cloud resources and only find them at the end of the month when you get the bill for stuff you forgot to shut down.

Having visibility and insights across the entire marketplace of HPC cloud services as well as your own infrastructure ecosystem is essential for being a savvy shopper and controlling cloud costs to get the most out of your HPC investments.

While setting up and managing HPC batch jobs is far from simple, getting it right is critical. Supercomputing is now essential for powering a growing array of powerful digital modeling and simulation software that is virtualizing scientific research and engineering. Such digital R&D is now becoming the foundation for the future of innovation. The companies that master high performance computing will have a growing advantage in their product development efforts.

Learn more about Rescale’s Intelligent Batch capabilities
Ensure all your high performance computing jobs are
set up the right way to run fast, efficiently, and dependably.

Mark Whitney

Mark Whitney is a director of engineering at Rescale. His areas of expertise include high performance computing architectures, quantum information research, and cloud computing. He holds a PhD in computer science from the University of California, Berkeley

View all posts

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
player	1 year	Vimeo uses this cookie to save the user's preferences when playing embedded videos from Vimeo.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
sync_active	never	This cookie is set by Vimeo and contains data on the visitor's video-content preferences, so that the website remembers parameters such as preferred volume or video quality.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-32985745-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
utm_campaign	past	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	past	This cookie is used for storing the session content value if present.
utm_source	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_chtbl	session	No description available.
_dtses	30 minutes	No description available.
_dtuid	10 years	No description available.
BIGipServersj30web-nginx-app_https	session	No description
email	past	No description available.
gclid	past	No description
handl_ip	1 month	No description available.
handl_landing_page	1 month	No description available.
handl_original_ref	past	No description available.
handl_ref	past	No description available.
handl_url	1 month	No description available.
li_gc	2 years	No description
muc_ads	2 years	No description
username	past	No description available.

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines