Quick Tip: Compress Your Output Files

One of the key challenges with cloud HPC is minimizing the amount of data that needs to be transferred between on-premise machines and machines in the cloud. Unlike traditional on-premise systems, this transfer occurs over a much slower and less reliable Wide Area Network. As we’ve touched on previously, the best thing to do is perform post-processing remotely and avoid transferring data unnecessarily.
That said, a common scenario for many users is to run a simulation and then transfer all of the output files from the job back to their workstation.
After a job has completed, each file in the working directory is encrypted and uploaded to cloud storage. This provides flexibility for users that only need to download a small subset of the output files to their machine. However the tradeoff is that each file introduces additional overhead in the transfer. When transferring data over a network, the more data that can be packed into a single file, the better. Further, many engineering codes will emit files that are highly compressible. Although compressing a file takes extra time, this can still be a net win if the time spent compressing plus transferring a smaller file is less than the time spent uploading the larger, uncompressed archive. Even if the compression and ensuing transfer takes longer overall, the real bottleneck in the overall transfer process is going to be the last hop between cloud storage and the user’s workstation. Having a smaller compressed file to transfer here can make an enormous difference depending on the user’s Internet connection speed.
If you know beforehand that you will need to download all of the output files for a job, then in general it is best to generate a single compressed archive file first instead of transferring each file individually. The linux tar command provides an easy way to create a compressed archive however it does not utilize the extra computing power available on the MPI cluster to generate the archive.
Jeff Gilchrist has developed an easy-to-use bz2 compressor that runs on MPI clusters (https://compression.ca/mpibzip2/). We compiled a Linux binary with a static bzip2 library reference and have made it available here for download to make it easier to incorporate it into your own jobs. The binary was built with the OpenMPI 1.6.4 mpic++ wrapper compiler. Please note that it may need to recompiled depending on the MPI flavor that you are using.
To use it, upload the mpibzip2 executable as an additional input file on your job. Then, the following commands should be appended to the end of the analysis command on the job settings page.
tar cf files.tar –exclude=mpibzip2 *
mpirun -np 16 mpibzip2 -v files.tar
find ! -name ‘files.tar.bz2’ -type f -exec rm -f {} +
First, a tar file is created called files.tar that contains everything except the parallel bzip utility. Then, we launch the mpibzip2 executable and generate a compressed archive called files.tar.bz2. Finally, all files except files.tar.bz2 are deleted. This prevents both individual files AND the compressed archive from being uploaded to cloud storage.
Note that the -np argument on the mpirun call should reflect the number of cores in the cluster. Here, the commands are being run on a 16 Nickel core cluster.
One additional thing to be aware of is that Windows does not support bz2 or tar files by default. 7-Zip can be installed to add support for this format along with many others.
As a quick test we built compressed archives from an OpenFOAM job that contained 2.1 GB worth of output data spread over 369 files and uploaded the resulting file to cloud storage.

As a baseline, we built an uncompressed tar file. We also tried creating a gzip compressed tar file using the -z flag with the tar command. Finally, we tried building a bz2 compressed archive with 8, 16, and 32 Nickel cores.
Not surprisingly, in the baseline case, building the archive takes a negligible amount of time and the majority of the overall time is spent uploading the larger file. When compressing the file, the overall time breakdown is flipped: The majority of the time is spent compressing the file instead. Also unsurprisingly, leveraging multiple cores provides a nice speedup over using the single-core gzip support that comes with the tar command. At around 16 cores, the overall time is roughly the same as the baseline case.
The real payoff for the compression step however will become evident when a user attempts to download the output to his or her local workstation as the compressed bz2 file is almost 5 times smaller than the uncompressed tar (439 MB vs 2.1 GB).
To reiterate, we believe that pushing as much of your post-processing and visualization into the cloud is the best way to minimize data transfer. However, for those cases where a large number of output files are needed, you can dramatically reduce your transfer times in many cases by spending a little bit of time preparing a compressed archive in advance. We plan on automating many of the manual steps described in this post and making this a more seamless process in the future. Stay tuned!

Ryan Kaneshiro

View all posts

Understanding How an Engineer Thinks

Adam McKenzie September 19, 2017January 25, 2023

When software engineers look at a piece of code, the first question they ask themselves is “what is it doing?” The next question is “why…

English

Budget Your Simulation Costs on Rescale

Irwen Song December 14, 2015January 25, 2023

The Budget featureWhen you run simulations on Rescale, you may want to control your budgets and spending. For example, an individual user might want to…

English

Data Management in Engineering Simulations

Robert Combier September 30, 2013March 22, 2023

As we noted in an earlier blog post about our Cloning feature (https://rescale.com/blog/cloning-on-rescale/), engineers and scientists spend significant time managing their simulation data. This is…

English | Thought Leadership

Big Compute Podcast – Rethinking HPC in Academia

Robert Combier March 4, 2019January 25, 2023

In this Big Compute Podcast episode, host Gabriel Broner interviews Marek Michalewicz, Director of ICM, the HPC center at the University of Warsaw. With the…

English

COMSOL Multiphysics® Now Available on Rescale’s Cloud

Robert Combier January 6, 2016January 25, 2023

San Francisco, CA – Rescale is excited to announce the availability of COMSOL Multiphysics® software on Rescale simulation platforms through a collaboration with COMSOL, creators…

English

Addressing the Cloud Cynic

Robert Combier October 1, 2018October 25, 2023

The Greek philosopher Diogenes kept dogs that served as emblems of his cynic philosophy. San Francisco, Calif., October 1, 2018 – I believe balance should be a…

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
player	1 year	Vimeo uses this cookie to save the user's preferences when playing embedded videos from Vimeo.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
sync_active	never	This cookie is set by Vimeo and contains data on the visitor's video-content preferences, so that the website remembers parameters such as preferred volume or video quality.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-32985745-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
utm_campaign	past	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	past	This cookie is used for storing the session content value if present.
utm_source	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_chtbl	session	No description available.
_dtses	30 minutes	No description available.
_dtuid	10 years	No description available.
BIGipServersj30web-nginx-app_https	session	No description
email	past	No description available.
gclid	past	No description
handl_ip	1 month	No description available.
handl_landing_page	1 month	No description available.
handl_original_ref	past	No description available.
handl_ref	past	No description available.
handl_url	1 month	No description available.
li_gc	2 years	No description
muc_ads	2 years	No description
username	past	No description available.

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines

Author

Similar Posts

Newsletter Sign Up

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines