In this guide we will show you some common sources of errors encountered by users running jobs in batch on the Rescale platform. We will try to show you how to diagnose some of these errors. We will also discuss ways on how to avoid and correct them.
Job Status page
Review the output on the Job Status page
Examine the job history on the Gantt chart
Does the command properly pass the validation step (Validating Input) with a green check?
Review the output on the Job Status page
Examine the job history on the Gantt chart
Does the command properly pass the validation step (Validating Input) with a green check? Are there error messages present in the Job Logs section?
Are there error messages present in the Job Logs section?
Results page
process_output.log
The first step we would always suggest for every job, successful or otherwise is to review the process_output.log file. This file logs the standard output from the software analysis method you are running. It will also log any potential error messages.
To find the process_output.log file go to the Job Results page
Query for the file in the search bar
Usually log or process is a sufficient search term
View the process_output.log file using the screen icon in the Actions column
If the file is too large, you may first need to download the file before viewing it in a text editor
Carefully review this log file and look for warning or error messages
Most of the time, the error can be identified here
Exit Codes
One key field to look for is the “exit code” at the end of the process_output.log file. If an analysis method runs smoothly and exits cleanly without posting any error messages it should produce:
Exit with code 0
While a job may complete with code 0, this simply means that the process ran without producing any errors. This of course does not guarantee that the job ran as you intended. When the program does encounter an explicit system error (i.e. out of memory, core dump, out of disk space, etc), the process will produce a non-zero exit code. Common exit codes you may encounter:
Exit Code
Meaning
1
Catchall for general errors
2
Misuse of shell built-ins
126
Command invoked cannot execute
127
Command not found
128
Invalid argument to exit
128+n
Fatal error signal “n”
130
Script terminated by Ctrl-C
137
Undetermined exit mode, including explicit termination of the process or out of memory
255*
Exit status out of range
These error codes, of course, are not the most instructive, however, they provide a starting point for debugging.
Basic Debugging Steps
While there are a wide variety of failure mechanisms, some of the most common issues as well as steps to diagnose and avoid them are listed below.
Missing input files
Ensure that all of the required files have been included in the job either individually or in a compressed (zip, tarball, etc) input file deck.
Incorrect file paths
Ensure that the compressed input file deck will expand to the proper directory paths
Use relative file paths in scripts, input files and other job definitions
Rescale will execute the software Command(s) indicated on the Software Settings page in the same working directory where the compressed files will be unpacked
The Rescale platform will assume that the prepared input files are packaged at the top directory level
zip/tar/etc the input files at the directory level where the command will be executed
Or, if a case subdirectory is used (i.e. run01_configB), ensure that you preface the analysis software command with the proper change in directory such as: cd run01_configB && run_analysis
Note that this is not the preferred workflow on the Rescale platform
Multi-node jobs that require access to a common file system
For most analysis methods where the head process handles the file I/O and communication with the worker process, Rescale by default places user specified input files into ~/work. Some methods, however, launch worker processes on nodes that also require access to a shared file system.
On the Rescale platform, the ~/work/shared directory is NFS mounted to all of the compute nodes in your job
Rescale has identified most of these analysis methods and by default start jobs in the ~/work/shared directory
However, occasionally, due to a runtime customization or option, the worker processes running on the nodes will require access to input files, load runtime libraries, or writing output files
The Command on the Software Settings page should be prepended with move and change directory commands:
mv * shared
cd shared
<run_analysis>
Error reading input files
Ensure that the input files are properly constructed as expected by your analysis software
Ensure that the proper software Version is selected on the “Software Settings” dropdown page
Ensure that text input files are in the proper format
Batch compute nodes are generally Linux machines. Depending on the type of text editor you use, sometimes end of line/end of file characters are encoded differently
Often times Windows text editors will result in files with the ^M newline character that Linux does not use
To replace this using a text editor such as VI/VIM you can replace these characters with the following command: :%s/^M$//
Note: ^M is entered as ctrl-V and ctrl-M
Examine other log files from your analysis method
While the Rescale Platform will output standard output messages to process_output.log, some analysis methods will print critical information to other log files
These output files will usually have file extensions of ‘log’, ‘out’, ‘live’, or ‘dat’, but may vary depending on the analysis method. Please refer to the software vendor’s documentation
These log files will also generally be ASCII text files, so you can view them using the small screen icon in the right hand column next to the filename
As with the process_output.log file, if it is too large, download them to your local workstation and view them using a text editor
Missing library files
Ensure that the process has proper access and path definitions to any custom library files used in the job
Rescale Support may have to install additional libraries for your application
Please notify Rescale Support if you encounter a message like this
Out of system resources
Check that your simulation process has sufficient physical memory and storage during runtime
Some codes will change their memory footprint size during runtime depending on the analysis, so sufficient memory at start-up may not persist
Some codes will also generate a large amount of scratch data files that could bloat the storage footprint beyond that of the final output files
Refer to the “Cluster Status” on the bottom of the Job Status page for monitors of free memory and disk space
Reduce the size of your mesh/simulation to see if the job can run successfully
Run on more cores and/or nodes
Select specialty core-types with more physical memory or storage
Proper license access
Ensure that your license settings on the Software Settings page are properly defined. Generally, these are in the format of port@hostname.
Ensure that you are checking out features on your license file, as seen from the process_output.log
Check if the right options to check out the features are used with command you are trying to execute.
If you are the person responsible for your license server:
Check that your license has not expired
Check that your license server is up and has network access
Before diving in with a production sized run, set up a small test case that exercises your workflow
Check that your pre- and post-processing steps are properly integrated into your Analysis Options > Command
Run a test job interactively
Replace the existing Command with sleep 3600
ssh into the compute node once it starts
Change into the appropriate directory for your analysis method (~/work or ~/work/shared)
Attempt to launch the job interactively
Record all of the commands that produce a successful outcome
Modify your Command(s) accordingly
The Command input window will accept line breaks, ; or && marks to separate commands
Note: commands separated by && will only execute if the previous completes with an code 0, while those following ; will always execute after the previous
If these common debugging steps still do not resolve your problems, please contact and share your job with Rescale Support.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies. Do not sell my personal information.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.