TE Connectivity – Customer Success (FAQs and Practical Information
What happens when a job is running but a simulation is hanging?
In some cases, an application will fail without crashing. In other words, the simulation solver binary remains active, holding up compute resources, but not actually running the simulation. This is called an “Idle Job” or “Idle Cluster”.
An idle job event occurs when all of these conditions add up:
- CPU Utilization is less than 5% average for 1200 seconds (20 min)
- Total Network inbound traffic is 1 MB for 20 min
- Total Network outbound traffic is 1 MB for 20 min
In this situation the user would need to manually end the job. A typical run of events would be:
- The user receives a notification by email that the cluster is idle. The notification provides the job name and a link to the job status page.
- The user can verify that the application has a hanging type of failure. This can be confirmed on the job Status page: job runtime may appear longer than expected, or find application errors in the job “process_output.log” or in the application logs or standard output files. Use the “Live Tailing” feature or the in-browser SSH terminal to access these files.
- The user can manually terminate the job using the platform:
Further troubleshooting when the application fails should be done by the user. Rescale support can provide guidance or help collect information for getting support with the application software vendor, if required.
Things to try:
- Submit again
- Run the simulation on an interactive session (see the documentation for End-to-End desktops or Workstations)
- Retry the job with a different, newer version of the application.
What are Capacity errors and what do they look like?
Capacity is one aspect of cloud computing that refers to the amount of compute and storage resources available across multiple cloud service provider (CSP) regions based on resource availability and region health data and is available to all users of the Rescale platform.
Each CSP region’s available capacity and current health state have major impacts on the job waiting time and success rate. If capacity runs low, then a job will have to wait a long time before resources become available, or could even lose hardware after the execution starts.
When capacity is low, the Job Logs show the error message “There is insufficient capacity of your requested core-type from the cloud provider. Retrying, resubmitting request to provider.”
The job will be re-attempted up to 5 times over the next few minutes. If capacity opens up, the job will continue to run normally.
If the capacity problem persists, then the job will be kept in the queue until capacity becomes available or the user cancels the job submission.
If canceling the job, consider running on a different coretype, On-Demand Priority, or with fewer cores.