TE Connectivity – Customer Success (FAQs and Practical Information

What happens when a job is running but a simulation is hanging?

In some cases, an application will fail without crashing. In other words, the simulation solver binary remains active, holding up compute resources,  but not actually running the simulation. This is called an “Idle Job” or “Idle Cluster”. 

An idle job event occurs when all of these conditions add up:

  • CPU Utilization is less than 5% average for 1200 seconds (20 min)
  • Total Network inbound traffic is 1 MB for 20 min
  • Total Network outbound traffic is 1 MB for 20 min

In this situation the user would need to manually end the job. A typical run of events would be:

  1. The user receives a notification by email that the cluster is idle. The notification provides the job name and a link to the job status page.

MjJmKqLKFaOZDBDrtVO7LQRXH

  1. The user can verify that the application has a hanging type of failure. This can be confirmed on the job Status page: job runtime may appear longer than expected, or find application errors in the job “process_output.log” or in the application logs or standard output files. Use the “Live Tailing” feature or the in-browser SSH terminal to access these files.

V6mNu8pHwCvIjsWLmgm2Tblq5sBZrgn44xfG58aHkqTaHdlAyl5151ckuOfpMI5fJEh8UrsYuFXaRdFzrL4BZ3NMgUgLAt9gNrq7RXjVx16 K31juSQu7IpKG3IodasFupv9IlIjvcbUPLG953PVDRaTo6VpfGJn3Z9XFe6vFGi8pj0nbPqVbNXlMg

  1. The user can manually terminate the job using the platform:

zNxgNnDB412qdDxN5h3 p7gnFgRtZEcnU02Xj CeaiAeuIdgioXCSEh1ds9M7IxmMcpARKwP47R0UPu3iKbPp885F4b9NktuqYcT5 W38hJ7jvrJPS1Bn2hlpj1Qt

Further troubleshooting when the application fails should be done by the user. Rescale support can provide guidance or help collect information for getting support with the application software vendor, if required.

Things to try:

  • Submit again
  • Run the simulation on an interactive session (see the documentation for End-to-End desktops or Workstations)
  • Retry the job with a different, newer version of the application.

What are Capacity errors and what do they look like?

Capacity is one aspect of cloud computing that refers to the amount of compute and storage resources available across multiple cloud service provider (CSP) regions based on resource availability and region health data and is available to all users of the Rescale platform. 

Each CSP region’s available capacity and current health state have major impacts on the job waiting time and success rate. If capacity runs low, then a job will have to wait a long time before resources become available, or could even lose hardware after the execution starts. 

When capacity is low, the Job Logs show the error message “There is insufficient capacity of your requested core-type from the cloud provider. Retrying, resubmitting request to provider.”

VPuoyeWnFokf4vOlTU4cELGfu9IBfIPfAWyiU98GK3uxSRHTr7 6N09IU7yxuqrr8T alG

The job will be re-attempted up to 5 times over the next few minutes. If capacity opens up, the job will continue to run normally. 

If the capacity problem persists, then the job will be kept in the queue until capacity becomes available or the user cancels the job submission.

u9LWyOtR9qyJCpwLOtLRSmk6vRB2mO1bTVP9JNj2EojwU5zuZex xWGgp v4OcigS1f8ratOLuAqF8MwYnnjXn3lxUbd3rQ8M5Wz7xZsQx0 KqLIzAAKx4u LGT

If canceling the job, consider running on a different coretype, On-Demand Priority, or with fewer cores.