Rescale Service Level Agreement
*Effective from October 2nd, 2020
- Platform: The Rescale control plane through which the user can submit and monitor jobs, view results, and administer their resources.
- Core Type: The tuned hardware specification selected by the user to run a job on the Rescale ScaleX platform. A core type is a combination of CPU architecture, networking, RAM, disk, and system environment configuration. Various core types exist to provide various capabilities e.g., high interconnect, high memory, GPU.
- Wall time: Maximum duration input by the user before which the job is forcefully terminated.
- Provider/CSP: Cloud Service Provider e.g., Amazon Web Services, Microsoft Azure, Google Cloud Platform.
- Runtime: Total compute time taken to execute the job submitted by the user i.e. measured from post job validation when servers start running to successful completion. See section 5 for a detailed definition.
- Covered Job: A job that complies with both the size and duration constraints set forth in the Service Level Definitions in Section 3, subject to the clarification and explanations in Section 5.
1. Platform Availability
- 99% availability
2. Job Performance
Running jobs on the Rescale ScaleX platform requires four steps which include uploading data, selecting software, choosing hardware parameters (core type, number of cores, duration), and executing the job. This document covers the service level considerations for running a job.
Rescale provides its customers with two hardware service levels for running jobs:
- On Demand (OD): For smaller jobs, including checkpoint restart-enabled jobs, to be completed at a lower cost, running on a core type of the customer’s choice.
- On Demand Pro (ODP): For larger jobs, selected to run on a particular named architecture, to be completed with less chance of interruption, at a higher price.
3. Service Level Definitions for On Demand and On Demand Pro
|Features||On Demand (OD)||On Demand PRO (ODP)|
|Overview||Economical, smaller scale||Robust, larger scale|
|Job Properties covered by SLA|
|Covered Job size (concurrent cores)||Up to 200 concurrent cores||Up to 2000 concurrent cores|
|Covered Job duration (wall time)||Up to 24 hrs without checkpoint restart||Up to 7 days or 168 hours|
|Service Level for the above job properties|
|Job Success Rate**||95% start within 30 minutes and complete without failure||99% start within 30 minutes and complete without failure|
- The Rescale SLA applies only to Covered Jobs satisfying the above job properties. Jobs not satisfying the Covered Job requirements will be covered by the CSP SLA.
- CSP SLA typically covers platform/service uptime. In addition, AWS covers compute with an “Hourly Uptime Percentage commitment of at least 90%” for individual instances.
- If a CSP makes a change to spot instance reclamation that further limits the maximum runtime, the aforementioned maximum job duration covered by the Rescale SLA will be adjusted to reflect the CSP limit.
- Rescale will attempt to restart Covered Jobs that fail (whether or not checkpoint-enabled), subject to the clarification and explanations in Section 5.
- Rescale will provide remediation subject to the calculations of Job Success Rate and Platform SLA set forth in Section 4, and subject to the clarification and explanations in Section 5.
- For Covered Jobs, if the job is spot killed, the job will be restarted and the customer will pay for the final successful run. Rescale will cover the cost of the failed runs.
- For any job not in Covered Jobs, the customer will bear the full financial responsibility for the job irrespective of success or failure.
4. SLA Calculation and remediation
- Job Success Rate: To determine the Job Success Rate (OD or ODP) for a customer during a calendar quarter, the denominator is the number of Covered Jobs completed during the calendar quarter that the customer correctly initiated, while the numerator is the number of such jobs which started within 30 minutes post job validation (licensing, budget, and cloud provider resource checks) and completed without failure (excluding failures caused by the customer – improper job config, bugs in SW, etc). (**)
- Platform SLA: To determine the Platform SLA for a customer during a calendar quarter, the denominator is the total amount of time during a calendar quarter, while the numerator is the amount of such time during which the customer can access the platform and start jobs (excluding lack of access caused by the customer). (*)
4.2 What happens when Rescale does not meet the Job Success Rate?
If in a calendar quarter, for the covered jobs i.e. jobs within SLA, if the Job Success Rate has not been met, the customer is entitled to compensation in the form of Rescale hardware credit equal to:HW Credit = (Committed Success Rate % – Actual Success Rate %) * Net quarterly Spend (in that category)For example, for OD jobs, the committed success rate is 95%. The below table shows example credit as a percentage:
|Actual Success Rate %||Credit %|
A customer can request an HW credit based on actual success rate with a maximum cap of 10% of the total quarterly net spend in the category (i.e. OD or ODP)
5. Clarification and explanations
A. Job Performance
- Below are cases not covered under SLA and not qualified for SLA remediation calculation, subject to passthrough CSP SLA.
- Jobs launched through customers “bring your own cloud” agreement, the SLA offered by the cloud provider replaces the Rescale SLAs
- For customers who bring their own software license, the Rescale SLAs do not cover user managed connections to on-premises license servers or issues arising from connectivity with on-premises resources
- Core types whose maturity, capacity and robustness are yet to be confirmed (i.e. GA-4 and Beta) or ongoing service issues for either OD or ODP
- Time to result (run time) is the duration a job consumes compute resources.
- Does not include data transfer time; defined as the time required for a user to upload or download their inputs or outputs to the Rescale ScaleX platform
- Does not include job validation time; defined as the amount of time a job spends queued due to administrator-imposed, software licensing, or cloud service provider resource constraints
- Does not include the queue time for a job when the customer reaches the concurrent core quote
- On Demand Pro 30 minute job startup time applies to the Linux OS. On Demand Pro startup time for Windows is 30 minutes.
- Job success rate definition:
- Failures covered include
- Unhandled Rescale platform errors
- CSP hardware failure
- ISV software installation error
- Job interruptions due to provider hardware reclamation and provider downtime
- Failures exclude
- Large jobs outside Covered job properties defined in section 3 unless explicitly agreed with customer
- Budget exhaustion and configuration errors including hardware sizing
- Wall time misconfiguration
- Software licensing issues, including the expiration of customer licenses hosted on Rescale
- License file renewals and updates must be provided to Rescale at least 5 business days in advance for installation
- Custom software installations (any functionality not explicitly defined by customer provided acceptance tests), user-provided binaries, user-built functions and routines (including shell scripts, Java macros, and UDFs)
- Application failures due to command errors, solution divergence, running out of physical memory, segmentation faults, kernel panics/errors, and/or application-based defects such as memory leaks
- Application usage issues due to model configuration, solver parameters, convergence criteria, model decomposition, or over parallelization.
- Access delay and load due to the customer environment, including customer hosted licenses, VPN, SSO, Squid, and other networking/authentication issues managed by the customer
- Incorrect specification of maximum job wall-time (execution time) or hardware requirements
- Suspension of services due to natural disasters or other force majeure
- Planned maintenance notified ahead of time
- Queuing of jobs due to unavailability of CSP spot capacity
- 95% on-demand job completion applies to all jobs, while 95% of on-demand jobs start within 30 minutes applies only to AWS coretypes in the US region
- Rescale guarantees performance up to 1000 concurrent cores for non-beta coretypes
- A job that is restarted and completes is a successful job. Customers will only be charged for one time the runtime for Covered Jobs i.e.
- for jobs without checkpoint/restarting, it will be the runtime of one contiguous job and
- for jobs with checkpoint/restarting, it will be the sum of runtime till checkpoint and remaining runtime post restart
- Failures covered include
Platform availability refers to the ability to access the platform and submit jobs. If users are unable to submit jobs, this counts against the Platform Availability SLA and not the Job Success Rate