Serverless Robots is experiencing downtime
Incident Report for UiPath
Postmortem

Root Cause Analysis – UiPath Serverless Outage on June 30 2022

Customer impact

On June 30, 2022 from 15:37 UTC until 17:17 UTC all Serverless jobs failed to execute. The problem started in Australia and Canada, but quickly became a global outage.

Root Cause

The outage was caused by an integration bug between two UiPath infrastructure services. In early June an update was made to both services to simplify their interaction. This was a breaking change to the API between the two services and required the updated services to be rolled out in the correct order. The rollout was done correctly in our testing, dogfood, and community environments. But the deployment of one of the services to our enterprise environment was delayed by 1 day due to an investigation into an unrelated issue. The engineers who decided on the delay did not realize the rollout dependency, and so the other service’s deployment went forward as scheduled and caused the outage.

Detection & Response

Our automated alerting system immediately detected the issue and notified the oncall engineers. They quickly realized the issue was caused by a downstream dependency and triggered our ‘major incident response’ process and pulled in representatives from each of the impacted teams. They were able to quickly identify the root cause and manually roll back the problematic change. 

Follow Up

UiPath has a policy of version-tolerant deployments to avoid exactly this sort of issue. In a globally distributed system, it is common for version mismatches to occur during slow rollouts and in the case of a rollback, so all servers should be tolerant to such circumstances. In this case that policy was not correctly applied. We are investigating why it was overlooked and how we can improve our release automation system to better enforce this policy and catch issues before they make it into production.

Posted Jul 05, 2022 - 21:57 UTC

Resolved
This incident has been resolved.
Posted Jun 30, 2022 - 17:24 UTC
Update
Serverless Robots are failing in all regions. We believe we have found the root cause and are currently working on a fix.
Posted Jun 30, 2022 - 17:07 UTC
Update
We are continuing to investigate this issue.
Posted Jun 30, 2022 - 16:42 UTC
Investigating
We are currently investigating this issue.
Posted Jun 30, 2022 - 16:38 UTC
This incident affected: Serverless Robots.