UiPath Automation Cloud services are deployed globally across multiple regions. UiPath has its own identity service that creates tokens. Just like other cloud services, the identity service is also available in multiple regions.
Each time there’s a cloud code release for cloud services, we follow safe deployment practices (SDP). We gradually introduce changes to our code. This way, we can balance the release’s exposure with its proven performance. Once a release has proven itself in production, it becomes available to tiers of broader audiences until everyone is using it. SPD also helps protect against retry storms by limiting the rate of requests.
Enterprise customers in the European region had issues connecting to the UiPath Automation Cloud, on March 13, 2024, starting at 13:04 UTC. The incident had three phases:
All new token refreshes started working successfully from 14:07 UTC onwards.
All new token refreshes even from IPs that previously may have received errors started succeeding from 14:50 UTC onwards.
Some customers saw HTTP 400 errors from 14:07 UTC until they signed in to their robots again.
This happened because the auth tokens expired during the outage. The tokens couldn't be refreshed automatically. So, users had to sign in again to get a new token.
If customers see errors, they might have to log back in to attended robots and portal pages. Their old login sessions might have run out.
Identity service had a new dependency, which caused this problem.
We gradually introduced the changes by following SDP. We started with our development environments, then community, and then onward through the different regions. By doing this, we can find as many issues as possible in a lower environment before they affect our critical customers.
However, the resource didn't have the right scaling configuration. The change worked fine across many regions for several days until March 13, when an increase in traffic caused the quota threshold to be reached, which led to this incident.
This failure also had a cascading effect. All the clients started trying to connect again, which triggered our rate-limiting policies for those specific IPs. This returned an HTTP 429 error, which limited any impact on other services and customers.
Our monitors automatically found the problem and let our disaster recovery team know within a few minutes.
To fix the problem, we turned off the extra rights and raised the throttling limits for a short time. This lets the retries drain away.
We followed our usual steps to fix the problem and remove the backup that made us hit the throttling limits.
We're continuing to look into this issue in more detail, and we'll be sharing more details about how we plan to fix it soon. But first, we're going to: