Automation Cloud in Europe is unavailable

Incident Report for UiPath

Postmortem

Background

UiPath Automation Cloud services are deployed globally across multiple regions. UiPath has its own identity service that creates tokens. Just like other cloud services, the identity service is also available in multiple regions.

Each time there’s a cloud code release for cloud services, we follow safe deployment practices (SDP). We gradually introduce changes to our code. This way, we can balance the release’s exposure with its proven performance. Once a release has proven itself in production, it becomes available to tiers of broader audiences until everyone is using it. SPD also helps protect against retry storms by limiting the rate of requests.

Customer impact

Enterprise customers in the European region had issues connecting to the UiPath Automation Cloud, on March 13, 2024, starting at 13:04 UTC. The incident had three phases:

Between 13:04 UTC and 14:07 UTC all requests failed with an HTTP 500 error. This happened because one of our essential background services wasn't working.

All new token refreshes started working successfully from 14:07 UTC onwards.

Between 14:07 UTC and 14:50 UTC, some customers experienced HTTP 429 errors. This happened because there were too many retries. As a result, our network system started to throttle certain processes to manage the load.

All new token refreshes even from IPs that previously may have received errors started succeeding from 14:50 UTC onwards.

Some customers saw HTTP 400 errors from 14:07 UTC until they signed in to their robots again.

This happened because the auth tokens expired during the outage. The tokens couldn't be refreshed automatically. So, users had to sign in again to get a new token.

If customers see errors, they might have to log back in to attended robots and portal pages. Their old login sessions might have run out.

Root cause

Identity service had a new dependency, which caused this problem.

We gradually introduced the changes by following SDP. We started with our development environments, then community, and then onward through the different regions. By doing this, we can find as many issues as possible in a lower environment before they affect our critical customers.

However, the resource didn't have the right scaling configuration. The change worked fine across many regions for several days until March 13, when an increase in traffic caused the quota threshold to be reached, which led to this incident.

This failure also had a cascading effect. All the clients started trying to connect again, which triggered our rate-limiting policies for those specific IPs. This returned an HTTP 429 error, which limited any impact on other services and customers.

Detection

Our monitors automatically found the problem and let our disaster recovery team know within a few minutes.

Response

To fix the problem, we turned off the extra rights and raised the throttling limits for a short time. This lets the retries drain away.

We followed our usual steps to fix the problem and remove the backup that made us hit the throttling limits.

Follow-ups

We're continuing to look into this issue in more detail, and we'll be sharing more details about how we plan to fix it soon. But first, we're going to:

Improve our failure drill process for new components in test environments to catch these kinds of issues earlier.
Reduce the time it takes to find the root cause of these types of failures.
Review our network throttling policies to make sure they're set correctly.
Look at ways to make our authentication and refresh token processes more resilient.

Posted Mar 14, 2024 - 01:08 UTC

Resolved

The issue has been resolved. If any clients are still experiencing issues, they may need to sign in again or restart their browser.
This issue was caused by a failure of one of our critical resources to auto-scale. As our traffic increased, it became overloaded. This prevented users and robots from being able to Authenticate. The issue was isolated to customers in the European region.

We will publish a postmortem with more details as soon as possible.

Posted Mar 13, 2024 - 15:18 UTC

Monitoring

A fix has been implemented to mitigate the issue. We are currently monitoring the system.

Posted Mar 13, 2024 - 14:21 UTC

Update

Only the customers in Europe region are impacted by this incident.

Posted Mar 13, 2024 - 13:58 UTC

Investigating

Automation Cloud is down due to a rate limiting issue. We are currently investigating the issue

Posted Mar 13, 2024 - 13:30 UTC

This incident affected: Automation Cloud.