Multiple UiPath services impacted in North Europe region

Incident Report for UiPath

Postmortem

Customer impact

Between January 30, 2025, at 02:58 UTC and January 30, 2025, at 03:38 UTC, community customers (and a few enterprise organizations that started as community) may have encountered errors when connecting to UiPath Automation Cloud.

Root cause

UiPath has a static configuration to register internal service endpoints for service discovery. We rolled out a new version of this configuration and an error in the community ring caused a failure in our east-west API gateway when routing internal traffic between services.

All UiPath changes are always deployed according to our “Safe Deployment Practices.” We always start with our development environments, then dogfood, then community, then onward through each region, and finally onto the FedRamp and delayed regions. This ensures that we catch as many issues as possible in a lower environment before they become a problem for our customers. However, since this configuration registers internal services per ring, the issue did not surface in lower environments and was first detected in the first production ring.

Detection

Our automation detected the issues within a few minutes. Our engineers started looking into it right away.

Response

Our engineers found the root cause within 10 minutes and began the rollback process. Since this configuration change was part of many releases, the who rollback took about 20 minutes. Once the rollback was completed, the issue was fully mitigated.

Follow-up

To prevent similar issues in the future and enhance our service reliability, we are implementing several key improvements:

  • Update the pre-merge validation: We will change our pre-merge validation to catch similar errors.
  • Improve the canary test cases: Our canary test didn’t find this regression. We will improve the test to capture such errors before the rollout.
Posted Jan 31, 2025 - 22:30 UTC

Resolved

The Issue has been fully resolved. Thank you for your patience
Posted Jan 30, 2025 - 04:10 UTC

Monitoring

Multiple services experienced degradation due to a routing configuration issue. The change has been reverted, and all services have returned to a healthy state. Our teams are actively monitoring the system.
Posted Jan 30, 2025 - 04:02 UTC

Update

We are continuing to investigate this issue.
Posted Jan 30, 2025 - 03:51 UTC

Investigating

Teams are currently assessing the scope of the impact and we will come back with more details.
Posted Jan 30, 2025 - 03:39 UTC
This incident affected: Orchestrator and Document Understanding.