Customer are experiencing sporadic high latencies accessing cloud.uipath.com

Incident Report for UiPath

Postmortem

Customer impact

Some customers in Asia (especially Japan and Singapore) saw downgraded performance when attempting to connect to UiPath Automation Cloud from 3:56 am to 5:11 am UTC on 2025/1/31. Customer’s request timed out during this time.

Root cause

A misconfigured cache expiry setting in Notification Service led to excessive API calls to Orchestrator, overloading the API gateway:

  • Trigger: We saw a spike in publication requests coming to Notification Service from Orchestrator. This was unusual, but the system should have been able to handle it.
  • Cause: Normally, a feature flag check should be cached for 10 minutes. The cache expiry setting was incorrectly set to "0" instead of "600" seconds. This caused every publication event to trigger a direct call to Orchestrator, significantly increasing load.

Detection

  • Multiple services started receiving synthetic alerts due to requests timing out.
  • Our logs showed the API gateway was closing connections due to excessive traffic.
  • The investigation pointed to an unusual surge in calls to the NotificationServiceIntegrationEnabled endpoint.

Response

  • API gateway was scaled up in affected regions to handle the increased load.
  • We found the underlying misconfiguration and corrected it.

Follow-up

  • Make sure the cache expiry setting is set correctly in all environments and checked by at least two teammates.
  • Improve autoscaling settings for API gateway to bear sudden load.
  • Improve rate-limiting from Notification Service end
  • Improve documentation and validation of feature flag/caching changes before rollout.
Posted Feb 05, 2025 - 06:09 UTC

Resolved

The issue has been resolved.
Posted Jan 31, 2025 - 06:34 UTC

Monitoring

The issue has been resolved for all impacted regions, and teams are actively monitoring the service
Posted Jan 31, 2025 - 06:22 UTC

Update

Impact from this event was limited to customers physically located in India, Singapore, and Japan regions. Services for customers in Singapore and Japan have fully recovered, while recovery efforts are ongoing for customers in India region. Our teams are actively working to restore full service as soon as possible.
Posted Jan 31, 2025 - 05:52 UTC

Identified

The issue has been identified, and the team is actively working on a fix.
Posted Jan 31, 2025 - 05:44 UTC

Investigating

We are currently investigating this issue.
Posted Jan 31, 2025 - 05:12 UTC
This incident affected: Automation Cloud, Orchestrator, Automation Hub, AI Center, Action Center, Apps, Automation Ops, Computer Vision, Cloud Robots - VM, Customer Portal, Data Service, Documentation Portal, Document Understanding, Insights, Integration Service, Marketplace, Process Mining, Task Mining, Test Manager, Communications Mining, Serverless Robots, Studio Web, Solutions Management, Context Grounding, and Autopilot for Everyone.