Some customers in Asia (especially Japan and Singapore) saw downgraded performance when attempting to connect to UiPath Automation Cloud from 3:56 am to 5:11 am UTC on 2025/1/31. Customer’s request timed out during this time.
Root cause
A misconfigured cache expiry setting in Notification Service led to excessive API calls to Orchestrator, overloading the API gateway:
Trigger: We saw a spike in publication requests coming to Notification Service from Orchestrator. This was unusual, but the system should have been able to handle it.
Cause: Normally, a feature flag check should be cached for 10 minutes. The cache expiry setting was incorrectly set to "0" instead of "600" seconds. This caused every publication event to trigger a direct call to Orchestrator, significantly increasing load.
Detection
Multiple services started receiving synthetic alerts due to requests timing out.
Our logs showed the API gateway was closing connections due to excessive traffic.
The investigation pointed to an unusual surge in calls to the NotificationServiceIntegrationEnabled endpoint.
Response
API gateway was scaled up in affected regions to handle the increased load.
We found the underlying misconfiguration and corrected it.
Follow-up
Make sure the cache expiry setting is set correctly in all environments and checked by at least two teammates.
Improve autoscaling settings for API gateway to bear sudden load.
Improve rate-limiting from Notification Service end
Improve documentation and validation of feature flag/caching changes before rollout.
Posted Feb 05, 2025 - 06:09 UTC
Resolved
The issue has been resolved.
Posted Jan 31, 2025 - 06:34 UTC
Monitoring
The issue has been resolved for all impacted regions, and teams are actively monitoring the service
Posted Jan 31, 2025 - 06:22 UTC
Update
Impact from this event was limited to customers physically located in India, Singapore, and Japan regions. Services for customers in Singapore and Japan have fully recovered, while recovery efforts are ongoing for customers in India region. Our teams are actively working to restore full service as soon as possible.
Posted Jan 31, 2025 - 05:52 UTC
Identified
The issue has been identified, and the team is actively working on a fix.
Posted Jan 31, 2025 - 05:44 UTC
Investigating
We are currently investigating this issue.
Posted Jan 31, 2025 - 05:12 UTC
This incident affected: Automation Cloud, Orchestrator, Automation Hub, AI Center, Action Center, Apps, Automation Ops, Computer Vision, Cloud Robots - VM, Customer Portal, Data Service, Documentation Portal, Document Understanding, Insights, Integration Service, Marketplace, Process Mining, Task Mining, Test Manager, Communications Mining, Serverless Robots, Studio Web, Solutions Management, Context Grounding, and Autopilot for Everyone.