Customers will see issues accessing automation cloud from EastUs region

Incident Report for UiPath

Postmortem

Customer impact

From 2024-04-03 23:45 UTC to 2024-04-04 02:35 UTC our customers experienced errors when accessing some of the services located in the US region of Automation Cloud. Impacted products include Automation Cloud, Orchestrator, Automation Hub, Automation Ops, Document Understanding, Serverless Robots, Cloud Robots - VM, Solutions Management, and Insights.

Root cause

UiPath makes extensive use of Azure SQL. At the beginning of the outage, Microsoft performed routine SQL maintenance in the East US region. Typically this is done without any visible impact to our customers. But for some reason this maintenance caused the SQL Databases in this region to become unavailable. We are still waiting for a root cause from Microsoft and will update this document once we receive it.

Detection

Automated alerts immediately detected the issue and notified UiPath on-call engineers. They confirmed the scope of the outage and updated status.uipath.com.

Response

After a brief investigation, we determined that the problem was with Azure SQL. We reached out to Microsoft Support to request assistance.

For the US region of all UiPath products, we place the primary database in Azure’s East US region and a failover database in Azure’s West US region. By default, Azure will failover from primary to secondary after the primary is unavailable for 60 minutes.

During this incident, most databases automatically failed to the secondary region. Unfortunately, the Orchestrator, Automation Hub and Insights databases did not. The UiPath engineers investigated the databases and began to trigger a manual failover, but by that time Microsoft had resolved the underlying issue in the East US region.

Follow up

Work with Microsoft to get a root cause for the underlying Azure SQL outage.
Determine why Orchestrator, Automation Hub and Insights did not failover to the secondary region. Perform a failover drill to confirm the problem has been fixed.
Investigate if the automatic failover period can be reduced from 60 minutes.

Posted Apr 04, 2024 - 16:58 UTC

Resolved

Issue is resolved and we are continuously monitoring our services. Marking the status as resolved.

Posted Apr 04, 2024 - 02:36 UTC

Monitoring

We are seeing improvements in health of the databases. We are continusoly monitoring the status

Posted Apr 04, 2024 - 02:04 UTC

Update

We are continuing to investigate this issue.

Posted Apr 04, 2024 - 02:02 UTC

Update

We are engaging the Microsoft and continuously investigating but at this time we are waiting for the issue to resolve from Microsoft Azure SQL backend

Posted Apr 04, 2024 - 01:19 UTC

Update

We are engaging the Microsoft and continuously investigating but at this time we are waiting for the issue to resolve from Microsoft Azure SQL backend

Posted Apr 04, 2024 - 01:18 UTC

Update

We are continuing to investigate this issue.

Posted Apr 04, 2024 - 00:54 UTC

Update

We are continuing to investigate this issue.

Posted Apr 04, 2024 - 00:53 UTC

Update

We are continuing the investigation and we are also seeing some issues on the backend Azure sql from cloud provider.

Posted Apr 04, 2024 - 00:40 UTC

Update

We are continuing to investigate this issue.

Posted Apr 04, 2024 - 00:09 UTC

Update

We are continuing to investigate this issue.

Posted Apr 04, 2024 - 00:06 UTC

Investigating

We are seeing increasing number of 503 errors from cloudflare . we are further investigating the issue from our backend applications and monitoring

Posted Apr 04, 2024 - 00:02 UTC

This incident affected: Orchestrator and Insights.