Customer Impact
On July 21st, 2022, starting from 4:18 AM UTC, Europe enterprise and Community users were facing issues when trying to access UiPath cloud products like Document Understanding, Data Services, AICenter and Automation Hub.
Below is the impacted timeline for each of the services mentioned above.
Document Understanding - 21st July, 04:33 UTC to 21st July, 04:59 UTC
AICenter - 21st July, 04:44 UTC to 21st July, 05:22 UTC
Data Service - 21st July, 04:54 UTC to 21st July, 05:40 UTC
Automation Hub - 21st July, 04:18 UTC to 22nd July, 03:30 AM UTC
Root Cause
These UiPath cloud services uses Azure SQL database cloud service for their backend database. During this time, new connections to Azure SQL databases in WestEurope region were failing with errors or timeouts due to a region wide outage at Azure end. More details on that incident can be found here.
Detection & Response
Our automated alerting system notified on call engineers about the problem within minutes of the outage impacting our applications. As soon as we identified the issue with the databases in WestEurope region, we initiated the failover of database to its NorthEurope instance for the impacted applications. For Data Service, it took some time to accept read/write requests due to problem from Azure side.
For AutomationHub service, we took the same remediation steps, however, due to a configuration issue in its secondary(NorthEurope) database instance, the database connectivity was not getting established there as well. After the incident at Azure end was partially mitigated, we were able to fallback to its WestEurope databases, however, in this process one of the database went into a fault state. We then worked with Azure support on this issue, and to mitigate the incident as soon as possible, we started recovering the data for the faulty database. Due to large volume of data, this process took some time and we were able to restore the services for all AutomationHub customers (in WestEurope) by 22nd July 3:30 AM UTC.
As part of response to the outage, we have identified improvements that we could make to the way we are using SQL infrastructure (and all other Azure services) in AutomationHub to minimize the downtime in case if we encounter similar issue.