Orchestrator degraded performance

Incident Report for UiPath

Postmortem

Customer impact

On May 13, 2024, at 12:04 UTC, enterprise customers with Orchestrator service deployed in the European Union got errors, timeouts, or their service was slow. This affected other services dependent on Orchestrator, like Automation Solutions, Studio Web, Insights, Integration Service, Action Center, and Apps. And it affected robot job execution. The error persisted for 1 hour and 45 minutes.

Root cause

Every time a query is executed for the first time in SQL Server, it is compiled and a query plan is generated. This query plan is stored in the SQL Server query plan cache. Occasionally, the database server decides to create a new query plan for the execution of the query. Plans are usually close to optimal, but sometimes, due to several factors (parameter sniffing, outdated statistics, etc.) the database server may choose a plan that works well in one instance but is slow in other cases.

On May 13, 2024, our database server made a new query plan for an SQL query that's commonly executed. However, the plan was executed in parallel in a way that is not necessary. This caused the query to use too many resources and overloaded our database server.

We solved the problem by identifying the new query plan. We then made the database server use an older query plan that is optimal. After that, we restarted the Orchestrator application.

Detection

Our automated monitoring system detected the issue within minutes and our engineers immediately started investigating.

Response

As soon as the automated alerts came in, we checked the SQL queries that use the most database resources. The database was at 100% capacity and this prevented the database reconnection. To fix it, we restarted the Orchestrator service. This freed up enough space to connect and proceed with debugging.

After identifying the bad query plan as the possible cause, the old query plan was forced. This resulted in the database CPU usage going down by 50%, while other metrics remained high. We found that some database sessions were still using the old plan, so the application was restarted for the second time. This closed the sessions and brought the database resources back to normal.

Follow-up

As we continue to analyze this incident and our response, we have taken immediate action and started working on the following:

enable automatic plan correction on the database, which will help reduce the reaction time in similar instances. [completed]
configure our database to reduce the likelihood of our database server choosing unnecessary parallelization.

Posted May 15, 2024 - 21:12 UTC

Resolved

The issue is fixed. Orchestrator is working fine and the dependent services Integration service, studioWeb and Insights is fully functional again.

Posted May 13, 2024 - 14:34 UTC

Monitoring

Team has fixed the issue and currently monitoring impacted the services.

Posted May 13, 2024 - 14:05 UTC

Investigating

Orchestrator is facing partial outage in North Europe region for paid customers. Team is actively investigating the issue. Insights, Integration service, StudioWeb are also facing degraded functionality due to their dependency on Orchestrator service.

Posted May 13, 2024 - 13:48 UTC

Update

We are continuing to monitor for any further issues.

Posted May 13, 2024 - 13:41 UTC

Monitoring

The issue has been mitigated. We will continue to monitor it.

Posted May 13, 2024 - 13:19 UTC

Update

We are continuing to investigate this issue.

Posted May 13, 2024 - 12:46 UTC

Update

We are continuing to investigate this issue.

Posted May 13, 2024 - 12:43 UTC

Investigating

Orchestrator is facing degraded performance for enterprise north europe. We are currently investigating the issue.

Posted May 13, 2024 - 12:42 UTC

This incident affected: Orchestrator, Insights, Integration Service, and Studio Web.