On May 13, 2024, at 12:04 UTC, enterprise customers with Orchestrator service deployed in the European Union got errors, timeouts, or their service was slow. This affected other services dependent on Orchestrator, like Automation Solutions, Studio Web, Insights, Integration Service, Action Center, and Apps. And it affected robot job execution. The error persisted for 1 hour and 45 minutes.
Every time a query is executed for the first time in SQL Server, it is compiled and a query plan is generated. This query plan is stored in the SQL Server query plan cache. Occasionally, the database server decides to create a new query plan for the execution of the query. Plans are usually close to optimal, but sometimes, due to several factors (parameter sniffing, outdated statistics, etc.) the database server may choose a plan that works well in one instance but is slow in other cases.
On May 13, 2024, our database server made a new query plan for an SQL query that's commonly executed. However, the plan was executed in parallel in a way that is not necessary. This caused the query to use too many resources and overloaded our database server.
We solved the problem by identifying the new query plan. We then made the database server use an older query plan that is optimal. After that, we restarted the Orchestrator application.
Our automated monitoring system detected the issue within minutes and our engineers immediately started investigating.
As soon as the automated alerts came in, we checked the SQL queries that use the most database resources. The database was at 100% capacity and this prevented the database reconnection. To fix it, we restarted the Orchestrator service. This freed up enough space to connect and proceed with debugging.
After identifying the bad query plan as the possible cause, the old query plan was forced. This resulted in the database CPU usage going down by 50%, while other metrics remained high. We found that some database sessions were still using the old plan, so the application was restarted for the second time. This closed the sessions and brought the database resources back to normal.
As we continue to analyze this incident and our response, we have taken immediate action and started working on the following: