Between October 17, 2024, at 21:00 UTC and October 18, 2024, at 13:00 UTC, some customers with tenants hosted in the US region may have experienced errors or increased latency while using the Generative AI capabilities in Document Understanding.
Document Understanding leverages Azure OpenAI GPT to power features that require large language models (LLMs). UiPath partners with Azure to secure a specific capacity for these advanced AI services. However, this Azure capacity is limited, and acquiring additional resources on short notice is only sometimes feasible. The allocated capacity is shared among several UiPath products, with Document Understanding receiving a more significant portion of this quota.
To ensure the fair and efficient use of these limited resources, UiPath has implemented a quota system that allocates capacity to each customer. This system prevents any single customer from consuming excessive resources, thereby safeguarding the service's performance and accessibility for all users. It ensures that high usage by one customer does not negatively impact the experience of others.
In addition to the quota system, Document Understanding incorporates an internal retry mechanism to shield customers from intermittent errors, such as brief periods of quota exhaustion. This mechanism automatically retries failed requests, enhancing the service's reliability and robustness. It helps maintain a seamless user experience even during temporary resource constraints.
Despite these measures, customers may experience errors or increased latency on occasion due to inherent limitations in resource capacity.
Due to a misconfiguration in our quota management system, the existing quota allocated for Document Understanding was too low to support the increased traffic during the incident. Consequently, a small subset of customers could consume all the LLM capacity allocated for Document Understanding.
This unintended usage caused other customers to experience significantly increased latency and required multiple attempts to complete operations that relied on this resource. The issue arose when we resized and reallocated resources but did not update the quota configuration.
Although we received alerts about the increased error rate, the internal retry mechanism—designed to handle intermittent errors—led to these alerts being incorrectly categorized as low severity. This misclassification delayed our awareness and response to the situation.
A few customers reported experiencing issues with Document Understanding not functioning correctly. Their reports enabled us to correlate these incidents with the lower severity alerts we had received, leading to the start of the investigation.
We took the time to identify the root cause of the incident. After fully understanding the problem, we increased the quota allocated for Document Understanding, which resolved the incident.
To prevent similar issues in the future and enhance our service reliability, we are implementing several key improvements: