Document Understanding generative extraction & classification capabilities unavailable

Incident Report for UiPath

Postmortem

Customer impact

Background context

Document Understanding leverages Azure OpenAI GPT to power features that require large language models (LLMs). UiPath partners with Azure to secure a specific capacity for these advanced AI services. However, this Azure capacity is limited, and acquiring additional resources on short notice is only sometimes feasible. The allocated capacity is shared among several UiPath products, with Document Understanding receiving a more significant portion of this quota.

To ensure the fair and efficient use of these limited resources, UiPath has implemented a quota system that allocates capacity to each customer. This system prevents any single customer from consuming excessive resources, thereby safeguarding the service's performance and accessibility for all users. It ensures that high usage by one customer does not negatively impact the experience of others.

In addition to the quota system, Document Understanding incorporates an internal retry mechanism to shield customers from intermittent errors, such as brief periods of quota exhaustion. This mechanism automatically retries failed requests, enhancing the service's reliability and robustness. It helps maintain a seamless user experience even during temporary resource constraints.

Despite these measures, customers may experience errors or increased latency on occasion due to inherent limitations in resource capacity.

Root cause

Due to a misconfiguration in our quota management system, the existing quota allocated for Document Understanding was too low to support the increased traffic during the incident. Consequently, a small subset of customers could consume all the LLM capacity allocated for Document Understanding.
This unintended usage caused other customers to experience significantly increased latency and required multiple attempts to complete operations that relied on this resource. The issue arose when we resized and reallocated resources but did not update the quota configuration.

Detection

Although we received alerts about the increased error rate, the internal retry mechanism—designed to handle intermittent errors—led to these alerts being incorrectly categorized as low severity. This misclassification delayed our awareness and response to the situation.

A few customers reported experiencing issues with Document Understanding not functioning correctly. Their reports enabled us to correlate these incidents with the lower severity alerts we had received, leading to the start of the investigation.

Response

We took the time to identify the root cause of the incident. After fully understanding the problem, we increased the quota allocated for Document Understanding, which resolved the incident.

Follow-up

To prevent similar issues in the future and enhance our service reliability, we are implementing several key improvements:

Enhance Alerting Mechanisms: We are improving our alert systems to provide immediate notifications for both quota issues and retried errors. This enhancement will enable us to respond more swiftly to potential problems, minimizing any impact on our customers.
Transition to a Dynamic Quota System: We will replace our fixed-rate quota with a dynamic allocation system that adjusts in real-time based on the current load, available vendor capacity, customer eligibility, and other pertinent factors. This approach will ensure a more equitable and efficient distribution of resources across all customers.
Offer "Bring Your Own LLM Subscription" Options: We are investigating the possibility of offering a "Bring Your Own LLM Subscription" feature. This option would allow customers to utilize their language model subscriptions within our platform, providing greater flexibility and potentially reducing dependency on shared resources.

Posted Oct 22, 2024 - 16:19 UTC

Resolved

Between October 17, 2024, at 21:00 UTC and October 18, 2024, at 13:00 UTC, some customers with tenants hosted in the US region may have experienced errors or increased latency while using the Generative AI capabilities in Document Understanding due to a misconfiguration in our quota management system, that limited Document Understanding from supporting the increased traffic during the incident.

We increased the quota allocated for Document Understanding, which resolved the incident. We sincerely apologize for the inconvenience caused to our customers due to this and will further be working on better monitoring and guardrails to prevent such issues in future.

Posted Oct 17, 2024 - 21:00 UTC