Orchestrator slowness for customers in US Region

Incident Report for UiPath

Postmortem

The Orchestrator service experienced performance issues on 09/07/2023, between 5:04 AM and 6:08 PM UTC in United States, affecting one of the two clusters in the region. The issue manifested as a gradual increase in latency and timeouts for a number of key APIs, leading to general slowness of the web interface.

The cause was identified to be a robot trying to send a large payload to Orchestrator, which was not trimmed on either the robot or Orchestrator. This large payload hit a part of our code that was not optimized to handle such large data, triggering a behavior that consumed a lot of resources, affecting the latency of all requests processed on that particular pod. Since the request was not successful, the robot retried submitting the payload twice a minute, affecting different pods every time.

Upon detecting the issue, we first moved to block the robot sending the large payload. We then deployed a hotfix that implements a mechanism for trimming payloads above a reasonable size.

Following up, we are investing into better detection mechanisms. Our alerting systems failed to detect the issue in time mainly because only the latency of a small subset of pods was affected at a time. Aggregated metrics were only slightly affected and failure rate remained fairly constant. We are adding alerting that will be capable to detect degradation for single pods better. We are also auditing Orchestrator for similar cases of unrestricted payload sizes and either limiting them or investing into optimized handling.

Posted Sep 19, 2023 - 19:25 UTC

Resolved

The performance issue has been fixed and response times have returned to normal.

Posted Sep 07, 2023 - 18:24 UTC

Monitoring

A fix has been implemented and we are monitoring the issue

Posted Sep 07, 2023 - 18:08 UTC

Investigating

We are currently investigating this issue.

Posted Sep 07, 2023 - 15:40 UTC

This incident affected: Orchestrator.