Intermittent 503 errors on Test Manager

Incident Report for UiPath

Resolved

Between 2025-07-10 07:07 AM UTC and 2025-07-16 08:43 AM UTC, some API calls in the EU and US regions failed with a 503 error. Customers may have experienced this issue intermittently during that time, but refreshing the page or trying again resolved it.

The incident was caused by a bad code push in recent deployment. The issue originated when background jobs were getting initialized. Intermittently, a database call was made after the required service context had already been disposed, resulting in a "connection is closed" error. Eventually, the unhandled exception propagated, causing the pods to crash and restart. Requests that were landed to these restarting pods resulted in a 503 status.

Once the issue was identified, it was isolated, fixed, and a hotfix was deployed across all regions.

To prevent similar incidents in the future, and to detect such issues earlier, we are implementing the following measures:
- Enhancing alerting mechanisms, including alerts for pod health and restarts across all environments
- Introducing a more rigorous code review process to catch such issues early in the development cycle
Posted Jul 30, 2025 - 07:52 UTC

Investigating

Between 2025-07-10 07:07 AM UTC and 2025-07-16 08:43 AM UTC, some API calls in the EU and US regions failed with a 503 error. Customers may have experienced this issue intermittently during that time, but refreshing the page or trying again resolved it.

The incident was caused by a bad code push in recent deployment. The issue originated when background jobs were getting initialized. Intermittently, a database call was made after the required service context had already been disposed, resulting in a "connection is closed" error. Eventually, the unhandled exception propagated, causing the pods to crash and restart. Requests that were landed to these restarting pods resulted in a 503 status.

Once the issue was identified, it was isolated, fixed, and a hotfix was deployed across all regions.

To prevent similar incidents in the future, and to detect such issues earlier, we are implementing the following measures:
- Enhancing alerting mechanisms, including alerts for pod health and restarts across all environments
- Introducing a more rigorous code review process to catch such issues early in the development cycle
Posted Jul 30, 2025 - 07:51 UTC
This incident affected: Test Manager.