US - Multiple Services - Experiencing Degraded Performance

Incident Report for UiPath

Postmortem

Customer Impact

From January 26, 2026, 2:00 AM to 4:48 PM UTC (approximately 14 hours), some customers in the US region experienced intermittent errors and degraded performance when using UiPath services.

The issue originated in the Identity platform, a backend service that supports sign-in and access control for multiple UiPath services. This caused degraded performance across multiple dependent services such as Orchestrator, Cloud Robots (VM and Serverless), Maestro, Studio Web, IXP, Communications Mining, etc.

Root Cause

A bad node version from Microsoft prevented cluster autoscaling. In combination, a configuration issue to force even distribution of nodes across zones in the Identity service, (which is widely considered the best practice), made it more susceptible to this this failure.

As traffic grew, the service was unable to add enough capacity to keep up, which led to delays and failures when verifying user identity. This, in turn, affected other services that rely on identity verification.

Detection

The issue was identified through automated monitoring and customer reports indicating service disruption.

While the customer impact was detected quickly, the underlying scaling limitation took longer to identify because of its intermittent nature and the lack of a specific alert indicating that the service was unable to scale as expected. For the DRI observer, this looked like transient failures due to traffic spikes.

Response

2:45 PM UTC – Internal Alerts started surfacing and customer impact began; alerts were transient and auto mitigating, since the load was non-uniform
3:42 PM UTC – Engineering teams began investigating to validate system behavior and capacity.
4:39 PM UTC – A configuration change was applied to restore service capacity.
4:48 PM UTC – Services recovered and the incident was marked resolved.

Follow-Up

To help prevent similar incidents in the future, we are taking the following actions:

Proactively upgrade away from the affected Microsoft node version.
Ensure the Identity service can always scale to a safe minimum capacity.
Improve monitoring and alerting for Microsoft node–related incidents to detect and respond earlier.
Improve monitoring to detect scaling and capacity issues earlier.
Review configuration and scaling practices for other critical services.

These improvements are underway as part of our ongoing commitment to reliability.

Appendix

Microsoft RCA:

Your support case 2601260050004104 is related to an ongoing outage in your region.

STATUS:
Mitigated 1/26/2026 9:00:36 PM UTC

What happened?

Between 08:00 UTC on 22 January 2026 and 00:00 UTC on 23 January 2026, a platform issue resulted in an impact to the Azure Kubernetes Service (AKS). Impacted customers experienced failures when attempting to start, stop, scale, or update their AKS clusters when using AKS VHD image 202510.19.1, 202510.29.0, 202511.07.0, 202511.12.0, 202511.20.0, 202512.06.0, 202512.18.0, 202601.07.0, and 202601.13.0. These failures stemmed from issues within the underlying system image used by certain AKS node pools, which may prevent normal cluster operations from completing successfully.

What do we know so far?

We determined that the issue occurred because some AKS clusters were using a specific node image that stopped working correctly after a period of time. When customers tried to perform actions such as scaling, starting, stopping, or updating their clusters, those operations failed because new or updated nodes could not complete their setup. Symptoms typically appear as repeated VMSS extension failures such as exit status 178. Existing workloads were often unaffected until such an operation was attempted, which is why the issue was not immediately visible. Customer action is required to apply the fix to the node pools.

Actions Required:

To ensure mitigation, we highly recommend that customers run a Node Image Upgrade on each affected node pool. If the upgrade succeeds, the node pool and cluster will recover normally. If the upgrade fails, replace the node pool entirely by adding a new node pool, migrating the workload to the new node pool, followed by deleting the old node pool.

az aks nodepool upgrade --resource-group

If your snapshot references a node image >= 202510.19.1 and < 202512.18.0, you will need to create a new snapshot.

If your snapshot references a node image >= 202601.07.0 and was created before 00:00 UTC 23 January 2026, you will need to create a new snapshot.

Recreate each affected node pool using the newly created snapshot (delete + recreate).
This replacement path is the recommended and supported mitigation for snapshot-based node pools.
While a Node Image Upgrade may succeed when specifying the same snapshot ID, this path is not reliable for snapshot scenarios and should not be used as the primary mitigation.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: Create Service Health alerts for Azure service notifications - Azure Service Health .
For more information on Post Incident Reviews, refer to After an incident - Training .
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: Architecture strategies for designing and creating a monitoring system - Microsoft Azure Well-Architected Framework .
Finally, for broader guidance on preparing for cloud incidents, refer to Incident readiness .

Posted Jan 31, 2026 - 13:18 UTC

Resolved

The hotfix has been applied and the issue impacting multiple UiPath services in the US regions has been fully resolved. Thank you for your patience.

Posted Jan 26, 2026 - 16:48 UTC

Monitoring

We have applied a hotfix, and the issue has been mitigated. We will continue monitoring and provide additional updates as needed.

Posted Jan 26, 2026 - 16:39 UTC

Investigating

We are currently investigating an issue causing degraded performance for multiple UiPath services in the US regions. Our engineering teams are actively working to identify the root cause. Further updates will be shared as they become available.

Posted Jan 26, 2026 - 15:42 UTC

This incident affected: European Union (Orchestrator, Serverless Robots, Studio Web, Cloud Robots - VM).

About This Site

Current UiPath Status