Communications mining outage in EU region- available only in readonly mode
Incident Report for UiPath
Postmortem

Background

UiPath Communications Mining is deployed globally across multiple regions. Each region is independent of all others with independent deployments of databases and stateless services.

Multiple, different distributed database solutions are deployed in each region for different purposes. Historically we used a strongly consistent, horizontally scalable document store, for most ground-truth data storage, but due to a variety of reasons, including operational concerns relevant to this outage, in the last year we've been migrating away from this store to a distributed SQL database, instead. But today, much of our data (~1B rows, ~5 TiB) is still stored in this legacy document store.

Customer Impact

  • Performance degradation and elevated error rates (HTTP 500 error codes) for tenants in the EU region starting at the weekend on Mar 16, 10:02 UTC and Mar 18.
  • From Monday, Mar 18, 11:37 UTC, analytics and UI fully back up, but training, ingestion and streams continued to experience issues.
  • All functionality was fully restored on Wednesday, Mar 20 at 10:20 UTC.
  • 35 tenants in the EU were affected, and no tenants in other regions were impacted.

Root Cause

The outage was caused by an interaction of multiple issues. At its core, however, the incident was triggered by a manual scaling operation that started on Saturday, Mar 16th that exposed fundamental problems in our legacy document store:

  1. Explicit table re-sharding causes a temporary reduction in fault tolerance
  2. Unexpected memory mapped page count exhaustion caused multiple DB nodes to crash simultaneously
  3. Kubernetes security controls, read-only filesystems with unprivileged containers prevented in-place updates to sysctl-s, requiring further DB restarts to increase memory mapped page limits (vm.max_map_count)
  4. Crashes exposed flaws in our document store fail-over mechanism, causing nodes to enter a "viral" state where failover nodes also entered a backfilling state.
  5. Eventual solution was manually re-creating a subset of the database tables and repopulating them with data from the old, now read-only tables.
  6. New tables suffered from very slow secondary index reconstruction performance issues in our document store

Detection

Due to increased usage in the EU region, we started scaling up our document store cluster on Jan 30. We added two new nodes and started re-sharding and moving tables to the new nodes over the next month and a half, during weekends to avoid customer impact. Until the weekend of Mar 16, these operations all completed without a hitch.

As soon as we started re-sharding one of the only two remaining tables at 10:02 UTC on Mar 16, two database nodes crashed simultaneously due to exhaustion of memory mapped pages (vm.max_map_count). There was on-call engineer actively monitoring the process at the time, but the issue was also picked up within minutes by our automated alerts.

Response

Since all our workloads run on read-only, unprivileged containers, increasing this limit is impossible without restarting all the nodes. So the focus was bringing the cluster into a fully replicated state so we could then run a controlled restart to increase the max_map_count on all the nodes.

Because of the hard crash during a re-sharding operation, the database started exhibiting unexpected behaviour and entered a degraded state and would sporadically enter read-only states and would not accept writes before a full integrity check. Furthermore the recovery process did not seem to ever fully complete. By Sunday evening a sufficient number of replicas became available.

Our automatic nightly backup process started on 23:00 UTC Sunday, Mar 17 which added sufficient load to the database such that it experienced another four node crashes between 01:00 to 06:00 on Monday, Mar 18, due to max_map_count exhaustion. The DB reverted to the same degraded state as above, with very lengthy automated "backfilling" processes that never completed and during which the DB entered read-only mode.

Due to the risk of further crashes before recovery is complete, on 08:42 UTC Monday, Mar 18 we made the decision to go ahead with the controlled restart to increase max_map_count, even though the database was not in a fully recovered state. This resulted many additional hours of downtime, but in exchange gave us the confidence that it would be able to complete successfully without additional unexpected crashes.

By 11:37 UTC Monday, Mar 18 all but two tables were fully available allowing us to restore most functionality. The remaining two (very large) tables failed to recover through the automated process multiple times. We rapidly built and after significant testing and iteration, deployed an emergency batch job at 04:20 UTC Tuesday, Mar 19. This created new tables and copied all rows into fresh tables, while maintaining availability of the rest of the product.

This process eventually completed at 07:30 UTC Tuesday, after which we could start rebuilding the secondary indexes in the new tables. Reindexing ~200M rows in these tables took over 24h, finally completing at 10:20 UTC, Wednesday, Mar 20, restoring all functionality.

Follow-ups

This is the most significant outage UiPath Communications Mining has ever experienced and it was caused by the one of our core data stores. We had been aware of issues with this document store, and have been migrating away from it slowly over the last year.

The next steps are

  1. Halt further scaling of the document store. The number of replicas today can handle current and forecasted load for at least another year. We know the database is resilient in its steady state.
  2. Reduce the amount of data stored in our database by more aggressively garbage collecting old data, and moving larger objects into blob storage that are referenced from the database instead.
  3. Reprioritise the migration away from this legacy store as critical, aiming to complete it in the next six months, and start with database tables that caused most problems during this incident first.
Posted Apr 09, 2024 - 16:06 UTC

Resolved
We've validated all functionalities and the incident is now fully resolved.
Posted Mar 20, 2024 - 10:33 UTC
Monitoring
The index restoration is completed. We are conducting tests to ensure that all functionality has been successfully restored.
Posted Mar 20, 2024 - 10:22 UTC
Update
The database index reconstruction process is taking longer than anticipated. We now expect it to require a few additional (approx. 2) hours to complete. We will provide updates as we gain more insights into the situation.
Posted Mar 20, 2024 - 10:14 UTC
Update
Database index reconstruction is significantly slower than previously expected. Now expecting full availability at midnight UTC tonight.
Posted Mar 19, 2024 - 17:33 UTC
Update
The issue is partially mitigated and it would take around 3 hours for a complete recovery
Posted Mar 19, 2024 - 07:47 UTC
Update
Email ingestion functionality restored, new data can be added. Streams functionality is still down.
Posted Mar 18, 2024 - 18:23 UTC
Update
Database is currently running through backfill process. We estimate this may take multiple hours to fully catch up.
Posted Mar 18, 2024 - 16:12 UTC
Update
Issue is partially mitigated:

1. Analytics functionality restored
2. Most of the UI now operates correctly

Ingestion and streams functionality still down

Mitigation is in progress
Posted Mar 18, 2024 - 11:37 UTC
Identified
Issue is identified, mitigation is in progress
Posted Mar 18, 2024 - 10:22 UTC
Investigating
We are experiencing an issue with communications mining and investigation is in progress

Currently the service is only available in read only mode
Posted Mar 18, 2024 - 09:18 UTC
This incident affected: Communications Mining.