UiPath Communications Mining is deployed globally across multiple regions. Each region is independent of all others with independent deployments of databases and stateless services.
Multiple, different distributed database solutions are deployed in each region for different purposes. Historically we used a strongly consistent, horizontally scalable document store, for most ground-truth data storage, but due to a variety of reasons, including operational concerns relevant to this outage, in the last year we've been migrating away from this store to a distributed SQL database, instead. But today, much of our data (~1B rows, ~5 TiB) is still stored in this legacy document store.
The outage was caused by an interaction of multiple issues. At its core, however, the incident was triggered by a manual scaling operation that started on Saturday, Mar 16th that exposed fundamental problems in our legacy document store:
Due to increased usage in the EU region, we started scaling up our document store cluster on Jan 30. We added two new nodes and started re-sharding and moving tables to the new nodes over the next month and a half, during weekends to avoid customer impact. Until the weekend of Mar 16, these operations all completed without a hitch.
As soon as we started re-sharding one of the only two remaining tables at 10:02 UTC on Mar 16, two database nodes crashed simultaneously due to exhaustion of memory mapped pages (vm.max_map_count
). There was on-call engineer actively monitoring the process at the time, but the issue was also picked up within minutes by our automated alerts.
Since all our workloads run on read-only, unprivileged containers, increasing this limit is impossible without restarting all the nodes. So the focus was bringing the cluster into a fully replicated state so we could then run a controlled restart to increase the max_map_count
on all the nodes.
Because of the hard crash during a re-sharding operation, the database started exhibiting unexpected behaviour and entered a degraded state and would sporadically enter read-only states and would not accept writes before a full integrity check. Furthermore the recovery process did not seem to ever fully complete. By Sunday evening a sufficient number of replicas became available.
Our automatic nightly backup process started on 23:00 UTC Sunday, Mar 17 which added sufficient load to the database such that it experienced another four node crashes between 01:00 to 06:00 on Monday, Mar 18, due to max_map_count
exhaustion. The DB reverted to the same degraded state as above, with very lengthy automated "backfilling" processes that never completed and during which the DB entered read-only mode.
Due to the risk of further crashes before recovery is complete, on 08:42 UTC Monday, Mar 18 we made the decision to go ahead with the controlled restart to increase max_map_count
, even though the database was not in a fully recovered state. This resulted many additional hours of downtime, but in exchange gave us the confidence that it would be able to complete successfully without additional unexpected crashes.
By 11:37 UTC Monday, Mar 18 all but two tables were fully available allowing us to restore most functionality. The remaining two (very large) tables failed to recover through the automated process multiple times. We rapidly built and after significant testing and iteration, deployed an emergency batch job at 04:20 UTC Tuesday, Mar 19. This created new tables and copied all rows into fresh tables, while maintaining availability of the rest of the product.
This process eventually completed at 07:30 UTC Tuesday, after which we could start rebuilding the secondary indexes in the new tables. Reindexing ~200M rows in these tables took over 24h, finally completing at 10:20 UTC, Wednesday, Mar 20, restoring all functionality.
This is the most significant outage UiPath Communications Mining has ever experienced and it was caused by the one of our core data stores. We had been aware of issues with this document store, and have been migrating away from it slowly over the last year.
The next steps are