Issues with Atlas backend

Incident Report for RIPE NCC

Postmortem

Summary

On Monday, 18 September, RIPE Atlas and RIPEstat suffered from significantly increased delays in processing incoming data caused by problems in our storage environment. We believe several factors have contributed to these problems, but no precise root cause has been identified. By taking several steps, the delays have been reduced to normal levels. The situation is now stable, but we continue to monitor it and are working on an alternative solution to prevent this from happening again.

Impact

Around 09:14 (UTC), the RIPE Atlas result processing stopped for about ten minutes. Our monitoring noticed this at 09:20 as a failing RegionServer, with 09:24 for the recovery. While still recovering from this, the incident occurred again around 09:31, lasting one hour. At 11:19, the problem returned and lasted for almost a full day (with brief, intermittent runs where some results were processed), causing the minimum delay for data in the queue to reach 24.5 hours.

In detail, for RIPE Atlas and RIS/RIPEstat, we run a Hadoop cluster with HBase. HBase is a “NoSQL” database, a key-value store that holds all data. To scale, HBase divides its tables into so-called regions, and regions are made available through RegionServers, machines that host several regions for all the tables in the HBase database. When one RegionServer fails, the regions are quickly distributed over the other RegionServers in the cluster, and the service is restored.

In this incident, we saw a cascading failure that took down one RegionServer after another. This has been caused by a rather large region that was migrated to one of our smaller RegionServers (our cluster consists of several generations of hardware, where the oldest machines have less memory, CPU and storage than newer ones). The smaller RegionServer is then unable to process such a large region and fails trying to do so. The region is reassigned to another RegionServer, also smaller, and the process repeats itself.

Because the region in question receives (part of the) incoming data, processing stalls until the region is available again. But then, when the region lands on a large server, there is a backlog of data to be processed and the load to this region and other regions hosted on the same RegionServer combine to problematic levels yet again.

Resolution

To mitigate the issue, we had a few approaches. We stopped the consumption of the RIPE Atlas data to ease the load on the cluster so it could recover from the cascading failures. This increased the backlog (and end-user delay), but the cluster did not stabilise. We then added capacity to the cluster; this helped because it added much-needed resources (memory, CPU, storage) and the incoming data could be more evenly distributed over the consumers that write to HBase. We also increased the memory allocation for the RegionServers, so they have more resources to do their work. Finally, we manually moved regions to specific RegionServers so they wouldn’t cause the cascading failure we’d seen. After that, it still took a few hours to consume the backlog of results that was accumulated during the outage.

Future Steps

To prevent this issue from recurring, we want to (in the short term) free up some more hardware resources to be added to the cluster. In the long term, we’re looking at re-architecting the solution since the current environment is reaching its limits (and end-of-life) - something we hope to share more news about soon.

Another key point to improve is our communication with the users. We’ll look into ways to ensure more timely updates are available to the community.

Timeline of events (All times in UTC):

Monday, 18 September, 09:14 - First RegionServer failure
Monday, 18 September, 14:07 - Incident reported on status.ripe.net
Tuesday, 19 September, 10:31 - Stop RIPE Atlas consumption to ease the load on HBase
Tuesday, 19 September, 14:58 - Resume RIPE Atlas consumption
Wednesday, 20 September, (approximately) 14:20 - Capacity expanded with four machines / 192 TB
Wednesday, 20 September, 16:30 - More memory allocated to region servers (20GB => 24GB each)
Thursday, 21 September, 13:10 - An attempt to split the largest region to reduce imbalance did not succeed, and delays temporarily increased again
Saturday, 23 September, 12:45 - All RIPE Atlas backlog has been consumed, and service is restored

Posted Oct 06, 2023 - 10:55 CEST

Resolved

By taking several steps to combat the problems, the delays have been reduced to normal levels; the status is that the situation is now stable, but we continue to monitor it and are working on an alternative solution that we believe will prevent this from happening again in the future. We are now also drafting a post mortem that will be available here soon.

Posted Oct 05, 2023 - 16:00 CEST

Update

We are continuing to monitor for any further issues.

Posted Sep 26, 2023 - 17:05 CEST

Monitoring

The issue has been resolved, but we are still monitoring the situation.

Posted Sep 26, 2023 - 17:04 CEST

Update

We are continuing to work on solving this issue. We have taken some measures that have improved matters and reduced the amount of delays, but we are still investigating the root cause and need to arrive at a longer-term solution. We will update when we have resolved this issue.

Posted Sep 22, 2023 - 16:37 CEST

Update

We have finished adding capacity to HBase. Results processing has restarted at 17:20 UTC. We will continue to monitor the cluster.

Posted Sep 20, 2023 - 21:17 CEST

Update

We have been seeing repeated crashes of nodes in the HBase backend that is used to store Atlas measurement results. Yesterday we have increased the amount of memory allocated to HBase. Since 21:00 UTC processing proceeded again, and processing delays were significantly reduced. Around 11:00 UTC today we started seeing several more crashing nodes, and delays are increasing again.

We are working on allocating more memory to HBase. In addition, we are adding more nodes to the cluster to spread the load.

Posted Sep 20, 2023 - 14:48 CEST

Identified

We have made some adjustments to the cluster configuration, and delays are slowly decreasing.

Posted Sep 20, 2023 - 08:15 CEST

Investigating

We are experiencing issues with the RIPE Atlas backend and currently are investigating this issue.

Posted Sep 18, 2023 - 16:07 CEST

This incident affected: Non-Critical Services (RIPE Atlas).