On Monday, 18 September, RIPE Atlas and RIPEstat suffered from significantly increased delays in processing incoming data caused by problems in our storage environment. We believe several factors have contributed to these problems, but no precise root cause has been identified. By taking several steps, the delays have been reduced to normal levels. The situation is now stable, but we continue to monitor it and are working on an alternative solution to prevent this from happening again.
Around 09:14 (UTC), the RIPE Atlas result processing stopped for about ten minutes. Our monitoring noticed this at 09:20 as a failing RegionServer, with 09:24 for the recovery. While still recovering from this, the incident occurred again around 09:31, lasting one hour. At 11:19, the problem returned and lasted for almost a full day (with brief, intermittent runs where some results were processed), causing the minimum delay for data in the queue to reach 24.5 hours.
In detail, for RIPE Atlas and RIS/RIPEstat, we run a Hadoop cluster with HBase. HBase is a “NoSQL” database, a key-value store that holds all data. To scale, HBase divides its tables into so-called regions, and regions are made available through RegionServers, machines that host several regions for all the tables in the HBase database. When one RegionServer fails, the regions are quickly distributed over the other RegionServers in the cluster, and the service is restored.
In this incident, we saw a cascading failure that took down one RegionServer after another. This has been caused by a rather large region that was migrated to one of our smaller RegionServers (our cluster consists of several generations of hardware, where the oldest machines have less memory, CPU and storage than newer ones). The smaller RegionServer is then unable to process such a large region and fails trying to do so. The region is reassigned to another RegionServer, also smaller, and the process repeats itself.
Because the region in question receives (part of the) incoming data, processing stalls until the region is available again. But then, when the region lands on a large server, there is a backlog of data to be processed and the load to this region and other regions hosted on the same RegionServer combine to problematic levels yet again.
To mitigate the issue, we had a few approaches. We stopped the consumption of the RIPE Atlas data to ease the load on the cluster so it could recover from the cascading failures. This increased the backlog (and end-user delay), but the cluster did not stabilise. We then added capacity to the cluster; this helped because it added much-needed resources (memory, CPU, storage) and the incoming data could be more evenly distributed over the consumers that write to HBase. We also increased the memory allocation for the RegionServers, so they have more resources to do their work. Finally, we manually moved regions to specific RegionServers so they wouldn’t cause the cascading failure we’d seen. After that, it still took a few hours to consume the backlog of results that was accumulated during the outage.
To prevent this issue from recurring, we want to (in the short term) free up some more hardware resources to be added to the cluster. In the long term, we’re looking at re-architecting the solution since the current environment is reaching its limits (and end-of-life) - something we hope to share more news about soon.
Another key point to improve is our communication with the users. We’ll look into ways to ensure more timely updates are available to the community.