To serve our worldwide set of active Internet measurement vantage points (“probes”), RIPE Atlas uses other service providers to host parts of the infrastructure components (“controllers”). A miscommunication between the RIPE NCC and one of the service providers (“provider”) resulted in the provider suspending their service to us [1].
The controller infrastructure is designed to work around events where parts of the infrastructure become unavailable by re-routing the probes to other controllers. However, in this case, all the controllers hosted on this provider became unavailable, and this affected about 50% of our total probe handling capacity. On top of this, most of the RIPE Atlas anchors, which in general require more resources to handle, were also affected. Even though the re-routing algorithm did its best to juggle probes, ultimately, the remaining capacity was not enough to handle all of them, and the remaining controllers became overloaded.
Once we were aware of the root cause, we quickly resolved the issue between the RIPE NCC and the provider, and service by the provider was restored soon after. This still left the system in an unbalanced state, with some components still struggling to deal with their tasks (handling probes, processing backlogs, and storing the collected results). We are actively working on resolving these issues.
Detailed timeline of events:
Going forward, we’ll investigate how we can improve our internal procedures in order to prevent similar cases in the future. We also identified a number of improvement points related to the infrastructure bottlenecks, in particular in such large-scale outages, and will make plans to address them.
[1] Footnote: This provider is only used for RIPE Atlas, therefore no other services were affected.