Issues with RIPE Atlas Infrastructure

Incident Report for RIPE NCC

Postmortem

To serve our worldwide set of active Internet measurement vantage points (“probes”), RIPE Atlas uses other service providers to host parts of the infrastructure components (“controllers”). A miscommunication between the RIPE NCC and one of the service providers (“provider”) resulted in the provider suspending their service to us [1].

The controller infrastructure is designed to work around events where parts of the infrastructure become unavailable by re-routing the probes to other controllers. However, in this case, all the controllers hosted on this provider became unavailable, and this affected about 50% of our total probe handling capacity. On top of this, most of the RIPE Atlas anchors, which in general require more resources to handle, were also affected. Even though the re-routing algorithm did its best to juggle probes, ultimately, the remaining capacity was not enough to handle all of them, and the remaining controllers became overloaded.

Once we were aware of the root cause, we quickly resolved the issue between the RIPE NCC and the provider, and service by the provider was restored soon after. This still left the system in an unbalanced state, with some components still struggling to deal with their tasks (handling probes, processing backlogs, and storing the collected results). We are actively working on resolving these issues.

Detailed timeline of events:

2024-01-30, 13:15 CET: provider’s controllers are suspended, not available for the service
2024-01-30, 14:15 CET: first signs of overload detected, investigation starts
2024-01-30, 14:23 CET: first public mention of possible issues on the mailing list
2024-01-30, 15:25 CET: service is restored by the hosting provider, and recovery operations begin
2024-01-30, 16:00-23:00 CET: restoring service to most anchors and probes
2024-01-31 (ongoing): further rebalancing, stabilisation and backlog processing
2024-01-31, 15:15 (approximate) rebalancing is done, probe handling is mostly stable, backlog processing is ongoing
2024-02-01, 12:00 CET: the issue is resolved

Going forward, we’ll investigate how we can improve our internal procedures in order to prevent similar cases in the future. We also identified a number of improvement points related to the infrastructure bottlenecks, in particular in such large-scale outages, and will make plans to address them.

‌

[1] Footnote: This provider is only used for RIPE Atlas, therefore no other services were affected.

Posted Feb 13, 2024 - 17:06 CET

Resolved

This incident has been resolved.

Posted Feb 01, 2024 - 11:51 CET

Update

The backlog has been processed, this incident has been resolved.

Posted Feb 01, 2024 - 11:49 CET

Update

The probes' connections have been stabilised, the system continues to process the backlog.

Posted Jan 31, 2024 - 18:05 CET

Monitoring

We have identified and fixed the underlying issue, and are recovering now. Normal status, including processing backlogs will likely take some time. We are monitoring the situation.

Posted Jan 30, 2024 - 17:02 CET

Update

We are continuing to investigate this issue.

Posted Jan 30, 2024 - 15:27 CET

Investigating

Our RIPE Atlas infrastructure has been having issues since approximately 13:10 CET.

We are working to find a solution.

Posted Jan 30, 2024 - 15:26 CET

This incident affected: Non-Critical Services (RIPE Atlas).