Issues with RIPE Atlas Infrastructure
Incident Report for RIPE NCC
Postmortem

To serve our worldwide set of active Internet measurement vantage points (“probes”), RIPE Atlas uses other service providers to host parts of the infrastructure components (“controllers”). A miscommunication between the RIPE NCC and one of the service providers (“provider”) resulted in the provider suspending their service to us [1].

The controller infrastructure is designed to work around events where parts of the infrastructure become unavailable by re-routing the probes to other controllers. However, in this case, all the controllers hosted on this provider became unavailable, and this affected about 50% of our total probe handling capacity. On top of this, most of the RIPE Atlas anchors, which in general require more resources to handle, were also affected. Even though the re-routing algorithm did its best to juggle probes, ultimately, the remaining capacity was not enough to handle all of them, and the remaining controllers became overloaded. 

Once we were aware of the root cause, we quickly resolved the issue between the RIPE NCC and the provider, and service by the provider was restored soon after. This still left the system in an unbalanced state, with some components still struggling to deal with their tasks (handling probes, processing backlogs, and storing the collected results). We are actively working on resolving these issues.

Detailed timeline of events:

  • 2024-01-30, 13:15 CET: provider’s controllers are suspended, not available for the service
  • 2024-01-30, 14:15 CET: first signs of overload detected, investigation starts
  • 2024-01-30, 14:23 CET: first public mention of possible issues on the mailing list
  • 2024-01-30, 15:25 CET: service is restored by the hosting provider, and recovery operations begin
  • 2024-01-30, 16:00-23:00 CET: restoring service to most anchors and probes
  • 2024-01-31 (ongoing): further rebalancing, stabilisation and backlog processing
  • 2024-01-31, 15:15 (approximate) rebalancing is done, probe handling is mostly stable, backlog processing is ongoing
  • 2024-02-01, 12:00 CET: the issue is resolved

Going forward, we’ll investigate how we can improve our internal procedures in order to prevent similar cases in the future. We also identified a number of improvement points related to the infrastructure bottlenecks, in particular in such large-scale outages, and will make plans to address them.

[1] Footnote: This provider is only used for RIPE Atlas, therefore no other services were affected.

Posted Feb 13, 2024 - 16:06 UTC

Resolved
This incident has been resolved.
Posted Feb 01, 2024 - 10:51 UTC
Update
The backlog has been processed, this incident has been resolved.
Posted Feb 01, 2024 - 10:49 UTC
Update
The probes' connections have been stabilised, the system continues to process the backlog.
Posted Jan 31, 2024 - 17:05 UTC
Monitoring
We have identified and fixed the underlying issue, and are recovering now. Normal status, including processing backlogs will likely take some time. We are monitoring the situation.
Posted Jan 30, 2024 - 16:02 UTC
Update
We are continuing to investigate this issue.
Posted Jan 30, 2024 - 14:27 UTC
Investigating
Our RIPE Atlas infrastructure has been having issues since approximately 13:10 CET.

We are working to find a solution.
Posted Jan 30, 2024 - 14:26 UTC
This incident affected: Non-Critical Services (RIPE Atlas).