RIPEstat is unavailable

Incident Report for RIPE NCC

Postmortem

The RIPEstat incident on March 20 was caused by a database cluster for which the nodes became stuck. This caused the frontend application to run out of Python workers. This situation initially affected the looking-glass endpoints. To restart the database, we had to stop the application component ("thrift-api") that communicates with this database. The thrift-api is also used by other endpoints, affecting many endpoints during that maintenance.

Timeline:

10:05-10:45 UTC: Initial (full) unavailability due to first failed database node and loadbalancer/health check interaction.
11:05-12:51 UTC: More database nodes fail, looking glass API was disabled.
16:00-18:30 UTC: Multiple APIs unavailable during maintenance. Database cluster is restarted.
19:30 UTC: Looking glass backend has recovered.

‌

Below, we will explain our current understanding of the issue:

RIPEstat uses a distributed, in-memory database for the looking glass. For unknown reasons, a query on this database deadlocked with another query. This deadlock caused the number of active queries from the thrift-api to increase sharply and the database process on one node to hit the open file descriptor limit (with every connection using a file descriptor). The issue with the first worker blocked queries to this node. In addition, this prevented all writes to the cluster.

Calls from the thrift-api to the in-memory database blocked and took time to time out. In turn, this caused Python processes for the frontend application to be stuck and sometimes be unable to process requests.

This sometimes caused health-check calls to fail, which resulted in the load-balancer disabling workers. This increased the load on the remaining workers, degrading the situation. We changed the load-balancer check and restarted the database node that was in the deadlocked state.

The application initially recovered, until a second database node hit a deadlock. At this point, more looking-glass API calls arrived than were timing out. This resulted in almost complete downtime until we disabled the looking-glass API (12:51 UTC).

At this point more nodes were stuck than we could safely restart (there was no quorum of functioning instances). This was complicated by the fact that hitting the open file descriptor prevented this database from writing its snapshot to disk, and also prevented management commands from cleanly restarting the database.

At 16:00 UTC we started a node-by-node restart of the in-memory database. In addition, we raised the open file limit for this database. After the database re-start we re-enabled the thrift-server (recovering many ripestat endpoints) and the service inserting data for the looking glass. Maintenance finished by 18:30 UTC, and by 19:30 UTC the looking glass caught up for most peers.

We took the following steps:

Increased the open file descriptor limit for the in-memory database.
Increased the number of python processes.

We will take the following actions:

Split the application healthcheck from the readiness endpoint and use readiness (with limited/no backend dependencies) from the load-balancer.
Reduce RPC call timeouts (reduce the timeout to less than the time for a http request to timeout).
Split application workers, preventing data backend problems from making the complete application unavailable.
Increase the number of Python processes further, or scale with application load.

Posted Mar 21, 2025 - 13:42 CET

Resolved

The issue was fixed with the intervention yesterday evening.

Posted Mar 21, 2025 - 07:32 CET

Update

We are continuing to monitor for any further issues.

Posted Mar 20, 2025 - 20:46 CET

Update

We have finished most of the maintenance. All datasets except the looking glass should be available.

Posted Mar 20, 2025 - 19:33 CET

Update

RIPEstat is fully unavailable while we perform maintenance on multiple backend components.

Posted Mar 20, 2025 - 17:00 CET

Monitoring

The endpoints other than the looking glass are functional. We are working on the looking glass backend.

Posted Mar 20, 2025 - 14:27 CET

Update

We have disabled the looking-glass API for now, to reduce the load on the system.

Posted Mar 20, 2025 - 13:51 CET

Identified

Another mode in the database cluster has hit the same failure. We will investigate further and likely do a rolling restart of the database.

Posted Mar 20, 2025 - 12:33 CET

Monitoring

The backend system is working again. We are monitoring the situation.

Posted Mar 20, 2025 - 11:53 CET

Update

The backend system for the looking glass is overloaded. Slow responses from this system blocked worker processes and caused the API to stop responding.

Posted Mar 20, 2025 - 11:37 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 20, 2025 - 11:23 CET

Investigating

RIPEstat is not available at the moment. We are currently investigating this issue.

Posted Mar 20, 2025 - 11:13 CET

This incident affected: Non-Critical Services (RIPEstat).