Increased error rate for RIPEstat requests

Incident Report for RIPE NCC

Resolved

We are resolving this statuspage issue, since the impact has been mitigated for now. The underlying cause has not been fully resolved.


After analysis we conclude that the python web application can get overloaded after requests to a backend system stall. RIPEstat is a python application based on Django. This means there is a limited number of worker threads available, and these get overloaded. There is no asynchronous client for the RPC system (Thrift) used by the backend API, so we can not handle multiple requests per thread..

We have:
* Added automatic rolling restarts of the thrift backend system as a stopgap measure

In the short term we will:
* Add metrics based on the nginx and python application logs

In the medium term we will:
* Scrape metrics natively from the python app (a prometheus endpoint is present but can not be scraped due to infrastructure limitations in our setup)
* Start scraping metrics from _each_ Thrift RPC server instance (instead of scraping a single instance through a webserver)
* Add standard JVM metrics, as well as statistics on thrift calls in the Thrift RPC server
Posted Jul 04, 2025 - 22:45 CEST

Update

We have more information, and expect to roll out additional metrics by the weekend. This will likely give us more visibility into the underlying cause.
Posted Jul 04, 2025 - 08:58 CEST

Identified

In addition, this root cause led to a near-full outage for RIPEstat between 0:38 and 1:29. We apologize for this inconvenience and are working on the monitoring gaps that are present for partial failures.
Posted Jul 03, 2025 - 11:56 CEST

Investigating

We intermittently have periods with an increased error rate affecting RIPEstat. This occurs for a fraction of requests when it happens.

Users may encounter errors when accessing RIPEstat on the web or making API calls. Data may fail to load in widgets, and API responses may return an HTTP error.

The underlying cause is intermittent issues with one of our (internal) backend APIs. This backend system degrades in performance, but not enough to raise alerts. In turn, this degrades the performance of RIPEstat. We currently have limited metrics for the performance of this system. This means there is user impact before the issue is visible in our current monitoring solution.

We are actively implementing enhanced monitoring to improve visibility into error rates for both the main RIPEstat application and the specific backend system. This will help us better identify and resolve issues in the future.

You may need to reload the page if you encounter an error, or retry failed API calls. We will update this notification as soon as we have more information.
Posted Jul 03, 2025 - 08:23 CEST
This incident affected: Non-Critical Services (RIPEstat).