Resolved -
This incident has been resolved.
Jun 22, 09:08 CEST
Monitoring -
The root cause likely is a cascade of HBase region server restarts after a Zookeeper session expired due to an unknown cause. The most likely cause is timeout after a worker did not communicate for 40 seconds.
We will add high resolution network monitoring between each zookeeper node and all zookeeper clients to track potential network issues.
Jun 18, 15:51 CEST
Identified -
We have identified a faulty node in the HBase cluster and applied mitigations.
The error rate remains elevated at around 5%, but RIPEstat functionality has improved. We are continuing to monitor the situation.
Jun 18, 14:20 CEST
Investigating -
Since around 10:15 UTC, RIPEstat has been experiencing elevated error rates and increased latency.
The issue is caused by a failing node in HBase, the distributed database used by many RIPEstat datasets. We are currently investigating the issue.
Jun 18, 13:45 CEST