Problems loading my.ripe.net/lirportal.ripe.net
Incident Report for RIPE NCC
Postmortem

NFS File Share Unresponsive in Storage Cluster.

Issue Summary

One of the NFS file shares which is hosting the www static content present on the storage cluster was taking a long time to load the files available on it. This caused the static content to be unavailable on the Atlas web page, the LIR Portal and the RPKI Dashboard.

Timeline

(all times in UTC)

(May 3, 2024 4:41 AM) Alert from atlas.ripe.net since none of the members were active.

(May 3, 2024 5:20 AM) Investigation points to a fault with the underlying hardware cluster.

(May 3, 2024 6:50 AM) Identified the specific hardware and isolated the server which was hosting the affected file share.

(May 3, 2024 7:18 AM) Maintenance mode is enabled on the isolated host to prevent any recurring disruption.

(May 3, 2024 7:40 AM) Alert closed as all members are back online and static content is available. IT opens a case with the vendor to investigate the issue further.

Root Cause

An ESXi host experienced a hardware issue which caused the file share to become unresponsive. The issue is being investigated with our supplier.

Resolution and recovery

The affected host was placed into Maintenance Mode to facilitate the migration of all Virtual Machines to a different host. Additionally, the NFS share was migrated to ensure accessibility of static content. Following these migrations, the issue ceased, and static content became available.

Corrective and Preventative Measures

  • Collaborate with the supplier to improve the monitoring on the storage cluster and the monitoring of the ESX hosts hardware.
  • Improve automated checks on the storage infrastructure with monitoring system.
  • Investigate redundancy and failover mechanisms for storage cluster.
Posted May 03, 2024 - 11:33 UTC

Resolved
This incident has been resolved.
Posted May 03, 2024 - 08:51 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 03, 2024 - 08:32 UTC
Update
We are continuing to work on a fix for this issue. We can also see that the atlas.ripe.net website is down.
Posted May 03, 2024 - 06:39 UTC
Identified
We have identified the root cause of the issues and are working on a mitigation.
Posted May 03, 2024 - 06:25 UTC
Investigating
Since ~05:44 UTC, some users may be experiencing problems loading parts of my.ripe.net/lirportal.ripe.net due to missing webpage components. This mostly affects users which had not (recently) visited the website. We are investigating the issue.
Posted May 03, 2024 - 06:06 UTC
This incident affected: RPKI (RPKI Dashboard), LIR (Member) Portal, and Non-Critical Services (RIPE Atlas).