Date: August 7 - August 8, 2025
Incident Duration: 40 minutes (23:50 – 00:30 UTC)
Impact: Hosts in the NorthC datacenter were unreachable.
Severity: Critical
On August 8, 2025, scheduled maintenance was carried out by our fiber provider Eurofiber on one of the redundant links between the AM3 and NorthC datacenters. The link was covered by another fiber routed along a different path, so at most, some blips in the network were expected due the traffic needing to move over to the other path.
While no service impact was anticipated due to the redundant network design, an unexpected hardware issue occurred on a core switch interface during the maintenance window. This resulted in network degradation and connectivity loss to several hosts located in the NorthC datacenter.
The failure was detected by our monitoring systems within five minutes. The engineering team promptly began an investigation and partial recovery was observed approximately 40 minutes later. The affected interface is currently under investigation with our hardware vendor.
23:05
Initial alerts because the link went down due to maintenance. 24/7 engineer was alerted, but took no action as this was expected and no services were impacted
23:50
Incident begins: Connectivity to hosts in NorthC datacenter lost.
Network degradation impacts services.
23:55
Monitoring alerts indicating loss of connectivity are triggered.
Initial suspicion falls on the ongoing fiber maintenance.
00:00 – 00:25
Engineering team validates alert data and confirms network issue.
Network troubleshooting begins, focusing on switch interface behavior.
00:30
Services begin to recover; connectivity is gradually restored.
The exact cause of the issue is still under investigation. Preliminary findings suggest a malfunction in one of the interfaces on a core switch, which coincided with the fiber maintenance. Although the redundant network topology was expected to prevent disruption, the failure of a single interface led to unexpected impact.
We are working with the hardware vendor to determine whether the issue stems from a physical fault, firmware bug, or misbehavior under failover conditions.