Issues reaching RIPE NCC services
Incident Report for RIPE NCC
Postmortem

Summary

On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone were bogus due to expired DNSSEC signatures being served. This rendered most of the RIPE NCC’s services unreachable. After investigating the issue, we found a typo in a change to our zone where a record had a TTL that was longer (864,000 seconds instead of 86,400) than the refresh interval for RRSIGs (seven days). This caused our signer to stop refreshing signatures and only sign changes to the zone. We are talking to the vendor of our DNSSEC signing solution about this case to see what can be improved on that end, have implemented a pre-commit check to prevent TTLs longer than a day in the ripe.net zone and are looking at improving monitoring for stale signatures to spot issues like this before they cause problems.

Impact

DNSSEC signatures in the ripe.net zone are valid for 14 days, with our signers configured to resign them after half that time (seven days). On 1 November at 10:45 UTC the signature on several records in the ripe.net zone expired. These records had last been signed on 18 October and were due to be re-signed on the 25th. However, due to a problem with the TTL on one record, our signer stopped re-signing records in the zone on 25 October. This resulted in the expiry of 11,026 out of 11,389 records on 1 November. New or changed records were still properly signed (363 of them), which meant that our monitoring, which checks the signature validity of the SOA record at the zone apex, missed this issue.

Because our internal resolvers are configured for DNSSEC validation, the impact was rather immediate for staff, as many internal services broke due to this issue. After first dismissing some alternative causes, we quickly found the problem was with expired signatures in the ripe.net zone, so we turned our attention to our signers. At the same time, we temporarily disabled DNSSEC validation on our internal resolvers so we could more easily access our own systems while troubleshooting. 

Resolution

While debugging, we found that the rrsig-refresh option that we configured to seven days (half the value of the rrsig-lifetime option of 14 days) was likely involved, logs showed:

info: [ripe.net.] DNSSEC, signing zone
error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired RRSIGs in resolver caches
info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+0000
error: [ripe.net.] zone event 're-sign' failed (invalid parameter)

At 12:14 UTC we removed that option from our configuration and we could sign the zone again. The freshly signed zone was pushed out and went live a little bit later, which meant that at 12:15 UTC our services were available again for most users. Unfortunately, some users kept seeing problems for several hours after we restored the signatures.

Root cause

After further investigation we found that the change that triggered this problem introduced a record in the ripe.net zone with a TTL of 864,000 (ten days). Because this TTL is longer than our rrsig-refresh configuration, this could lead to cases where a resolver’s cache contains the record with an expired signature. The signer software rightfully complained about this. We were surprised to find it then stopped refreshing signatures for all records in the zone that didn’t change.

Future steps

During the incident and the aftermath we identified a few changes that we want to make to improve the resiliency of our setup and allow us to find cases like these before they become problems. Our current RRSIG freshness monitoring did not catch this case, because the records we monitor still had valid and recent signatures, so we are considering what we can do to cover this situation. We have also improved our zone-editing pipeline to catch typos or misconfigurations for TTL values.

Next to that, the problem also affected our ability to communicate internally, as our internal chat system was unresolvable too. We have some means of out-of-band communication, but will review how we can improve that.

Additionally, while the status.ripe.net website is hosted on separate infrastructure, the fact that it is also in the ripe.net domain meant that it was just as unreachable as our other services. We will evaluate this approach and see how we can improve on it.

Timeline (times in UTC)

25 October

  • 08:52 a record was added to the ripe.net zone with a TTL of 864,000 seconds
  • 08:53 knot incrementally signs ripe.net successfully
  • 09:02 knot fails to sign the ripe.net zone for the first time

1 November

  • 10:45 ripe.net signatures expire and many records go bogus
  • 11:27 DNSSEC validation on internal resolvers was disabled
  • 12:14 changed configuration and manually re-signed zone
  • 12:15 ripe.net zone has new valid signatures
  • 12:38 DNSSEC validation on internal resolvers is re-enabled

2 November

  • 08:39 typo in TTL fixed, bringing it back to 86,400 seconds as intended
  • 08:39 added check in pipeline to detect too large TTL values
Posted Nov 02, 2023 - 15:43 UTC

Resolved
This incident has been resolved.
Posted Nov 01, 2023 - 13:08 UTC
Monitoring
According to our current understanding the root cause of the issue was DNSSEC-related.

According to the information available to us the issues started to recover around 12:15. We are monitoring the situation.
Posted Nov 01, 2023 - 12:37 UTC
Identified
We are still investigating the issue.

We have also marked the RPKI RRDP repositories as affected; rrdp.ripe.net was affected between 11:01 and 11:15 UTC.
Posted Nov 01, 2023 - 12:19 UTC
Update
We are continuing to investigate this issue.
Posted Nov 01, 2023 - 12:08 UTC
Update
We are still investigating the issues.

We are aware of the issues some users face resolving names in the ripe.net zone.
Posted Nov 01, 2023 - 11:19 UTC
Investigating
We are currently investigating this issue.
Posted Nov 01, 2023 - 10:52 UTC
This incident affected: RIPE Database, Email to/from the RIPE NCC, LIR (Member) Portal, RIPE NCC Access, www.ripe.net and RPKI (RPKI Dashboard, RRDP Repository, rsync Repository).