Issues reaching RIPE NCC services

Incident Report for RIPE NCC

Postmortem

Summary

On 1 November, from 10:45 to 12:15 UTC, most names in the ripe.net zone were bogus due to expired DNSSEC signatures being served. This rendered most of the RIPE NCC’s services unreachable. After investigating the issue, we found a typo in a change to our zone where a record had a TTL that was longer (864,000 seconds instead of 86,400) than the refresh interval for RRSIGs (seven days). This caused our signer to stop refreshing signatures and only sign changes to the zone. We are talking to the vendor of our DNSSEC signing solution about this case to see what can be improved on that end, have implemented a pre-commit check to prevent TTLs longer than a day in the ripe.net zone and are looking at improving monitoring for stale signatures to spot issues like this before they cause problems.

‌

Impact

DNSSEC signatures in the ripe.net zone are valid for 14 days, with our signers configured to resign them after half that time (seven days). On 1 November at 10:45 UTC the signature on several records in the ripe.net zone expired. These records had last been signed on 18 October and were due to be re-signed on the 25th. However, due to a problem with the TTL on one record, our signer stopped re-signing records in the zone on 25 October. This resulted in the expiry of 11,026 out of 11,389 records on 1 November. New or changed records were still properly signed (363 of them), which meant that our monitoring, which checks the signature validity of the SOA record at the zone apex, missed this issue.

Because our internal resolvers are configured for DNSSEC validation, the impact was rather immediate for staff, as many internal services broke due to this issue. After first dismissing some alternative causes, we quickly found the problem was with expired signatures in the ripe.net zone, so we turned our attention to our signers. At the same time, we temporarily disabled DNSSEC validation on our internal resolvers so we could more easily access our own systems while troubleshooting.

‌

Resolution

While debugging, we found that the rrsig-refresh option that we configured to seven days (half the value of the rrsig-lifetime option of 14 days) was likely involved, logs showed:

info: [ripe.net.] DNSSEC, signing zone
error: [ripe.net.] DNSSEC, rrsig-refresh too low to prevent expired RRSIGs in resolver caches
info: [ripe.net.] DNSSEC, next signing at 2023-10-25T10:02:02+0000
error: [ripe.net.] zone event 're-sign' failed (invalid parameter)

At 12:14 UTC we removed that option from our configuration and we could sign the zone again. The freshly signed zone was pushed out and went live a little bit later, which meant that at 12:15 UTC our services were available again for most users. Unfortunately, some users kept seeing problems for several hours after we restored the signatures.

‌

Root cause

After further investigation we found that the change that triggered this problem introduced a record in the ripe.net zone with a TTL of 864,000 (ten days). Because this TTL is longer than our rrsig-refresh configuration, this could lead to cases where a resolver’s cache contains the record with an expired signature. The signer software rightfully complained about this. We were surprised to find it then stopped refreshing signatures for all records in the zone that didn’t change.

‌

Future steps

During the incident and the aftermath we identified a few changes that we want to make to improve the resiliency of our setup and allow us to find cases like these before they become problems. Our current RRSIG freshness monitoring did not catch this case, because the records we monitor still had valid and recent signatures, so we are considering what we can do to cover this situation. We have also improved our zone-editing pipeline to catch typos or misconfigurations for TTL values.

Next to that, the problem also affected our ability to communicate internally, as our internal chat system was unresolvable too. We have some means of out-of-band communication, but will review how we can improve that.

Additionally, while the status.ripe.net website is hosted on separate infrastructure, the fact that it is also in the ripe.net domain meant that it was just as unreachable as our other services. We will evaluate this approach and see how we can improve on it.

‌

Timeline (times in UTC)

25 October

08:52 a record was added to the ripe.net zone with a TTL of 864,000 seconds
08:53 knot incrementally signs ripe.net successfully
09:02 knot fails to sign the ripe.net zone for the first time

1 November

10:45 ripe.net signatures expire and many records go bogus
11:27 DNSSEC validation on internal resolvers was disabled
12:14 changed configuration and manually re-signed zone
12:15 ripe.net zone has new valid signatures
12:38 DNSSEC validation on internal resolvers is re-enabled

2 November

08:39 typo in TTL fixed, bringing it back to 86,400 seconds as intended
08:39 added check in pipeline to detect too large TTL values

Posted Nov 02, 2023 - 16:43 CET

Resolved

This incident has been resolved.

Posted Nov 01, 2023 - 14:08 CET

Monitoring

According to our current understanding the root cause of the issue was DNSSEC-related.

According to the information available to us the issues started to recover around 12:15. We are monitoring the situation.

Posted Nov 01, 2023 - 13:37 CET

Identified

We are still investigating the issue.

We have also marked the RPKI RRDP repositories as affected; rrdp.ripe.net was affected between 11:01 and 11:15 UTC.

Posted Nov 01, 2023 - 13:19 CET

Update

We are continuing to investigate this issue.

Posted Nov 01, 2023 - 13:08 CET

Update

We are still investigating the issues.

We are aware of the issues some users face resolving names in the ripe.net zone.

Posted Nov 01, 2023 - 12:19 CET

Investigating

We are currently investigating this issue.

Posted Nov 01, 2023 - 11:52 CET

This incident affected: RIPE Database, Email to/from the RIPE NCC, LIR (Member) Portal, RIPE NCC Access, www.ripe.net and RPKI (RPKI Dashboard, RRDP Repository, rsync Repository).