Every month, DNS Belgium performs a key rollover routine in which the ZSK keys (Zone Signing Key) are renewed, for the .be, .vlaanderen and .brussels zones. This is a best practice when DNSSEC is enabled on DNS zones. The Zone Signing Key (ZSK) key is used for signing the zone.
During the ZSK key rollover executed on 15 November 2018, an error was introduced in all 3 zones, which was placed in production. Due to this error, 1,410 on a total of 1.6 million websites were not reachable for clients behind a validating resolver.
The incident had a duration of 10 hours and 35 minutes before it was resolved (Thursday 15 November 2018 from 12:06 CET until 22:41 CET). Analysis of the name server logs indicated that the real impact was for a total of 559 .be domain names, for which requests where received during the issue timeslot. Since we cannot know how many end users are behind a resolver, it is hard to estimate the number of users that were impacted trying to access any of the affected websites.
The root cause of the problem was a bug introduced in the new version of the signing software (ISC BIND v9.11.4), which was installed a few weeks before the incident. This bug has a random factor and did not manifest itself when we tested this new BIND version in our test environments.
During the weeks after the incident, we have implemented additional measures to avoid publishing incorrectly signed zone files.
Before 15th November 2018
The software used by DNS Belgium for signing the zones, is ISC BIND, and for each zone a dedicated machine is specifically assigned for the signing task.
A few weeks before the date of the issue (15 November), the BIND software was upgraded from version 9.9.4 to 9.11.4 on all signer devices. This was necessary because the version 9.9.4 was out-of-support by ISC support and also because this version included 2 specific fixes on request of DNS Belgium:
- fix for double RRSIG bug during ZSK rollover
- fix for signature validity collateral damage during ZSK rollover￼
FYI, the signer is configured with the following configurations:
- The setting “auto_dnssec maintain” is enabled with dynamic updates.
- The KSK key is kept offline; and is only brought online when a key rollover takes place.
- The NSEC3 Optout flag is ON.
- The monthly ZSK key rollovers happen using an in-house built semi-automated script which has been thoroughly tested and has been in use for several years now.
The new version of BIND was already put in production on the signer devices weeks before the day of the issue and was continually and correctly signing the zones, which are dynamically updated.
The issue was introduced when on the 15 November 2018, the monthly scheduled ZSK key rollover was executed, only this time with the updated version of BIND.
This new version of BIND included 3 new bugs which are specifically related to signature maintenance tasks, like a ZSK key rollover:
- Bug 1a: broken NSEC/NSEC3 chains
- Bug 1b: improper signing of glue records
- Bug 2: ZSK key signs the DNSKEY RRset
Unfortunately, the issue did not manifest itself during the previously performed tests of the ZSK rollover script on the new BIND version.
Only bug number 1 is very bad because this breaks the DNSSEC validation. The 2 other bugs do not actually break anything in the DNSSEC validation or in the DNS resolving process, but do place incorrect entries in the zone and are therefore also not desired.
These bugs were acknowledged by ISC support and are reported in the ISC knowledge base article.
Bug 1a: corrupt NSEC3 chain
As is explained in the ISC knowledge base article: “In some, but not all cases, the newly-signed RRsets are added to the zone's NSEC/NSEC3 chain, but incompletely -- this can result in a broken chain, affecting validation of proof of nonexistence for records in the zone.”
This is exactly what happened to all 3 live zones of DNS Belgium and is also displayed in the screenshot (see below) taken from the dnsviz.net website for an affected domain name (domain name has been removed from the picture).
The issue only affected clients behind a DNSSEC validating resolver (eg. 18.104.22.168) trying to resolve one of the 1410 affected domains. At the time of the issue, a validating resolver would return a SERVFAIL response to the client because of the incorrectly signed NSEC3 record which has the hash value of the requested domain in its range. Because only the NSEC3 chain was corrupted by the bug and not the signing of the rest of the zone (and NSEC3 OPT-OUT is enabled), only non-DNSSEC signed domains were affected. The number of affected domains remained relatively limited, due to a very specific race condition in the BIND software.
This race condition also made the issue difficult to reproduce afterwards, which is another reason why the problem was not detected when testing the new BIND version in our test environments.
Until this issue, the ZSK rollover was a standard operational procedure within DNS Belgium, monthly executed for many years without a problem, which did not include extra DNSSEC validation checks. The signer was not isolated during the process either (meaning no dynupdates are processed by the signer and no updates are forwarded by the signer to the hidden masters and authoritative slave name servers). Therefore, the invalid entries in the zone were set live on all the authoritative name servers. The problem was not noticed right away.
The issue was detected the same day by DNS Belgium during the execution of the first phase of a KSK (Key Signing Key) rollover. During the KSK rollover procedure, extra DNSSEC verification checks were performed, which indicated the issue.
Once the problem had been noticed, immediate action was taken by the DNS Belgium engineers advised by ISC support, in order to rectify the situation. The complete NSEC3 chain was regenerated by changing the NSEC3 salt value and then changing it back to the previous value. Hereafter extra validation checks for DNSSEC were performed before the signer was taken out of isolation and dynupdates were processed again.
This procedure lasted a couple of hours for the .be zone, but was significantly faster for the .vlaanderen and .brussels zone.
During the issue, a significant increase for DS queries (x 60) was noticed on the slave name servers.
Bug 1b: glue records with RRSIG signature
This bug was noticed the day after the NSEC3 issue. The previous bug (1a) was actually introduced by this bug (1b), due to the already mentioned race condition within the BIND software.
The ISC knowledge base article explains this as follows: “Code change #4964, intended to prevent double signatures when deleting an inactive zone DNSKEY in some situations,introduced a new problem during zone processing in which some delegation glue RRsets are incorrectly identified as needing RRSIGs, which are then created for them using the current active ZSK for the zone. In some, but not all cases, the newly-signed RRsets are added to the zone's NSEC/NSEC3 chain, but incompletely.”
Bug 2: ZSK key sign the DNSKEY RRset
After the ZSK rollover on the 15thof November 2018, DNS Belgium also noticed that the DNSKEY RRset was signed by a ZSK key. This was undesired, the DNSKEY RRset should only be signed by valid KSK keys. This was later also confirmed by ISC as being a bug which also was introduced in BIND version v9.11.4.
This bug does not actually break anything, which was also the reason the bug was not noticed during the testing phase. The bug is easy to reproduce, and only occurs when the KSK key is offline during a ZSK rollover procedure.
The bug is at the time of writing not yet fixed by ISC; the issue is reported on the ISC gitlab as nr #763.
Because the ZSK rollover is a monthly procedure and the ZSK key validity period is 40 days, the necessary preventive measures had to be taken before the next key rollover, which was planned for 20 December.
- At first, it was decided to roll-back the BIND version to the previous version, for which workaround measures for the known bugs had already been implemented.
- Furthermore, extra DNSSEC validation checks were installed and these now run hourly instead of daily on each zone.
- What’s more, the ZSK rollover procedure has been improved:
- The ZSK rollovers for the different zones are spread over time.
- The signer is isolated for the ZSK procedure as well; dynamic updates for the specific zone are temporarily stopped and updates from the signer towards hidden masters are stopped.
- Extra manual verification checks for DNSSEC are executed before the signer is taken out of isolation.