Any changes to BIND zone results in SERVFAIL



  • Hey friends

    I've got a faulty bind set up…or at least something is going on. Basically, every month or so, my main domain zone will stop working. lookups all fail with servfail and the logs show the dreaded query fail message.

    My solution has been to restore from a known working backup. Sometimes that buys me a month or two and sometimes only a week or two.

    Also, if I change anything with that zone, it fails and I have to restore again. I've tried adding new hosts using the GUI and the custom zone file section, both case clients to get SERVFAIL.

    I have, historically, had DHCP clients register with DNS but have turned that off testing now.

    Anyone see anything glaring with my zone file? (I know I have some public IPs in there that I've masked with XXX.ZZZ ... I don't think that's the problem unless you smart minds tell me otherwise.)

    Lastly, Aspen (10.50.1.1) is a slave and even when Washington (10.15.1.1) gets itself out of sorts, aspen (slave) will respond correctly and without error.

    $TTL 43200
    ;
    $ORIGIN nsnet.us.
    
    ;	Database file nsnet.us.DB for nsnet.us zone.
    ;	Do not edit this file!!!
    ;	Zone version 2496684162
    ;
    nsnet.us.	 IN  SOA 10.15.1.1\. 	 zonemaster.nsnet.us. (
    		2496684162 ; serial
    		1d ; refresh
    		2h ; retry
    		4w ; expire
    		1h ; default_ttl
    		)
    
    ;
    ; Zone Records
    ;
    @ 	 IN NS 	10.15.1.1.
    @ 	 IN A 	10.15.1.1
    washington 	 IN A  	10.15.1.1
    vail 	 IN A  	10.15.1.15
    ajax 	 IN A  	10.50.1.103
    alta 	 IN A  	192.168.83.2
    blackcomb 	 IN A  	10.50.1.15
    chamonix 	 IN A  	10.15.1.11
    colorado 	 IN A  	198.27.XXX.ZZZ
    frontrange 	 IN A  	192.99.XXX.ZZZ
    osx5 	 IN A  	10.15.1.100
    prima 	 IN A  	10.75.1.20
    telluride 	 IN A  	10.75.1.1
    verbier 	 IN A  	10.75.1.15
    wintergreen 	 IN A  	10.50.1.107
    winterpark 	 IN A  	144.217.XXX.ZZZ
    yonder 	 IN A  	10.75.1.25
    zermatt 	 IN A  	10.15.1.115
    aspen 	 IN A  	10.50.1.1
    elkrange 	 IN A  	217.182.XXX.ZZZ
    rockies 	 IN A  	217.182.XXX.ZZZ
    highline	IN A	10.15.1.105
    
    


  • @SpaceBass:

    Hey friends

    I've got a faulty bind set up…or at least something is going on. Basically, every month or so, my main domain zone will stop working. lookups all fail with servfail and the logs show the dreaded query fail message.

    My solution has been to restore from a known working backup. Sometimes that buys me a month or two and sometimes only a week or two.

    Also, if I change anything with that zone, it fails and I have to restore again. I've tried adding new hosts using the GUI and the custom zone file section, both case clients to get SERVFAIL. I know that the nameserver should be a hostname not an IP...but if I fix that, it breaks things and I have to restore.

    I have, historically, had DHCP clients register with DNS but have turned that off testing now.

    I've also tried completely deleting the zone and recreating it from scratch with only one host - also SERVFAIL

    Anyone see anything glaring with my zone file? (I know I have some public IPs in there that I've masked with XXX.ZZZ ... I don't think that's the problem unless you smart minds tell me otherwise.)

    Lastly, Aspen (10.50.1.1) is a slave and even when Washington (10.15.1.1) gets itself out of sorts, aspen (slave) will respond correctly and without error.

    $TTL 43200
    ;
    $ORIGIN nsnet.us.
    
    ;	Database file nsnet.us.DB for nsnet.us zone.
    ;	Do not edit this file!!!
    ;	Zone version 2496684162
    ;
    nsnet.us.	 IN  SOA 10.15.1.1\. 	 zonemaster.nsnet.us. (
    		2496684162 ; serial
    		1d ; refresh
    		2h ; retry
    		4w ; expire
    		1h ; default_ttl
    		)
    
    ;
    ; Zone Records
    ;
    @ 	 IN NS 	10.15.1.1.
    @ 	 IN A 	10.15.1.1
    washington 	 IN A  	10.15.1.1
    vail 	 IN A  	10.15.1.15
    ajax 	 IN A  	10.50.1.103
    alta 	 IN A  	192.168.83.2
    blackcomb 	 IN A  	10.50.1.15
    chamonix 	 IN A  	10.15.1.11
    colorado 	 IN A  	198.27.XXX.ZZZ
    frontrange 	 IN A  	192.99.XXX.ZZZ
    osx5 	 IN A  	10.15.1.100
    prima 	 IN A  	10.75.1.20
    telluride 	 IN A  	10.75.1.1
    verbier 	 IN A  	10.75.1.15
    wintergreen 	 IN A  	10.50.1.107
    winterpark 	 IN A  	144.217.XXX.ZZZ
    yonder 	 IN A  	10.75.1.25
    zermatt 	 IN A  	10.15.1.115
    aspen 	 IN A  	10.50.1.1
    elkrange 	 IN A  	217.182.XXX.ZZZ
    rockies 	 IN A  	217.182.XXX.ZZZ
    highline	IN A	10.15.1.105
    
    


  • I know I'm reviving an old thread but maybe someone can use this information.

    I've just had the same thing: even changing the serial makes Bind respond with SERVFAIL.

    We are using dynamic updates from DHCP, so Bind is keeping a journal for the zone. (https://ftp.isc.org/www/bind/arm95/Bv9ARM.ch04.html#dynamic_update)

    I noticed this in "Status -> System Logs -> System -> DNS Resolver" (/var/log/resolver.log; you may have to dig into history):

    Aug 16 14:33:34 pfsense named[27097]: zone <zone>/IN/<zone>: journal rollforward failed: journal out of
     sync with zone
    Aug 16 14:33:34 pfsense named[27097]: zone <zone>/IN/<zone>: not loaded due to errors.
    

    I solved this by first sync-ing the journal (login as admin/root on pfSense):

    rndc -c /cf/named/etc/namedb/rndc.conf sync -clean
    

    And then restart Bind using the GUI.
    I changed the serial again and the problem did not pop up.



  • Hi,

    Something you didin't mention, but probably did :
    When changing, for a example, the SOA in a zone file, and this zone is also updated by RFC 2136 (dynamic), you have to :

    rndc freeze <zone>
    

    Only now you can open, edit, and save the zone file.

    rndc reload <one>
    rndc thaw <zone>
    

    Syncing the ".jnl" == journal file is ok, buth bind keep the actual working zone structure in memory, not in the actual file you are editing.

    Btw : I don't know if all this is done by pfSense, but when you edit the file, use these 3 "rndc" sequences.
    The action is visible in the "general" log - if the bind package has such a facility :
    Example :

    16-Aug-2018 15:30:14.300 general: received control channel command 'freeze home.brit-hotel-fumel.fr'
    16-Aug-2018 15:30:14.300 general: freezing zone 'home.brit-hotel-fumel.fr/IN': success
    16-Aug-2018 15:30:27.940 general: received control channel command 'reload home.brit-hotel-fumel.fr'
    16-Aug-2018 15:30:35.636 general: received control channel command 'thaw home.brit-hotel-fumel.fr'
    16-Aug-2018 15:30:35.636 general: thawing zone 'home.brit-hotel-fumel.fr/IN': success