Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    DNS/DHCP stop working suddenly

    Scheduled Pinned Locked Moved General pfSense Questions
    59 Posts 6 Posters 8.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S
      stephenw10 Netgate Administrator
      last edited by

      This is the first time you've seen this issue? It might not be the disk, I would want to see something conclusive. If it is you could just replace the disk though.

      M 1 Reply Last reply Reply Quote 0
      • M
        michmoor LAYER 8 Rebel Alliance @stephenw10
        last edited by

        @stephenw10
        It happened again..
        Got some more data that may or may not help.

        1. Internet suddenly goes out so i go straight into console.
        2. Nothing on the console screen. Usual output for selecting options.

        WAN port loses IP address

        Mar  1 15:22:59 GAFW dhclient[38626]: No working leases in persistent database - sleeping.
        Mar  1 15:22:59 GAFW dhclient[30526]: FAIL
        Mar  1 15:23:00 GAFW dhclient[38626]: connection closed
        Mar  1 15:23:00 GAFW dhclient[38626]: exiting.
        Mar  1 15:23:01 GAFW dhclient[41673]: PREINIT
        Mar  1 15:23:01 GAFW dhclient[43008]: EXPIRE
        Mar  1 15:23:01 GAFW dhclient[44238]: Deleting old routes
        Mar  1 15:23:01 GAFW dhclient[45556]: PREINIT
        Mar  1 15:23:01 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 2
        Mar  1 15:23:03 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 5
        Mar  1 15:23:08 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 7
        

        CPU usage using top -aSH command.

        last pid: 16707;  load averages:  4.63,  3.67,  2.21                                                                                                         up 0+15:43:45  15:11:13
        519 threads:   9 running, 472 sleeping, 38 waiting
        CPU:  2.0% user,  0.0% nice, 26.4% system,  0.0% interrupt, 71.7% idle
        Mem: 398M Active, 1576M Inact, 1347M Wired, 4465M Free
        ARC: 319M Total, 119M MFU, 187M MRU, 875K Anon, 2154K Header, 10M Other
             261M Compressed, 759M Uncompressed, 2.91:1 Ratio
        Swap: 1024M Total, 1024M Free
        
          PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
            0 root        -60    -     0B  2128K CPU3     3  13:49  99.92% [kernel{if_io_tqg_3}]
           11 root        187 ki31     0B    64K RUN      2 872:54  95.83% [idle{idle: cpu2}]
           11 root        187 ki31     0B    64K CPU1     1 867:56  95.08% [idle{idle: cpu1}]
           11 root        187 ki31     0B    64K CPU0     0 878:02  93.18% [idle{idle: cpu0}]
        60326 root         68    0   114M    51M accept   1   0:02   8.47% php-fpm: pool nginx (php-fpm)
        81924 www          20    0    43M    23M kqread   1  12:44   1.15% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -x /tmp/haproxy.socket -sf 465
        20240 unbound      20    0   192M   144M kqread   0   0:01   0.60% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
        42858 root         20    0    13M  3084K select   2   4:35   0.57% /usr/sbin/syslogd -O rfc3164 -s -c -c -l /var/dhcpd/var/run/log -l /tmp/haproxy_chroot/var/run/log -P /var/run/sy
        20240 unbound      20    0   192M   144M kqread   1   0:01   0.53% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
        63614 root         20    0  1196M   733M uwait    1   4:34   0.48% /usr/local/bin/suricata --netmap -D -c /usr/local/etc/suricata/suricata_12791_igc1/suricata.yaml --pidfile /var/r
            0 root        -64    -     0B  2128K -        1   1:21   0.28% [kernel{dummynet}]
            0 root        -12    -     0B  2128K -        1   0:41   0.23% [kernel{z_wr_iss_0}]
        48900 root         20    0    13M  3520K bpf      2   2:43   0.22% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
            0 root        -12    -     0B  2128K -        1   0:41   0.21% [kernel{z_wr_iss_1}]
        

        Firewall: NetGate,Palo Alto-VM,Juniper SRX
        Routing: Juniper, Arista, Cisco
        Switching: Juniper, Arista, Cisco
        Wireless: Unifi, Aruba IAP
        JNCIP,CCNP Enterprise

        M 1 Reply Last reply Reply Quote 0
        • M
          michmoor LAYER 8 Rebel Alliance @michmoor
          last edited by

          @stephenw10

          What is also strange is that i could not resolve anything local. My own self hosted services would time out..

          C:\Users\michm>nslookup syslog.xxxxxxx
          Server:  GAFW.xxxxxxx
          Address:  192.168.50.254
          
          DNS request timed out.
              timeout was 2 seconds.
          DNS request timed out.
              timeout was 2 seconds.
          *** Request to GAFW.xxxxxxxxx timed-out
          

          I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.

          Netgate pfSense Plus will reboot. This may take a few minutes, depending on your hardware.
          Do you want to proceed?
          
              Y/y: Reboot normally
              R/r: Reroot (Stop processes, remount disks, re-run startup sequence)
              S: Reboot into Single User Mode (requires console access!)
              Enter: Abort
          
          Enter an option: r
          
          Netgate pfSense Plus is rerooting now.
           Stopping package arpwatch...done.
           Stopping package freeradius3...done.
           Stopping package lldpd...done.
           Stopping package WireGuard...done.
           Stopping package haproxy...done.
           Stopping package nut...done.
           Stopping package syslog-ng...done.
           Stopping package Tailscale...   <----- stuck here and could not go past
          

          I eventually lost access to the web configurator and got the nginx timed out message.

          Firewall: NetGate,Palo Alto-VM,Juniper SRX
          Routing: Juniper, Arista, Cisco
          Switching: Juniper, Arista, Cisco
          Wireless: Unifi, Aruba IAP
          JNCIP,CCNP Enterprise

          tinfoilmattT 1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            OK interesting. Not the SSD then, so that's good.

            One CPU core seems to be stuck at 100%. We can only see it's interface io traffic so something attempting (and seemingly failing) to send or receive. From the fact tailscale could not be stopped I'd guess it's that somehow. Are you able to disable that as a test?

            M 1 Reply Last reply Reply Quote 0
            • M
              michmoor LAYER 8 Rebel Alliance @stephenw10
              last edited by

              @stephenw10
              I have uninstalled Tailscale.
              I’m glad I got some more data on this so thanks for the assistance.

              If this is a bug - let’s say it might be for sake of conversation - I don’t have much data to prove this isn’t just environmental to my setup. Any other data points I can gather?
              Losing a WAN ip address does make Tailscale chatty but i can’t find the link to why dns breaks for local resolution.

              Firewall: NetGate,Palo Alto-VM,Juniper SRX
              Routing: Juniper, Arista, Cisco
              Switching: Juniper, Arista, Cisco
              Wireless: Unifi, Aruba IAP
              JNCIP,CCNP Enterprise

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                It looks like whatever is using that CPU core is preventing anything sending/receiving traffic or possible opening states. Causing problem for numerous services.

                M 1 Reply Last reply Reply Quote 0
                • M
                  michmoor LAYER 8 Rebel Alliance @stephenw10
                  last edited by michmoor

                  @stephenw10
                  Interesting and i love that theory. Seems very logical.
                  Would explain why almost all functionality related to the firewall is broken - web-configurator , dns , dhcp...
                  Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?

                  edit: So whatever is using that core is related to traffic? High throughput? This 6100 can for sure push well over 10G from my testing but looking at the monitoring i see pretty low WAN throughput considering.
                  Tailscale seems like the smoking gun but checking my historical syslog it seems that once ix3 (WAN) goes down and cant renew a dhcp address, tailscale becomes very verbose. As i discovered today i am unable to reboot the router because it gets stuck on the tailscale process. High level but the trigger is losing the WAN.
                  If we're stable this weekend, i will install tailscale again and unplug the wan..see if i can reproduce.

                  ba19cfa2-0b4b-4faf-870e-18fe8195b2d3-image.png

                  Firewall: NetGate,Palo Alto-VM,Juniper SRX
                  Routing: Juniper, Arista, Cisco
                  Switching: Juniper, Arista, Cisco
                  Wireless: Unifi, Aruba IAP
                  JNCIP,CCNP Enterprise

                  stephenw10S 1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator @michmoor
                    last edited by

                    @michmoor said in DNS/DHCP stop working suddenly:

                    Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?

                    It is I agree. Though I'm assuming it's choking something in the network stack so it doesn't matter than then other cores are still available.

                    1 Reply Last reply Reply Quote 0
                    • tinfoilmattT
                      tinfoilmatt @michmoor
                      last edited by

                      @michmoor said in DNS/DHCP stop working suddenly:

                      I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.

                      what's the longest you waited for the system to shutdown gracefully when it appeared to be 'stuck" stopping Tailscale package? conservative estimate in minutes...

                      M 1 Reply Last reply Reply Quote 0
                      • M
                        michmoor LAYER 8 Rebel Alliance @tinfoilmatt
                        last edited by

                        @cyberconsultants
                        maybe 5mins or so.
                        Reboots generally do not take long

                        @stephenw10 so far since the TS package removal things have been stable since. I’ll add the package tomorrow and let you know how things go.

                        Firewall: NetGate,Palo Alto-VM,Juniper SRX
                        Routing: Juniper, Arista, Cisco
                        Switching: Juniper, Arista, Cisco
                        Wireless: Unifi, Aruba IAP
                        JNCIP,CCNP Enterprise

                        tinfoilmattT 1 Reply Last reply Reply Quote 1
                        • tinfoilmattT
                          tinfoilmatt @michmoor
                          last edited by

                          @michmoor five minutes doesn't seem reasonable given documented precaution:

                          https://docs.netgate.com/pfsense/en/latest/diagnostics/system-halt.html

                          ...and what you seemed to be dealing with more specifically.

                          abrupt and repeated power losses over time might be closer in line with your 'root cause' than any particular package. i personally think it'd be advisable to make not cutting the power until you've waited at least an unreasonable period of time given the specific reboot context (sysem updates, maintenance, system crash, etc.) a more regular practice.

                          pfSense is a software firewall that doesn't load its config strictly out of NVRAM like typical consumer devices often do. despite filesystem and backup/recovery/snapshot improvements over time, OS corruption is still a real concern with abrupt loss of system power.

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.

                            tinfoilmattT 1 Reply Last reply Reply Quote 0
                            • tinfoilmattT
                              tinfoilmatt @stephenw10
                              last edited by

                              @stephenw10 said in DNS/DHCP stop working suddenly:

                              In reality I don't think I've ever seen a correctly functioning install take more than 5 mins to shutdown.

                              so you would pull the physical power after five minutes of Stopping package Tailscale... in some kind of unstable system state?

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                If the console was hung at that point then yes. I might try hitting ctl+t first to see what it's actually waiting for. That won't allow it to continue but might give a clue as to why it failed.

                                tinfoilmattT M 2 Replies Last reply Reply Quote 0
                                • tinfoilmattT
                                  tinfoilmatt @stephenw10
                                  last edited by

                                  @stephenw10 said in DNS/DHCP stop working suddenly:

                                  If the console was hung at that point then yes. I might try hitting ctl+t first to see what it's actually waiting for. That won't allow it to continue but might give a clue as to why it failed.

                                  lol

                                  1 Reply Last reply Reply Quote 0
                                  • M
                                    michmoor LAYER 8 Rebel Alliance @stephenw10
                                    last edited by michmoor

                                    @stephenw10
                                    Hey Stephen,
                                    It happened again. I collected some data to help diagnose

                                    I initially SSH'd to my firewall and was able to run a top -aSH.

                                    last pid: 37748;  load averages:  5.19,  5.30,  5.34                                                                                                         up 5+14:27:25  06:36:49
                                    536 threads:   10 running, 488 sleeping, 38 waiting
                                    CPU:  1.1% user,  0.0% nice, 26.6% system,  0.1% interrupt, 72.3% idle
                                    Mem: 471M Active, 1981M Inact, 1400M Wired, 3937M Free
                                    ARC: 341M Total, 106M MFU, 214M MRU, 2972K Anon, 2640K Header, 14M Other
                                         267M Compressed, 972M Uncompressed, 3.64:1 Ratio
                                    Swap: 1024M Total, 1024M Free
                                    
                                      PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
                                        0 root        -60    -     0B  2160K CPU3     3 260:17  99.91% [kernel{if_io_tqg_3}]
                                       11 root        187 ki31     0B    64K CPU0     0 127.5H  96.51% [idle{idle: cpu0}]
                                       11 root        187 ki31     0B    64K CPU1     1 126.0H  95.49% [idle{idle: cpu1}]
                                       11 root        187 ki31     0B    64K RUN      2 126.7H  95.47% [idle{idle: cpu2}]
                                    98424 unbound      23    0   179M   139M kqread   1   0:06   3.65% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
                                    98424 unbound      21    0   179M   139M kqread   2   0:05   2.39% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
                                    

                                    I noticed that my WAN IP has changed. Being on ATT fiber it has remained the same for over two years but its DHCP so now its time to change.

                                    All logging stopped at the same time

                                    -rw-------  1 root       wheel       572K Mar  7 02:29 filter.log.1.bz2
                                    -rw-------  1 root       wheel       534K Mar  7 02:35 filter.log.0.bz2
                                    -rw-------  1 root       wheel       214K Mar  7 02:37 resolver.log.1.bz2
                                    -rw-------  1 root       wheel       149K Mar  7 02:39 resolver.log.0.bz2
                                    -rw-------  1 root       wheel       2.9M Mar  7 02:39 auth.log
                                    -rw-------  1 root       wheel       938K Mar  7 02:39 dhcpd.log
                                    -rw-------  1 root       wheel       3.1M Mar  7 02:39 gateways.log
                                    -rw-------  1 root       wheel       544K Mar  7 02:39 ntpd.log
                                    -rw-------  1 root       wheel       2.6M Mar  7 02:39 nginx.log
                                    -rw-------  1 root       wheel       2.8M Mar  7 02:39 routing.log
                                    -rw-------  1 root       wheel       2.5M Mar  7 02:39 ipsec.log
                                    -rw-------  1 root       wheel       2.5M Mar  7 02:39 openvpn.log
                                    -rw-------  1 root       wheel       6.5M Mar  7 02:39 system.log
                                    -rw-------  1 root       wheel       8.4M Mar  7 02:39 filter.log
                                    -rw-------  1 root       wheel       426K Mar  7 02:39 resolver.log
                                    -rw-------  1 freeradius freeradius  879K Mar  7 02:40 radius.log
                                    -rw-r--r--  1 root       wheel       591B Mar  7 06:36 utx.lastlogin
                                    -rw-------  1 root       wheel        22K Mar  7 06:36 utx.log
                                    

                                    I still couldn't get to the GUI on the LAN side as it was unresponsive so from SSH I went ahead with the reboot. It was stalling on stopping certain packages so I did ctrl t

                                    Netgate pfSense Plus is rebooting now.
                                     Stopping package arpwatch...done.
                                     Stopping package freeradius3...done.
                                     Stopping package lldpd...done.
                                     Stopping package WireGuard...
                                    load: 5.71  cmd: php_wg 65403 [nanslp] 10.49r 0.68u 0.07s 3% 53036k
                                    done.
                                     Stopping package haproxy...done.
                                     Stopping package nut...done.
                                     Stopping package syslog-ng...done.
                                     Stopping package softflowd...done.
                                     Stopping package suricata...
                                    
                                     Stopping package suricata...
                                    load: 5.59  cmd: php-cgi 56784 [nanslp] 35.07r 0.54u 0.04s 0% 53360k
                                    
                                    

                                    Waiting over an hour...........yes an hour...pfsense never came back from the reboot and within that hour I lost access to the shell. So I went in through the console and the console was flooded with the following logs.

                                    arpresolve: can't allocate llinfo for 192.168.1.254 on ix3
                                    

                                    I couldn't access anything else via console. Just the flooding of the arpresolve log tied up anything. Even disconnecting the ix3 interface from the ATT modem didn't matter..arpresolve still kept flooding console.

                                    Finally, a reboot fixed it. Back online..

                                    Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                    Routing: Juniper, Arista, Cisco
                                    Switching: Juniper, Arista, Cisco
                                    Wireless: Unifi, Aruba IAP
                                    JNCIP,CCNP Enterprise

                                    tinfoilmattT 1 Reply Last reply Reply Quote 0
                                    • tinfoilmattT
                                      tinfoilmatt @michmoor
                                      last edited by

                                      @michmoor sounds like a simple gateway alarm/action triggering a cascade of headache. you might review configuration of the packages whose services you're not allowing to stop, likely compounding the kludge each time you do. it'll only get longer and kludgier if you keep "resolving" it the way you are.

                                      Suricata initialization (both start and stop) is what's causing those arpresolve kernel notices.

                                      1 Reply Last reply Reply Quote 0
                                      • stephenw10S
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        Do you have the service watchdog installed? That can cause problems with stopping services when incorrectly used. It should only really be used for debugging. You would see that logged though.

                                        M 1 Reply Last reply Reply Quote 0
                                        • M
                                          michmoor LAYER 8 Rebel Alliance @stephenw10
                                          last edited by michmoor

                                          @stephenw10
                                          When @cyberconsultants mentioned suricata i did immediately go to my service watchdog and I have the following enabled.

                                          2a81fa14-6f7f-47e7-94b9-1fb1f9ac5b1b-image.png

                                          This is a set up Ive had for years now. I don't mind removing these services from monitoring but i don't see how it prevented a reboot. Ive upgraded/rebooted many times with these enabled.

                                          edit: I would also say that its likely something environmentally changed here but I have no idea what it could be. Short of adding a firewall rule, the config is static. Goes down for an upgrade every few months. So whatever is causing a cascade of headaches for me (love the term) its proving difficult to isolate.

                                          Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                          Routing: Juniper, Arista, Cisco
                                          Switching: Juniper, Arista, Cisco
                                          Wireless: Unifi, Aruba IAP
                                          JNCIP,CCNP Enterprise

                                          bmeeksB 1 Reply Last reply Reply Quote 0
                                          • stephenw10S
                                            stephenw10 Netgate Administrator
                                            last edited by stephenw10

                                            Well the service watchdog should log if it restarts anything so I'd expect to see that in the system logs. Unless it's not logging anything of course.

                                            It's an easy test to remove those though. Unbound is the only thing I could imagine being an issue though.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.