Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    DNS/DHCP stop working suddenly

    Scheduled Pinned Locked Moved General pfSense Questions
    59 Posts 6 Posters 8.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      michmoor LAYER 8 Rebel Alliance
      last edited by

      So let me set the stage
      pfsense = Netgate 6100
      OS = 23.09.1

      Its 8pm ET today and the family finished dinner and are in bed. I started getting email alerts from my Zabbix instance in Linode that a few of my servers and self-hosted sites are down. This happened before where there was an issue in the Linode cloud or maybe my VPN path could been messy with the underlay so i didn't think anything to much at first. Then my other monitoring instance at my colo (yes i have a colo and i monitor my house) started alerting on my phone that my WAN address is no longer reachable.
      Now i think there is an issue with my local ISP. I go to my basement and do a visual inspection. Lights are blinking and cable modem is green. My wife informed me that she can no longer RDP to her work computer and i can no longer browse the web on my phone. We are dealing with a home outage now...
      First things first i pinged my gateway which is my pfsense. It responds.
      Then i do an nslookup for google.com...Times out...Ok not unusual if i have a WAN outage i thought.
      Then i started noticing that my phone dropped off the network. I tried to join the wifi and im not receiving my DHCP address. I tried another wifi device and the same thing. Now my focus is on pfSense.
      I log into the console and nothing..No error message. I did notice something unusual and my IX3 , WAN port, was blinking extremely fast. Odd. Checked the system.log file and i noticed dhclient is attempting to get an address over and over and is failing. But that shouldnt have anything to do with clients in my LAN not getting an IP address fro mthe firewall. Troubleshooting for the next 30m i gave up. I restarted the firewall. Everything returned to normal.
      I am very positive there was an issue with pfSense but nothing in the logs indicates what. I log everything...everything........I started hunting in my syslog.

      Evidence 1: Unifi logs show that my client received a dns timeout from pfsense which is 192.168.14.1

      full_message
      <4>Feb 29 20:00:22 BasementWiFi6Lite 7845584827a0,U6-Lite-6.6.55+15189: kernel: [6281512.155374] [STA_TRACKER] DNS request timed out; [STA: a0:d0:dc:cc:44:d6][QUERY: 3k4wih5m6mws2qutry3nx5vcvyf331709194617.uaid.imrworldwide.com.] [DNS_SERVER :192.168.14.1] [TXN_ID d00b] [SRCPORT 62556][SRCPORT 7182]
      
      

      Evidence 2: HA proxy detected that my backend servers are unreachable. These are on vlans with the firewall as the gateway so that cant be possible

      full_message
      <128>Feb 29 20:00:18 HAProxyMasterNode haproxy[86326]: backend idrac_ipvANY has no server available!
      

      Evidence 3: dpinger losing packets

      full_message
      <12>Feb 29 20:00:20 dpinger[8674]: WAN_DHCP 162.193.210.1: Alarm latency 1818us stddev 306us loss 21%
      

      Evidence 4: Although no logs here, i was unable to access my firewall from my desktop which has a static IP. I was able to ping inter-vlan to my unifi controller but i couldnt resolve anything.

      Evidence 5: The earliest timestamp i have of the issue: I think this is pfblocker DNSBL but i dont know for sure.

      full_message
      <14>Feb 29 20:00:06 filterdns[89958]: merge_config: configuration reload
      
      

      Are there any other logs within pfSense that i can filter through and check? This really bothers me that i cant find a root cause.

      Firewall: NetGate,Palo Alto-VM,Juniper SRX
      Routing: Juniper, Arista, Cisco
      Switching: Juniper, Arista, Cisco
      Wireless: Unifi, Aruba IAP
      JNCIP,CCNP Enterprise

      M 1 Reply Last reply Reply Quote 0
      • M
        michmoor LAYER 8 Rebel Alliance @michmoor
        last edited by

        Something else i found interesting is that logs are missing - to me it seems pfSense stopped writing to logs.

        Notice that log.2 is missing. I was able to find bz2 for other logs like gateway

        /var/log: ls -ltrh | grep resolver.log
        -rw-------  1 root       wheel       267K Aug 17  2023 resolver.log.6.bz2
        -rw-------  1 root       wheel       267K Aug 17  2023 resolver.log.5.bz2
        -rw-------  1 root       wheel       272K Aug 18  2023 resolver.log.4.bz2
        -rw-------  1 root       wheel       269K Aug 18  2023 resolver.log.3.bz2
        -rw-------  1 root       wheel       317K Feb 29 21:39 resolver.log.1.bz2
        -rw-------  1 root       wheel       317K Feb 29 22:08 resolver.log.0.bz2
        -rw-------  1 root       wheel       9.0M Feb 29 22:41 resolver.log
        
        
         ls -ltrh | grep system.log
        -rw-------  1 root       wheel       1.5K Aug 12  2023 system.log.6.bz2
        -rw-------  1 root       wheel       1.9K Aug 12  2023 system.log.5.bz2
        -rw-------  1 root       wheel       1.8K Aug 12  2023 system.log.4.bz2
        -rw-------  1 root       wheel       1.3K Aug 12  2023 system.log.3.bz2
        -rw-------  1 root       wheel        17K Feb 29 21:42 system.log.1.bz2
        -rw-------  1 root       wheel        19K Feb 29 22:12 system.log.0.bz2
        -rw-------  1 root       wheel       8.2M Feb 29 22:41 system.log
        
        /var/log: ls -ltrh | grep ipsec.log
        -rw-------  1 root       wheel       256K Aug 16  2023 ipsec.log.6.bz2
        -rw-------  1 root       wheel       254K Aug 16  2023 ipsec.log.5.bz2
        -rw-------  1 root       wheel       254K Aug 17  2023 ipsec.log.4.bz2
        -rw-------  1 root       wheel       259K Aug 17  2023 ipsec.log.3.bz2
        -rw-------  1 root       wheel       218K Feb 28 13:07 ipsec.log.1.bz2
        -rw-------  1 root       wheel       218K Feb 29 15:16 ipsec.log.0.bz2
        -rw-------  1 root       wheel       2.5M Feb 29 22:42 ipsec.log
        

        Firewall: NetGate,Palo Alto-VM,Juniper SRX
        Routing: Juniper, Arista, Cisco
        Switching: Juniper, Arista, Cisco
        Wireless: Unifi, Aruba IAP
        JNCIP,CCNP Enterprise

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          Those are the sort of symptoms you get if pfSense either loses access to the disk or the disk is full.

          I would expect to see a whole bunch of errors in the console in either of those scenarios. Were you able to check the console at the time?

          M 1 Reply Last reply Reply Quote 0
          • M
            michmoor LAYER 8 Rebel Alliance @stephenw10
            last edited by

            @stephenw10
            I did a visual inspection and i logged into the console.
            Console had no errors. Nothing on screen other than the default output of all the interfaces and the number options.
            Visually, the WAN interface was blinking extremely fast. Abnormal but thats about it.

            Ive been checking syslog all day today and still nothing.

            Just checked SMART and disk status is normal.

            === START OF SMART DATA SECTION ===
            SMART overall-health self-assessment test result: PASSED
            
            SMART/Health Information (NVMe Log 0x02)
            Critical Warning:                   0x00
            Temperature:                        59 Celsius
            Available Spare:                    100%
            Available Spare Threshold:          1%
            Percentage Used:                    22%
            Data Units Read:                    34,397 [17.6 GB]
            Data Units Written:                 24,672,414 [12.6 TB]
            Host Read Commands:                 577,023
            Host Write Commands:                768,172,296
            Controller Busy Time:               10,950
            Power Cycles:                       25
            Power On Hours:                     6,553
            Unsafe Shutdowns:                   18
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      0
            Warning  Comp. Temperature Time:    0
            Critical Comp. Temperature Time:    0
            Temperature Sensor 1:               83 Celsius
            Temperature Sensor 2:               59 Celsius
            Temperature Sensor 3:               60 Celsius
            Temperature Sensor 4:               61 Celsius
            Thermal Temp. 1 Transition Count:   1
            Thermal Temp. 1 Total Time:         25290
            
            Error Information (NVMe Log 0x01, 16 of 64 entries)
            No Errors Logged
            
            

            Firewall: NetGate,Palo Alto-VM,Juniper SRX
            Routing: Juniper, Arista, Cisco
            Switching: Juniper, Arista, Cisco
            Wireless: Unifi, Aruba IAP
            JNCIP,CCNP Enterprise

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              And nowhere near full?

              M 1 Reply Last reply Reply Quote 0
              • M
                michmoor LAYER 8 Rebel Alliance @stephenw10
                last edited by

                @stephenw10 Its in good shape

                [23.09.1-RELEASE][admin@GAFW]/root: df -H
                Filesystem                                    Size    Used   Avail Capacity  Mounted on
                pfSense/ROOT/UseThis_Current                   93G    1.8G     91G     2%    /
                devfs                                         1.0k      0B    1.0k     0%    /dev
                pfSense/tmp                                    91G    602k     91G     0%    /tmp
                pfSense/var                                    91G     42M     91G     0%    /var
                pfSense/home                                   91G    111k     91G     0%    /home
                pfSense                                        91G     98k     91G     0%    /pfSense
                pfSense/var/tmp                                91G    492k     91G     0%    /var/tmp
                pfSense/var/log                                91G     87M     91G     0%    /var/log
                pfSense/var/db                                 93G    1.3G     91G     1%    /var/db
                pfSense/var/cache                              91G    9.7M     91G     0%    /var/cache
                pfSense/reservation                           103G     98k    103G     0%    /pfSense/reservation
                pfSense/ROOT/UseThis_Current/cf                91G    7.5M     91G     0%    /cf
                pfSense/ROOT/UseThis_Current/var_cache_pkg     91G    233M     91G     0%    /var/cache/pkg
                pfSense/ROOT/UseThis_Current/var_db_pkg        91G     11M     91G     0%    /var/db/pkg
                tmpfs                                         4.2M    332k    3.9M     8%    /var/run
                /lib                                           93G    1.8G     91G     2%    /var/unbound/lib
                devfs                                         1.0k      0B    1.0k     0%    /var/unbound/dev
                /var/log/pfblockerng                           91G     87M     91G     0%    /var/unbound/var/log/pfblockerng
                /usr/local/share/GeoIP                         93G    1.8G     91G     2%    /var/unbound/usr/local/share/GeoIP
                /usr/local/bin                                 93G    1.8G     91G     2%    /var/unbound/usr/local/bin
                /usr/local/lib                                 93G    1.8G     91G     2%    /var/unbound/usr/local/lib
                devfs                                         1.0k      0B    1.0k     0%    /var/dhcpd/dev
                
                

                Firewall: NetGate,Palo Alto-VM,Juniper SRX
                Routing: Juniper, Arista, Cisco
                Switching: Juniper, Arista, Cisco
                Wireless: Unifi, Aruba IAP
                JNCIP,CCNP Enterprise

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  Yup no problem there. Hmm, hard to say then without any logs to check from the time. It really looks like a disk issue though from those symptoms.

                  M 2 Replies Last reply Reply Quote 0
                  • M
                    michmoor LAYER 8 Rebel Alliance @stephenw10
                    last edited by

                    @stephenw10
                    It's possible. Seeing how all logs get pumped to my syslog collector is there something i should search for?
                    Its also possible that because im missing .bz2 logs as i showed above they probably didnt make it to syslog.

                    Firewall: NetGate,Palo Alto-VM,Juniper SRX
                    Routing: Juniper, Arista, Cisco
                    Switching: Juniper, Arista, Cisco
                    Wireless: Unifi, Aruba IAP
                    JNCIP,CCNP Enterprise

                    1 Reply Last reply Reply Quote 0
                    • M
                      michmoor LAYER 8 Rebel Alliance @stephenw10
                      last edited by

                      @stephenw10
                      There was also something i noticed. So to recover i did a hard reboot. Unplug and plug.
                      I didnt do any checkouts because all my alerts cleared but later on in the evening the wife couldnt connect to captive portal which is weird because there is no captive portal enabled on that interface.
                      -- if you recall i had a CP issue about a week ago and i was tshooting :)

                      So it seems that pfsense loaded a previous config and not the latest config. It was simple to load the recent config from back up and restore but why did an old config load?

                      Firewall: NetGate,Palo Alto-VM,Juniper SRX
                      Routing: Juniper, Arista, Cisco
                      Switching: Juniper, Arista, Cisco
                      Wireless: Unifi, Aruba IAP
                      JNCIP,CCNP Enterprise

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        It will revert to the last good config if the current config file is unreadable. That could also indicate a drive issue.

                        M 1 Reply Last reply Reply Quote 0
                        • M
                          michmoor LAYER 8 Rebel Alliance @stephenw10
                          last edited by

                          @stephenw10 OK now im nervous.
                          What are my options to either replace the 6100 or replace the SSD?

                          Firewall: NetGate,Palo Alto-VM,Juniper SRX
                          Routing: Juniper, Arista, Cisco
                          Switching: Juniper, Arista, Cisco
                          Wireless: Unifi, Aruba IAP
                          JNCIP,CCNP Enterprise

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            This is the first time you've seen this issue? It might not be the disk, I would want to see something conclusive. If it is you could just replace the disk though.

                            M 1 Reply Last reply Reply Quote 0
                            • M
                              michmoor LAYER 8 Rebel Alliance @stephenw10
                              last edited by

                              @stephenw10
                              It happened again..
                              Got some more data that may or may not help.

                              1. Internet suddenly goes out so i go straight into console.
                              2. Nothing on the console screen. Usual output for selecting options.

                              WAN port loses IP address

                              Mar  1 15:22:59 GAFW dhclient[38626]: No working leases in persistent database - sleeping.
                              Mar  1 15:22:59 GAFW dhclient[30526]: FAIL
                              Mar  1 15:23:00 GAFW dhclient[38626]: connection closed
                              Mar  1 15:23:00 GAFW dhclient[38626]: exiting.
                              Mar  1 15:23:01 GAFW dhclient[41673]: PREINIT
                              Mar  1 15:23:01 GAFW dhclient[43008]: EXPIRE
                              Mar  1 15:23:01 GAFW dhclient[44238]: Deleting old routes
                              Mar  1 15:23:01 GAFW dhclient[45556]: PREINIT
                              Mar  1 15:23:01 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 2
                              Mar  1 15:23:03 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 5
                              Mar  1 15:23:08 GAFW dhclient[41096]: DHCPDISCOVER on ix3 to 255.255.255.255 port 67 interval 7
                              

                              CPU usage using top -aSH command.

                              last pid: 16707;  load averages:  4.63,  3.67,  2.21                                                                                                         up 0+15:43:45  15:11:13
                              519 threads:   9 running, 472 sleeping, 38 waiting
                              CPU:  2.0% user,  0.0% nice, 26.4% system,  0.0% interrupt, 71.7% idle
                              Mem: 398M Active, 1576M Inact, 1347M Wired, 4465M Free
                              ARC: 319M Total, 119M MFU, 187M MRU, 875K Anon, 2154K Header, 10M Other
                                   261M Compressed, 759M Uncompressed, 2.91:1 Ratio
                              Swap: 1024M Total, 1024M Free
                              
                                PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
                                  0 root        -60    -     0B  2128K CPU3     3  13:49  99.92% [kernel{if_io_tqg_3}]
                                 11 root        187 ki31     0B    64K RUN      2 872:54  95.83% [idle{idle: cpu2}]
                                 11 root        187 ki31     0B    64K CPU1     1 867:56  95.08% [idle{idle: cpu1}]
                                 11 root        187 ki31     0B    64K CPU0     0 878:02  93.18% [idle{idle: cpu0}]
                              60326 root         68    0   114M    51M accept   1   0:02   8.47% php-fpm: pool nginx (php-fpm)
                              81924 www          20    0    43M    23M kqread   1  12:44   1.15% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -x /tmp/haproxy.socket -sf 465
                              20240 unbound      20    0   192M   144M kqread   0   0:01   0.60% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
                              42858 root         20    0    13M  3084K select   2   4:35   0.57% /usr/sbin/syslogd -O rfc3164 -s -c -c -l /var/dhcpd/var/run/log -l /tmp/haproxy_chroot/var/run/log -P /var/run/sy
                              20240 unbound      20    0   192M   144M kqread   1   0:01   0.53% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
                              63614 root         20    0  1196M   733M uwait    1   4:34   0.48% /usr/local/bin/suricata --netmap -D -c /usr/local/etc/suricata/suricata_12791_igc1/suricata.yaml --pidfile /var/r
                                  0 root        -64    -     0B  2128K -        1   1:21   0.28% [kernel{dummynet}]
                                  0 root        -12    -     0B  2128K -        1   0:41   0.23% [kernel{z_wr_iss_0}]
                              48900 root         20    0    13M  3520K bpf      2   2:43   0.22% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
                                  0 root        -12    -     0B  2128K -        1   0:41   0.21% [kernel{z_wr_iss_1}]
                              

                              Firewall: NetGate,Palo Alto-VM,Juniper SRX
                              Routing: Juniper, Arista, Cisco
                              Switching: Juniper, Arista, Cisco
                              Wireless: Unifi, Aruba IAP
                              JNCIP,CCNP Enterprise

                              M 1 Reply Last reply Reply Quote 0
                              • M
                                michmoor LAYER 8 Rebel Alliance @michmoor
                                last edited by

                                @stephenw10

                                What is also strange is that i could not resolve anything local. My own self hosted services would time out..

                                C:\Users\michm>nslookup syslog.xxxxxxx
                                Server:  GAFW.xxxxxxx
                                Address:  192.168.50.254
                                
                                DNS request timed out.
                                    timeout was 2 seconds.
                                DNS request timed out.
                                    timeout was 2 seconds.
                                *** Request to GAFW.xxxxxxxxx timed-out
                                

                                I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.

                                Netgate pfSense Plus will reboot. This may take a few minutes, depending on your hardware.
                                Do you want to proceed?
                                
                                    Y/y: Reboot normally
                                    R/r: Reroot (Stop processes, remount disks, re-run startup sequence)
                                    S: Reboot into Single User Mode (requires console access!)
                                    Enter: Abort
                                
                                Enter an option: r
                                
                                Netgate pfSense Plus is rerooting now.
                                 Stopping package arpwatch...done.
                                 Stopping package freeradius3...done.
                                 Stopping package lldpd...done.
                                 Stopping package WireGuard...done.
                                 Stopping package haproxy...done.
                                 Stopping package nut...done.
                                 Stopping package syslog-ng...done.
                                 Stopping package Tailscale...   <----- stuck here and could not go past
                                

                                I eventually lost access to the web configurator and got the nginx timed out message.

                                Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                Routing: Juniper, Arista, Cisco
                                Switching: Juniper, Arista, Cisco
                                Wireless: Unifi, Aruba IAP
                                JNCIP,CCNP Enterprise

                                tinfoilmattT 1 Reply Last reply Reply Quote 0
                                • stephenw10S
                                  stephenw10 Netgate Administrator
                                  last edited by

                                  OK interesting. Not the SSD then, so that's good.

                                  One CPU core seems to be stuck at 100%. We can only see it's interface io traffic so something attempting (and seemingly failing) to send or receive. From the fact tailscale could not be stopped I'd guess it's that somehow. Are you able to disable that as a test?

                                  M 1 Reply Last reply Reply Quote 0
                                  • M
                                    michmoor LAYER 8 Rebel Alliance @stephenw10
                                    last edited by

                                    @stephenw10
                                    I have uninstalled Tailscale.
                                    I’m glad I got some more data on this so thanks for the assistance.

                                    If this is a bug - let’s say it might be for sake of conversation - I don’t have much data to prove this isn’t just environmental to my setup. Any other data points I can gather?
                                    Losing a WAN ip address does make Tailscale chatty but i can’t find the link to why dns breaks for local resolution.

                                    Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                    Routing: Juniper, Arista, Cisco
                                    Switching: Juniper, Arista, Cisco
                                    Wireless: Unifi, Aruba IAP
                                    JNCIP,CCNP Enterprise

                                    1 Reply Last reply Reply Quote 0
                                    • stephenw10S
                                      stephenw10 Netgate Administrator
                                      last edited by

                                      It looks like whatever is using that CPU core is preventing anything sending/receiving traffic or possible opening states. Causing problem for numerous services.

                                      M 1 Reply Last reply Reply Quote 0
                                      • M
                                        michmoor LAYER 8 Rebel Alliance @stephenw10
                                        last edited by michmoor

                                        @stephenw10
                                        Interesting and i love that theory. Seems very logical.
                                        Would explain why almost all functionality related to the firewall is broken - web-configurator , dns , dhcp...
                                        Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?

                                        edit: So whatever is using that core is related to traffic? High throughput? This 6100 can for sure push well over 10G from my testing but looking at the monitoring i see pretty low WAN throughput considering.
                                        Tailscale seems like the smoking gun but checking my historical syslog it seems that once ix3 (WAN) goes down and cant renew a dhcp address, tailscale becomes very verbose. As i discovered today i am unable to reboot the router because it gets stuck on the tailscale process. High level but the trigger is losing the WAN.
                                        If we're stable this weekend, i will install tailscale again and unplug the wan..see if i can reproduce.

                                        ba19cfa2-0b4b-4faf-870e-18fe8195b2d3-image.png

                                        Firewall: NetGate,Palo Alto-VM,Juniper SRX
                                        Routing: Juniper, Arista, Cisco
                                        Switching: Juniper, Arista, Cisco
                                        Wireless: Unifi, Aruba IAP
                                        JNCIP,CCNP Enterprise

                                        stephenw10S 1 Reply Last reply Reply Quote 0
                                        • stephenw10S
                                          stephenw10 Netgate Administrator @michmoor
                                          last edited by

                                          @michmoor said in DNS/DHCP stop working suddenly:

                                          Considering the 6100 has 4x cores, its not odd that a single core choking the firewall breaks connectivity?

                                          It is I agree. Though I'm assuming it's choking something in the network stack so it doesn't matter than then other cores are still available.

                                          1 Reply Last reply Reply Quote 0
                                          • tinfoilmattT
                                            tinfoilmatt @michmoor
                                            last edited by

                                            @michmoor said in DNS/DHCP stop working suddenly:

                                            I went ahead and went for the reboot through console. It could not stop the Tailscale package. This forced me to pull the plug and plug it back in.

                                            what's the longest you waited for the system to shutdown gracefully when it appeared to be 'stuck" stopping Tailscale package? conservative estimate in minutes...

                                            M 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.