Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Dual WAN monitoring failing

    Scheduled Pinned Locked Moved General pfSense Questions
    14 Posts 2 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • X
      Ximulate
      last edited by

      I have pfSense with a dual WAN gateway group. WAN1 is DSL PPOE connection; WAN2 is wISP DHCP on CGNAT. Monitoring for WAN2 is not reliable.

      If I "Save" System>Routing>Gateway>WAN2 configuration, monitoring will initially work. "pfSsh.php playback gatewaystatus" will list both gateways.

      After some short period of time (I haven't figured out exact time frame), WAN2 status becomes Unknown. "pfSsh.php playback gatewaystatus" lists only WAN1, but not WAN2.

      Doesn't matter if Monitor IP is blank (default) or I use various DNS servers (like 1.1.1.1 or 8.8.8.8). WAN2 will initially indicate Online, but eventually reverts to Unknown. Saving the config as described above temporarly resolves the issue.

      Any suggestions would be appreciated.

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        What do the system and gateway logs show when it moves from on-line to unknown?

        Steve

        X 1 Reply Last reply Reply Quote 0
        • X
          Ximulate @stephenw10
          last edited by Ximulate

          I may need to wait for it to go Unknown again to get a better idea, but:

          Jan 19 02:04:00 	sshguard 	31974 	Now monitoring attacks.
          Jan 19 04:31:03 	kernel 		igb2: link state changed to DOWN
          Jan 19 04:31:03 	check_reload_status 	403 	Linkup starting igb2
          Jan 19 04:31:04 	php-fpm 	31244 	/rc.linkup: DEVD Ethernet detached event for opt2
          Jan 19 04:31:06 	php-fpm 	31244 	/rc.linkup: Removing static route for monitor 1.0.0.1 and adding a new route through 10.28.XX.1
          Jan 19 04:31:06 	check_reload_status 	403 	Reloading filter
          Jan 19 04:31:10 	kernel 		arpresolve: can't allocate llinfo for 192.169.XX.1 on igb2
          Jan 19 04:31:11 	kernel 		arpresolve: can't allocate llinfo for 192.169.XX.1 on igb2
          

          arpresolve repeats many times

          Jan 19 04:35:14 	kernel 		arpresolve: can't allocate llinfo for 192.169.XX.1 on igb2
          Jan 19 04:35:55 	kernel 		arpresolve: can't allocate llinfo for 192.169.XX.1 on igb2
          

          followed by

          Jan 19 04:59:32 	check_reload_status 	403 	Linkup starting igb2
          Jan 19 04:59:32 	kernel 		igb2: link state changed to UP
          Jan 19 04:59:33 	php-fpm 	34935 	/rc.linkup: DEVD Ethernet attached event for opt2
          Jan 19 04:59:33 	php-fpm 	34935 	/rc.linkup: HOTPLUG: Configuring interface opt2
          Jan 19 05:00:00 	kernel 		igb2: link state changed to DOWN
          Jan 19 05:00:00 	check_reload_status 	403 	Linkup starting igb2
          Jan 19 05:00:29 	kernel 		igb2: link state changed to UP
          Jan 19 05:00:29 	check_reload_status 	403 	Linkup starting igb2
          Jan 19 05:01:27 	check_reload_status 	403 	rc.newwanip starting igb2
          Jan 19 05:01:27 	check_reload_status 	403 	Restarting IPsec tunnels
          Jan 19 05:01:28 	php-fpm 	374 	/rc.newwanip: rc.newwanip: Info: starting on igb2.
          Jan 19 05:01:28 	php-fpm 	374 	/rc.newwanip: rc.newwanip: on (IP address: 192.169.XX.146) (interface: WAN2[opt2]) (real interface: igb2).
          Jan 19 05:01:28 	check_reload_status 	403 	Reloading filter
          Jan 19 05:01:33 	check_reload_status 	403 	updating dyndns opt2
          Jan 19 05:01:38 	check_reload_status 	403 	Reloading filter
          Jan 19 05:01:38 	php-fpm 	373 	/rc.linkup: DEVD Ethernet detached event for opt2
          Jan 19 05:01:39 	php-fpm 	373 	/rc.linkup: Removing static route for monitor 1.0.0.1 and adding a new route through 10.28.X.1
          Jan 19 05:01:39 	check_reload_status 	403 	Reloading filter
          Jan 19 05:01:39 	php-fpm 	16459 	/rc.linkup: DEVD Ethernet attached event for opt2
          Jan 19 05:01:39 	php-fpm 	16459 	/rc.linkup: HOTPLUG: Configuring interface opt2
          Jan 19 05:01:39 	check_reload_status 	403 	rc.newwanip starting igb2
          Jan 19 05:01:40 	check_reload_status 	403 	Restarting IPsec tunnels
          Jan 19 05:01:40 	php-fpm 	13788 	/rc.newwanip: rc.newwanip: Info: starting on igb2.
          Jan 19 05:01:40 	php-fpm 	13788 	/rc.newwanip: rc.newwanip: on (IP address: 192.169.XX.146) (interface: WAN2[opt2]) (real interface: igb2).
          Jan 19 05:01:41 	check_reload_status 	403 	Reloading filter
          Jan 19 05:01:46 	check_reload_status 	403 	updating dyndns opt2
          Jan 19 05:01:XX 	check_reload_status 	403 	Reloading filter
          Jan 19 05:01:54 	kernel 		config_aqm Unable to configure flowset, flowset busy!
          Jan 19 05:18:00 	sshguard 	31974 	Exiting on signal.
          Jan 19 05:18:00 	sshguard 	11012 	Now monitoring attacks.
          Jan 19 08:12:00 	sshguard 	11012 	Exiting on signal.
          Jan 19 08:12:00 	sshguard 	99330 	Now monitoring attacks.
          Jan 19 08:26:00 	sshguard 	99330 	Exiting on signal.
          Jan 19 08:26:00 	sshguard 	52291 	Now monitoring attacks. 
          

          Nothing in the gateway logs related to WAN2 during this time. However, I do see dpinger entries for what I think are loss alarms on WAN1

          Jan 19 01:45:40 	dpinger 	52544 	WAN2_DHCP 1.1.1.1: Alarm latency 24373us stddev 8916us loss 56%
          Jan 19 01:47:00 	dpinger 	52544 	WAN2_DHCP 1.1.1.1: Clear latency 19491us stddev 3007us loss 0%
          Jan 19 04:31:06 	dpinger 	14112 	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 850ms loss_alarm 50% dest_addr 1.0.0.1 bind_addr AAA.BBB.CCC.DDD identifier "WAN_PPPOE "
          
          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Hmm, that's the first thing that appears? igb2 link state changes to UP?

            I expect to see it go down before that and it might show why at that point.

            X 1 Reply Last reply Reply Quote 0
            • X
              Ximulate @stephenw10
              last edited by Ximulate

              @stephenw10 said in Dual WAN monitoring failing:

              I expect to see it go down before that and it might show why at that point.

              Updated previous post with additional prior log entries. Link state does go down first

              Also, I received several email notifications

              1:45:41 MONITOR: WAN2_DHCP has packet loss, omitting from routing group gwgBalanced
              1.1.1.1|192.169.XX.146|WAN2_DHCP|24.373ms|8.916ms|58%|down|highloss
              
              1:47:01 MONITOR: WAN2_DHCP is available now, adding to routing group gwgBalanced
              1.1.1.1|192.169.XX.146|WAN2_DHCP|19.491ms|3.007ms|0.0%|online|none
              

              Received several of this notifications since resetting the gateway earlier today (wISP in horrible weather), though the gateway status is still reporting Online at the moment.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Mmm, Ok so it shows the link on igb2 actually goes down and that's the first thing that happens.

                What is that NIC actually connecting to? Is that thing losing link?

                X 1 Reply Last reply Reply Quote 0
                • X
                  Ximulate @stephenw10
                  last edited by Ximulate

                  @stephenw10 NIC is connected to a POE power injector then out to a small dish antenna (wISP is using ubiquite gear). Maybe a flaky ethernet cable? However...

                  The gateway was in Unknown state again this morning. I see what appears to be the same pattern in the logs, at about the same time. I haven't done a detailed comparision of the logs from each night yet.

                  Speculating that the wISP is doing some late night / early morning maintenace on theire side which causes a service interuption, but when they come back online my router isn't able to detect (ping).

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by stephenw10

                    Try putting switch between igb2 and the injector so it never loses link as a test.

                    Can you see the uptime on the wisp client device?

                    There is a dhclient race condition you might be hitting here if the upstream device reboots.

                    Steve

                    1 Reply Last reply Reply Quote 0
                    • X
                      Ximulate
                      last edited by

                      @stephenw10
                      Adding the switch allows the gateway to recover. Not sure yet why, without the switch, dpinger seems to lose its ability to ping the given Monitor IP and therefore doesn't report the gateway back online.

                      There is a bit more complication to my set-up, and therefore, for testing, I need to simplify a few things and retest a bit longer.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        So with the switch dpinger still shows monitoring failing but does it still show the link going down in the logs?

                        Try a pcap on the WAN, do you see the ping requests leaving still?

                        X 1 Reply Last reply Reply Quote 0
                        • X
                          Ximulate @stephenw10
                          last edited by Ximulate

                          @stephenw10
                          I never did resolve this issue entirely, but a few things have changed. I think the link down issue was from a faulty POE injector for the wISP equipment. I still have that switch in place though.

                          That said, WAN2 (wISP) monitoring is not working at all. WAN2 Gateway status is now constantly "Unknown", with RRT, RTTsd, Loss all reporting "pending". Doesn't seem to matter if the Monitor IP is blank or specified. Resetting the router or WAN2 gateway makes no difference.

                          I can Diagnostics > Ping using Source WAN2, for eample:

                          PING 8.8.4.4 (8.8.4.4): 56 data bytes
                          64 bytes from 8.8.4.4: icmp_seq=0 ttl=58 time=20.031 ms
                          64 bytes from 8.8.4.4: icmp_seq=1 ttl=58 time=19.862 ms
                          64 bytes from 8.8.4.4: icmp_seq=2 ttl=58 time=19.858 ms
                          
                          --- 8.8.4.4 ping statistics ---
                          3 packets transmitted, 3 packets received, 0.0% packet loss
                          round-trip min/avg/max/stddev = 19.858/19.917/20.031/0.081 ms
                          

                          and Traceroute, except I see that initial hop begins with WAN1 (Source is set for WAN2):

                          1  10.X.X.1  9.732 ms  9.313 ms  9.577 ms
                          2  100.41.18.20  11.356 ms  11.587 ms  11.522 ms
                          3  * * *
                          4  * * *
                          5  209.85.245.245  19.006 ms  21.789 ms  18.562 ms
                          6  142.250.232.97  19.071 ms  20.712 ms  19.280 ms
                          7  8.8.4.4  18.553 ms  17.874 ms  18.492 ms
                          

                          and pfcap from WAN2

                          15:45:23.837132 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:24.837138 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:29.836226 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:30.836189 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:31.836247 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:32.837259 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:33.837309 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:34.837255 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:34.860494 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 300
                          15:45:36.864170 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 300
                          15:45:39.837327 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:40.837308 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:41.837284 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:41.884901 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 300
                          15:45:42.838374 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:43.838329 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:44.838368 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:48.945150 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 300
                          15:45:49.836463 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:50.352587 IP 169.254.138.31.52198 > 255.255.255.255.10002: UDP, length 177
                          15:45:50.836438 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:51.836421 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:52.837437 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:53.837447 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:54.837443 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:45:59.837509 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:46:00.837537 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:46:01.837524 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:46:02.838573 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                          15:46:03.060206 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: UDP, length 300
                          
                          stephenw10S 1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator @Ximulate
                            last edited by

                            @Ximulate said in Dual WAN monitoring failing:

                            15:46:00.837537 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                            15:46:01.837524 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46
                            15:46:02.838573 ARP, Request who-has 192.169.51.1 tell 192.169.51.63, length 46

                            That's the WAN2 gateway IP I assume? If that's not responding to ARP that's a very broken situation.

                            X 1 Reply Last reply Reply Quote 0
                            • X Ximulate referenced this topic on
                            • X
                              Ximulate @stephenw10
                              last edited by

                              @stephenw10
                              When this happens, the router GUI locks up & netowk devices start losing network connectivity (DHCP?). I removed the modem outright, so no back-up connection. Then, this happened this morning:
                              https://forum.netgate.com/topic/178428/php-fatal-error/44
                              I was able to log into the GUI & catch the PHP log. Typically I can't even log in to the GUI when this happens. Maybe there's more than one thing going on.

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Yes, that seems like more than one thing. The notices error has been seen by users in various configs so I doubt it's directly related. It could be some common cause though if you're seeing php stop responding for example. Hmm.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.