Multi-Wan Comcast And Starlink - dpinger restarts every few minutes

SageUser

I have a pair of Netgate 5100's located in separate states that have identical set-ups with Multi-wan using Comcast/Starlink and I am seeing dpinger exiting (exiting on signal 15) every minute. In understand that may be desired behavior in some circumstances but eventually I end up with sendto error: 65 for one or both of my failover wan's (usually the comcast one).

My second Netgate 5100 has dpinger restart every day or so... and I haven't seen the same issues (full disclosure - that unit has FIOS fiber, Comcast and Starlink... ). For both sites the pfsense unit is purely serving as a router/firewall with minimal bells and whistles - just tailscale and mailreport packages. DNS and DHCP is just used for the router itself (DHCP/DNS provided by on-prem windows server) but am using ISC/unbound resolver for appliance.

Is there a 1st place to look for trouble shooting? Do I need to place an unmanaged switch between the WAN and pfsense to avoid triggering dpinger to restart? I switched out the comcast cable modem and upgraded from cellular model to starlink trying to resolve this issue already and it persists.

Many thanks for any ideas!

Gertjan

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

circumstances but eventually I end up with sendto error: 65 for one or both of my failover wan's (usually the comcast one).

Inspect the devices connected to your WAN interface.
Start with the pfSense "comcast" WAN interface : does this WAN interface goes down ? == is the comcast router/modem pulling down this interface ? Was the number of none-returning ICMP packet that high that dpinger pulled the plug (== reset-cycle that WAN interface) ?

SageUser

Log Files April 15.zip

Thanks Gertjan,

I have the gateway log and system log from the most recent full crash. As you can see in the gateway log @ 21:40 on April 15th the gateway goes down hard (send to 65). There are hot-plug error codes going back to 20:06 on that same day in the system log (I don't have anything going back before that - didn't retrieve the log files fast enough). Unfortunately I lost all remote access and had to resort to a non-technical user powering down the entire equipment closet to recover.

Both before the crash and after I can see the behavior in the gateway log that is suspect with dpinger triggering signal 15 every few minutes - starting further apart and gradually happening every minute.

I called out a snippet of the suspect gateway activity below - I still have that activity in my logs from today but I do NOT have the hot plug errors. Maybe this is unrelated.

Apr 16 07:59:34 kiawahpfsense dpinger[59265]: exiting on signal 15
Apr 16 07:59:34 kiawahpfsense dpinger[58690]: exiting on signal 15
Apr 16 07:59:34 kiawahpfsense dpinger[96778]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 07:59:42 kiawahpfsense dpinger[96778]: exiting on signal 15
Apr 16 07:59:42 kiawahpfsense dpinger[3018]: send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 2000ms latency_alarm 700ms loss_alarm 30% alarm_hold 20000ms dest_addr 75.75.75.75 bind_addr 24.11.XXX.XX identifier "WAN_DHCP "
Apr 16 07:59:42 kiawahpfsense dpinger[3327]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 08:09:29 kiawahpfsense dpinger[3327]: exiting on signal 15
Apr 16 08:09:29 kiawahpfsense dpinger[3018]: exiting on signal 15
Apr 16 08:09:29 kiawahpfsense dpinger[91728]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 08:09:37 kiawahpfsense dpinger[91728]: exiting on signal 15
Apr 16 08:09:37 kiawahpfsense dpinger[88394]: send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 2000ms latency_alarm 700ms loss_alarm 30% alarm_hold 20000ms dest_addr 75.75.75.75 bind_addr 24.11.XXX.XX identifier "WAN_DHCP "
Apr 16 08:09:37 kiawahpfsense dpinger[88789]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 08:57:54 kiawahpfsense dpinger[88789]: exiting on signal 15
Apr 16 08:57:54 kiawahpfsense dpinger[88394]: exiting on signal 15
Apr 16 08:57:54 kiawahpfsense dpinger[89068]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 08:58:02 kiawahpfsense dpinger[89068]: exiting on signal 15
Apr 16 08:58:02 kiawahpfsense dpinger[617]: send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 2000ms latency_alarm 700ms loss_alarm 30% alarm_hold 20000ms dest_addr 75.75.75.75 bind_addr 24.11.XXX.XX identifier "WAN_DHCP "
Apr 16 08:58:02 kiawahpfsense dpinger[756]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "
Apr 16 09:00:09 kiawahpfsense dpinger[756]: exiting on signal 15
Apr 16 09:00:09 kiawahpfsense dpinger[617]: exiting on signal 15
Apr 16 09:00:09 kiawahpfsense dpinger[98264]: send_interval 5000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 20000ms latency_alarm 900ms loss_alarm 40% alarm_hold 200000ms dest_addr 1.1.1.1 bind_addr 100.68.67.17 identifier "Starlink "

Gertjan

@SageUser

About the snippet : something is missing : look at the same moment the System log, that's the place where system events like Interfaces going down are logged.
That said,

Btw :

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

1.1.1.1

your using 1.1.1.1 as a dpinger destination.
Be ware : that's an IP that can and should be use for : DNS requests.
1.1.1.1 main goal is serving DNS requests - not answering to ICMP (ping) request. Guess what it does when it gets a little bit busy doing so ? It starts to drop less important traffic, like ICMP traffic.
Using 1.1.1.1 or 8.8.8.8 or other classic "let's try that IP get things up and running fast" is ok for a moment, but later on, when finalizing the (pfSense) WAN setup : try to get an IP of a more close Starlink router that is more suite for the job.
Use for example "tracert" to find one.

At Apr 16 07:59:34 a dpinger process started with PID 96778 - and it's exits 8 seconds later at Apr 16 07:59:42 ... that quiet fast indeed, that's not "normal" at all.

Can you ping from a command line to 1.1.1.1 yourself and see what is happening ?

Also : between pfSense, the WAN interface, and the uplink WAN device (a Starlink router ?) can you place a switch and see what happens ?

SageUser

From Pfsense itself ping to 1.1.1.1 looks good. I am using "monitor" addresses other than the gateway/ist hop due to Comcast and Starlink's reported issues discarding ping's inside their infrastructure. If I should change that for troubleshooting I can.

Of course this is a remote site so lot's of fun if it goes down hard.

PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=58 time=43.095 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=40.778 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=58 time=49.599 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 40.778/44.491/49.599/3.734 ms

SageUser

@Gertjan

You are correct that the system log was really busy at 7:59. See below (link state / hot plug). My problem is that i think dpinger is restarting every handful of minutes for weeks until something else goes wrong. To recover from the April 16th events a local user had to flip the master power off and hot plug events went away afterwards.

I'm going to try reverting to a 1-hop away router instead of a DNS provider for now and see what develops.

Apr 16 07:59:32 kiawahpfsense check_reload_status[483]: Linkup starting igb0
Apr 16 07:59:32 kiawahpfsense kernel: igb0: link state changed to DOWN
Apr 16 07:59:33 kiawahpfsense php-fpm[6717]: /rc.linkup: Hotplug event detected for WAN(wan) dynamic IP address (4: dhcp)
Apr 16 07:59:33 kiawahpfsense php-fpm[6717]: /rc.linkup: DEVD Ethernet detached event for wan
Apr 16 07:59:34 kiawahpfsense php-fpm[6717]: /rc.linkup: Removing static route for monitor 1.1.1.1 and adding a new route through 100.64.0.1
Apr 16 07:59:34 kiawahpfsense rc.gateway_alarm[98543]: >>> Gateway alarm: WAN_DHCP (Addr:24.11.139.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
Apr 16 07:59:34 kiawahpfsense check_reload_status[483]: updating dyndns WAN_DHCP
Apr 16 07:59:34 kiawahpfsense check_reload_status[483]: Restarting IPsec tunnels
Apr 16 07:59:34 kiawahpfsense check_reload_status[483]: Restarting OpenVPN tunnels/interfaces
Apr 16 07:59:34 kiawahpfsense check_reload_status[483]: Reloading filter
Apr 16 07:59:36 kiawahpfsense check_reload_status[483]: Linkup starting igb0
Apr 16 07:59:36 kiawahpfsense kernel: igb0: link state changed to UP
Apr 16 07:59:38 kiawahpfsense php-fpm[81990]: /rc.openvpn: dpinger: timeout while retrieving status for gateway Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[45596]: /rc.dyndns.update: dpinger: timeout while retrieving status for gateway Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[80641]: /rc.filter_configure_sync: dpinger: timeout while retrieving status for gateway Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[81990]: /rc.openvpn: Gateway, switch to: Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[80641]: /rc.filter_configure_sync: Gateway, switch to: Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[45596]: /rc.dyndns.update: Gateway, switch to: Starlink
Apr 16 07:59:38 kiawahpfsense php-fpm[45596]: /rc.dyndns.update: Default gateway setting Starlink as default.
Apr 16 07:59:38 kiawahpfsense php-fpm[80641]: /rc.filter_configure_sync: Default gateway setting Starlink as default.
Apr 16 07:59:38 kiawahpfsense php-fpm[81990]: /rc.openvpn: Default gateway setting Starlink as default.
Apr 16 07:59:38 kiawahpfsense php-fpm[45596]: /rc.dyndns.update: Dynamic DNS: updatedns() starting
Apr 16 07:59:38 kiawahpfsense php-fpm[45596]: /rc.dyndns.update: Dynamic DNS dyndns-custom (Kiawah.uvajones.org): _checkIP() starting.

Gertjan

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

My problem is that i think dpinger is restarting every handful of minutes for weeks until something else goes wrong

dpinger restarts the interface if the ping losses goes above the max allowed, which is 40 % for your starlink and 30 % for your WAN_DHCP interface.

But looking at the system logs, its not dpinger that pulled in down. It was an external event = the other side of the WAN interface disabled the connection for a moment.

Two tests : disable the dpinger action for a while :

if the issue goes away : it was dpinger. The issue is still there : the issue is on the other side of the WAN cable.
This :

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

DEVD Ethernet detached event for wan

looks like a physical event.

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

had to flip the master power off and hot plug events went away afterwards

Be careful. pfSense isn't a light bulb. It has a filesystem that loves to corrupt itself when a sudden power death is introduced. Which means : chances are it won't boot again. Treat pfSense like your phone or PC : use the soft command to shut it down.

@SageUser said in Multi-Wan Comcast And Starlink - dpinger restarts every few minutes:

route through 100.64.0.1

CGNAT .... Starlink ?

SageUser

I am hoping this fairly ancient -5100 appliance holds on. It's been through a lot including several dead cable modems. However, physical intervention has to wait until my next in-person visit. I only have non-technical people on site (as indicated by the hard power off recovery).

The site is in an area with buried cable and a high water table which is a recipe for disaster (eats a cable modem every 18 months) and explains the starlink back-up. We use tailscale so remote access is fairly tolerant of CGNAT and our windows DC / local users can phone home and network admin can remote in. When the comcast circuit is down I lose remote admin via the fqdn/public IP which is stressful as I'm reliant on tailscale coming up using starlink.

As a first order trouble shooting step I'm stepping up a local VM as a tailscale client to give myself a back door if the netgate box becomes unreachable again and potentially to automatically attempt a pfsense reboot in the event of a sustained loss of connectivity.

I will experiment with turning off gateway monitoring after that and watch for physical events. Next time I'm on site I'm going to insert a fiber media converter between the cable modem and the netgate (or it's successor if I can get the budget for it) to remove physical plug events from cable modem reboots and risk of electrical shocks on the wan port.