Hourly WAN packet loss after updating to pfSense+ 23

stephenw10

Hmm, well the timestamp difference is odd. But that aside there doesn't look to be anything significant there. The gateway shows loss and triggers the other events.

Assuming 1.2.3.4 there is your real gateway address you should change the monitoring IP to something upstream like 8.8.8.8 or 1.1.1.1. It could be simply that the gateway loaded and drops pings.

Steve

drueter

@stephenw10 Thanks for the suggestion, but:

The problem persists even with monitoring disabled
The problem results in actual packet loss to workstations: it is not just a monitoring problem
The problem persists even with a fixed WAN IP address (I thought it might be related to DHCP renewals.)
Looking at additional graphs, there is no evidence of CPU or other load increase at the times the problem ocurrs.

I am out of ideas at this point and intend to re-install version 22.05. I will report back to confirm the status of the problem after that downgrade.

stephenw10

There is no increase in traffic when this happens?

drueter

@stephenw10 That is correct: there is no change in network traffic.

FYI the network nearly always has very light traffic.

stephenw10

When using an external IP do you still see the same level of packet loss? Do you also see latency increases?

Otherwise I would try to capture the ping packets when this happens and see if there are actually missing packets on the wire.

Are you using an igb or ix NIC for WAN? Try re-assigning it to the other type if you can.

drueter

I am starting to think this may be a wild goose chase after all: this may be an ISP problem. It could in fact be that the problem was occurring prior to the firmware upgrade and went unnoticed, and that monitoring was not enabled. (I thought/assumed that monitoring had been enabled the whole time, but I can't say for sure.)

I reverted back to 22.05 and have observed the same behavior: dropped packets each hour.

Yes, the packet loss is real...going through the pfSense (observable from a workstation). I have not yet been able to confirm packet loss is real without the pfSense in the mix.

WAN is on igb0 and LAN is on igb1. I could try moving it, but igb0 is the default. Is there a reason why igb versus ix would make a difference?

Sep 28 09:16:44 pMyRouter rc.gateway_alarm[33556]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:10.000ms RTTsd:2.329ms Loss:21%)
Sep 28 09:18:34 pMyRouter rc.gateway_alarm[18934]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:10.422ms RTTsd:3.664ms Loss:6%)
Sep 28 10:16:52 pMyRouter rc.gateway_alarm[58429]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:8.861ms RTTsd:1.637ms Loss:21%)
Sep 28 10:18:25 pMyRouter rc.gateway_alarm[61365]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:9.553ms RTTsd:3.310ms Loss:5%)
Sep 28 11:17:05 pMyRouter rc.gateway_alarm[67044]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:8.698ms RTTsd:2.920ms Loss:22%)
Sep 28 11:18:11 pMyRouter rc.gateway_alarm[41078]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:9.716ms RTTsd:5.033ms Loss:5%)

It's kind of a strange problem (with the hourly recurrence). I'll keep working on getting to the bottom of this and will report back.

stephenw10

Yeah if it's the same in 22.05 it's probably not a NIC/driver issue. However on the 5100 it can be worth trying. The igb NICs are separate PCIe devices. The ix NICs are part of the SoC.

swixo

@drueter Did you ever confirm this? I am troubleshooting a very similar issue - Hourly Gateway Alarms, knocking things offline.
I have zero leads on the issue, so looking for hints...

stephenw10

Nothing logged, just packet loss? Monitoring the gateway IP(s)? Same hardware?

swixo

@stephenw10 My case is somewhat different - but I am looking for clues. The gateway is an openvpn gateway, the wan does not appear to go down just the openvpn link.

What does happen is when the gateway alarm goes off, the DNS lookups that go across the VPN to the remote DNS immediately fail, and so my nagios probes detect the name resolution failure right away and report.

The pfSense logs show ">>> Gateway alarm: ... " at about 1 hour intervals, right when the DNS lookups start failing. Last about 2-3 minutes maximum. I suspected that could be when vpn re-keying is taking place?

Last night after I left my comment I tried clearing the "Fast UDP" feature in openvpn, and the issue has stopped. (For now). I'll see if it stays working.

stephenw10

Hmm, odd. I wouldn't expect that. I have openvpn tunnels that have been up without issue for weeks.