Hourly WAN packet loss after updating to pfSense+ 23
-
I had been running pfSense+ 22.05 on my Netgate SG-5100 and everything was fine. Now on 23.05.1 I experience WAN packet loss for about 30-60 seconds nearly every hour at 4 to 7 minutes past the hour.
You see the packet loss in blue begin ("out of the blue" as they say...sorry...). You see utilization in orange increase due to ntopng running...which began with the update occurred. You see utilization drop back down when ntopng was removed. But packet loss continues.
Curiously packet loss began some hours (maybe a day or so) after the update and ntopng was installed.
I was tight on disk space, and wondered if log rotation was a factor. I installed an M.2 SATA drive and re-installed pfSense+ (from an image provided by support), and then restored my config backup. This all worked fine...but the packet loss still continues.
It is possible (but unlikely) that my stable internet connection developed a problem concurrent with the pfSense update. I tend to suspect something going on in my router to cause the problem.
I adjusted my gateway monitoring thresholds to minimize failing over to the backup WAN since it seems that dpinger and rc.newwanip resets were causing more problems than the brief packet loss itself.
Here is longer more detailed background:
I had previously run ntopng, but had not been using that for quite some time.
A couple weeks ago I wanted to try ntopng. The package was out of date. I encountered errors when I tried to update it. So I decided to update to pfSense+ 23.05.1 I performed this upgrade without a problem, then upgraded the ntopng package, and everything worked fine.
Except...I noticed periodic WAN disconnections that had not occurred prior to the update. As I gathered further history, I noticed that the outages were brief--approx. 60 seconds, and would occur at about 4 minutes past the hour,..most hours (but not every).
I then began wondering if ntopng was causing the problem (due to disk activity, cpu load, log rotation, or similar), so I remove the package. The frequent packet loss (hourly, for less than a minute) continued.
Is there any change that ntopng would have made that could explain the problem? Or any known problem in ver. 23 that could shed some light?
I suppose as a diagnostic I could obtain a ver. 22 image from support and reinstall that. And/or I could reinstall and apply an old config backup (or even hand-configure if needed)...in case something in the config backup brings about the problem on a fresh install.
Thanks for your help and suggestions
-
What do you see in the main system log at that time when the loss starts?
-
@stephenw10 I see the warning alarm, followed by the alarm...and then gateway reconfiguration-related messages.
Here is a sample containing everything from system.log from one of these events. Note that most log entries seem to be in UTC, but some are in local time (-7) (SHGuard, xxinetd, and rc.gateway_alarm). This is a bit confusing...but what I've pasted below is actually contiguous.
In this excerpt the first three entries are normal / all is well. The 4th entry at "9/26/2023 11:12" (aka "9/26/2023 18:12") is where the outage begins.
9/26/2023 17:13 pfMyRouter php-fpm[62804]: /rc.dyndns.update: phpDynDNS (my.ddnsdomain.com): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. 9/26/2023 10:36 pfMyRouter sshguard[34957]: Exiting on signal. 9/26/2023 10:36 pfMyRouter sshguard[22849]: Now monitoring attacks. 9/26/2023 11:12 pfMyRouter rc.gateway_alarm[42912]: >>> Gateway alarm: WAN_Cox (Addr:1.2.3.4 Alarm:1 RTT:9.331ms RTTsd:3.164ms Loss:22%) 9/26/2023 18:12 pfMyRouter check_reload_status[348]: updating dyndns WAN_Cox 9/26/2023 18:12 pfMyRouter check_reload_status[348]: Restarting IPsec tunnels 9/26/2023 18:12 pfMyRouter check_reload_status[348]: Restarting OpenVPN tunnels/interfaces 9/26/2023 18:12 pfMyRouter check_reload_status[348]: Reloading filter 9/26/2023 18:12 pfMyRouter php-fpm[51310]: /rc.openvpn: MONITOR: WAN_Cox has packet loss, omitting from routing group InternetWANGroup 9/26/2023 18:12 pfMyRouter php-fpm[51310]: 1.2.3.4|1.2.3.478|WAN_Cox|9.52ms|3.521ms|23%|down|highloss 9/26/2023 18:12 pfMyRouter php-fpm[51310]: /rc.openvpn: Gateway, switch to: WANBackupGW 9/26/2023 18:12 pfMyRouter php-fpm[51310]: /rc.openvpn: Default gateway setting as default. 9/26/2023 18:12 pfMyRouter php-fpm[71632]: /rc.filter_configure_sync: Gateway, switch to: WANBackupGW 9/26/2023 18:12 pfMyRouter php-fpm[10059]: /rc.dyndns.update: Gateway, switch to: WANBackupGW 9/26/2023 18:12 pfMyRouter php-fpm[51310]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_Cox. 9/26/2023 11:12 pfMyRouter xinetd[39386]: Starting reconfiguration 9/26/2023 11:12 pfMyRouter xinetd[39386]: Swapping defaults 9/26/2023 11:12 pfMyRouter xinetd[39386]: readjusting service 21000-tcp 9/26/2023 11:12 pfMyRouter xinetd[39386]: readjusting service 21001-tcp 9/26/2023 11:12 pfMyRouter xinetd[39386]: Reconfigured: new=0 old=2 dropped=0 (services) 9/26/2023 18:12 pfMyRouter php-fpm[10059]: /rc.dyndns.update: phpDynDNS (my2.ddnsdomain.com: No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. 9/26/2023 18:12 pfMyRouter php-fpm[10059]: /rc.dyndns.update: phpDynDNS (my.ddnsdomain.com): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. 9/26/2023 11:13 pfMyRouter rc.gateway_alarm[29520]: >>> Gateway alarm: WAN_Cox (Addr:1.2.3.4 Alarm:0 RTT:9.968ms RTTsd:3.869ms Loss:6%) 9/26/2023 18:13 pfMyRouter check_reload_status[348]: updating dyndns WAN_Cox 9/26/2023 18:13 pfMyRouter check_reload_status[348]: Restarting IPsec tunnels 9/26/2023 18:13 pfMyRouter check_reload_status[348]: Restarting OpenVPN tunnels/interfaces 9/26/2023 18:13 pfMyRouter check_reload_status[348]: Reloading filter 9/26/2023 18:13 pfMyRouter php-fpm[62804]: /rc.openvpn: MONITOR: WAN_Cox is available now, adding to routing group InternetWANGroup 9/26/2023 18:13 pfMyRouter php-fpm[62804]: 1.2.3.4|1.2.3.478|WAN_Cox|9.829ms|3.634ms|5%|online|none 9/26/2023 18:13 pfMyRouter php-fpm[62804]: /rc.openvpn: Gateway, switch to: WAN_Cox 9/26/2023 18:13 pfMyRouter php-fpm[62804]: /rc.openvpn: Default gateway setting as default. 9/26/2023 18:13 pfMyRouter php-fpm[286]: /rc.dyndns.update: Gateway, switch to: WAN_Cox 9/26/2023 18:13 pfMyRouter php-fpm[287]: /rc.filter_configure_sync: Gateway, switch to: WAN_Cox 9/26/2023 18:13 pfMyRouter php-fpm[62804]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_Cox. 9/26/2023 11:13 pfMyRouter xinetd[39386]: Starting reconfiguration 9/26/2023 11:13 pfMyRouter xinetd[39386]: Swapping defaults 9/26/2023 11:13 pfMyRouter xinetd[39386]: readjusting service 21000-tcp 9/26/2023 11:13 pfMyRouter xinetd[39386]: readjusting service 21001-tcp 9/26/2023 11:13 pfMyRouter xinetd[39386]: Reconfigured: new=0 old=2 dropped=0 (services) 9/26/2023 18:13 pfMyRouter php-fpm[286]: /rc.dyndns.update: phpDynDNS (my2.ddnsdomain.com: No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. 9/26/2023 18:13 pfMyRouter php-fpm[286]: /rc.dyndns.update: phpDynDNS (my.ddnsdomain.com): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
-
Hmm, well the timestamp difference is odd. But that aside there doesn't look to be anything significant there. The gateway shows loss and triggers the other events.
Assuming 1.2.3.4 there is your real gateway address you should change the monitoring IP to something upstream like 8.8.8.8 or 1.1.1.1. It could be simply that the gateway loaded and drops pings.
Steve
-
@stephenw10 Thanks for the suggestion, but:
- The problem persists even with monitoring disabled
- The problem results in actual packet loss to workstations: it is not just a monitoring problem
- The problem persists even with a fixed WAN IP address (I thought it might be related to DHCP renewals.)
- Looking at additional graphs, there is no evidence of CPU or other load increase at the times the problem ocurrs.
I am out of ideas at this point and intend to re-install version 22.05. I will report back to confirm the status of the problem after that downgrade.
-
There is no increase in traffic when this happens?
-
@stephenw10 That is correct: there is no change in network traffic.
FYI the network nearly always has very light traffic.
-
When using an external IP do you still see the same level of packet loss? Do you also see latency increases?
Otherwise I would try to capture the ping packets when this happens and see if there are actually missing packets on the wire.
Are you using an igb or ix NIC for WAN? Try re-assigning it to the other type if you can.
-
I am starting to think this may be a wild goose chase after all: this may be an ISP problem. It could in fact be that the problem was occurring prior to the firmware upgrade and went unnoticed, and that monitoring was not enabled. (I thought/assumed that monitoring had been enabled the whole time, but I can't say for sure.)
I reverted back to 22.05 and have observed the same behavior: dropped packets each hour.
Yes, the packet loss is real...going through the pfSense (observable from a workstation). I have not yet been able to confirm packet loss is real without the pfSense in the mix.
WAN is on igb0 and LAN is on igb1. I could try moving it, but igb0 is the default. Is there a reason why igb versus ix would make a difference?
Sep 28 09:16:44 pMyRouter rc.gateway_alarm[33556]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:10.000ms RTTsd:2.329ms Loss:21%) Sep 28 09:18:34 pMyRouter rc.gateway_alarm[18934]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:10.422ms RTTsd:3.664ms Loss:6%) Sep 28 10:16:52 pMyRouter rc.gateway_alarm[58429]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:8.861ms RTTsd:1.637ms Loss:21%) Sep 28 10:18:25 pMyRouter rc.gateway_alarm[61365]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:9.553ms RTTsd:3.310ms Loss:5%) Sep 28 11:17:05 pMyRouter rc.gateway_alarm[67044]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:1 RTT:8.698ms RTTsd:2.920ms Loss:22%) Sep 28 11:18:11 pMyRouter rc.gateway_alarm[41078]: >>> Gateway alarm: WAN_DHCP (Addr:1.2.3.4 Alarm:0 RTT:9.716ms RTTsd:5.033ms Loss:5%)
It's kind of a strange problem (with the hourly recurrence). I'll keep working on getting to the bottom of this and will report back.
-
Yeah if it's the same in 22.05 it's probably not a NIC/driver issue. However on the 5100 it can be worth trying. The igb NICs are separate PCIe devices. The ix NICs are part of the SoC.
- 23 days later
-
@drueter Did you ever confirm this? I am troubleshooting a very similar issue - Hourly Gateway Alarms, knocking things offline.
I have zero leads on the issue, so looking for hints... -
Nothing logged, just packet loss? Monitoring the gateway IP(s)? Same hardware?
-
@stephenw10 My case is somewhat different - but I am looking for clues. The gateway is an openvpn gateway, the wan does not appear to go down just the openvpn link.
What does happen is when the gateway alarm goes off, the DNS lookups that go across the VPN to the remote DNS immediately fail, and so my nagios probes detect the name resolution failure right away and report.
The pfSense logs show ">>> Gateway alarm: ... " at about 1 hour intervals, right when the DNS lookups start failing. Last about 2-3 minutes maximum. I suspected that could be when vpn re-keying is taking place?
Last night after I left my comment I tried clearing the "Fast UDP" feature in openvpn, and the issue has stopped. (For now). I'll see if it stays working.
-
Hmm, odd. I wouldn't expect that. I have openvpn tunnels that have been up without issue for weeks.