Random Massive Lag Spikes
-
In advance I'm an absolute noob when it comes to networking and working with pfSense in general so I'm not sure how to navigate pfSense or debug issues with any level of sophistication.
I have a client on my network which does a lot of downloading and, when turned on, causes massive lag spikes for packets moving into my pfSense box. Typically pings to my pfSense gateway addr take around 0.3ms to return, however at random times pings take up to 200ms and sometimes even longer. For example see this paste
So far the best lead I have is that the System Activity screen starts showing less CPU idle time, interrupt load seems to skyrocket, and a program running debug against the ruleset starts showing up and taking up massive amounts of CPU time? The interrupts seem to be the culprits here but i'm not sure what's causing them or how to find that out. In addition I couldn't figure out where/how the
pfctl
program was being executed which is a bit suspicious. From what I understand it's from dynamic rules being applied, however I don't think I have any dynamic rules currently? Here's a pastebin I managed to capture with all of the aforementioned issues: see this paste for a top printout showing high random loadMy specs are as follows:
Intel(R) Celeron(R) CPU J1900 @ 1.99GHz Current: 1992 MHz, Max: 1993 MHz Memory: 3924 MiB
I've tried the following approaches to mitigate the issue to no avail:
- Traffic shaping - I have a codel limiter working to keep traffic at a max of 85Mbit/s where my internet bandwidth is 100Mbit/s
-
The ping paste : how is the device you are pining from connected to pfSense ? Wifi ? Wired ?
If its wired, is this a classic 1 Gbit/sec connection ?The top paste : your pfSense is basically doing 'nothing' with its 4 cores.
That said, on my own 4100 I've 105 processes, and you have twice as much. That's ... a bit strange. You installed and activated all pfSense packages ?Btw : pfctl ones in a while the firewall rules get reloaded, that's normal.
What is the brand (type) of the NICs used ?
@cortadalallo said in Random Massive Lag Spikes:
I have a codel limiter working to keep traffic at a max of 85Mbit/s where my internet bandwidth is 100Mbit/s
If a LAN devices is loading or sending something 'big' and fills up the WAN connection, less priority packages might get dropped. ICMP (ping) is a less priority protocol.
If your traffic shaping is set up using two queues one reserved 'channel' for ICMP only, and another for the rest of the traffic, the ping latency (buffer bloat) will be gone. That doesn't mean the system will be any any faster, though. -
@cortadalallo
/sbin/pfctl -o basic -f /tmp/rules.debug
does not 'run debug against the ruleset', it applies a new ruleset. Setting new rules does impact traffic, so that's probably the cause here.It is unusual for that to be happening regularly. It might mean that you have an interface that's flapping, or there might be something else triggering this. That's what you need to figure out.
-
Yup that^.
Check the system log to see what's triggering the ruleset reload.
-
@kprovost said in Random Massive Lag Spikes:
@cortadalallo
/sbin/pfctl -o basic -f /tmp/rules.debug
does not 'run debug against the ruleset', it applies a new ruleset. Setting new rules does impact traffic, so that's probably the cause here.It is unusual for that to be happening regularly. It might mean that you have an interface that's flapping, or there might be something else triggering this. That's what you need to figure out.
This provided a good lead, I used the system logs to figure out that the interface is flapping but I'm not sure why. I just replaced the cable on that interface to no avail.
From the logs it looks like
check_reload_status
might be a factor but it looks like its trying to bring the interface up & not down? Although that's assuming that thecheck_reload_status
&kernel
printouts are synchronous when they're probably not.May 2 17:38:51 rc.gateway_alarm 12499 >>> Gateway alarm: WAN_DHCP6 (Addr:fe80::201:5cff:fe95:f846%igb0 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) May 2 17:38:51 check_reload_status 429 Reloading filter May 2 17:38:51 check_reload_status 429 Restarting OpenVPN tunnels/interfaces May 2 17:38:51 check_reload_status 429 Restarting IPsec tunnels May 2 17:38:51 check_reload_status 429 updating dyndns WAN_DHCP May 2 17:38:51 rc.gateway_alarm 11618 >>> Gateway alarm: WAN_DHCP (Addr:68.112.120.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) May 2 17:38:51 check_reload_status 429 Reloading filter May 2 17:38:51 kernel igb0: link state changed to UP May 2 17:38:51 check_reload_status 429 Linkup starting igb0 May 2 17:38:48 php-fpm 53068 /rc.linkup: DEVD Ethernet detached event for wan May 2 17:38:48 php-fpm 53068 /rc.linkup: Hotplug event detected for WAN(wan) dynamic IP address (4: dhcp, 6: dhcp6) May 2 17:38:47 kernel igb0: link state changed to DOWN May 2 17:38:47 check_reload_status 429 Linkup starting igb0
-
It an be hard to see what is cause or symptom there. But those log lines from kernel show it's actually losing link.
What is igb0 connected to? Can you try a different port?
-
That interface is connected directly to my modem so unforunately I am unable to try a different port.
I could potentially try a different interface in pfSense or even do a ping/stability test directly against the modem with a different device.
-
Also yeah I just tried turning on the high-bandwidth-consuming client and it triggered 2 "flaps" within about 10 minutes. Is it possible that my ISP is just totally crapping itself whenever theres an uptick in traffic?
-
It could just be the modem crapping out, yes.
Can you try a different port at the pfSense end?
Can you test putting a switch in between the pfSense WAN and the modem? That would prove which end is dropping the link.