Random Massive Lag Spikes

cortadalallo

In advance I'm an absolute noob when it comes to networking and working with pfSense in general so I'm not sure how to navigate pfSense or debug issues with any level of sophistication.

I have a client on my network which does a lot of downloading and, when turned on, causes massive lag spikes for packets moving into my pfSense box. Typically pings to my pfSense gateway addr take around 0.3ms to return, however at random times pings take up to 200ms and sometimes even longer. For example see this paste

So far the best lead I have is that the System Activity screen starts showing less CPU idle time, interrupt load seems to skyrocket, and a program running debug against the ruleset starts showing up and taking up massive amounts of CPU time? The interrupts seem to be the culprits here but i'm not sure what's causing them or how to find that out. In addition I couldn't figure out where/how the pfctl program was being executed which is a bit suspicious. From what I understand it's from dynamic rules being applied, however I don't think I have any dynamic rules currently? Here's a pastebin I managed to capture with all of the aforementioned issues: see this paste for a top printout showing high random load

My specs are as follows:

Intel(R) Celeron(R) CPU J1900 @ 1.99GHz
Current: 1992 MHz, Max: 1993 MHz

Memory: 3924 MiB

I've tried the following approaches to mitigate the issue to no avail:

Traffic shaping - I have a codel limiter working to keep traffic at a max of 85Mbit/s where my internet bandwidth is 100Mbit/s

Gertjan

@cortadalallo

The ping paste : how is the device you are pining from connected to pfSense ? Wifi ? Wired ?
If its wired, is this a classic 1 Gbit/sec connection ?

The top paste : your pfSense is basically doing 'nothing' with its 4 cores.
That said, on my own 4100 I've 105 processes, and you have twice as much. That's ... a bit strange. You installed and activated all pfSense packages ?

Btw : pfctl ones in a while the firewall rules get reloaded, that's normal.

What is the brand (type) of the NICs used ?

@cortadalallo said in Random Massive Lag Spikes:

I have a codel limiter working to keep traffic at a max of 85Mbit/s where my internet bandwidth is 100Mbit/s

If a LAN devices is loading or sending something 'big' and fills up the WAN connection, less priority packages might get dropped. ICMP (ping) is a less priority protocol.
If your traffic shaping is set up using two queues one reserved 'channel' for ICMP only, and another for the rest of the traffic, the ping latency (buffer bloat) will be gone. That doesn't mean the system will be any any faster, though.

kprovost

@cortadalallo /sbin/pfctl -o basic -f /tmp/rules.debug does not 'run debug against the ruleset', it applies a new ruleset. Setting new rules does impact traffic, so that's probably the cause here.

It is unusual for that to be happening regularly. It might mean that you have an interface that's flapping, or there might be something else triggering this. That's what you need to figure out.

stephenw10

Yup that^.

Check the system log to see what's triggering the ruleset reload.

cortadalallo

@kprovost said in Random Massive Lag Spikes:

@cortadalallo /sbin/pfctl -o basic -f /tmp/rules.debug does not 'run debug against the ruleset', it applies a new ruleset. Setting new rules does impact traffic, so that's probably the cause here.

It is unusual for that to be happening regularly. It might mean that you have an interface that's flapping, or there might be something else triggering this. That's what you need to figure out.

This provided a good lead, I used the system logs to figure out that the interface is flapping but I'm not sure why. I just replaced the cable on that interface to no avail.

From the logs it looks like check_reload_status might be a factor but it looks like its trying to bring the interface up & not down? Although that's assuming that the check_reload_status & kernel printouts are synchronous when they're probably not.

May 2 17:38:51	rc.gateway_alarm	12499	>>> Gateway alarm: WAN_DHCP6 (Addr:fe80::201:5cff:fe95:f846%igb0 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
May 2 17:38:51	check_reload_status	429	Reloading filter
May 2 17:38:51	check_reload_status	429	Restarting OpenVPN tunnels/interfaces
May 2 17:38:51	check_reload_status	429	Restarting IPsec tunnels
May 2 17:38:51	check_reload_status	429	updating dyndns WAN_DHCP
May 2 17:38:51	rc.gateway_alarm	11618	>>> Gateway alarm: WAN_DHCP (Addr:68.112.120.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
May 2 17:38:51	check_reload_status	429	Reloading filter
May 2 17:38:51	kernel		                igb0: link state changed to UP
May 2 17:38:51	check_reload_status	429	Linkup starting igb0
May 2 17:38:48	php-fpm	                53068   /rc.linkup: DEVD Ethernet detached event for wan
May 2 17:38:48	php-fpm	                53068   /rc.linkup: Hotplug event detected for WAN(wan) dynamic IP address (4: dhcp, 6: dhcp6)
May 2 17:38:47	kernel		                igb0: link state changed to DOWN
May 2 17:38:47	check_reload_status	429	Linkup starting igb0

stephenw10

It an be hard to see what is cause or symptom there. But those log lines from kernel show it's actually losing link.

What is igb0 connected to? Can you try a different port?

cortadalallo

@stephenw10

That interface is connected directly to my modem so unforunately I am unable to try a different port.

I could potentially try a different interface in pfSense or even do a ping/stability test directly against the modem with a different device.

cortadalallo

Also yeah I just tried turning on the high-bandwidth-consuming client and it triggered 2 "flaps" within about 10 minutes. Is it possible that my ISP is just totally crapping itself whenever theres an uptick in traffic?

stephenw10

It could just be the modem crapping out, yes.

Can you try a different port at the pfSense end?

Can you test putting a switch in between the pfSense WAN and the modem? That would prove which end is dropping the link.