Issues with High Latency on PPPOE Reconnect

stephenw10

Hmm, that's crazy high CPU usage for 100Mb throughput on an i3.

Do you somehow have device polling enabled?
Check the output of ifconfig -a.

Or maybe the CPU is running at a fraction of normal speed somehow?
Check sysctl dev.cpu.0.freq

Steve

Flole

Sorry the data was the same as the one above (copied the same thing twice), I've corrected it now. CPU Usage is pretty low

Flole

Anybody has another Idea what might be going on here? Why is the em0 and em1 que taking that much CPU?

stephenw10

I wonder if it's some interrupt issue. Try checking vmstat -i Is there some huge interrupt rate on the em intertupts? Can you see if it spikes at a reconnect?
Also check the interrupt graph in System > Monitoring.

If there is something there you might be able to tweak the interrupt moderation settings for em.

Steve

Flole

During a reconnect I ran the vmstat -i command a few times, this is what I've got:

#vmstat -i
interrupt                          total       rate
irq4: uart0                            4          0
irq16: ehci0                     1819312          1
irq23: ehci1                     2425630          2
cpu0:timer                    1301077192       1073
cpu1:timer                      42493717         35
cpu3:timer                      57991184         48
cpu2:timer                     599667271        494
irq264: em0                   2223977382       1834
irq265: em1                   1785916192       1472
irq266: em2                   2958347521       2439
irq267: hdac0                       7114          0
irq268: re0                    181548785        150
irq269: ahci0                    3925301          3
Total                         9159196605       7551
# vmstat -i
interrupt                          total       rate
irq4: uart0                            4          0
irq16: ehci0                     1819312          1
irq23: ehci1                     2425630          2
cpu0:timer                    1301080827       1073
cpu1:timer                      42497352         35
cpu3:timer                      57994819         48
cpu2:timer                     599670906        494
irq264: em0                   2223977433       1834
irq265: em1                   1785916305       1472
irq266: em2                   2958347577       2439
irq267: hdac0                       7114          0
irq268: re0                    181548785        150
irq269: ahci0                    3925307          3
Total                         9159211371       7551
# vmstat -i
interrupt                          total       rate
irq4: uart0                            4          0
irq16: ehci0                     1819318          1
irq23: ehci1                     2425638          2
cpu0:timer                    1301083170       1073
cpu1:timer                      42499612         35
cpu3:timer                      57997160         48
cpu2:timer                     599673173        494
irq264: em0                   2223983068       1834
irq265: em1                   1785920972       1472
irq266: em2                   2958353105       2439
irq267: hdac0                       7114          0
irq268: re0                    181548785        150
irq269: ahci0                    3925316          3
Total                         9159236435       7551
#
interrupt                          total       rate
irq4: uart0                            4          0
irq16: ehci0                     1819324          1
irq23: ehci1                     2425646          2
cpu0:timer                    1301087566       1073
cpu1:timer                      42503812         35
cpu3:timer                      58001340         48
cpu2:timer                     599677486        494
irq264: em0                   2223989223       1834
irq265: em1                   1785927209       1472
irq266: em2                   2958360786       2439
irq267: hdac0                       7114          0
irq268: re0                    181548785        150
irq269: ahci0                    3925330          3
Total                         9159273625       7551

There was a spike that caused the vmstat -i command to hang, so if there was something that should have been captured.

Looking at the graph, set at the shortest (1 minute) interval, interrupt is at 15%, system at 50% and user at 15% at the minute with the reconnect. Afterwards everything goes back to normal again (System at 4%, everything else less than 0.5%).

Flole

Anyone has another idea? I really have no clue what else to try or how to debug. It all should work, but it doesn't. How could I try to track down the issue?

stephenw10

I notice you have that high CPU usage in state *if_la which I assume is truncated from if_lagg.

What sort of lagg is that? What interfaces are in it?

Can you remove the lagg entirely and put the VLANs over one interface directly as a test?

It's weird though. I have a setup at home that's almost exactly like that. PPPoE connections on VLANs on an LACP LAGG.

Steve

Flole

It is an LACP lagg with all interfaces in it (3 Intels and 1 Realtek), as I've mentioned before I've tried removing the Realtek from it before but no difference.

I'll try adding them to the interface directly tomorrow, I'll have to find a time where downtime doesn't matter.

Flole

I think I found the issue: It was something with the tuneables. It could have been the ip redirect thingy, but I've changed a few more before this reboot. Anyways, this is how it looks like know during a PPPoE reconnect (only lines with more than 1ms response time are shown, interval is 0.05ms):

64 bytes from 192.168.1.1: icmp_seq=4223 ttl=64 time=79.2 ms
64 bytes from 192.168.1.1: icmp_seq=4224 ttl=64 time=23.2 ms
64 bytes from 192.168.1.1: icmp_seq=4225 ttl=64 time=126 ms
64 bytes from 192.168.1.1: icmp_seq=4226 ttl=64 time=72.8 ms
64 bytes from 192.168.1.1: icmp_seq=4227 ttl=64 time=16.8 ms
64 bytes from 192.168.1.1: icmp_seq=4236 ttl=64 time=1.10 ms
64 bytes from 192.168.1.1: icmp_seq=4255 ttl=64 time=65.5 ms
64 bytes from 192.168.1.1: icmp_seq=4256 ttl=64 time=9.53 ms
64 bytes from 192.168.1.1: icmp_seq=4257 ttl=64 time=121 ms
64 bytes from 192.168.1.1: icmp_seq=4258 ttl=64 time=67.8 ms
64 bytes from 192.168.1.1: icmp_seq=4259 ttl=64 time=12.3 ms
64 bytes from 192.168.1.1: icmp_seq=4278 ttl=64 time=1.10 ms
64 bytes from 192.168.1.1: icmp_seq=4363 ttl=64 time=2.16 ms
64 bytes from 192.168.1.1: icmp_seq=4461 ttl=64 time=63.3 ms
64 bytes from 192.168.1.1: icmp_seq=4462 ttl=64 time=7.35 ms
64 bytes from 192.168.1.1: icmp_seq=4463 ttl=64 time=114 ms
64 bytes from 192.168.1.1: icmp_seq=4464 ttl=64 time=60.7 ms
64 bytes from 192.168.1.1: icmp_seq=4465 ttl=64 time=4.71 ms

64 bytes from 192.168.1.1: icmp_seq=4496 ttl=64 time=79.2 ms
64 bytes from 192.168.1.1: icmp_seq=4497 ttl=64 time=23.2 ms
64 bytes from 192.168.1.1: icmp_seq=4498 ttl=64 time=125 ms
64 bytes from 192.168.1.1: icmp_seq=4499 ttl=64 time=68.1 ms
64 bytes from 192.168.1.1: icmp_seq=4500 ttl=64 time=12.1 ms
64 bytes from 192.168.1.1: icmp_seq=4514 ttl=64 time=103 ms
64 bytes from 192.168.1.1: icmp_seq=4515 ttl=64 time=47.6 ms
64 bytes from 192.168.1.1: icmp_seq=4517 ttl=64 time=104 ms
64 bytes from 192.168.1.1: icmp_seq=4518 ttl=64 time=48.7 ms

This should be okay, right? Or is there more room for improvement?

Flole

Actually that did not solve the issue..... I rebooted pfsense, added the Intel NIC I removed from the LACP earlier back to the bridge and it behaved just like it used to.

Now I have removed em2 from the LAGG again and unplugged it, and I am getting those good times again and also the radvd problem is solved. Now of course I want to figure out the real source of this problem: The card causing the issues is em2, pciconf -lv says:

em2@pci0:0:25:0:        class=0x020000 card=0x1494103c chip=0x15028086 rev=0x04 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82579LM Gigabit Network Connection (Lewisville)'
    class      = network
    subclass   = ethernet

Anybody has any idea why this card is causing such massive problems in my setup?

stephenw10

So you still have the other em NICX and the re NIC in that lagg when the ping latency looks good?
Or is that with the lagg effectively removed?

Steve

Flole

Yes I still have the LACP containing the other 2 em NICs and the re NIC, just em2 is now no longer connected to the LACP and i've pulled out the Ethernet cable aswell.

stephenw10

Curious. You could try enabling lacp debugging.
sysctl net.link.lagg.lacp.debug=1

Are those em NICs all the same type?

Steve

Flole

No they are not all the same, em2 is a different one (it's the onboard NIC). The other ones are 82571EB.

Now that I know what causes the issue I will check if I still have one of those computers somewhere. I need a machine for testing all the stuff.

Flole

Could this patch solve the issue? https://svnweb.freebsd.org/base?view=revision&revision=336313

Looks like there are known Problems with that chipset causing some hangs, looks like exactly what I've experienced.

stephenw10

TSO should be disabled by default in pfSense anyway in System > Advanced > Networking. But check the ifconfig output for the parent NICs to be sure.

Steve

Flole

I have it enabled, is it possible that thats what caused all the issues? I assumed that for Intel NICs that works fine. Do I understand it correctly that the patch is basically disabling the TSO, so there's no point in having it enabled on my installation?

stephenw10

It does appear to. I would definitely try that.

Steve

Flole

Unfortunately i can't post the ifconfig output as that constantly triggers the spam filter. On em2 there's nothing about tso, on em0 and em1 theres tso4. Also em2 is still unplugged, not sure if that matters.

stephenw10

It may be correctly disabled on the NIC but passed through when it's added to the lagg group. Does lagg0 show TSO4?

Steve