Issues with High Latency on PPPOE Reconnect
-
No I do have 2 Internet connections with 2 Gateways each. I'll call them WAN and DSL from now on to make clear which one I mean.
On WAN I have a static IPv4 Configured and I am using DHCP for IPv6.
On DSL I use PPPOE for IPv4 Configuration and DHCP for IPv6.
The CPU is an Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz with 16GB of RAM
-
Ah OK.
Hmm, I'm not sure if that would apply at all to v6, I don't think it does.That hardware is clearly more than capable of handling WAN events without a problem. Or should be least.
Disabling gateway monitoring and/or disabling gateway monitor action on each of those could be telling.
Steve
-
I've disabled the monitoring action on all Gateways. With Avahi and snort disabled it looks like this after PPPOE is connected:
64 bytes from 192.168.1.1: icmp_seq=715 ttl=64 time=0.225 ms 64 bytes from 192.168.1.1: icmp_seq=716 ttl=64 time=0.236 ms 64 bytes from 192.168.1.1: icmp_seq=717 ttl=64 time=1996 ms 64 bytes from 192.168.1.1: icmp_seq=718 ttl=64 time=973 ms 64 bytes from 192.168.1.1: icmp_seq=719 ttl=64 time=2895 ms 64 bytes from 192.168.1.1: icmp_seq=720 ttl=64 time=1879 ms 64 bytes from 192.168.1.1: icmp_seq=721 ttl=64 time=855 ms 64 bytes from 192.168.1.1: icmp_seq=722 ttl=64 time=32.5 ms 64 bytes from 192.168.1.1: icmp_seq=723 ttl=64 time=0.297 ms 64 bytes from 192.168.1.1: icmp_seq=724 ttl=64 time=0.213 ms 64 bytes from 192.168.1.1: icmp_seq=725 ttl=64 time=0.166 ms 64 bytes from 192.168.1.1: icmp_seq=726 ttl=64 time=2758 ms 64 bytes from 192.168.1.1: icmp_seq=727 ttl=64 time=1734 ms 64 bytes from 192.168.1.1: icmp_seq=728 ttl=64 time=710 ms 64 bytes from 192.168.1.1: icmp_seq=729 ttl=64 time=772 ms 64 bytes from 192.168.1.1: icmp_seq=730 ttl=64 time=1637 ms 64 bytes from 192.168.1.1: icmp_seq=731 ttl=64 time=628 ms 64 bytes from 192.168.1.1: icmp_seq=732 ttl=64 time=0.299 ms 64 bytes from 192.168.1.1: icmp_seq=733 ttl=64 time=0.310 ms 64 bytes from 192.168.1.1: icmp_seq=734 ttl=64 time=0.339 ms
The following lines all have a latency <1ms.
With avahi disabled and snort enabled it looks like this:
64 bytes from 192.168.1.1: icmp_seq=30 ttl=64 time=0.337 ms 64 bytes from 192.168.1.1: icmp_seq=31 ttl=64 time=2198 ms 64 bytes from 192.168.1.1: icmp_seq=32 ttl=64 time=1174 ms 64 bytes from 192.168.1.1: icmp_seq=33 ttl=64 time=151 ms 64 bytes from 192.168.1.1: icmp_seq=34 ttl=64 time=443 ms 64 bytes from 192.168.1.1: icmp_seq=35 ttl=64 time=2097 ms 64 bytes from 192.168.1.1: icmp_seq=36 ttl=64 time=1090 ms 64 bytes from 192.168.1.1: icmp_seq=37 ttl=64 time=66.9 ms 64 bytes from 192.168.1.1: icmp_seq=38 ttl=64 time=0.310 ms 64 bytes from 192.168.1.1: icmp_seq=39 ttl=64 time=0.305 ms 64 bytes from 192.168.1.1: icmp_seq=40 ttl=64 time=2061 ms 64 bytes from 192.168.1.1: icmp_seq=41 ttl=64 time=1037 ms 64 bytes from 192.168.1.1: icmp_seq=42 ttl=64 time=13.8 ms 64 bytes from 192.168.1.1: icmp_seq=43 ttl=64 time=91.6 ms 64 bytes from 192.168.1.1: icmp_seq=44 ttl=64 time=0.270 ms 64 bytes from 192.168.1.1: icmp_seq=45 ttl=64 time=386 ms 64 bytes from 192.168.1.1: icmp_seq=46 ttl=64 time=0.260 ms
The following lines all have a latency <1ms.
It seems like neither one has an effect on this. I would rather assume that this is somehow related to the IPv6 Latency spikes I am experiencing when using DHCPv6 RA functionality, see this thread for the issue description of that. Is there any way of troubleshooting like for example seeing what kernel module uses how much CPU time or something similar?
-
If you have
top -aSH
open in the console you should see what processes are using CPU cycles.
Is it just dpinger?
If you actually disable monitoring for all gateways dpinger shouldn't be running at all so that would reveal anything else.Steve
-
Thats interesting..... I re-enabled dpinger as I think it's clearly not the cause.
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root -92 - 0K 336K CPU2 2 57:12 66.62% [intr{irq268: re0}] 12 root -92 - 0K 336K RUN 1 5:40 62.03% [intr{swi4: clock (0)}] 0 root -92 - 0K 480K *in6_m 0 28:45 50.42% [kernel{em2 que}] 11 root 155 ki31 0K 64K RUN 2 23.5H 27.08% [idle{idle: cpu2}] 27976 root 34 0 6900K 2464K nanslp 2 0:01 21.58% [dpinger{dpinger}] 11 root 155 ki31 0K 64K RUN 0 24.3H 18.74% [idle{idle: cpu0}] 0 root -92 - 0K 480K CPU0 0 21:12 18.15% [kernel{em1 que}] 29112 root 33 0 6900K 2508K nanslp 2 0:01 17.51% [dpinger{dpinger}] 11 root 155 ki31 0K 64K RUN 1 24.5H 16.80% [idle{idle: cpu1}]
Is the realtek Interface the one causing the issues? Right now it is connected as part of the LAGG, would disconnecting help, or removing it from the LAGG or can just removing it physically from the machine show if its the cause?
-
Well if anything is it's probably the Realtek. You should be able to just remove it from the LAGG unless it's causing some terrrible PCI bus issue.
What throughput are you seeing there? Those numbers look high for a 3.3GHz i3.
Steve
-
Unplugged the ethernet cable from the realtek and removed it from the LAGG. Thats what I'm seeing:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 0 root -92 - 0K 480K *if_la 2 20:48 92.86% [kernel{em0 que}] 12 root -92 - 0K 336K CPU1 1 6:14 92.82% [intr{swi4: clock (0)}] 0 root -92 - 0K 480K CPU0 0 23:09 92.55% [kernel{em1 que}] 17789 root 34 0 6900K 2456K nanslp 3 0:01 30.08% [dpinger{dpinger}] 22550 root 33 0 6900K 2508K nanslp 3 0:01 28.47% [dpinger{dpinger}] 41010 root 73 0 6216K 2080K *rawcb 3 0:00 18.74% /sbin/route -n get -inet6 default 20069 root 29 0 8948K 2496K nanslp 3 0:00 15.18% [dpinger{dpinger}]
Those top outputs I have captured right after the ping spike once top started refreshing again. Normally it looks like this:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 64K CPU1 1 26.5H 96.82% [idle{idle: cpu1}] 11 root 155 ki31 0K 64K CPU3 3 26.0H 94.34% [idle{idle: cpu3}] 11 root 155 ki31 0K 64K RUN 2 25.4H 94.21% [idle{idle: cpu2}] 11 root 155 ki31 0K 64K CPU0 0 26.2H 89.89% [idle{idle: cpu0}] 0 root -92 - 0K 480K - 0 26:31 6.14% [kernel{em1 que}] 0 root -92 - 0K 480K - 0 32:31 3.60% [kernel{em2 que}] 0 root -92 - 0K 480K - 2 21:44 1.36% [kernel{em0 que}] 92981 root 52 20 6968K 2640K wait 3 0:19 0.31% /bin/sh /var/db/rrd/updaterrd.sh
The realtek and Dual Port Intel are PCI Express, the other Intel is the onboard NIC, just if that matters. The traffic is about 100MBit/s constantly in and out, sometimes it goes up to 1 GBit but that's rather an exception.
-
Hmm, that's crazy high CPU usage for 100Mb throughput on an i3.
Do you somehow have device polling enabled?
Check the output ofifconfig -a
.Or maybe the CPU is running at a fraction of normal speed somehow?
Checksysctl dev.cpu.0.freq
Steve
-
Sorry the data was the same as the one above (copied the same thing twice), I've corrected it now. CPU Usage is pretty low
-
Anybody has another Idea what might be going on here? Why is the em0 and em1 que taking that much CPU?
-
I wonder if it's some interrupt issue. Try checking
vmstat -i
Is there some huge interrupt rate on the em intertupts? Can you see if it spikes at a reconnect?
Also check the interrupt graph in System > Monitoring.If there is something there you might be able to tweak the interrupt moderation settings for em.
Steve
-
During a reconnect I ran the vmstat -i command a few times, this is what I've got:
#vmstat -i interrupt total rate irq4: uart0 4 0 irq16: ehci0 1819312 1 irq23: ehci1 2425630 2 cpu0:timer 1301077192 1073 cpu1:timer 42493717 35 cpu3:timer 57991184 48 cpu2:timer 599667271 494 irq264: em0 2223977382 1834 irq265: em1 1785916192 1472 irq266: em2 2958347521 2439 irq267: hdac0 7114 0 irq268: re0 181548785 150 irq269: ahci0 3925301 3 Total 9159196605 7551 # vmstat -i interrupt total rate irq4: uart0 4 0 irq16: ehci0 1819312 1 irq23: ehci1 2425630 2 cpu0:timer 1301080827 1073 cpu1:timer 42497352 35 cpu3:timer 57994819 48 cpu2:timer 599670906 494 irq264: em0 2223977433 1834 irq265: em1 1785916305 1472 irq266: em2 2958347577 2439 irq267: hdac0 7114 0 irq268: re0 181548785 150 irq269: ahci0 3925307 3 Total 9159211371 7551 # vmstat -i interrupt total rate irq4: uart0 4 0 irq16: ehci0 1819318 1 irq23: ehci1 2425638 2 cpu0:timer 1301083170 1073 cpu1:timer 42499612 35 cpu3:timer 57997160 48 cpu2:timer 599673173 494 irq264: em0 2223983068 1834 irq265: em1 1785920972 1472 irq266: em2 2958353105 2439 irq267: hdac0 7114 0 irq268: re0 181548785 150 irq269: ahci0 3925316 3 Total 9159236435 7551 # interrupt total rate irq4: uart0 4 0 irq16: ehci0 1819324 1 irq23: ehci1 2425646 2 cpu0:timer 1301087566 1073 cpu1:timer 42503812 35 cpu3:timer 58001340 48 cpu2:timer 599677486 494 irq264: em0 2223989223 1834 irq265: em1 1785927209 1472 irq266: em2 2958360786 2439 irq267: hdac0 7114 0 irq268: re0 181548785 150 irq269: ahci0 3925330 3 Total 9159273625 7551
There was a spike that caused the vmstat -i command to hang, so if there was something that should have been captured.
Looking at the graph, set at the shortest (1 minute) interval, interrupt is at 15%, system at 50% and user at 15% at the minute with the reconnect. Afterwards everything goes back to normal again (System at 4%, everything else less than 0.5%).
-
Anyone has another idea? I really have no clue what else to try or how to debug. It all should work, but it doesn't. How could I try to track down the issue?
-
I notice you have that high CPU usage in state
*if_la
which I assume is truncated from if_lagg.What sort of lagg is that? What interfaces are in it?
Can you remove the lagg entirely and put the VLANs over one interface directly as a test?
It's weird though. I have a setup at home that's almost exactly like that. PPPoE connections on VLANs on an LACP LAGG.
Steve
-
It is an LACP lagg with all interfaces in it (3 Intels and 1 Realtek), as I've mentioned before I've tried removing the Realtek from it before but no difference.
I'll try adding them to the interface directly tomorrow, I'll have to find a time where downtime doesn't matter.
-
I think I found the issue: It was something with the tuneables. It could have been the ip redirect thingy, but I've changed a few more before this reboot. Anyways, this is how it looks like know during a PPPoE reconnect (only lines with more than 1ms response time are shown, interval is 0.05ms):
64 bytes from 192.168.1.1: icmp_seq=4223 ttl=64 time=79.2 ms 64 bytes from 192.168.1.1: icmp_seq=4224 ttl=64 time=23.2 ms 64 bytes from 192.168.1.1: icmp_seq=4225 ttl=64 time=126 ms 64 bytes from 192.168.1.1: icmp_seq=4226 ttl=64 time=72.8 ms 64 bytes from 192.168.1.1: icmp_seq=4227 ttl=64 time=16.8 ms 64 bytes from 192.168.1.1: icmp_seq=4236 ttl=64 time=1.10 ms 64 bytes from 192.168.1.1: icmp_seq=4255 ttl=64 time=65.5 ms 64 bytes from 192.168.1.1: icmp_seq=4256 ttl=64 time=9.53 ms 64 bytes from 192.168.1.1: icmp_seq=4257 ttl=64 time=121 ms 64 bytes from 192.168.1.1: icmp_seq=4258 ttl=64 time=67.8 ms 64 bytes from 192.168.1.1: icmp_seq=4259 ttl=64 time=12.3 ms 64 bytes from 192.168.1.1: icmp_seq=4278 ttl=64 time=1.10 ms 64 bytes from 192.168.1.1: icmp_seq=4363 ttl=64 time=2.16 ms 64 bytes from 192.168.1.1: icmp_seq=4461 ttl=64 time=63.3 ms 64 bytes from 192.168.1.1: icmp_seq=4462 ttl=64 time=7.35 ms 64 bytes from 192.168.1.1: icmp_seq=4463 ttl=64 time=114 ms 64 bytes from 192.168.1.1: icmp_seq=4464 ttl=64 time=60.7 ms 64 bytes from 192.168.1.1: icmp_seq=4465 ttl=64 time=4.71 ms 64 bytes from 192.168.1.1: icmp_seq=4496 ttl=64 time=79.2 ms 64 bytes from 192.168.1.1: icmp_seq=4497 ttl=64 time=23.2 ms 64 bytes from 192.168.1.1: icmp_seq=4498 ttl=64 time=125 ms 64 bytes from 192.168.1.1: icmp_seq=4499 ttl=64 time=68.1 ms 64 bytes from 192.168.1.1: icmp_seq=4500 ttl=64 time=12.1 ms 64 bytes from 192.168.1.1: icmp_seq=4514 ttl=64 time=103 ms 64 bytes from 192.168.1.1: icmp_seq=4515 ttl=64 time=47.6 ms 64 bytes from 192.168.1.1: icmp_seq=4517 ttl=64 time=104 ms 64 bytes from 192.168.1.1: icmp_seq=4518 ttl=64 time=48.7 ms
This should be okay, right? Or is there more room for improvement?
-
Actually that did not solve the issue..... I rebooted pfsense, added the Intel NIC I removed from the LACP earlier back to the bridge and it behaved just like it used to.
Now I have removed em2 from the LAGG again and unplugged it, and I am getting those good times again and also the radvd problem is solved. Now of course I want to figure out the real source of this problem: The card causing the issues is em2,
pciconf -lv
says:em2@pci0:0:25:0: class=0x020000 card=0x1494103c chip=0x15028086 rev=0x04 hdr=0x00 vendor = 'Intel Corporation' device = '82579LM Gigabit Network Connection (Lewisville)' class = network subclass = ethernet
Anybody has any idea why this card is causing such massive problems in my setup?
-
So you still have the other em NICX and the re NIC in that lagg when the ping latency looks good?
Or is that with the lagg effectively removed?Steve
-
Yes I still have the LACP containing the other 2 em NICs and the re NIC, just em2 is now no longer connected to the LACP and i've pulled out the Ethernet cable aswell.
-
Curious. You could try enabling lacp debugging.
sysctl net.link.lagg.lacp.debug=1
Are those em NICs all the same type?
Steve