Issues with High Latency on PPPOE Reconnect



  • Hi everyone,

    I have a pfSense Setup with 2 WAN Links, one over DHCP and one using PPPOE. I am having various LANs aswell (DMZ, Guest Network, Normal LAN). Whenever my PPPOE Connecton reconnects, I am getting packet loss on every single connection, just like pfSense stops routing for a few seconds. This is enough to consider the DHCP Wan Gateway down. All connections are being connected over the same LAGG on different VLANS. In dmesg and the System Logs is no hint to what might cause this issues.

    Does anybody know what might be going on here and is there any way to further debug this?


  • Rebel Alliance Developer Netgate

    Is the PPPoE also going over a VLAN on the LAGG?

    There have been a very small number of reports of PPPoE issues with VLANs but so far we have not been able to replicate a problem here.


  • Netgate Administrator

    Do you have State Killing on Gateway Failure set in System > Advanced > Miscellaneous?

    Steve



  • Yes the PPPOE is also connected over a VLAN. It works great, just whenever I do a reconnect, shortly after the connect I am getting those packet drops. Is there any Info I can provide to help you trace the issue? In top I can see dpinger using 100% CPU at that time (refreshing doesnt work, but thats what I can see once it starts to refresh again).


  • Rebel Alliance Developer Netgate

    If you kill dpinger and then edit/save/apply the gateway does it come back OK?



  • @stephenw10 No, that option is not set. When the states are killed, the SSH Sessions would also drop, and not continue to function afterwards.


  • Netgate Administrator

    Try disabling gateway monitoring for the PPPoE gateway and retesting.

    Steve



  • @jimp Theres no need to kill it, just wait for a few seconds and its all back to normal like nothing happened.

    I am getting an email with this:

    15:56:33 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:33 10292MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 45534MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 53386MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 10292MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|218.804ms|634.286ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 64937MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 64937MONITOR: GW_WAN is available now, adding to routing group KDG
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 45534MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:36 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 10292MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 53386MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 53386MONITOR: GW_WAN is available now, adding to routing group KDG
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 MONITOR: GW_WAN has high latency, omitting from routing group Main
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    15:56:37 64937MONITOR: GW_WAN is available now, adding to routing group Failover
    8.8.8.8|x.x.x.x|GW_WAN|204.818ms|611.325ms|0.0%|delay
    

    Also Network UPS Tools complained about the UPS being unreachable, and a few seconds later it says that its connected again. Pings to the firewall get also lost for a few seconds, and after some time it's all back to normal.


  • Rebel Alliance Developer Netgate

    Killing and restarting dpinger was more about the 100% CPU issue and not the latency. That may be closer to the root cause.



  • Dpinger is going back to normal CPU Usage after a few seconds. Also Avahi is using lots of CPU on a reconnect when its enabled.

    I disabled the Gateway monitoring and also marked it as permanent down. Dpinger still had a high CPU, as well as Avahi. So i disabled avahi aswell, now I got a high CPU Usage from dhcpd and route for a second during the reconnect, and the issue is still there.



  • I just noticed that with avahi disabled it gets a little better. My normal püing to pfsense is 0.2, on a PPPOE reconnect with avahi disabled I am getting 2 pings with 60-100ms. I need avahi to do printer sharing betweem the Server in the DMZ and the printers on the LAN though.


  • Netgate Administrator

    What hardware are you running this on?

    Steve



  • pciconf -l | grep class=0x020000
    em2@pci0:0:25:0:        class=0x020000 card=0x1494103c chip=0x15028086 rev=0x04 hdr=0x00
    em0@pci0:1:0:0: class=0x020000 card=0x7044103c chip=0x105e8086 rev=0x06 hdr=0x00
    em1@pci0:1:0:1: class=0x020000 card=0x7044103c chip=0x105e8086 rev=0x06 hdr=0x00
    re0@pci0:3:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x06 hdr=0x00
    

    I also tried unplugging that Realtek PCI Express Card from the Switch so it's no longer an active slave in the bonding, but that didn't help. One of the Intel NICs is an onboard NIC, the other 2 a Dual NIC PCI Express card. Please let me know if thats enough information or if I have to open the machine and get the exact model number from the PCI card.


  • Rebel Alliance Developer Netgate

    Do you have any Advanced Configuration options active on the DHCP WAN? Maybe that's the one causing the issue, not PPPoE. See https://redmine.pfsense.org/issues/8507#note-23



  • I have tried disabling IPv6 on the PPPOE Link, so that it's no longer using DHCP but that did not help. The IPv4 is aquired over PPPOE, there's nothing else I can configure in there.


  • Netgate Administrator

    Ah, OK so you don't actually have two WAN connections just two gateways, one v4 and one v6?

    If that's the case you would expect the dhcp v6 gateway to go down when the parent PPPoE link reconnects.

    Earlier I meant what CPU and RAM etc are you running. I was hoping to judge what sort of CPU usage might be expected there.

    Steve



  • No I do have 2 Internet connections with 2 Gateways each. I'll call them WAN and DSL from now on to make clear which one I mean.

    On WAN I have a static IPv4 Configured and I am using DHCP for IPv6.

    On DSL I use PPPOE for IPv4 Configuration and DHCP for IPv6.

    The CPU is an Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz with 16GB of RAM


  • Netgate Administrator

    Ah OK.
    Hmm, I'm not sure if that would apply at all to v6, I don't think it does.

    That hardware is clearly more than capable of handling WAN events without a problem. Or should be least.

    Disabling gateway monitoring and/or disabling gateway monitor action on each of those could be telling.

    Steve



  • I've disabled the monitoring action on all Gateways. With Avahi and snort disabled it looks like this after PPPOE is connected:

    64 bytes from 192.168.1.1: icmp_seq=715 ttl=64 time=0.225 ms
    64 bytes from 192.168.1.1: icmp_seq=716 ttl=64 time=0.236 ms
    64 bytes from 192.168.1.1: icmp_seq=717 ttl=64 time=1996 ms
    64 bytes from 192.168.1.1: icmp_seq=718 ttl=64 time=973 ms
    64 bytes from 192.168.1.1: icmp_seq=719 ttl=64 time=2895 ms
    64 bytes from 192.168.1.1: icmp_seq=720 ttl=64 time=1879 ms
    64 bytes from 192.168.1.1: icmp_seq=721 ttl=64 time=855 ms
    64 bytes from 192.168.1.1: icmp_seq=722 ttl=64 time=32.5 ms
    64 bytes from 192.168.1.1: icmp_seq=723 ttl=64 time=0.297 ms
    64 bytes from 192.168.1.1: icmp_seq=724 ttl=64 time=0.213 ms
    64 bytes from 192.168.1.1: icmp_seq=725 ttl=64 time=0.166 ms
    64 bytes from 192.168.1.1: icmp_seq=726 ttl=64 time=2758 ms
    64 bytes from 192.168.1.1: icmp_seq=727 ttl=64 time=1734 ms
    64 bytes from 192.168.1.1: icmp_seq=728 ttl=64 time=710 ms
    64 bytes from 192.168.1.1: icmp_seq=729 ttl=64 time=772 ms
    64 bytes from 192.168.1.1: icmp_seq=730 ttl=64 time=1637 ms
    64 bytes from 192.168.1.1: icmp_seq=731 ttl=64 time=628 ms
    64 bytes from 192.168.1.1: icmp_seq=732 ttl=64 time=0.299 ms
    64 bytes from 192.168.1.1: icmp_seq=733 ttl=64 time=0.310 ms
    64 bytes from 192.168.1.1: icmp_seq=734 ttl=64 time=0.339 ms
    

    The following lines all have a latency <1ms.

    With avahi disabled and snort enabled it looks like this:

    64 bytes from 192.168.1.1: icmp_seq=30 ttl=64 time=0.337 ms
    64 bytes from 192.168.1.1: icmp_seq=31 ttl=64 time=2198 ms
    64 bytes from 192.168.1.1: icmp_seq=32 ttl=64 time=1174 ms
    64 bytes from 192.168.1.1: icmp_seq=33 ttl=64 time=151 ms
    64 bytes from 192.168.1.1: icmp_seq=34 ttl=64 time=443 ms
    64 bytes from 192.168.1.1: icmp_seq=35 ttl=64 time=2097 ms
    64 bytes from 192.168.1.1: icmp_seq=36 ttl=64 time=1090 ms
    64 bytes from 192.168.1.1: icmp_seq=37 ttl=64 time=66.9 ms
    64 bytes from 192.168.1.1: icmp_seq=38 ttl=64 time=0.310 ms
    64 bytes from 192.168.1.1: icmp_seq=39 ttl=64 time=0.305 ms
    64 bytes from 192.168.1.1: icmp_seq=40 ttl=64 time=2061 ms
    64 bytes from 192.168.1.1: icmp_seq=41 ttl=64 time=1037 ms
    64 bytes from 192.168.1.1: icmp_seq=42 ttl=64 time=13.8 ms
    64 bytes from 192.168.1.1: icmp_seq=43 ttl=64 time=91.6 ms
    64 bytes from 192.168.1.1: icmp_seq=44 ttl=64 time=0.270 ms
    64 bytes from 192.168.1.1: icmp_seq=45 ttl=64 time=386 ms
    64 bytes from 192.168.1.1: icmp_seq=46 ttl=64 time=0.260 ms
    

    The following lines all have a latency <1ms.

    It seems like neither one has an effect on this. I would rather assume that this is somehow related to the IPv6 Latency spikes I am experiencing when using DHCPv6 RA functionality, see this thread for the issue description of that. Is there any way of troubleshooting like for example seeing what kernel module uses how much CPU time or something similar?


  • Netgate Administrator

    If you have top -aSH open in the console you should see what processes are using CPU cycles.
    Is it just dpinger?
    If you actually disable monitoring for all gateways dpinger shouldn't be running at all so that would reveal anything else.

    Steve



  • Thats interesting..... I re-enabled dpinger as I think it's clearly not the cause.

    PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
       12 root       -92    -     0K   336K CPU2    2  57:12  66.62% [intr{irq268: re0}]
       12 root       -92    -     0K   336K RUN     1   5:40  62.03% [intr{swi4: clock (0)}]
        0 root       -92    -     0K   480K *in6_m  0  28:45  50.42% [kernel{em2 que}]
       11 root       155 ki31     0K    64K RUN     2  23.5H  27.08% [idle{idle: cpu2}]
    27976 root        34    0  6900K  2464K nanslp  2   0:01  21.58% [dpinger{dpinger}]
       11 root       155 ki31     0K    64K RUN     0  24.3H  18.74% [idle{idle: cpu0}]
        0 root       -92    -     0K   480K CPU0    0  21:12  18.15% [kernel{em1 que}]
    29112 root        33    0  6900K  2508K nanslp  2   0:01  17.51% [dpinger{dpinger}]
       11 root       155 ki31     0K    64K RUN     1  24.5H  16.80% [idle{idle: cpu1}]
    

    Is the realtek Interface the one causing the issues? Right now it is connected as part of the LAGG, would disconnecting help, or removing it from the LAGG or can just removing it physically from the machine show if its the cause?


  • Netgate Administrator

    Well if anything is it's probably the Realtek. You should be able to just remove it from the LAGG unless it's causing some terrrible PCI bus issue.

    What throughput are you seeing there? Those numbers look high for a 3.3GHz i3.

    Steve



  • Unplugged the ethernet cable from the realtek and removed it from the LAGG. Thats what I'm seeing:

      PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
        0 root       -92    -     0K   480K *if_la  2  20:48  92.86% [kernel{em0 que}]
       12 root       -92    -     0K   336K CPU1    1   6:14  92.82% [intr{swi4: clock (0)}]
        0 root       -92    -     0K   480K CPU0    0  23:09  92.55% [kernel{em1 que}]
    17789 root        34    0  6900K  2456K nanslp  3   0:01  30.08% [dpinger{dpinger}]
    22550 root        33    0  6900K  2508K nanslp  3   0:01  28.47% [dpinger{dpinger}]
    41010 root        73    0  6216K  2080K *rawcb  3   0:00  18.74% /sbin/route -n get -inet6 default
    20069 root        29    0  8948K  2496K nanslp  3   0:00  15.18% [dpinger{dpinger}]
    

    Those top outputs I have captured right after the ping spike once top started refreshing again. Normally it looks like this:

      PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
       11 root       155 ki31     0K    64K CPU1    1  26.5H  96.82% [idle{idle: cpu1}]
       11 root       155 ki31     0K    64K CPU3    3  26.0H  94.34% [idle{idle: cpu3}]
       11 root       155 ki31     0K    64K RUN     2  25.4H  94.21% [idle{idle: cpu2}]
       11 root       155 ki31     0K    64K CPU0    0  26.2H  89.89% [idle{idle: cpu0}]
        0 root       -92    -     0K   480K -       0  26:31   6.14% [kernel{em1 que}]
        0 root       -92    -     0K   480K -       0  32:31   3.60% [kernel{em2 que}]
        0 root       -92    -     0K   480K -       2  21:44   1.36% [kernel{em0 que}]
    92981 root        52   20  6968K  2640K wait    3   0:19   0.31% /bin/sh /var/db/rrd/updaterrd.sh
    

    The realtek and Dual Port Intel are PCI Express, the other Intel is the onboard NIC, just if that matters. The traffic is about 100MBit/s constantly in and out, sometimes it goes up to 1 GBit but that's rather an exception.


  • Netgate Administrator

    Hmm, that's crazy high CPU usage for 100Mb throughput on an i3.

    Do you somehow have device polling enabled?
    Check the output of ifconfig -a.

    Or maybe the CPU is running at a fraction of normal speed somehow?
    Check sysctl dev.cpu.0.freq

    Steve



  • Sorry the data was the same as the one above (copied the same thing twice), I've corrected it now. CPU Usage is pretty low



  • Anybody has another Idea what might be going on here? Why is the em0 and em1 que taking that much CPU?


  • Netgate Administrator

    I wonder if it's some interrupt issue. Try checking vmstat -i Is there some huge interrupt rate on the em intertupts? Can you see if it spikes at a reconnect?
    Also check the interrupt graph in System > Monitoring.

    If there is something there you might be able to tweak the interrupt moderation settings for em.

    Steve



  • During a reconnect I ran the vmstat -i command a few times, this is what I've got:

    #vmstat -i
    interrupt                          total       rate
    irq4: uart0                            4          0
    irq16: ehci0                     1819312          1
    irq23: ehci1                     2425630          2
    cpu0:timer                    1301077192       1073
    cpu1:timer                      42493717         35
    cpu3:timer                      57991184         48
    cpu2:timer                     599667271        494
    irq264: em0                   2223977382       1834
    irq265: em1                   1785916192       1472
    irq266: em2                   2958347521       2439
    irq267: hdac0                       7114          0
    irq268: re0                    181548785        150
    irq269: ahci0                    3925301          3
    Total                         9159196605       7551
    # vmstat -i
    interrupt                          total       rate
    irq4: uart0                            4          0
    irq16: ehci0                     1819312          1
    irq23: ehci1                     2425630          2
    cpu0:timer                    1301080827       1073
    cpu1:timer                      42497352         35
    cpu3:timer                      57994819         48
    cpu2:timer                     599670906        494
    irq264: em0                   2223977433       1834
    irq265: em1                   1785916305       1472
    irq266: em2                   2958347577       2439
    irq267: hdac0                       7114          0
    irq268: re0                    181548785        150
    irq269: ahci0                    3925307          3
    Total                         9159211371       7551
    # vmstat -i
    interrupt                          total       rate
    irq4: uart0                            4          0
    irq16: ehci0                     1819318          1
    irq23: ehci1                     2425638          2
    cpu0:timer                    1301083170       1073
    cpu1:timer                      42499612         35
    cpu3:timer                      57997160         48
    cpu2:timer                     599673173        494
    irq264: em0                   2223983068       1834
    irq265: em1                   1785920972       1472
    irq266: em2                   2958353105       2439
    irq267: hdac0                       7114          0
    irq268: re0                    181548785        150
    irq269: ahci0                    3925316          3
    Total                         9159236435       7551
    #
    interrupt                          total       rate
    irq4: uart0                            4          0
    irq16: ehci0                     1819324          1
    irq23: ehci1                     2425646          2
    cpu0:timer                    1301087566       1073
    cpu1:timer                      42503812         35
    cpu3:timer                      58001340         48
    cpu2:timer                     599677486        494
    irq264: em0                   2223989223       1834
    irq265: em1                   1785927209       1472
    irq266: em2                   2958360786       2439
    irq267: hdac0                       7114          0
    irq268: re0                    181548785        150
    irq269: ahci0                    3925330          3
    Total                         9159273625       7551
    

    There was a spike that caused the vmstat -i command to hang, so if there was something that should have been captured.

    Looking at the graph, set at the shortest (1 minute) interval, interrupt is at 15%, system at 50% and user at 15% at the minute with the reconnect. Afterwards everything goes back to normal again (System at 4%, everything else less than 0.5%).



  • Anyone has another idea? I really have no clue what else to try or how to debug. It all should work, but it doesn't. How could I try to track down the issue?


  • Netgate Administrator

    I notice you have that high CPU usage in state *if_la which I assume is truncated from if_lagg.

    What sort of lagg is that? What interfaces are in it?

    Can you remove the lagg entirely and put the VLANs over one interface directly as a test?

    It's weird though. I have a setup at home that's almost exactly like that. PPPoE connections on VLANs on an LACP LAGG.

    Steve



  • It is an LACP lagg with all interfaces in it (3 Intels and 1 Realtek), as I've mentioned before I've tried removing the Realtek from it before but no difference.

    I'll try adding them to the interface directly tomorrow, I'll have to find a time where downtime doesn't matter.



  • I think I found the issue: It was something with the tuneables. It could have been the ip redirect thingy, but I've changed a few more before this reboot. Anyways, this is how it looks like know during a PPPoE reconnect (only lines with more than 1ms response time are shown, interval is 0.05ms):

    64 bytes from 192.168.1.1: icmp_seq=4223 ttl=64 time=79.2 ms
    64 bytes from 192.168.1.1: icmp_seq=4224 ttl=64 time=23.2 ms
    64 bytes from 192.168.1.1: icmp_seq=4225 ttl=64 time=126 ms
    64 bytes from 192.168.1.1: icmp_seq=4226 ttl=64 time=72.8 ms
    64 bytes from 192.168.1.1: icmp_seq=4227 ttl=64 time=16.8 ms
    64 bytes from 192.168.1.1: icmp_seq=4236 ttl=64 time=1.10 ms
    64 bytes from 192.168.1.1: icmp_seq=4255 ttl=64 time=65.5 ms
    64 bytes from 192.168.1.1: icmp_seq=4256 ttl=64 time=9.53 ms
    64 bytes from 192.168.1.1: icmp_seq=4257 ttl=64 time=121 ms
    64 bytes from 192.168.1.1: icmp_seq=4258 ttl=64 time=67.8 ms
    64 bytes from 192.168.1.1: icmp_seq=4259 ttl=64 time=12.3 ms
    64 bytes from 192.168.1.1: icmp_seq=4278 ttl=64 time=1.10 ms
    64 bytes from 192.168.1.1: icmp_seq=4363 ttl=64 time=2.16 ms
    64 bytes from 192.168.1.1: icmp_seq=4461 ttl=64 time=63.3 ms
    64 bytes from 192.168.1.1: icmp_seq=4462 ttl=64 time=7.35 ms
    64 bytes from 192.168.1.1: icmp_seq=4463 ttl=64 time=114 ms
    64 bytes from 192.168.1.1: icmp_seq=4464 ttl=64 time=60.7 ms
    64 bytes from 192.168.1.1: icmp_seq=4465 ttl=64 time=4.71 ms
    
    64 bytes from 192.168.1.1: icmp_seq=4496 ttl=64 time=79.2 ms
    64 bytes from 192.168.1.1: icmp_seq=4497 ttl=64 time=23.2 ms
    64 bytes from 192.168.1.1: icmp_seq=4498 ttl=64 time=125 ms
    64 bytes from 192.168.1.1: icmp_seq=4499 ttl=64 time=68.1 ms
    64 bytes from 192.168.1.1: icmp_seq=4500 ttl=64 time=12.1 ms
    64 bytes from 192.168.1.1: icmp_seq=4514 ttl=64 time=103 ms
    64 bytes from 192.168.1.1: icmp_seq=4515 ttl=64 time=47.6 ms
    64 bytes from 192.168.1.1: icmp_seq=4517 ttl=64 time=104 ms
    64 bytes from 192.168.1.1: icmp_seq=4518 ttl=64 time=48.7 ms
    

    This should be okay, right? Or is there more room for improvement?



  • Actually that did not solve the issue..... I rebooted pfsense, added the Intel NIC I removed from the LACP earlier back to the bridge and it behaved just like it used to.

    Now I have removed em2 from the LAGG again and unplugged it, and I am getting those good times again and also the radvd problem is solved. Now of course I want to figure out the real source of this problem: The card causing the issues is em2, pciconf -lv says:

    em2@pci0:0:25:0:        class=0x020000 card=0x1494103c chip=0x15028086 rev=0x04 hdr=0x00
        vendor     = 'Intel Corporation'
        device     = '82579LM Gigabit Network Connection (Lewisville)'
        class      = network
        subclass   = ethernet
    

    Anybody has any idea why this card is causing such massive problems in my setup?


  • Netgate Administrator

    So you still have the other em NICX and the re NIC in that lagg when the ping latency looks good?
    Or is that with the lagg effectively removed?

    Steve



  • Yes I still have the LACP containing the other 2 em NICs and the re NIC, just em2 is now no longer connected to the LACP and i've pulled out the Ethernet cable aswell.


  • Netgate Administrator

    Curious. You could try enabling lacp debugging.
    sysctl net.link.lagg.lacp.debug=1

    Are those em NICs all the same type?

    Steve



  • No they are not all the same, em2 is a different one (it's the onboard NIC). The other ones are 82571EB.

    Now that I know what causes the issue I will check if I still have one of those computers somewhere. I need a machine for testing all the stuff.



  • Could this patch solve the issue? https://svnweb.freebsd.org/base?view=revision&revision=336313

    Looks like there are known Problems with that chipset causing some hangs, looks like exactly what I've experienced.


  • Netgate Administrator

    TSO should be disabled by default in pfSense anyway in System > Advanced > Networking. But check the ifconfig output for the parent NICs to be sure.

    Steve



  • I have it enabled, is it possible that thats what caused all the issues? I assumed that for Intel NICs that works fine. Do I understand it correctly that the patch is basically disabling the TSO, so there's no point in having it enabled on my installation?