VLANs kill my igb i210AT NICs on APU2C4
I have been pulling my hair out on the following issue for the last 4 days - I am stuck, so I am posting here now.
I am currently running my WAN connection over PIA OVPN on a pfSense VM.
For availability reasons I wanted to move that, my Sophos UTM VM, my Debian DHCP & DNS VM and my pihole VM to a bare metal pfSense box and also migrate to a new IP range (Class C -> Class A + VLANs) and DNS suffix.
For that I chose a PCEngines APU2C4. I installed the 2.3.2 memstick image (amd64), updated to 2.3.2_1 and setup my interfaces and rules.
In testing I didnt use any VLANs but just physical interfaces to my notebook and desktop plus a WAN link (WAN as in IP subnet to my router/modem - no PPPoE and no NAT).
Since everything worked I changed my Netgear GS748TSv1 switch config to give me tagged ports for my VLANs and plugged everything in.
At first on a LACP LAGG since I planned on running a 3 interface LAGG with all my VLANs on it, then on single ports because after just a few packets all communication would stop.
I traced the issue down to inter VLAN communication on a single NIC by now. So when I run traffic from one VLAN to another it works fine as long as that VLAN is on another NIC.
Also if I transmit over/to a NIC without VLANs. But the second I run traffic from one VLAN to another that are hosted on the same NIC that NIC stops all communication.
Clients get a "Destination Unreachable" in terminal or a "ERR_ADDRESS_UNREACHABLE" in chrome etc and pfSense gives me a "Host down".
I tried to run a packet capture before I ran some traffic to recreate the issue but the second it occurs even the packet capture just stops, so its bound interface just terminated.
I moved VLANs across all 3 NICs by now and all 3 give me this issue upon VLAN traffic to and from the same NIC. The issue does not occur from mere pinging, it needs to be some more traffic than that e.g. opening youtube.com or just browsing about on a vSphere web client.
pfSense does not log anything regarding this nor posts any errors to TTY.
When I was on LACP I got some LACP flapping messages so I moved to static LAGG, which didnt resolve my issues ofc so I went on to a single tagged port.
I also encountered a queue length overflow so I added some tunables which made sense to me:
net.inet.tcp.tso 0 kern.ipc.nmbclusters 1000000 kern.ipc.soacceptqueue 4096 hw.igb.max_interrupt_rate 32000 hw.igb.fc_setting 0
I also added a```
I setup igb0 as a physical interface on my old network so I can manage it from there and I am also passing through the serial connection to a Debian VM on one of my ESXi servers so I can administrate it from CLI over serial (minicom) if all else fails. igb1 and igb2 are carrying my VLANs, thats also how I tested when the issue exactly occurs. If you need any other info I forgot, just ask. Thanks for any help in advance!
I would test it with another distro and then post to the support mailing list for whoever does the drivers if it still doesn't work, such as freebsd kernel support or the intel support forums.
My first though here might be hardware VLAN tagging. I could see that being handled differently if the traffic was on the same NIC. You might try disabling that, I believe you need to do that via ifconfig.
I would also be sure to check the mbuf values using:
Exhausting those can present like that.
Never had any problems using many APU2C4 and VLAN's, it just works. But I never tried this using an LACP (does LACP help you solving a specific problem except more bandwidth and redundancy ?).
Did you try the same setup without LACP ? Are all VLAN's tagged on the switch ? (packet capture on the pfSense Port may help)
I tried a new install on a HP DL360 G6 that has Broadcom NICs having the same issue. Just in this case the NICs come back after one minute and log interface DOWN and UP into syslog.
Since these are bce NICs I added these tunables:
kern.ipc.nmbclusters 131072 hw.bce.tso_enable 0 hw.pci.enable_msix 0 net.isr.direct_force 1 net.isr.direct 1
I also tried disabling hardware VLAN tagging, didnt make a difference.
mbufs are around 1/3 of max usage.
By now I am doubting the switch since once again it works when the inter VLAN traffic goes through different NICs instead of the same one which ofc also means different/same ports on the switch..
And since it maybe wasnt clear enough, I tried this on LACP, static LAG and no LAG - right now its on two single trunk ports to the switch.
Any other ideas? I am already looking into getting a HP switch with a CLI instead of another web UI..
OK, so I added a "smart" 8port switch I still had laying around to the mix to handle VLAN tagging instead.
So it got all VLANs untagged on ports 1-7 and tagged them all on port 8 to pfSense.
So after all, the Netgear switch was the issue. Going to hunt down a Cisco C3750G if I can and replace it…
Been testing this with UBNT switches, with no issues. So yeah, that's definitely a switch to blame.
So after all, the Netgear switch was the issue.
In the Netgear configuration you often need to add the whole LAG to the VLAN and not only single ports!
Did you know that?
Yeah, thats what I did. But I sold the APU now since it was fine and bought a C3750G 24T for the money.
Still using one of those Netgears as access and distribution switch for the building - works fine trunked over LACP with the Cisco.
Ran into some other issues with the HP server now - bce driver going ham. Will open a new thread for that though or just replace the NICs with Intels..