CoS (802.1p) tag bug with virutalized pfsense
-
I've been trying to troubleshoot an issue with my WAN configuration with running pfsense as a VM on ESXi (6.7 tested on u1 and u2).
My ISP (Google Fiber) has a fairly uncommon requirement for bypassing their hardware that doesn't support bridge mode. I need to tag WAN traffic over VLAN 2 with a CoS priorty tag of 3 (802.1p tag). Luckily, pfsense supports this configuration and has worked for me for the better part of a year running on bare metal. I've been tinkering around with moving this over to a VM so I could have a bit more flexibility with my server gear.
When I moved the instance to a VM, my WAN performance fell to the quintessential characteristic of not having the proper CoS tag set, which is anywhere from 550-700 Mbps down, but an abysmal 15-50Mbps up (out of a possible 1000/1000). I noticed this behavior on only one of my two hosts, the one I was previous running on bare metal, which have dual i210-AT NICs. On my other host, with quad-82574L NICs, the VM performs completely as expected, full gigabit.
None of the configuration between the VMs changes between ESXi hosts, both are running on a distributed vSwitch, port group trunking 0-4094 (so not touching the VLANs), on both LACP and single-NIC configurations, VMXNET3 or e1000e VM NICs.
To further clarify, works on host with 82574Ls, once I vMotion to host with i210-ATs, starts behaving like CoS tag isn't there. Side Note: I've also tested it with a separate 82576 dual-port NIC and it also works fine. As well, I've also tested with i350s and those exhibit the same behavior as the i210s. Also, I've tested by passing through a NIC on the VM, and THAT works completely fine.
The odd thing is if I run a tcpdump, I do see the CoS tag of 3 properly on egress, but I'm not able to really test it at any other point after it leaves pfsense (so once it hits the virutalization layer).
I'm honestly not sure if it's a pfsense or a VMware issue, but I'm inclined to think it's pfsense since it works completely fine with 82574Ls.
I've tried pretty much every troubleshooting step I can possibly think of, to no avail.
The workaround I have for now is to set an ethernet switch rule on my mikrotik switch to add the CoS 3 tag matching VLAN 2 traffic. So I do have it working, but I think this may be an issue with the vmx driver.
-
First off, 802.1p has been obsolete for 20 years. It's been rolled into 802.1q, which is the VLAN spec. Second you mention VMs. Do the VMs use bridging or NAT mode for networking? If NAT, you've got a router between the VM and host. Since 802.1Q is a layer 2 protocol, it will not pass through a router.
-
Are you running those NICs full pass-through to pfSense? There might be some hardware off-loading options that are broken by that.
If not then pfSense cannot see the hardware differences at all, it must be a VMWare issue.Can you do that tagging in ESXi instead?
Ultimately I would try to get a switch with a mirror port in the connection to run a pcap and see what is really being sent.
Steve
-
@JKnott said in CoS (802.1p) tag bug with virutalized pfsense:
First off, 802.1p has been obsolete for 20 years. It's been rolled into 802.1q, which is the VLAN spec. Second you mention VMs. Do the VMs use bridging or NAT mode for networking? If NAT, you've got a router between the VM and host. Since 802.1Q is a layer 2 protocol, it will not pass through a router.
It's a bridge mode as it's ESXi's native network handler. The physical connection gets bridged to a virtual uplink that attaches to a virtual switch. It's just the standard ESXi networking as most here use.
@stephenw10 said in CoS (802.1p) tag bug with virutalized pfsense:
Are you running those NICs full pass-through to pfSense? There might be some hardware off-loading options that are broken by that.
If not then pfSense cannot see the hardware differences at all, it must be a VMWare issue.Can you do that tagging in ESXi instead?
Ultimately I would try to get a switch with a mirror port in the connection to run a pcap and see what is really being sent.
Steve
If I passthrough the NICs directly to the pfsense VM, then it works completely fine. As I said, this is only when using virtual NICs with the vmx drivers. The tagging in ESXi doesn't really matter since I still need to tag at the pfsense level to properly pull DHCP to the right interface.
The fact that this works fine with some NICs and not others leads me to believe it's an issue with the vmx driver/pfsense. Setting a trunk group of 0-4094 on the port group in ESXi makes it so that ESXi doesn't touch the traffic at all.
-
But it still goes through a v-switch here right? In that situation it doesn't matter what the physical NICs are, pfSense on sees the vmx NIC connected to the v-switch.
If it works OK in pass-through mode, where pfSense can see the NICs and use it's own driver, why not do that?
Unless I'm misunderstanding the situation here it looks more like an ESXi driver issue to me.
Steve
-
@stephenw10 and the only thing changing in that situation is the move from igb drivers to vmx drivers.
Coupled with the fact it does work for certain NICs, I'm inclined to believe it's a pfsense issue, since as well, trunking 0-4094 over a vswitch doesn't touch any of the traffic coming in or out.
-
But you're saying that on the two systems pfSense is using VMX NICs on both. The only difference is which NICs ESXi is using right? And yet one works and the other does not?
Which ESXi version are you running? Same on both hosts?
Steve
-
Correct. The only difference is using underlying i210 or i350 NICs vs older 82574 or 82756 NICs.
Running 6.7U2 on both, but this behavior existed on U1 as well.
As an aside, I know the CoS tag is the culprit since I can set it on the switch to add the CoS tag in and it works fine after that.
-
An external managed switch you mean?
-
Correct. It's a Mikrotik CRS328-24P-4S+. I can add the tag using
/interface ethernet switch rule add vlan-id=2 ports=<ports> new-vlan-priority=3
As soon as I add that to the corresponding physical ports (on the switch) the VMs are on top of, it all magically starts working again.