CARP VIP. Problem with traffic shaping queues lost on secondary upon failover

jtl

Hello

Trying out a HA setup with two machines running pfSense 2.4.3. The primary is a Supermicro with a Intel em driver NIC and the secondary is a APU1d4 which has Realtek re driver NIC(s).

Before you say "pfsync doesn't work with differing interface names". That's true but I discovered you can use <earlyshellcmd>to rename the interfaces early enough in the boot process to not cause other issues and it's persistent within the running system (ie, re1 (WAN) on the secondary is renamed to em3 (WAN) of the primary).

Because I only have a DHCP WAN connection I'm working on a prototype of a concept I call "bettercarp" where I have my WAN connection terminated into a managed switch and the primary and secondary shutdown the other WAN port by SNMP upon failover to avoid any MAC address conflicts, all the LAN and VLAN interfaces are CARP VIP's. This seems to work fairly well after sorting out issues with devd.

If I get this fully working I'm planning on upgrading my internet connection to 5 usable IP's. Sadly they are still DHCP assigned but then I could have a second WAN interface on the primary and secondary so I could do updates without needing to failover the connection.

An issue I haven't been able to figure out is that I use HFSC ALTQ traffic shaping on my WAN to give priority of my LAN over my DMZ network. Upon failover to the secondary all traffic from connections that was assigned to subqueues ends up in the interfaces root queue. Because pfsync otherwise works, no connections are dropped. To rule out this being a issue with my WAN "fencing" setup I tested limiting the bandwidth of LAN to DMZ (both VIP interfaces) using HFSC upper limit to 10Mb and ran an iperf from my workstation (LAN) to a host on my DMZ network.

Upon CARP failover to the secondary the traffic ended up in the root interface queue and upon failover back to the primary the traffic was in the correct queue again. If I reboot the primary while the secondary is master and it fails over to the primary, the connection won't be dropped but the queue for the connection would be lost.

(at 14-15 seconds) I failover to the secondary with CARP maintenance mode and I disabled the CARP maintenance mode soon after, and it moves back the primary at 40-41 seconds with the queues still intact.

Thanks</earlyshellcmd>

Derelict

Trying out a HA setup with two machines running pfSense 2.4.3. The primary is a Supermicro with a Intel em driver NIC and the secondary is a APU1d4 which has Realtek re driver NIC(s).

Not even going to read any further because that is a completely unsupported configuration. HA nodes must match. If you want to test HA, use VMs.

jtl

Grumble grumble. Maybe on the weekend I'll attempt the same with a VM lab on my desktop and report back

jtl

@Derelict:

Trying out a HA setup with two machines running pfSense 2.4.3. The primary is a Supermicro with a Intel em driver NIC and the secondary is a APU1d4 which has Realtek re driver NIC(s).

Not even going to read any further because that is a completely unsupported configuration. HA nodes must match. If you want to test HA, use VMs.

OK. I've reproduced the same issue with "matching" VM hardware (Proxmox KVM). Just disregard the first sentence of my first post if it makes you happy.

Still running pfSense 2.4.3. I have CARP VIPs on both the "WAN" and "LAN" interfaces, which are just VLAN's on my core switch. I have a server of mine on the LAN VLAN running an iperf3 server to simulate LAN->WAN throughput and vice versa, and my desktop on the WAN VLAN as the iperf3 client.

I suck at diagrams, so here's a brief description of my network topology.

Test network WAN - 10.253.0.1/24
Test network LAN - 10.254.0.1/24
PFSYNC/Management network - 10.100.0.1/24
pfsense-dev-master WAN - 10.253.0.2 (VIP 10.253.0.1)
pfsense-dev-master LAN - 10.254.0.2 (VIP 10.254.0.1)
pfsense-dev-slave WAN - 10.253.0.3 (VIP 10.253.0.1)
pfsense-dev-slave LAN - 10.254.0.3 (VIP 10.254.0.1)

jtl-desktop - 10.253.0.9 (for testing LAN->WAN and vice versa using iperf)
angrybear (server) - 10.254.0.10 (port forwarded 5201 TCP/UDP to WAN VIP)

Still have the same issue when I'm testing LAN->WAN bandwidth, which is traffic shaped, and I failover to the secondary, the shaped traffic ends up in the root interface queue.

I made a video here as well

Derelict

If you have a reproducible case, please open a report at redmine.pfsense.org outlining the expected behavior, the steps to reproduce, and the actual behavior.