CARP issues

steve40

Hello,

I'm hoping someone can point me in the right direction.

I have an odd situation were my pfsense box which is configured in an HA pair suddenly stops passing traffic. I have CARP enabled as per specifications and it works fine for about 450 seconds (yes, I timed it as I orginally thought it simply did not arp the VIP after the 1200 second timeout out). If I do nothing the arp entry as listed in the table just disappears and the box continues not passing traffic. Interestingly, if I go to the vritual ip section and do nothing other than click "save" at the bottom of the screen and hit "accept changes" it starts to work again for another 450 or so seconds beforing dying again.

All the while my cam table in the switch shows both the physical mac and virtual mac bound to the correct ports in the correct VLAN. IGMP is disabled. the box is currently running alone without it's partner and I've ensured that the VHIDs are not being used anywhere else and are unique on each interfact.

To add a little color to the situation this pfsense box is running on top of a centos/kvm hypervisor. I have completely passed the PCI cards that hold the ethernet ports over to the pfsense guest via pci stubs configured in grub. As it stands now the host as no knoweledge of the ethernet ports which is confirmed through ifconfig. I did this after a losing battle with both virtio and e1000 NIC cards assigned as bridges to the guest. As it stands now pfsense reconizes the cards and binds the em4 drivers to them appropriately and like I said, works flawlessly for about 450 seconds with CARP VIP as the gateway and indefinitely if I point the gateway to the IP bound to the physical NIC.

These are lab boxes at the moment so I'll post up the config.xml file, loader.conf or any other file on the hypervisor or pfsense guest that you feel will help narrow down the root cause

thanks all

Derelict

This stuff is almost always something in the switch.

It could also easily be something in the vSwitch in the hypervisor.

If I do nothing the arp entry as listed in the table just disappears and the box continues not passing traffic.

What ARP entry in what table where?

steve40

Hello derelict,

thank you for the reply. Were it not for the fact that a simple "save" and click apply temporarily resolves the issue I would be inclined to agree completely with your statement that this is a switch related issue. With respect to the vswitch I have also ruled that out due to the fact that the Ethernet NIC is being provided directly to the guest at th PCI level utilizing vt-d and pci stubbing so the hypervisor has no knowledge of it's existence.

The arp entry which I am talking about which disappears from the table after the 1200 second timeout is the VIHD generated mac address on pfsense. This entry is viewable via the "diagnostics --> "arp table" screen in the pfsense gui.

as you can see from the cut and paste of a pciconf -lv on the guest there are a total of 3 intel nics. em0,em1em2

em1 and em2 are the nICs which are completely passed through. em0 is a e1000 provided by QEMU. If you look closely you'll see the difference in chipset between em0 and the other two.

em0@pci0:0:7:0: class=0x020000 card=0x11001af4 chip=0x100e8086 rev=0x03 hdr=0x00
vendor = 'Intel Corporation'
device = '82540EM Gigabit Ethernet Controller'
class = network
subclass = ethernet

em1@pci0:0:10:0: class=0x020000 card=0x000015d9 chip=0x10d38086 rev=0x00 hdr=0x00
vendor = 'Intel Corporation'
device = '82574L Gigabit Network Connection'
class = network
subclass = ethernet

em2@pci0:0:12:0: class=0x020000 card=0x000015d9 chip=0x10d38086 rev=0x00 hdr=0x00
vendor = 'Intel Corporation'
device = '82574L Gigabit Network Connection'
class = network
subclass = ethernet

thanks again for your reply :)

Derelict

I can pretty much guarantee that "it's not pfSense." Saving might kick something out like a GARP that works for a while but these problems are pretty invariably always out in the layer 2 infrastructure somewhere.

Please back up and look at specifics.

The arp entry which I am talking about which disappears from the table after the 1200 second timeout is the VIHD generated mac address on pfsense. This entry is viewable via the "diagnostics --> "arp table" screen in the pfsense gui.

ARP entry on the primary or the secondary? For what address? There is no reason for the secondary to maintain an ARP entry for the primary unless it is actively communicating at layer 3 with the CARP VIP. If the secondary is unable to establish an ARP entry for the primary holding the CARP VIP then that also indicates a malfunction outside the firewall. Packet capture and analyze and figure out why.

Not sure if you're expecting to find a bug or something because there isn't one (with a small exception of some IPv6 addressing issues that are easily-remedied.)

steve40

The arp entry is on the primary as the secondary unit has been brought offline.

Are you aware of any layer 2 related to TP-link switches? As it stands now the pfsense NICs are connected to the switch as follows

Em1 - WAN = Vlan 2 pfsense VHID 12
Em2 - LAN = Vlan3 pfsense VHID 13

thanks

steve40

Update.

I swapped out the TP-link with a Cisco switch and the same issue persists. The primary unit is the only unit online and ping -t to 8.8.8.8 from a workstation on the local lan works for about 450 seconds then begins to timeout. If I go into carp settings page and just click "save" and "accept changes" (even without making any changes} it begins to work for another 450 seconds. The switch is a plain vanilla install and the outside interface is running advanced outbound nat which is mapping to the virtual ip

As an added test, I simultaneously setup a continuous ping to the outside interface VIP address from the same host inside and that ping begins to die at the exact same time as the one to 8.8.8.8. It would stand to reason in my mind that if this were a upstream switch related issue. I also have a laptop sitting on the outside subnet where this internet facing VIP resides and the laptop can ping the VIP and it sees the VIP autogenerated MAC address in it's own arp table.

Derelict

You'll have to be pore specific, post some screen shots, packet captures, etc.

It does not stand to reason that you are seeing a CARP problem that nobody else is.

Proper functioning of CARP does not require an ARP entry on the local firewall for the CARP VIP.

xn1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=3<RXCSUM,TXCSUM>
	ether f2:92:fa:6a:32:79
	hwaddr f2:92:fa:6a:32:79
	inet6 fe80::f092:faff:fe6a:3279%xn1 prefixlen 64 scopeid 0x6
	inet 172.25.228.18 netmask 0xffffff00 broadcast 172.25.228.255
	inet 172.25.228.140 netmask 0xffffff00 broadcast 172.25.228.255 vhid 228
	inet 172.25.228.65 netmask 0xffffff00 broadcast 172.25.228.255 vhid 228
	inet 172.25.228.66 netmask 0xffffff00 broadcast 172.25.228.255 vhid 228
	inet 172.25.228.67 netmask 0xffffff00 broadcast 172.25.228.255 vhid 228
	inet 172.25.228.17 netmask 0xffffff00 broadcast 172.25.228.255 vhid 228
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
	media: Ethernet manual
	status: active
	carp: MASTER vhid 228 advbase 1 advskew 0

PING 172.25.228.1 (172.25.228.1) from 172.25.228.17: 56 data bytes
64 bytes from 172.25.228.1: icmp_seq=0 ttl=64 time=1.280 ms
64 bytes from 172.25.228.1: icmp_seq=1 ttl=64 time=0.268 ms
64 bytes from 172.25.228.1: icmp_seq=2 ttl=64 time=0.244 ms

--- 172.25.228.1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.244/0.597/1.280/0.483 ms

Shell Output - arp -n 172.25.228.17
172.25.228.17 (172.25.228.17) -- no entry

steve40

Hello,

thanks again for responding. I'm uploading a tcp dump taken off the outside interface
there are two packet captures. the first was taken while it's working and the second when it stops.
I've also included my config.xml, loader.conf outpout of sysctl and ifconfig and pciconf files and attached the log file generated around the same time

thanks 0_1532554827846_debug.tar 0_1532554834563_logs.txt

steve40

Oh, and sorry for not specifying. The packet captures and other files are in the debug.tar file and the log is in the other file

thanks again

Derelict

Please download and post actual pcaps so wireshark can do the heavy lifting.

Thanks.

Also please describe exactly what is "working" and what isn't. Like what traffic to actually look at.

steve40

I'm posting a capture of the pfsense capture as it's working. meaning, this is the gateway I am literally connecting to this forum right now through .. I've got a continuous ping running to 8.8.8.8 as well. I about 5 minutes I'll be posting the capture taken when it stops working [0_1532557784353_while-working.cap](Uploading 100%)

steve40

0_1532557838891_while-working.pcap

Derelict

You might want to set the pcaps to more than 100 frames.

That showed one request and one reply and no CARP.

The best thing to have in your case is probably a transition from working to not working. 10000 frames, 100000 frames. Whatever it takes.

steve40

0_1532558535343_when-its-broken.pcap

steve40

0_1532558669575_whenitworks.pcap

Derelict

Is there a package you haven't installed?

It also appears you are playing fast and loose with what is and isn't RFC1918.

Derelict

I see nothing in those captures to indicate a problem.

"when-its-broken" When exactly WHAT is broken?

steve40

Can no longer connect to the internet and the continuous ping comes to an immediate halt

Derelict

Cannot connect to the internet from where and continuous pings to what? What does cannot connect mean? DNS resolution? HTTP? HTTPS? What? To where? From where?

Sorry, but you are going to have to be far more specific. From what you have posted it looks like there is no problem on the WAN.

I am fairly sure this is an issue in your virtual environment/switching that will not be solved by changing anything in the firewall settings.

https://www.netgate.com/docs/pfsense/routing/connectivity-troubleshooting.html

https://www.netgate.com/docs/pfsense/highavailability/troubleshooting-high-availability-clusters.html

Derelict

OK so it looks like there is no response to traffic when it is "broken" and the source is the CARP VIP. But the traffic is going out and there is no reply. You will need to investigate upstream and see why that is.

I don't see any ARP requests for the CARP VIP, and certainly none that are going unanswered.

So, it still points to something upstream probably in your layer 2.