Public WAN VIP failing after 20 minutes

BCSE

My first CARP setup seems to be working fine on all LAN VIP's. The only problem we have is that the Public WAN VIP only works for 20 minutes each time i hit the save button from the WAN VIP. The actual WAN ip's on both boxes keep working.

At the CARP status page it shows all vip's are up. All traffic on the WAN VIP is dead after 20 minutes. If i hit the edit and save button of the VIP it's up again for 20 minutes.

I hope someone could help me….

jimp

20 minutes is a common ARP table timeout. Could be an IP conflict or something upstream that isn't correctly picking up the CARP VIP's MAC from ARP.

BCSE

So this could be something on the ISP side?

At the moment we have 5 LAN subnets. All of them are having VIP's and seems to work fine. All of the WAN & LAN traffic is passing a HP-1810 switch. Mostly with VLAN's. I can see the VIP MAC addresses on the switch. As far as i can see nothing conflicts.

The ISP side is a /29 subnet. Currently using 2 IP's on the actual WAN interfaces and one for the VIP.

Hardware = PC Engines APU1D4

I removed the WAN VIP and configured it again with one of the other IP addresses in the /29 range but that didn't help.

cmb

IP or MAC conflict (or other issue) the likely cause. Save and apply sends a gratuitous ARP, which will clear up some problems along those lines for a period of time.

Since it happens on multiple IPs, assuming you don't have those IPs assigned elsewhere, try changing the VHID so the virtual MAC changes.

BCSE

I already changed the VHID (from 1 to 10) when i reconfigured the WAN VIP and changed the IP address. Sorry for not mentioning that. Still failing after 20 minutes. Tried to change the VHID again to 99 and hit the save button. Again failing after 20 minutes.

So i removed the VIP again. I changed the WAN IP of the 2nd FW to the IP address i used for the WAN VIP. This one is still working.
I recreated the VIP. Used random not in use VHID (202). And took the IP address that i was using for the 2nd FW WAN ip. Again failing after 20 minutes.

Old config
FW1 WAN IP xx.xx.xx.82/29 <- Working
FW2 WAN IP xx.xx.xx.83/29 <- Working
VIP WAN IP xx.xx.xx.84/29 also tried 85 & 86 <- All failing after 20 min.

Current config
FW1 WAN IP xx.xx.xx.82/29 <- Working
FW2 WAN IP xx.xx.xx.84/29 <- Working
VIP WAN IP xx.xx.xx.83/29 <- Failing after 20 min.

IP adresses are only assigned to the FW's. I tried the complete /29 range by now. By switching to several VHID's the MAC address was changed every time. Still not working… :(

cmb

Packet capture and filter on the affected IP, what happens?

BCSE

Wireshark

1 0.000000 PcEngine_XX:XX:X8 Broadcast ARP 42 Gratuitous ARP for XX.XX.XX.83 (Request)
2 0.001040 PcEngine_XX:XX:Xc Broadcast ARP 60 Gratuitous ARP for XX.XX.XX.83 (Request)

I removed all the ping request & reply's here.

2445 1198.885005 CiscoSpv_XX:XX:Xb Broadcast ARP 60 Who has XX.XX.XX.83? Tell XX.XX.XX.81
2446 1198.885029 PcEngine_XX:XX:X8 CiscoSpv_XX:XX:Xb ARP 42 XX.XX.XX.83 is at 00:00:XX:XX:XX:Xb

The same capture from the pfSense packet capture field.

09:18:47.564989 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.83, length 28
09:18:47.566029 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.83, length 46

09:38:46.449994 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.81, length 46
09:38:46.450018 ARP, Reply XX.XX.XX.83 is-at 00:00:XX:XX:XX:Xb, length 28

cmb

I mean capture when it's not working, sounds like it was working fine at that point?

BCSE

At the time of the capture it was working for 20 minutes as you can see. At 09:38:46 the interface is down.

Capture of this morning. Didn't edit and saved the VIP so the interface was still down.

07:19:19.562103 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.81, length 46
07:19:19.562128 ARP, Reply XX.XX.XX.83 is-at 00:00:XX:XX:XX:Xb, length 28
07:39:19.688892 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.81, length 46
07:39:19.688914 ARP, Reply XX.XX.XX.83 is-at 00:00:XX:XX:XX:Xb, length 28
07:59:19.398768 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.81, length 46
07:59:19.398793 ARP, Reply XX.XX.XX.83 is-at 00:00:XX:XX:XX:Xb, length 28
08:19:19.255317 ARP, Request who-has XX.XX.XX.83 tell XX.XX.XX.81, length 46
08:19:19.255341 ARP, Reply XX.XX.XX.83 is-at 00:00:XX:XX:XX:Xb, length 28

cmb

That's much more telling. You're not getting anything coming in on that IP. And 20 minutes is definitely the upstream ARP cache timeout. You're replying to the ARP requests correctly. That confirms the problem resides where I said it did previously, with an IP or MAC conflict. Having changed the VHID, it's probably not the MAC. My best guess is something else is replying to that ARP request as well, which you won't see from that perspective. If you have access to the next hop router, check its ARP cache when it's not working. If you don't, have your ISP check it and tell you what MAC they're showing.

BCSE

Thanks for pointing us in the right direction. We tested the WAN interfaces a bit more on the WAN side by placing a machine on that side. The VIP seems to work fine.

So it looks like the modem of the ISP is not working properly. So we Googled som more and found another topic with the same problem, even the same ISP provider (UPC) Netherlands.

https://forum.pfsense.org/index.php?topic=66838.0

'Problem almost resolved'. Testing with the script from that topic…....

cmb

Ah that's fun. Your modem is broken, that behavior is in violation of RFC 826.

BCSE

Added an IP Alias for testing. That one keeps working. Not understanding why an IP Alias keeps working and a CARP ip not.

cmb

Because of the diff in the way ARP is answered between them, it's perfectly valid both ways, but with broken CPE the CARP way can be problematic.

cr_hyland

We also have exactly this issue with UPC Ireland.

No resolution as of yet no matter what we tried.