VIP addresses stop working

TitanSystems

We have an odd issue where our VIP IP aliases stop passing traffic. We have them set for may tasks, like Openvpn, webservers, and the like. After a non set length of time, they simply stop responding. In order for them to start working again, all that has to be done is change the subnet to ANYTHING else, save and apply. Boom they start working again. They had been working fine since a fresh install Jan 1, no changes to system, and it starting this fail about a week ago. The WAN primary ip always works, otherwise I would assume failing hardware. Been using pfsense for a very long time, but this is the first time I am seeing this issue.
We have ordered replacement hardware just in case, but this just does not seem like hardware failure.

We are getting this in our dpinger system log:
send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 184.174.xxx.xx bind_addr 184.174.xxx.xxx identifier "WANGW "

Thank you for your help.

Derelict

That is a perfectly normal log entry that is logged when dpinger starts or restarts.

Going to need more information. Are the IP Alias VIPs still on the interface if you run ifconfig -a when it is broken?

What happens when you try to ping sourced from that VIP?

ping -S VIP_IP_ADDRESS GATEWAY_ADDRESS

ping -S VIP_IP_ADDRESS 8.8.8.8

TitanSystems

To answer, here is ifconfig while broken.

bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
ether 00:e0:66:fd:6c:c3
hwaddr 00:e0:66:fd:6c:c3
inet6 fe80::2e0:66ff:fefd:6cc3%bge0 prefixlen 64 scopeid 0x1
inet 184.xx.xx.83 netmask 0xffffff00 broadcast 184.xx.xx.255
inet 198.xx.xx.51 netmask 0xffffff00 broadcast 198.xx.xx.255
inet 184.xx.xx.118 netmask 0xffffff00 broadcast 184.xx.xx.255
inet 184.xx.xx.84 netmask 0xffffff00 broadcast 184.xx.xx.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active

first ping:
--- 184.xx.xx.1 ping statistics ---
16 packets transmitted, 0 packets received, 100.0% packet loss

Second Ping:
--- 8.8.8.8 ping statistics ---
12 packets transmitted, 0 packets received, 100.0% packet loss

As soon as I change the subnet to anything else and apply

PING 184.xx.xx.1 (184.xx.xx.1) from 184.xx.xx.84: 56 data bytes
64 bytes from 184.174.168.1: icmp_seq=0 ttl=64 time=1.634 ms
64 bytes from 184.174.168.1: icmp_seq=1 ttl=64 time=1.124 ms
64 bytes from 184.174.168.1: icmp_seq=2 ttl=64 time=0.962 ms
64 bytes from 184.174.168.1: icmp_seq=3 ttl=64 time=1.856 ms
64 bytes from 184.174.168.1: icmp_seq=4 ttl=64 time=1.716 ms
64 bytes from 184.174.168.1: icmp_seq=5 ttl=64 time=2.012 ms

Derelict

You'll have to pcap and see what's going on on the WAN.

Which one of those is the interface subnet and which are the VIPs?

Is the 198 address routed to 184.xx.xx.83 or is it a "secondary" interface subnet?

When it is broken is your WAN ARPing for 184.xx.xx.1 or is it sending the echo requests to a MAC address from its ARP table? Is there a response? Does the upstream ARP for 184.xx.xx.84? Is there a response? Is it honored or is it ignored?

Seems like upstream ARP is screwed up to me.

TitanSystems

Afraid I am getting a bit over my head, but will try to keep up.

Interface subnet is 184.xx.xx.1/24
Secondary , still on same wan interface, is 198.xx.xx.1/24

I wish they had given me statics all in the same block. I have for testing, however, removed the 198 subnet including the gateway, but still end up with the same issue.

I have created a gateway for both, and they always show up.

Your next questions are where I am getting a bit lost on how to answer.

Would the upstream arp be my internet provider (EPB Telecom)? Do I need to call them and ask a specific question?

Derelict

Diagnostics > Packet Capture

What we do here sort of depends on how much traffic is on that VIP.

When it is broken, start this capture:

Interface: WAN
Address Family: IPv4
Protocol: Any
Host Address: 184.xx.xx.84
Count: 1000000
Start the capture

Then run the above ping tests with -S 184.xx.xx.84 Let it fail for a bit. Then do whatever you do to fix it, note the time you did this, then run another ping test using -S 184.xx.xx.84.

Then go back to Diagnostics > Packet Capture, stop it and download it. If you've never used it before, this is a good time to download wireshark and open that file using that. It'll do a lot of the interpretation of the protocols for you.

I'll send you a chat with a drop box link to upload the capture file to.

Do you really have the whole /24 or just some addresses on each of them?

This is EPB in Tennessee?

TitanSystems

Will do. I do not have the entire/24 but that is how they hand them out.

Yes, EPB Chattanooga TN. We have a 1 gig fiber connection, hence the need for netgate / pfsense.

I have just called their helpdesk, after your prompting about upstream arp, and apparently last Wednesday, they moved us to some new equipment early o'clock in the morning. I started seeing the issues on Thursday. At the moment they have esclated the issue, but the high level will not be in until 9am eastern. Since it is almost 8pm here I will probably let it go til tomorrow and let you know what they say. If it is on their end I will ask as much as possible (including what type of equipment is on their end) so I can add to the forum just in case someone else has a similar issue.

Thanks for your help thus far. I have been using PFsense since 07 when I shifted from using m0n0wall and Tomato and have have donations along the way. Have had the various bugs and such as were expected but never needed to ask for help. Glad it may not be pfsense!

TitanSystems

OK, so the problem is solved, sorta.

The issue is that PFSENSE (and cisco asa) does not reply to their equipment for ARPhost package replies on the VIP's. They have seen it with every pfsense / opnsense / monowall based package. In order for it to work going forward, they had to direct all traffic sent to VIP's to the main ip. I dont know if this is something the devs can fix, but kinda doubt it.

Thank you again

Derelict

They should be routing the packets destined for the new subnet to an address on the existing subnet. That is the proper way to do that.

I believe the fix was applied to the correct side there. Good to know EPB came through.

magnus-maximus

I am working on the same issue with EPB and the CARP VIP ARP problem.

@Derelict, I saw a post that you moved to Chattanooga for EPBs 10Gbps fiber network, and your input is appreciated.

https://forum.netgate.com/topic/155028/so-i-moved-to-chattanooga-so-i-could-get-the-fastest-internet-in-america

We have been looking into this issue with EPB as the firewall does respond to the VIP ARP request. Still, it seems like EPBs equipment reads the source-mac address from the ARP frame, which is the router's physical interface address, instead of using the virtual-MAC and virtual-IP in the ARP reply.

https://redmine.pfsense.org/issues/9476

The virtual IP address is associated with a virtual MAC address in the ARP reply when it egresses the local network to the ISP network to register the virtual IP address and MAC address with the ISP's network. However, the VIP has the same mac address as the physical router, and it looks like the ISP switch discards the frame due to this implementation. The ISP's upstream equipment never sees the ARP reply even though it is present when egressing the client's network before the ISP equipment (ONT, Modem, ETC). The source MAC address of the VIP ARP using the router's physical MAC address is why the ISP does not see the ARP reply, and it looks like the router is not generating an ARP reply.

https://community.arubanetworks.com/community-home/digestviewer/viewthread?MID=14293

As previously stated, the most common solution is to use a single router as a gateway and then redundant firewalls behind the gateway with routed IPs. But you still have the issue of the single point of failure. Ideally, one will have two ISP connections with one to each firewall.

A modified solution to temporarily register a VIP until the MAC IP address registration time-out triggers removal of the IP registration (4 to 18 Hours) is installing ARPing and generating an ARP frame with the virtual IP and virtual MAC address as the source MAC.

ARPING -A -i eth1 -s 00:00:5e:00:01:01 -S xxx.xxx.xxx.xxx 255.255.255.255

I have read many of your posts @Derelict where you discuss how some vendor's equipment deviates from the implementation in pfSense CARP, causing much frustration and difficulty for many installations.

As you mentioned in another post, the VRRP specification in the RFC is that the source-MAC address is the physical address of the router.

https://forum.netgate.com/topic/134297/cox-and-the-carp-mac/18

"Note that the source address of the Ethernet frame of this ARP response is the physical MAC address of the physical router. "
https://datatracker.ietf.org/doc/html/rfc5798#page-29

But some vendors insist that the source-mac address for VRRP should be the virtual mac address.

"The virtual MAC address should source ethernet frames from a VRRP router, but not all vendors may implement it this way. "
https://kb.juniper.net/InfoCenter/index?page=content&id=KB7109

I believe it could have come from the VRRP RFC in 2004, as this early RFC for VRRP does not include the note regarding the source of the MAC address.
https://datatracker.ietf.org/doc/html/rfc3768#section-8.2

The question I have is, are there any known solutions for interoperability that do not require changing the behavior of pfSense, as doing so would be working backward to try to maintain a non-compliant VRRP implementation?

Derelict

@magnus-maximus I do not have business class service so I have not had to deal with EPB and CARP MAC addresses. (One DHCP address is all I have, which is incompatible with CARP/HA).

The ARP response to a CARP VIP is sourced from the interface MAC address and contains the CARP MAC in the IS AT response. This is so layer 3 comms go to the correct MAC address for the CARP VIP on the broadcast domain.

The CARP heartbeats source from the CARP MAC address. This is generally to instruct the layer 2 gear which port to use to contact that MAC address.

While CARP and VRRP are similar they are not the same and the VRRP RFCs are really meaningless here.

Usually the problems people have are with ISP gear that only permits one MAC address per port or some other "port security" type scheme.

I have not spent nearly as much time looking at VRRP as I have CARP.

Feel free to DM/chat the provisioning information from EPB for your circuit. Their business tiers are just too spendy for me here at the home office so I haven't seen one.

Derelict

@magnus-maximus said in VIP addresses stop working:

https://datatracker.ietf.org/doc/html/rfc3768#section-8.2

That seems to indicate what is included in the ARP IS AT response in the ARP protocol itself. It is silent about the source MAC address of the frame containing the ARP response.

8.2 pretty much describes what CARP does. The MAC address in the ARP response for a CARP VIP is always the virtual CARP MAC address.

What, exactly, is the ISP doing that is breaking things? Why are they not issuing another ARP request when they have traffic for an IP address after the ARP cache has expired?