IPv6 issues after reinstallation

junicast

Hi,

I'm running two dedicated Supermicro systems with Intel X710 dual 10G in CARP mode. For HA sync there is a dedicated igb 1G link.
Eversince I reinstalled and restored config there are IPv6 issues only on the secondary device. It just does not anwer neighbor solicitations. There is no neighbor advertisement anwer to them.

The system is running on pfSense 2.6, the 10G Ports are being combined to lagg0 with LACP and the upstream device is set to Active / Passive. Onto this lagg0 there are plenty of VLAN interfaces.

This is what the lagg looks like:

lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 3c:fd:fe:e8:fb:78
	inet6 fe80::3efd:feff:fee8:fb78%lagg0 prefixlen 64 scopeid 0xb
	laggproto lacp lagghash l2,l3,l4
	laggport: ixl0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: ixl2 flags=c<ACTIVE,COLLECTING>
	groups: lagg
	media: Ethernet autoselect
	status: active
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

I do not see any dropped packets in the filter.log regarding those IPv6 connection issues. The CARP state is allright, primary is showing MASTER for all interfaces and secondary shows BACKUP.

This is tcpdump on the secondary when I ping from primary to secondary:

12:26:31.401936 IP6 2a00:abc:0:113::a > ff02::1:ff00:b: ICMP6, neighbor solicitation, who has 2a00:abc:0:113::b, length 32

There is no corresponding neighbor advertisement.

Whats somewhat weird is the CARP IPs, see this ifconfig on the secondary:

lagg0.202: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	description: RXMGMTFW_SL3
	options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 3c:fd:fe:e8:fb:78
	inet6 fe80::3efd:feff:fee8:fb78%lagg0.202 prefixlen 64 scopeid 0x16
	inet6 2a00:abc:0:113::b prefixlen 64
	inet6 2a00:abc:0:113::1 prefixlen 64 vhid 18
	inet 192.168.14.252 netmask 0xffffff00 broadcast 192.168.14.255
	inet 192.168.14.1 netmask 0xffffff00 broadcast 192.168.14.255 vhid 17
	groups: vlan
	carp: BACKUP vhid 17 advbase 1 advskew 100
	carp: BACKUP vhid 18 advbase 1 advskew 100
	vlan: 202 vlanpcp: 0 parent interface: lagg0
	media: Ethernet autoselect
	status: active
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

It says Backup, yet when I ping the IP 2a00:abc:0:10e::1 and look into ndp it show the MAC of the secondary:

[2.6.0-RELEASE][root@fw3-rx.relaix.net]/root: ndp -an |grep "2a00:abc:0:113::1"
2a00:abc:0:113::1                    3c:fd:fe:e8:fb:78 lagg0.202 permanent R

A reboot won't help. I also tried another reinstall but the problem persist.
Sorry if I might have forgotten some details, please ask.

Junicast

Edit:
Sorry I had to change my posting because I chose an interface at first which does not suffer from the problem. So some interfaces actually work.

junicast

@junicast

I really put some effort into solving this but first I need to explain how we got here. At first we had some issues with the primary firewall when we changed rules. This resulted in some kind of kernel hang. That again resulted in LACP messages not being send and the upstream shutting down the links (LACP timeout set to fast). That again resulted in CARP actions. Packages are getting lost in those cases. But then all recovers.

The idea was to reinstall both firewalls in order to solve that issue but then things got worse, NDP problems appeared on the secondary as described in my first post.
Today I again tried to reinstall the secondary but this time with a all new config made from scratch on test devices.
NDP issues still persist. On top the interface ix0 and ixl2 seem both to flap. They are the two lagg members.

To me it looks like there might be a problem between the combination of pfSense 2.6 (FreeBSD 12.3), Intel X710 and maybe also the fact that the upstream devices are Nokia Routers. Maybe some kind of incompatibility. On the test devices I was using a simple 10G Zyxel switch without having any of those issues.

We have not had any issues on pfSense 2.5, which is FreeBSD 12.2 IIRC. There were actually some changes to the ixl driver. https://www.freebsd.org/releases/12.3R/relnotes/

This really seems to be an ugly one. My plan is to build up a new firewall on the hardware I just used for testing. Hook it up to the Nokias. Well yeah the upstream devices are actually VPLS capable Nokia Routers.

Maybe I will just switch to Mellanox Connect X 4 and see if the problem vanishes. If any of you get an idea I would be glad to hear it.

junicast

@junicast
To whom it may concern.

~~We just migrated to different hardware and the original problem with reloading firewall rules is now resolved big relief.~~
Actually it happened again. I suspect the Intel X170 are just bad and the update to pfSense 2.6 triggers this problem.

Jun 30 10:24:05 fw3-rx kernel: ixl0: Interface stopped DISTRIBUTING, possible flapping

The other problem persists. Neighbor discovery fails ~~and the reason is that the primary firewall uses its Global Unicast address in the source field instead of the Link Local address.~~ That was not the reason. We observed other occurences of NDP using UGA as source and those worked.

At first I though some NAT rules might be the reason for that but after deactivation the problem persists.

I checked that all interfaces have a Link Local address assigned so that also isn't the reason.

Does someone have an idea under what circumstances this might happen?

Edit:
We contacted Netgate about it. They think this might be an actual FreeBSD bug. They do now have a solution, yet.