2.4.2 in HA mode NBNS storm kills wan

Heimire

We are running 2.42 in HA mode.
We have 6 nics with different subnets, etc.

One subnet is using 172.22.22.x
If we by mistake enter 172.222.22.xxx or 172.2.22.xx (or any other combo for the second octet but not tested) it will create a packet storm that makes our ISP shut us down.
They auto kill at 25% of total bandwidth for the interface and its a 1 gb port for storms.

So they kill our wan connections but our connections are connected to a switch so they stay up on our side.

After we shut down the server with the mistake on, the firewalls continued the storm. I know this since I did a packet capture.
I captured probably 10-15 minutes after the mistake was corrected and the server was shut down but the packet storm is still there.

I had to disable the WAN interface on the primary firewall for it to stop.

So this makes a simple typo shut down our circuits.

We really need to figure out how to prevent pfSense from doing this and we are desperate for suggestions.

so if you have any suggestion, please let us know.
[packetcapture (27).cap.txt](/public/imported_attachments/1/packetcapture (27).cap.txt)

Derelict

You: "Doctor, it hurts when I do this."

Doctor: "Don't do that."

You must be bridging or something. That traffic is not routed.

That has all the markings of a layer 2 loop. HA is incompatible with bridging for reasons such as this. Spanning tree is your friend.

We really need to figure out how to prevent pfSense from doing this

Doesn't look like pfSense, but your design instead.

jimp

It's also possible that you are policy routing some traffic that shouldn't be policy routed (broadcast traffic, for example).

We used to see this on older versions with APIPA traffic that would get stuck in a loop when policy routed (See https://redmine.pfsense.org/issues/2073 for example), something similar could be happening in this case.

Heimire

Thank you so much for answering.

I am pretty sure its not a policy route in place but I did not setup the switches.
The person who did said he doubled checked but we will go over it again.

I am not sure if spanning tree is enabled on the HP 5406 switches. I think they might be off by default.

Again, thank you so much for taking the time to respond.

H.

Heimire

So we broke everything down.

We have 2 separate dumb switches for the WAN interfaces.
Not connected to each other.

Sync cable is direct.

We connected each LAN interface into a dumb hub and we ONLY connected a single laptop.
On the laptop we change the IP from 172.22.22.xx to 172.222.22.xx and it creates a strom within seconds.

It still happens..

If I turn off the backup firewall the problem goes away.
If I disconnect the LAN connection on the backup firewall the problem goes away.
So the only loop here is having the LAN interfaces connected to the same switch but thats how its designed.

So its not a loop in our switches.

I can't find a policy route that could do this.

I did follow the steps to set this up using the hangout presentation on 2.4 HA.
George (pfsense) did not see anything that looks funky when he looked but he did not look for this problem.

I have included screenshots of the lan rules and nat mode screen.
It shows we have several LANs but only one connected during testing is the NAT_LAN. We also have a VPN connection going to another location but that IP is 192.168.30.xx

So at this point I am not sure where to go.
What kind of info do I need to provide to get some help?

![nat lan rules.JPG](/public/imported_attachments/1/nat lan rules.JPG)
![nat lan rules.JPG_thumb](/public/imported_attachments/1/nat lan rules.JPG_thumb)
![Outbound nat mode.JPG](/public/imported_attachments/1/Outbound nat mode.JPG)
![Outbound nat mode.JPG_thumb](/public/imported_attachments/1/Outbound nat mode.JPG_thumb)

Derelict

OK not a lot of that is making any sense.

Any bridged interfaces on the pfSense nodes?

If this is causing something on WAN because you are allowing traffic from 172.222.22.X out WAN when only 172.22.22.X should be allowed, a simple workaround to that would be to only pass traffic into NAT_LAN that is sourced from NAT_LAN network In other words, change the NAT_LAN rules from passing traffic sourced from any to NAT_LAN network.

You might want to draw a physical diagram of what is connected to what, it's purpose, addressing, etc.

There is still something not right. Your nodes should not freak out like that. Still looks like a loop to me.

Heimire

Nothing bridged on the network.

Attached is the test network layout.
Its pretty simple.

I have the RFC1918 in the NAT_LAN rule set and I even notice I used the words Bypass policy route.
I can't recall why I included that one and I can't make sense out of it.
You can see it in the previous post image.

So that could be it but I am not in a position to test it remotely.

![test layout with ip.jpg](/public/imported_attachments/1/test layout with ip.jpg)
![test layout with ip.jpg_thumb](/public/imported_attachments/1/test layout with ip.jpg_thumb)

Derelict

That looks fine.

You'll need to either stop accepting that traffic into the LAN ports or trace it and figure out what is reflecting it. Before accepting a workaround I would try to isolate the actual cause if for no other reason than understanding what the issue actually is.

One thing to note is that with the pass source any rule on NAT_LAN, the router has no idea that 172.222.22.255 is a broadcast address because it has no netmask to reference. .255 is a valid last-octet on, say, a /22. If the "broadcast" arrives addressed to the NAT_LAN interface, it should be forwarded.

Are the NAT_LAN interfaces in promiscuous mode or something for some reason?

Have you done any pcaps?

I'd pcap on LAN - you only need enough to see what is looping/reflecting. I'd set the packet counter to 1000 or something.

Then I'd pcap on WAN, same thing.

You'll probably need to look at the MAC addresses, etc.

It would be best to capture the same test on both interfaces at the same time but without managed switches and mirror ports you'll have to start at least one of the captures manually in the shell.

You can start one and ps axww | grep tcpdump to see the commands that are run for each interface.

Heimire

So here is what I did.

Tested this and captured on LAN interfaces, then on WAN.
Those are attached, just remove .txt if you want to view them.

Turned off the policy route but that made no difference for this problem.

Then I changed the LAN rule to only accept traffic from our 172.22.22.0/23 network.
That seems to have fixed it.

But I still have the policy routes disabled but not sure I really need them.
I thought I got that from the HA setup but cant recall where and why.

So I think this is a temporary fix but not sure of the consequence of disabling the rule yet.

[capture_ fw1_WAN_failure.cap.txt](/public/imported_attachments/1/capture_ fw1_WAN_failure.cap.txt)
capture_fw2_WAN_failure.cap.txt
capture_fw1_novpn_no_pr_failure.cap.txt
capture_fw2_novpn_nopr_failure.cap.txt

Derelict

These MAC addresses are reflecting the bad Src: 172.222.22.92, Dst: 172.222.23.255 traffic that gets irresponsibly put out on WAN back and forth. Fix that and you fix your problem.

00:a0:d1:ea:eb:f4
00:26:6c:f1:ff:d0

Where is this? OVH or something?

Heimire

Thats the VM that we use for this testing.
Its a Windows 2012 server that we just changed the IP to create this situation.
It is sitting on the Hyper-V cluster.

The problem is that if we by accident types in 172.222.x.x instead of 172.22.x.x it creates the storm.
We discovered that when one of the guys made exactly that mistake.

It does not happen if we turn off backup firewall, or disable the LAN side of the backup firewall.
The way we make the storm stop is to correct the IP and disable the WAN NIC on primary firewall.

I changed the rules for the LAN sections to only allow 172.22.22.x/23, that is the subnet we are using.
I also removed the policy by pass rule that you see in the image attached before. Not sure why its there and if its needed.

Derelict

Why is that MAC address in a pcap on WAN then? Seems you have some sorting out to do there. inside MAC addresses should never be on the WAN layer 2.

Heimire

@Derelict:

Why is that MAC address in a pcap on WAN then? Seems you have some sorting out to do there. inside MAC addresses should never be on the WAN layer 2.

Thats what all of this has been about.
Trying to sort that out.

Derelict

Hint: It's not pfSense.

Heimire

@Derelict:

Hint: It's not pfSense.

Please explain how it can't be pfsense.
Look at my network drawing.

2 dumb switches not connected to each other.
2 pfsense boxes
1 dumb hub.

The switches not connect to each other.
pfsense boxes are connected to switches via 2 wan interfaces.
pfsense boxes are connected to the dumb hub with 2 lan interfaces.

Only other devices involved is a laptop connected to the hub.

Since the switches/hub has NO connection between them, how can it not be pfsense?

Only thing that connects all the devices together is pfsense boxes.

primary pfsense is connected to switch 1 via wan
primary pfsense is connected to switch 2 via wan
primary pfense is connected to hub via lan.

backup pfsense is connected to switch 1 via wan
backup pfsense is connected to switch 2 via wan
backup pfense is connected to hub via lan.

Derelict

pfSense will not put an inside MAC address on the outside without a bridge interface. period. Check your layer 2.

Heimire

There are no physical connection between any of the 3 switches.
2 of them is dumb netgear switches I purchased for this testing.
The hub is something very old and retired.

Only physical connection between them is via pfsense so I am a loss when you are saying its not pfsense.

Thank you for all your help, I really appreciate your input even if I am a bit confused.

Derelict

All I can say is check again. It is pretty much impossible to have an inside MAC address on a WAN pcap without some sort of layer 2 connectivity between inside and outside.