Strange traffic spike on all interfaces cripples boxes

jasonlitka

Not sure where to stick this as I've no idea what is causing it…

I recently setup two boxes (1.2.3 testing) to do failover on my two WAN connections and to run CARP between them on WAN1, WAN2, and LAN for increased uptime. 99% of the time things run fine. However, at what seems to me to be completely random times, I get huge traffic spikes on all of my interfaces on both devices that cripples both systems. My normal traffic on WAN1 is 3-4 Mbit/s down and 1.5-2.0 Mbit/s up. When this happens I end up with about 35 Mbit/s in both directions on WAN1, WAN2, and LAN (but not the SYNC interface between the two systems). My guess is that it would be higher, but these systems seem to have a limit of around 230 Mbit/s for all interfaces combined.

If one is rebooted the issue goes away while it is offline and then typically reappears once it comes back up. The only way I've found to eliminate the issue is to reboot both boxes at once, kind of defeating the purpose of having two...

When it happens I end up with thousands of "kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1" messages in my System Log and thousands of "Feb 25 16:45:47 WAN2 67.93.xxx.xxx:65180 239.255.255.250:1900 UDP" messages in my Firewall Log (both records censored slightly). it's worth mentioning that the IP showing up in the firewall log as having originated on WAN2 is actually assigned to WAN1.

Can anyone give me some ideas as to where to look to find out what is going wrong?

jasonlitka

Eh? Actually, those log entries are still showing up, even though I've turned one box off. Can anyone tell me what they mean?

wallabybob

kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1

The kernel sent a LAN broadcast message on re0 asking "Who has 192.168.1.x"? Then the kernel received on re1 a message saying "my MAC address is 00:0c:29:9f:xx:xx and I have 192.168.1.x". It would appear BOTH re0 and re1 are in the same subnet. This is a configuration no-no.

You haven't described what is on the "other end" of the "next hop" from the WAN connections. A system attempting to do its own "load balancing" or failover?

Something is broken or incompatibly configured.

jasonlitka

More info on my config:

IP Info (Censored):
LAN - 192.168.1.0/24
WAN1 - 67.93.x.x/27
WAN2 - 70.20.x.x/24 (10 IP range, not the whole block)
SYNC - 10.0.0.x/24

Firewall Rules:
LAN:

LAN -> !LAN = Use WAN1->WAN2 Failover

WAN1:

Block Bogon Networks
Block RFC 1918 Networks
Allow ICMP responses
Allow IAX (UDP 4569) traffic to specific LAN IP

WAN2:

Block Bogon Networks (using an alias)
Block RFC 1918 Networks (using an alias)
Allow ICMP responses
Allow IAX (UDP 4569) traffic to 192.168.1.54

SYNC:

Allow all to all

NAT Rules:
Port Forward:

WAN1: UDP 4569 to 192.168.1.54
WAN2: UDP 4560 to 192.168.1.54

Outbound:

WAN1: Source = LAN Subnet, Gateway WAN1-CARP
WAN2: Source = LAN Subnet, Gateway WAN2-CARP

Virtual IPs

One on WAN1
One on WAN2
One on LAN

There were more outbound NAT rules as the two above do not allow the backup system to receive updates, sync time, etc. when it does not have the CARP VIPs but I removed them when trying to figure out what was going on.

jasonlitka

@wallabybob:

kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1

The kernel sent a LAN broadcast message on re0 asking "Who has 192.168.1.x"? Then the kernel received on re1 a message saying "my MAC address is 00:0c:29:9f:xx:xx and I have 192.168.1.x". It would appear BOTH re0 and re1 are in the same subnet. This is a configuration no-no.

You haven't described what is on the "other end" of the "next hop" from the WAN connections. A system attempting to do its own "load balancing" or failover?

Something is broken or incompatibly configured.

Yeah, I've been reading up on that message and I think I might have a bad switch. On the outside of my two pfSense boxes I've got an 8-port switch with port-based vlans set for two ports to be my internal network, 3 ports for WAN1, and 3 ports for WAN2. I think that the switch has decided to ignore the vlans… I'm going to unplug the LAN link from that for now and then try replacing it tomorrow morning.

jasonlitka

After playing around with it a bit more, it seems like only one port was bad. The three dedicated to VLAN 2 worked fine, the three dedicated to VLAN 3 worked fine, and the second for VLAN 1 worked fine. The first port for VLAN 1, on the other hand, seems to be broadcasting traffic on all three. That's what I get for spending three times as much for a single switch that supports VLANs over a pair of cheap ones that don't but would have been physically segregated…