Strange traffic spike on all interfaces cripples boxes



  • Not sure where to stick this as I've no idea what is causing it…

    I recently setup two boxes (1.2.3 testing) to do failover on my two WAN connections and to run CARP between them on WAN1, WAN2, and LAN for increased uptime.  99% of the time things run fine.  However, at what seems to me to be completely random times, I get huge traffic spikes on all of my interfaces on both devices that cripples both systems.  My normal traffic on WAN1 is 3-4 Mbit/s down and 1.5-2.0 Mbit/s up.  When this happens I end up with about 35 Mbit/s in both directions on WAN1, WAN2, and LAN (but not the SYNC interface between the two systems).  My guess is that it would be higher, but these systems seem to have a limit of around 230 Mbit/s for all interfaces combined.

    If one is rebooted the issue goes away while it is offline and then typically reappears once it comes back up.  The only way I've found to eliminate the issue is to reboot both boxes at once, kind of defeating the purpose of having two...

    When it happens I end up with thousands of "kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1" messages in my System Log and thousands of "Feb 25 16:45:47  WAN2  67.93.xxx.xxx:65180  239.255.255.250:1900  UDP" messages in my Firewall Log (both records censored slightly).  it's worth mentioning that the IP showing up in the firewall log as having originated on WAN2 is actually assigned to WAN1.

    Can anyone give me some ideas as to where to look to find out what is going wrong?



  • Eh?  Actually, those log entries are still showing up, even though I've turned one box off.  Can anyone tell me what they mean?



  • kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1

    The kernel sent a LAN broadcast message on re0 asking "Who has 192.168.1.x"? Then the kernel received on re1 a message saying "my MAC address is 00:0c:29:9f:xx:xx and I have 192.168.1.x". It would appear BOTH re0 and re1 are in the same subnet. This is a configuration no-no.

    You haven't described what is on the "other end" of the "next hop" from the WAN connections. A system attempting to do its own "load balancing" or failover?

    Something is broken or incompatibly configured.



  • More info on my config:

    IP Info (Censored):
    LAN - 192.168.1.0/24
    WAN1 - 67.93.x.x/27
    WAN2 - 70.20.x.x/24 (10 IP range, not the whole block)
    SYNC - 10.0.0.x/24

    Firewall Rules:
    LAN:

    • LAN -> !LAN = Use WAN1->WAN2 Failover

    WAN1:

    • Block Bogon Networks
    • Block RFC 1918 Networks
    • Allow ICMP responses
    • Allow IAX (UDP 4569) traffic to specific LAN IP

    WAN2:

    • Block Bogon Networks (using an alias)
    • Block RFC 1918 Networks (using an alias)
    • Allow ICMP responses
    • Allow IAX (UDP 4569) traffic to 192.168.1.54

    SYNC:

    • Allow all to all

    NAT Rules:
    Port Forward:

    • WAN1: UDP 4569 to 192.168.1.54
    • WAN2: UDP 4560 to 192.168.1.54

    Outbound:

    • WAN1: Source = LAN Subnet, Gateway WAN1-CARP
    • WAN2: Source = LAN Subnet, Gateway WAN2-CARP

    Virtual IPs

    • One on WAN1
    • One on WAN2
    • One on LAN

    There were more outbound NAT rules as the two above do not allow the backup system to receive updates, sync time, etc. when it does not have the CARP VIPs but I removed them when trying to figure out what was going on.



  • @wallabybob:

    kernel: arp: 192.168.1.x is on re0 but got reply from 00:0c:29:9f:xx:xx on re1

    The kernel sent a LAN broadcast message on re0 asking "Who has 192.168.1.x"? Then the kernel received on re1 a message saying "my MAC address is 00:0c:29:9f:xx:xx and I have 192.168.1.x". It would appear BOTH re0 and re1 are in the same subnet. This is a configuration no-no.

    You haven't described what is on the "other end" of the "next hop" from the WAN connections. A system attempting to do its own "load balancing" or failover?

    Something is broken or incompatibly configured.

    Yeah, I've been reading up on that message and I think I might have a bad switch.  On the outside of my two pfSense boxes I've got an 8-port switch with port-based vlans set for two ports to be my internal network, 3 ports for WAN1, and 3 ports for WAN2.  I think that the switch has decided to ignore the vlans…  I'm going to unplug the LAN link from that for now and then try replacing it tomorrow morning.



  • After playing around with it a bit more, it seems like only one port was bad.  The three dedicated to VLAN 2 worked fine, the three dedicated to VLAN 3 worked fine, and the second for VLAN 1 worked fine.  The first port for VLAN 1, on the other hand, seems to be broadcasting traffic on all three.  That's what I get for spending three times as much for a single switch that supports VLANs over a pair of cheap ones that don't but would have been physically segregated…


Log in to reply