CARP Split brain scenario with sustained throughput



  • Hello, consider this scenario:
    pfSense in HA pair, on Dell R210II (Single Xeon CPU, 8 GB RAM). NIC are Intel x520 SFP 10 GigE dual port and Broadcom
    copper 1 Gbe dual Port.

    Running pfSense 2.4.5 but had same issue even with previous versions.

    Intel NICs are configured to serve LAN networks in LACP LAG to a daisy-chained swiches (1 port per switch). On top of LACP LAG I put my pfSense interfaces, on VLANs. CARP is one of these VLANs.

    Broadcom NICs are connected via LACP LAG to a switch, which is connected to WAN router.

    I have a Backup LAN (intended as a LAN segment where I put backup servers). Backup servers runs smb and nfs protocols.

    When using SMB for sustained throughput from Servers LAN to Backup LAN (almost 5 hours with files varying in size from 2 to 200 GB at speed greater than 100 MBps), copy fails and entire network suddenly become and stay unresponsive: CARP status on master is ok reporting all interfaces Master, but some interfaces in secondary are reported as Master too.

    Rebooting secondary seems to solve the issue.

    For now I tried putting a backup server in the same segment of Servers LAN and everything seems running smooth.

    For obvious reasons (100 people working) I can't try to replicate the problem and investigate further, because the risk is to cripple the network again. When copying files CPU, memory utilizations and Web GUI responsivness are normal.

    I'm thinking a couple solutions:

    • move CARP interfaces outside Switches, by adding second copper network card
    • implement VLAN Priority for CARP Interfaces (Could it work?)

    Any advice appreciated.
    Thanks



  • I had a similar setup. vlan changes on the lag always resulted in total Core dump.

    I have now simplied the setup since I did a little research on carp. The pfsense will make a total failover if one carp interface will fail.

    So the full mesh with the switches is not needed. If the switch or the cable goes bad the full firewall will failover to the other firewall cluster member.

    Maybe you could also adopt this and get rid of the lag and this also solves your problems since this advanced setup is a little bit shaky.


  • LAYER 8 Moderator

    @thesurf said in CARP Split brain scenario with sustained throughput:

    I had a similar setup. vlan changes on the lag always resulted in total Core dump.

    You wouldn't have a Intel X520 (or something) SFP+ card in that setup, would you? We had 2-3 instances of that chipset/driver misbehaving on LACP/VLAN pairings and changing anything related to VLAN setups etc. would break apart.

    With 2.4.5_1 and 2.5 those drivers were updated and now the systems run smoothly again.

    As the OP wrote about x520s but running 2.4.5 I don't know if that happend on 2.4.5 again (and with what errors in syslog) but in one case we were switching away from that particular nic/driver and inserted another better supported 10G SFP+ card and now the customer is happy and has no problems at all.



  • Sincerely I had never other problems except this. VLAN and LACP behave fine, no core dumps or particular problems in using X520s.
    But since a couple of reconditioned SFP cards are cheaper than troubleshooting, I will try that way. Any advice on which SFP+ card put in my servers? Is Chelsio S320 good?


  • LAYER 8 Moderator

    AFAIR chelsio are the ones Netgate uses itself in the XG series thou I don't know exactly what model or revision, but I'd try them!


Log in to reply