Failed master node

  • Hi All,

    I'm using 2 Supermicro boxes in a HA/CARP setup. The boxes are taking care for the dhcp servers on all the vlans.
    Today a hardware failure occured on the master. The fail-over to the slave went well. I replaced the master's hardware, installed pfsense and restored a 2 day old config.xml.

    I was expecting the master to sync with the slave and become the master as before but it was not. During first boot of the new master both interfaces were down so i brought them up via the shell, but then all kind of errors started scrolling on the screen to fast to read. Also i was unable to access the webconfigurator or ping the lan interface.

    Looks like i did something wrong but i don't know what.
    What is the exact recovery procedure incase of a master hardware failure?

  • LAYER 8 Netgate

    Sounds like you did OK.

    The best thing to do in that situation is a configuration freeze on the secondary since the secondary will not sync back to the primary. Else a log will have to be kept of all changes so they can be made again to the primary node.

    I would bring the replacement master back up disconnected from the network, restore the configuration, and set permanent CARP maintenance mode there.

    Then I would connect it to the network, start it, and be sure everything comes up in CARP BACKUP state. Make sure everything looks good then disable CARP maintenance mode and fail back.

    It sounds like you might have experienced a kernel panic. Was a crash dump present when you restarted it?

  • Thanks for your reply.

    So i started from scrath:

    • Installed same pfsense version on node1 as node2
    • Restored backup xml
    • set parmenent CARP maintenance mode from the console using option 12 and command "enablecarpmaint exec;"
    • connected both netwerkcables and ifconfig up for both interfaces

    Text started scrolling on the console screen.
    Filter synchronize: beginning XMLRPC sync data to https://XXX/xmlrpc.php
    A communication error occured while attempting to call XMLRPC method host_firmware_version. Unable to connect to tls://xxx:443 Error can't assign requested address.

    arpresolve: can't allocate llinfo for on em1.

  • LAYER 8 Netgate

    Well it is going to need at least a sync cable to sync over.

    You might also want to disable XMLRPC sync on the restored primary until you are ready to do that too. Or ifnore that error.

    If it is supposed to be syncing and cannot, you'll have to work out why there is no connectivity between the two.

Log in to reply