IPSEC/CARP behaviour



  • Hi all,

    Using latest snapshot in a cluster having a site to site IPSEC tunnel with a 1.2.3 peer (VPN endpoint is the WAN CARP), the second node try to establish SAs with the peer from times to times, which can result in a loss of communication since the SA got established with the slave node. I have hundred of clusters in the wild, this is the first time I see this behaviour.
    If I stop/start racoon after the reboot of the second node, SAs establishment attempt still occurs but fails with a "send error" and then it does not affect the peer anymore.

    It's Friday, perhaps I need to go home ;-)



  • And the IPsec is on the CARP IP, not WAN?



  • Yes, it's using the WAN CARP VIP as the tunnel endpoint.
    I've just checked the logs on the two systems and I found what is happening.
    Each time the admin makes a change on the primary node, the configuration is synced to the slave (normal behaviour), but CARP interfaces are switched off and then on while the configuration is synced, racoon get restarted, the secondary node then seems to become master for 1 second, time to send a SA request to the peer. If you make many changes on the master configuration, sometimes the SA get establish.

    X.X.X.X is the WAN CARP VIP

    System logs on the slave node at the time it receive a config sync:

    Apr 8 20:20:02 php: /xmlrpc.php: Resyncing OpenVPN instances.
    Apr 8 20:20:02 apinger: Starting Alarm Pinger, apinger(8874)
    Apr 8 20:20:02 kernel: vip126: link state changed to DOWN
    Apr 8 20:20:02 kernel: vip126: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:02 kernel: vip124: link state changed to DOWN
    Apr 8 20:20:02 kernel: vip124: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:02 kernel: vip125: link state changed to DOWN
    Apr 8 20:20:02 kernel: vip125: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:01 kernel: vip122: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip122: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:01 kernel: vip121: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip121: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:01 apinger: Exiting on signal 15.
    Apr 8 20:20:01 php: /xmlrpc.php: ROUTING: change default route to Z.Z.Z.Z
    Apr 8 20:20:01 php: : Config sync not being done because of missing sync IP (normal on secondary systems).
    Apr 8 20:20:01 kernel: vip126: link state changed to UP
    Apr 8 20:20:01 kernel: vip126: INIT -> MASTER (preempting)
    Apr 8 20:20:01 kernel: vip126: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip125: link state changed to UP
    Apr 8 20:20:01 kernel: vip125: INIT -> MASTER (preempting)
    Apr 8 20:20:01 kernel: vip125: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip124: link state changed to UP
    Apr 8 20:20:01 kernel: vip124: INIT -> MASTER (preempting)
    Apr 8 20:20:01 kernel: vip124: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip122: link state changed to UP
    Apr 8 20:20:01 kernel: vip122: INIT -> MASTER (preempting)
    Apr 8 20:20:01 kernel: vip122: link state changed to DOWN
    Apr 8 20:20:01 check_reload_status: reloading filter
    Apr 8 20:20:01 kernel: vip121: link state changed to UP
    Apr 8 20:20:01 kernel: vip121: INIT -> MASTER (preempting)
    Apr 8 20:20:01 kernel: vip121: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip126: link state changed to DOWN
    Apr 8 20:20:01 kernel: vip126: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:01 check_reload_status: syncing firewall
    Apr 8 20:20:00 kernel: vip124: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip124: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:00 kernel: vip125: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip125: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:00 kernel: vip122: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip122: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:00 kernel: vip121: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip121: MASTER -> BACKUP (more frequent advertisement received)
    Apr 8 20:20:00 kernel: vip126: link state changed to UP
    Apr 8 20:20:00 kernel: vip126: INIT -> MASTER (preempting)
    Apr 8 20:20:00 kernel: carp4: changing name to 'vip126'
    Apr 8 20:20:00 kernel: vip125: link state changed to UP
    Apr 8 20:20:00 kernel: vip125: INIT -> MASTER (preempting)
    Apr 8 20:20:00 kernel: carp3: changing name to 'vip125'
    Apr 8 20:20:00 kernel: vip124: link state changed to UP
    Apr 8 20:20:00 kernel: vip124: INIT -> MASTER (preempting)
    Apr 8 20:20:00 kernel: carp2: changing name to 'vip124'
    Apr 8 20:20:00 kernel: vip122: link state changed to UP
    Apr 8 20:20:00 kernel: vip122: INIT -> MASTER (preempting)
    Apr 8 20:20:00 kernel: carp1: changing name to 'vip122'
    Apr 8 20:20:00 php: : Config sync not being done because of missing sync IP (normal on secondary systems).
    Apr 8 20:20:00 kernel: vip121: link state changed to UP
    Apr 8 20:20:00 kernel: vip121: INIT -> MASTER (preempting)
    Apr 8 20:20:00 kernel: carp0: changing name to 'vip121'
    Apr 8 20:20:00 check_reload_status: syncing firewall
    Apr 8 20:20:00 kernel: vip126: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip125: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip124: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip122: link state changed to DOWN
    Apr 8 20:20:00 kernel: vip121: link state changed to DOWN

    IPSEC logs on the slave at the time of the sync:
    Apr 8 20:20:03 racoon: WARNING: unrecognized route message with rtm_type: 4
    Apr 8 20:20:03 racoon: WARNING: unrecognized route message with rtm_type: 4
    Apr 8 20:20:03 racoon: WARNING: unrecognized route message with rtm_type: 4
    Apr 8 20:20:03 racoon: WARNING: unrecognized route message with rtm_type: 4
    Apr 8 20:20:01 racoon: [Self]: INFO: X.X.X.X[500] used as isakmp port (fd=17)
    Apr 8 20:20:01 racoon: [Self]: INFO: X.X.X.X[500] used for NAT-T
    Apr 8 20:20:01 racoon: [Self]: INFO: X.X.X.X[4500] used as isakmp port (fd=16)
    Apr 8 20:20:01 racoon: [Self]: INFO: X.X.X.X[4500] used for NAT-T
    Apr 8 20:20:00 racoon: ERROR: failed to begin ipsec sa negotication.
    Apr 8 20:20:00 racoon: ERROR: phase1 negotiation failed due to send error. 2f22f846442bd455:0000000000000000
    Apr 8 20:20:00 racoon: INFO: begin Identity Protection mode.
    Apr 8 20:20:00 racoon: [Remote Peer]: INFO: initiate new phase 1 negotiation: X.X.X.X[500]<=>Y.Y.Y.Y[500]
    Apr 8 20:20:00 racoon: [Remote Peer]: INFO: IPsec-SA request for Y.Y.Y.Y queued due to no phase1 found.
    Apr 8 20:20:00 racoon: [Self]: INFO: X.X.X.X[500] used as isakmp port (fd=17)
    Apr 8 20:20:00 racoon: [Self]: INFO: X.X.X.X[500] used for NAT-T
    Apr 8 20:20:00 racoon: [Self]: INFO: X.X.X.X[4500] used as isakmp port (fd=16)
    Apr 8 20:20:00 racoon: [Self]: INFO: X.X.X.X[4500] used for NAT-T



  • I confirm what I said before.

    Doing multiple changes on the primary node makes the secondary node reconfigure it's CARP VIP.
    Secondary node negotiate new SA with the peer ! Then remote network becomes unaccessible.

    Do you know what changed in the configuration sync ?



  • anyone else experiencing the same ?

    Thx ;)



  • Yeah that makes sense, I see the state change too, which if you hit the IPsec at just the right time will trigger it to come up and cause issues.
    http://redmine.pfsense.org/issues/1433



  • @Juve:

    anyone else experiencing the same ?

    Thx ;)

    I see "strange behaviours" when I am making many changes to the primary node… sometimes the backup node just hangs with Kernel error 12... hope this is the answer to the problem and will be fixed soon!

    Thanks,
    Michele



  • Thank you cmb.
    I'll check the ticket status on redmine and will apply an update as soon as a patch is available and I'll give you a feedback.



  • I saw on redmine that Ermal had made a patch for the CARP behaviour.
    So I have updated my 2.0 cluster to the latest snapshot (pfSense-Full-Update-2.0-RC1-i386-20110426-1925)

    I have tested the behaviour during a configuration resync.
    IPSECstill doing the same.

    The problem of the CARP switch other is still here. Every time I make a change on the master (using the LAN CARP as the pfsense admin URL), at the time of the mouse click on the "Apply changes" button, I get a SSL Certificate error because I'm switching to the slave and then back to the master.

    I saw on #1433 that Ermal said nothing had changed in the sync process, so why is this happening ?
    It has never been the same with 1.2.3.



  • Using the latest RC2 snaphsot the problem of being switched to the slave when using webconfigurator though lan CARP still occurs from time to time.
    When making many changes within few minutes, the problem ca occurs every two click on "Apply changes".

    Anyone experiencing the same ?



  • There were some fixes done yesterday and some more will come on this today.


Locked