Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP strange behaviour on all networks

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    15 Posts 5 Posters 4.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • GruensFroeschliG
      GruensFroeschli
      last edited by

      This sounds to me like you have some kind of packetstorm.
      Are you using VLANs?
      Configured a bridge?
      Connected something somewhere to "save hardware"?

      We do what we must, because we can.

      Asking questions the smart way: http://www.catb.org/esr/faqs/smart-questions.html

      1 Reply Last reply Reply Quote 0
      • P
        PDJ
        last edited by

        All answers are No,

        we do not have VLANs, no bridges configured and every subnet has it's own phisical ethernet adapter (port)
        at the moment I have dissabled CARP on the slave because when enabled the network is unstable.
        I have to say we were on 2.0 before, we had some problems with the WAN VIPs but it worked just fine, the real collapse started with 2.1

        I have checked the switches during outage, but there were no very high loads on any port (from the one I have checked, don't have much time to check when all networks are down)

        This is what the logfile showed (on the master):
        Sep 13 14:13:01 kernel: opt12_vip12: link state changed to UP
        Sep 13 14:13:01 kernel: lan_vip1: link state changed to UP
        Sep 13 14:13:01 kernel: opt4_vip7: link state changed to UP
        Sep 13 14:13:01 kernel: opt6_vip9: link state changed to UP
        Sep 13 14:13:01 kernel: opt7_vip15: link state changed to UP
        Sep 13 14:13:01 kernel: opt2_vip6: link state changed to UP
        Sep 13 14:13:01 kernel: opt1_vip5: link state changed to UP
        Sep 13 14:13:01 kernel: wan_vip3: link state changed to UP
        Sep 13 14:13:01 kernel: opt11_vip11: link state changed to UP
        Sep 13 14:13:01 kernel: opt5_vip8: link state changed to UP
        Sep 13 14:13:34 kernel: opt7_vip15: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt7_vip15: link state changed to DOWN
        Sep 13 14:13:34 kernel: opt4_vip7: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt4_vip7: link state changed to DOWN
        Sep 13 14:13:34 kernel: opt6_vip9: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt6_vip9: link state changed to DOWN
        Sep 13 14:13:34 kernel: opt11_vip11: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt11_vip11: link state changed to DOWN
        Sep 13 14:13:34 kernel: opt2_vip6: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt1_vip5: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: opt2_vip6: link state changed to DOWN
        Sep 13 14:13:34 kernel: opt1_vip5: link state changed to DOWN
        Sep 13 14:13:34 kernel: wan_vip3: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:34 kernel: wan_vip3: link state changed to DOWN
        Sep 13 14:13:35 kernel: opt5_vip8: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:35 kernel: opt5_vip8: link state changed to DOWN
        Sep 13 14:24:28 kernel: opt5_vip8: link state changed to UP
        Sep 13 14:24:28 kernel: opt7_vip15: link state changed to UP
        Sep 13 14:24:29 kernel: opt4_vip7: link state changed to UP
        Sep 13 14:24:29 kernel: opt6_vip9: link state changed to UP
        Sep 13 14:24:29 kernel: opt11_vip11: link state changed to UP
        Sep 13 14:24:29 kernel: opt2_vip6: link state changed to UP
        Sep 13 14:24:29 kernel: opt1_vip5: link state changed to UP
        Sep 13 14:24:29 kernel: wan_vip3: link state changed to UP

        This is what the slave showed:
        Sep 13 14:12:59 kernel: opt1_vip5: link state changed to DOWN
        Sep 13 14:13:00 kernel: opt2_vip6: link state changed to DOWN
        Sep 13 14:13:01 kernel: opt4_vip7: link state changed to DOWN
        Sep 13 14:13:02 kernel: opt5_vip8: link state changed to DOWN
        Sep 13 14:13:03 kernel: opt6_vip9: link state changed to DOWN
        Sep 13 14:13:04 kernel: in_scrubprefix: err=65, prefix delete failed
        Sep 13 14:13:05 kernel: opt11_vip11: link state changed to DOWN
        Sep 13 14:13:05 kernel: in_scrubprefix: err=65, prefix delete failed
        Sep 13 14:13:06 kernel: opt12_vip12: link state changed to DOWN
        Sep 13 14:13:06 kernel: in_scrubprefix: err=65, prefix delete failed
        Sep 13 14:13:07 kernel: wan_vip3: link state changed to DOWN
        Sep 13 14:13:08 kernel: opt7_vip15: link state changed to DOWN
        Sep 13 14:13:09 kernel: lan_vip1: link state changed to DOWN
        Sep 13 14:13:22 kernel: carp0: changing name to 'opt1_vip5'
        Sep 13 14:13:22 kernel: opt1_vip5: INIT -> BACKUP
        Sep 13 14:13:22 kernel: opt1_vip5: link state changed to DOWN
        Sep 13 14:13:23 kernel: carp1: changing name to 'opt2_vip6'
        Sep 13 14:13:23 kernel: opt2_vip6: INIT -> BACKUP
        Sep 13 14:13:23 kernel: opt2_vip6: link state changed to DOWN
        Sep 13 14:13:24 kernel: carp2: changing name to 'opt4_vip7'
        Sep 13 14:13:24 kernel: opt4_vip7: INIT -> BACKUP
        Sep 13 14:13:24 kernel: opt4_vip7: link state changed to DOWN
        Sep 13 14:13:25 kernel: carp3: changing name to 'opt5_vip8'
        Sep 13 14:13:25 kernel: opt5_vip8: INIT -> BACKUP
        Sep 13 14:13:25 kernel: opt5_vip8: link state changed to DOWN
        Sep 13 14:13:25 kernel: opt1_vip5: link state changed to UP
        Sep 13 14:13:26 kernel: carp4: changing name to 'opt6_vip9'
        Sep 13 14:13:26 kernel: Restoring context for interface opt6_vip9 to 1(cpzone)
        Sep 13 14:13:26 kernel: opt6_vip9: INIT -> BACKUP
        Sep 13 14:13:26 kernel: opt6_vip9: link state changed to DOWN
        Sep 13 14:13:26 kernel: opt2_vip6: link state changed to UP
        Sep 13 14:13:27 kernel: carp5: changing name to 'opt10_vip10'
        Sep 13 14:13:27 kernel: ifa_del_loopback_route: deletion failed
        Sep 13 14:13:27 kernel: ifa_add_loopback_route: insertion failed
        Sep 13 14:13:27 kernel: opt4_vip7: link state changed to UP
        Sep 13 14:13:28 kernel: carp6: changing name to 'opt11_vip11'
        Sep 13 14:13:28 kernel: opt11_vip11: INIT -> BACKUP
        Sep 13 14:13:28 kernel: opt11_vip11: link state changed to DOWN
        Sep 13 14:13:28 kernel: opt5_vip8: link state changed to UP
        Sep 13 14:13:29 kernel: carp7: changing name to 'opt12_vip12'
        Sep 13 14:13:29 kernel: opt12_vip12: INIT -> BACKUP
        Sep 13 14:13:29 kernel: opt12_vip12: link state changed to DOWN
        Sep 13 14:13:29 kernel: opt6_vip9: link state changed to UP
        Sep 13 14:13:30 kernel: carp8: changing name to 'wan_vip3'
        Sep 13 14:13:30 kernel: wan_vip3: INIT -> BACKUP
        Sep 13 14:13:30 kernel: wan_vip3: link state changed to DOWN
        Sep 13 14:13:31 kernel: carp9: changing name to 'opt7_vip15'
        Sep 13 14:13:31 kernel: opt7_vip15: INIT -> BACKUP
        Sep 13 14:13:31 kernel: opt7_vip15: link state changed to DOWN
        Sep 13 14:13:31 kernel: opt11_vip11: link state changed to UP
        Sep 13 14:13:32 kernel: carp10: changing name to 'lan_vip1'
        Sep 13 14:13:32 kernel: lan_vip1: INIT -> BACKUP
        Sep 13 14:13:32 kernel: lan_vip1: link state changed to DOWN
        Sep 13 14:13:32 kernel: opt12_vip12: link state changed to UP
        Sep 13 14:13:33 kernel: wan_vip3: link state changed to UP
        Sep 13 14:13:34 php: /carp_status.php: waiting for pfsync…
        Sep 13 14:13:34 php: /carp_status.php: pfsync done in 0 seconds.
        Sep 13 14:13:34 php: /carp_status.php: Configuring CARP settings finalize...
        Sep 13 14:13:34 kernel: opt7_vip15: link state changed to UP
        Sep 13 14:13:35 kernel: opt12_vip12: MASTER -> BACKUP (more frequent advertisement received)
        Sep 13 14:13:35 kernel: opt12_vip12: link state changed to DOWN
        Sep 13 14:24:26 kernel: opt1_vip5: link state changed to DOWN
        Sep 13 14:24:27 kernel: opt2_vip6: link state changed to DOWN
        Sep 13 14:24:28 kernel: opt4_vip7: link state changed to DOWN
        Sep 13 14:24:28 kernel: opt12_vip12: link state changed to UP
        Sep 13 14:24:28 kernel: lan_vip1: link state changed to UP
        Sep 13 14:24:29 kernel: opt5_vip8: link state changed to DOWN
        Sep 13 14:24:30 kernel: opt6_vip9: link state changed to DOWN
        Sep 13 14:24:31 kernel: in_scrubprefix: err=65, prefix delete failed
        Sep 13 14:24:32 kernel: opt11_vip11: link state changed to DOWN
        Sep 13 14:24:33 kernel: opt12_vip12: link state changed to DOWN
        Sep 13 14:24:34 kernel: wan_vip3: link state changed to DOWN
        Sep 13 14:24:35 kernel: opt7_vip15: link state changed to DOWN
        Sep 13 14:24:36 kernel: lan_vip1: link state changed to DOWN

        1 Reply Last reply Reply Quote 0
        • P
          PDJ
          last edited by

          anyone?
          I really don't know what it could be, didn't find much about this on the forums or on other pages

          1 Reply Last reply Reply Quote 0
          • P
            PDJ
            last edited by

            Do I have to report this as a bug?
            Since 2.1 we have only problems with the network, we have 14 networks and they all go down after a while.
            Do we have to use different passwords for every VID ?

            1 Reply Last reply Reply Quote 0
            • P
              PDJ
              last edited by

              I have switched the backup server off, because it was very unstable.
              So what should I do? How can I fix this problem?
              Anybody?

              Does it help to become a gold member?

              1 Reply Last reply Reply Quote 0
              • S
                ssheikh
                last edited by

                What does your MBUF usage look like?

                1 Reply Last reply Reply Quote 0
                • N
                  nothing
                  last edited by

                  If I were you, I would disconnect all the networks and leave just the WAN and 1 LAN and if this works, start connecting the rest of LANs one by one to see when it fails.

                  1 Reply Last reply Reply Quote 0
                  • P
                    PDJ
                    last edited by

                    Thanks for the answer.

                    I have done that, but the problem is, with all networks connected it runs for a couple of hours and suddenly it collapse, sometimes after an hour, sometimes after a day.
                    leaving all the networks disconnected for a day is not an option, that would mean downtime on a lot of services.

                    @ssheikh: good question, I'll check that.

                    1 Reply Last reply Reply Quote 0
                    • P
                      PDJ
                      last edited by

                      I has been a while, we decided to let it rest for a while and disable CARP

                      Now we have made a test network with the same hardware and I found out something very strange.
                      First of all, when the master is down and up again, the slave won't switch back to master.
                      when I check on the slave when I do a tcpdump I get

                      IP 192.168.20.252 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36

                      Funny thing is, that the master is configure as skew 0 instead of 240, where is that 240 comming from?

                      When I manually set the skew to 250 on the backup machine, I see it switch back to slave and the master becomes master.

                      But what causing the strange unstable behaviour? and why is the prio set to 240 ?

                      1 Reply Last reply Reply Quote 0
                      • P
                        podilarius
                        last edited by

                        Don't know. I checked mine and it is listed in tcpdump as:
                        <externalip>> 224.0.0.18: VRRPv2, Advertisement, vrid 124, prio 0, authtype none, intvl 1s, length 36, addrs(7): <removed to="" protect="" privacy="">It does this on all my CARP stuff. I am on 2.1 final, but all my configs are upgrades and not new installs.

                        drop to console and report the output of this back.
                        grep -e advskew -e subnet /cf/conf/config.xml</removed></externalip>

                        1 Reply Last reply Reply Quote 0
                        • P
                          PDJ
                          last edited by

                          Thanks for the answer, I found more info it has something to do with preempt, if 1 interface fails, the rest will be set to 240 so all interfaces will switch over (that's not something I prefer, but since 2006 you can't change this, pfsense has enabled this by default)
                          However in my case, both boxes do the same, result all interfaces have advskew 240 on master and slave, and with 20 carp networks will bring both boxes down because of the constant switching master -> backup -> master….

                          I have set net.inet.carp.preempt to 0 in the system tunables, but it is not changing.

                          1 Reply Last reply Reply Quote 0
                          • P
                            podilarius
                            last edited by

                            In you backup FW, do you have configuration setting sync turned on?
                            Personally, if I have one link fail, I would need all to fail over. Mostly this is cause I will need to bring down the master for maintenance. Also cause the WAN died and I don't want any LAN to go to the box where the WAN link failed. If its on of the LAN, sure, its not that big a deal, it will just go out the other WAN port. But you never know.

                            1 Reply Last reply Reply Quote 0
                            • P
                              PDJ
                              last edited by

                              For me it's easier to have only one failover, the setup is so that the slave doesn't have all features (no backup wan connection) so only 1 network doesn't have the failover when there is a network fail.
                              If all networks will switch in depended, I still can switch the master down, all networks will go down and the slave would take over all networks.

                              I have created a stable situation again, I found out when there is an open network (both pfsense are set to init, the network becomes unstable in a couple of hours)

                              But still I want to failover independent, I don't get why the option has been taken out.

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post
                              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.