Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    State table not synced?

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    15 Posts 2 Posters 9.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B
      bmaster
      last edited by

      We have two pfsense nodes, with Carp installed and working. Each node has a LAN and WAN interface, and a third sync interface, configured as described in the pfsense book. Configuration changes (firewall rules, traffic shaper, …) are synced to the backup node, and failover seems to work. However, when I start a tcp connection from my pc to somewhere on the internet, then force a failover (by unplugging the network cables of the master node), the tcp connection is dropped. When I compare the state table (diagnostics -> states) I see that they are nowhere near identical. What am I doing wrong? Thanks in advance!

      PS: on the slave node, there's nothing checked or filled in under 'carp settings'. Is that ok?

      1 Reply Last reply Reply Quote 0
      • jimpJ
        jimp Rebel Alliance Developer Netgate
        last edited by

        Anything in the system log around the time of the changeover?

        Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

        Need help fast? Netgate Global Support!

        Do not Chat/PM for help!

        1 Reply Last reply Reply Quote 0
        • B
          bmaster
          last edited by

          Nothing special. On the master:

          
          Jul 6 16:48:52 	kernel: carp0: link state changed to DOWN
          Jul 6 16:48:52 	kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
          Jul 6 16:48:51 	kernel: carp1: link state changed to DOWN
          Jul 6 16:48:51 	kernel: re0: link state changed to DOWN
          
          

          And on the backup:

          
          Jul 6 16:48:53 	kernel: carp1: link state changed to UP
          Jul 6 16:48:52 	kernel: carp0: link state changed to UP
          Jul 6 16:48:52 	kernel: carp0: BACKUP -> MASTER (preempting a slower master)
          
          

          EDIT: Fixed it. On the slave machine, you have to enable "Synchronize Enabled" under CARP settings. This is quite unclear, because in the book it says "you should not configure synchronization from the backup to the master"…

          1 Reply Last reply Reply Quote 0
          • B
            bmaster
            last edited by

            One more thing I noticed. This time I tested the failover by unplugging the power cord from the master (this is never good, but it can happen in the real world…). Failover to the slave works fine. Then I plug in the power on the master again, so it starts booting. After a few moments, one of the two carp interfaces (the LAN side) switches back to the master box, but the other carp interface stays on the slave. Only after 2 minutes they both are on the master again. The problem here is that any tcp connections get messed up of course.

            Below is some logging at the moment that the master is starting up again. There you see that on the master both carp interfaces are up at 08:47:04, while on the slave carp0 is up 2 minutes later than carp1...

            Master node:

            
            Jul 7 08:47:04 	kernel: carp1: link state changed to UP
            Jul 7 08:47:04 	kernel: carp1: INIT -> MASTER (preempting)
            Jul 7 08:47:04 	kernel: carp1: link state changed to DOWN
            Jul 7 08:47:04 	kernel: carp0: link state changed to UP
            Jul 7 08:47:04 	kernel: carp0: INIT -> MASTER (preempting)
            Jul 7 08:47:04 	kernel: carp0: link state changed to DOWN
            Jul 7 08:47:02 	pftpx[556]: listening on 127.0.0.1 port 8021
            Jul 7 08:47:02 	pftpx[556]: listening on 127.0.0.1 port 8021
            Jul 7 08:47:10 	kernel: carp1: link state changed to DOWN
            Jul 7 08:47:10 	kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
            Jul 7 08:47:10 	kernel: carp1: link state changed to UP
            Jul 7 08:47:06 	kernel: carp1: link state changed to DOWN
            Jul 7 08:47:06 	kernel: carp1: 2 link states coalesced
            Jul 7 08:47:06 	kernel: re0: link state changed to UP
            Jul 7 08:47:06 	kernel: carp1: INIT -> BACKUP
            Jul 7 08:47:06 	kernel: bge0: link state changed to UP
            Jul 7 08:47:05 	kernel: carp0: link state changed to DOWN
            Jul 7 08:47:05 	kernel: carp0: 2 link states coalesced
            Jul 7 08:47:05 	kernel: em1: link state changed to UP
            Jul 7 08:47:05 	kernel: carp0: INIT -> BACKUP
            
            

            Slave node:

            
            Jul 7 08:49:14 	kernel: carp0: link state changed to DOWN
            Jul 7 08:49:14 	kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
            Jul 7 08:47:04 	kernel: carp1: link state changed to DOWN
            Jul 7 08:47:04 	kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
            
            
            1 Reply Last reply Reply Quote 0
            • jimpJ
              jimp Rebel Alliance Developer Netgate
              last edited by

              Are those log entries from the master in order? If so, the time is a little out of whack on that server.

              Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

              Need help fast? Netgate Global Support!

              Do not Chat/PM for help!

              1 Reply Last reply Reply Quote 0
              • B
                bmaster
                last edited by

                I didn't even notice those timestamps  :o  Could it be that its time was a couple of seconds wrong after the shutdown, and that it synced it again during boot? Just a wild guess… But I think that would not explain why it takes 2 minutes for the slave server to change carp0 from master to backup, right?

                1 Reply Last reply Reply Quote 0
                • jimpJ
                  jimp Rebel Alliance Developer Netgate
                  last edited by

                  It's a known issue that if the clocks are off, CARP will not be right, but usually you get some messages like "Incorrect Hash".

                  Are these physical systems or VMs? If they're physical, what kind of hardware is involved?

                  If the time problems happen on every boot, there may be a BIOS or RTC issue, or it could be an ACPI issue. There are a couple different timecounters that can be set on the system, changing that setting might improve the situation as well.

                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  1 Reply Last reply Reply Quote 0
                  • B
                    bmaster
                    last edited by

                    you say "if the clocks are off"… do you mean a couple of milliseconds or minutes or ... ?

                    The two boxes are physical machines: two identical HP DC7600 computers (Dual Core 3.4Ghz, 1GB ram, onboard Broadcom NetXtreme network interface) with some extra network cards installed.

                    I have to do some extra testing to tell if it's on every boot. Changing which setting might improve the situation?

                    EDIT: We did another test. We made box 1 the backup node, and box 2 the master node. Then we rebooted the master (box 2). After reboot the master shows MASTER on all interfaces, but the slave shows MASTER for carp0 and SLAVE for carp1. After about 2 minutes, both interfaces show SLAVE.

                    Note: In the log file for box2, there's no "time jump" like we saw in the log file for box1 with our previous tests. We entered the correct time in the bios for box1, but for every reboot we see a small time jump (about 8 seconds) in the log file. I don't think this is a problem though because in the last test, box1 keeps running so I assume its time is correct.

                    EDIT2: We had another identical pc to test with, so I replaced box 1 with this 3rd box, built over the network cards, restored the backup of config.xml and tried rebooting box2 again. Same thing happens: when box2 wakes up, both its interfaces are master, but interface carp0 of box3 stays master as well (for about 2 minutes). All ideas welcome :-)

                    1 Reply Last reply Reply Quote 0
                    • B
                      bmaster
                      last edited by

                      Any more ideas Jim?

                      1 Reply Last reply Reply Quote 0
                      • jimpJ
                        jimp Rebel Alliance Developer Netgate
                        last edited by

                        Unless there's something specific to the switch you're using, I'm not sure.

                        You could try setting the advskew of the CARP IPs on the backup even higher (200 or so) but I'm not sure that would really make that large of a difference.

                        Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                        Need help fast? Netgate Global Support!

                        Do not Chat/PM for help!

                        1 Reply Last reply Reply Quote 0
                        • B
                          bmaster
                          last edited by

                          on the LAN side both machines are connected to a stack of two Nortel Baystack 5510 switches, each machine connected to a different unit (machine 1 on unit 1, machine 2 on unit 2). On the WAN side, they are connected to the built-in switch of the Speedtouch modem that we have to use for our ISP.

                          I'll try the tip tomorrow and post the results…

                          1 Reply Last reply Reply Quote 0
                          • B
                            bmaster
                            last edited by

                            Setting the advertising frequency on the slave box to 200 didn't change anything. Is there a known issue with carp on speedtouch routers that you know of?

                            Fixed it! I found a simple switch that in installed between the pfsense boxes and the speedtouch modem/router. Failover works perfect now. So it seems that the switch built into the Speedtouch modems isn't realy suitable for carp! Thanks again for all the help!

                            1 Reply Last reply Reply Quote 0
                            • jimpJ
                              jimp Rebel Alliance Developer Netgate
                              last edited by

                              A switch can definitely cause that kind of issue, but it's usually pretty uncommon for a physical switch to do so.

                              Unfortunately now you've got another single point of failure. :-)

                              Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                              Need help fast? Netgate Global Support!

                              Do not Chat/PM for help!

                              1 Reply Last reply Reply Quote 0
                              • B
                                bmaster
                                last edited by

                                @jimp:

                                Unfortunately now you've got another single point of failure. :-)

                                Yeah, but there will always be single points of failure I guess. And I prefer to replace a simple and cheap switch that requires no settings, instead of a pc with 4 extra network interfaces and pfsense that has to be configured :-) Besides, that switch is for internet access. We don't really need internet for our business, the other subnets that we'll connect in the future are more important.

                                1 Reply Last reply Reply Quote 0
                                • jimpJ
                                  jimp Rebel Alliance Developer Netgate
                                  last edited by

                                  Sounds good, hopefully that's the end of the issue :-)

                                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                                  Need help fast? Netgate Global Support!

                                  Do not Chat/PM for help!

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.