State table not synced?
-
Anything in the system log around the time of the changeover?
-
Nothing special. On the master:
Jul 6 16:48:52 kernel: carp0: link state changed to DOWN Jul 6 16:48:52 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Jul 6 16:48:51 kernel: carp1: link state changed to DOWN Jul 6 16:48:51 kernel: re0: link state changed to DOWN
And on the backup:
Jul 6 16:48:53 kernel: carp1: link state changed to UP Jul 6 16:48:52 kernel: carp0: link state changed to UP Jul 6 16:48:52 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
EDIT: Fixed it. On the slave machine, you have to enable "Synchronize Enabled" under CARP settings. This is quite unclear, because in the book it says "you should not configure synchronization from the backup to the master"…
-
One more thing I noticed. This time I tested the failover by unplugging the power cord from the master (this is never good, but it can happen in the real world…). Failover to the slave works fine. Then I plug in the power on the master again, so it starts booting. After a few moments, one of the two carp interfaces (the LAN side) switches back to the master box, but the other carp interface stays on the slave. Only after 2 minutes they both are on the master again. The problem here is that any tcp connections get messed up of course.
Below is some logging at the moment that the master is starting up again. There you see that on the master both carp interfaces are up at 08:47:04, while on the slave carp0 is up 2 minutes later than carp1...
Master node:
Jul 7 08:47:04 kernel: carp1: link state changed to UP Jul 7 08:47:04 kernel: carp1: INIT -> MASTER (preempting) Jul 7 08:47:04 kernel: carp1: link state changed to DOWN Jul 7 08:47:04 kernel: carp0: link state changed to UP Jul 7 08:47:04 kernel: carp0: INIT -> MASTER (preempting) Jul 7 08:47:04 kernel: carp0: link state changed to DOWN Jul 7 08:47:02 pftpx[556]: listening on 127.0.0.1 port 8021 Jul 7 08:47:02 pftpx[556]: listening on 127.0.0.1 port 8021 Jul 7 08:47:10 kernel: carp1: link state changed to DOWN Jul 7 08:47:10 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Jul 7 08:47:10 kernel: carp1: link state changed to UP Jul 7 08:47:06 kernel: carp1: link state changed to DOWN Jul 7 08:47:06 kernel: carp1: 2 link states coalesced Jul 7 08:47:06 kernel: re0: link state changed to UP Jul 7 08:47:06 kernel: carp1: INIT -> BACKUP Jul 7 08:47:06 kernel: bge0: link state changed to UP Jul 7 08:47:05 kernel: carp0: link state changed to DOWN Jul 7 08:47:05 kernel: carp0: 2 link states coalesced Jul 7 08:47:05 kernel: em1: link state changed to UP Jul 7 08:47:05 kernel: carp0: INIT -> BACKUP
Slave node:
Jul 7 08:49:14 kernel: carp0: link state changed to DOWN Jul 7 08:49:14 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Jul 7 08:47:04 kernel: carp1: link state changed to DOWN Jul 7 08:47:04 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
-
Are those log entries from the master in order? If so, the time is a little out of whack on that server.
-
I didn't even notice those timestamps :o Could it be that its time was a couple of seconds wrong after the shutdown, and that it synced it again during boot? Just a wild guess… But I think that would not explain why it takes 2 minutes for the slave server to change carp0 from master to backup, right?
-
It's a known issue that if the clocks are off, CARP will not be right, but usually you get some messages like "Incorrect Hash".
Are these physical systems or VMs? If they're physical, what kind of hardware is involved?
If the time problems happen on every boot, there may be a BIOS or RTC issue, or it could be an ACPI issue. There are a couple different timecounters that can be set on the system, changing that setting might improve the situation as well.
-
you say "if the clocks are off"… do you mean a couple of milliseconds or minutes or ... ?
The two boxes are physical machines: two identical HP DC7600 computers (Dual Core 3.4Ghz, 1GB ram, onboard Broadcom NetXtreme network interface) with some extra network cards installed.
I have to do some extra testing to tell if it's on every boot. Changing which setting might improve the situation?
EDIT: We did another test. We made box 1 the backup node, and box 2 the master node. Then we rebooted the master (box 2). After reboot the master shows MASTER on all interfaces, but the slave shows MASTER for carp0 and SLAVE for carp1. After about 2 minutes, both interfaces show SLAVE.
Note: In the log file for box2, there's no "time jump" like we saw in the log file for box1 with our previous tests. We entered the correct time in the bios for box1, but for every reboot we see a small time jump (about 8 seconds) in the log file. I don't think this is a problem though because in the last test, box1 keeps running so I assume its time is correct.
EDIT2: We had another identical pc to test with, so I replaced box 1 with this 3rd box, built over the network cards, restored the backup of config.xml and tried rebooting box2 again. Same thing happens: when box2 wakes up, both its interfaces are master, but interface carp0 of box3 stays master as well (for about 2 minutes). All ideas welcome :-)
-
Any more ideas Jim?
-
Unless there's something specific to the switch you're using, I'm not sure.
You could try setting the advskew of the CARP IPs on the backup even higher (200 or so) but I'm not sure that would really make that large of a difference.
-
on the LAN side both machines are connected to a stack of two Nortel Baystack 5510 switches, each machine connected to a different unit (machine 1 on unit 1, machine 2 on unit 2). On the WAN side, they are connected to the built-in switch of the Speedtouch modem that we have to use for our ISP.
I'll try the tip tomorrow and post the results…
-
Setting the advertising frequency on the slave box to 200 didn't change anything. Is there a known issue with carp on speedtouch routers that you know of?
Fixed it! I found a simple switch that in installed between the pfsense boxes and the speedtouch modem/router. Failover works perfect now. So it seems that the switch built into the Speedtouch modems isn't realy suitable for carp! Thanks again for all the help!
-
A switch can definitely cause that kind of issue, but it's usually pretty uncommon for a physical switch to do so.
Unfortunately now you've got another single point of failure. :-)
-
Unfortunately now you've got another single point of failure. :-)
Yeah, but there will always be single points of failure I guess. And I prefer to replace a simple and cheap switch that requires no settings, instead of a pc with 4 extra network interfaces and pfsense that has to be configured :-) Besides, that switch is for internet access. We don't really need internet for our business, the other subnets that we'll connect in the future are more important.
-
Sounds good, hopefully that's the end of the issue :-)