State table not synced?

bmaster

I didn't even notice those timestamps :o Could it be that its time was a couple of seconds wrong after the shutdown, and that it synced it again during boot? Just a wild guess… But I think that would not explain why it takes 2 minutes for the slave server to change carp0 from master to backup, right?

jimp

It's a known issue that if the clocks are off, CARP will not be right, but usually you get some messages like "Incorrect Hash".

Are these physical systems or VMs? If they're physical, what kind of hardware is involved?

If the time problems happen on every boot, there may be a BIOS or RTC issue, or it could be an ACPI issue. There are a couple different timecounters that can be set on the system, changing that setting might improve the situation as well.

bmaster

you say "if the clocks are off"… do you mean a couple of milliseconds or minutes or ... ?

The two boxes are physical machines: two identical HP DC7600 computers (Dual Core 3.4Ghz, 1GB ram, onboard Broadcom NetXtreme network interface) with some extra network cards installed.

I have to do some extra testing to tell if it's on every boot. Changing which setting might improve the situation?

EDIT: We did another test. We made box 1 the backup node, and box 2 the master node. Then we rebooted the master (box 2). After reboot the master shows MASTER on all interfaces, but the slave shows MASTER for carp0 and SLAVE for carp1. After about 2 minutes, both interfaces show SLAVE.

Note: In the log file for box2, there's no "time jump" like we saw in the log file for box1 with our previous tests. We entered the correct time in the bios for box1, but for every reboot we see a small time jump (about 8 seconds) in the log file. I don't think this is a problem though because in the last test, box1 keeps running so I assume its time is correct.

EDIT2: We had another identical pc to test with, so I replaced box 1 with this 3rd box, built over the network cards, restored the backup of config.xml and tried rebooting box2 again. Same thing happens: when box2 wakes up, both its interfaces are master, but interface carp0 of box3 stays master as well (for about 2 minutes). All ideas welcome :-)

bmaster

Any more ideas Jim?

jimp

Unless there's something specific to the switch you're using, I'm not sure.

You could try setting the advskew of the CARP IPs on the backup even higher (200 or so) but I'm not sure that would really make that large of a difference.

bmaster

on the LAN side both machines are connected to a stack of two Nortel Baystack 5510 switches, each machine connected to a different unit (machine 1 on unit 1, machine 2 on unit 2). On the WAN side, they are connected to the built-in switch of the Speedtouch modem that we have to use for our ISP.

I'll try the tip tomorrow and post the results…

bmaster

Setting the advertising frequency on the slave box to 200 didn't change anything. Is there a known issue with carp on speedtouch routers that you know of?

Fixed it! I found a simple switch that in installed between the pfsense boxes and the speedtouch modem/router. Failover works perfect now. So it seems that the switch built into the Speedtouch modems isn't realy suitable for carp! Thanks again for all the help!

jimp

A switch can definitely cause that kind of issue, but it's usually pretty uncommon for a physical switch to do so.

Unfortunately now you've got another single point of failure. :-)

bmaster

@jimp:

Unfortunately now you've got another single point of failure. :-)

Yeah, but there will always be single points of failure I guess. And I prefer to replace a simple and cheap switch that requires no settings, instead of a pc with 4 extra network interfaces and pfsense that has to be configured :-) Besides, that switch is for internet access. We don't really need internet for our business, the other subnets that we'll connect in the future are more important.

jimp

Sounds good, hopefully that's the end of the issue :-)