Backup->master at random intervals
-
Recently i've been having some weird behaviour on my firewall setup. The CARP-sub of this forum is probably not the correct one, but it's because of CARP that i've noticed the problem.
I have two firewalls running in HA with CARP. Pfsense 2.0.2 AMD64.
I have noticed that the backup firewall will all of a sudden become the master of some, not all, CARP addresses and then go right back to being backup. It's quite random which addresses it happens to.
At first i just thought it was a CARP bug, and didn't give the issue any real attention because it didn't interfere with regular traffic. Or so I thought…
One of my users one day said that he had a problem transferring a large file over FTP. The FTP server would say "connection reset". I asked him for a timestamp of when it happened, and sure enough the firewall logs showed that the backup firewall had become master and then gone back to being backup at the exact time the user had written.
Yesterday i was pinging a server in one of my vlans, and i suddenly got 6 ping timeouts. No errors in the firewall log or anything, but i could see that the backup firewall had become master and then went back to backup... All this happens within 5 seconds.
Now... What i'm thinking is that the primary firewall is experiencing some sort of "flow stop", the backup firewall sees this as the master has gone offline and becomes the master, but then the primary firewall resumes normal operation and the backup goes back to being backup.
Do any of you have any idea what could be causing this? I've tried booting the firewalls, but it didn't help, otherwise i wouldn't be posting this. Last night i even swapped out the server with my sparepart server. (Dell R610 with intel 10Gbit interfaces)
I have a rather large setup, with about 15 vlans, and about 250 productions servers and load balancers behind the firewalls. So i can't just do changes or anything else that would have impact on operations.
Any advice would be nice as i'm running low on ideas. The only thing i haven't explored is booting/ios updating my cisco switches. They have uptimes surpassing 400 days.
-
Most commonly that is an issue at layer 2.
The switchover only happens if the backup fails to see advertisements from the master, or if it sees its own faster.
If you have any kind of multicast/broadcast storm control or limiting in your switches, disable it.
-
i knew you someone would say that. sadly i don't have any multi/broadcast limiter in my switches. that would have been an easy fix though.
allright… time for an ios upgrade. :)
i'll let you know how it works out.
-
The CARP flapping is the telling symptom of the actual problem, but it isn't the problem in and of itself, that indicates connectivity problems between the systems. Seeing if you lose connectivity to the interface IPs of the firewalls could be telling.
-
solved
i didn't get to the ios update part because i found the problem. spanning tree was converging at random times even though there was no topology changes. edited some stp costs, that did the trick.