Problem with Carp after upgrade to RC3
I've a carp setup that works flawlessly for a month using 2 pfSense RC1 machines. Yesterday I've upgraded these box with RC3 and after 20 minutes of use my network goes down.
I don't know what happens but I've tested every single device in my network to try to restore the communication and after almost 6 hours after testing I turn off the passive server. As soon I've turned off the backup server the network starts working again.
I didn't change any other setting on pfSense, only upgrade distribution using the manual firmware option on the system menu and choose the RC3 tgz file.
Below is the configuration of my network:
Each server has 4 NICs: 1 WAN / 1 LAN / 1 DMZ / 1 CARP monitoring (Cross over cable)
I've 8 CARP IP's: 1 for DMZ, 1 for LAN, 6 for WAN
These 2 servers LAN Interfaces are plugged in one switch, in the same switch I have another Proxy server running in Bridge mode (CentOS) and then in the other nic of the proxy server I've connected to the main network switch.
This setup works for a month until the upgrade to RC3, so I'd like to ask you if someone could realize what could happened in this case!
If I remove the lan cable from the backup the network start function again, so I remove the cable and for now I'm trying to downgrade the RC3 to RC1 again but if it is a problem I think you should take care of what have changed!!
I forget to tell, I'm using always the i386 version ok!
Hi! It's me again!
We rollback for the previous state, we have downgraded both servers to RC1 but the problem that never happened before occurs again. Now with both servers in RC1 the fail over task didn't work.
If I only remove the LAN Cable for one of the 2 servers my network starts working. But If I plug the lan cable after 4 to 10 minutes the network stop answering.
The strange is that it only happens in the LAN interface. All other interfaces continue working without any problem!
I'm desperate right now because I need fix it in some way, and I can't figure out what could it be, if a problem with pfSense, a hardware problem (fault nic) or anything else!
Please could someone help me with any tip! Thank you very much!
I was monitoring the system and when the problem happens I only see an message like this "CLOGJ|##" in bloth firewalls!!
What could be happening?
Maybe you should consider to drop the cluster for now.
Just use one machine for a few days and and find more info.
While this happen can you access the GUI? From wan or dmz
Well, now everything is working again without any modification, the only thing that is different is that only 10% of my users are working now, all the rest go for weekend and turn of their pc's.
Could it be anything related to a mac address spoofing or duplicated mac address on my network??
I don't think so but everything possible.
What hardware are you using?
Problem sounds like the switch is hanging onto MACs in its CAM table past when CARP switches over, so while the master brings up its CARP IPs, the switch is still sending the traffic destined to their MACs to the backup (or vice versa for master/backup). Eventually that would resolve itself once that entry expired in the switch and it moved that MAC over to the correct port.
The symptoms are somewhat similar to having a conflicting MAC address as well, the usual cause of that is having two CARP or VRRP IPs with the same vhids on the same network as they'll share the same MAC address.
I've checked all the CARP ip's configuration and everything seems ok. We have another pfsense server that is plugged in the same switch but with an ADSL extra link.
In this pfsense server we have only an internet link and the lan cable with a different ip number.
We don't have carp settings in this server because it's an standalone server. I have only 3 rules that allow 3 specific machines to browse internet using this adsl link.
And in the pfsense Carp server (main firewall master/backup) I have 3 rules that forward packets coming from this 3 pc's to go out to internet by the standalone pfsense server.
To clarify for you!
Carp IP: 10.48.3.254 (main gateway for the whole network)
When packets for port TCP/80 comes from 10.48.3.150, 10.48.3.146 and 10.48.3.179 the main firewall routes for the standalone firewall.
In the other side of the main firewall we have two cisco routers in load-balance and failover with the same schema I think (2 specific IP's and 1 virtual ip for both routers), but I have never had any problem in the internet segment of the lan, nor in the dmz, only in my lan segment where I have only 1 carp ip that I tell you above ok!