CARP broken including latest version 2.1.5

secgeek

My two i386 pfSense Firewalls won't work under the standard CARP configuration.
Config is WAN, LAN and pfSync interfaces in both boxes - nothing more complicated than what is shown in the examples.
All three interfaces use a /24 private network and are on different switches except for the pfSync which uses a crossover cable.
All interfaces seem to work fine.
I can't locate any clear error messages except for the following on the primary "A communication error occured while attempting XMLRPC sync with username admin https://x.x.x.x:443

I tried switching between admin and user rsync both of which have full webconfigurator rights on both machines and identical passwords but this does not resolve the problem.

The most interesting thing is that if the backup is rebooted it works fine for a couple of minutes; actually it used to work fine for about 15 minutes (with 2.1.4) before upgrading to 2.1.5, then only the WAN interface flips from backup to Master and the communications is messed up.

Based on what I can find on the forums the 2.1.5 included at patch from June that broke CARP when Virtual IP's where included. However 2.1.5 did not solve my problem. I therefore remove all Virtual IP's to simplify the setup but its still not working.

I'm suspecting that something is not being properly set between the GUI and the underlying configurations however I'm not an expert on pfSense and could use some thoughts on this.

Thank you

secgeek

some additional information

it appears that CARP stays up for about 1 hour or so an during that time all failover testing works correctly. Video streams fine with no interruptions although a Citrix session did stop but I was able to reconnect. Fallback; when you reconnect the primary; is somewhat more slow then failing to the backup but my primary is actually a P4 vs a P5 on the backup therefore I can leave with this.

I'm getting the following errors though

Aug 29 09:01:04 php: /xmlrpc.php: Resyncing OpenVPN instances.
Aug 29 09:01:06 php: rc.filter_synchronize: Config sync not being done because of missing sync IP (this is normal on secondary systems).
Aug 29 09:01:07 lighttpd[25631]: (network_openssl.c.118) SSL: 5 -1 1 Operation not permitted
Aug 29 09:01:07 lighttpd[25631]: (connections.c.619) connection closed: write failed on fd 14
Aug 29 09:01:07 lighttpd[25631]: (connections.c.1692) SSL (error): 5 -1 1 Operation not permitted

secgeek

Additional Information - sorry got interrupted by phone call and didn't add the following

It seems that the lighttpd errors are related to the the following lighttpd bug#1461 which seems to be old but re-appeared in FreedBSD 9.2
including lighttpd-1.4.35 and php5-5.4.26

since my systems are on pfSense 2.1.5 (latest) the same software versions are
php5-5.3.28
lighttpd-1.4.35

http://redmine.lighttpd.net/issues/1461

Hence do I have an issue with the lighttpd?

BBMitch

I'm seeing something similar - same version - so far I haven't seen a pattern and I can't see the error logs you mention, but I'm seeing similar symptoms. I seem to have days of trouble free operation though (admittedly I'm not on 24x7) but I've seen it fix itself.

Just now I "Fixed" it by disabling carp on primary forcing a transfer.

From the outside, port forwards etc. seem inoperable - active tcp sockets time out - new ones can not be created.

From the inside, even pinging the gateway beyond the pfsense cluster fails.

As soon as I disable the primary though everything works, and on re-enabling it, everything still works - until next time.

This thread seems dead though - did you find a solution elsewhere? Thanks!!!

m

secgeek

I solved my problem my moving the pfSense setup to VMware - when the systems are virtualized everything is a ok. I have an issue with clients needed static IP's. It seems that if I used the CARP LAN VIP nothing works; if set the static GW and DNS to the Master LAN IP everything is fine and I set the clients to DHCP from the pfSense everything works as well as long as I don't set the DHCP parameters to the LAN VIP. Does this make any sense? If yes, please explain.

If I'm using the latest stable version of pfSense and running into issues, why do some many other folks insist on using older versions and sometimes not even identical older versions?

Thanks

cmb

The lighttpd errors have no relation (and are safe to ignore). The original issue was a complete inability to reach the port where the web interface is running on the secondary, either a general lack of network connectivity, or missing firewall rules to allow it.

@secgeek:

I solved my problem my moving the pfSense setup to VMware - when the systems are virtualized everything is a ok. I have an issue with clients needed static IP's. It seems that if I used the CARP LAN VIP nothing works; if set the static GW and DNS to the Master LAN IP everything is fine and I set the clients to DHCP from the pfSense everything works as well as long as I don't set the DHCP parameters to the LAN VIP. Does this make any sense? If yes, please explain.

Your VMware networking config is incorrect, the virtual CARP MAC can't reach the VMs.
https://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting#VMware_ESX.2FESXi_Users

@secgeek:

If I'm using the latest stable version of pfSense and running into issues, why do some many other folks insist on using older versions and sometimes not even identical older versions?

No one with a clue insists on using older versions, and especially not even differing older versions, anyone who thinks that's a good idea has absolutely no idea what they're talking about.

BBMitch

Hey Chris - agreed - httpd has nothing to do with the issue from what I can see - my own occurrence of this issue is using pcengines hardware - no vm / hypervisor of any kind.

There seem to be a number of people dancing around the same problem.

The issue only seems to affect the primary carp IP.

I've got other services on load balancers on alias IP's that are not affected.

Forcing carp to fail over or fail back corrects the issue.

It's hard to "reproduce" though as it seems to vary from hours to days before it occurs, and can "fix" itself if left along long enough - at least I think that happened - this was before I figured out how to force it to fix.

If there's some way I can help track the issue please let me know?

Thanks!

Mitch

cmb

Where forcing CARP to fail over and fail back fixes anything, the problem is one of two things, neither of which are on the firewalls. Problem with the IP or the virtual MAC. IP conflict on the affected IP, MAC conflict by using the same VHID on a separate pair or having the same VRRP VRID in use on the same broadcast domain, among other possibilities at layer 2 or 3. Forcing back and forth updates switch CAM tables, and other devices' ARP tables potentially dependending on the circumstance. Your issue is significantly different than OP's, starting your own thread with specifics would be best.

BBMitch

You are right - it was my own issue. I hadn't taken an old test system offline. It was not working properly and only connected to the network intermittently causing the issue. I hate it when people don't post their answers to problems so even though I'm "late" I'm hoping that's better than "never".
Thanks again!
m