Whole environment has become slow after introducing HA

nfern

Hi,
I'm new to pfsense.
I have introduced a second pfsense router and configured HA with CARP ip of 10.x.x.189.
Individual interface IPs are 10.x.x.196 and 10.x.x.197.
I have seen a lot of ping loss to all servers
When tracing path between servers, i noticed that data is flowing via secondary device 10.x.x.197 instead of primary device.
Master and Backup are seen correctly in CARP failover.
Can someone help? Thank you

mjh_ca

Welcome, pfSense HA is great but can be very sensitive to configuration errors.

When you say "ping loss to all servers", do you mean pinging your internal servers from externally? Internally to internally? Internal to external?

Are you using CARP VIPs on your WAN as well as LAN? Or only on LAN side? Need more information about your setup. The IPs you list are private IPs (i.e. LAN only) so I'm unclear how you have this configured.

You say "whole environment has become slow". Do you mean internally (i.e. pinging server to server on same subnet)? HA shouldn't affect pinging internally between servers at all unless it is doing routing between VLANs or something, if you're seeing this slowness internally then you have some other issue like possibly you have introduced a loop on your switches and don't have STP/RSTP enabled.

Assuming it isn't a networking issue (i.e. packet loss is only when traversing pfSense nodes) then it could be a misconfiguration of NAT or some other routing issue.

Are you sure you properly setup Manual Outbound NAT, checked that XMLRPC sync is happening properly between the primary and secondary nodes, etc? Check the Status > System Logs, do you see any unusual messages about CARP?

Double check the HA Troubleshooting docs and the links at the bottom of that page
https://docs.netgate.com/pfsense/en/latest/highavailability/troubleshooting-high-availability-clusters.html

There are lots of little errors that could cause weird behavior (slightly different configuration between the primary and the secondary nodes - i.e. incorrect subnet mask, etc).

What is your upstream provider? Certain providers' equipment (in particular, cable modems) will freak out and block packets when they see multiple MACs involved with the same IPs and it simply can't be fixed unless the provider allows that configuration. Even if they will issue you a public /29 which should work for CARP, it won't work because you will have high packet loss when it sees multiple MACs and starts blocking.

Re-read the configuration guides carefully to make sure nothing was missed:
https://docs.netgate.com/pfsense/en/latest/highavailability/configuring-high-availability.html
https://docs.netgate.com/pfsense/en/latest/book/highavailability/index.html

nfern

When you say "ping loss to all servers", do you mean pinging your internal servers from externally? Internally to internally? Internal to external?

I am using the WAN interface to route internally between vlans.

Are you using CARP VIPs on your WAN as well as LAN? Or only on LAN side? Need more information about your setup. The IPs you list are private IPs (i.e. LAN only) so I'm unclear how you have this configured.

CARP is configured on WAN and LAN and i have Alias IPs configured for each of the vlans. All IPs are private and not going to the internet through this setup.

You say "whole environment has become slow". Do you mean internally (i.e. pinging server to server on same subnet)? HA shouldn't affect pinging internally between servers at all unless it is doing routing between VLANs or something, if you're seeing this slowness internally then you have some other issue like possibly you have introduced a loop on your switches and don't have STP/RSTP enabled.

Before I could add the second box into the picture for HA, there were no issues of slowness. As soon as I introduced HA, i saw a ping loss.

Assuming it isn't a networking issue (i.e. packet loss is only when traversing pfSense nodes) then it could be a misconfiguration of NAT or some other routing issue.

*I do not have NAT configured.

I have checked configuration on both devices, they match.
Also noticed that although primary device shows "master" and secondary device shows "backup", a trace between clients goes out the primary and returns via secondary, which makes me think that they are both master.
Advertising freq of base value and skew is also correct, 0 on master and 100 on backup*

nfern

@mjh_ca
I rebooted the primary device and when it came up, no more ping loss issue or slowness in environment. However, after about 6 hours or so, i'm back to square one. same slowness and tracert shows traffic passing through the secondary device instead of the primary.

Any advice would be great at this point.
Thank you

mjh_ca

Lots of possibilities. I would simplify the configuration down to the basics and see if you can get it working.

Key suspects for me would be network structure issue (have you accidentally introduced a loop and STP/RTSP is kicking in and disabling ports, causing weird routing? do you have a bad cable or ports that are auto-negotiating at the wrong speeds? etc). A CARP or Virtual IP configuration step you missed (wrong netmask on a virtual IP? left CARP in temporary maintenance mode, etc...)?

Check out the pfSense system logs, check that node is MASTER on both WAN and LAN (if they split then of course you have routing issues) and the other is BACKUP on both, check your switch port status and logs to see if it gives you any hints...