Intermittent interface blips leading to brief CARP failovers
-
Thanks for your response. It's nice to have a few suggestions and pointers, and I think it's helped already.
@KOM:
Suggestion #1 would be to solve your VPN issue and upgrade to current. Staying on older, buggy versions is bad news in the long run.
Of course, but unfortunately it isn't that straightforward. The next version up of pfSense uses a completely different VPN backend, which is why different configuration is needed. These machines are in production, so firstly I need to build a test environment and run some suitable tests on the new versions which is logistically quite complex. Then I need to schedule downtime to do the upgrades, get the VPNs up, test them quickly, be prepared for rollback, etc. It's not just a case of upgrading and then fiddling until it works.
@KOM:
Anything in your Interface Stats (Status - Interfaces) with regard to errors or collisions?
Yes. I hadn't thought to look there, so that's quite enlightening. The primary firewall has 0/8 in/out errors for the WAN interface and 0/127 in/out errors for the LAN interface. The dedicated CARP interface has no errors. None of the interfaces have collisions. There are no errors or collisions on the secondary firewall.
I did some further searching to learn about in/out errors, which led to reports of other people having similar problems. There don't seem to be any easy solutions to this one. It suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs.
@KOM:
Anything in Status - System Logs - Gateways?
Getting a few "apinger: ALARM: GW_WAN(1.2.3.4) *** down ***" errors, immediately followed by "apinger: alarm canceled: GW_WAN(212.188.163.155) *** down ***". I guess this is essentially the same problem and perhaps corresponds to the in/out errors on the LAN interface.
@KOM:
Anything in Status - RRD Graphs - Quality?
Not too sure what I should be looking at in there really? Packet loss? Average packet loss over the last three months is 0.1%.
-
Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events. You could try disabling the gateway monitoring via System - Routing - Gateway - (edit gateway) - Disable Gateway Monitoring.
-
dedicated Ethernet port for direct CARP connection between firewalls.
Don't confuse CARP with pfsync.
CARP should happen locally on your switches and has nothing to do with gateway up or down status. You need solid layer 2 between the interfaces in the failover group (those sharing the CARP VIP)
The sync interface has nothing to do with which node is master or backup for any particular CARP VIP.
-
Don't confuse CARP with pfsync.
I wasn't, I was just using incorrect/misleading terminology in that bit of my description of our setup, in my haste to get the post written so I could ask for help. Apologies for any confusion.
Did you have any thoughts or suggestions regarding these issues I've described?
-
Figure out why you're dropping/delaying CARP packets between your interfaces.
-
@KOM:
Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events.
It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.
What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?
-
@KOM:
Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events.
It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.
What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?
So, further to the above, a failover event just occurred and the number of "out" errors increased by 1.
So, what further investigation can I do to find ways of resolving this problem? There's nothing further in the logs and nothing in any console or kernel output that I can find when logging in via SSH. I'm a bit stuck for ideas really!
-
This is all in your layer 2 switching, dude, not pfSense. CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).
What kind of switch are you using? How is it configured? Are the ports taking errors?
-
This is all in your layer 2 switching, dude, not pfSense.
How have you come to this conclusion?
CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).
The problem is with the LAN interface, as I explained in my original post. There is indeed no gateway on the LAN interface. I only mentioned gateways in response to KOM who suggested that I should look in Status - System Logs - Gateways and report what was in there.
What kind of switch are you using? How is it configured? Are the ports taking errors?
2 x Cisco WS-C2960S switches for redundancy. One pfSense firewall goes into one switch, the other firewall into the other switch. Each server has NIC bonding configured, with one NIC going into one switch and the other NIC going into the other switch. So the whole infrastructure is completely redundant.
Everything's working fine. There are no apparent errors on the switches. There are no NIC-related errors on the servers. The only issue is the intermittent, brief CARP failover on pfSense on the LAN interface, which seems to correspond to the "out" errors incrementing on the LAN interface.
-
Everything's working fine.
Look again. Your CARP is failing.
Try new cables. Try Intel NICs - Realtek sucks.
-
Look again. Your CARP is failing.
That's what this entire thread is about and why I posted the question originally. Not sure what your point is.
Try new cables. Try Intel NICs - Realtek sucks.
Thanks for the suggestions. Earlier in the thread I said "it suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs" so it seems you're potentially confirming my suspicions.
-
For the benefit of anyone reading this with similar problems in future: I replaced the Mini-ITX firewalls with new pfSense SG appliances and the NIC/CARP errors went away. I therefore conclude that the RealTek NICs in the old hardware weren't up to the job.
-
This can be added to the growing list of "Realtek sucks" threads.
I have had zero problems with a pair of APUs, however.