Intermittent interface blips leading to brief CARP failovers

dz-015

Hardware: Jetway JBC 373 mini-ITX system with Intel Atom D525, 2GB RAM, 4 x RealTek 8111 Ethernet ports.

Software: pfSense 2.1.5 (upgrading to a newer version broke our VPN connections so we had to quickly roll back).

Configuration: 2 x pfSense firewalls as above; WAN ports have separate IP addresses; LAN ports have a failover IP address using CARP; dedicated Ethernet port for direct CARP connection between firewalls.

The issue is that at seemingly random times the LAN port briefly changes its state to down which causes CARP to failover such that the floating LAN IP moves to the secondary machine. This seems to happen for a couple of seconds then the IP moves back to the primary again.

I've scoured the logs but can't see anything happening which could be triggering this. No unusual traffic, no scheduled tasks in pfSense.

Since it's a brief blip it doesn't really cause any serious noticeable problems, but it's a nuisance because it fills the logs and sends out lots of emails. However, because I don't understand why this is happening, it's quite worrying because it may be indicative of a problem which could become more serious if we were to get more traffic on the network.

I've tried changing the Advertising Frequency for the CARP VIP on the secondary machine to see if that would at least reduce the number of annoying emails being sent, but that doesn't seem to have worked. But I really want to understand the problem with the Ethernet port so I know why this is happening.

dz-015

Any thoughts or suggestions at all would be very welcome!

KOM

Suggestion #1 would be to solve your VPN issue and upgrade to current. Staying on older, buggy versions is bad news in the long run.

Anything in your Interface Stats (Status - Interfaces) with regard to errors or collisions?

Anything in Status - System Logs - Gateways?

Anything in Status - RRD Graphs - Quality?

dz-015

Thanks for your response. It's nice to have a few suggestions and pointers, and I think it's helped already.

@KOM:

Suggestion #1 would be to solve your VPN issue and upgrade to current. Staying on older, buggy versions is bad news in the long run.

Of course, but unfortunately it isn't that straightforward. The next version up of pfSense uses a completely different VPN backend, which is why different configuration is needed. These machines are in production, so firstly I need to build a test environment and run some suitable tests on the new versions which is logistically quite complex. Then I need to schedule downtime to do the upgrades, get the VPNs up, test them quickly, be prepared for rollback, etc. It's not just a case of upgrading and then fiddling until it works.

@KOM:

Anything in your Interface Stats (Status - Interfaces) with regard to errors or collisions?

Yes. I hadn't thought to look there, so that's quite enlightening. The primary firewall has 0/8 in/out errors for the WAN interface and 0/127 in/out errors for the LAN interface. The dedicated CARP interface has no errors. None of the interfaces have collisions. There are no errors or collisions on the secondary firewall.

I did some further searching to learn about in/out errors, which led to reports of other people having similar problems. There don't seem to be any easy solutions to this one. It suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs.

@KOM:

Anything in Status - System Logs - Gateways?

Getting a few "apinger: ALARM: GW_WAN(1.2.3.4) *** down ***" errors, immediately followed by "apinger: alarm canceled: GW_WAN(212.188.163.155) *** down ***". I guess this is essentially the same problem and perhaps corresponds to the in/out errors on the LAN interface.

@KOM:

Anything in Status - RRD Graphs - Quality?

Not too sure what I should be looking at in there really? Packet loss? Average packet loss over the last three months is 0.1%.

KOM

Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events. You could try disabling the gateway monitoring via System - Routing - Gateway - (edit gateway) - Disable Gateway Monitoring.

Derelict

dedicated Ethernet port for direct CARP connection between firewalls.

Don't confuse CARP with pfsync.

CARP should happen locally on your switches and has nothing to do with gateway up or down status. You need solid layer 2 between the interfaces in the failover group (those sharing the CARP VIP)

The sync interface has nothing to do with which node is master or backup for any particular CARP VIP.

dz-015

@Derelict:

Don't confuse CARP with pfsync.

I wasn't, I was just using incorrect/misleading terminology in that bit of my description of our setup, in my haste to get the post written so I could ask for help. Apologies for any confusion.

Did you have any thoughts or suggestions regarding these issues I've described?

Derelict

Figure out why you're dropping/delaying CARP packets between your interfaces.

dz-015

@KOM:

Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events.

It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.

What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?

dz-015

@dz-015:

@KOM:

Now that you know where to look, I would check again after you have detected the latest failover. See if there is any correlation between the time it starts flapping and other network quality events.

It looks as if there's been one failover event which has caused the number of "out" errors to increase. Next time I'll hopefully check it in real time.

What I'm not sure about, though, is where I can go from there? If I know that the "out" errors are linked to the failover events, how does that knowledge benefit me and what can I do about it?

So, further to the above, a failover event just occurred and the number of "out" errors increased by 1.

So, what further investigation can I do to find ways of resolving this problem? There's nothing further in the logs and nothing in any console or kernel output that I can find when logging in via SSH. I'm a bit stuck for ideas really!

Derelict

This is all in your layer 2 switching, dude, not pfSense. CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).

What kind of switch are you using? How is it configured? Are the ports taking errors?

dz-015

@Derelict:

This is all in your layer 2 switching, dude, not pfSense.

How have you come to this conclusion?

@Derelict:

CARP will work with or without a gateway on the interface. See also your CARP on your LAN interface (no gateway).

The problem is with the LAN interface, as I explained in my original post. There is indeed no gateway on the LAN interface. I only mentioned gateways in response to KOM who suggested that I should look in Status - System Logs - Gateways and report what was in there.

@Derelict:

What kind of switch are you using? How is it configured? Are the ports taking errors?

2 x Cisco WS-C2960S switches for redundancy. One pfSense firewall goes into one switch, the other firewall into the other switch. Each server has NIC bonding configured, with one NIC going into one switch and the other NIC going into the other switch. So the whole infrastructure is completely redundant.

Everything's working fine. There are no apparent errors on the switches. There are no NIC-related errors on the servers. The only issue is the intermittent, brief CARP failover on pfSense on the LAN interface, which seems to correspond to the "out" errors incrementing on the LAN interface.

Derelict

Everything's working fine.

Look again. Your CARP is failing.

Try new cables. Try Intel NICs - Realtek sucks.

dz-015

@Derelict:

Look again. Your CARP is failing.

That's what this entire thread is about and why I posted the question originally. Not sure what your point is.

@Derelict:

Try new cables. Try Intel NICs - Realtek sucks.

Thanks for the suggestions. Earlier in the thread I said "it suggests that perhaps the NICs are having some issues, so maybe I need to consider hardware upgrades to machines with more robust NICs" so it seems you're potentially confirming my suspicions.

dz-015

For the benefit of anyone reading this with similar problems in future: I replaced the Mini-ITX firewalls with new pfSense SG appliances and the NIC/CARP errors went away. I therefore conclude that the RealTek NICs in the old hardware weren't up to the job.

Derelict

This can be added to the growing list of "Realtek sucks" threads.

I have had zero problems with a pair of APUs, however.