2X Dell R515 servers and 2.1-RC0 CARP

wabashky

We have two Dell R515 servers running:

2.1-RC0 (amd64)
built on Mon Jul 22 03:26:44 EDT 2013
FreeBSD 8.3-RELEASE-p8

We have 2 separate WAN's, one 50/50 fiber pipe and a 15/2 cable for failover. We are successfully using CARP with no issues and fail-over internet works fine as well. We have six virtual interfaces and some 60 ipsec tunnels to our remote office.

The problem we are having is we have intermittent down times, yet it is 'up' (not one dropped packet externally). There are absolutely NO corresponding logs to explain the issue at all. We host multiple systems at different sites, and all of the down times (seconds only) are causing the tunnels to drop and so all outages are effecting us.

We had an old supermicro running 2.0.1 and had no outages, but the reliability of the box wasn't the best (loud as could be and no hard drive space left) and it would have slow down … no outages but slow downs. We figured bringing in these BOSS dell servers as our routers would resolve. :)

Any input is appreciated.

ssheikh

Do you lose internet connectivity as well during the "down time"?

wabashky

Yes we do, but only for the duration of a ping to recover. And then its from our backup internet, not our primary.

wabashky

And like I said, NOTHING in logs …

ssheikh

So if it is switching to the backup internet connection then your default gateway is switching. Do you see evidence of that happening in the System | Gateways log?

Is the Monitor IP being used a reliable pingable system and have the thresholds been changed from their defaults?

Is CARP also by any chance failing over from Master to backup?

wabashky

I guess there is something in logs :) didn't know about this log.

Jul 30 21:37:03 apinger: ALARM: gw(199.68.254.225) *** down ***
Jul 30 21:37:03 apinger: ALARM: RRGW(67.53.57.105) *** down ***
Jul 30 21:37:22 apinger: alarm canceled: RRGW(67.53.57.105) *** down ***
Jul 30 21:37:22 apinger: alarm canceled: gw(199.68.254.225) *** down ***
Aug 1 15:24:13 apinger: Exiting on signal 15.
Aug 1 15:24:14 apinger: Starting Alarm Pinger, apinger(27599)
Aug 1 15:24:15 apinger: SIGUSR1 received, writting status.
Aug 1 15:46:04 apinger: ALARM: gw(199.68.254.225) *** down ***
Aug 1 15:46:14 apinger: alarm canceled: gw(199.68.254.225) *** down ***
Aug 1 15:46:51 apinger: ALARM: gw(199.68.254.225) *** down ***
Aug 1 15:47:00 apinger: alarm canceled: gw(199.68.254.225) *** down ***
Aug 1 18:40:23 apinger: ALARM: gw(199.68.254.225) *** down ***
Aug 1 18:40:25 apinger: alarm canceled: gw(199.68.254.225) *** down ***

Examples above from gateway logs. Like 10 seconds of down then up again.

I assume the IPs we monitor are good. They are from our ISP.

Carp isn't failing over, we did force it (unplug) and that worked so it does function correctly.

ssheikh

Check your Gateway monitors. Your pings to them is dying.

If this is because of the link being saturated with traffic then try relaxing your gateway monitoring thresholds.

wabashky

Thanks. We will try that out over a couple days and see if it works.

wabashky

that didnt work for us, we were still going down, just not notified about it or logging as often. BUT!>>>>

We found this doc on pfsense.org

http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

and (knock on wood) we have not had an outage at all today! Maybe we are good? We will monitor over the next week and I'll update if we find anything.

Thanks ssheikh