VMWare VM 2x vCPU, 1024MB RAM
em0 - LAN
em1 - WAN1 - 5 static IPs
em2 - WAN2 - 1 static IP
Gateways configured for failover - WAN1's gateway is the default gateway, and the WAN2 link serves as a backup.
Everything works great for a good period of time, approximately 2 weeks or so. Failover and failback work without issue. I have the latency and packet loss thresholds calibrated pretty well; I rarely lose connectivity on either link unless there's a real upstream issue.
The problem: About once every couple weeks, apinger will start consistently marking WAN1 as down, and record heavy packet loss in the system log and in the RRD quality graph. Apparently pings to the monitor IP are failing. The first time this happened, testing conducted by both myself and my ISP revealed no issue; with a host plugged into the provider's CPE directly, I was able to ping the monitor IP with zero packet loss while pfSense was reporting packet loss upwards of 40% pinging the same IP. Rebooting my provider's CPE did not fix the problem, and pfSense being the last piece of the puzzle, I rebooted it. Doing so alleviated the issue.
The next time it happened, for grins I removed the monitor IP (which is upstream of the CPE and gateway) from the WAN1 GW, and then added it back. apinger restarted and immediately and pfSense/apinger reported that the monitor IP became responsive. No other changes were made.
There is no correlation between this issue high traffic or link failover, nor is there any other pattern I've been able to discern. Interestingly, apinger on the WAN2 link does not appear to ever suffer the same issue.
To me, this pretty clearly indicates an issue with apinger. I've searched the forums and read about other similar issues with apinger, but I haven't seen quite exactly the same issue I'm seeing. I've read that other odd problems with apinger have been resolved with a cron job to restart it, but not having much depth of knowledge with BSD, I was wondering if somebody could point me in the right direction for finding a little guide on how to do this.
Also, please let me know if there are other potential fixes for this issue and whether or not I should also post this elsewhere.
Is the ISP a cable provider and are you using a cable modem for that WAN connection. Is it the same make, model, revision as the WAN2 connection?
No, they're completely different. WAN is business-class cable and WAN2 is wireless.
But that doesn't explain the behavior I've seen: pfSense (apinger) will report 100% packet loss, and if I ping the same monitor IP from outside pfSense at the same time (originating from the same source IP as far as the ISP can tell), I get 0 loss. And then restarting apinger immediately causes pfSense to report 0% packet loss again, where it basically remains for weeks on end.
I occasionally have the same issue and it's with business cable. I have a block of 5 static IPs and I use Google (126.96.36.199) as my monitor endpoint.
I saw a thread on here somewhere and I have yet to find it again, but someone mentions an odd behavior with cable modems. If I remember correctly, when the connection to the cable modem (coax) goes down, the model self assigns itself a 192.168.100.x address. For some reason the cable modem won't release its own self-assigned IP address even though apinger is pinging the crap out of the destination endpoint. Either dropping the connection in pfSense or cycling the cable modem resolves the issue. I may not be exactly accurate with my description, but those are the fragments I remember. I have to search for it.
However, it only happens with cable modems, and that's why I asked if you were using a cable modem. All of my research (which is somewhat limited) leads to the cable box.
That is interesting, since my setup is the same, and that sounds kind of like what I have going on. What I just can't square with that, though, is that just flip-flopping the monitor IP (which also restarts apinger) resolves the issue.
I do find some of the log entries upon changing the monitor IP interesting (10.x.x.x is cable gateway, 96.x.x.x is cable upstream router, 192.168.x.x is wireless gateway, 71.x.x.x is wireless upstream router):
Jun 5 14:35:23 check_reload_status: Syncing firewall
Jun 5 14:35:24 php: /system_gateways.php: ROUTING: setting default route to 10.x.x.x
Jun 5 14:35:24 php: /system_gateways.php: Removing static route for monitor 96.x.x.x and adding a new route through 10.x.x.x
Jun 5 14:35:24 php: /system_gateways.php: Removing static route for monitor 71.x.x.x and adding a new route through 192.168.x.x
Jun 5 14:35:24 apinger: Exiting on signal 15.
I'm going to do some searching on that and see what I can come up with.