WAN failure, reboot fixes it.
Had our main WAN go down twice now in a week. Strange problem as everything else works I.E. WAN failover, internal LAN network, OpenVPN all continue to function as normal just no main WAN. Thought it was our ISP (Frontier) but surprisingly it was not. Found out all I have to do to fix it is reboot the pfsense box (SG-4860).
Fairly simple setup. 2.4.4-RELEASE-p2 (amd64), FIOS main WAN, Cell modem for failover WAN, LAN, LAN2, OpenVPN, DNS Resolver. Up until this happened system has been working fine.
No clue if it is hardware or software. It does not happen very often but Cellular data is expensive and rebooting during business hours is not ideal. What I need is a plan on how to find out what is happening. I am aware pfsense has logging functionality. I just have not used it much. When I did look at the logs I did not see anything obvious and some of the logs already overwrote older entries around the time of the failure with newer items. Again I may not know what I am supposed to look for either.
Do I need to change some logging settings to get a better idea if it happens again?
Any other ideas?
Now this problem has become regular instead of random. Main WAN goes down (no IP) Monday through Friday at 9AM local time. Not on the week ends however! So far I have found that only a pfsense box reboot fixes it and then it is fine for the rest of the day.
Any ideas on the best way to discover what is happening here and then how to fix it?
It sounds like your ISP is resetting your connection every day at 9am. This is common with consumer-level connections but shouldn't be the case for a business account. When you reboot, does it come up and grab the same IP address or is it different every time? Anything in System Logs - DHCP with regard to dhclient?
Yes it grabs the same WAN IP it always has. We are on a dynamic IP but it rarely changes.
The DHCP logs have scrolled off the screen is there a way to see the older ones?
The logs for various pfSense bits are often circular so those details may be gone.
Since you know that this effect happens every day at 9am, start a packet capture on WAN just before the event and then stop it once your WAN is down. Look at it in Wireshark to see what's really going on.
You could always call your ISP and ask them why you lose connectivity every morning at 9am as well.
I will check the pfsense logs right at 9 am tomorrow.
I will have to look into Wireshark been a long while since I last used it.
I checked the lease expiration time and it says currently 6:11 PM today.
Not sure it is related to the ISP as it does not happen on Saturday or Sunday.
I will also cal my ISP but I don't have much hope they can help. Maybe I will catch a break.
Thanks for the help.
Did the exact same thing at 9AM sharp again. Looking at the logs nothing obvious. I did notice that unbound restarts every time a client accesses the dhcp server. The last one happened one minute before the WAN went down. I unchecked "Register DHCP leases in the DNS Resolver" and unbound has not restarted once since. So I will leave it that way for now and see what happens at 9AM tomorrow.
The unbound deal explains why some times clients host names do not resolve as they should. Unbound was probably restarting at the exact time the client was accessing dhcp. pfsense needs to fix this long standing unbound issue!
Also called my ISP and they ran a bunch of tests on the ONT modem. They can see that the pfsense box stops communicating to the ONT at 9AM everyday except Saturday and Sunday. They say it's the router as their modem tests out fine. Of course they offered to send someone out. I declined as I think the problem is a pfsense issue.
I have a hard time understanding why unbound restarting would happen at exactly 9am. What's the TTL on your DHCP leases? Plus, having no DNS does not just stop all traffic in and out. Something else is going on
Yes I really do not know if unbound is restarting exactly at 9AM sharp everyday as today is the first time I looked at the logs right after it happened. I agree as you say - not likely.
For the clients on the LAN the leases are 2 hours.
By the way OpenVPN got a SIGHUP[hard] one second after the Gateway alarms that the WAN is down. Very annoying for our VPN user as they have to reboot their desktop so they can reestablish all their connections. Then they are going through our backup WAN at that point. However they have to reboot again after I reboot the pfsense box to get our main WAN back.
Why would the client have to reboot?? That makes no sense either. They should just disconnect if they haven't already been forced to, and then reconnect. How many clients are we talking about here, and what is their nature considering you have such a short lease time?
Only the VPN client has to reboot as they have a few connections to our CRM and accounting systems and they get left in a strange state when the VPN goes down. We tried to just disconnect and reconnect the VPN but it did not resume the connections to the software correctly. It is just easier for them to reboot and establish new connections to get back in. We only have to one VPN client currently. We have about 20 regular clients on our LAN.
I did not set the DHCP lease time in pfsense I left it the default setting. Seems to work for us.
My packet capture suggestion is still valid. Start your cap on WAN just before 9am and set it to monitor ICMP, then start a ping session to google or somewhere from a client. Stop it as soon as you notice that the pings die at 9am-ish. Look at it in Wireshark and see if that tells you anything. You may need to widen your capture scope to all traffic protocols if your initial capture is inconclusive.
May have had a hardware problem here all along. A few minutes before 9AM today I noticed we were on the backup wan already. Then about 4 minutes before 9 the router died. Would not boot up after power cycling. Status light stays red. Shuts off after 5 to 10 minutes. Can't even see activity from the console port. Switched back in Frontiers router to get a working network. Minus the VPN, LAN2 and backup WAN.
Only two year old box. Going back today for repair. Have a new box coming tomorrow.
We will see if the new box has the same problem the old box did.
Well, that would certainly explain it. Good luck.
Just to not leave everyone hanging on this, here is the outcome:
Got the new router. Restored from backup the configuration file. Switched the interfaces as the SG-5100 is slightly different then the 4860. Plugged everything in and everything seemed to work. That is till 9 am the next work day. Same problem different router, however when it switched to the backup WAN it seemed to connect for a minute or less then it too failed. Rebooted the 5100 and we were back in business on the the main WAN. The backup WAN was connected also. That is until 9 am the next day. Same thing.
Some time that day the backup WAN stopped working. Could not get it to work. However the next day at 9 am the main wan did not fail with the cell modem off. Same thing the next day. Without the cell modem no 9 am anomaly.
However no back up wan. Over the course of several days I continued to trouble shoot the cell modem. Multiple settings changes, factory resets, every suggestion I could find on the net about this specific modem. However the modem would work just fine plugged directly into a laptop. It worked in router mode as well as bridge mode always on the laptop. I could set it up on the laptop, leave it powered up and quickly plug it into the 5100 and there would be an ethernet connection but would never get an IP. The 5100 would not communicate with the modem no mater what I did.
One night at home I thought what if the router had a hardware port problem? Not normal for a new device but possible. Also unusual that it would be the same port that I happened to have the modem connected. So the next morning I logged into the pfsense gui and switched WAN_CELL interface from ix0 to ix2 and plugged the cell modem into ix2. Power up the cell modem and when it finished booting it connected and I had my backup wan again.
Netgate support said I should hook up a laptop to ix0, do some changes to the pfsense settings and see if that port can connect to the laptop. Guess what - it would not connect. To triple check I did the same changes to ix1 and plugged the laptop in and it immediately connected.
On top of that with a working port and cell modem the 9am anomaly has not occurred again.
So I now have an RMA to send the brand new 5100 back. And our repaired 4860 is on it's way back to us.
P.S. During this adventure I discovered an anomaly with some settings when you switch interfaces. In System, Routing Gateways, Edit, WAN_CELL, Advanced the Probe Interval and the Alert Intervals switch back to the default values. I would think that they should stay as set to the gateway name.
Anyway we are back up for now. I will let you know if anything changes.