Internet access breaks approx every week
I'm having a strange (what I assume is a) pfSense issue which I can't figure out the root cause of.
This has been an issue for several months and I've not made much progress.
Here's what happens...
Approximatly once a week, sometimes every few days, I will lose internet connection on my machines which are downstream of my pfSense router system. If I signin to the webgui, I see that the gateway is marked as down and the indicator color goes red. If I ping the IP address of the gateway using the webgui diagnostics ping form, then the pings come back ok indicating that the gateway address is responding to pings. I can also ping the gateway from the other side of the pfSense router, using a machine downstream on the network. However if I use such a machine to ping for example 188.8.131.52 then the ping command line utility responds "time to live exceeded".
Now let me tell you about my setup.
I have my crappy ISP router which sits facing the internet. It has an inbuilt WAP and some devices on our home network connect to it directly. The IP is 192.168.0.1. It has one ethernet connected device, which is a connection to the WAN side of the pfSense router.
The pfSense router runs virtualized on a skylake 1156 (?) xeon based server system. This is a DIY build. There is a network card which is passed through as a PCI device. The native OS is Debian, virtualization is done through QEMU/KVM. The network card has 4 ports in total, 3 are used; for WAN, LAN1 and OPT1. LAN1 and OPT1 are two other networks with the addresses 192.168.1.x, 192.168.2.x.
The Debian server connects to the network 192.168.1.x. There is another WAP on the same network.
I have some other machines on the network 192.168.2.x
Here's a more detailed description of the issue.
I will be working on a machine on 192.168.2.x, and will abruptly loose internet connection. If I try to ping 184.108.40.206 from these machines, I get "TTL Exceeded". I can ping the pfSense router, and I can ping the ISP router.
Machines connected to the ISP router via the ISP inbuilt WAP continue to work fine.
Only machines on 192.168.2.x and 192.168.1.x loose connectivity.
Connectivity between machines on 192.168.2.x are fine, these connect directly via a switch.
If I sign into the pfSense webgui I can ping 192.168.0.1 and 220.127.116.11 fine, but the gateway still shows as "down".
Here's a list of things which fix it.
Using the webgui, if I navigate to the status->interfaces form and release the WAN ip and then re-new it, the problem is resolved.
Restarting the ISP router OR pfSense system resolves the problem. Restarting both obviously also fixes the issue.
Removing the WAN network cable and plugging it back in. This is the cable linking the isp router and pfSense VM.
Here's a list of things I tried to fix it and they didn't work.
I changed the static IP assigned by the ISP router to the pfSense box to a higher number, so to avoid conflict with any other devices. The reservation is set to 192.168.0.199 and the pfSense box uses DHCP to get this address. This seems to make no difference.
I don't know what futher steps I might take to diagnose the issue. It's very strange that I can ping 18.104.22.168 from the pfSense VM but not from machines further up the network.
Tried to add as much info as possible sorry if I missed something obvious.
Since this issue only occurs once every few days I might not be able to reply very quickly to suggested fixes.
If you have two WANs, see https://forum.netgate.com/post/956355. Otherwise it sounds similar, like your pfSense is determining the gateway is down, and not going back up. You can brute force it by checking "Disable Gateway Monitoring" on that gateway to turn off the checking. (System/Routing/Gateways/Edit)
Yup TTL exceeded sounds like a routing loop.
I would say your default gateway got changed or the ISP router started routing traffic back to pfSense somehow.
Make sure your default gateway is set as the WAN in System > Routing > Gateways.
@stephenw10 Thanks for the tip. It was set to "automatic" but presumably that is the cause of the problem? I have set it to WAN_DHCP. Hopefully that will fix the issue.
If you have more than one gateway it could certainly have been. Especially if that other gateway is something internal causing the loop.
@stephenw10 Yes, there was another gateway to another "internal" network. Everything there was virtualized, however I can see why that would cause a problem.