Router Locking Up (maybe due to excessive lan traffic?)
-
Ok so it could be the cell modem serving it's own subnet via DHCP if it loses cell signal. You might have to reject leases from it to prevent that.
192.0.0.1 could be really what the ISP is using even if they probably shouldn't! -
Thanks I added 192.0.0.1 to "Reject leases from" in the interface. For kicks, I decided to reboot the cell modem. Though the Primary WAN was up, connectivity if the network went to pot. Here's the logs, filtered for "192.0"
Feb 26 15:41:12 php-fpm 37729 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages. Feb 26 15:40:37 php-fpm 14962 8.8.8.8|192.0.0.2|GW_Cellular|306.312ms|389.629ms|0.0%|online|delay Feb 26 15:40:26 php-fpm 37729 /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1 Feb 26 15:39:43 php-fpm 401 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages. Feb 26 15:39:41 php-fpm 37729 /rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1). Feb 26 15:39:07 kernel arpresolve: can't allocate llinfo for 192.0.0.1 on igb1 Feb 26 15:38:59 rc.gateway_alarm 72234 >>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) Feb 26 15:38:59 php-fpm 401 /rc.newwanip: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found Feb 26 15:38:59 php-fpm 37729 /rc.dyndns.update: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found Feb 26 15:38:59 php-fpm 36581 /rc.filter_configure_sync: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found Feb 26 15:38:58 php-fpm 401 /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1 Feb 26 15:38:19 php-fpm 36581 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages. Feb 26 15:38:16 php-fpm 401 /rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1). Feb 26 15:37:34 php-fpm 36581 /rc.newwanip: The command '/usr/local/bin/dpinger -S -r 0 -i GW_Cellular -B 192.0.0.2 -p /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.pid -u /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock -C "/etc/rc.gateway_alarm" -d 1 -s 2500 -l 5000 -t 60000 -A 5000 -D 350 -L 15 8.8.8.8 >/dev/null' returned exit code '1', the output was '' Feb 26 15:37:34 rc.gateway_alarm 12798 >>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) Feb 26 15:36:50 php-fpm 36581 /rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1). Feb 26 15:36:50 php-fpm 54361 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 192.168.5.145 -> 192.0.0.2 - Restarting packages. Feb 26 15:36:07 rc.gateway_alarm 92783 >>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) Feb 26 15:36:07 php-fpm 54361 /rc.newwanip: dpinger: cannot connect to status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock - No such file or directory (2) Feb 26 15:36:05 php-fpm 54361 /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1 Feb 26 15:35:58 php-fpm 54361 8.8.8.8|192.0.0.2|GW_Cellular|51.916ms|16.283ms|54%|down|highloss Feb 26 15:35:28 php-fpm 54361 /rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1). Feb 26 15:34:45 rc.gateway_alarm 10879 >>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%) Feb 26 15:34:40 php-fpm 54361 /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1
-
If anything I would expect 192.0.0.X to be the real connection and 192.168.225.1 to be something local. However that isn't the modem subnet it's gui seems to be using.
So if that fails I'd try refusing leases from 192.168.225.1 instead.
-
I reached out to the cellular modem manufacturer, who was helpful. Apparently some of my modem config was wrong, so that now appears to be straighten out. However, I'm continuing to experience issues.
https://wirelessjoint.com/viewtopic.php?t=4191Reviewing the logs from the last two lock-ups, I see the following happening several minutes beforehand. I was not in this morning, but saw my blink cams & several other devices went offline. A few hours later, the alarm monitoring company called to report a com failure (which means the alarm was able to communicate for some time after the issue started.) Also in the logs I've noticed that both gateways report packetloss/offline within a few seconds of each other.
Mar 2 09:43:43 router unbound[12219]: [12219:1] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown Mar 2 09:43:43 router unbound[12219]: [12219:1] notice: ssl handshake failed 10.111.11.118 port 53295 Mar 2 09:44:22 router unbound[12219]: [12219:3] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown Mar 2 09:44:22 router unbound[12219]: [12219:3] notice: ssl handshake failed 10.111.11.115 port 62052 Mar 2 09:44:22 router unbound[12219]: [12219:2] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown Mar 2 09:44:22 router unbound[12219]: [12219:2] notice: ssl handshake failed 10.111.11.115 port 62053 Mar 2 09:45:08 router filterdns[36239]: merge_config: configuration reload Mar 2 09:45:08 router filterdns[36239]: Adding Action: pf table: networkABC host: abc.duckdns.org Mar 2 09:45:08 router filterdns[36239]: Adding Action: pf table: network123 host: 123.duckdns.org [More of the above, then] Mar 2 09:46:08 router filterdns[36239]: failed to resolve host ntp.org will retry later again. Mar 2 09:46:08 router filterdns[36239]: failed to resolve host abc.duckdns.org will retry later again. Mar 2 09:46:08 router filterdns[36239]: failed to resolve host 123.duckdns.org will retry later again.
-
@stephenw10
It seems like something in the router is preventing IPs (not just DNS resolving) from being found -
What are those hosts at 10.111.11.115 and 10.111.11.118?
-
@stephenw10
.115 is an iPhone. Interesting thing here is that device would not have been on the network at that time (09:44). This episode & the one before, .115 was in the logs reporting the same thing minutes before the issue started (it would have been on the network the previous episode.)Not sure yet what .118 is. It might be another iPhone.
-
Is it possible the system clock is wrong?
-
@stephenw10 no, the time is correct
-
This morning all seemed fine (smart TV was working, not complaints from the alarm, etc) until I logged into my desktop PC and it would not load local (i.e. pfSense GUI) or WWW pages. I tried to SSH into the router, but no joy.
The logs are relatively quite from midnight until the time I power cycled the router. I did not see any "ssl handshake failed crypto error."
There were several "filterdns 18089 Adding Action: pf table: XYZ host: xxx.xxx.xxx.xxx" prior to rebooting. I've seen this in the logs prior to other failures.
I also noticed serveral ntpd logs like this:
Mar 6 08:31:31 ntpd 87972 Soliciting pool server 45.83.234.123
Tried "ntpq -c pe" per stack exchange post, which if I understand correctly st:16 means out of sync:
=============================================================================
0.pfsense.pool. .POOL. 16 p - 64 0 0.000 +0.000 0.000
1.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
2.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
3.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
*65-100-46-166.d .SOCK. 1 u 38 128 377 74.742 +2.404 1.990
+ns1.your-site.c 216.218.254.202 3 u 58 128 377 72.907 +1.203 5.012
+104.156.246.53 204.9.54.119 2 u 103 128 377 40.948 -0.402 6.028Dashboard shows correct time
-
Dashboard shows correct time
NTP service is enabled with
0.pfsense.pool.ntp.org
1.pool.ntp.org
2.pool.ntp.org
3.pool.ntp.orgSame timeservers are input into System > General > Timeservers
I don't see any firewall rules that would block NTP requests.
I'm disabling NTP Server, as I don't think I'm using it.
I'm assuming the other timeservers listed in the ntpq results are requests from LAN devices -
@Ximulate said in Router Locking Up (maybe due to excessive lan traffic?):
I logged into my desktop PC and it would not load local (i.e. pfSense GUI) or WWW pages. I tried to SSH into the router, but no joy.
How was it failing? Is it a DNS resolution failure? The services actually stopped responding on the firewall?
Progressively failing services like that could be a disk issue. Do you see gaps in the logging after recovering access?
-
No, the IP addresses appear to being dropped as if dhcp is failing or devices are not able to see other devices. In other words, if I type the pfSense router IP address into the browser it does not load... the browser does not see the pfSense gui. Once this happens, the only way I'm able to recover access is power cycling the router.
At one point, I had my laptop connected to the serial console of the router. I was usually able to access the command menu that way. Occasionally, I could RPD to the laptop to access the command menu but that would normally not work either.
I think I've tried this already, but I think I'll manually set the the IP address of my desktop & laptop to see if they still communicate next time the network fails. Currently pfSense is handling out static leases to my desktop & a few other items, and dynamic to the rest.
On the rare occasion that I catch the network acting up but can get to the router gui, I have not seen any failing services. I have also tried the pfsense tools in the CLI lile "playback restartallwan" without success. Reboot was required.
-
Was the console responsive if you were at the laptop connected to it directly?
-
@stephenw10 To the best of my recollection, at least within the last few weeks, the console has always been accessable via serial.
-
Ok then I'd try to connect out from it when this happens and see what (if anything) still works.
-
@stephenw10 maybe I misunderstood your last question. When the network/router fails, I have been able to access the console via serial connection but devices on the network/router still do not communicate. I've tried restarting php, restarting the web configurator, using the playback scripts in the tools... none of those resolve the issue, except rebooting
-
Right but can you ping out from the console to external targets? By IP and FQDN? What about internal targets?
-
@stephenw10
I had to go back to the first post to refresh my memory, but yes I did also try pinging back then
https://forum.netgate.com/post/1152732When I can get into the GUI, I don't see any issues in the dashboard like down WAN, CPU or memory issues. Most of the time, I don't notice 'til its too late so I can't connect to GUI. I have set-up my laptop to the router using the console. I've tried various options in the menu, including restarting PHP, the web configurator, tools like "playback restartallwan" and others to no avail. The one interesting thing is, although the lan devices aren't connecting, I can, from the console, sometimes ping external IPs like 9.9.9.9 OK but 8.8.8.8 might not respond. Internal LAN devices don't respond to ping either.
Now a lot has transpired since that post so I'll try to ping next time. However, I do think I'm going to have the same/similar results. I just reconnected my laptop via serial to the console so its ready to go as soon as I can get to it.
BTW... Thank you for hanging in there with me on this!
-
Hmm, unable to ping any internal IP address seems like it just stops moving traffic. Unlikely to be a NIC issue on an APU2 assuming you are only using the igb NICs.
Hmm, something of a mystery. You can try booting in verbose mode. If it is something hardware related that might show something.
Beyond that you can try loading the debug kernel:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.html