Router Locking Up (maybe due to excessive lan traffic?)

stephenw10

Ok so it could be the cell modem serving it's own subnet via DHCP if it loses cell signal. You might have to reject leases from it to prevent that.
192.0.0.1 could be really what the ISP is using even if they probably shouldn't!

Ximulate

Thanks I added 192.0.0.1 to "Reject leases from" in the interface. For kicks, I decided to reboot the cell modem. Though the Primary WAN was up, connectivity if the network went to pot. Here's the logs, filtered for "192.0"

Feb 26 15:41:12 	php-fpm 	37729 	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages.
Feb 26 15:40:37 	php-fpm 	14962 	8.8.8.8|192.0.0.2|GW_Cellular|306.312ms|389.629ms|0.0%|online|delay
Feb 26 15:40:26 	php-fpm 	37729 	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1
Feb 26 15:39:43 	php-fpm 	401 	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages.
Feb 26 15:39:41 	php-fpm 	37729 	/rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1).
Feb 26 15:39:07 	kernel 		arpresolve: can't allocate llinfo for 192.0.0.1 on igb1
Feb 26 15:38:59 	rc.gateway_alarm 	72234 	>>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
Feb 26 15:38:59 	php-fpm 	401 	/rc.newwanip: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found
Feb 26 15:38:59 	php-fpm 	37729 	/rc.dyndns.update: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found
Feb 26 15:38:59 	php-fpm 	36581 	/rc.filter_configure_sync: dpinger: status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock not found
Feb 26 15:38:58 	php-fpm 	401 	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1
Feb 26 15:38:19 	php-fpm 	36581 	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> 192.0.0.2 - Restarting packages.
Feb 26 15:38:16 	php-fpm 	401 	/rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1).
Feb 26 15:37:34 	php-fpm 	36581 	/rc.newwanip: The command '/usr/local/bin/dpinger -S -r 0 -i GW_Cellular -B 192.0.0.2 -p /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.pid -u /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock -C "/etc/rc.gateway_alarm" -d 1 -s 2500 -l 5000 -t 60000 -A 5000 -D 350 -L 15 8.8.8.8 >/dev/null' returned exit code '1', the output was ''
Feb 26 15:37:34 	rc.gateway_alarm 	12798 	>>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
Feb 26 15:36:50 	php-fpm 	36581 	/rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1).
Feb 26 15:36:50 	php-fpm 	54361 	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 192.168.5.145 -> 192.0.0.2 - Restarting packages.
Feb 26 15:36:07 	rc.gateway_alarm 	92783 	>>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
Feb 26 15:36:07 	php-fpm 	54361 	/rc.newwanip: dpinger: cannot connect to status socket /var/run/dpinger_GW_Cellular~192.0.0.2~8.8.8.8.sock - No such file or directory (2)
Feb 26 15:36:05 	php-fpm 	54361 	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1
Feb 26 15:35:58 	php-fpm 	54361 	8.8.8.8|192.0.0.2|GW_Cellular|51.916ms|16.283ms|54%|down|highloss
Feb 26 15:35:28 	php-fpm 	54361 	/rc.newwanip: rc.newwanip: on (IP address: 192.0.0.2) (interface: WANSEC[opt6]) (real interface: igb1).
Feb 26 15:34:45 	rc.gateway_alarm 	10879 	>>> Gateway alarm: GW_Cellular (Addr:192.0.0.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)
Feb 26 15:34:40 	php-fpm 	54361 	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 192.0.0.1

stephenw10

If anything I would expect 192.0.0.X to be the real connection and 192.168.225.1 to be something local. However that isn't the modem subnet it's gui seems to be using.

So if that fails I'd try refusing leases from 192.168.225.1 instead.

Ximulate

I reached out to the cellular modem manufacturer, who was helpful. Apparently some of my modem config was wrong, so that now appears to be straighten out. However, I'm continuing to experience issues.
https://wirelessjoint.com/viewtopic.php?t=4191

Reviewing the logs from the last two lock-ups, I see the following happening several minutes beforehand. I was not in this morning, but saw my blink cams & several other devices went offline. A few hours later, the alarm monitoring company called to report a com failure (which means the alarm was able to communicate for some time after the issue started.) Also in the logs I've noticed that both gateways report packetloss/offline within a few seconds of each other.

Mar  2 09:43:43 router unbound[12219]: [12219:1] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown
Mar  2 09:43:43 router unbound[12219]: [12219:1] notice: ssl handshake failed 10.111.11.118 port 53295
Mar  2 09:44:22 router unbound[12219]: [12219:3] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown
Mar  2 09:44:22 router unbound[12219]: [12219:3] notice: ssl handshake failed 10.111.11.115 port 62052
Mar  2 09:44:22 router unbound[12219]: [12219:2] error: ssl handshake failed crypto error:0A000416:SSL routines::sslv3 alert certificate unknown
Mar  2 09:44:22 router unbound[12219]: [12219:2] notice: ssl handshake failed 10.111.11.115 port 62053
Mar  2 09:45:08 router filterdns[36239]: merge_config: configuration reload
Mar  2 09:45:08 router filterdns[36239]: 	Adding Action: pf table: networkABC host: abc.duckdns.org
Mar  2 09:45:08 router filterdns[36239]: 	Adding Action: pf table: network123 host: 123.duckdns.org
[More of the above, then]
Mar  2 09:46:08 router filterdns[36239]: failed to resolve host ntp.org will retry later again.
Mar  2 09:46:08 router filterdns[36239]: failed to resolve host abc.duckdns.org will retry later again.
Mar  2 09:46:08 router filterdns[36239]: failed to resolve host 123.duckdns.org will retry later again.

Ximulate

@stephenw10
It seems like something in the router is preventing IPs (not just DNS resolving) from being found

stephenw10

What are those hosts at 10.111.11.115 and 10.111.11.118?

Ximulate

@stephenw10
.115 is an iPhone. Interesting thing here is that device would not have been on the network at that time (09:44). This episode & the one before, .115 was in the logs reporting the same thing minutes before the issue started (it would have been on the network the previous episode.)

Not sure yet what .118 is. It might be another iPhone.

stephenw10

Is it possible the system clock is wrong?

Ximulate

@stephenw10 no, the time is correct

Ximulate

This morning all seemed fine (smart TV was working, not complaints from the alarm, etc) until I logged into my desktop PC and it would not load local (i.e. pfSense GUI) or WWW pages. I tried to SSH into the router, but no joy.

The logs are relatively quite from midnight until the time I power cycled the router. I did not see any "ssl handshake failed crypto error."

There were several "filterdns 18089 Adding Action: pf table: XYZ host: xxx.xxx.xxx.xxx" prior to rebooting. I've seen this in the logs prior to other failures.

I also noticed serveral ntpd logs like this:

Mar 6 08:31:31 	ntpd 	87972 	Soliciting pool server 45.83.234.123

Tried "ntpq -c pe" per stack exchange post, which if I understand correctly st:16 means out of sync:

=============================================================================
0.pfsense.pool. .POOL. 16 p - 64 0 0.000 +0.000 0.000
1.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
2.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
3.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.000
*65-100-46-166.d .SOCK. 1 u 38 128 377 74.742 +2.404 1.990
+ns1.your-site.c 216.218.254.202 3 u 58 128 377 72.907 +1.203 5.012
+104.156.246.53 204.9.54.119 2 u 103 128 377 40.948 -0.402 6.028

Dashboard shows correct time

Ximulate

Dashboard shows correct time

NTP service is enabled with
0.pfsense.pool.ntp.org
1.pool.ntp.org
2.pool.ntp.org
3.pool.ntp.org

Same timeservers are input into System > General > Timeservers
I don't see any firewall rules that would block NTP requests.
I'm disabling NTP Server, as I don't think I'm using it.
I'm assuming the other timeservers listed in the ntpq results are requests from LAN devices

stephenw10

@Ximulate said in Router Locking Up (maybe due to excessive lan traffic?):

I logged into my desktop PC and it would not load local (i.e. pfSense GUI) or WWW pages. I tried to SSH into the router, but no joy.

How was it failing? Is it a DNS resolution failure? The services actually stopped responding on the firewall?

Progressively failing services like that could be a disk issue. Do you see gaps in the logging after recovering access?

Ximulate

@stephenw10

No, the IP addresses appear to being dropped as if dhcp is failing or devices are not able to see other devices. In other words, if I type the pfSense router IP address into the browser it does not load... the browser does not see the pfSense gui. Once this happens, the only way I'm able to recover access is power cycling the router.

At one point, I had my laptop connected to the serial console of the router. I was usually able to access the command menu that way. Occasionally, I could RPD to the laptop to access the command menu but that would normally not work either.

I think I've tried this already, but I think I'll manually set the the IP address of my desktop & laptop to see if they still communicate next time the network fails. Currently pfSense is handling out static leases to my desktop & a few other items, and dynamic to the rest.

On the rare occasion that I catch the network acting up but can get to the router gui, I have not seen any failing services. I have also tried the pfsense tools in the CLI lile "playback restartallwan" without success. Reboot was required.

stephenw10

Was the console responsive if you were at the laptop connected to it directly?

Ximulate

@stephenw10 To the best of my recollection, at least within the last few weeks, the console has always been accessable via serial.

stephenw10

Ok then I'd try to connect out from it when this happens and see what (if anything) still works.

Ximulate

@stephenw10 maybe I misunderstood your last question. When the network/router fails, I have been able to access the console via serial connection but devices on the network/router still do not communicate. I've tried restarting php, restarting the web configurator, using the playback scripts in the tools... none of those resolve the issue, except rebooting

stephenw10

Right but can you ping out from the console to external targets? By IP and FQDN? What about internal targets?

Ximulate

@stephenw10
I had to go back to the first post to refresh my memory, but yes I did also try pinging back then
https://forum.netgate.com/post/1152732

When I can get into the GUI, I don't see any issues in the dashboard like down WAN, CPU or memory issues. Most of the time, I don't notice 'til its too late so I can't connect to GUI. I have set-up my laptop to the router using the console. I've tried various options in the menu, including restarting PHP, the web configurator, tools like "playback restartallwan" and others to no avail. The one interesting thing is, although the lan devices aren't connecting, I can, from the console, sometimes ping external IPs like 9.9.9.9 OK but 8.8.8.8 might not respond. Internal LAN devices don't respond to ping either.

Now a lot has transpired since that post so I'll try to ping next time. However, I do think I'm going to have the same/similar results. I just reconnected my laptop via serial to the console so its ready to go as soon as I can get to it.

BTW... Thank you for hanging in there with me on this!

stephenw10

Hmm, unable to ping any internal IP address seems like it just stops moving traffic. Unlikely to be a NIC issue on an APU2 assuming you are only using the igb NICs.

Hmm, something of a mystery. You can try booting in verbose mode. If it is something hardware related that might show something.

Beyond that you can try loading the debug kernel:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/debug-kernel.html