SG3100 needs to reboot every few days after 2.4.4 upgrade
-
Hello, I have an SG3100 that has been running fine since April when it was installed. After the latest update to 2.4.4 was installed (I did this through the GUI Update mechanism), I seem to have developed a new issue. After a few days (unclear if exact duration is the same each time), I seem to lose the ability for clients to make new outbound connections. When this happens, I log into the pfSense GUI and notice that even though it logs in quickly, even just waiting for the main dashboard page to finish displaying takes way longer than the usual couple of seconds. It seems to take a long time to check the status of any available updates.
To fix this problem, all I have done is reboot the SG3100 and when it comes up everything is fine again. All clients work fine for another few days until the problem comes back. I haven't done much config from factory settings, though I have setup a few things back when I was on 2.3. For example, enabled DNS resolver with forwarding to public DNS over TLS hosts. And enabled Snort with a couple of signatures and IP lists. And ntopng.
I realize there could be an issue with a package or its config that was working fine before. But rather than just disabling everything blind and then crossing my fingers, I'd prefer to diagnose and resolve the issue. Does anyone have any suggestions as to any logs I can download or copy off the box the next time this happens, before I restart it?
-
Is your WAN DHCP? Do you have any advanced settings enabled in the WAN DHCP options?
If so, see https://redmine.pfsense.org/issues/8507#note-23
-
@jimp Thanks for the reply! Yes I have a DHCP assigned WAN interface. No advanced settings are enabled. All settings on the page appear to be at defaults. At the very bottom, I do have the 2 checkboxes checked (Block private networks and loopback addresses, and Block bogon networks).
-
I am having a similar issue on my T620 Plus whitebox, a very simple home setup with just 1 DHCP WAN + 1 OpenVPN client. I had the following in the option modifiers coming from 2.4.3:
supersede dhcp-lease-time 7200
I tried adding
supersede interface-mtu 0
as suggested in the patch comments, but this didn't seem to make a difference. In the gateway logs I keep gettingsendto error: 64
when the WAN is down. I am also experiencing the issues with the web UI home page taking a very long time to load, although navigating to other pages as speedy as expected.I've reverted my DHCP options to the blank defaults (although the original reason I needed them was to fix a issue with my ISP), and will report how it goes. It usually takes between 6-24 hours for it to go down again.
-
Are you sure DNS isn't broken but IP is still working? I recently had OOM killer go after unbound on a sg3100 because my DNSBL was too big.
Also I'd try disabling or uninstalling some of the packages.
-
@bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
Are you sure DNS isn't broken but IP is still working? I recently had OOM killer go after unbound on a sg3100 because my DNSBL was too big.
Also I'd try disabling or uninstalling some of the packages.
Nope, my RAM is only 6% utilized and I don't see any evidence of Unbound dying. The only package I have installed is Traffic Totals.
-
@bbrendon No I am not sure of that at all.
Just an update, it happened again, and I noticed that it happened pretty much exactly 4 days after the last one.
Just in case it's related to DNS, I went poking around my Resolver settings. In previous releases, back when it wasn't exposed in the settings list, I had set some custom options in the resolver to use TLS for outbound queries to my forward targets. So I went ahead and removed all custom options, enabled forwarding mode, and checked the new box "Use SSL/TLS for outgoing DNS Queries to Forwarding Servers." I don't think I am using DNSBL, as that is part of pfBlockerNG is it not? My only installed packages are ntopng and snort. Though I admit the problem could be related to something leaking... even though it never happened before 2.4.4.
Any general advice as to exactly WHAT I should do to capture logs once the problem starts happening, if it happens again?
-
Another update on my end - I disabled the one OpenVPN client I have running and so far my WAN has been stable. The OpenVPN configuration does use push routes - is it possible that interruptions in OpenVPN connectivity could be affecting my WAN gateway at all?
-
I have another update to share. The problem happened again, after perhaps 6.5 days. It appears the problem isn't that connectivity goes away... DNS Resolver is dying. It appears there is some kind of memory leak in unbound when using DNS over TLS, but that is just an assumption. When I manually start the DNS Resolver service once it gets into this state, it starts working again with no reboot required. I have pasted the log entries below in case anyone has any insight to share about how to resolve this. No pun intended.
Last 50 DNS Resolver Log Entries. (Maximum 50) Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:25 unbound 71561:1 error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:25 unbound 71561:1 error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:48 unbound 71561:0 info: service stopped (unbound 1.7.3). Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 0: 92 queries, 16 answers from cache, 76 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 0: requestlist max 2 avg 0.276316 exceeded 0 jostled 0 Oct 21 12:12:48 unbound 71561:0 info: average recursion processing time 1.013904 sec Oct 21 12:12:48 unbound 71561:0 info: histogram of recursion processing times Oct 21 12:12:48 unbound 71561:0 info: [25%]=0.049152 median[50%]=0.109227 [75%]=0.220753 Oct 21 12:12:48 unbound 71561:0 info: lower(secs) upper(secs) recursions Oct 21 12:12:48 unbound 71561:0 info: 0.000000 0.000001 12 Oct 21 12:12:48 unbound 71561:0 info: 0.032768 0.065536 14 Oct 21 12:12:48 unbound 71561:0 info: 0.065536 0.131072 18 Oct 21 12:12:48 unbound 71561:0 info: 0.131072 0.262144 19 Oct 21 12:12:48 unbound 71561:0 info: 0.262144 0.524288 9 Oct 21 12:12:48 unbound 71561:0 info: 16.000000 32.000000 4 Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 1: 50 queries, 15 answers from cache, 35 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 1: requestlist max 1 avg 0.171429 exceeded 0 jostled 0 Oct 21 12:12:48 unbound 71561:0 info: average recursion processing time 1.271249 sec Oct 21 12:12:48 unbound 71561:0 info: histogram of recursion processing times Oct 21 12:12:48 unbound 71561:0 info: [25%]=0.053248 median[50%]=0.155648 [75%]=0.32768 Oct 21 12:12:48 unbound 71561:0 info: lower(secs) upper(secs) recursions Oct 21 12:12:48 unbound 71561:0 info: 0.000000 0.000001 4 Oct 21 12:12:48 unbound 71561:0 info: 0.016384 0.032768 1 Oct 21 12:12:48 unbound 71561:0 info: 0.032768 0.065536 6 Oct 21 12:12:48 unbound 71561:0 info: 0.065536 0.131072 5 Oct 21 12:12:48 unbound 71561:0 info: 0.131072 0.262144 8 Oct 21 12:12:48 unbound 71561:0 info: 0.262144 0.524288 9 Oct 21 12:12:48 unbound 71561:0 info: 16.000000 32.000000 2 Oct 21 12:12:48 unbound 71561:0 notice: Restart of unbound 1.7.3. Oct 21 12:12:48 unbound 71561:0 notice: init module 0: validator Oct 21 12:12:48 unbound 71561:0 notice: init module 1: iterator Oct 21 12:12:48 unbound 71561:0 error: malloc failed Oct 21 12:12:48 unbound 71561:0 error: could not create outgoing sockets Oct 21 12:12:48 unbound 71561:0 error: ./util/alloc.c at 167 could not pthread_spin_destroy(&alloc->lock): Invalid argument Oct 21 12:12:48 unbound 71561:0 fatal error: Could not initialize main thread
-
From the logs that definitely looks like a memory issue.
I see they released unbound 1.8.1 earlier this month and it has some memory leak fixes. We'll get that pulled in soon, it will be in -p1.
EDIT: Issue for tracking: https://redmine.pfsense.org/issues/9059 (also has a link to the unbound release notes)
-
So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.
-
@bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.
They fixed several leaks, from the release notes:
A memory leak in the TLS lookup code is fixed. Leaked requests in the requestlist are fixed.
- free memory leaks in config strlist and str2list insert functions.
- Free memory leak in config strlist append.
- Fix memory leak when message parse fails partway through copy.
So it is possible if you have an issue with unbound that happens over time, it may be due to a memory leak.
-
Is there a way to restart unbound every 24 hours to try and mitigate the problem until 2.4.4-p1?
Thanks! -
You can install the cron package and then make a cron entry that periodically does
/usr/local/sbin/pfSsh.php playback svc restart unbound
on whatever schedule you want. -
I don't know if I just caught it at a funny time, but I just went to go look at the DNS Resolver log page again, but it looks like the service restarted itself multiple times within just the last few minutes. I wonder if it has been doing this all day? I am wondering if installing cron and having it restart periodically (say daily) is even worthwhile if it is restarting itself frequently anyway...
I have a mind to disable DNS over TLS for now and wait until 2.4.4-p1 comes out before I try to re-enable it. DNS privacy appeals to me, but in principle only - no real reasons to hide. It's not worth periodically losing DNS functionality. I haven't really decided to turn it off yet, but what do other people think about the trade-off?
-
I woke up without internet this morning :(
Is there a way to install unbound 1.8.1 manually?
-
@bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
I woke up without internet this morning :(
Is there a way to install unbound 1.8.1 manually?
Why don't you follow the advice of @jimp above? Install the cron package, then add a restart of unbound every ~6 hours (or hour or whatever works best) as a mitigation?
-
For what it's worth, I have checked back at random periods at the DNS Resolver log entries (but NOT when the problem occurred when I had to manually start it). The few times I checked the DNS Resolver log, I found each time that unbound had stopped (message like info: service stopped (unbound 1.7.3). ) a couple of times within a 5 minute period. I am not sure why it stops so often, and why sometimes it will stop and not start back on its own without my intervention. But I'm not sure adding a restart every hour or few hours would make a difference. If the service is stopping and starting on its own fairly often, one would think adding a cron-based restart wouldn't make a big difference.
-
@muppet said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
restart of unbound every ~6 hours
I could but that's janky.
-
Edit: Updated to remove the bit about it being unsupported.
Unbound 1.8.1 has been pushed to the 2.4.4 repo.
If you do a "pkg update && pkg upgrade" you can find it. If you wanted to be more careful (always a good idea) you could do "pkg update && pkg upgrade unbound" then you'll only get an upgraded unbound and strongswan (ipsec package)
KEEP IN MIND YOU'LL BE RUNNING A SEMI-UNSUPPORTED VERSION OF PFSENSE BY DOING THIS- So only do it if you really really need to. But it should fix your unbound problems.Thanks heaps to netcat6549 in the IRC channel for explaining some of the above to me.