SG3100 needs to reboot every few days after 2.4.4 upgrade
-
Another update on my end - I disabled the one OpenVPN client I have running and so far my WAN has been stable. The OpenVPN configuration does use push routes - is it possible that interruptions in OpenVPN connectivity could be affecting my WAN gateway at all?
-
I have another update to share. The problem happened again, after perhaps 6.5 days. It appears the problem isn't that connectivity goes away... DNS Resolver is dying. It appears there is some kind of memory leak in unbound when using DNS over TLS, but that is just an assumption. When I manually start the DNS Resolver service once it gets into this state, it starts working again with no reboot required. I have pasted the log entries below in case anyone has any insight to share about how to resolve this. No pun intended.
Last 50 DNS Resolver Log Entries. (Maximum 50) Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:25 unbound 71561:1 error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:25 unbound 71561:1 error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure Oct 21 12:12:25 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:46 unbound 71561:1 error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure Oct 21 12:12:46 unbound 71561:1 error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error Oct 21 12:12:46 unbound 71561:1 notice: ssl handshake failed 1.1.1.1 port 853 Oct 21 12:12:48 unbound 71561:0 info: service stopped (unbound 1.7.3). Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 0: 92 queries, 16 answers from cache, 76 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 0: requestlist max 2 avg 0.276316 exceeded 0 jostled 0 Oct 21 12:12:48 unbound 71561:0 info: average recursion processing time 1.013904 sec Oct 21 12:12:48 unbound 71561:0 info: histogram of recursion processing times Oct 21 12:12:48 unbound 71561:0 info: [25%]=0.049152 median[50%]=0.109227 [75%]=0.220753 Oct 21 12:12:48 unbound 71561:0 info: lower(secs) upper(secs) recursions Oct 21 12:12:48 unbound 71561:0 info: 0.000000 0.000001 12 Oct 21 12:12:48 unbound 71561:0 info: 0.032768 0.065536 14 Oct 21 12:12:48 unbound 71561:0 info: 0.065536 0.131072 18 Oct 21 12:12:48 unbound 71561:0 info: 0.131072 0.262144 19 Oct 21 12:12:48 unbound 71561:0 info: 0.262144 0.524288 9 Oct 21 12:12:48 unbound 71561:0 info: 16.000000 32.000000 4 Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 1: 50 queries, 15 answers from cache, 35 recursions, 0 prefetch, 0 rejected by ip ratelimiting Oct 21 12:12:48 unbound 71561:0 info: server stats for thread 1: requestlist max 1 avg 0.171429 exceeded 0 jostled 0 Oct 21 12:12:48 unbound 71561:0 info: average recursion processing time 1.271249 sec Oct 21 12:12:48 unbound 71561:0 info: histogram of recursion processing times Oct 21 12:12:48 unbound 71561:0 info: [25%]=0.053248 median[50%]=0.155648 [75%]=0.32768 Oct 21 12:12:48 unbound 71561:0 info: lower(secs) upper(secs) recursions Oct 21 12:12:48 unbound 71561:0 info: 0.000000 0.000001 4 Oct 21 12:12:48 unbound 71561:0 info: 0.016384 0.032768 1 Oct 21 12:12:48 unbound 71561:0 info: 0.032768 0.065536 6 Oct 21 12:12:48 unbound 71561:0 info: 0.065536 0.131072 5 Oct 21 12:12:48 unbound 71561:0 info: 0.131072 0.262144 8 Oct 21 12:12:48 unbound 71561:0 info: 0.262144 0.524288 9 Oct 21 12:12:48 unbound 71561:0 info: 16.000000 32.000000 2 Oct 21 12:12:48 unbound 71561:0 notice: Restart of unbound 1.7.3. Oct 21 12:12:48 unbound 71561:0 notice: init module 0: validator Oct 21 12:12:48 unbound 71561:0 notice: init module 1: iterator Oct 21 12:12:48 unbound 71561:0 error: malloc failed Oct 21 12:12:48 unbound 71561:0 error: could not create outgoing sockets Oct 21 12:12:48 unbound 71561:0 error: ./util/alloc.c at 167 could not pthread_spin_destroy(&alloc->lock): Invalid argument Oct 21 12:12:48 unbound 71561:0 fatal error: Could not initialize main thread
-
From the logs that definitely looks like a memory issue.
I see they released unbound 1.8.1 earlier this month and it has some memory leak fixes. We'll get that pulled in soon, it will be in -p1.
EDIT: Issue for tracking: https://redmine.pfsense.org/issues/9059 (also has a link to the unbound release notes)
-
So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.
-
@bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.
They fixed several leaks, from the release notes:
A memory leak in the TLS lookup code is fixed. Leaked requests in the requestlist are fixed.
- free memory leaks in config strlist and str2list insert functions.
- Free memory leak in config strlist append.
- Fix memory leak when message parse fails partway through copy.
So it is possible if you have an issue with unbound that happens over time, it may be due to a memory leak.
-
Is there a way to restart unbound every 24 hours to try and mitigate the problem until 2.4.4-p1?
Thanks! -
You can install the cron package and then make a cron entry that periodically does
/usr/local/sbin/pfSsh.php playback svc restart unbound
on whatever schedule you want. -
I don't know if I just caught it at a funny time, but I just went to go look at the DNS Resolver log page again, but it looks like the service restarted itself multiple times within just the last few minutes. I wonder if it has been doing this all day? I am wondering if installing cron and having it restart periodically (say daily) is even worthwhile if it is restarting itself frequently anyway...
I have a mind to disable DNS over TLS for now and wait until 2.4.4-p1 comes out before I try to re-enable it. DNS privacy appeals to me, but in principle only - no real reasons to hide. It's not worth periodically losing DNS functionality. I haven't really decided to turn it off yet, but what do other people think about the trade-off?
-
I woke up without internet this morning :(
Is there a way to install unbound 1.8.1 manually?
-
@bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
I woke up without internet this morning :(
Is there a way to install unbound 1.8.1 manually?
Why don't you follow the advice of @jimp above? Install the cron package, then add a restart of unbound every ~6 hours (or hour or whatever works best) as a mitigation?
-
For what it's worth, I have checked back at random periods at the DNS Resolver log entries (but NOT when the problem occurred when I had to manually start it). The few times I checked the DNS Resolver log, I found each time that unbound had stopped (message like info: service stopped (unbound 1.7.3). ) a couple of times within a 5 minute period. I am not sure why it stops so often, and why sometimes it will stop and not start back on its own without my intervention. But I'm not sure adding a restart every hour or few hours would make a difference. If the service is stopping and starting on its own fairly often, one would think adding a cron-based restart wouldn't make a big difference.
-
@muppet said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
restart of unbound every ~6 hours
I could but that's janky.
-
Edit: Updated to remove the bit about it being unsupported.
Unbound 1.8.1 has been pushed to the 2.4.4 repo.
If you do a "pkg update && pkg upgrade" you can find it. If you wanted to be more careful (always a good idea) you could do "pkg update && pkg upgrade unbound" then you'll only get an upgraded unbound and strongswan (ipsec package)
KEEP IN MIND YOU'LL BE RUNNING A SEMI-UNSUPPORTED VERSION OF PFSENSE BY DOING THIS- So only do it if you really really need to. But it should fix your unbound problems.Thanks heaps to netcat6549 in the IRC channel for explaining some of the above to me.
-
That is not unsupported. We pushed it out specifically to address the unbound memory leaks. It's perfectly fine to run it either way, whichever way works better for someone.
In fact if someone were to upgrade to 2.4.4 today, they'd automatically pull in the new version.
tl;dr: If you have problems with unbound eating memory, upgrade, don't suffer until -p1, even if it is coming soon.
-
@jimp said in SG3100 needs to reboot every few days after 2.4.4 upgrade:
That is not unsupported. We pushed it out specifically to address the unbound memory leaks. It's perfectly fine to run it either way, whichever way works better for someone.
In fact if someone were to upgrade to 2.4.4 today, they'd automatically pull in the new version.
tl;dr: If you have problems with unbound eating memory, upgrade, don't suffer until -p1, even if it is coming soon.
Ahhh ok, thank you
I was asking in the IRC channel about how "supported" the pkg update/pkg upgrade commands were. The feedback I got was that it would be an unsupported thing to do. Thank you for clarification! -
I could see how someone might think that, but in this case we put out some updates for a few issues that some had concerns about (unbound, strongSwan, curl, libssh) and if someone needs to, they can update by hand. We don't do that often, but in this case the issues warranted an OOB update of that nature.