SG3100 needs to reboot every few days after 2.4.4 upgrade



  • Hello, I have an SG3100 that has been running fine since April when it was installed. After the latest update to 2.4.4 was installed (I did this through the GUI Update mechanism), I seem to have developed a new issue. After a few days (unclear if exact duration is the same each time), I seem to lose the ability for clients to make new outbound connections. When this happens, I log into the pfSense GUI and notice that even though it logs in quickly, even just waiting for the main dashboard page to finish displaying takes way longer than the usual couple of seconds. It seems to take a long time to check the status of any available updates.

    To fix this problem, all I have done is reboot the SG3100 and when it comes up everything is fine again. All clients work fine for another few days until the problem comes back. I haven't done much config from factory settings, though I have setup a few things back when I was on 2.3. For example, enabled DNS resolver with forwarding to public DNS over TLS hosts. And enabled Snort with a couple of signatures and IP lists. And ntopng.

    I realize there could be an issue with a package or its config that was working fine before. But rather than just disabling everything blind and then crossing my fingers, I'd prefer to diagnose and resolve the issue. Does anyone have any suggestions as to any logs I can download or copy off the box the next time this happens, before I restart it?


  • Rebel Alliance Developer Netgate

    Is your WAN DHCP? Do you have any advanced settings enabled in the WAN DHCP options?

    If so, see https://redmine.pfsense.org/issues/8507#note-23



  • @jimp Thanks for the reply! Yes I have a DHCP assigned WAN interface. No advanced settings are enabled. All settings on the page appear to be at defaults. At the very bottom, I do have the 2 checkboxes checked (Block private networks and loopback addresses, and Block bogon networks).



  • I am having a similar issue on my T620 Plus whitebox, a very simple home setup with just 1 DHCP WAN + 1 OpenVPN client. I had the following in the option modifiers coming from 2.4.3: supersede dhcp-lease-time 7200

    I tried adding supersede interface-mtu 0 as suggested in the patch comments, but this didn't seem to make a difference. In the gateway logs I keep getting sendto error: 64 when the WAN is down. I am also experiencing the issues with the web UI home page taking a very long time to load, although navigating to other pages as speedy as expected.

    I've reverted my DHCP options to the blank defaults (although the original reason I needed them was to fix a issue with my ISP), and will report how it goes. It usually takes between 6-24 hours for it to go down again.



  • Are you sure DNS isn't broken but IP is still working? I recently had OOM killer go after unbound on a sg3100 because my DNSBL was too big.

    Also I'd try disabling or uninstalling some of the packages.



  • @bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:

    Are you sure DNS isn't broken but IP is still working? I recently had OOM killer go after unbound on a sg3100 because my DNSBL was too big.

    Also I'd try disabling or uninstalling some of the packages.

    Nope, my RAM is only 6% utilized and I don't see any evidence of Unbound dying. The only package I have installed is Traffic Totals.



  • @bbrendon No I am not sure of that at all.

    Just an update, it happened again, and I noticed that it happened pretty much exactly 4 days after the last one.

    Just in case it's related to DNS, I went poking around my Resolver settings. In previous releases, back when it wasn't exposed in the settings list, I had set some custom options in the resolver to use TLS for outbound queries to my forward targets. So I went ahead and removed all custom options, enabled forwarding mode, and checked the new box "Use SSL/TLS for outgoing DNS Queries to Forwarding Servers." I don't think I am using DNSBL, as that is part of pfBlockerNG is it not? My only installed packages are ntopng and snort. Though I admit the problem could be related to something leaking... even though it never happened before 2.4.4.

    Any general advice as to exactly WHAT I should do to capture logs once the problem starts happening, if it happens again?



  • Another update on my end - I disabled the one OpenVPN client I have running and so far my WAN has been stable. The OpenVPN configuration does use push routes - is it possible that interruptions in OpenVPN connectivity could be affecting my WAN gateway at all?



  • I have another update to share. The problem happened again, after perhaps 6.5 days. It appears the problem isn't that connectivity goes away... DNS Resolver is dying. It appears there is some kind of memory leak in unbound when using DNS over TLS, but that is just an assumption. When I manually start the DNS Resolver service once it gets into this state, it starts working again with no reboot required. I have pasted the log entries below in case anyone has any insight to share about how to resolve this. No pun intended.

    Last 50 DNS Resolver Log Entries. (Maximum 50)
    Oct 21 12:12:25 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:25 	unbound 	71561:1 	error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure
    Oct 21 12:12:25 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:25 	unbound 	71561:1 	error: ssl handshake failed crypto error:1409C041:SSL routines:ssl3_setup_read_buffer:malloc failure
    Oct 21 12:12:25 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:46 	unbound 	71561:1 	error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error
    Oct 21 12:12:46 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:46 	unbound 	71561:1 	error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error
    Oct 21 12:12:46 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:46 	unbound 	71561:1 	error: ssl handshake failed crypto error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:0607B041:digital envelope routines:EVP_CipherInit_ex:malloc failure
    Oct 21 12:12:46 	unbound 	71561:1 	error: and additionally crypto error:140D1044:SSL routines:tls1_change_cipher_state:internal error
    Oct 21 12:12:46 	unbound 	71561:1 	notice: ssl handshake failed 1.1.1.1 port 853
    Oct 21 12:12:48 	unbound 	71561:0 	info: service stopped (unbound 1.7.3).
    Oct 21 12:12:48 	unbound 	71561:0 	info: server stats for thread 0: 92 queries, 16 answers from cache, 76 recursions, 0 prefetch, 0 rejected by ip ratelimiting
    Oct 21 12:12:48 	unbound 	71561:0 	info: server stats for thread 0: requestlist max 2 avg 0.276316 exceeded 0 jostled 0
    Oct 21 12:12:48 	unbound 	71561:0 	info: average recursion processing time 1.013904 sec
    Oct 21 12:12:48 	unbound 	71561:0 	info: histogram of recursion processing times
    Oct 21 12:12:48 	unbound 	71561:0 	info: [25%]=0.049152 median[50%]=0.109227 [75%]=0.220753
    Oct 21 12:12:48 	unbound 	71561:0 	info: lower(secs) upper(secs) recursions
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.000000 0.000001 12
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.032768 0.065536 14
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.065536 0.131072 18
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.131072 0.262144 19
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.262144 0.524288 9
    Oct 21 12:12:48 	unbound 	71561:0 	info: 16.000000 32.000000 4
    Oct 21 12:12:48 	unbound 	71561:0 	info: server stats for thread 1: 50 queries, 15 answers from cache, 35 recursions, 0 prefetch, 0 rejected by ip ratelimiting
    Oct 21 12:12:48 	unbound 	71561:0 	info: server stats for thread 1: requestlist max 1 avg 0.171429 exceeded 0 jostled 0
    Oct 21 12:12:48 	unbound 	71561:0 	info: average recursion processing time 1.271249 sec
    Oct 21 12:12:48 	unbound 	71561:0 	info: histogram of recursion processing times
    Oct 21 12:12:48 	unbound 	71561:0 	info: [25%]=0.053248 median[50%]=0.155648 [75%]=0.32768
    Oct 21 12:12:48 	unbound 	71561:0 	info: lower(secs) upper(secs) recursions
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.000000 0.000001 4
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.016384 0.032768 1
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.032768 0.065536 6
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.065536 0.131072 5
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.131072 0.262144 8
    Oct 21 12:12:48 	unbound 	71561:0 	info: 0.262144 0.524288 9
    Oct 21 12:12:48 	unbound 	71561:0 	info: 16.000000 32.000000 2
    Oct 21 12:12:48 	unbound 	71561:0 	notice: Restart of unbound 1.7.3.
    Oct 21 12:12:48 	unbound 	71561:0 	notice: init module 0: validator
    Oct 21 12:12:48 	unbound 	71561:0 	notice: init module 1: iterator
    Oct 21 12:12:48 	unbound 	71561:0 	error: malloc failed
    Oct 21 12:12:48 	unbound 	71561:0 	error: could not create outgoing sockets
    Oct 21 12:12:48 	unbound 	71561:0 	error: ./util/alloc.c at 167 could not pthread_spin_destroy(&alloc->lock): Invalid argument
    Oct 21 12:12:48 	unbound 	71561:0 	fatal error: Could not initialize main thread
    

  • Rebel Alliance Developer Netgate

    From the logs that definitely looks like a memory issue.

    I see they released unbound 1.8.1 earlier this month and it has some memory leak fixes. We'll get that pulled in soon, it will be in -p1.

    EDIT: Issue for tracking: https://redmine.pfsense.org/issues/9059 (also has a link to the unbound release notes)



  • So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.


  • Rebel Alliance Developer Netgate

    @bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:

    So maybe my DNSBL wasn’t too big. I guess I’m still leaking memory and it’s just going to take longer to fail since I’m not using as much memory now. Interesting.

    They fixed several leaks, from the release notes:

    A memory leak in the TLS lookup code is fixed. Leaked requests in the requestlist are fixed.

    • free memory leaks in config strlist and str2list insert functions.
    • Free memory leak in config strlist append.
    • Fix memory leak when message parse fails partway through copy.

    So it is possible if you have an issue with unbound that happens over time, it may be due to a memory leak.



  • Is there a way to restart unbound every 24 hours to try and mitigate the problem until 2.4.4-p1?
    Thanks!


  • Rebel Alliance Developer Netgate

    You can install the cron package and then make a cron entry that periodically does /usr/local/sbin/pfSsh.php playback svc restart unbound on whatever schedule you want.



  • I don't know if I just caught it at a funny time, but I just went to go look at the DNS Resolver log page again, but it looks like the service restarted itself multiple times within just the last few minutes. I wonder if it has been doing this all day? I am wondering if installing cron and having it restart periodically (say daily) is even worthwhile if it is restarting itself frequently anyway...

    I have a mind to disable DNS over TLS for now and wait until 2.4.4-p1 comes out before I try to re-enable it. DNS privacy appeals to me, but in principle only - no real reasons to hide. It's not worth periodically losing DNS functionality. I haven't really decided to turn it off yet, but what do other people think about the trade-off?



  • I woke up without internet this morning :(

    Is there a way to install unbound 1.8.1 manually?



  • @bbrendon said in SG3100 needs to reboot every few days after 2.4.4 upgrade:

    I woke up without internet this morning :(

    Is there a way to install unbound 1.8.1 manually?

    Why don't you follow the advice of @jimp above? Install the cron package, then add a restart of unbound every ~6 hours (or hour or whatever works best) as a mitigation?



  • For what it's worth, I have checked back at random periods at the DNS Resolver log entries (but NOT when the problem occurred when I had to manually start it). The few times I checked the DNS Resolver log, I found each time that unbound had stopped (message like info: service stopped (unbound 1.7.3). ) a couple of times within a 5 minute period. I am not sure why it stops so often, and why sometimes it will stop and not start back on its own without my intervention. But I'm not sure adding a restart every hour or few hours would make a difference. If the service is stopping and starting on its own fairly often, one would think adding a cron-based restart wouldn't make a big difference.



  • @muppet said in SG3100 needs to reboot every few days after 2.4.4 upgrade:

    restart of unbound every ~6 hours

    I could but that's janky.



  • Edit: Updated to remove the bit about it being unsupported.

    Unbound 1.8.1 has been pushed to the 2.4.4 repo.
    If you do a "pkg update && pkg upgrade" you can find it. If you wanted to be more careful (always a good idea) you could do "pkg update && pkg upgrade unbound" then you'll only get an upgraded unbound and strongswan (ipsec package)
    KEEP IN MIND YOU'LL BE RUNNING A SEMI-UNSUPPORTED VERSION OF PFSENSE BY DOING THIS - So only do it if you really really need to. But it should fix your unbound problems.

    Thanks heaps to netcat6549 in the IRC channel for explaining some of the above to me.


  • Rebel Alliance Developer Netgate

    That is not unsupported. We pushed it out specifically to address the unbound memory leaks. It's perfectly fine to run it either way, whichever way works better for someone.

    In fact if someone were to upgrade to 2.4.4 today, they'd automatically pull in the new version.

    tl;dr: If you have problems with unbound eating memory, upgrade, don't suffer until -p1, even if it is coming soon.



  • @jimp said in SG3100 needs to reboot every few days after 2.4.4 upgrade:

    That is not unsupported. We pushed it out specifically to address the unbound memory leaks. It's perfectly fine to run it either way, whichever way works better for someone.

    In fact if someone were to upgrade to 2.4.4 today, they'd automatically pull in the new version.

    tl;dr: If you have problems with unbound eating memory, upgrade, don't suffer until -p1, even if it is coming soon.

    Ahhh ok, thank you
    I was asking in the IRC channel about how "supported" the pkg update/pkg upgrade commands were. The feedback I got was that it would be an unsupported thing to do. Thank you for clarification!


  • Rebel Alliance Developer Netgate

    I could see how someone might think that, but in this case we put out some updates for a few issues that some had concerns about (unbound, strongSwan, curl, libssh) and if someone needs to, they can update by hand. We don't do that often, but in this case the issues warranted an OOB update of that nature.