pfSense running out of memory and locking up



  • Hey all, since the 2.4.5 release, I've had two instances of my Netgate SG-3100 locking up and and requiring me to restart the box via the console to get all the services back up and running. Instances were May 25th and yesterday, June 7. I can't say for sure what the cause of the first lockup was as I didn't think to save the log files. The second one, however, appears to have been caused by the system exhausting all available memory and not being able to spawn new processes. Here are the logs just before I rebooted the system:

    Jun  7 15:30:36 pfSense upsd[86015]: Data for UPS [ups] is stale - check driver
    Jun  7 15:30:37 pfSense upsmon[78626]: Poll UPS [ups] failed - Data stale
    Jun  7 15:30:37 pfSense upsmon[78626]: Communications with UPS ups lost
    Jun  7 15:30:37 pfSense upsmon[78626]: Can't fork to notify: Cannot allocate memory
    Jun  7 15:30:40 pfSense upsd[86015]: UPS [ups] data is no longer stale
    Jun  7 15:30:42 pfSense upsmon[78626]: Communications with UPS ups established
    Jun  7 15:30:43 pfSense upsmon[78626]: Can't fork to notify: Cannot allocate memory
    Jun  7 15:31:04 pfSense upsd[86015]: Data for UPS [ups] is stale - check driver
    Jun  7 15:31:06 pfSense upsd[86015]: UPS [ups] data is no longer stale
    Jun  7 15:33:51 pfSense upsd[86015]: Data for UPS [ups] is stale - check driver
    Jun  7 15:33:53 pfSense upsd[86015]: UPS [ups] data is no longer stale
    Jun  7 15:34:38 pfSense upsd[86015]: Data for UPS [ups] is stale - check driver
    Jun  7 15:34:40 pfSense upsd[86015]: UPS [ups] data is no longer stale
    Jun  7 15:37:20 pfSense upsd[86015]: Data for UPS [ups] is stale - check driver
    Jun  7 15:37:21 pfSense upsmon[78626]: Poll UPS [ups] failed - Data stale
    Jun  7 15:37:21 pfSense upsmon[78626]: Communications with UPS ups lost
    Jun  7 15:37:22 pfSense upsmon[78626]: Can't fork to notify: Cannot allocate memory
    Jun  7 15:37:22 pfSense upsd[86015]: UPS [ups] data is no longer stale
    Jun  7 15:37:27 pfSense upsmon[78626]: Communications with UPS ups established
    Jun  7 15:37:27 pfSense upsmon[78626]: Can't fork to notify: Cannot allocate memory
    Jun  7 15:37:37 pfSense sshd[38954]: fatal: fork of unprivileged child failed
    Jun  7 15:38:04 pfSense sshd[39018]: fatal: fork of unprivileged child failed
    Jun  7 15:39:06 pfSense login: login on ttyu0 as root
    Jun  7 15:39:24 pfSense shutdown: reboot by root: 
    Jun  7 15:39:24 pfSense init: /etc/rc.shutdown returned status 2
    Jun  7 15:39:24 pfSense syslogd: exiting on signal 15
    Jun  7 15:40:19 pfSense syslogd: kernel boot file is /boot/kernel/kernel
    

    As you can see at the bottom, I couldn't ssh in due to the box not being able to allocate memory for the process. Prior to that are thousands of lines of the same upsmon/upsd entries. My logs don't go back for enough to determine when the system exactly ran out of memory. Everything was still routing fine until about ~2 PM EST, when my family let me know the "Internet was down". I had to finish up yard work before coming in to look at the box and reboot it at ~3:40 PM EST, as you see in the logs. According the logs, that "Cannot allocate memory" error was happening since the early morning of June 7. Not sure when it actually started.

    Are there any known issues with the 2.4.5 release regarding memory exhaustion? I didn't have anything like this happen with the 2.4.4 release that my box shipped with.

    Thanks,
    Dan


  • Netgate Administrator

    The virtual memory limits were changed to allow for reasonable sized ramdisks after additional checking was added in the driver there. The limit was previously much lower. If you're not using RAM disks you can add a loader variable bring it back down to 2.4.4 levels.
    Create the file /boot/loader.conf.local and add the line vm.kmem_size_max="200M"

    You might have to experiment with that value. ~300M is what it is by default in 2.4.5.

    Steve



  • @stephenw10 , thank you for the suggestion. I didn't set up a ramdisk myself and, as far as I can tell, there isn't one currently on the system that might have been setup by a package or something.

    [2.4.5-RELEASE][admin@pfSense.localdomain]/root: df -h
    Filesystem                     Size    Used   Avail Capacity  Mounted on
    /dev/ufsid/5cdd38aef2899872    6.9G    1.0G    5.3G    16%    /
    devfs                          1.0K    1.0K      0B   100%    /dev
    /dev/msdosfs/FATBOOT0           34M    2.0M     32M     6%    /boot/u-boot
    /dev/md0                       3.4M    116K    3.0M     4%    /var/run
    devfs                          1.0K    1.0K      0B   100%    /var/dhcpd/dev
    

    I'm going to try to keep a closer eye on the logs to see, if this happens again, when and what the point of failure might be.

    ~Dan


  • Netgate Administrator

    You don't need to be using RAM disks to see that change. It was applied to all arm installs in order to accommodate RAM disks better. You might be hitting a side effect of that.

    Steve



  • Well, I didn't apply these changes and it locked up again today, 28 days later. When I logged in with the console, I could do nothing as everything I did from the shell could not be spawned due to lack of memory including "shutdown -r now". Ended having to power cycle the box and pray the file system didn't get corrupted.

    I've put the suggested fix into place and rebooted. Will see if this occurs again.



  • Hi,

    I'm using UPS (NUT ?) myself. No connections issues, as my using an off-the shelfves APC UPS, rtahter classic.
    Memory (max 2 Gbytes) doesn't change.

    Except for BlockerNG-devel, which could eat up a lot of memory, I'm not using anything sepcial :

    acme 	0.6.8 	
    Avahi 	2.1_1 	
    Cron 	0.3.7_4 	
    freeradius3 	0.15.7_16 	
    Notes 	0.2.9_2 	
    nut 	2.7.4_7 	
    openvpn-client-export 	1.4.23_1 	
    pfBlockerNG-devel 	2.2.5_33 	
    RRD_Summary 	2.0 	
    Shellcmd 	1.0.5_1 	
    System_Patches
    

    What happens when you stop / remove NUT/UPS ?
    If needed, go bare bone, and test up from there.



  • I was trying to figure out how to list the packages from the command line as it appears you did, but haven't found it yet. This is all I show as installed from the web GUI:

    GUI packages

    I'd rather not disable nut since my box wouldn't shutdown properly in the event of a power outage. Same as you, I'm just using an off-the-shelf UPS, the CyberPower CP685AVRG. I can't decide if nut is actually the problem or the symptom of whatever is going on. It does seem to lose connectivity a few times a day, but regain it quickly. That cycling seems to release (or maybe not?) resources and reacquire them. I have wondered if it might not actually be releasing anything and consuming more and more over time, but I have no proof of that. Memory usage as remained low according to the GUI.

    ~Dan



  • Interestingly, the log messages are slightly different from last time. @stephenw10 , would these indicate that your suggested fix might indeed be the problem:

    Jul  5 01:14:36 pfSense upsmon[90689]: Communications with UPS ups established
    Jul  5 01:14:36 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:14:36 pfSense upsmon[16358]: Can't invoke wall: Cannot allocate memory
    Jul  5 01:14:44 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:15:30 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:17:08 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:18:45 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:19:11 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  5 01:19:44 pfSense upsd[93206]: Data for UPS [ups] is stale - check driver
    Jul  5 01:19:46 pfSense upsd[93206]: UPS [ups] data is no longer stale
    Jul  5 01:20:31 pfSense upsd[93206]: Data for UPS [ups] is stale - check driver
    

    And when I was trying to reboot from the web GUI:

    Jul  6 11:08:57 pfSense php-fpm[365]: /diag_reboot.php: Stopping all packages.
    Jul  6 11:08:57 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  6 11:08:57 pfSense php-fpm[365]: /diag_reboot.php: The command '/usr/local/etc/rc.d/nut.sh stop' returned exit code '2', the output was 'stopping NUT /usr/local/etc/rc.d/nut.sh: Cannot fork: Cannot a
    llocate memory' 
    Jul  6 11:08:57 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  6 11:08:58 pfSense kernel: vm_thread_new: kstack allocation failed
    Jul  6 11:08:58 pfSense php-fpm[365]: /diag_reboot.php: The command 'nohup /etc/rc.reboot > /dev/null 2>&1 &' returned exit code '-1', the output was ''
    

    No logs are present from when I was trying to reboot from the console, but I was seeing the same kind of messages echoed. It couldn't spawn any of the commands I was issuing.

    ~Dan



  • My bet is the upsd driver. The message says your system is running out of kstack memory. This is special, reserved memory for the kernel stack. I think that is a fixed allocated chunk of memory, so it very well may not show up as being "consumed" in the Dashboard memory consumption indicator. Or stated another way, it is probably part of the base system memory area and the entire block is accounted for once and processes use small bits of that preallocated chunk when running.

    From what I understand from the limited research I did, that block is not expandable. So when a process consumes enough of it, that will kill the kernel because other processes that need some space can't get it.

    My guess is the upsd driver is crashing in some fashion (or some portion of it is crashing) and leaking kstack memory each time it crashes. After enough days of crashing, all of the kstack memory is consumed via those "leaks".

    I know it can be dangerous, especially if power at your location is flaky, but I would test with nut removed and the UPS unplugged from the USB port to see if the kstack errors go away. It will take several days to know.



  • @DannyBoy2k said in pfSense running out of memory and locking up:

    from the command line as it appears you did

    I have to deceive you here.

    I copied with my mouse the follow part of the GUI dashboard :

    f6e9ef83-9fdc-4c99-a96b-31434752b8f6-image.png
    I copied the text, and used the format tool

    3b01dce6-4f74-450d-8ebd-83e11c9b36ac-image.png

    to make it readable for humans.
    ( and thus using 190 bytes storage in stead of several Kilo of bytes for the image)

    Btw : command line version :

    ls -al /usr/local/pkg
    

    will list you what you have on your pfSense.
    It's a bit messy, but can be useful.

    The directory list doesn't show what is up to date, actually activated etc.



  • @Gertjan:
    Are you running nut on an SG-3100? I tried a number of times to get it running on an SG-3100 with a CyberPower UPS and was never successful in getting the UPS to be recognized.





  • @Gertjan said in pfSense running out of memory and locking up:

    @bmeeks said in pfSense running out of memory and locking up:

    @Gertjan:

    You mean @DannyBoy2k

    No, I was asking you since you mentioned nut running okay. Not trying to change the thread topic, but wondering if the SG-3100 due to its ARM architecture acts weird with some peripherals.

    @DannyBoy2k has it sort of running, but with the serious issue he posted about.



  • @bmeeks , yes, I was able to get nut running with the CyperPower just using the usb driver. It's just that is seems to occasionally (maybe once a day) need to restart/reconnect to it.

    ~Dan



  • @bmeeks said in pfSense running out of memory and locking up:

    No, I was asking you

    I'm using NUT (pfSense) and a bare bone Intel PC's from the last decade - APC UPS's only using their "USB" ports.



  • @DannyBoy2k said in pfSense running out of memory and locking up:

    @bmeeks , yes, I was able to get nut running with the CyperPower just using the usb driver. It's just that is seems to occasionally (maybe once a day) need to restart/reconnect to it.

    ~Dan

    Okay, but it appears to not be running well. Should not disconnect. I was never able to get it to work, so have that firewall for now running on the UPS but "blind" to battery exhaustion. Not ideal!

    Mentioned this in your thread to say perhaps there are issues with the USB driver for UPS/nut that manifest themselves in various ways.



  • @Gertjan said in pfSense running out of memory and locking up:

    @bmeeks said in pfSense running out of memory and locking up:

    No, I was asking you

    I'm using NUT (pfSense) and a bare bone Intel PC's from the last decade - APC UPS's only using their "USB" ports.

    Ah! I've never had issues with my Intel-based firewalls and have used both APC and other UPS boxes. The SG-3100 was the first one to ever kick my butt! It's also the first ARM architecture firewall I've encountered.



  • @bmeeks , thank you for the thoughts. I posted a message in the pfsense packages Category to see if it leads anywhere:
    https://forum.netgate.com/topic/155094/possible-memory-leak-in-nut-package

    ~Dan



  • Review my first post in this thread where I mention the kstack memory allocation error. My bet is still on the USB driver for the UPS being the problem. If you can, disable that driver completely and see if stability returns. Might take a month to be sure since you went as far as 28 or 29 days between lockups.



  • @bmeeks said in pfSense running out of memory and locking up:

    My guess is the upsd driver is crashing in some fashion (or some portion of it is crashing) and leaking kstack memory each time it crashes. After enough days of crashing, all of the kstack memory is consumed via those "leaks".

    I know it can be dangerous, especially if power at your location is flaky, but I would test with nut removed and the UPS unplugged from the USB port to see if the kstack errors go away. It will take several days to know.

    Hello!

    I use a pi running upsd and netgates attaching to it with upsmon (Remote NUT Server). This could be a workaround while exploring local upsd issues.

    John



  • @bmeeks said in pfSense running out of memory and locking up:

    The SG-3100 was the first one to ever kick my butt!

    Hello!

    With a sample size of one...

    https://forum.netgate.com/topic/154674/nut-and-apc-smart-ups-750-rm-usb

    John



  • @serbus said in pfSense running out of memory and locking up:

    @bmeeks said in pfSense running out of memory and locking up:

    The SG-3100 was the first one to ever kick my butt!

    Hello!

    With a sample size of one...

    https://forum.netgate.com/topic/154674/nut-and-apc-smart-ups-750-rm-usb

    John

    I tried a number of things with that SG-3100, and never did get the UPS properly recognized. It is something to do with device file permissions I suspect. I did not want to dive into a bunch of repetitive reboots and tinkering with the base OS at the time. I've never had any issues at all with either nut or apcupsd on several iterations of Intel-based hardware with pfSense. That particular SG-3100 is currently serving duty as a church firewall.

    I found another post or two here in the past about assigning specific permissions to one or more of the /dev psuedo files/directories that get created for peripherals, but as I said above I did not want to get off into those weeds.

    The ARM architecture of the SG-1000, SG-1100 and SG-3100 appliances has turned out to be shall we just say "interesting" ... ☺. Lots of Some legacy C source code programs that run fine on Intel hardware will crash on the ARM stuff due to memory alignment errors. Other subtle differences in the internal architecture can also contribute to "weirdness" with some software on the ARM devices.


  • Netgate Administrator

    If there is some memory leak I would expect to be able to see it somewhere before it actually locks up.

    The first place I've look in the Monitoring Graphs for System - Memory. When we have seen bugs like that before you can see the usage ramp up there.

    Steve



  • @stephenw10 , I am almost always at 7% of 2020 MiB when I log into the web GUI. I'm barely using the features of this box. A couple of VLANs, DNS, DHCP. That's about it. The only indication I've ever found of something being wrong are the systems logs I've pasted above.

    ~Dan


  • Netgate Administrator

    Hmm. Well I guess if it is kernel memory that's harder to see.... try checking the output of sysctl vm.kmem_map_free.

    First thing you will see is that on the 32bit arm system that is much smaller than other architectures so far easier to hit an issue. See if hat value decreases over time.

    Steve



  • @stephenw10 , I explored the web GUI a bit more and found the Status: Monitoring section. This seems interesting, but I have no idea what it means. I mean, I can see that all free memory suddenly became unavailable, but no idea why.
    Memory for 1 month

    I ran your suggested command:

    [2.4.5-RELEASE][admin@pfSense.localdomain]/root: sysctl vm.kmem_map_free
    vm.kmem_map_free: 141295616
    

    I'll try to log in every now and again and continue to monitor.

    ~Dan



  • Here is a higher fidelity snapshot around the period of interest. It appears it just happened very suddenly, not ramping up over time.
    Higher fidelity image

    ~Dan


  • Netgate Administrator

    That is it failing to log anything, likely when it exhausted the kernel memory.

    However before that you can see the free memory ramping down. If you click on the orange 'free' button to de-select it that will show the other data in more detail.
    Any idea what happened on June 17th to free some memory?

    Steve



  • @stephenw10 , unfortunately, no. I really don't do anything on this box; I just let it do its thing. I pretty much only log in when I suddenly lose Internet connectivity. What's interesting is, based on these graphs, I'm in a pretty bad state for several days before the box ceases to route.

    Here is the graph without the free memory line:
    Graph Without Free Memory

    ~Dan



  • @serbus , thank you for putting out this idea. I'm definitely considering it.

    Cheers,
    Dan



  • Hello, just a meeto post.

    SG-3100 firewalls running 2.4.5-P1.

    I have 9 SG-3100 boxes that don't run Nut, they do use ramdisks, no problem with those.

    I have another 6 SG-3100 boxes that have Nut setup with USB Cyberpower OR500.

    This morning I noticed that two that were installed on the same day, 13 days ago, both failed an automated config backup because ssh failed. I was able to reboot one of them, the other I'm still fighting with. These are 45 and 60 miles away, so I cannot just power cycle them.

    I'm trying to figure out how to tell which processes are using kmem.

    Josh



  • Speculating here, I had one of the SG-3100 boxes run into the IPV6 bogons issue, where it couldn't load the bogonsv6 table because it didn't have enough memory. Even after upping the max table entries value... so I'm wondering if the bogons tables use kmem also? Maybe my setup of ramdisks + arpwatch + nut didn't leave enough kmem for the bogon table refresh, which is why it didn't matter how much I increased max table entries.

    I'm wondering if this was triggered after a month because the bogonv6 table gets reloaded via cron once a month, and takes 2x the memory for a reload(I read that in one of the threads about the bogonsv6 table reload issue).

    I've since disabled ipv6 on my SG3100 boxes, so maybe that will take care of this for me? I didn't have it disabled on the two that locked up today.

    <someone upvote me so I can get enough reputation to change my old signature>

    Josh


  • Netgate Administrator

    What size ram disks are you using?

    There was a change the kmem setup for armv6 in pfSense 2.4.5p1. We discovered that earlier versions would allow you to allocate more ram disk than is actually available. An update to the driver in FreeBSD 11.3 prevents that. Ram disks in 2.4.5 in the SG-3100 is fairly limited, I found anything over ~125MB total could hit the limit. The default values should be OK though.

    Steve



  • @stephenw10 Hello Stephen, thanks for the reply.

    I had my ramdisks set way too large (I now understand).

    I took the minimum sizes not as the recommended size, just as the bare minimum, and set them up to use just about all the memory.

    The config page doesn't really indicate that a user shouldn't set the ramdisks to use up all the kernel memory. Maybe some general guidance language would be good there?

    But since the OP wasn't using a ramdisk at all, this may not really be relevant to the original issue. Other than causing me to hit the issue sooner, but I think I was just asking for trouble with my original ramdisk config.

    I'm going to setup zabbix to log the kmem values, or just grab them occasionally with a script.

    I have been looking to try and find a way to show how the kmem is allocated, but haven't found anything yet.

    Josh


  • Netgate Administrator

    Indeed, it may not be but you should set them correctly in 2.4.5. f they are too big the setup code simply won't create them at boot. It logs that.

    Steve


Log in to reply