Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    pfSense 23.05 on 2100 - kernel memory leak with ZFS?

    Scheduled Pinned Locked Moved General pfSense Questions
    10 Posts 6 Posters 675 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      johan333
      last edited by

      I've been running 23.05 on (2) 2100 units for about 25 days and the system seems to hang after about 10 days requiring a hard reboot. It appears that wired memory gradually increases over the course of this 10 day cycle until all memory is exhausted. So, my theory is that ZFS is not relinquishing memory and ultimately gobbles it all up not allowing other processes to acquire new/additional memory, subsequently causing the system to hang. I have not changed the ZFS kernel parameters from their default (0 - Auto), but according to the pfSense documentation, the max should never exceed 1/2 the RAM or the total RAM minus 1GB, whichever is greater. So, it appears it is not honoring the default maximum since it should never exceed 75% of available memory (4GB RAM), yet it has gone as high as 91%. As you can see from the graph below, there was a distinct difference in how memory is being managed since the upgrade to 23.01 on 4/13 and then 23.05 on 5/26 compared to 22.01 that came with the units.

      Screenshot 2023-06-19 232809.jpg

      My configuration is pretty basic:
      pfSense 2100 4GB RAM 8GB Flash
      BIOS Vendor: U-Boot, Version: 2018.03-devel-1.2.0ROGUE2-01.00.00.02+, Release Date: Fri Feb 7 2020
      Version 23.05-RELEASE (arm64), built on Mon May 22 15:04:36 UTC 2023, FreeBSD 14.0-CURRENT
      Interfaces: WAN, 5 VLANs
      QoS: Basic traffic shaping
      Firewall: 5-6 rules per interface
      Services: pfBlocker, DNS resolver, OpenVPN

      Appreciate any assistance from the community on this issue.

      Thank you,
      Jon

      keyserK J 2 Replies Last reply Reply Quote 0
      • keyserK
        keyser Rebel Alliance @johan333
        last edited by

        @johan333 You seem very confident that ZFS is the culprit. How did you determine that? because that behaviour seems more like a proper memory leak in a process. For ZFS to use your memory like that would require not only a memory flaw in the process, but also a very liniearly steady read from your SSD (to fill the ZFS cache). Did you try top to see if you can get a little more info?
        https://docs.netgate.com/pfsense/en/latest/monitoring/status/system-activity.html

        Love the no fuss of using the official appliances :-)

        1 Reply Last reply Reply Quote 1
        • M
          mer
          last edited by

          This is simplified, a lot of details left out.
          ZFS memory use is tied primarily to ARC. ARC is used to cache read data, so you need something doing a lot of reading from the filesystem, lots different stuff.
          I think the typical pattern on a pfSense box would be lots of reads at startup, then mostly writes to logs. Writes to the device obviously flow through different memory structures and if the physical device is not responding, there may be an increase until things time out, but they should then be getting freed.

          from the gui, diagnostics, command prompt you can get zfs specific values:
          sysctl -a | grep zfs | grep arcstats

          1 Reply Last reply Reply Quote 1
          • jimpJ
            jimp Rebel Alliance Developer Netgate
            last edited by

            On 23.05 the ZFS ARC usage on that graph is under Cache, not Wire, so it seems unlikely that ZFS would be the culprit based on what little information has been given.

            You should monitor the system by keeping a console and/or ssh session open and watch what is using the memory with something like top -aS -o res.

            Also if you have something else that might be using RAM configured, such as RAM disks for /tmp and /var that's worth looking into. If you have /var in a RAM disk and something is going nuts with log files and using a bunch of space that way it could contribute to the issue.

            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            1 Reply Last reply Reply Quote 1
            • J
              johan333 @johan333
              last edited by

              @keyser @mer @jimp

              Re: pfSense 23.05 on 2100 - kernel memory leak with ZFS?

              Thanks gents for the responses and I completely agree that my theory may be prematurely incriminating ZFS. I based this on my research on this forum where others saw a signature memory leak type of scenario on 23.01/23.05 and it pointed to ZFS and ARC. I'm new to the pfSense platform (and BSD), but I'm fairly computer savvy so I sincerely appreciate any help in diagnosing this problem. I've included output below from what you guys requested, but please chime in with what else I might want to use to build a better mouse trap as I expect I'm going to have to monitor this over time - as with most memory leak issues.

              Here are two 'top -atSzo res -s 3' snapshots taken 16 hours apart. As far as I can see, no system or user level process is demonstrating a memory leak scenario. The only increase I see is with ARC, but I do not know how to diagnose what is driving it's usage to ultimately find a culprit.

              Screenshot 2023-06-19 230753.jpg
              Snapshot #1 - 6/19/23 @ 13:27 PT

              Screenshot 2023-06-20 052256.jpg
              Snapshot #2 - 6/20/23 @ 05:22 PT

              Here is a snapshot of the ARC stats:

              kstat.zfs.misc.arcstats.abd_chunk_waste_size: 31232
              kstat.zfs.misc.arcstats.cached_only_in_progress: 0
              kstat.zfs.misc.arcstats.arc_raw_size: 0
              kstat.zfs.misc.arcstats.arc_sys_free: 0
              kstat.zfs.misc.arcstats.arc_need_free: 0
              kstat.zfs.misc.arcstats.demand_iohit_prescient_prefetch: 24
              kstat.zfs.misc.arcstats.demand_hit_prescient_prefetch: 24
              kstat.zfs.misc.arcstats.prescient_prefetch: 48
              kstat.zfs.misc.arcstats.demand_iohit_predictive_prefetch: 123
              kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch: 737
              kstat.zfs.misc.arcstats.predictive_prefetch: 2524
              kstat.zfs.misc.arcstats.async_upgrade_sync: 59
              kstat.zfs.misc.arcstats.arc_meta_min: 16777216
              kstat.zfs.misc.arcstats.arc_meta_max: 45561080
              kstat.zfs.misc.arcstats.arc_dnode_limit: 185929113
              kstat.zfs.misc.arcstats.arc_meta_limit: 1859291136
              kstat.zfs.misc.arcstats.arc_meta_used: 39604200
              kstat.zfs.misc.arcstats.arc_prune: 0
              kstat.zfs.misc.arcstats.arc_loaned_bytes: 0
              kstat.zfs.misc.arcstats.arc_tempreserve: 0
              kstat.zfs.misc.arcstats.arc_no_grow: 0
              kstat.zfs.misc.arcstats.memory_available_bytes: 2446315520
              kstat.zfs.misc.arcstats.memory_free_bytes: 2520383488
              kstat.zfs.misc.arcstats.memory_all_bytes: 3552796672
              kstat.zfs.misc.arcstats.memory_indirect_count: 0
              kstat.zfs.misc.arcstats.memory_direct_count: 0
              kstat.zfs.misc.arcstats.memory_throttle_count: 0
              kstat.zfs.misc.arcstats.l2_rebuild_log_blks: 0
              kstat.zfs.misc.arcstats.l2_rebuild_bufs_precached: 0
              kstat.zfs.misc.arcstats.l2_rebuild_bufs: 0
              kstat.zfs.misc.arcstats.l2_rebuild_asize: 0
              kstat.zfs.misc.arcstats.l2_rebuild_size: 0
              kstat.zfs.misc.arcstats.l2_rebuild_lowmem: 0
              kstat.zfs.misc.arcstats.l2_rebuild_cksum_lb_errors: 0
              kstat.zfs.misc.arcstats.l2_rebuild_dh_errors: 0
              kstat.zfs.misc.arcstats.l2_rebuild_io_errors: 0
              kstat.zfs.misc.arcstats.l2_rebuild_unsupported: 0
              kstat.zfs.misc.arcstats.l2_rebuild_success: 0
              kstat.zfs.misc.arcstats.l2_data_to_meta_ratio: 0
              kstat.zfs.misc.arcstats.l2_log_blk_count: 0
              kstat.zfs.misc.arcstats.l2_log_blk_asize: 0
              kstat.zfs.misc.arcstats.l2_log_blk_avg_asize: 0
              kstat.zfs.misc.arcstats.l2_log_blk_writes: 0
              kstat.zfs.misc.arcstats.l2_hdr_size: 0
              kstat.zfs.misc.arcstats.l2_asize: 0
              kstat.zfs.misc.arcstats.l2_size: 0
              kstat.zfs.misc.arcstats.l2_io_error: 0
              kstat.zfs.misc.arcstats.l2_cksum_bad: 0
              kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
              kstat.zfs.misc.arcstats.l2_free_on_write: 0
              kstat.zfs.misc.arcstats.l2_evict_l1cached: 0
              kstat.zfs.misc.arcstats.l2_evict_reading: 0
              kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
              kstat.zfs.misc.arcstats.l2_writes_lock_retry: 0
              kstat.zfs.misc.arcstats.l2_writes_error: 0
              kstat.zfs.misc.arcstats.l2_writes_done: 0
              kstat.zfs.misc.arcstats.l2_writes_sent: 0
              kstat.zfs.misc.arcstats.l2_write_bytes: 0
              kstat.zfs.misc.arcstats.l2_read_bytes: 0
              kstat.zfs.misc.arcstats.l2_rw_clash: 0
              kstat.zfs.misc.arcstats.l2_feeds: 0
              kstat.zfs.misc.arcstats.l2_bufc_metadata_asize: 0
              kstat.zfs.misc.arcstats.l2_bufc_data_asize: 0
              kstat.zfs.misc.arcstats.l2_mfu_asize: 0
              kstat.zfs.misc.arcstats.l2_mru_asize: 0
              kstat.zfs.misc.arcstats.l2_prefetch_asize: 0
              kstat.zfs.misc.arcstats.l2_misses: 0
              kstat.zfs.misc.arcstats.l2_hits: 0
              kstat.zfs.misc.arcstats.uncached_evictable_metadata: 0
              kstat.zfs.misc.arcstats.uncached_evictable_data: 0
              kstat.zfs.misc.arcstats.uncached_size: 0
              kstat.zfs.misc.arcstats.mfu_ghost_evictable_metadata: 0
              kstat.zfs.misc.arcstats.mfu_ghost_evictable_data: 0
              kstat.zfs.misc.arcstats.mfu_ghost_size: 0
              kstat.zfs.misc.arcstats.mfu_evictable_metadata: 1089024
              kstat.zfs.misc.arcstats.mfu_evictable_data: 36472832
              kstat.zfs.misc.arcstats.mfu_size: 48907264
              kstat.zfs.misc.arcstats.mru_ghost_evictable_metadata: 0
              kstat.zfs.misc.arcstats.mru_ghost_evictable_data: 0
              kstat.zfs.misc.arcstats.mru_ghost_size: 0
              kstat.zfs.misc.arcstats.mru_evictable_metadata: 2090496
              kstat.zfs.misc.arcstats.mru_evictable_data: 99806720
              kstat.zfs.misc.arcstats.mru_size: 123869184
              kstat.zfs.misc.arcstats.anon_evictable_metadata: 0
              kstat.zfs.misc.arcstats.anon_evictable_data: 0
              kstat.zfs.misc.arcstats.anon_size: 151552
              kstat.zfs.misc.arcstats.other_size: 6773040
              kstat.zfs.misc.arcstats.bonus_size: 1121600
              kstat.zfs.misc.arcstats.dnode_size: 4335728
              kstat.zfs.misc.arcstats.dbuf_size: 1315712
              kstat.zfs.misc.arcstats.metadata_size: 31754240
              kstat.zfs.misc.arcstats.data_size: 141173760
              kstat.zfs.misc.arcstats.hdr_size: 1076920
              kstat.zfs.misc.arcstats.overhead_size: 28936192
              kstat.zfs.misc.arcstats.uncompressed_size: 411830784
              kstat.zfs.misc.arcstats.compressed_size: 143991808
              kstat.zfs.misc.arcstats.size: 180809192
              kstat.zfs.misc.arcstats.c_max: 2479054848
              kstat.zfs.misc.arcstats.c_min: 111024896
              kstat.zfs.misc.arcstats.c: 283686656
              kstat.zfs.misc.arcstats.p: 141843840
              kstat.zfs.misc.arcstats.hash_chain_max: 2
              kstat.zfs.misc.arcstats.hash_chains: 22
              kstat.zfs.misc.arcstats.hash_collisions: 13561
              kstat.zfs.misc.arcstats.hash_elements_max: 5494
              kstat.zfs.misc.arcstats.hash_elements: 5245
              kstat.zfs.misc.arcstats.evict_l2_skip: 0
              kstat.zfs.misc.arcstats.evict_l2_ineligible: 0
              kstat.zfs.misc.arcstats.evict_l2_eligible_mru: 0
              kstat.zfs.misc.arcstats.evict_l2_eligible_mfu: 0
              kstat.zfs.misc.arcstats.evict_l2_eligible: 0
              kstat.zfs.misc.arcstats.evict_l2_cached: 0
              kstat.zfs.misc.arcstats.evict_not_enough: 0
              kstat.zfs.misc.arcstats.evict_skip: 0
              kstat.zfs.misc.arcstats.access_skip: 0
              kstat.zfs.misc.arcstats.mutex_miss: 0
              kstat.zfs.misc.arcstats.deleted: 0
              kstat.zfs.misc.arcstats.uncached_hits: 0
              kstat.zfs.misc.arcstats.mfu_ghost_hits: 0
              kstat.zfs.misc.arcstats.mfu_hits: 4685498
              kstat.zfs.misc.arcstats.mru_ghost_hits: 0
              kstat.zfs.misc.arcstats.mru_hits: 1616339
              kstat.zfs.misc.arcstats.prefetch_metadata_misses: 351
              kstat.zfs.misc.arcstats.prefetch_metadata_iohits: 681
              kstat.zfs.misc.arcstats.prefetch_metadata_hits: 857
              kstat.zfs.misc.arcstats.prefetch_data_misses: 658
              kstat.zfs.misc.arcstats.prefetch_data_iohits: 0
              kstat.zfs.misc.arcstats.prefetch_data_hits: 25
              kstat.zfs.misc.arcstats.demand_metadata_misses: 1193
              kstat.zfs.misc.arcstats.demand_metadata_iohits: 107
              kstat.zfs.misc.arcstats.demand_metadata_hits: 3393539
              kstat.zfs.misc.arcstats.demand_data_misses: 2548
              kstat.zfs.misc.arcstats.demand_data_iohits: 56
              kstat.zfs.misc.arcstats.demand_data_hits: 2907416
              kstat.zfs.misc.arcstats.misses: 4750
              kstat.zfs.misc.arcstats.iohits: 844
              kstat.zfs.misc.arcstats.hits: 6301837
              

              I am not using any RAM disks on the system:

              [23.05-RELEASE][jonsmall@pfSense-small.home.lan]/home/jonsmall: mount
              pfSense/ROOT/default on / (zfs, local, noatime, nfsv4acls)
              devfs on /dev (devfs)
              pfSense/var on /var (zfs, local, noatime, nfsv4acls)
              pfSense/tmp on /tmp (zfs, local, noatime, nosuid, nfsv4acls)
              pfSense/home on /home (zfs, local, noatime, nfsv4acls)
              pfSense/var/log on /var/log (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              pfSense/var/db on /var/db (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              pfSense/var/cache on /var/cache (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              pfSense/var/tmp on /var/tmp (zfs, local, noatime, nosuid, nfsv4acls)
              pfSense/ROOT/default/cf on /cf (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              pfSense/ROOT/default/var_cache_pkg on /var/cache/pkg (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              pfSense/ROOT/default/var_db_pkg on /var/db/pkg (zfs, local, noatime, noexec, nosuid, nfsv4acls)
              tmpfs on /var/run (tmpfs, local)
              devfs on /var/dhcpd/dev (devfs)
              

              Screenshot 2023-06-20 060438.jpg

              S 1 Reply Last reply Reply Quote 0
              • S
                SteveITS Galactic Empire @johan333
                last edited by

                @johan333 FWIW this page links to a ZFS tuning doc: https://docs.netgate.com/pfsense/en/latest/hardware/memory.html
                Haven’t seen hanging 2100s, and haven’t done any memory tuning. Can you view the console while it happens?
                https://docs.netgate.com/pfsense/en/latest/solutions/netgate-2100/connect-to-console.html

                Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                Upvote 👍 helpful posts!

                J 1 Reply Last reply Reply Quote 1
                • J
                  johan333 @SteveITS
                  last edited by

                  @SteveITS Thanks for the reference to memory tuning and I'll check it out. You know, I was thinking the same thing, but I don't currently have the console connected to anything. I've got a spare Raspberry PI I think I'll throw in the cabinet and connect it to the 2100 console port.

                  1 Reply Last reply Reply Quote 0
                  • J
                    johan333
                    last edited by

                    I just wanted to follow up on this post and provide some closure as I believe I've identified the issue. When researching problems in forums like this, it can be so frustrating to read through 20 pages of dialogue only to find no closure. It is as if someone tore out the last pages of a great novel!

                    Also, please allow me to also apologize to the community for prematurely blaming ZFS or the kernel as I'm 90% certain that isn't the case. This is out of character for me and no different than a typical doctor/patient exam. I should've started with describing the symptoms and allowed experts to interrogate with more detailed questions, determine a prognosis, and treatment plan. Again, my humble apologies.

                    Now, onto what I've discovered after carefully scrutinizing the system through the lens of what is different about my systems vs other 2100s running 23.05 that aren't raising flags about memory leaks. I found some anecdotal things such as running pfBlockerNG-devel (3.2.0_5) instead of pfBlockerNG (3.2.0_5), but I didn't feel that would be a likely culprit as the associated process memory has always been stable. I then remembered about the constant errors I have been getting with the connected UPS systems in DMESG:

                    ugen1.2: <CPS ST Series> at usbus1 (disconnected)
                    ugen1.2: <CPS ST Series> at usbus1
                    

                    See, when these systems were installed (6mo ago), I setup UPS systems on each of them and attempted to use Nut to manage the UPS interface. However, it was not stable and opted to just disable Nut for the time being and save troubleshooting for a rainy day. These messages are constant (every few secs), so I began thinking that perhaps with each disconnect/connect some kernel level memory allocation (malloc call) was happening that wasn't being released. After 10 days of this constant cycle, the system was just starved of free memory and became unstable. So, 2 days ago, I disconnected the UPS from pfSense and performed a hard restart. As you can see from the graph below, the wire and free memory are looking much more well-behaved as I would normally see a 7-8% increase in wire memory within 24 hours and so far it hasn't fluctuated more than +/- 2% over 2 days.

                    Screenshot 2023-06-22 123229.jpg

                    Again, thank you for the community's support as these troubleshooting efforts, although menacing, do provide a valuable opportunity to learn more about the system.

                    BTW - I have since connected the UPSs (and pfSense console) to a Raspberry PI running Nut and it is working great! Now the rainy day project is configuring a Nut client on pfSense to respond to triggers from the Nut server for loss of power events.

                    1 Reply Last reply Reply Quote 3
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      Nice catch!
                      So the UPS USB device was flapping like that with or without NUT installed/running?

                      J 1 Reply Last reply Reply Quote 0
                      • J
                        johan333 @stephenw10
                        last edited by

                        @stephenw10 Yes, using 2 different models of CyberPower UPSs (SL700U & CP425SLG) and have been experiencing this flapping since the 2100s were new with v22.05 - 6mo ago. I removed the Nut package shortly after the install when these disconnects were causing the instability of Nut. They are both HID compliant and I was using the usbhid-ups driver with Nut, but it still was flapping. As mentioned, I have them connected to a Raspberry PI with Nut and usbhid-ups driver and it is working great (see pic).

                        Screenshot 2023-06-22 152945.jpg

                        In case you're interested, I saved this output. When it disconnected, the UPS device didn't show in a usbconfig listing.

                        [23.05-RELEASE][jonsmall@pfSense-rider.home.lan]/home/jonsmall: sudo usbconfig
                        ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
                        ugen1.1: <Marvell EHCI root HUB> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
                        ugen1.2: <CPS ST Series> at usbus1, cfg=0 md=HOST spd=LOW (1.5Mbps) pwr=ON (50mA)
                        
                        [23.05-RELEASE][jonsmall@pfSense-rider.home.lan]/home/jonsmall: sudo usbconfig -vd ugen1.2
                        ugen1.2: <CPS ST Series> at usbus1, cfg=0 md=HOST spd=LOW (1.5Mbps) pwr=ON (50mA)
                        
                          bLength = 0x0012
                          bDescriptorType = 0x0001
                          bcdUSB = 0x0110
                          bDeviceClass = 0x0000  <Probed by interface class>
                          bDeviceSubClass = 0x0000
                          bDeviceProtocol = 0x0000
                          bMaxPacketSize0 = 0x0008
                          idVendor = 0x0764
                          idProduct = 0x0501
                          bcdDevice = 0x0001
                          iManufacturer = 0x0003  <CPS>
                          iProduct = 0x0001  <ST Series>
                          iSerialNumber = 0x0000  <no string>
                          bNumConfigurations = 0x0001
                        
                        
                         Configuration index 0
                        
                            bLength = 0x0009
                            bDescriptorType = 0x0002
                            wTotalLength = 0x0022
                            bNumInterfaces = 0x0001
                            bConfigurationValue = 0x0001
                            iConfiguration = 0x0000  <no string>
                            bmAttributes = 0x00a0
                            bMaxPower = 0x0019
                        
                            Interface 0
                              bLength = 0x0009
                              bDescriptorType = 0x0004
                              bInterfaceNumber = 0x0000
                              bAlternateSetting = 0x0000
                              bNumEndpoints = 0x0001
                              bInterfaceClass = 0x0003  <HID device>
                              bInterfaceSubClass = 0x0000
                              bInterfaceProtocol = 0x0000
                              iInterface = 0x0000  <no string>
                        
                              Additional Descriptor
                        
                              bLength = 0x09
                              bDescriptorType = 0x21
                              bDescriptorSubType = 0x10
                               RAW dump:
                               0x00 | 0x09, 0x21, 0x10, 0x01, 0x21, 0x01, 0x22, 0x5f,
                               0x08 | 0x02
                        
                             Endpoint 0
                                bLength = 0x0007
                                bDescriptorType = 0x0005
                                bEndpointAddress = 0x0081  <IN>
                                bmAttributes = 0x0003  <INTERRUPT>
                                wMaxPacketSize = 0x0008
                                bInterval = 0x000a
                                bRefresh = 0x0000
                                bSynchAddress = 0x0000
                        
                        
                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.