pfSense 23.05 on 2100 - kernel memory leak with ZFS?

johan333

I've been running 23.05 on (2) 2100 units for about 25 days and the system seems to hang after about 10 days requiring a hard reboot. It appears that wired memory gradually increases over the course of this 10 day cycle until all memory is exhausted. So, my theory is that ZFS is not relinquishing memory and ultimately gobbles it all up not allowing other processes to acquire new/additional memory, subsequently causing the system to hang. I have not changed the ZFS kernel parameters from their default (0 - Auto), but according to the pfSense documentation, the max should never exceed 1/2 the RAM or the total RAM minus 1GB, whichever is greater. So, it appears it is not honoring the default maximum since it should never exceed 75% of available memory (4GB RAM), yet it has gone as high as 91%. As you can see from the graph below, there was a distinct difference in how memory is being managed since the upgrade to 23.01 on 4/13 and then 23.05 on 5/26 compared to 22.01 that came with the units.

Screenshot 2023-06-19 232809.jpg

My configuration is pretty basic:
pfSense 2100 4GB RAM 8GB Flash
BIOS Vendor: U-Boot, Version: 2018.03-devel-1.2.0ROGUE2-01.00.00.02+, Release Date: Fri Feb 7 2020
Version 23.05-RELEASE (arm64), built on Mon May 22 15:04:36 UTC 2023, FreeBSD 14.0-CURRENT
Interfaces: WAN, 5 VLANs
QoS: Basic traffic shaping
Firewall: 5-6 rules per interface
Services: pfBlocker, DNS resolver, OpenVPN

Appreciate any assistance from the community on this issue.

Thank you,
Jon

keyser

@johan333 You seem very confident that ZFS is the culprit. How did you determine that? because that behaviour seems more like a proper memory leak in a process. For ZFS to use your memory like that would require not only a memory flaw in the process, but also a very liniearly steady read from your SSD (to fill the ZFS cache). Did you try top to see if you can get a little more info?
https://docs.netgate.com/pfsense/en/latest/monitoring/status/system-activity.html

mer

This is simplified, a lot of details left out.
ZFS memory use is tied primarily to ARC. ARC is used to cache read data, so you need something doing a lot of reading from the filesystem, lots different stuff.
I think the typical pattern on a pfSense box would be lots of reads at startup, then mostly writes to logs. Writes to the device obviously flow through different memory structures and if the physical device is not responding, there may be an increase until things time out, but they should then be getting freed.

from the gui, diagnostics, command prompt you can get zfs specific values:
sysctl -a | grep zfs | grep arcstats

jimp

On 23.05 the ZFS ARC usage on that graph is under Cache, not Wire, so it seems unlikely that ZFS would be the culprit based on what little information has been given.

You should monitor the system by keeping a console and/or ssh session open and watch what is using the memory with something like top -aS -o res.

Also if you have something else that might be using RAM configured, such as RAM disks for /tmp and /var that's worth looking into. If you have /var in a RAM disk and something is going nuts with log files and using a bunch of space that way it could contribute to the issue.

johan333

@keyser @mer @jimp

Re: pfSense 23.05 on 2100 - kernel memory leak with ZFS?

Thanks gents for the responses and I completely agree that my theory may be prematurely incriminating ZFS. I based this on my research on this forum where others saw a signature memory leak type of scenario on 23.01/23.05 and it pointed to ZFS and ARC. I'm new to the pfSense platform (and BSD), but I'm fairly computer savvy so I sincerely appreciate any help in diagnosing this problem. I've included output below from what you guys requested, but please chime in with what else I might want to use to build a better mouse trap as I expect I'm going to have to monitor this over time - as with most memory leak issues.

Here are two 'top -atSzo res -s 3' snapshots taken 16 hours apart. As far as I can see, no system or user level process is demonstrating a memory leak scenario. The only increase I see is with ARC, but I do not know how to diagnose what is driving it's usage to ultimately find a culprit.

Screenshot 2023-06-19 230753.jpg
Snapshot #1 - 6/19/23 @ 13:27 PT

Screenshot 2023-06-20 052256.jpg
Snapshot #2 - 6/20/23 @ 05:22 PT

Here is a snapshot of the ARC stats:

kstat.zfs.misc.arcstats.abd_chunk_waste_size: 31232
kstat.zfs.misc.arcstats.cached_only_in_progress: 0
kstat.zfs.misc.arcstats.arc_raw_size: 0
kstat.zfs.misc.arcstats.arc_sys_free: 0
kstat.zfs.misc.arcstats.arc_need_free: 0
kstat.zfs.misc.arcstats.demand_iohit_prescient_prefetch: 24
kstat.zfs.misc.arcstats.demand_hit_prescient_prefetch: 24
kstat.zfs.misc.arcstats.prescient_prefetch: 48
kstat.zfs.misc.arcstats.demand_iohit_predictive_prefetch: 123
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch: 737
kstat.zfs.misc.arcstats.predictive_prefetch: 2524
kstat.zfs.misc.arcstats.async_upgrade_sync: 59
kstat.zfs.misc.arcstats.arc_meta_min: 16777216
kstat.zfs.misc.arcstats.arc_meta_max: 45561080
kstat.zfs.misc.arcstats.arc_dnode_limit: 185929113
kstat.zfs.misc.arcstats.arc_meta_limit: 1859291136
kstat.zfs.misc.arcstats.arc_meta_used: 39604200
kstat.zfs.misc.arcstats.arc_prune: 0
kstat.zfs.misc.arcstats.arc_loaned_bytes: 0
kstat.zfs.misc.arcstats.arc_tempreserve: 0
kstat.zfs.misc.arcstats.arc_no_grow: 0
kstat.zfs.misc.arcstats.memory_available_bytes: 2446315520
kstat.zfs.misc.arcstats.memory_free_bytes: 2520383488
kstat.zfs.misc.arcstats.memory_all_bytes: 3552796672
kstat.zfs.misc.arcstats.memory_indirect_count: 0
kstat.zfs.misc.arcstats.memory_direct_count: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 0
kstat.zfs.misc.arcstats.l2_rebuild_log_blks: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs_precached: 0
kstat.zfs.misc.arcstats.l2_rebuild_bufs: 0
kstat.zfs.misc.arcstats.l2_rebuild_asize: 0
kstat.zfs.misc.arcstats.l2_rebuild_size: 0
kstat.zfs.misc.arcstats.l2_rebuild_lowmem: 0
kstat.zfs.misc.arcstats.l2_rebuild_cksum_lb_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_dh_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_io_errors: 0
kstat.zfs.misc.arcstats.l2_rebuild_unsupported: 0
kstat.zfs.misc.arcstats.l2_rebuild_success: 0
kstat.zfs.misc.arcstats.l2_data_to_meta_ratio: 0
kstat.zfs.misc.arcstats.l2_log_blk_count: 0
kstat.zfs.misc.arcstats.l2_log_blk_asize: 0
kstat.zfs.misc.arcstats.l2_log_blk_avg_asize: 0
kstat.zfs.misc.arcstats.l2_log_blk_writes: 0
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.l2_asize: 0
kstat.zfs.misc.arcstats.l2_size: 0
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.l2_free_on_write: 0
kstat.zfs.misc.arcstats.l2_evict_l1cached: 0
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0
kstat.zfs.misc.arcstats.l2_writes_lock_retry: 0
kstat.zfs.misc.arcstats.l2_writes_error: 0
kstat.zfs.misc.arcstats.l2_writes_done: 0
kstat.zfs.misc.arcstats.l2_writes_sent: 0
kstat.zfs.misc.arcstats.l2_write_bytes: 0
kstat.zfs.misc.arcstats.l2_read_bytes: 0
kstat.zfs.misc.arcstats.l2_rw_clash: 0
kstat.zfs.misc.arcstats.l2_feeds: 0
kstat.zfs.misc.arcstats.l2_bufc_metadata_asize: 0
kstat.zfs.misc.arcstats.l2_bufc_data_asize: 0
kstat.zfs.misc.arcstats.l2_mfu_asize: 0
kstat.zfs.misc.arcstats.l2_mru_asize: 0
kstat.zfs.misc.arcstats.l2_prefetch_asize: 0
kstat.zfs.misc.arcstats.l2_misses: 0
kstat.zfs.misc.arcstats.l2_hits: 0
kstat.zfs.misc.arcstats.uncached_evictable_metadata: 0
kstat.zfs.misc.arcstats.uncached_evictable_data: 0
kstat.zfs.misc.arcstats.uncached_size: 0
kstat.zfs.misc.arcstats.mfu_ghost_evictable_metadata: 0
kstat.zfs.misc.arcstats.mfu_ghost_evictable_data: 0
kstat.zfs.misc.arcstats.mfu_ghost_size: 0
kstat.zfs.misc.arcstats.mfu_evictable_metadata: 1089024
kstat.zfs.misc.arcstats.mfu_evictable_data: 36472832
kstat.zfs.misc.arcstats.mfu_size: 48907264
kstat.zfs.misc.arcstats.mru_ghost_evictable_metadata: 0
kstat.zfs.misc.arcstats.mru_ghost_evictable_data: 0
kstat.zfs.misc.arcstats.mru_ghost_size: 0
kstat.zfs.misc.arcstats.mru_evictable_metadata: 2090496
kstat.zfs.misc.arcstats.mru_evictable_data: 99806720
kstat.zfs.misc.arcstats.mru_size: 123869184
kstat.zfs.misc.arcstats.anon_evictable_metadata: 0
kstat.zfs.misc.arcstats.anon_evictable_data: 0
kstat.zfs.misc.arcstats.anon_size: 151552
kstat.zfs.misc.arcstats.other_size: 6773040
kstat.zfs.misc.arcstats.bonus_size: 1121600
kstat.zfs.misc.arcstats.dnode_size: 4335728
kstat.zfs.misc.arcstats.dbuf_size: 1315712
kstat.zfs.misc.arcstats.metadata_size: 31754240
kstat.zfs.misc.arcstats.data_size: 141173760
kstat.zfs.misc.arcstats.hdr_size: 1076920
kstat.zfs.misc.arcstats.overhead_size: 28936192
kstat.zfs.misc.arcstats.uncompressed_size: 411830784
kstat.zfs.misc.arcstats.compressed_size: 143991808
kstat.zfs.misc.arcstats.size: 180809192
kstat.zfs.misc.arcstats.c_max: 2479054848
kstat.zfs.misc.arcstats.c_min: 111024896
kstat.zfs.misc.arcstats.c: 283686656
kstat.zfs.misc.arcstats.p: 141843840
kstat.zfs.misc.arcstats.hash_chain_max: 2
kstat.zfs.misc.arcstats.hash_chains: 22
kstat.zfs.misc.arcstats.hash_collisions: 13561
kstat.zfs.misc.arcstats.hash_elements_max: 5494
kstat.zfs.misc.arcstats.hash_elements: 5245
kstat.zfs.misc.arcstats.evict_l2_skip: 0
kstat.zfs.misc.arcstats.evict_l2_ineligible: 0
kstat.zfs.misc.arcstats.evict_l2_eligible_mru: 0
kstat.zfs.misc.arcstats.evict_l2_eligible_mfu: 0
kstat.zfs.misc.arcstats.evict_l2_eligible: 0
kstat.zfs.misc.arcstats.evict_l2_cached: 0
kstat.zfs.misc.arcstats.evict_not_enough: 0
kstat.zfs.misc.arcstats.evict_skip: 0
kstat.zfs.misc.arcstats.access_skip: 0
kstat.zfs.misc.arcstats.mutex_miss: 0
kstat.zfs.misc.arcstats.deleted: 0
kstat.zfs.misc.arcstats.uncached_hits: 0
kstat.zfs.misc.arcstats.mfu_ghost_hits: 0
kstat.zfs.misc.arcstats.mfu_hits: 4685498
kstat.zfs.misc.arcstats.mru_ghost_hits: 0
kstat.zfs.misc.arcstats.mru_hits: 1616339
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 351
kstat.zfs.misc.arcstats.prefetch_metadata_iohits: 681
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 857
kstat.zfs.misc.arcstats.prefetch_data_misses: 658
kstat.zfs.misc.arcstats.prefetch_data_iohits: 0
kstat.zfs.misc.arcstats.prefetch_data_hits: 25
kstat.zfs.misc.arcstats.demand_metadata_misses: 1193
kstat.zfs.misc.arcstats.demand_metadata_iohits: 107
kstat.zfs.misc.arcstats.demand_metadata_hits: 3393539
kstat.zfs.misc.arcstats.demand_data_misses: 2548
kstat.zfs.misc.arcstats.demand_data_iohits: 56
kstat.zfs.misc.arcstats.demand_data_hits: 2907416
kstat.zfs.misc.arcstats.misses: 4750
kstat.zfs.misc.arcstats.iohits: 844
kstat.zfs.misc.arcstats.hits: 6301837

I am not using any RAM disks on the system:

[23.05-RELEASE][jonsmall@pfSense-small.home.lan]/home/jonsmall: mount
pfSense/ROOT/default on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs)
pfSense/var on /var (zfs, local, noatime, nfsv4acls)
pfSense/tmp on /tmp (zfs, local, noatime, nosuid, nfsv4acls)
pfSense/home on /home (zfs, local, noatime, nfsv4acls)
pfSense/var/log on /var/log (zfs, local, noatime, noexec, nosuid, nfsv4acls)
pfSense/var/db on /var/db (zfs, local, noatime, noexec, nosuid, nfsv4acls)
pfSense/var/cache on /var/cache (zfs, local, noatime, noexec, nosuid, nfsv4acls)
pfSense/var/tmp on /var/tmp (zfs, local, noatime, nosuid, nfsv4acls)
pfSense/ROOT/default/cf on /cf (zfs, local, noatime, noexec, nosuid, nfsv4acls)
pfSense/ROOT/default/var_cache_pkg on /var/cache/pkg (zfs, local, noatime, noexec, nosuid, nfsv4acls)
pfSense/ROOT/default/var_db_pkg on /var/db/pkg (zfs, local, noatime, noexec, nosuid, nfsv4acls)
tmpfs on /var/run (tmpfs, local)
devfs on /var/dhcpd/dev (devfs)

Screenshot 2023-06-20 060438.jpg

SteveITS

@johan333 FWIW this page links to a ZFS tuning doc: https://docs.netgate.com/pfsense/en/latest/hardware/memory.html
Haven’t seen hanging 2100s, and haven’t done any memory tuning. Can you view the console while it happens?
https://docs.netgate.com/pfsense/en/latest/solutions/netgate-2100/connect-to-console.html

johan333

@SteveITS Thanks for the reference to memory tuning and I'll check it out. You know, I was thinking the same thing, but I don't currently have the console connected to anything. I've got a spare Raspberry PI I think I'll throw in the cabinet and connect it to the 2100 console port.

johan333

I just wanted to follow up on this post and provide some closure as I believe I've identified the issue. When researching problems in forums like this, it can be so frustrating to read through 20 pages of dialogue only to find no closure. It is as if someone tore out the last pages of a great novel!

Also, please allow me to also apologize to the community for prematurely blaming ZFS or the kernel as I'm 90% certain that isn't the case. This is out of character for me and no different than a typical doctor/patient exam. I should've started with describing the symptoms and allowed experts to interrogate with more detailed questions, determine a prognosis, and treatment plan. Again, my humble apologies.

Now, onto what I've discovered after carefully scrutinizing the system through the lens of what is different about my systems vs other 2100s running 23.05 that aren't raising flags about memory leaks. I found some anecdotal things such as running pfBlockerNG-devel (3.2.0_5) instead of pfBlockerNG (3.2.0_5), but I didn't feel that would be a likely culprit as the associated process memory has always been stable. I then remembered about the constant errors I have been getting with the connected UPS systems in DMESG:

ugen1.2: <CPS ST Series> at usbus1 (disconnected)
ugen1.2: <CPS ST Series> at usbus1

See, when these systems were installed (6mo ago), I setup UPS systems on each of them and attempted to use Nut to manage the UPS interface. However, it was not stable and opted to just disable Nut for the time being and save troubleshooting for a rainy day. These messages are constant (every few secs), so I began thinking that perhaps with each disconnect/connect some kernel level memory allocation (malloc call) was happening that wasn't being released. After 10 days of this constant cycle, the system was just starved of free memory and became unstable. So, 2 days ago, I disconnected the UPS from pfSense and performed a hard restart. As you can see from the graph below, the wire and free memory are looking much more well-behaved as I would normally see a 7-8% increase in wire memory within 24 hours and so far it hasn't fluctuated more than +/- 2% over 2 days.

Screenshot 2023-06-22 123229.jpg

Again, thank you for the community's support as these troubleshooting efforts, although menacing, do provide a valuable opportunity to learn more about the system.

BTW - I have since connected the UPSs (and pfSense console) to a Raspberry PI running Nut and it is working great! Now the rainy day project is configuring a Nut client on pfSense to respond to triggers from the Nut server for loss of power events.

stephenw10

Nice catch!
So the UPS USB device was flapping like that with or without NUT installed/running?

johan333

@stephenw10 Yes, using 2 different models of CyberPower UPSs (SL700U & CP425SLG) and have been experiencing this flapping since the 2100s were new with v22.05 - 6mo ago. I removed the Nut package shortly after the install when these disconnects were causing the instability of Nut. They are both HID compliant and I was using the usbhid-ups driver with Nut, but it still was flapping. As mentioned, I have them connected to a Raspberry PI with Nut and usbhid-ups driver and it is working great (see pic).

Screenshot 2023-06-22 152945.jpg

In case you're interested, I saved this output. When it disconnected, the UPS device didn't show in a usbconfig listing.

[23.05-RELEASE][jonsmall@pfSense-rider.home.lan]/home/jonsmall: sudo usbconfig
ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.1: <Marvell EHCI root HUB> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
ugen1.2: <CPS ST Series> at usbus1, cfg=0 md=HOST spd=LOW (1.5Mbps) pwr=ON (50mA)

[23.05-RELEASE][jonsmall@pfSense-rider.home.lan]/home/jonsmall: sudo usbconfig -vd ugen1.2
ugen1.2: <CPS ST Series> at usbus1, cfg=0 md=HOST spd=LOW (1.5Mbps) pwr=ON (50mA)

  bLength = 0x0012
  bDescriptorType = 0x0001
  bcdUSB = 0x0110
  bDeviceClass = 0x0000  <Probed by interface class>
  bDeviceSubClass = 0x0000
  bDeviceProtocol = 0x0000
  bMaxPacketSize0 = 0x0008
  idVendor = 0x0764
  idProduct = 0x0501
  bcdDevice = 0x0001
  iManufacturer = 0x0003  <CPS>
  iProduct = 0x0001  <ST Series>
  iSerialNumber = 0x0000  <no string>
  bNumConfigurations = 0x0001


 Configuration index 0

    bLength = 0x0009
    bDescriptorType = 0x0002
    wTotalLength = 0x0022
    bNumInterfaces = 0x0001
    bConfigurationValue = 0x0001
    iConfiguration = 0x0000  <no string>
    bmAttributes = 0x00a0
    bMaxPower = 0x0019

    Interface 0
      bLength = 0x0009
      bDescriptorType = 0x0004
      bInterfaceNumber = 0x0000
      bAlternateSetting = 0x0000
      bNumEndpoints = 0x0001
      bInterfaceClass = 0x0003  <HID device>
      bInterfaceSubClass = 0x0000
      bInterfaceProtocol = 0x0000
      iInterface = 0x0000  <no string>

      Additional Descriptor

      bLength = 0x09
      bDescriptorType = 0x21
      bDescriptorSubType = 0x10
       RAW dump:
       0x00 | 0x09, 0x21, 0x10, 0x01, 0x21, 0x01, 0x22, 0x5f,
       0x08 | 0x02

     Endpoint 0
        bLength = 0x0007
        bDescriptorType = 0x0005
        bEndpointAddress = 0x0081  <IN>
        bmAttributes = 0x0003  <INTERRUPT>
        wMaxPacketSize = 0x0008
        bInterval = 0x000a
        bRefresh = 0x0000
        bSynchAddress = 0x0000