Suricata stops randomly with "stale" PID file.

ma0f97

Hey Suricata seems to randomly stop on 2 of my 3 interfaces (LAN and WAN, OPT1 which is Wireguard is fine). The bad thing is: I don't see any fatal error in the Suricata.log file:

27/2/2022 -- 21:04:00 - <Notice> -- This is Suricata version 6.0.4 RELEASE running in SYSTEM mode
27/2/2022 -- 21:04:00 - <Info> -- CPUs/cores online: 3
27/2/2022 -- 21:04:00 - <Info> -- SSSE3 support not detected, disabling Hyperscan for MPM
27/2/2022 -- 21:04:00 - <Info> -- SSSE3 support not detected, disabling Hyperscan for SPM
27/2/2022 -- 21:04:00 - <Info> -- HTTP memcap: 67108864
27/2/2022 -- 21:04:00 - <Info> -- fast output device (regular) initialized: alerts.log
27/2/2022 -- 21:04:00 - <Info> -- http-log output device (regular) initialized: http.log
27/2/2022 -- 21:04:00 - <Info> -- Using log dir /var/log/suricata/suricata_vtnet139657
27/2/2022 -- 21:04:00 - <Info> -- Selected pcap-log compression method: none
27/2/2022 -- 21:04:00 - <Info> -- using normal logging
27/2/2022 -- 21:04:00 - <Info> -- eve-log output device (regular) initialized: eve.json
27/2/2022 -- 21:04:00 - <Info> -- Going to log the md5 sum of email subject
27/2/2022 -- 21:04:00 - <Warning> -- [ERRCODE: SC_WARN_NO_STATS_LOGGERS(261)] - stats are enabled but no loggers are active
27/2/2022 -- 21:04:00 - <Info> -- SSSE3 support not detected, disabling Hyperscan for SPM
27/2/2022 -- 21:04:00 - <Error> -- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - previous keyword has a fast_pattern:only; set. Can't have relative keywords around a fast_pattern only content
[...]
27/2/2022 -- 21:04:05 - <Warning> -- [ERRCODE: SC_WARN_FLOWBIT(306)] - flowbit 'file.pdf&file.ttf' is checked but not set. Checked in 28585 and 1 other sigs
27/2/2022 -- 21:04:05 - <Warning> -- [ERRCODE: SC_WARN_FLOWBIT(306)] - flowbit 'file.ppsx&file.zip' is checked but not set. Checked in 26068 and 1 other sigs
27/2/2022 -- 21:04:38 - <Info> -- Using 1 live device(s).
27/2/2022 -- 21:04:38 - <Info> -- using interface vtnet1
27/2/2022 -- 21:04:38 - <Info> -- running in 'auto' checksum mode. Detection of interface state will require 1000ULL packets
27/2/2022 -- 21:04:38 - <Info> -- Set snaplen to 1518 for 'vtnet1'
27/2/2022 -- 21:04:38 - <Info> -- Initializing PCAP ring buffer for /var/log/suricata/suricata_vtnet139657/log.pcap.
27/2/2022 -- 21:04:38 - <Notice> -- Ring buffer initialized with 4 files.
27/2/2022 -- 21:04:38 - <Info> -- RunModeIdsPcapAutoFp initialised
27/2/2022 -- 21:04:38 - <Notice> -- all 4 packet processing threads, 2 management threads initialized, engine started.
27/2/2022 -- 21:05:08 - <Info> -- No packets with invalid checksum, assuming checksum offloading is NOT used

When I click on the start button in the Interface tab it won't start because it says:

27/2/2022 -- 21:12:32 - <Notice> -- This is Suricata version 6.0.4 RELEASE running in SYSTEM mode
27/2/2022 -- 21:12:32 - <Info> -- CPUs/cores online: 3
27/2/2022 -- 21:12:32 - <Info> -- SSSE3 support not detected, disabling Hyperscan for MPM
27/2/2022 -- 21:12:32 - <Info> -- SSSE3 support not detected, disabling Hyperscan for SPM
27/2/2022 -- 21:12:32 - <Info> -- HTTP memcap: 67108864
27/2/2022 -- 21:12:32 - <Error> -- [ERRCODE: SC_ERR_INITIALIZATION(45)] - pid file '/var/run/suricata_vtnet139657.pid' exists but appears stale. Make sure Suricata is not running and then remove /var/run/suricata_vtnet139657.pid. Aborting!

Removing this stale PID file makes me able to start the interface again but after a few minutes it will be red again.

I am using Suricata 6.0.4 with DISABLED blocking mode on all interfaces. I use ET (free), Snort Community and Snort paid rules (and the one from Suricata itself).

Can anybody help me? I would appreciate it.

SteveITS

@ma0f97 The .pid is left behind when it crashes. Is there anything in the system log file? Out of memory?

You may not need to run it on WAN as well as internal networks; scanning on WAN happens before the firewall blocks packets so it will end up scanning a lot of packets that the firewall will immediately discard.

ma0f97

@steveits Ah good idea with the system.log I got the following:

<2>1 2022-02-27T20:57:05.836728+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T20:57:05.836770+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T20:57:05.836835+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T20:57:05.836893+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(1): failed
<3>1 2022-02-27T20:57:11.928940+01:00 PfSense.pfsense.pve kernel - - - pid 11955 (suricata), jid 0, uid 0, was killed: out of swap space
<6>1 2022-02-27T20:57:11.929050+01:00 PfSense.pfsense.pve kernel - - - vtnet1: promiscuous mode disabled
<6>1 2022-02-27T20:57:13.396581+01:00 PfSense.pfsense.pve kernel - - - vtnet0: promiscuous mode enabled
<3>1 2022-02-27T20:58:40.546557+01:00 PfSense.pfsense.pve kernel - - - pid 37067 (suricata), jid 0, uid 0, was killed: out of swap space
<3>1 2022-02-27T20:59:33.466802+01:00 PfSense.pfsense.pve kernel - - - pid 97260 (suricata), jid 0, uid 0, was killed: out of swap space
<3>1 2022-02-27T20:59:36.710826+01:00 PfSense.pfsense.pve kernel - - - pid 75009 (suricata), jid 0, uid 0, was killed: out of swap space
<6>1 2022-02-27T20:59:36.710867+01:00 PfSense.pfsense.pve kernel - - - vtnet0: promiscuous mode disabled
<13>1 2022-02-27T21:00:00.263859+01:00 PfSense.pfsense.pve php 57413 - - [pfBlockerNG] Starting cron process.
<6>1 2022-02-27T21:00:08.488845+01:00 PfSense.pfsense.pve kernel - - - vtnet0: promiscuous mode enabled
<3>1 2022-02-27T21:00:26.606596+01:00 PfSense.pfsense.pve kernel - - - pid 64071 (suricata), jid 0, uid 0, was killed: out of swap space
<3>1 2022-02-27T21:00:27.866572+01:00 PfSense.pfsense.pve kernel - - - pid 64398 (suricata), jid 0, uid 0, was killed: out of swap space
<6>1 2022-02-27T21:00:27.866656+01:00 PfSense.pfsense.pve kernel - - - vtnet0: promiscuous mode disabled
<6>1 2022-02-27T21:01:08.826772+01:00 PfSense.pfsense.pve kernel - - - vtnet0: promiscuous mode enabled
<3>1 2022-02-27T21:02:22.354979+01:00 PfSense.pfsense.pve kernel - - - pid 77313 (suricata), jid 0, uid 0, was killed: out of swap space
<3>1 2022-02-27T21:02:56.766758+01:00 PfSense.pfsense.pve kernel - - - pid 58351 (suricata), jid 0, uid 0, was killed: out of swap space
<3>1 2022-02-27T21:03:32.886845+01:00 PfSense.pfsense.pve kernel - - - pid 89124 (suricata), jid 0, uid 0, was killed: out of swap space
<2>1 2022-02-27T21:04:33.786861+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T21:04:33.786933+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(24): failed
<2>1 2022-02-27T21:04:33.996622+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T21:04:33.996771+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(24): failed
<2>1 2022-02-27T21:04:34.626791+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(32): failed
<2>1 2022-02-27T21:04:34.626860+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(24): failed
<2>1 2022-02-27T21:04:34.626890+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(12): failed
<2>1 2022-02-27T21:04:34.626918+01:00 PfSense.pfsense.pve kernel - - - swap_pager_getswapspace(23): failed

I guess it has to do something with swap space. What is the best way to mitigate this? Disable swap? Make swap bigger?

SteveITS

@ma0f97 Are you on 22.01/2.6? There was a memory leak in the pcscd service. If you aren't using IPSec you can stop it, then either upgrade or there is a patch to disable it properly in older versions if you look for threads. (it might be in the new System Patches package if that's available in older versions)

If you are using IPSec then you have to stop IPSec before pcscd, then start IPSec again, or just reboot the router for a temporary fix. Otherwise IPSec logs a lot of errors.

ma0f97

@steveits No I am on

2.5.2-RELEASE (amd64)
built on Fri Jul 02 15:33:00 EDT 2021
FreeBSD 12.2-STABLE

so I guess I am not affected.

I think I might add 2GiB of SWAP to the disk following this manual:
https://people.freebsd.org/~blackend/en_US.ISO8859-1/books/handbook/adding-swap-space.html

SteveITS

@ma0f97 Sorry, to be clear the pcscd service was disabled (memory leak fixed) in 2.6.

ma0f97

@steveits Hm starting the interfaces the SWAP was at 25% or so, but RAM was maxed out. The moment I disabled pcscd, the SWAP was immediately at 99%.
Either way I am upgrading now and hope nothing breaks.

bmeeks

Something is chewing up your RAM, most likely the pcscd daemon as @SteveITS mentioned. That issue is solved in the latest pfSense release.

The stale PID file is a symptom of another issue, not a problem in and of itself.

When your system runs out of available RAM, data within memory (RAM hardware) that is not actively being accessed is written out to a special disk area called swap space. This makes room in hardware RAM for immediate needs. But on the next context switch, some of that data written out to swap has to be read back in. That whole disk I/O business makes your box super sluggish.

You essentially NEVER want to see any swap space in use except in very extreme and rare temporary conditions. But if something is chewing up RAM and not releasing it, then the box runs out of physical RAM and starts using the swap area as a fallback. When it then also runs out of swap space, it's "game over" ... .

ma0f97

@bmeeks Hello, thanks for the detailed explanation, I updated the Pfsense (nothing broke ) and also gave my machine 1GiB more RAM and the Interfaces are now stable and didn’t crash a single time! Only thing that wonders me now is why my Proxmox PVE (A VM management OS) did show that only half of the available RAM was used when in fact Pfsense showed 99%. When I use top and look at the Mem stats, I see that the memory itself is the same as reported to Proxmox but there is an additional (about the same size) portion of „laundry“ memory in use, whatever this means.

Anyway the problem I described is now solved thanks again guys.