Suricata process dying due to hyperscan problem

bmeeks

@jowe78 said in Suricata process dying due to hyperscan problem:

Hello,

I'm having the same problem.

I have 6 interfaces set up with Suricata. and only 2 of them are stopped randomly.
One using IX0 on WAN
And the other one using VLAN on IX1
device = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
Using IPS Mode - Legacy Mode

I deleted one of the monitored interfaces in Suricata that was having the issue, duplicated a working one. And got the same error on the new (same as before) interface. Also tried to disable some of the working ones but nothing changed.

Suricata log
[607907 - W#02] 2023-11-27 08:35:42 Error: spm-hs: Hyperscan returned fatal error -1.

System log.
Nov 27 08:35:42 kernel ix0: promiscuous mode disabled

That is really puzzling. It is very hard to pin down what the root cause of this might be . Are the rules different on the interfaces with no issue compared to the interfaces that are crashing?

bmeeks

@kiokoman said in Suricata process dying due to hyperscan problem:

106211 - Suricata-Main] 2023-11-27 13:12:52 Notice: threads: Threads created -> W: 1 FM: 1 FR: 1 Engine started.
[863533 - W#01-vmx2] 2023-11-27 13:12:53 Info: checksum: No packets with invalid checksum, assuming checksum offloading is NOT used
[863533 - W#01-vmx2] 2023-11-27 13:13:00 Error: spm-hs: Hyperscan returned fatal error -1.

elfctl did not help for me

Hmm...I was sort of afraid that might be the result. Another user tried it and it seemed to work very briefly, but then a crash. The random nature of this bug is frustrating. It's happening with different physical interfaces, it happens immediately for some users (they can't even start an interface), but for other users it happens at random points during a long runtime.

@kiokoman said in Suricata process dying due to hyperscan problem:

how about vectorscan? there is plan for it?

Vectorscan is somethig Suricata upstream would have to incorporate into the binary. All we do on the pfSense side is take the upstream source code for the binary and add the custom blocking plugin for Legacy Blocking Mode.

I'm also unsure at this point what the support level is in Vectorscan for Intel devices. It was first developed to bring hyperscan-like technology to ARM and other non-Intel CPUs.

bmeeks

@chrysmon said in Suricata process dying due to hyperscan problem:

@jowe78 Have Suricata in IPS mode on the WAN interface. Had a crash once a day with hyperscan mode. Yesterday I switched to AK-CS mode and it crashed in half a day running. There is no error in logs.
Now I'm switching to AC-BS mode and keep you updated.

There is no error in any log? Always check BOTH the pfSense system log under STATUS > SYSTEM LOGS and the suricata.log under the LOGS VIEW tab in the Suricata GUI.

Different things are going to be logged in each. For example, if Suricata hard crashes, it can't log anything into suricata.log about the crash because the binary died suddenly. But the pfSense operating system will see the binary crash and log information about it in the pfSense system log.

chrysmon

@bmeeks I wrote it explicitly because it was unusual: in system.log the last entry about suricata was a detection log. Nothing about crash.
Mine still running with AC-BS Matcher Algorithm. I even did a manual update, successful.

jowe78

@bmeeks said in Suricata process dying due to hyperscan problem:

That is really puzzling. It is very hard to pin down what the root cause of this might be . Are the rules different on the interfaces with no issue compared to the interfaces that are crashing?

I have all rulesets applied to all interfaces, but not all rules enabled. So there are exceptions, then some rules are disabled to ensure functionality. So there are differences
between the interfaces.

I will start fresh on one of the interfaces to see how it works.

asdjklfjkdslfdsaklj

@bmeeks in my case, no (configured) variance between interfaces.

I recently removed Suricata, including all configuration, caches, logs, etc., and installed fresh. Created the first interface, then copied it to create the second. First interface seems to be stable, but the second will die fairly shortly after start, due to the aforementioned hyper scan problem.

jowe78

I just removed the old, and set up 3 new suricata interfaces, let them run for a couple of hours in IDS. Disabling rules that breaks functionality. But as soon as i enabled IPS, the interface stopped 5min after start with the same error.

bmeeks

@asdjklfjkdslfdsaklj said in Suricata process dying due to hyperscan problem:

@bmeeks in my case, no (configured) variance between interfaces.

I recently removed Suricata, including all configuration, caches, logs, etc., and installed fresh. Created the first interface, then copied it to create the second. First interface seems to be stable, but the second will die fairly shortly after start, due to the aforementioned hyper scan problem.

That is just so weird! What should be two practically identical setups, yet one works and the other crashes. I honestly am running out of ideas at this point. There does not seem to be a common thread other than Hyperscan.

asdjklfjkdslfdsaklj

@bmeeks said in Suricata process dying due to hyperscan problem:

@asdjklfjkdslfdsaklj said in Suricata process dying due to hyperscan problem:

@bmeeks in my case, no (configured) variance between interfaces.

I recently removed Suricata, including all configuration, caches, logs, etc., and installed fresh. Created the first interface, then copied it to create the second. First interface seems to be stable, but the second will die fairly shortly after start, due to the aforementioned hyper scan problem.

That is just so weird! What should be two practically identical setups, yet one works and the other crashes. I honestly am running out of ideas at this point. There does not seem to be a common thread other than Hyperscan.

Indeed.

Initially created interface I mentioned just died, same hyper scan fatal error.

No expectations with regard to time and effort here, but if you need a methodical guinea pig say the word.

chrysmon

@asdjklfjkdslfdsaklj More than 1 day running without crash. With AC-BS Pattern Matcher Algorithm.

jowe78

I Installed Snort to test, and there i get another error. Memory usage is at 57% out of 8GB with all snort interfaces running. So might be a little high.
I tried to change the "Stream Memory Cap" on Suricata interface (before changing to Snort) from 256MB to 384MB with no luck, Also changed some other memory settings without any luck.

Is it using using a lot of RAM when reloading? It doesn't show on main page atleast.

Last log entries before crash, but it's not every time that i have disabled a rule that it has crashed, especially not for Suricata.

Nov 28 12:57:01 kernel ix0: promiscuous mode disabled
Nov 28 12:57:01 kernel pid 79776 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:59 kernel pid 79699 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:44 kernel pid 88772 (php-fpm), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:25 php-fpm 88772 [Snort] Snort RELOAD CONFIG for WAN(ix0)...
Nov 28 12:56:25 php-fpm 88772 [Snort] Building new sid-msg.map file for WAN...
Nov 28 12:56:25 php-fpm 88772 [Snort] Enabling any flowbit-required rules for: WAN...
Nov 28 12:56:24 php-fpm 88772 [Snort] Enabling any flowbit-required rules for: WAN...
Nov 28 12:56:23 php-fpm 88772 [Snort] Updating rules configuration for: WAN ...
Nov 28 12:56:23 check_reload_status 438 Syncing firewall
Nov 28 12:56:23 php-fpm 88772 /snort/snort_alerts.php: Configuration Change: admin@1.2.3.4 (Local Database): Snort pkg: User-forced rule state override applied for rule XXX:X on ALERTS tab for interface wan.

bmeeks

@jowe78 said in Suricata process dying due to hyperscan problem:

Nov 28 12:57:01 kernel pid 79776 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:59 kernel pid 79699 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:44 kernel pid 88772 (php-fpm), jid 0, uid 0, was killed: failed to reclaim memory

These log entries are very interesting ... .

I've been doing some research this morning on memory management in modern operating systems and FreeBSD in particular. Still not an expert in this area by any measure, but I've learned some things that make me suspect a memory allocation/reclamation bug may exist in the recent FreeBSD releases.

Memory management in a modern operating system such as FreeBSD is quite complex. There are several memory area classifications explained here.

The operating system can experience something known as "memory pressure". This is a condition where some process needs additional memory but there is currently no Free memory available (refer to the link a couple of sentences prior in this paragraph for the definition of Free). In this state, the kernel memory management algorithm goes on the hunt for memory it can reclaim and then give to the requesting process. The kernel does its best to find memory to give a requesting process instead of just simply returning an OOM (out-of-memory) error to the requester. It first looks for a process that is sleeping, and if it finds a suitable one, it will either reclaim that memory space temporarily or swap that process' memory out to the swap partition. But if there is no sleeping process and the kernel can't otherwise find memory for the process currently requesting it, it will go on the hunt for something to kill in order to obtain memory. It is possible in that scenario for it to choose one of the largest memory consumption processes to kill.

So, back to the log entries. Snort will be using a lot of extra memory during the rules update process. And it will be using a good chunk of that memory through the PHP interpreter. Look at the log entries I quoted above and notice what processes were killed: snort and the php-fpm engine. These would have been the biggest current memory users. But curiously, it was these processes that were likely asking for additional memory.

I've seen a number of posts since the recent pfSense Plus release and the 2.7.1 CE release with similar log errors. Commonly impacted programs are unbound and snort. But sometimes a few others. The use of ZFS and its ARC (Adaptive Replacement Cache) might play a role here, too.

This post I found does a decent job of explaining how memory management in FreeBSD works: https://unix.stackexchange.com/questions/234446/how-does-freebsd-allocate-memory.

chrysmon

@bmeeks said in Suricata process dying due to hyperscan problem:

@jowe78 said in Suricata process dying due to hyperscan problem:

Nov 28 12:57:01 kernel pid 79776 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:59 kernel pid 79699 (snort), jid 0, uid 0, was killed: failed to reclaim memory
Nov 28 12:56:44 kernel pid 88772 (php-fpm), jid 0, uid 0, was killed: failed to reclaim memory

These log entries are very interesting ... .

I've been doing some research this morning on memory management in modern operating systems and FreeBSD in particular. Still not an expert in this area by any measure, but I've learned some things that make me suspect a memory allocation/reclamation bug may exist in the recent FreeBSD releases.

Memory management in a modern operating system such as FreeBSD is quite complex. There are several memory area classifications explained here.

The operating system can experience something known as "memory pressure". This is a condition where some process needs additional memory but there is currently no Free memory available (refer to the link a couple of sentences prior in this paragraph for the definition of Free). In this state, the kernel memory management algorithm goes on the hunt for memory it can reclaim and then give to the requesting process. The kernel does its best to find memory to give a requesting process instead of just simply returning an OOM (out-of-memory) error to the requester. It first looks for a process that is sleeping, and if it finds a suitable one, it will either reclaim that memory space temporarily or swap that process' memory out to the swap partition. But if there is no sleeping process and the kernel can't otherwise find memory for the process currently requesting it, it will go on the hunt for something to kill in order to obtain memory. It is possible in that scenario for it to choose one of the largest memory consumption processes to kill.

So, back to the log entries. Snort will be using a lot of extra memory during the rules update process. And it will be using a good chunk of that memory through the PHP interpreter. Look at the log entries I quoted above and notice what processes were killed: snort and the php-fpm engine. These would have been the biggest current memory users. But curiously, it was these processes that were likely asking for additional memory.

I've seen a number of posts since the recent pfSense Plus release and the 2.7.1 CE release with similar log errors. Commonly impacted programs are unbound and snort. But sometimes a few others. The use of ZFS and its ARC (Adaptive Replacement Cache) might play a role here, too.

This post I found does a decent job of explaining how memory management in FreeBSD works: https://unix.stackexchange.com/questions/234446/how-does-freebsd-allocate-memory.

Let me share my experience about memory. The same configuration running on three different physical systems:

With 16GB RAM: uses the swap partition, the performance is unacceptable
With 32GB RAM uses about 50% (peaks at 54%), no swap
With 64GB RAM uses about 7%, no swap

The values are from System Information - Memory usage

kiokoman

LAN is cloned from WAN, there is no difference but wan is stable ...

SteveITS

@kiokoman Weird, some difference in the private subnet or related pass list maybe?

@chrysmon said in Suricata process dying due to hyperscan problem:

With 16GB RAM: uses the swap partition, the performance is unacceptable
With 32GB RAM uses about 50% (peaks at 54%), no swap

Have you found https://docs.netgate.com/pfsense/en/latest/hardware/tune-zfs.html?

"The default maximum ARC size (vfs.zfs.arc.max) is automatic (0) and uses 1/2 RAM or the total RAM minus 1GB, whichever is greater."
(but also it's supposed to give it up on its own)

bmeeks

@kiokoman said in Suricata process dying due to hyperscan problem:

LAN is cloned from WAN, there is no difference but wan is stable ...

Does turning Blocking Mode off completely make any difference on the LAN interface?

chrysmon

@SteveITS said in Suricata process dying due to hyperscan problem:

@kiokoman Weird, some difference in the private subnet or related pass list maybe?

@chrysmon said in Suricata process dying due to hyperscan problem:

With 16GB RAM: uses the swap partition, the performance is unacceptable
With 32GB RAM uses about 50% (peaks at 54%), no swap

Have you found https://docs.netgate.com/pfsense/en/latest/hardware/tune-zfs.html?

"The default maximum ARC size (vfs.zfs.arc.max) is automatic (0) and uses 1/2 RAM or the total RAM minus 1GB, whichever is greater."
(but also it's supposed to give it up on its own)

No I haven't. The strange (for me) is the low usage when the machine has 64GB. It's not consistent with the other two cases.
Sorry to continue here a discussion not relevant to the topic.

jowe78

I seem to have solved my problems by going from 8GB to 16GB RAM. Topped out at 65% or something with 16GB, But wasn't much higher with 8GB in %. (Going back to Suricata, memory usage is around 30% again)

Before there were some strange hanged apps like ovpn and adguard.

I just replaced the hardware, and looked at the usage on my old system eith 16GB RAM, and that was using like 30% so i thought that 8GB would be fine. But apparently not.

Thanks!

chrysmon

@jowe78 Just to be clear: in my case suricata stopped crashing after setting the Pattern Matcher to AC-BS. Still may be early to conclude, but before (with any other Algorithm) it had crashes. I will let it one more day and then set to Hyperscan again.

jowe78

@chrysmon said in Suricata process dying due to hyperscan problem:

in my case suricata stopped crashing after setting the Pattern Matcher to AC-BS

I changed back so still using hyperscan (or auto), just more RAM.