Suricata process dying due to hyperscan problem

bmeeks

@kiokoman said in Suricata process dying due to hyperscan problem:

__sysctl("kern.version",2,0xfb9245edfba,0x820d5f640,0x0,0) ERR#12 'Cannot allocate memory'

How much RAM is configured for the virtual machine? That error says Suricata ran out of memory.

One change made in the Suricata 7.x series was to substantially increase the memory sizes for the TCP stream_memcap and a few related parameters. That was done to match up with the new defaults from upstream and to prevent startup failures on multi-core machines. The more CPU cores you have, the more TCP stream_memcap memory you need to allocate.

sgnoc

I just found this topic after creating my own topic on the same issue (searches for some reason didn't find this result before I submitted).

I'm having the exact same issue with 2 interfaces with Suricata blocking on legacy mode. Both interfaces have the same message, one with blocks and one without blocks before they both halt with that error.

I'm running a 7100 with 23.09 and Suricata updated to 7.0.2, but it was occurring on the previous version of Suricata. I just checked and when I updated to Suricata 7.0.2, it still shows hyperscan 5.4.0. Looks like I'll just have to wait for an update to get pushed out for 5.4.2 to resolve this problem.

Are there any temporary config solutions to get these interfaces back online for monitoring? I'm guessing it is just specific rules that are triggering these to fail. I can go back through the suricata.log files for those 2 interfaces. There were some errors on specific rules, so if I disable each of those, might that do the trick for now? One of these 2 interfaces is my WAN, so a little more important than the others.

I'm going to give the AC pattern matcher on these 2 interfaces and see if it still crashes. Hopefully that will be a solution for me for the meantime.

bmeeks

@sgnoc said in Suricata process dying due to hyperscan problem:

There were some errors on specific rules, so if I disable each of those, might that do the trick for now?

Unlikely. If you are running any Snort VRT rules with Suricata, then several hundred of those rules (like around 700 or so last time I checked) are incompatible with Suricata and will fail to load. That's because Snort and Suricata are not 100% compatible with their rule syntax. When it encounters such rules, Suricata will log an error and ignore that rule.

@sgnoc said in Suricata process dying due to hyperscan problem:

Are there any temporary config solutions to get these interfaces back online for monitoring?

Changing the MPM pattern matcher alogorithm to something other than Auto or HS should work. If the HyperScan library is present for your system, then the Auto setting will always choose HyperScan. The HS setting forces the selection of HyperScan (if it is available, but remember HyperScan is an Intel creation and only runs on AMD64 CPUs, so not on ARM hardware).

sgnoc

@bmeeks I switched the 2 interfaces with the hyperscan fault to the AC pattern matcher last night, and it has run all night without failing. Everything is still running fine. I'll continue with the AC pattern matcher for now.

Are there any repercussions for using AC vs hyperscan? I'm guessing hyperscan is better for resources and performance, if it is the default chosen if available?

One other note, I know you mentioned the hyperscan 5.4.0 was listed by the upstream suricata developer team to be fine, but that is the version on my system, and I'm definitely getting the hyperscan error on 2 interfaces. I'm hopeful that their 5.4.2 version will be a fix, if they believe that 5.4.0 is not affected, but it is on my system with that version.

Thanks for all your hard work! If I can do anything to help from my end, I would be happy to try. I'm running the XG-7100 netgate hardware.

sgnoc

@bmeeks My WAN interface has halted again, but I don't see a log where it failed. These are the last few lines of suricata.log:

[101805 - Suricata-Main] 2023-11-23 18:32:40 Warning: detect-flowbits: flowbit 'file.pdf&file.ttf' is checked but not set. Checked in 28585 and 1 other sigs
[101805 - Suricata-Main] 2023-11-23 18:32:40 Warning: detect-flowbits: flowbit 'file.xls&file.ole' is checked but not set. Checked in 30990 and 1 other sigs
[101805 - Suricata-Main] 2023-11-23 18:32:40 Warning: detect-flowbits: flowbit 'ET.gadu.loginsent' is checked but not set. Checked in 2008299 and 0 other sigs
[101805 - Suricata-Main] 2023-11-23 18:32:40 Warning: detect-flowbits: flowbit 'file.onenote' is checked but not set. Checked in 61666 and 1 other sigs
[101805 - Suricata-Main] 2023-11-23 18:33:37 Notice: detect: rule reload complete

This triggered right after the Emerging Threats rules updated and the interface rules reloaded. The pid file is still listed in /var/run, which I guess makes sense since the process halted. There are no processes running for suricata on that interface when I check with "ps aux".

I was able to find this in the system logs:

2023-11-23 19:14:37.481228-05:00 	kernel 	- 	pid 4814 (suricata), jid 0, uid 0: exited on signal 10 (core dumped)
2023-11-23 18:33:08.993276-05:00 	php-cgi 	81475 	[Suricata] The Rules update has finished. 
... Other interfaces reloading
2023-11-23 18:30:30.406289-05:00 	php-cgi 	81475 	[Suricata] Suricata signalled with SIGUSR2 for 00_WAN (ix0)...
2023-11-23 18:30:30.400230-05:00 	php-cgi 	81475 	[Suricata] Live-Reload of rules from auto-update is enabled...
2023-11-23 18:30:28.809617-05:00 	php-cgi 	81475 	[Suricata] Building new sid-msg.map file for 00_WAN...
2023-11-23 18:30:28.555533-05:00 	php-cgi 	81475 	[Suricata] Enabling any flowbit-required rules for: 00_WAN...
2023-11-23 18:30:17.944583-05:00 	php-cgi 	81475 	[Suricata] Updating rules configuration for: 00_WAN ...
2023-11-23 18:30:17.493458-05:00 	php-cgi 	81475 	[Suricata] Snort GPLv2 Community Rules are up to date...
2023-11-23 18:30:17.318869-05:00 	php-cgi 	81475 	[Suricata] Snort VRT rules are up to date...
2023-11-23 18:30:17.081341-05:00 	php-cgi 	81475 	[Suricata] Emerging Threats Pro rules file update downloaded successfully.
2023-11-23 18:30:16.694776-05:00 	php-cgi 	81475 	[Suricata] There is a new set of Emerging Threats Pro rules posted. Downloading etpro.rules.tar.gz...

What troubleshooting options do I have? Is this still related to the same problem, or is this a separate problem from the hyperscan issue that I need to switch to my own topic?

kiokoman

@bmeeks
8vcpu
16gb ram
increasing stream memory cap up to 2.147.483.648 didn't help

bmeeks

@sgnoc said in Suricata process dying due to hyperscan problem:

2023-11-23 19:14:37.481228-05:00 kernel - pid 4814 (suricata), jid 0, uid 0: exited on signal 10 (core dumped)

Signal 10 is a bus error normally associated with ARM-based hardware. What kind of machine are you running Suricata on? The Signal 10 error is more commonly associated with a non-aligned memory access, and that really can't happen on anything but ARM hardware these days.

sgnoc

@bmeeks I'm using a netgate xg-7100-u, which has an Intel x64 processor, and I have 24 GB of ram installed. It's only the one interface on suricata that has had that error, so I wouldn't think failing memory or other services should be having issues, I would think?

bmeeks

@sgnoc said in Suricata process dying due to hyperscan problem:

This triggered right after the Emerging Threats rules updated and the interface rules reloaded.

By my calculations using the log timestamps, Suricata finished the rules update and ran for 41 minutes before crashing, so "right after the rules update" is not entirely correct.

2023-11-23 19:14:37.481228-05:00 	kernel 	- 	pid 4814 (suricata), jid 0, uid 0: exited on signal 10 (core dumped)
2023-11-23 18:33:08.993276-05:00 	php-cgi 	81475 	[Suricata] The Rules update has finished.

Rules update completed at 18:33:08. That crash happened at 19:14:37, or 41 minutes later.

Other helpful information the next time this happens would be the content of the suricata.log file around the same time interval. You would need to capture that log BEFORE you restarted Suricata because that log is wiped clean each time Suricata is started or restarted in the GUI.

sgnoc

@bmeeks These are the only logs available in the suricata.log file, and the immediately was reference to it being the next log in line. There was nothing else before the core dump other than the rukes reloading. It has not yet occurred again, so hopefully it is an isolated incident and won't occur again.

bmeeks

For those of you having the Signal 11 or Signal 10 crashes, it would perhaps be useful if you can submit the core dump backtrace.

The command to execute at a shell prompt is:

gdb /usr/local/bin/suricata /root/suricata.core

Then execute these commands within the gdb prompt:

(gdb) bt
(gdb) bt full
(gdb) info threads
(gdb) thread apply all bt
(gdb) thread apply all bt full

Capture the output of those commands and post it back here.

bmeeks

@sgnoc said in Suricata process dying due to hyperscan problem:

It has not yet occurred again, so hopefully it is an isolated incident and won't occur again.

No, I don't think that is a true statement. It should never have occurred in the first place. The fact it did indicates there is a problem, and so it will happen again. It's only the "when" that is unknown.

sgnoc

@bmeeks I know it isnt likely, but can still be hopeful. I'll run the core dump commands on the next crash so I can provide them the next time it happens. Thanks for your help!

kiokoman

After the last error i decided to uninstall everything and reconfigure from scratch
maybe some configuration didn't migrate correctly
now i'm unable to reproduce the error at start

bmeeks

@kiokoman said in Suricata process dying due to hyperscan problem:

After the last error i decided to uninstall everything and reconfigure from scratch
maybe some configuration didn't migrate correctly
now i'm unable to reproduce the error at start

This has been the experience of a few other users as well all the way back to the original release of 7.x Suricata in pfSense. That's what makes this such a maddeningly difficult thing to debug .

bmeeks

I am continuing to look into this issue. Just sent a new batch of emails to the Suricata development team with questions about some recent changes in this area of the Suricata binary's code.

Still would be nice if I could reliably reproduce this in my test machines with a debug image running.

bmeeks

Attention Users hitting the Suricata Hyperscan problem (or other mysterious Suricata stoppages):

To help in pinning down what this problem is, please collect the following information for me when you experience the crash and include it in your post or feedback.

Are you seeing a Signal 11 or Signal 10 error fault logged in the pfSense system log (under STATUS > SYSTEM LOGS) around the time Suricata crashed? If so, include those log entries in your report.
Before attempting to restart Suricata after finding it stopped or crashed, examine the suricata.log for the interface under the LOGS VIEW tab in the Suricata GUI. Examine that log for any errors mentioning "hyperscan". Include those in your report.

I am trying to determine if a Signal 11 or Signal 10 happens each time Suricata crashes, or if Suricata is sometimes just stopping on its own when it encounters an internal hyperscan error.

Please provide the information requested above when posting about this issue. It is not helpful at all to simply create a reply saying "I'm having this problem, too" with no additional helpful information.

And at this time there is no indication at all the hyperscan crash issue is related to the Legacy Blocking Mode bug shared with Snort. That bug has, I'm fairly confident, been fixed. I think the issue in this thread is something different.

SteveITS

@kiokoman Do have a backup config 1) from before upgrading, 2) that wasn’t working and 3) after rebuilding? Might be interesting to compare the Suricata section to see if anything is different across those.

(I usually save one just before upgrading and immediately after)

kiokoman

@SteveITS
i have the backup history,
the only difference after reconfiguration was

old not working config:

<stream_bypass>off</stream_bypass>
<stream_drop_invalid>off</stream_drop_invalid>

vs
new config
<stream_bypass>no</stream_bypass>
<stream_drop_invalid>no</stream_drop_invalid>

everything else is the same but i don't have the old generated suricata.yaml

anyway i have a new problem now, before it was not even starting, now i have this after some hours, and only on one interface (i have suricata running on wan and lan, wan(vmx1) is still running ok)

[843086 - W#01-vmx2] 2023-11-25 12:03:58 Info: pcap: vmx2: running in 'auto' checksum mode. Detection of interface state will require 1000 packets
[843086 - W#01-vmx2] 2023-11-25 12:03:58 Info: pcap: vmx2: snaplen set to 1518
[100515 - Suricata-Main] 2023-11-25 12:03:58 Notice: threads: Threads created -> W: 1 FM: 1 FR: 1   Engine started.
[843086 - W#01-vmx2] 2023-11-25 12:03:59 Info: checksum: No packets with invalid checksum, assuming checksum offloading is NOT used
[843086 - W#01-vmx2] 2023-11-25 12:05:02 Error: spm-hs: Hyperscan returned fatal error -1.

bmeeks

@kiokoman said in Suricata process dying due to hyperscan problem:

@SteveITS
i have the backup history,
the only difference after reconfiguration was

old not working config:

<stream_bypass>off</stream_bypass>
<stream_drop_invalid>off</stream_drop_invalid>

vs
new config
<stream_bypass>no</stream_bypass>
<stream_drop_invalid>no</stream_drop_invalid>

everything else is the same but i don't have the old generated suricata.yaml

anyway i have a new problem now, before it was not even starting, now i have this after some hours, and only on one interface (i have suricata running on wan and lan, wan(vmx1) is still running ok)
[843086 - W#01-vmx2] 2023-11-25 12:03:58 Info: pcap: vmx2: running in 'auto' checksum mode. Detection of interface state will require 1000 packets
[843086 - W#01-vmx2] 2023-11-25 12:03:58 Info: pcap: vmx2: snaplen set to 1518
[100515 - Suricata-Main] 2023-11-25 12:03:58 Notice: threads: Threads created -> W: 1 FM: 1 FR: 1   Engine started.
[843086 - W#01-vmx2] 2023-11-25 12:03:59 Info: checksum: No packets with invalid checksum, assuming checksum offloading is NOT used
[843086 - W#01-vmx2] 2023-11-25 12:05:02 Error: spm-hs: Hyperscan returned fatal error -1.

Those small differences in Boolean values from the config.xml file would not be a factor here. Something is most likely wrong within the Suricata binary itself, but I don't know where nor do I know that is absolutely true.

I've had a virtual machine running for 36 hours- with every single ET Open rule enabled and the Snort IPS Connectivity Policy enabled- and have not seen a crash yet. So, this is a strange problem. To positively identify it is going to require being able to reproduce it easily. Then a debugging version of Suricata can be executed and the precise failure point identified. But so far I cannot reproduce the problem. And even in @kiokoman's case, the problem disappeared for a time and then recurred later under different circumstances (running versus starting up).

There were some upstream changes in the HyperScan portions of Suricata code starting with version 7.0.1. Those were to work around some problems introduced by a behavior change upstream made by Intel in the HyperScan library itself. I've been communicating with the Suricata developer team, and they are pretty confident the fixes they made are sufficient. Nobody on Linux seems to be having a problem. The vast majority of Suricata users are on Linux derivatives. Very few users are on FreeBSD- mostly just the pfSense and OPNsense users. I'm not seeing this problem reported on the OPNsense forum, but they are still running the 6.0.x branch of Suricata and not the new 7.x branch.