Suricata process dying due to hyperscan problem
-
Environment
- Running pfSense CE 2.7.2 with Suricata plugin 7.0.2_2 and Suricata package 7.0.2_5
- No changes to Suricata ASLR
- Pattern matcher = Auto
- Legacy blocking mode = Enabled
- Multiple VLANs on the LAN interface, but Suricata is only running on a single VLAN (interface is called PC)
Reproducing the issue (with logs)
When I start the Suricata service the WAN interface starts and continues to run without issue, but the PC interface dies immediately. I do not see the Hyperscan error in the Suricata logs. This is 100% reproducible with this VM.
WAN (works) Suricata log - https://pastebin.com/qRRa2P48
PC (crashes) Suricata log - https://pastebin.com/FNcRQnhUSystem log excerpt showing that the PC Suricata instance dumps core.
Dec 12 10:05:51 kernel pid 10455 (suricata), jid 0, uid 0: exited on signal 11 (core dumped) Dec 12 10:05:50 php 3903 [Suricata] Suricata START for PC(vtnet0.700)... Dec 12 10:05:50 php 3903 [Suricata] Building new sid-msg.map file for PC... Dec 12 10:05:50 php 3903 [Suricata] Enabling any flowbit-required rules for: PC... Dec 12 10:05:50 php 3903 [Suricata] Updating rules configuration for: PC ... Dec 12 10:05:49 php 3903 [Suricata] Building new sid-msg.map file for WAN... Dec 12 10:05:49 php 3903 [Suricata] Enabling any flowbit-required rules for: WAN... Dec 12 10:05:49 php 3903 [Suricata] Updating rules configuration for: WAN ... Dec 12 10:05:49 php-fpm 13080 Starting Suricata on PC(vtnet0.700) per user request...
Workaround
This workaround does not require changing the pattern-matcher or disabling the legacy blocking mode. It works consistently across multiple hosts.
- Stop the Suricata service
- Go to Diagnostics --> Command Prompt
- Execute
elfctl -e +noaslr /usr/local/bin/suricata
- Start the Suricata service
In my case both interfaces start and continue to run without further crashes.
If you compare the failing PC interface suricata.log file with the working suricata.log file you can see where the process dumps core
PC (crashes) Suricata log - https://pastebin.com/FNcRQnhU
PC (working) Suricata log -https://pastebin.com/AE469T7mThe crashing instance fails immediately after attempting to parse a rule that it doesn't like. The working instance still sees that error, but continues to run.
This system log excerpt shows that both interfaces start correctly
Dec 12 10:58:05 kernel vtnet0.700: promiscuous mode enabled Dec 12 10:58:05 kernel vtnet0: promiscuous mode enabled Dec 12 10:58:00 kernel vtnet1: promiscuous mode enabled Dec 12 10:57:36 SuricataStartup 66406 Suricata START for PC(23822_vtnet0.700)... Dec 12 10:57:35 SuricataStartup 65014 Suricata START for WAN(65037_vtnet1)... Dec 12 10:57:08 SuricataStartup 98203 Suricata STOP for PC(23822_vtnet0.700)...
Next steps
I'm going to try removing the failing rule and then try starting up Suricata without the ASLR mitigation. I'll report back what I find.
-
@masons said in Suricata process dying due to hyperscan problem:
Environment
- Running pfSense CE 2.7.2 with Suricata plugin 7.0.2_2 and Suricata package 7.0.2_5
- No changes to Suricata ASLR
- Pattern matcher = Auto
- Legacy blocking mode = Enabled
- Multiple VLANs on the LAN interface, but Suricata is only running on a single VLAN (interface is called PC)
Reproducing the issue (with logs)
When I start the Suricata service the WAN interface starts and continues to run without issue, but the PC interface dies immediately. I do not see the Hyperscan error in the Suricata logs. This is 100% reproducible with this VM.
WAN (works) Suricata log - https://pastebin.com/qRRa2P48
PC (crashes) Suricata log - https://pastebin.com/FNcRQnhUSystem log excerpt showing that the PC Suricata instance dumps core.
Dec 12 10:05:51 kernel pid 10455 (suricata), jid 0, uid 0: exited on signal 11 (core dumped) Dec 12 10:05:50 php 3903 [Suricata] Suricata START for PC(vtnet0.700)... Dec 12 10:05:50 php 3903 [Suricata] Building new sid-msg.map file for PC... Dec 12 10:05:50 php 3903 [Suricata] Enabling any flowbit-required rules for: PC... Dec 12 10:05:50 php 3903 [Suricata] Updating rules configuration for: PC ... Dec 12 10:05:49 php 3903 [Suricata] Building new sid-msg.map file for WAN... Dec 12 10:05:49 php 3903 [Suricata] Enabling any flowbit-required rules for: WAN... Dec 12 10:05:49 php 3903 [Suricata] Updating rules configuration for: WAN ... Dec 12 10:05:49 php-fpm 13080 Starting Suricata on PC(vtnet0.700) per user request...
Workaround
This workaround does not require changing the pattern-matcher or disabling the legacy blocking mode. It works consistently across multiple hosts.
- Stop the Suricata service
- Go to Diagnostics --> Command Prompt
- Execute
elfctl -e +noaslr /usr/local/bin/suricata
- Start the Suricata service
In my case both interfaces start and continue to run without further crashes.
If you compare the failing PC interface suricata.log file with the working suricata.log file you can see where the process dumps core
PC (crashes) Suricata log - https://pastebin.com/FNcRQnhU
PC (working) Suricata log -https://pastebin.com/AE469T7mThe crashing instance fails immediately after attempting to parse a rule that it doesn't like. The working instance still sees that error, but continues to run.
This system log excerpt shows that both interfaces start correctly
Dec 12 10:58:05 kernel vtnet0.700: promiscuous mode enabled Dec 12 10:58:05 kernel vtnet0: promiscuous mode enabled Dec 12 10:58:00 kernel vtnet1: promiscuous mode enabled Dec 12 10:57:36 SuricataStartup 66406 Suricata START for PC(23822_vtnet0.700)... Dec 12 10:57:35 SuricataStartup 65014 Suricata START for WAN(65037_vtnet1)... Dec 12 10:57:08 SuricataStartup 98203 Suricata STOP for PC(23822_vtnet0.700)...
Next steps
I'm going to try removing the failing rule and then try starting up Suricata without the ASLR mitigation. I'll report back what I find.
This is very intriguing data. Thank you for the research and posting the results. This sort of jives with my original hypothesis that ASLR may be involved here. One of the Netgate kernel developers did not think it was because the currently documented ASLR bug is in the address sanitizer piece of the
llmv
compiler and he said that was unlikely to be used outside of debug builds. The documentation for the sanitizer says it results in about a 2x slowdown in execution.I also now doubt the documented address sanitizer bug in
llvm
is the likely cause, but your testing seems to imply that ASLR is at fault in some manner with this bug. However, other users experiencing the bug have tried disabling ASLR (as you did) and did not see any change in behavior. -
I removed the offending rule (SID 26470), removed the ASLR change and restarted Suricata. The PC interface Suricata instance immediately dumps core with Signal 11 again.
Stopping the Suricata service, making the ASLR change and restarting Suricata, results in the PC interface Suricata instance coming up and staying up.
At least for me, across several VMs, this is very consistent behavior.
-
@masons said in Suricata process dying due to hyperscan problem:
I removed the offending rule (SID 26470), removed the ASLR change and restarted Suricata. The PC interface Suricata instance immediately dumps core with Signal 11 again.
Stopping the Suricata service, making the ASLR change and restarting Suricata, results in the PC interface Suricata instance coming up and staying up.
At least for me, across several VMs, this is very consistent behavior.
I was about to test specifically with that offending rule enabled, but your test results suggest that is a moot point (meaning not the actual cause). I have no proof, but ALSR is definitely a suspect in my mind (at least for the Signal 11 segfault issue). Apparently it does little to help with the "Hyperscan returned fatal error -1" issue, though.
-
@Maltz said in Suricata process dying due to hyperscan problem:
kernel kills Suricata with a "failed to reclaim memory" error
I didn't reread the now-long thread, but did you post your memory usage with Suricata running?
ZFS is supposed to give up cache RAM but can be tuned to reduce usage:
https://docs.netgate.com/pfsense/en/latest/hardware/tune-zfs.html
"The default maximum ARC size (vfs.zfs.arc.max) is automatic (0) and uses 1/2 RAM or the total RAM minus 1GB, whichever is greater." -
@SteveITS said in Suricata process dying due to hyperscan problem:
I didn't reread the now-long thread, but did you post your memory usage with Suricata running?
It's "28% of 3388 MiB" (4GB Netgate 2100) right now. With any algorithm other than AC-BS, RAM usage ramps up a few minutes after Suricata starts then the kernel kills it.
-
@tylerevers said in Suricata process dying due to hyperscan problem:
@bmeeks said in Suricata process dying due to hyperscan problem:
My pull request containing the anticipated fix for this Hyperscan error has been merged. An updated Suricata package has built and should appear as an available update for 2.7.2 CE and 23.09.1 Plus users.
Look for an update to version 7.0.2_2 for the Suricata package. When installed, the new package should pull in version 7.0.2_5 of the Suricata binary.
Fingers crossed this fixes the Hyperscan issue. But as I mentioned previously, since I could never reproduce the error in my small test environment, I can't say with 100% certainty the bug I found and fixed is the actual Hyperscan culprit.
Nearly 20 hours since updating to 7.0.2_2 on 23.09.1 Plus with custom bare metal setup and no Hyperscan crash yet. Pattern Match set to AUTO and Blocking Mode ENABLED. Using all VLANs that traverse a LAGG in my case just as a reminder.
Thanks, Bill!
At roughly the 28-hour mark, the Suricata Interface failed with the Hyperscan issue again.
-
@bmeeks
i have removed all the rules from an interface but the hyperscan error is still there after a few moments for me.
+noaslr is still doing nothing
any chance you can provide the dbg pkg of suricata? -
@bmeeks
For now and maybe going forward as a perm solution can we just have the package updated to use AC-CS as the default with a note stating to avoid HyperScan for its inconsistent performance or something along those lines. -
@kiokoman said in Suricata process dying due to hyperscan problem:
@bmeeks
i have removed all the rules from an interface but the hyperscan error is still there after a few moments for me.
+noaslr is still doing nothing
any chance you can provide the dbg pkg of suricata?Not at the moment. I'm trying to reconstruct my package builder for the RELENG_2_7_2 branch of CE (which is the current 2.7.2 release), and that build is failing. Working with the Netgate team on that. Once I get my package builder working again, then I can build a debug package and perhaps share it.
Nothing else can happen until at least after this coming weekend as I am about to be out of town for a few days.
-
@michmoor said in Suricata process dying due to hyperscan problem:
@bmeeks
For now and maybe going forward as a perm solution can we just have the package updated to use AC-CS as the default with a note stating to avoid HyperScan for its inconsistent performance or something along those lines.I don't see the point in changing the default if users can just simply make the change manually and save it.
And I can't work on this issue anymore until late this Sunday at the earliest as I will be away from all my computing infrastructure until then.
-
@bmeeks said in Suricata process dying due to hyperscan problem:
I don't see the point in changing the default if users can just simply make the change manually and save it.
I think changing the default would be tremendously useful for people who have no way of knowing why Suricata is crashing over a month after the pfSense update that seemingly broke it. People who haven't, or don't have to expertise to, spend hours poring over system logs, find the right log entry to google, and make their way to this thread.
-
@Maltz said in Suricata process dying due to hyperscan problem:
@bmeeks said in Suricata process dying due to hyperscan problem:
I don't see the point in changing the default if users can just simply make the change manually and save it.
I think changing the default would be tremendously useful for people who have no way of knowing why Suricata is crashing over a month after the pfSense update that seemingly broke it. People who haven't, or don't have to expertise to, spend hours poring over system logs, find the right log entry to google, and make their way to this thread.
I beg to differ, why force everybody to use some settings as workaround, in order to track down an issue? This is not a test branch. As far as I understood from the posts here, this happens only if Suricata is in Legacy Mode. For example I use Suricata in inline mode on WAN and also on LAN with multiple VLANS and I don't encounter this issue. I'm not saying that we should not attempt to fix this, but forcing all of us to use the proposed defaults is bad practice.
-
@bmeeks said in Suricata process dying due to hyperscan problem:
@kiokoman said in Suricata process dying due to hyperscan problem:
@bmeeks
i have removed all the rules from an interface but the hyperscan error is still there after a few moments for me.
+noaslr is still doing nothing
any chance you can provide the dbg pkg of suricata?Not at the moment. I'm trying to reconstruct my package builder for the RELENG_2_7_2 branch of CE (which is the current 2.7.2 release), and that build is failing. Working with the Netgate team on that. Once I get my package builder working again, then I can build a debug package and perhaps share it.
Nothing else can happen until at least after this coming weekend as I am about to be out of town for a few days.
Well. i'm not in a hurry , i just like to solve mistery
-
Suricata still hangs on interfaces with higher traffic, even though I set it to use AC-KS.
It is strange that the same message appears with the hyperscan error, although all interfaces are set to AC-KS:
[104160 - W#03] 2023-12-13 16:18:07 Error: spm-hs: Hyperscan returned fatal error -1.
-
@paulp same here:
[122730 - W#04] 2023-12-12 17:20:11 Error: spm-hs: Hyperscan returned fatal error -1.
-
@NRgia said in Suricata process dying due to hyperscan problem:
@Maltz said in Suricata process dying due to hyperscan problem:
@bmeeks said in Suricata process dying due to hyperscan problem:
I don't see the point in changing the default if users can just simply make the change manually and save it.
I think changing the default would be tremendously useful for people who have no way of knowing why Suricata is crashing over a month after the pfSense update that seemingly broke it. People who haven't, or don't have to expertise to, spend hours poring over system logs, find the right log entry to google, and make their way to this thread.
I beg to differ, why force everybody to use some settings as workaround, in order to track down an issue? This is not a test branch. As far as I understood from the posts here, this happens only if Suricata is in Legacy Mode. For example I use Suricata in inline mode on WAN and also on LAN with multiple VLANS and I don't encounter this issue. I'm not saying that we should not attempt to fix this, but forcing all of us to use the proposed defaults is bad practice.
Let me rephrase: Changing the logic around the default "Auto" setting would be useful. If AC-BS is the only one that works reliably in Legacy Mode (it's the only one that works for me at any rate) then that's the one "Auto" should choose.
-
@Maltz
Exactly which is why i proposed it. As someone pointed out this isnt the dev branch this is the 'prod' branch so change the default to what is working for most instead of having people guess what the issue is, google it, find your way to the forum post and now a forum post which is already a 205 posts topic. Now the person needs to read all of this to conclude Hyperscan doesnt work. -
This issue is incredibly frustrating to pin down. It took a full day, but I just saw the Hyperscan error on one of my VMs. The ASLR change does seem to significantly improve my ability to start Suricata instances and keep them running but, it's not a viable workaround. I've switched over to AC-BS. I'll watch this for a few days and see what happens.
-
Good news! At least I hope so :)
I changed Pattern Match to AC-BS and for now I see it has been working for about 20 hours.
With the Pattern Matcher Algorithm set to Auto, AC-KS or Hyperscan, Suricata stopped after a few hours on the interfaces that had higher traffic.