Suricata process dying due to hyperscan problem

tylerevers

Reconfirm Hyperscan Still Crashes

Block Offenders = On
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

Interface failed with error:

[101378 - W#07] 2023-11-29 12:54:32 Error: spm-hs: Hyperscan returned fatal error -1.

Test with Block Offenders Off
Block Offenders = Off
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

It has been three hours without a crash.

bmeeks

@asdjklfjkdslfdsaklj said in Suricata process dying due to hyperscan problem:

Both ran for a few hours, and eventually LAN 1 died (same hyperscan error), while LAN 2 remains up.

Okay, now swap the blocking mode around. Disable blocking on LAN 1 and Enable blocking on LAN 2. Let's see if the hyperscan error moves over to LAN 2 and it now crashes while LAN 1 remains stable.

If the problem does not move to LAN 1, then that would tend to take blocking mode out of the picture unless it takes that in combination with something else to trigger the hyperscan crash.

bmeeks

@tylerevers said in Suricata process dying due to hyperscan problem:

@bmeeks

Reconfirm Hyperscan Still Crashes

Block Offenders = On
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

Interface failed with error:
[101378 - W#07] 2023-11-29 12:54:32 Error: spm-hs: Hyperscan returned fatal error -1.
Test with Block Offenders Off
Block Offenders = Off
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

It has been three hours without a crash.

How long does it typically take to crash? Is three hours of runtime quite a bit longer than you were getting with blocking enabled?

tylerevers

@bmeeks said in Suricata process dying due to hyperscan problem:

@tylerevers said in Suricata process dying due to hyperscan problem:
@bmeeks

Reconfirm Hyperscan Still Crashes

Block Offenders = On
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

Interface failed with error:
[101378 - W#07] 2023-11-29 12:54:32 Error: spm-hs: Hyperscan returned fatal error -1.
Test with Block Offenders Off
Block Offenders = Off
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

It has been three hours without a crash.
How long does it typically take to crash? Is three hours of runtime quite a bit longer than you were getting with blocking enabled?

Yes, three hours is in the realm of 3-8x longer (and it still hasn't crashed yet ~9 hours total).

bmeeks

@tylerevers said in Suricata process dying due to hyperscan problem:

@bmeeks said in Suricata process dying due to hyperscan problem:
@tylerevers said in Suricata process dying due to hyperscan problem:
@bmeeks

Reconfirm Hyperscan Still Crashes

Block Offenders = On
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

Interface failed with error:
[101378 - W#07] 2023-11-29 12:54:32 Error: spm-hs: Hyperscan returned fatal error -1.
Test with Block Offenders Off
Block Offenders = Off
Signature Group Header MPM Context = Auto
Pattern Matcher Algorithm = Auto

It has been three hours without a crash.
How long does it typically take to crash? Is three hours of runtime quite a bit longer than you were getting with blocking enabled?
Yes, three hours is in the realm of 3-8x longer (and it still hasn't crashed yet ~9 hours total).

Well, now I need to figure out how in the world the custom blocking module code could possibly interact with the Hyperscan library .

It makes no sense as they are not even remotely related.

chrysmon

@bmeeks Can confirm that in IDS mode (no blocking) suricata has no crashes. In IPS mode it crashes. Hyperscan, no VLANS.

asdjklfjkdslfdsaklj

@bmeeks swapped, same result. Instance on interface w/blocking disabled remains up, other died.

bmeeks

@asdjklfjkdslfdsaklj said in Suricata process dying due to hyperscan problem:

@bmeeks swapped, same result. Instance on interface w/blocking disabled remains up, other died.

Thank you. This is very helpful. It tells me that somehow the custom blocking module is part of the issue.

I will need to dig into the code and see if something pops out. It will be a few days, though, before I can generate debug versions of the package because the ESXi host that contained all my pfSense package builders and private testing repo crashed and burned last Sunday morning due to a power blip and my UPS failing at the same time. Something is weird with the UPS. It shows the battery as good, but if power blips it drops the load. I will need to get a new one. I've started the process of rebuilding my test environment on that host, but it's going to take a few days. Also have some other non-related obligations over the next 4 days that interfere with the effort.

tylerevers

@bmeeks said in Suricata process dying due to hyperscan problem:

@asdjklfjkdslfdsaklj said in Suricata process dying due to hyperscan problem:

@bmeeks swapped, same result. Instance on interface w/blocking disabled remains up, other died.

Thank you. This is very helpful. It tells me that somehow the custom blocking module is part of the issue.

I will need to dig into the code and see if something pops out. It will be a few days, though, before I can generate debug versions of the package because the ESXi host that contained all my pfSense package builders and private testing repo crashed and burned last Sunday morning due to a power blip and my UPS failing at the same time. Something is weird with the UPS. It shows the battery as good, but if power blips it drops the load. I will need to get a new one. I've started the process of rebuilding my test environment on that host, but it's going to take a few days. Also have some other non-related obligations over the next 4 days that interfere with the effort.

Godspeed to you, sir. Best wishes in all things.

SteveITS

@bmeeks said in Suricata process dying due to hyperscan problem:

battery as good, but if power blips it drops the load

FWIW we see that a lot on older batteries, or I suppose defective ones. In our experience the UPS "self test" works to proactively alert the majority of the time but a decent amount the self test will trigger a power failure because the battery can't handle the load for the 2 seconds. :( And by "older" I mean over 4-5 years.

bmeeks

@SteveITS said in Suricata process dying due to hyperscan problem:

@bmeeks said in Suricata process dying due to hyperscan problem:

battery as good, but if power blips it drops the load

FWIW we see that a lot on older batteries, or I suppose defective ones. In our experience the UPS "self test" works to proactively alert the majority of the time but a decent amount the self test will trigger a power failure because the battery can't handle the load for the 2 seconds. :( And by "older" I mean over 4-5 years.

I suspect a defective battery at some level. It is a Tripp-Lite. My favorite is APC, and I think that's what I will go back with.

chrysmon

Again I want to mention that suricata works fine (on my system at least) in IPS mode with AC-BS Pattern Match instead the default (Hyperscan). This may help the developers to find the bug and the users to stay protected.

ajohnson353

@chrysmon I am seeing the same thing in AC mode. It has yet to die since making the switch.

chrysmon

@ajohnson353 said in Suricata process dying due to hyperscan problem:

@chrysmon I am seeing the same thing in AC mode. It has yet to die since making the switch.

If I remember well, mine was not working in AC mode. Let it run for longer time to be sure.

bmeeks

Wonder if this might be the source of the mysterious Hyperscan bug we are seeing in Suricata?

https://www.freebsd.org/security/advisories/FreeBSD-EN-23:15.sanitizer.asc

If so, that would explain a lot of the weirdness. I will keep tabs on this. Thanks to @RobbieTT for the link in another thread unreleated to Suricata.

masons

@bmeeks said in Suricata process dying due to hyperscan problem:

Wonder if this might be the source of the mysterious Hyperscan bug we are seeing in Suricata?

https://www.freebsd.org/security/advisories/FreeBSD-EN-23:15.sanitizer.asc

@bmeeks,

The two machines I posted about earlier, are both running with the default hyperscan enabled and with legacy blocking mode enabled. Both machines have not experienced a Suricata core dump since I disabled ASLR for the Suricata binary. Thus it seems increasingly plausible that the root of the issue is linked to ASLR and the link above about the LLVM sanitizer could certainly explain why this has suddenly happened.

kiokoman

@bmeeks
proccontrol -m aslr -s disable /usr/local/bin/suricata -i vmx2 -D -c /usr/local/etc/suricata/suricata_28559_vmx2/suricata.yaml --pidfile /var/run/suricata_vmx228559.pid

[1022863 - RX#01-vmx2] 2023-12-03 20:36:32 Info: pcap: vmx2: snaplen set to 1518
[600532 - Suricata-Main] 2023-12-03 20:36:32 Notice: threads: Threads created -> RX: 1 W: 8 FM: 1 FR: 1 Engine started.
[1022863 - RX#01-vmx2] 2023-12-03 20:36:35 Info: checksum: No packets with invalid checksum, assuming checksum offloading is NOT used
[1022865 - W#02] 2023-12-03 20:36:41 Error: spm-hs: Hyperscan returned fatal error -1.
[1022866 - W#03] 2023-12-03 20:36:41 Error: spm-hs: Hyperscan returned fatal error -1.

no luck for me

sgnoc

I tried to disable the ASLR on my system to test, but it caused the whole system to become unresponsive and I had to do a forced power cycle and revert back. Not sure what happened, since the logs only show suricata coming back online on the interfaces and then no logs until the reboot. I shut down the suricata processes before running "elfctl -e +noaslr /usr/local/bin/suricata" and then restarted the suricata interfaces. So no luck on my end testing with disabled ASLR.

SteveITS

@bmeeks I noticed this in another thread:

@jimp said in Major DNS Bug 23.01 with Quad9 on SSL:

While we are likely to include the patch from that EN in future builds it isn't relevant to Unbound.

They only use those sanitizers for debug/test builds and not for normal/production builds.

JonathanLee

Hello fellow Netgate community members,

I recently learned that Snort and Suricata's maintainer does all this work you what you see here unpaid. I opened a ticket to have a Wikipedia type donate to maintainer button on all 3rd party packages. I personally want to send some money to maintainers. If you also feel the same way please respond to this Redmine.

https://redmine.pfsense.org/issues/15056

Happy Holidays.