Suricata 2.0.3 Package Preview

bmeeks

Yes, I found that bug with Suricata's log management routine forgetting to delete the "rotated" log files when the logging directory size exceeds the set limit. That fix is coming in the next update. It cleans up the current alert log, but does not cleanup the older rotated logs under some conditions.

The interim fix is to periodically go in and delete those files (the ones with alert.log.xxxxx where xxxxx is a timestamp).

Suricata can barf out a lot of log traffic, and running it on a Nano install is risky (in my opinion) due to the potential for exhausting disk space unexpectedly).

Bill

binaryjay

@bmeeks:

Yes, I found that bug with Suricata's log management routine forgetting to delete the "rotated" log files when the logging directory size exceeds the set limit. That fix is coming in the next update. It cleans up the current alert log, but does not cleanup the older rotated logs under some conditions.

The interim fix is to periodically go in and delete those files (the ones with alert.log.xxxxx where xxxxx is a timestamp).

Suricata can barf out a lot of log traffic, and running it on a Nano install is risky (in my opinion) due to the potential for exhausting disk space unexpectedly).

Bill

Awesome, thank you! Curious, is it an upstream problem or pfsense package specific?

bmeeks

@binaryjay:

Awesome, thank you! Curious, is it an upstream problem or pfsense package specific?

The log management bug I was referring to is pfSense-specific. It is a bug in the package GUI code that I created. My code looks for the file alert.log, but fails to then also do a pattern search for .log. so it can catch the rotated files.

It will be fixed in the next Suricata package update.

The Log Management code in the pfSense Suricata package was added to cope with the accumulation of log files that can occur. It is not a perfect solution, though. For one, the routine runs once every 5 minutes via a cron job. If you have a busy network and a very noisy rule, then a lot of megabytes of log traffic could be created in the five-minute window between runs of the cron job.

I also realized I did not answer your question about the "file_data" keyword errors. Suricata does not currently process all of the same rule keywords and options that Snort does. The "file_data" option is one of those. This is an upstream issue. There are some feature requests posted on the Suricata Redmine site asking that support be added for the new VRT keywords and options.

Bill

demco

Bill, is it possible to use clog? Or there are unintentional consequences?

Regarding the "file_data" keyword and signature parsing errors raised by binaryjay. If Suricata start up with these errors, can we assume it is working and just skipped these signatures?

Thanks, appreciate your work on Suricata and Snort.

bmeeks

@demco:

Bill, is it possible to use clog? Or there are unintentional consequences?

Regarding the "file_data" keyword and signature parsing errors raised by binaryjay. If Suricata start up with these errors, can we assume it is working and just skipped these signatures?

Thanks, appreciate your work on Suricata and Snort.

Yes, Suricata will just skip signatures it can't parse. It prints the error you see and goes to the next signature. Snort, on the other hand, will print an error and quit on signature parsing errors.

No, the logging method of Suricata is fixed by the underlying binary. It does its own thing. It would be a rather substantial rewrite/customization of the binary to have it use clog. You may already know this, but I will repeat it anyway. The Suricata and Snort packages on pfSense each consist of two separate but related parts. What you see and interact with in the GUI is simply PHP code that creates the suricata.yaml (or snort.conf) configuration files used by a separate binary process. So the GUI components simply help you create the text configuration file that is read and processed by the binary that does the actual network inspection. That binary runs as a service. The GUI code is only active during the time you have a menu page or tab open and are actually interacting with it.

Bill

Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.

bmeeks

@jflsakfja:

Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.

Yeah, it's actually with all the rotated logs. Dumb mistake I made in the routine that looks for files to remove. It will be fixed in the next update which I hope to post soon. Mulling over whether to bump the binary version to 2.0.4 as well to stay in sync with upstream releases.

Bill

No worries, most people might not even notice it (assuming they go with the default of not logging packets). That's the worst of all the non-rotating logs, since that's what eats up /var space fast.

I vote for a quick bug fix instead of keeping in line with upstream. On the other hand, if it's not too much extra work, just do the binary bump + bugs.

binaryjay

Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:


1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>

BBcan177

There was another thread with a similar failure.. Does this scenario fit your issue?

https://forum.pfsense.org/index.php?topic=81533.0

binaryjay

@BBcan177:

There was another thread with a similar failure.. Does this scenario fit your issue?

https://forum.pfsense.org/index.php?topic=81533.0

Nope, I don't even have DHCPv6 enabled on any interface.

bmeeks

@binaryjay:

Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:


1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>

As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.

It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.

This is how you sometimes wind up with duplicate Suricata or Snort processes.

Bill

binaryjay

@bmeeks:

@binaryjay:
Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:
1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info> 
As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.

It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.

This is how you sometimes wind up with duplicate Suricata or Snort processes.

Bill

I'm actually a professional developer, granted I am of the type that spends my days doing corporate stuff for (gasp) Windows so I am still a novice when it comes to BSD in particular but can one not export a global variable right at the start of the script that will be set to indicate suricata is starting which subsequent calls to the script can do a sanity check effectively putting in place a lock on the startup? I will not be shocked if this kind of scope is disallowed, but of course there are other options like creating a temporary file to use as a lock immediately when the script is invoked.

I assume that the issue is that you can't even get a process id early enough to lock based on finding a pid?

Anyhow I have not thought this through and it is so obvious that it must have already been attempted lol. If I remember right there is another pfSense package that will continually check if services are started or not and relaunch if they die so that I guess is a good enough workaround.

bmeeks

The problem is the amount of time between when Suricata or Snort get the initial "start" signal and when the PID file is created. So you have a window of time where the process is kicking off but not fully running. I have tried various lock file tricks, and they work in some cases but not all.

Both binaries offer a type of "warm restart" where you can send a running process a signal. The current Snort and Suricata packages attempt to use this so that if a PID file exists, and another "start" command is received, the PID file is used to signal the running process with the "warm restart" instead of "start". I've discovered that if a running process receives a "warm restart" command too early in its startup phase, it will sometimes crash. I think that is what is happening to folks posting here with randomly crashing processes. Compounding this problem is the fact that package startup times for Suricata and Snort are different on different systems due to complexity and number of enabled rules and the CPU horsepower. So what is 15 seconds on one box is 90 seconds on some others.

In defense of pfSense, it is trying to make sure running packages are immediately aware of big configuration changes like new WAN IP addresses, newly started VPNs, etc. The way used to tell packages about these changes today is to simply restart them. During a reboot is when most of these rapid-fire "start" commands are issued as the various init scripts run in succession.

I am certainly open to ideas.

Bill

How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.

binaryjay

@jflsakfja:

How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.

I think you just paraphrased the same concept.

Lock file: usually created when the process is starting up. Usually removed when the process ends.

Special file: created after the process is up and running. Removed only after a successful manual shutdown command.

Yes, as you said slightly different variations of the same concept, it's just the file's creation that matters. As far as actually working, that's not for me to decide, since I've always hated programming for a simple reason: Computers do what I tell them, not what I meant.

bmeeks

Without getting down into all the technical details, I have tried a few variations of "lock files" and "special files".

Here is a brief history of how I got back into this problem. Due to popular demand, I was altering the Snort package so that each configured interface would show up in pfSense as a distinct service. So if you ran Snort (and Barnyard2) on both the WAN and LAN, then in the SERVICES applet you would see four services for Snort listed: one each for the WAN and LAN labeled as "snort_wan" and "snort_lan"; plus matching "barnyard2_wan" and "barnyard2_lan" entries. Since these would be showing up as individual services, they could be monitored by something like the Service Watchdog package.

I added the new code and everything was working great on my testing VMs. I recruited some of my private testers and let them try. Low and behold they started getting lots of duplicate processes. After trying a bunch of different approaches, nothing seemed to really work reliably. Since 50% of my testers had the problem of duplicate processes, I don't think this idea is ready for prime time. Investigating with my testers we found that whether or not you got the duplicate processes on reboot was controlled to some degree by the order in which other packages were installed. If you installed Snort first and then after it the Service Watchdog package, things usually worked OK. However, there was an annoyance from the Service Watchdog package because it runs every 60 seconds looking at the services. It immediately tries to restart any service that is down. So during the nightly rules update, it can attempt to restart Snort during the rules update as Snort is restarting itself. There is no facility within the Service Watchdog package to tell it to "stand down" for a while. At least not a way a package like Snort can use. So you get e-mail and log spam from Service Watchdog while Snort is restarting. If you have a minimal Snort setup with few active rules and an i7 3.3 GHz CPU, then you may get nothing from Service Watchdog because Snort can stop and restart quickly. However, if you have thousands of active rules and/or a weaker CPU like an Atom, then Snort takes a while to restart and Service Watchdog will spew restart commands every 60 seconds.

Service Watchdog is not the only issue. The init scripts of pfSense also issue multiple package start/restart commands during a reboot. You get the expected "package start" command as part of the initial boot up script, but then you get additional "package restart" commands from the WAN IP and NEW WAN IP scripts as the interfaces come up. So if you have IPv6 and IPv4 configured, for example, that's three package "start" commands within 7 seconds. One from the initial boot script, then one from NEW IPv4 WAN IP, and then another from NEW IPv6 WAN IP. These can be followed by more "package restart" commands for things like VPNs, etc. In my humble opinion, the boot up process should only issue a single package start command as like the very last thing it does. That means after all the WAN IPs have been assigned and any VPNs have been started, etc.

I have not given up, but for the moment I'm scratching my head looking for a viable solution that works in all cases.

Bill

bmeeks

@jflsakfja:

Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.

Clarification and Update on this issue:

There are configuration parameters on the INTERFACE SETTINGS tab where packet capture files are enabled that allow you specify the size limit in megabytes for each pcap file and the total number of pcap files to retain in the log directory. The defaults are 32 MB for file size and 1000 pcap files maintained. Once the number of pcap files to retain value is exceeded, the oldest file overwritten.

With that said, there is also a type of "fail safe" feature associated with both Snort and Suricata that lets you set an overall limit in MB on the amount of disk space in /var/log that will be consumed by ALL of the Suricata (or Snort, in that package) log files. It's this latter fail-safe code that was neglecting to remove the pcap files when the log directory size limit was exceeded. That will be fixed in the next update of Suricata.

Bill

@bmeeks:

Clarification and Update on this issue:

There are configuration parameters on the INTERFACE SETTINGS tab where packet capture files are enabled that allow you specify the size limit in megabytes for each pcap file and the total number of pcap files to retain in the log directory. The defaults are 32 MB for file size and 1000 pcap files maintained. Once the number of pcap files to retain value is exceeded, the oldest file overwritten.

With that said, there is also a type of "fail safe" feature associated with both Snort and Suricata that lets you set an overall limit in MB on the amount of disk space in /var/log that will be consumed by ALL of the Suricata (or Snort, in that package) log files. It's this latter fail-safe code that was neglecting to remove the pcap files when the log directory size limit was exceeded. That will be fixed in the next update of Suricata.

Bill

No no, I meant that the log rotation ignores the limits set for the packet logs. If setting a limit of 100MB, and a log number of 100, logs get cut off at 100MB as expected, and a new one is created until the size is reached, but the number of logs increases to 160, for example.

I identified the bug when the packet logs started filling up all the space that was assigned to suricata's log files. For some strange reason the blocked tab would cut off at a random number of hosts, and suricata would constantly complain that the log directory is full. Did an ls on the directory, which showed that packet logs weren't properly getting rotated, which led to the directory becoming full, and suricata freaking out and cutting off the alerts/blocks/other logs, but keeping the packet logs intact.