Suricata 2.0.3 Package Preview
-
@jflsakfja:
Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.Yeah, it's actually with all the rotated logs. Dumb mistake I made in the routine that looks for files to remove. It will be fixed in the next update which I hope to post soon. Mulling over whether to bump the binary version to 2.0.4 as well to stay in sync with upstream releases.
Bill
-
No worries, most people might not even notice it (assuming they go with the default of not logging packets). That's the worst of all the non-rotating logs, since that's what eats up /var space fast.
I vote for a quick bug fix instead of keeping in line with upstream. On the other hand, if it's not too much extra work, just do the binary bump + bugs.
-
Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:
1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete 1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1': pkts: 0, drop: 0 (nan%), invalid chksum: 0 1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>
-
There was another thread with a similar failure.. Does this scenario fit your issue?
https://forum.pfsense.org/index.php?topic=81533.0
-
There was another thread with a similar failure.. Does this scenario fit your issue?
https://forum.pfsense.org/index.php?topic=81533.0
Nope, I don't even have DHCPv6 enabled on any interface.
-
Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:
1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete 1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1': pkts: 0, drop: 0 (nan%), invalid chksum: 0 1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>
As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.
It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.
This is how you sometimes wind up with duplicate Suricata or Snort processes.
Bill
-
Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:
1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete 1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1': pkts: 0, drop: 0 (nan%), invalid chksum: 0 1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>
As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.
It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.
This is how you sometimes wind up with duplicate Suricata or Snort processes.
Bill
I'm actually a professional developer, granted I am of the type that spends my days doing corporate stuff for (gasp) Windows so I am still a novice when it comes to BSD in particular but can one not export a global variable right at the start of the script that will be set to indicate suricata is starting which subsequent calls to the script can do a sanity check effectively putting in place a lock on the startup? I will not be shocked if this kind of scope is disallowed, but of course there are other options like creating a temporary file to use as a lock immediately when the script is invoked.
I assume that the issue is that you can't even get a process id early enough to lock based on finding a pid?
Anyhow I have not thought this through and it is so obvious that it must have already been attempted lol. If I remember right there is another pfSense package that will continually check if services are started or not and relaunch if they die so that I guess is a good enough workaround.
-
The problem is the amount of time between when Suricata or Snort get the initial "start" signal and when the PID file is created. So you have a window of time where the process is kicking off but not fully running. I have tried various lock file tricks, and they work in some cases but not all.
Both binaries offer a type of "warm restart" where you can send a running process a signal. The current Snort and Suricata packages attempt to use this so that if a PID file exists, and another "start" command is received, the PID file is used to signal the running process with the "warm restart" instead of "start". I've discovered that if a running process receives a "warm restart" command too early in its startup phase, it will sometimes crash. I think that is what is happening to folks posting here with randomly crashing processes. Compounding this problem is the fact that package startup times for Suricata and Snort are different on different systems due to complexity and number of enabled rules and the CPU horsepower. So what is 15 seconds on one box is 90 seconds on some others.
In defense of pfSense, it is trying to make sure running packages are immediately aware of big configuration changes like new WAN IP addresses, newly started VPNs, etc. The way used to tell packages about these changes today is to simply restart them. During a reboot is when most of these rapid-fire "start" commands are issued as the various init scripts run in succession.
I am certainly open to ideas.
Bill
-
How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.
-
@jflsakfja:
How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.
I think you just paraphrased the same concept.
-
Lock file: usually created when the process is starting up. Usually removed when the process ends.
Special file: created after the process is up and running. Removed only after a successful manual shutdown command.
Yes, as you said slightly different variations of the same concept, it's just the file's creation that matters. As far as actually working, that's not for me to decide, since I've always hated programming for a simple reason: Computers do what I tell them, not what I meant.
-
Without getting down into all the technical details, I have tried a few variations of "lock files" and "special files".
Here is a brief history of how I got back into this problem. Due to popular demand, I was altering the Snort package so that each configured interface would show up in pfSense as a distinct service. So if you ran Snort (and Barnyard2) on both the WAN and LAN, then in the SERVICES applet you would see four services for Snort listed: one each for the WAN and LAN labeled as "snort_wan" and "snort_lan"; plus matching "barnyard2_wan" and "barnyard2_lan" entries. Since these would be showing up as individual services, they could be monitored by something like the Service Watchdog package.
I added the new code and everything was working great on my testing VMs. I recruited some of my private testers and let them try. Low and behold they started getting lots of duplicate processes. After trying a bunch of different approaches, nothing seemed to really work reliably. Since 50% of my testers had the problem of duplicate processes, I don't think this idea is ready for prime time. Investigating with my testers we found that whether or not you got the duplicate processes on reboot was controlled to some degree by the order in which other packages were installed. If you installed Snort first and then after it the Service Watchdog package, things usually worked OK. However, there was an annoyance from the Service Watchdog package because it runs every 60 seconds looking at the services. It immediately tries to restart any service that is down. So during the nightly rules update, it can attempt to restart Snort during the rules update as Snort is restarting itself. There is no facility within the Service Watchdog package to tell it to "stand down" for a while. At least not a way a package like Snort can use. So you get e-mail and log spam from Service Watchdog while Snort is restarting. If you have a minimal Snort setup with few active rules and an i7 3.3 GHz CPU, then you may get nothing from Service Watchdog because Snort can stop and restart quickly. However, if you have thousands of active rules and/or a weaker CPU like an Atom, then Snort takes a while to restart and Service Watchdog will spew restart commands every 60 seconds.
Service Watchdog is not the only issue. The init scripts of pfSense also issue multiple package start/restart commands during a reboot. You get the expected "package start" command as part of the initial boot up script, but then you get additional "package restart" commands from the WAN IP and NEW WAN IP scripts as the interfaces come up. So if you have IPv6 and IPv4 configured, for example, that's three package "start" commands within 7 seconds. One from the initial boot script, then one from NEW IPv4 WAN IP, and then another from NEW IPv6 WAN IP. These can be followed by more "package restart" commands for things like VPNs, etc. In my humble opinion, the boot up process should only issue a single package start command as like the very last thing it does. That means after all the WAN IPs have been assigned and any VPNs have been started, etc.
I have not given up, but for the moment I'm scratching my head looking for a viable solution that works in all cases.
Bill
-
@jflsakfja:
Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.Clarification and Update on this issue:
There are configuration parameters on the INTERFACE SETTINGS tab where packet capture files are enabled that allow you specify the size limit in megabytes for each pcap file and the total number of pcap files to retain in the log directory. The defaults are 32 MB for file size and 1000 pcap files maintained. Once the number of pcap files to retain value is exceeded, the oldest file overwritten.
With that said, there is also a type of "fail safe" feature associated with both Snort and Suricata that lets you set an overall limit in MB on the amount of disk space in /var/log that will be consumed by ALL of the Suricata (or Snort, in that package) log files. It's this latter fail-safe code that was neglecting to remove the pcap files when the log directory size limit was exceeded. That will be fixed in the next update of Suricata.
Bill
-
Clarification and Update on this issue:
There are configuration parameters on the INTERFACE SETTINGS tab where packet capture files are enabled that allow you specify the size limit in megabytes for each pcap file and the total number of pcap files to retain in the log directory. The defaults are 32 MB for file size and 1000 pcap files maintained. Once the number of pcap files to retain value is exceeded, the oldest file overwritten.
With that said, there is also a type of "fail safe" feature associated with both Snort and Suricata that lets you set an overall limit in MB on the amount of disk space in /var/log that will be consumed by ALL of the Suricata (or Snort, in that package) log files. It's this latter fail-safe code that was neglecting to remove the pcap files when the log directory size limit was exceeded. That will be fixed in the next update of Suricata.
Bill
No no, I meant that the log rotation ignores the limits set for the packet logs. If setting a limit of 100MB, and a log number of 100, logs get cut off at 100MB as expected, and a new one is created until the size is reached, but the number of logs increases to 160, for example.
I identified the bug when the packet logs started filling up all the space that was assigned to suricata's log files. For some strange reason the blocked tab would cut off at a random number of hosts, and suricata would constantly complain that the log directory is full. Did an ls on the directory, which showed that packet logs weren't properly getting rotated, which led to the directory becoming full, and suricata freaking out and cutting off the alerts/blocks/other logs, but keeping the packet logs intact.
-
@jflsakfja:
No no, I meant that the log rotation ignores the limits set for the packet logs. If setting a limit of 100MB, and a log number of 100, logs get cut off at 100MB as expected, and a new one is created until the size is reached, but the number of logs increases to 160, for example.
I identified the bug when the packet logs started filling up all the space that was assigned to suricata's log files. For some strange reason the blocked tab would cut off at a random number of hosts, and suricata would constantly complain that the log directory is full. Did an ls on the directory, which showed that packet logs weren't properly getting rotated, which led to the directory becoming full, and suricata freaking out and cutting off the alerts/blocks/other logs, but keeping the packet logs intact.
Oh…that sounds like perhaps a bug in the Suricata binary. The limits set on the INTERFACE SETTINGS tab (where the pcap feature is turned on or off) are all handled by built-in parts of the binary. Only settings on the LOG MGMT tab are controlled by PHP code I wrote. I will test the binary rotation in a VM by setting the number very low (like 1 file) to see what happens.
Bill
-
@jflsakfja:
No no, I meant that the log rotation ignores the limits set for the packet logs. If setting a limit of 100MB, and a log number of 100, logs get cut off at 100MB as expected, and a new one is created until the size is reached, but the number of logs increases to 160, for example.
I identified the bug when the packet logs started filling up all the space that was assigned to suricata's log files. For some strange reason the blocked tab would cut off at a random number of hosts, and suricata would constantly complain that the log directory is full. Did an ls on the directory, which showed that packet logs weren't properly getting rotated, which led to the directory becoming full, and suricata freaking out and cutting off the alerts/blocks/other logs, but keeping the packet logs intact.
Looked into the binary source code and this is not really a bug. It's more a case of a bit misleading feature. Suricata keeps track of the number of pcap files it creates during a given session and correctly rotates them. However, the key here is "a given session". It does not maintain state through startup/shutdown cycles. That means as far as Suricata is concerned, when it starts up, it has opened "zero files". So any that existed prior to the last Suricata shutdown are ignored. So the misleading part is that the "max-files" parameter really only works so long as Suricata is never restarted… :-\
So to address this I will need to put something in the GUI code's cron job. I will do that in the update I am currently working on.
Bill
-
@jflsakfja:
No no, I meant that the log rotation ignores the limits set for the packet logs. If setting a limit of 100MB, and a log number of 100, logs get cut off at 100MB as expected, and a new one is created until the size is reached, but the number of logs increases to 160, for example.
I identified the bug when the packet logs started filling up all the space that was assigned to suricata's log files. For some strange reason the blocked tab would cut off at a random number of hosts, and suricata would constantly complain that the log directory is full. Did an ls on the directory, which showed that packet logs weren't properly getting rotated, which led to the directory becoming full, and suricata freaking out and cutting off the alerts/blocks/other logs, but keeping the packet logs intact.
Looked into the binary source code and this is not really a bug. It's more a case of a bit misleading feature. Suricata keeps track of the number of pcap files it creates during a given session and correctly rotates them. However, the key here is "a given session". It does not maintain state through startup/shutdown cycles. That means as far as Suricata is concerned, when it starts up, it has opened "zero files". So any that existed prior to the last Suricata shutdown are ignored. So the misleading part is that the "max-files" parameter really only works so long as Suricata is never restarted… :-\
So to address this I will need to put something in the GUI code's cron job. I will do that in the update I am currently working on.
Bill
You've got it worse than I have it in my day job. Thumbs up to you, sir.
-
Has this issue been resolved?
I am also running a PPPoE Connection and trying to get Suricata configured but my logs are flooded with "SC_ERR_DATALINK_UNIMPLEMENTED" errors.
Suricata will not start automatically after a pfsense restart, is this by design or caused by these errors?
Regards,
-
Not sure (since I've never needed pppoe), and mind is a bit fuzzy right now (been working straight for the past 50 hours, with a pass-out-on-the-office-chair in between) but have you tried making suricata listen on the underlying interface, instead of the pppoe one? Oh well, someone more knowledgeable on this will come along. Please correct me if I'm wrong in any way.
-
I have not tried that yet as it was noted earlier in the thread it was a bad idea to do that as Suricata will receive header information from the PPPoE interface which it wont know how to handle.
I will have to research how to do this as at the moment suricata is not generating any alerts at all.