Suricata 2.0.3 Package Preview

bmeeks

@avink and @geudrik:

Thanks for offering to help test. It will be a few weeks before I can get to this. I will contact you via PM when I have something to test.

Bill

binaryjay

@bmeeks:

@binaryjay:

Thank you for the package… I was having stability issues with snort crashing when under heavy load repeatedly and decided to give suricata a try the DAY before the 2.0 package went up... :) Anyway I upgraded of course... and I have not had the service go down a single time yet. I thought snort was supposed to be the more stable of the two? Anyway...

The only issue I've run across was what I suspect to be suricata filling my /var partition to 100% with... something... causing everything else on the box to go haywire. I couldn't do much else but force a reboot and it has been fine since and will need to monitor growth of /var more carefully and see if it happens again. Logs were not persisted after reboot (thanks nanoBSD ram disk).

The reason I blame suricata was because it was the first time I've seen this happen and the only thing that changed was adding suricata and removing snort. I do have most of the logging turned OFF and the automatic log management turned ON with a very small size limit.

If I can come up with some actual proof of what happened (if it occurs again) I will post back.

This is on nanoBSD.

Suricata will produce a LOT of log output if some of the logging options are enabled. I highly suspect it is filling up the /var partition. I more or less left the defaults at the "out-of-the-box" values. Go to the LOGS MGMT tab and check the box to enable automatic management of logs. This will prune and rotate the logs. For nanoBSD, you should probably reduce both the LOG LIMIT sizes and RETENTION PERIODS to numbers somewhat smaller than the defaults. If keeping the logs is a big deal, then you would want to export them off the box somehow. Using Barnyard2 and outputting to a remote syslog server is one way of doing that.

Suricata's log files are all in /var/log/suricata and sub-directories underneath.

Oops…went back and read your original post more carefully and saw where you had enabled the management. How busy is your network? The logs management task only runs once every 5 minutes, and it is possible on extremely busy networks for some of the log files (say EVE, http or dns) to get quite large in 5 minutes when compared to the amount of free space that is likely to be on a nanoBSD partition.

Bill

Bill,

It has happened again, I was having clients refuse to get assigned an ip address. I was able to log into the box this time to see what is going on, I found that dhcpd was complaining about /var being out of space. A df -h confirmed that it was overprovisioned.

This installation is a nanoBSD one, and my /var partition is only 120MB (which I believe is more than default, /var on nanoBSD is a ramdisk. I narrowed it down to the suricata alert logs:

This is a small example of the list for the one interface suricata is running on:


-rw-r--r--  1 root  wheel  1737215 Sep 28 15:15 alerts.log.2014_0928_1515
-rw-r--r--  1 root  wheel  6789234 Sep 28 15:20 alerts.log.2014_0928_1520
-rw-r--r--  1 root  wheel  6462547 Sep 28 15:25 alerts.log.2014_0928_1525
-rw-r--r--  1 root  wheel  2383926 Sep 28 15:30 alerts.log.2014_0928_1530
-rw-r--r--  1 root  wheel  1195393 Sep 28 15:35 alerts.log.2014_0928_1535
-rw-r--r--  1 root  wheel  2762914 Sep 28 16:30 alerts.log.2014_0928_1630
-rw-r--r--  1 root  wheel  1308605 Sep 28 16:35 alerts.log.2014_0928_1635
-rw-r--r--  1 root  wheel  3320858 Sep 28 16:50 alerts.log.2014_0928_1650
-rw-r--r--  1 root  wheel  5628798 Sep 28 16:55 alerts.log.2014_0928_1655
-rw-r--r--  1 root  wheel   674842 Sep 28 17:00 alerts.log.2014_0928_1700
-rw-r--r--  1 root  wheel   540330 Sep 28 17:15 alerts.log.2014_0928_1715
-rw-r--r--  1 root  wheel   758454 Sep 28 17:35 alerts.log.2014_0928_1735
-rw-r--r--  1 root  wheel  3895902 Sep 28 17:40 alerts.log.2014_0928_1740
-rw-r--r--  1 root  wheel  4773497 Sep 28 17:55 alerts.log.2014_0928_1755
-rw-r--r--  1 root  wheel     8192 Sep 28 18:00 alerts.log.2014_0928_1800
-rw-r--r--  1 root  wheel     8192 Sep 28 18:05 alerts.log.2014_0928_1805
-rw-r--r--  1 root  wheel     8192 Sep 28 18:15 alerts.log.2014_0928_1815
-rw-r--r--  1 root  wheel     8192 Sep 28 18:20 alerts.log.2014_0928_1820
-rw-r--r--  1 root  wheel   704512 Sep 28 18:25 alerts.log.2014_0928_1825
-rw-r--r--  1 root  wheel  1200180 Sep 28 18:30 alerts.log.2014_0928_1830

These logs, after removal, ended up taking about 100MB.

In logs management:


Enable directory size limit (Default): Yes
Size in MB: 3
alerts: 500KB, 1 Day Retention

Given these settings, I would not expect to see so many rotated alert logs and never see the suricata log folder reach 100MB.

And this is the real kicker fromthe pfSense system log:


Sep 28 18:50:04	kernel: pid 88242 (dhcpd), uid 1002 inumber 12529 on /var: filesystem full
Sep 28 18:50:01	php: suricata_check_cron_misc.inc: [Suricata] Automatic clean-up of Suricata logs completed.
Sep 28 18:50:01	php: suricata_check_cron_misc.inc: [Suricata] Truncating logs for WAN (bce1)...
Sep 28 18:50:01	php: suricata_check_cron_misc.inc: [Suricata] Truncating the Rules Update Log file...
Sep 28 18:50:01	php: suricata_check_cron_misc.inc: [Suricata] Log directory size exceeds configured limit of 3 MB set on Global Settings tab. All Suricata log files will be truncated.
[b]Sep 28 18:49:56	kernel: pid 88242 (dhcpd), uid 1002 inumber 12519 on /var: filesystem full[/b]

You can see it recognizes it needs to truncate, says it has done so but only for the rules update log? I see it has been trying to truncate all day long, the above was just the last instance of it in the logs. Is it because of the 1 day retention, of which there is no shorter period of time to choose? It seems like 1 day is too long with nanoBSD in some cases where something causes your network to get hammered with a lot of alerts one day.

Is it something I'm doing wrong? I'm afraid that this is causing the whole network to nearly go down in flames when it happens so I might have to go without suricata until a solution can be reached.

I suppose in the meantime I can run a script with cron to just blow away the logs myself if it gets too big but it does kind of seem like there is a bug here and I don't think this is an ideal solution. In the meantime I have also doubled the size of the ramdisk holding /var in the hopes it will provide some relief but this is kind of a waste of memory.

Thanks!

Bonus question: When rebooting pfsense, I find suricata not started and there are rafts of errors like this. When I then manually start the service, it starts up fine after the usual long delay.


28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-FIREFOX Mozilla Firefox iframe and xul element reload crash attempt"; flow:to_server,established; file_data; content:"document.createElement|28 27|iframe|27 29|"; fast_pattern:only; content:"<frame"; content:".xul";="" content:".contentdocument.location.reload|28="" 29|";="" metadata:policy="" balanced-ips="" drop,="" policy="" connectivity-ips="" security-ips="" service="" smtp;="" reference:cve,2011-2982;="" classtype:attempted-user;="" sid:25228;="" rev:4;)"="" from="" file="" usr="" pbi="" suricata-i386="" etc="" suricata="" suricata_48468_bce1="" rules="" suricata.rules="" at="" line="" 14217<br="">28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-FIREFOX Mozilla Firefox IDB use-after-free attempt"; flow:established,to_server; file_data; content:"IDBKeyRange"; fast_pattern:only; pcre:"/IDBKeyRange\x2e(only|lowerBound|upperBound|bound)\x28.*?\x29.{0,100}\x2e(lower|upper|lowerOpen|upperOpen)/smi"; metadata:policy balanced-ips drop, policy connectivity-ips drop, policy security-ips drop, service smtp; reference:cve,2012-0469; reference:url,bugzilla.mozilla.org/show_bug.cgi?id=738985; classtype:attempted-user; sid:24574; rev:4;)" from file /usr/pbi/suricata-i386/etc/suricata/suricata_48468_bce1/rules/suricata.rules at line 14220
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-FIREFOX Mozilla Firefox IDB use-after-free attempt"; flow:established,to_server; file_data; content:"IDBKeyRange.lowerBound("; content:".upper"; within:20; metadata:policy balanced-ips drop, policy connectivity-ips drop, policy security-ips drop, service smtp; reference:cve,2012-0469; reference:url,bugzilla.mozilla.org/show_bug.cgi?id=738985; classtype:attempted-user; sid:24573; rev:3;)" from file /usr/pbi/suricata-i386/etc/suricata/suricata_48468_bce1/rules/suricata.rules at line 14221
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-FIREFOX Mozilla Firefox IDB use-after-free attempt"; flow:established,to_server; file_data; content:"IDBKeyRange.only("; content:").lower"; within:20; metadata:policy balanced-ips drop, policy connectivity-ips drop, policy security-ips drop, service smtp; reference:cve,2012-0469; reference:url,bugzilla.mozilla.org/show_bug.cgi?id=738985; classtype:attempted-user; sid:24570; rev:3;)" from file /usr/pbi/suricata-i386/etc/suricata/suricata_48468_bce1/rules/suricata.rules at line 14224
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $HOME_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-FIREFOX Mozilla Firefox nsTreeRange Use After Free attempt"; flow:to_server,established; file_data; content:"|2E|view|2E|selection"; nocase; content:"|2E|invalidateSelection"; distance:0; nocase; pcre:"/\x2Eview\x2Eselection.*?\x2Etree\s*\x3D\s*null.*?\x2Einvalidate/smi"; metadata:policy balanced-ips drop, policy connectivity-ips drop, policy security-ips drop, service smtp; reference:cve,2011-0073; reference:url,www.mozilla.org/security/announce/2011/mfsa2011-13.html; classtype:attempted-user; sid:29617; rev:1;)" from file /usr/pbi/suricata-i386/etc/suricata/suricata_48468_bce1/rules/suricata.rules at line 14234
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-IE Microsoft Internet Explorer execCommand CTreePos memory corruption attempt"; flow:to_server,established; file_data; content:".execCommand"; nocase; content:"undo"; within:15; nocase; content:".execCommand"; within:100; nocase; content:"redo"; within:15; nocase; content:".execCommand"; within:100; nocase; content:"undo"; within:15; nocase; pcre:"/\x2eexecCommand\s*\x28\s*[\x22\x27]\s*undo\s*[\x22\x27].*?\x2eexecCommand\s*\x28\s*[\x22\x27]\s*redo\s*[\x22\x27].*?\x2eexecCommand\s*\x28\s*[\x22\x27]\s*undo\s*[\x22\x27]/smi"; metadata:policy balanced-ips drop, policy connectivity-ips drop, policy security-ips drop, service smtp; reference:cve,2013-3914; reference:url,technet.microsoft.com/en-us/security/bulletin/MS13-088; classtype:attempted-user; sid:28495; rev:1;)" from file /usr/pbi/suricata-i386/etc/suricata/suricata_48468_bce1/rules/suricata.rules at line 14235
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - Can't use file_data with flow:to_server or from_client with http.
28/9/2014 -- 19:55:55 - <error>-- [ERRCODE: SC_ERR_INVALID_SIGNATURE(39)] - error parsing signature "alert tcp $EXTERNAL_NET any -> $SMTP_SERVERS 25 (msg:"BROWSER-IE Microsoft Internet Explorer fontFamily attribute deleted object access memory corruption attempt"; flow:to_server,established; file_data; content:"</error></error></error></error></error></error></error></error></error></error></error></error></frame";></error></error>

bmeeks

Yes, I found that bug with Suricata's log management routine forgetting to delete the "rotated" log files when the logging directory size exceeds the set limit. That fix is coming in the next update. It cleans up the current alert log, but does not cleanup the older rotated logs under some conditions.

The interim fix is to periodically go in and delete those files (the ones with alert.log.xxxxx where xxxxx is a timestamp).

Suricata can barf out a lot of log traffic, and running it on a Nano install is risky (in my opinion) due to the potential for exhausting disk space unexpectedly).

Bill

binaryjay

@bmeeks:

Yes, I found that bug with Suricata's log management routine forgetting to delete the "rotated" log files when the logging directory size exceeds the set limit. That fix is coming in the next update. It cleans up the current alert log, but does not cleanup the older rotated logs under some conditions.

The interim fix is to periodically go in and delete those files (the ones with alert.log.xxxxx where xxxxx is a timestamp).

Suricata can barf out a lot of log traffic, and running it on a Nano install is risky (in my opinion) due to the potential for exhausting disk space unexpectedly).

Bill

Awesome, thank you! Curious, is it an upstream problem or pfsense package specific?

bmeeks

@binaryjay:

Awesome, thank you! Curious, is it an upstream problem or pfsense package specific?

The log management bug I was referring to is pfSense-specific. It is a bug in the package GUI code that I created. My code looks for the file alert.log, but fails to then also do a pattern search for .log. so it can catch the rotated files.

It will be fixed in the next Suricata package update.

The Log Management code in the pfSense Suricata package was added to cope with the accumulation of log files that can occur. It is not a perfect solution, though. For one, the routine runs once every 5 minutes via a cron job. If you have a busy network and a very noisy rule, then a lot of megabytes of log traffic could be created in the five-minute window between runs of the cron job.

I also realized I did not answer your question about the "file_data" keyword errors. Suricata does not currently process all of the same rule keywords and options that Snort does. The "file_data" option is one of those. This is an upstream issue. There are some feature requests posted on the Suricata Redmine site asking that support be added for the new VRT keywords and options.

Bill

demco

Bill, is it possible to use clog? Or there are unintentional consequences?

Regarding the "file_data" keyword and signature parsing errors raised by binaryjay. If Suricata start up with these errors, can we assume it is working and just skipped these signatures?

Thanks, appreciate your work on Suricata and Snort.

bmeeks

@demco:

Bill, is it possible to use clog? Or there are unintentional consequences?

Regarding the "file_data" keyword and signature parsing errors raised by binaryjay. If Suricata start up with these errors, can we assume it is working and just skipped these signatures?

Thanks, appreciate your work on Suricata and Snort.

Yes, Suricata will just skip signatures it can't parse. It prints the error you see and goes to the next signature. Snort, on the other hand, will print an error and quit on signature parsing errors.

No, the logging method of Suricata is fixed by the underlying binary. It does its own thing. It would be a rather substantial rewrite/customization of the binary to have it use clog. You may already know this, but I will repeat it anyway. The Suricata and Snort packages on pfSense each consist of two separate but related parts. What you see and interact with in the GUI is simply PHP code that creates the suricata.yaml (or snort.conf) configuration files used by a separate binary process. So the GUI components simply help you create the text configuration file that is read and processed by the binary that does the actual network inspection. That binary runs as a service. The GUI code is only active during the time you have a menu page or tab open and are actually interacting with it.

Bill

Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.

bmeeks

@jflsakfja:

Hey Bill,
As far as I can tell, the bug with the logs not getting properly removed happens with the packet captures as well. Dunno if it happens with other logs, those are the first that went over the limits.

Yeah, it's actually with all the rotated logs. Dumb mistake I made in the routine that looks for files to remove. It will be fixed in the next update which I hope to post soon. Mulling over whether to bump the binary version to 2.0.4 as well to stay in sync with upstream releases.

Bill

No worries, most people might not even notice it (assuming they go with the default of not logging packets). That's the worst of all the non-rotating logs, since that's what eats up /var space fast.

I vote for a quick bug fix instead of keeping in line with upstream. On the other hand, if it's not too much extra work, just do the binary bump + bugs.

binaryjay

Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:


1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>

BBcan177

There was another thread with a similar failure.. Does this scenario fit your issue?

https://forum.pfsense.org/index.php?topic=81533.0

binaryjay

@BBcan177:

There was another thread with a similar failure.. Does this scenario fit your issue?

https://forum.pfsense.org/index.php?topic=81533.0

Nope, I don't even have DHCPv6 enabled on any interface.

bmeeks

@binaryjay:

Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:


1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info>

As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.

It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.

This is how you sometimes wind up with duplicate Suricata or Snort processes.

Bill

binaryjay

@bmeeks:

@binaryjay:
Another thing I'm wondering about, sometimes I find my Suricata dead in the morning and this is in the logs:
1/10/2014 -- 02:17:07 - <info>-- cleaning up signature grouping structure... complete
1/10/2014 -- 02:17:08 - <notice>-- Stats for 'bce1':  pkts: 0, drop: 0 (nan%), invalid chksum: 0
1/10/2014 -- 02:17:08 - <error>-- [ERRCODE: SC_ERR_DAEMON(87)] - Child died unexpectedly</error></notice></info> 
As part of testing for an experimental feature in Snort, I've found what I think is the root cause of these random errors where the process dies. Snort has the same issue. Unfortunately the problem is not really within the two IDS packages themselves. It is caused by the way pfSense handles certain events that involve installed packages. It's a bit technical and complicated to explain, but there a number of processes within pfSense that can kick of a procedure to "start/restart all packages". While debugging/testing my theory with Snort recently I saw the Snort package get sent three separate "start" commands within a 7-second window. I had a special shell script that was logging the exact time it was called and with what argument.

It takes Snort much longer than 7 seconds to startup. So when you get multiple start commands, the shell script can wind up launching multiple processes. Two of those "start" commands happened only 2 seconds apart! I have made several attempts to workaround this in both the Snort and Suricata packages. The developers who worked on Snort before me apparently also tried some tricks because their code is still in there. Nothing works 100% when you have your startup script called repeatedly before you can even get started from the first call.

This is how you sometimes wind up with duplicate Suricata or Snort processes.

Bill

I'm actually a professional developer, granted I am of the type that spends my days doing corporate stuff for (gasp) Windows so I am still a novice when it comes to BSD in particular but can one not export a global variable right at the start of the script that will be set to indicate suricata is starting which subsequent calls to the script can do a sanity check effectively putting in place a lock on the startup? I will not be shocked if this kind of scope is disallowed, but of course there are other options like creating a temporary file to use as a lock immediately when the script is invoked.

I assume that the issue is that you can't even get a process id early enough to lock based on finding a pid?

Anyhow I have not thought this through and it is so obvious that it must have already been attempted lol. If I remember right there is another pfSense package that will continually check if services are started or not and relaunch if they die so that I guess is a good enough workaround.

bmeeks

The problem is the amount of time between when Suricata or Snort get the initial "start" signal and when the PID file is created. So you have a window of time where the process is kicking off but not fully running. I have tried various lock file tricks, and they work in some cases but not all.

Both binaries offer a type of "warm restart" where you can send a running process a signal. The current Snort and Suricata packages attempt to use this so that if a PID file exists, and another "start" command is received, the PID file is used to signal the running process with the "warm restart" instead of "start". I've discovered that if a running process receives a "warm restart" command too early in its startup phase, it will sometimes crash. I think that is what is happening to folks posting here with randomly crashing processes. Compounding this problem is the fact that package startup times for Suricata and Snort are different on different systems due to complexity and number of enabled rules and the CPU horsepower. So what is 15 seconds on one box is 90 seconds on some others.

In defense of pfSense, it is trying to make sure running packages are immediately aware of big configuration changes like new WAN IP addresses, newly started VPNs, etc. The way used to tell packages about these changes today is to simply restart them. During a reboot is when most of these rapid-fire "start" commands are issued as the various init scripts run in succession.

I am certainly open to ideas.

Bill

How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.

binaryjay

@jflsakfja:

How about fundamentally changing the way services are started? Instead of using a lock file to prevent a service restarting, use a lock file to indicate that the service is running as expected. The pfsense startup scripts could check for the existence of this file before attempting to start up snort/suricata. Just an idea.

I think you just paraphrased the same concept.

Lock file: usually created when the process is starting up. Usually removed when the process ends.

Special file: created after the process is up and running. Removed only after a successful manual shutdown command.

Yes, as you said slightly different variations of the same concept, it's just the file's creation that matters. As far as actually working, that's not for me to decide, since I've always hated programming for a simple reason: Computers do what I tell them, not what I meant.

bmeeks

Without getting down into all the technical details, I have tried a few variations of "lock files" and "special files".

Here is a brief history of how I got back into this problem. Due to popular demand, I was altering the Snort package so that each configured interface would show up in pfSense as a distinct service. So if you ran Snort (and Barnyard2) on both the WAN and LAN, then in the SERVICES applet you would see four services for Snort listed: one each for the WAN and LAN labeled as "snort_wan" and "snort_lan"; plus matching "barnyard2_wan" and "barnyard2_lan" entries. Since these would be showing up as individual services, they could be monitored by something like the Service Watchdog package.

I added the new code and everything was working great on my testing VMs. I recruited some of my private testers and let them try. Low and behold they started getting lots of duplicate processes. After trying a bunch of different approaches, nothing seemed to really work reliably. Since 50% of my testers had the problem of duplicate processes, I don't think this idea is ready for prime time. Investigating with my testers we found that whether or not you got the duplicate processes on reboot was controlled to some degree by the order in which other packages were installed. If you installed Snort first and then after it the Service Watchdog package, things usually worked OK. However, there was an annoyance from the Service Watchdog package because it runs every 60 seconds looking at the services. It immediately tries to restart any service that is down. So during the nightly rules update, it can attempt to restart Snort during the rules update as Snort is restarting itself. There is no facility within the Service Watchdog package to tell it to "stand down" for a while. At least not a way a package like Snort can use. So you get e-mail and log spam from Service Watchdog while Snort is restarting. If you have a minimal Snort setup with few active rules and an i7 3.3 GHz CPU, then you may get nothing from Service Watchdog because Snort can stop and restart quickly. However, if you have thousands of active rules and/or a weaker CPU like an Atom, then Snort takes a while to restart and Service Watchdog will spew restart commands every 60 seconds.

Service Watchdog is not the only issue. The init scripts of pfSense also issue multiple package start/restart commands during a reboot. You get the expected "package start" command as part of the initial boot up script, but then you get additional "package restart" commands from the WAN IP and NEW WAN IP scripts as the interfaces come up. So if you have IPv6 and IPv4 configured, for example, that's three package "start" commands within 7 seconds. One from the initial boot script, then one from NEW IPv4 WAN IP, and then another from NEW IPv6 WAN IP. These can be followed by more "package restart" commands for things like VPNs, etc. In my humble opinion, the boot up process should only issue a single package start command as like the very last thing it does. That means after all the WAN IPs have been assigned and any VPNs have been started, etc.

I have not given up, but for the moment I'm scratching my head looking for a viable solution that works in all cases.

Bill