Need a simple script to detect firewall 'hanging'…

ChirpyTurnip

Hi,

I have a PFSense firewall running on a PCEngines APU board. From time to time the FW will hang - it will run the webconfigurator and allow access via the LAN interface, but while it reports (correctly) than the WAN interface is up it will no longer route any traffic between the WAN/LAN in either direction. The fix is to reboot the firewall. There are no clues (obvious ones at least) that I can see in the system log so I'm unsure what causes this problem. Sometime it happens daily…sometimes it is a month...but sooner or later it will hang. What I need is a simple script that will run every five minutes or so and ping an external host - if it responds it exists, if it fails in will bounce the WAN interface, if it fails twice it will reboot.

There was a script previously posted that would be ideal...but I can't get it to work. All I get is "command not found"....which is weird. I've done the usual (chmod 755, included ./ in the name to execute when connected via SSH. What am I missing?

The script in question looks like this:

#!/bin/sh

#=====================================================================
# pingtest.sh, v1.0.1
# Created 2009 by Bennett Lee
# Released to public domain
#
# (1) Attempts to ping several hosts to test connectivity.  After
#     first successful ping, script exits.
# (2) If all pings fail, resets interface and retries all pings.
# (3) If all pings fail again after reset, then reboots pfSense.
#
# History
# 1.0.1   Added delay to ensure interface resets (thx ktims).
# 1.0.0   Initial release.
#=====================================================================

#=====================================================================
# USER SETTINGS
#
# Set multiple ping targets separated by space.  Include numeric IPs
# (e.g., remote office, ISP gateway, etc.) for DNS issues which
# reboot will not correct.
ALLDEST="google.com yahoo.com 24.93.40.36 24.93.40.37"
# Interface to reset, usually your WAN
BOUNCE=em0
# Log file
LOGFILE=/root/pingtest.log
#=====================================================================

COUNT=1
while [ $COUNT -le 2 ]
do

	for DEST in $ALLDEST
	do
		#echo `date +%Y%m%d.%H%M%S` "Pinging $DEST" >> $LOGFILE
		ping -c1 $DEST >/dev/null 2>/dev/null
		if [ $? -eq 0 ]
		then
			#echo `date +%Y%m%d.%H%M%S` "Ping $DEST OK." >> $LOGFILE
			exit 0
		fi
	done

	if [ $COUNT -le 1 ]
	then
		echo `date +%Y%m%d.%H%M%S` "All pings failed. Resetting interface $BOUNCE." >> $LOGFILE
		/sbin/ifconfig $BOUNCE down
		# Give interface time to reset before bringing back up
		sleep 10
		/sbin/ifconfig $BOUNCE up
		# Give WAN time to establish connection
		sleep 60
	else
		echo `date +%Y%m%d.%H%M%S` "All pings failed twice. Rebooting..." >> $LOGFILE
		/sbin/shutdown -r now >> $LOGFILE
		exit 1
	fi

	COUNT=`expr $COUNT + 1`
done

Full credit BennTech who posted this many years ago at https://forum.pfsense.org/index.php?topic=17243.0

Help?

Cheers,
ChirpyTurnip

Derelict

This behavior is decidedly abnormal. You'd probably be better off finding out the cause of the problem instead of hacking around it like that.

Nullity

Try enabling verbose logging to help track down the cause for your freezes. I think there is some parameter that you can add to /boot/loader.conf to enable it.

kejianshi

In the end, if you need to detect if your system is up or not, there are monitoring services that periodically ping IPs of your choosing and will report to you if they go down.

I use one for a chat server and vent server. Find whats good for you.

muswellhillbilly

I'm with Derelict. You should be trying to find out the cause of the problem rather than just set up a monitor to see when the firewall hangs. From the description you give, it could just as well be something to do with hardware or your physical connection than with the firewall software. The fact that it's so ad-hoc makes it seem like something caused by usage or maybe user activity.

Moving forward, you could install the NRPE package in the package list and set up a Nagios monitoring system to run regular checks of memory usage, filesystem usage, network connectivity, etc and have Nagios alert you when any of these monitored services go down. You might even find that Nagios might point you in the direction of a long-term solution to your issue, depending on the type of issue and what you're monitoring.

kejianshi

I found this interesting since it looks like something that can be easily made into a pfsense package…

https://play.google.com/store/apps/details?id=com.emoticode.monyt&hl=en

Seems pfsense would support it easily:

It's required that the machine has a web server with PHP support on it. In the case the machine mounts a particular operating system it might become needed that the "shell_exec" function is enabled in php.ini.
Monyt requires you to install a serverside script, download and install the following file and save it with .php extension in the web directory of the machine you're going to monitor before adding it to Monyt.

http://www.monyt.net/monyt-server-script.txt

Some Android devices have custom power saving modes that prevent monyt from notifiyng server downtimes.
If monyt doesn't alert properly, try to disable power saving.

charliem

@ChirpyTurnip:

There was a script previously posted that would be ideal…but I can't get it to work. All I get is "command not found"....which is weird. I've done the usual (chmod 755, included ./ in the name to execute when connected via SSH. What am I missing?

Other advice to find the root cause is good advice; but not really helping you learn about 'why won't the script run'. "Command not found" is pretty clear and should be preceded by whatever you typed:

[2.2-RELEASE][root@pfsense.localdomain]/root: ./non-existent-script
./non-existent-script: Command not found.

Does that match the file name shown by the directory listing? What directory did you upload the script to, /usr/local/bin as in the guide you linked? What directory are you in when you try to run the command? Are you appending the '.sh' extension? (You don't need the .sh extension on the filename but you do need to call the script correctly, either with it or without)

Either you are not in the right working directory or the script filename doesn't match what you are typing.

ChirpyTurnip

Hi,

I totally understand the need to track down the root cause….I rather suspect it is something the ISP is doing at their end as this wasn't a problem until I recently changed providers...though it could just be coincidence. The main problem though is that because it is so random and sporadic I can't easily test it in the same way as you can a problem that always happens at a given time, or after a set of steps is taken - it appears random. 99% of the time this happens when I'm not at home, and then I have to fix the problem remotely...which I can't do ...because I'm not home and the WAN interface accepts no inbound connections. So what I need, when this happens, is for PFSense to realise it has died and to restart itself.

Anyway....I have made some progress...though again I can't understand why. The original script (based on the source in the other post) was edited in notepad++ and ftp'd to /usr/local/bin. I then set the permissions to 755, and tried to run the script in a multitude of ways:

autodetectfail
autodetectfail.sh
./autodetectfail
./autodetectfail.sh
/user/local.bin/autodetectfail
/user/local.bin/autodetectfail.sh

Every single time I got command not found.

I then followed the procedure in the original post for uploading the script and used the Diagnostics\Edit File method to copy and paste the script to a file, and ran the chmod +x command to make it executable. Now all of a sudden it will run using ./autodetectfail2.sh. When I do an ls -l the autodetectfail and autodetectfail2 files have identical ownership, they're both 755, they're both the same size, they're both in the same directory. Identical in EVERY way - except one came in via FTP and refuses to run, whereas the other was locally created and works fine (or at least appears to execute - it's creating a log.).

Now I need to add it to cron and then see what happens.... In the meantime I will try to dial up the verbosity on the logging so I can see what happens when it dies....but at least I might recover automatically now.

kejianshi > Really cool app! I can use that to monitor my family's PFSense firewalls too if that was included. If we were going to make packages though I think that pingscript could be a super simple package too - ideally we take a three-pronged approach here:

1. Here's a tool for remote monitoring your stuff
2. Here is a way to temporarily increase logging verbosity (without manually hacking conf files)
3. Here's a way to recover the firewall if it hangs/freezes but is still sort of running

Monitoring using Nagios or some form of syslog is fine, except that home and SMB users are unlikely to use those...and if they do they are cloudbased...and therefore only good while PFSense is up...once it dies its ability to call for help is also lost.... :-(

Thoughts?

Cheers!
ChirpyTurnip.

Derelict

There's a difference between the WAN port/ISP going dead and "it will no longer route any traffic between the WAN/LAN in either direction" which is how you originally described it.

If, when it's "dead," you can run a packet capture on WAN and see outbound connection attempts going to the ISP with no response, then it's not pfSense "hanging," it's your ISP going screwy. Big difference. We can essentially eliminate your hardware and concentrate on what on WAN is screwing up.

doktornotor

@ChirpyTurnip:

autodetectfail
autodetectfail.sh
./autodetectfail
./autodetectfail.sh
/user/local.bin/autodetectfail
/user/local.bin/autodetectfail.sh

Absolutely none of those will work with cron. The first 4 are missing the path altogether and the remaining two get the path wrong. Should be /user/local/bin and not /user/local.bin/. No wonder it does not run.

ChirpyTurnip

Yeah…a little clarity...

The incorrect paths were just a typo.../user/local.bin/ was SUPPOSED to be /usr/local/bin with a /....I just can't type and look! :-(

Anyway, the correct path is what I actually tried...and it didn't work. In terms of the first four lines I was running it from the command line, via SSH (not via cron), from the /usr/local/bin directory. I know not all the commands were valid, I was just exhausting all the possible options. When I created the second file (locally versus FTP) it would run as /usr/local/bin/autodetectfail2.sh - which is the expected result. It just refuses to work for the file I FTP'd in....

Derelict >
I think there is a misunderstanding here....PFSense believes that the WAN port is up. However, no traffic is processed therefore it appears, for all intents and purposes, to be 'dead' when viewed from the WAN side. When I check the system logs there are no obvious "X stopped running" type messages, and when I check the FW logs there are no new traffic entries. All services (e.g. gateway, apinger etc) appear to be up. Webconfigurator is up and responds. What happens for me is that I get a text message at 10am to say the phones at home are down. I can't VPN in to see what the problem is, I can't remote to my PCs - nothing is online. When I get home at 6pm PFSense is sitting there doing nothing - but still capable responding to LAN pings and webconfigurator requests. Hence for me the priority it to get PFSense to recognise failure and to reboot itself. When I then see this has happened (whenever it happens again) I can look at the logs for clues...but at least my system isn't down for the whole day (or a whole long weekend if that's when it happens). Either way, cause is still completely unknown. Hardware is somewhat unlikely though I think on the grounds that everything else still appears to be fine....

Derelict

The real question is whether pfSense is sending the packets out WAN (or ignoring packets sent to it by the ISP). A packet capture will tell that tale. If everything looks normal on WAN, then pfSense is doing everything it's supposed to be doing and we're looking at simply kicking your ISP back into gear, not fixing pfSense. Just because a reboot fixes it doesn't mean something is wrong on the pfSense side of things.

ChirpyTurnip

Yes…that's true. Hopefully the more verbose logging will tell us more. I can only do a packet capture when it breaks...and I can't just capture all my traffic in the hope that something will happen. The downside of now recovering the firewall in the event of an issue is that the problem will self-re-mediate without ever getting the opportunity to do a capture. Maybe I disable the cron task when I'm home so I'm around it if happens...

The main supporting piece of potential evidence for your theory is that there is nothing in the logs that would suggest why PFSense stops working. So it's a distinct possibility that it is something the ISP is doing....

Derelict

I have set up a switch mirror port going to a host running tcpdump in a circular fashion such that it always has a day or so of data to catch intermittent things like this.

ChirpyTurnip

..which is cool if you have a smart switch…but I can't do that from a completely unmanaged device. I can get PFsense to capture it, but then I'd need it to write it to an external USB drive in real-time...rather than caching it and then offering to display/export the results...but there is no easy (user friendly) way of doing this that I'm aware of... Best bet is to catch it in the act and then ask it to capture.... Sigh.

cmb

It sounds like the system isn't hanging at all, it's still network-reachable (from the LAN at least). So next time it happens, start a packet capture on interface WAN, with count 0 and all else at defaults. Then try to get out to something on the Internet, try a few different things within less than a minute or so to keep the capture size down, then stop the capture and see what you get.

Try to ping both by hostname, like "ping google.com", and just by IP, "ping 8.8.8.8", and see if one or the other is any different. You're trying to determine there whether or not DNS is functioning, and whether you have basic IP connectivity.

Derelict

Well you changed ISPs and started having problems…

I don't know if people come here thinking they HAVE to make it look like a problem with pfSense or they won't get any help (like they experience everywhere else they go, particularly their ISP.) That couldn't be farther from the truth here.

So we always spend a couple pages basically having to prove it's not the firewall before we can get to the real issue.

Nothing personal to you, OP. I'm just venting.

doktornotor

@cmb:

It sounds like the system isn't hanging at all, it's still network-reachable (from the LAN at least).

Yeah indeed. If the system did hang, the script would be essentially useless anyway, you'd need a proper watchdog, not a shell script.

kejianshi

I do think if you have something important running, having something to monitor it and email you when its down is good to have but my experience is if the pfsense hangs you will need some sort of out of band access to the box to reboot it, or a person to call to reboot it. There is also a box you can attach to pfsense on the LAN to ping the WAN and if the pings stop the box will auto power cycle the pfsense, modem, switch and your kids xbox…. whatever you plug into it...

Something like this...

http://3gstore.com/product/2062_ip_power_remote_switch.html

I've seen some as low as $60 USD with Iphone and android apps to remote control them. Can be handy I'd imagine.

ChirpyTurnip

So it has been a week…and nothing has happened so far. The FW has been perfectly stable and there have been no unscheduled reboots except for the one that will have happened at 2am on Sunday morning. Anyway, simple question, I think we all agree that doing a packet capture would be a good thing to grab to see if there is any traffic trying to reach the Internet...is there any reason why I couldn't get my script to do that? Rather than going "Ping, fault found, reboot" why can't I say "Ping, fault found, packet capture for 2 minutes on WAN interface, save pcap file to log dir, reboot". That way when the event occurs again I will have a capture to look at after a failure that will be allow us to (hopefully) narrow down the cause of the fault.

If this is possible, can someone provide the additional lines of code that need to be inserted?