PfSense Instability Help

mattlach

Hello all,

I could appreciate some help in trying to solve a pfSense instability issue I have been having.

Every now and then pfSense seems to stop working properly. The time intervals seem random (though hastened by heavy use, like torrents).

The specs are as follows:

2.0.1-RELEASE (amd64) running under ESXi 5 (Vsphere Hypervisor) on the following hardware:
AMD E-350 (1.6ghz dual core)
8GB RAM (4GB provisioned for pfSense)
Intel EXPI9402PT dual gigabit NIC (via ESXi virtual switches, as this system is not IOMMU compatible)

I don't think it's a hardware problem, as when pfSense stops working properly ESXi and my other server on the ESXi box keep running normally.

When pfsense stops working properly, all clients lose external internet access. Clients already on and active keep their IP addresses, but new clients do not obtain an ip from the DHCP server and instead make up their own, and fail to even reach internal clients.

When this happens, I can force my desktop to use a static IP, and then can sometimes reach the pfSense web interface, sometimes not. I go to the console and hit "5" to reboot the system, which doesn't appear to work, and then finally force a reboot from inside ESXi. When it comes back up again, everything works as normal, but it seems to pull a new external IP.

It takes 1 -2 days for the same issue to recur.

I logged onto the console to see if I could track down what happened in the logs, but I am not sure which log I should be looking in.

/var/log/system.log only appears to have information from the most recent boot, so it is not helpful.

I'd appreciate any assistance in figuring this one out, including what log information to pull to look at it.

If it is helpful, I will post my network diagram, shortly (as soon as I am done drawing it)

Thanks,
Matt

mattlach

Here is the network diagram (click for larger)

Appreciate any help/suggestions.

wallabybob

@mattlach:

Every now and then pfSense seems to stop working properly.

When this happens does the pfSense console still respond to shell commands?

Sometimes a kernel can loop in the kernel due to resource exhaustion. The following shell script could be run soon after startup to monitor kernel network resource usage:```
$ more t.sh
while true
do
date
netstat -m
sleep $1
done
$

where the first parameter gives the interval (in seconds) between the time stamped reports, for example
$ sh -x ./t.sh 3600

mattlach

@wallabybob:

@mattlach:

Every now and then pfSense seems to stop working properly.

When this happens does the pfSense console still respond to shell commands?

Sometimes a kernel can loop in the kernel due to resource exhaustion. The following shell script could be run soon after startup to monitor kernel network resource usage:```
$ more t.sh
while true
do
date
netstat -m
sleep $1
done
$

where the first parameter gives the interval (in seconds) between the time stamped reports, for example
$ sh -x ./t.sh 3600

Thank you, I will do this and give it a shot.

I am a beginner at BSD though, so please bear with me here.

How can I install an editor to perform this task?

It looks like gcc is installed with pfsense. Should I compile one manually? I have heard of th eports package manager, but I have no clue how to use it. (I am familiar with Gentoo linux portage implementation as well as redhats RPM and debian's APT).

How can I get an editor like vi, nano or emacs on the system so I can edit and save the script?

Thanks for your help,
Matt

wallabybob

@mattlach:

How can I install an editor to perform this task?

No need, see later.

@mattlach:

It looks like gcc is installed with pfsense. Should I compile one manually? I have heard of th eports package manager, but I have no clue how to use it. (I am familiar with Gentoo linux portage implementation as well as redhats RPM and debian's APT).

Compile one what? A shell? sh is already installed.

@mattlach:

How can I get an editor like vi, nano or emacs on the system so I can edit and save the script?

vi and ee are installed as part of the base install.

You don't need to do anything to the base system (except create the shell script) to run the script I provided.

bkamen

is your system completely locked hard? (i.e. you hit the HW reset button or power-cycled?)

I recently installed a new setup using a SuperMicro X7SPE Atom MB – and have had similar issues.

I think I may have solved it - but don't have enough up-time to say yet.

-Ben

mattlach

@wallabybob:

vi and ee are installed as part of the base install.

Ahh, I see, that they are. My mistake.

I was trying to launch "vim" instead of "vi" as I am used to that being installed on my linux systems.

Thanks for the help. I will be running your netstat script.

Can you give me an idea of what I might be looking for in the netstat - m output?

Thanks,
Matt

mattlach

@bkamen:

is your system completely locked hard? (i.e. you hit the HW reset button or power-cycled?)

I recently installed a new setup using a SuperMicro X7SPE Atom MB – and have had similar issues.

I think I may have solved it - but don't have enough up-time to say yet.

-Ben

Mine does not lock hard.

I am running it in a VM under VMware ESXi. the rest of the system (and other VM's) remain up and stable.

The pfSense VM remains somewhat accessible (web interface sometimes, console always) but it is slow and unresponsive after the issue occurs. I typically try to go to console and restart it using command "5", but after waiting what seems like forever without a reboot occurring, I usually just get tired of waiting and force a reset from within ESXi.

What was your issue? How did you solve it?

wallabybob

@mattlach:

Can you give me an idea of what I might be looking for in the netstat - m output?

Allocation failures, a trend of increasing resource use and a resource current or total figure "near" maximum. The maximum figures need to be big enough for worst case use. The total figures may rise for a time then should level off but will rise again if you encounter bigger peak loads.

heper

i've experienced similar issues with esxi, even to the point where all VM's on the same box stop working (i've had this on multiple systems running esxi4.1 & pfsense)

thus far i've been unable to solve it but the problem does not repeat frequently (sometimes +150d uptime without issues).

esxi forum posts indicate that this could be related to a couple of types of raid controller … (check esxi logs for warnings)

mattlach

@heper:

i've experienced similar issues with esxi, even to the point where all VM's on the same box stop working (i've had this on multiple systems running esxi4.1 & pfsense)

thus far i've been unable to solve it but the problem does not repeat frequently (sometimes +150d uptime without issues).

esxi forum posts indicate that this could be related to a couple of types of raid controller … (check esxi logs for warnings)

Thanks for the suggestion.

In my case I know it's not due to any RAID controllers, as I am not using RAID. I appreciate the input though.

jimp

Probably worth applying the em tweaks from here:
http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

mattlach

@jimp:

Probably worth applying the em tweaks from here:
http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

Thank you, I'll have to take a look at this.

I don't think the mbuf's are the issue for me.

The pfsense guest became unresponsive again on Friday (just getting around to posting now), and the following was the last entry in my scripted log file:


Fri Mar 30 16:16:44 EDT 2012
514/5758/6272 mbufs in use (current/cache/total)
513/5541/6054/25600 mbuf clusters in use (current/cache/total/max)
512/5376 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

This seems to suggest that mbuf's are not my issue.