Snif: pfSense randomly hangs, how to diagnose please (peep)?

Mr. Jingles

G'day all :-[

Over the last couple of days, my first pfSense (in my sig) spontaneously hangs. Nothing will work anymore, not even a keyboard response on the console. I have to hard power off and power on, after which it does the file system checks and starts up normally again.

The last change I made was uninstalling Suricata some days ago, since this conflicted with Snort.

In the GUI, Status/System logs, nothing is to be seen (since this is the log since reboot), [b]except for that I am supposing it happened after this:


Aug 5 12:47:51 syslogd: kernel boot file is /boot/kernel/kernel
Aug 5 12:30:35 php: rc.update_urltables: /etc/rc.update_urltables

As you can see, the update ran at 12.30, and at 12.47 we see the system booting up again after my hard reset and the file system checks.

How would one diagnose this further? In which logs to look for what problem happened?

Thank you in advance for any help :-*

Bye,

EDIT: I just thought: these urltables are updated more than once a day (BB's script for updating the IR* tables also starts the updater hourly), so most of the times the updating does not cause a crash, since the crash frequency is once every 1,5-2 days or so.

charliem

In my experience, this is almost always hardware. Trouble is, there's not a lot of diagnosing to be done except systematic replacement; I'd start with a new power supply. (I've seen this exact symptom cause by a bad power supply twice: first time took a few months of intermittent failures to figure it out, second time was a day :) )

You can try running memtest and/or a cpu stressing utility (to look for temperature issues). But most times with memory issues, you would still see an oops or something in the log.

Mr. Jingles

Thanks Charlie ;D

I memtested it only a couple of days ago, it was fine. The PSU for this mobo is very expensive, I'd rather not buy a new one if not necessary :-[

CPU stress testing is something I could try, but then I have to put the machine offline again, I was hoping to rule out other causes first. Isn't the great FreeBSD keeping logs of everything everywhere all the time? Even Windows does a lot of logging, so I would be expecting FreeBSD to do the same(?)

charliem

It sounds like you have a monitor and keyboard hooked up; I take it no console error messages appeared?

Yes, of course there are logs ('cd /var/log; clog system.log | less' for example), but if the CPU is halted by safety circuits on the chip due to Vdd being out of range or over temperature (just for example), such logs never make it to the disk. In any case, if the CPU stops, there's no notice given to the OS.

After the system hangs, is there any response to the NumLock key? Any response to pings? I know linux will flash the keyboard LEDs in morse code to signal a kernel fault, but I don't think freebsd does so.

Sorry I have no other suggestions. I had the same reluctance when I ran into this the first time, hence the two-month time frame to fix it …

jasonlitka

@Hollander:

I memtested it only a couple of days ago, it was fine. The PSU for this mobo is very expensive, I'd rather not buy a new one if not necessary :-[ [/quote]

Running MemTest86+ (or similar) is only useful if you commit to it for 48-72 hours. I've had machines that would run overnight without issue but would crumble if left to run over a weekend.

The same goes for power supplies. If your PSU is expensive because it's a high-wattage "gaming" PSU, well, I've had more of those fail than a standard Enermax or Antec.

Mr. Jingles

Thank you to both of you ;D

The PSU is not expensive because it is a high-wattage gaming PSU, aux contraire. The problem is it is some special sort of adapter required for this Intel Mobo. And as soon as the word 'special' comes up, it appears to be expensive.

My guess, but I am nnot sure, is that there are some package conflicts somewhere. I deinstalled Squid, Squidguard, Lightsquid (squid wasn't behaving with Snort anyway), and some other packages I already forgot, and in 24 hours no crash.

I did manage to fetch one crash log that pfSense presented to me in the GUI when I logged in. Of course, being the eternal noob, I have no clue how to interpret this. Would any of you perhaps be able to see anything in there that might give a clue?

Thank you for your help very much ;D

EDIT: sorry, I don't know where I saved the log :-[ ( >:( )

I will have to wait for the next crash. The only thing I actually wrote down is this:

[quote]
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x420
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xf8023be83
stack pointer = 0x28:0xff80000fd5e0
frame pointer = 0x28:0xff80000fd610
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (em0 que)

But I am not sure if that is sufficient information(?)

Mr. Jingles

@charliem:

It sounds like you have a monitor and keyboard hooked up; I take it no console error messages appeared?

Yes, of course there are logs ('cd /var/log; clog system.log | less' for example), but if the CPU is halted by safety circuits on the chip due to Vdd being out of range or over temperature (just for example), such logs never make it to the disk. In any case, if the CPU stops, there's no notice given to the OS.

After the system hangs, is there any response to the NumLock key? Any response to pings? I know linux will flash the keyboard LEDs in morse code to signal a kernel fault, but I don't think freebsd does so.

Sorry I have no other suggestions. I had the same reluctance when I ran into this the first time, hence the two-month time frame to fix it …

I forgot to answer this, sorry :(

No, it does nothing, and doesn't respond to anything at all.

I will go into the cli and look for the logs. Are there perhaps particular key words I would need to grep for?

BBcan177

Google "Fatal trap 12: page fault while in kernel mode" and there are lots of people with that error. What kind of machine is it? Are you virtualizing this machine?

Mr. Jingles

@BBcan177:

Google "Fatal trap 12: page fault while in kernel mode" and there are lots of people with that error. What kind of machine is it? Are you virtualizing this machine?

'tIs the first machine in my sig, BB; not virtualized ;D

I don't think it was hardware; I uninstalled these packages mentioned before, and so far no hangs anymore. I'll see what happens next.