"Fatal trap 12: page fault while in kernel mode" (w/ screenshot)

jjj

Had a nice fatal error over the weekend. Any ideas what this is about?

jimp

Hard to say without more detail. That doesn't look familiar.

First, update to a current snapshot. If that alone doesn't help, then when you get the panic again, get a new picture and also type "bt" at that prompt, and get a picture of that output as well.

The backtrace (bt) is important to help track down what the code was doing at the time.

jjj

We're on 2.0 Release and it's still occurring.

Cry Havok

Probably hardware related - whether the hardware is faulty/borderline or just incompatible.

Start by running diagnostics on the hardware. I'd suggest a memory test (memtest86 et all) as the first test to run.

jjj

Well, it started happening right when we updated to 2.0 (RC3). It had been running flawlessly up until then. Therefore, I really doubt it's faulty hardware related. How do we verify hardware compatibility?

GoldServe

I found that if one of my links go up and down and up so often that the gateway monitor removes the link while PF is processing a packet, it will crash. I fixed my link so it doesn't go down as often and everything is good again.

jjj

I don't understand how 1.2.3 was rock solid and never, ever crashed, even once, then we deploy 2.0 and it crashes every other day.

jjj

This happens only during off-hours. Error screenshot

We just switched to new hardware…guess we'll wait and see. :(

jimp

If you run your NICs out of resources, then you could hit a panic with the driver…

Try some of the tweaks here:
http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

wallabybob

Your error screenshot shows:

sf2: watchdog timeout, 240 Tx descriptors are active
sf2: watchdog timeout, 248 Tx descriptors are active
sf0: Tx underrun – increasing Tx threshold to 384 bytes

The FreeBSD man page for sf (http://www.freebsd.org/cgi/man.cgi?query=sf&apropos=0&sektion=0&manpath=FreeBSD+8.2-RELEASE&arch=default&format=html) says

sf%d: watchdog timeout The device has stopped responding to the network,
or there is a problem with the network connection (cable).

Do the watchdog timeout messages consistently appear before the panic? Always the same interface (sf2)?

Do the Tx underrun messages consistently appear before the panic? Always the same interface (sf0)?

Perhaps there is some sf driver related error condition that the software doesn't handle correctly, ultimately resulting in a panic.

Did you notice these messages when running pfSense 1.2.3?

@jjj:

I don't understand how 1.2.3 was rock solid and never, ever crashed, even once, then we deploy 2.0 and it crashes every other day.

Operating system upgrades often include performance enhancements. Those enhancements often drive some part of a system harder than it was driven before and sometimes other parts of a system can noticeably suffer. For illustration, suppose FreeBSD changed to double the maximum size of a disk transfer. That MIGHT impact sf devices. Its fundamental to the way Ethernet works that once a NIC starts transmission of a frame it can't pause the transmission mid-frame. If the the transmission does pause (transmit underrun) the whole frame must be retransmitted. For cost reasons, older NICs had a small transmit buffer which was refilled from main memory during transmission as required. For performance reasons, newer NICs include a buffer large enough to be able to guarantee transmission of at least a maximum standard sized frame without pauses. That sf driver reports transmit underruns suggests its transmit component buffers less than a whole frame and the PCI bus gets busy for long enough that an sf device can't refill its transmit buffer in time to avoid mid frame transmission pauses. An increase in maximum disk transfer size might result in an increased likelihood of the disk (or the disk plus other very active NICs) starving the sf device of PCI transfers long enough to result in transmit underrun. Poor handling of transmit underrun (due to a rare combination of circumstances) might result in a panic somewhat later.

jjj

Thanks for the replies. So far no crash on the new hardware (Dell AMD x64), but we'll see after we get through the weekend.

I did notice however TX underrun issues are happening on the new hardware. The interesting part is we're still using an Intel quad-port NIC (dc0-3), but it's slightly different than before. Also, we're using a 3com NIC (xl0), as the onboard Broadcom NIC wouldn't pass traffic at all.

Here's a screenshot of the errors the current firewall is getting.

@wallabybob: it appears as if the watchdog and Tx underrun do appear before the panic. we'll keep an eye on that now. If that is indeed the case, then our NICs might need to get replaced. Notice in the screenshot that both the Intel and 3com are having errors. I'm guessing it's because they're both the same age (i.e. old).

@jimp: does that link apply since our NICs are dcX?

jimp

The old dc cards were notoriously crappy, even when new. They were DEC chips, and iirc only rebadged as Intel, they aren't "proper" Intel cards really, the good ones use the fxp/em/igb drivers. The xl cards have been ok, but are showing their age.

I'm not sure I'd trust anything that old to a decent workload.

wallabybob

@jimp:

The old dc cards were notoriously crappy, even when new. They were DEC chips, and iirc only rebadged as Intel,

For a while Intel sold them as Intel parts after they took over the DEC chip business in the 1990s.

@jjj:

So far no crash on the new hardware (Dell AMD x64), but we'll see after we get through the weekend.

I did notice however TX underrun issues are happening on the new hardware. The interesting part is we're still using an Intel quad-port NIC (dc0-3), but it's slightly different than before. Also, we're using a 3com NIC (xl0), as the onboard Broadcom NIC wouldn't pass traffic at all.

So you have five active NICS. Are they mostly pretty busy? If the box you are running this in has another PCI bus (unlikely if its a desktop, possible if its a server) you might get fewer Tx underruns if you move one of the cards to the other PCI bus and put the heaviest load on the xl0 interface (to try to balance somewhat the traffic on xl0 and the traffic on the dcx interfaces). Alternatively, if the box has a PCIe slot you might reduce the Tx underruns by purchasing a PCIe NIC and moving the heaviest traffic to it. My suspicion is that you might be trying to pass enough traffic to saturate the PCI bus at times.

jjj

All of the NICs are on PCI slots. I just bought the 4-port NIC. It'll plug into the 16x slot on the PC.