Fatal trap 12: page fault while in kernel mode

tlum

I've started getting a periodic crash, about once a week, though it varies. This box has been quite stable for years, but started this behavior after an update this past summer, though correlation does not equal causation. It's hard to peg the exact date and version since it happens so infrequently. From what I can see it looks like it's happening during packet inspection in pf.

This seems the same as an issue posted in "2.0-RC Snapshot Feedback and Problems" https://forum.pfsense.org/index.php?topic=21743.40;wap2 That was four years ago and it's not clear what ever became of it.

So today I became aggravated enough to drop everything I'm doing and concentrate on ending this forever. Unfortunately, I don't know of a way reproduce it on demand, but I suspect that it could be traffic related based on what circumstantial evidence I do have. And yes, this probably is a FreeBSD issue, however I would counter that pfSense is distro based and chooses the OS distro that it's packaged with and tested against, so I would think it's in our mutual best interest to understand and resolve it.

Although I have not come across any recent complaints, can anyone verify this as a current problem? Are the pfSense developers aware of this or related issues? Are there any suggestions for capturing additional information on this? -TIA-

FreeBSD 8.3-RELEASE-p16 #0: Mon Aug 25 08:25:41 EDT 2014
    root@pf2_1_1_i386.pfsense.org:/usr/obj.i386/usr/pfSensesrc/src/sys/pfSense_SMP.8

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100055 td 0xc702db80
rn_match(c1520d4c,c9671300,ed797904,c7c4d200,ed79785c,...) at rn_match+0x11
pfr_match_addr(c94689b0,c80a581a,2,16,ed797844,...) at pfr_match_addr+0xe0
pf_test_tcp(ed797920,ed79791c,1,c7c4d200,c8107900,...) at pf_test_tcp+0xb05
pf_test(1,c70d4400,ed797aec,0,0,...) at pf_test+0x2596
pf_check_in(0,ed797aec,c70d4400,1,0,...) at pf_check_in+0x46
pfil_run_hooks(c156e620,ed797b3c,c70d4400,1,0,...) at pfil_run_hooks+0x93
ip_input(c8107900,c8107900,10,c0ac8dc9,c1569a10,...) at ip_input+0x35a
netisr_dispatch_src(1,0,c8107900,ed797bac,c0b6838f,...) at netisr_dispatch_src+0x71
netisr_dispatch(1,c8107900,5,c70d4400) at netisr_dispatch+0x20
ether_demux(c70d4400,c8107900,3,0,3,...) at ether_demux+0x19f
ether_input(c70d4400,c8107900,c7c4d804,c7acc800) at ether_input+0x174
ether_demux(c7acc800,c8107900,3,0,3,...) at ether_demux+0x65
ether_input(c7031400,c8107900,c155b180,ed797c3c,c6d94000,...) at ether_input+0x174
em_rxeof(0,0,c70143c0,c702a880,ed797cc0,...) at em_rxeof+0x206
em_msix_rx(c7026300,c702db80,0,109,98bc0483,...) at em_msix_rx+0x3f
intr_event_execute_handlers(c6d92560,c702a880,c0f955af,529,c702a8f0,...) at intr_event_execute_handlers+0xd4
ithread_loop(c7001b20,ed797d28,2a90d8a7,0,c7001b20,...) at ithread_loop+0x66
fork_exit(c0a7a4e0,c7001b20,ed797d28) at fork_exit+0x87
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xed797d60, ebp = 0 ---

cmb

are you using schedules on firewall rules?

tlum

@cmb:

are you using schedules on firewall rules?

Nope, ZERO schedules. No traffic shaping, or anything else dynamic either.

The configuration is not simplistic though. Two NIC's participate in a LAG, which presents 8 VLANs, two of which are WAN's with IP Aliases ( two /29 blocks) in addition to the primary. And, OpenVPN counts as a ninth interface. This configuration has been stable since at least 2008.

tlum

Well, disabled textdump in favor of conventional minidump. I guess I wait till it happens again and see if I end up with more useful artifacts next time.

tlum

Alright, so I finally got a dump on 2/5, and then another on 2/11. So, is there a debug build of the 8.3 kernel around that the pfSense developers use, or am I going to have to go build my own?

cmb

Is lagg, VLANs, two WANs + VIPs, and OpenVPN all you're running on it? All those things are fine on 2.2, and that panic is almost certainly fixed in 10.1. It's really not worth the effort to track down unless it happens on 10.1.

tlum

Yes! I have run pfSense for years, first on an IBM x335, and now on a SuperMicro SYS-5015A-EHF-D525 since 5/30/2012. It is dedicated to firewall and routing. It's peer is a Cisco Catalyst 2690 switch. I run the reverse proxy and IDS behind it, I'd rather keep the firewall box native, simple, and clean. It is the network time server(NTP). It logs to a central network syslog server. It is the network DHCP server and manages static as well as dynamic pools. It does not get involved in DNS. I'd prefer not to even run OpenVPN on it, but there is a higher risk of not being able to remotely recover from internal issues if the VPN runs behind it.

I've been having this problem for less than a year, but more than six months, I'm not exactly sure. As of 12/23 I became fed up, but it took 44 days for it to panic again in order to get a real dump, then just 6 days for another.

I am VERY uncomfortable doing an upgrade without knowing what was causing the issue. Maybe 10.1 will solve the problem, and maybe it will sweep a hardware problem under the carpet for 3 months. Right now time is the only know way to reproduce the issue, and the reason for the seemingly random amount of time is unknown. While I would have no expectation of trying to fix a deprecated version, I would sleep better at night having identified the root cause and being ably to identify means of reproducing the issue and testing for it's presence and resolution. I know it seems counterproductive, and if I had a reproducible issue I could test against a new version I'd be all over it in a heartbeat. But, all I do have is the data contained in two separate core dumps, and no way of knowing if an upgrade will be of any value. So, I'm just ~~crazy~~ persistent enough to take what i do have to it's logical conclusion.

So, kgdb works a lot better with debug symbols… not to mention that pfSense dosen't even ship with (k)gdb.

cmb

The panic is in something related to the packet filter. It looks a lot like what happens with schedules, but there is another similar panic in some unusual edge case. If the backtraces all look similar to that one, it's a near-certainty it's not a hardware problem. That would exhibit itself in a diff bt, or or varying ones.