Kernel Panic

vito

Hopefully it's fixable, though lem is legacy em, so it's some rather old/early em chipsets. Might explain why so few people have hit it.

Jim
where do you see it is an old legacy nic/driver? (not questioning, just Curious) :)
just an fyi, had no problem with oct snaps.

Thanks for your help

jimp

@vito:

where do you see it is an old legacy nic/driver? (not questioning, just Curious) :)
just an fyi, had no problem with oct snaps.


exclusive sleep mutex em0 (EM TX Lock) r = 0 (0xc2f52580) locked @ /usr/pfSensesrc/src/sys/dev/e1000/if_lem.c:1350

if_lem.c. lem is legacy em, a normal em card would have been in if_em.c

FisherKing

Hmm - that line looks similar to the panic I get when captive portal is enabled on my box.


exclusive sleep mutex fxp0 (network driver) r = 0 (0xc36de018) locked @ /usr/pfSensesrc/src/sys/dev/fxp/if_fxp.c:1288

Details here:
http://forum.pfsense.org/index.php/topic,30791.msg159227.html#msg159227

CryoGenID gets a panic but with yet another set of drivers. Cino does as well.
http://forum.pfsense.org/index.php/topic,29839.60.html

Is there anything we can do to help besides posting back traces?

jimp

I just spent a bit of time on the phone with someone who hit this. It does seem to be related to OpenVPN somehow (or the kind of traffic that is seen more often with OpenVPN I suppose). Once we had the developer kernel on it stayed up for quite a while until we had someone connect with OpenVPN and generate some traffic.

jimp

A patch was just committed by ermal that might be a potential fix for this, or at least change the behavior somewhat. Give the next snapshot a try.

Jonb

Just as a side note I was getting a kernal panic with the PPPOA interface being selected to WAN rather than rl1. Not sure if that is due to incorrect config but if so might be worth removing to save people the hassle.

LostInIgnorance

Still getting the panic, I don't think the commit happened on the most recent snap.

Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex em0 (EM TX Lock) r = 0 (0xc2f52580) locked @ /usr/pfSensesrc/src/sys/dev/e1000/if_lem.c:1350
KDB: stack backtrace:
X_db_sym_numargs(c0eb72fb,ccc3ca90,c0a41f25,546,0,...) at X_db_sym_numargs+0x146
kdb_backtrace(546,0,ffffffff,c145d42c,ccc3cac8,...) at kdb_backtrace+0x29
witness_display_spinlock(c0eb9813,ccc3cadc,4,1,0,...) at witness_display_spinlock+0x75
witness_warn(5,0,c0ef7bc2,14,c131b3c0,...) at witness_warn+0x20d
trap(ccc3cb68) at trap+0x19e
alltraps(c341ab00,dedeadc0,c341ab00,c341ab00,ccc3cbf0,...) at alltraps+0x1b
m_tag_delete_chain(c341ab00,0,c0e6e75d,0,c2ed9d50,...) at m_tag_delete_chain+0x3f
reallocf(c341ab00,100,0,c0a42978,df,...) at reallocf+0x8a5
uma_zfree_arg(c1d7e380,c341ab00,0,d5,ccc3cc84,...) at uma_zfree_arg+0x29
m_freem(c341ab00,4,c0e6e75d,b87,c2f4e000,...) at m_freem+0x43
ed_probe_RTL80x9(c2f52580,0,c0e6e75d,546,c2f525bc,...) at 0xc06ec4d8
ed_probe_RTL80x9(c2f4e000,1,c0eb8bcc,4f,c2edb918,...) at 0xc06efea0
taskqueue_run(c2edb900,c2edb918,c0ea5f85,0,c0eb222b,...) at taskqueue_run+0x103
taskqueue_thread_loop(c2f525ec,ccc3cd38,c0eaed9a,344,c131b3c0,...) at taskqueue_thread_loop+0x68
fork_exit(c0a3b1a0,c2f525ec,ccc3cd38) at fork_exit+0xb8
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xccc3cd70, ebp = 0 ---

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address= 0xdedeadc0
fault code= supervisor read, page not present
instruction pointer= 0x20:0xc0a611c8
stack pointer        = 0x28:0xccc3cba8
frame pointer        = 0x28:0xccc3cbb8
code segment= base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process= 0 (em0 taskq)
[thread]
Stopped at      m_tag_delete+0x48:      movl    0(%ecx),%eax
db> [/thread]

jimp

It did happen. I manually restarted the builders after the patch went in. So apparently it still isn't quite right.

jimp

Could someone who can readily reproduce this panic give this custom firmware build a try?

http://cvs.pfsense.org/~jimp/pfSense-Full-Update-2.0-BETA5-i386-20110114-2041.tgz

It was built without a patch that does the extra mbuf operations that may be triggering the panic.

LostInIgnorance

Bad news JimP, still crashes.

Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex em0 (EM TX Lock) r = 0 (0xc2f52580) locked @ /usr/pfSensesrc/src/sys/dev/e1000/if_lem.c:1350
KDB: stack backtrace:
X_db_sym_numargs(c0eb72fb,ccc3ca90,c0a41f25,546,0,...) at X_db_sym_numargs+0x146
kdb_backtrace(546,0,ffffffff,c145d42c,ccc3cac8,...) at kdb_backtrace+0x29
witness_display_spinlock(c0eb9813,ccc3cadc,4,1,0,...) at witness_display_spinlock+0x75
witness_warn(5,0,c0ef7bc2,14,c131b3c0,...) at witness_warn+0x20d
trap(ccc3cb68) at trap+0x19e
alltraps(c2feeb00,dedeadc0,c2feeb00,c2feeb00,ccc3cbf0,...) at alltraps+0x1b
m_tag_delete_chain(c2feeb00,0,c0e6e75d,0,c2ed9b50,...) at m_tag_delete_chain+0x3f
reallocf(c2feeb00,100,0,c0a42978,df,...) at reallocf+0x8a5
uma_zfree_arg(c1d7e380,c2feeb00,0,b5,ccc3cc84,...) at uma_zfree_arg+0x29
m_freem(c2feeb00,4,c0e6e75d,b87,c2f4e000,...) at m_freem+0x43
ed_probe_RTL80x9(c2f52580,0,c0e6e75d,546,c2f525bc,...) at 0xc06ec4d8
ed_probe_RTL80x9(c2f4e000,1,c0eb8bcc,4f,c2edb918,...) at 0xc06efea0
taskqueue_run(c2edb900,c2edb918,c0ea5f85,0,c0eb222b,...) at taskqueue_run+0x103
taskqueue_thread_loop(c2f525ec,ccc3cd38,c0eaed9a,344,c131b3c0,...) at taskqueue_thread_loop+0x68
fork_exit(c0a3b1a0,c2f525ec,ccc3cd38) at fork_exit+0xb8
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xccc3cd70, ebp = 0 ---

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address= 0xdedeadc0
fault code= supervisor read, page not present
instruction pointer= 0x20:0xc0a611c8
stack pointer        = 0x28:0xccc3cba8
frame pointer        = 0x28:0xccc3cbb8
code segment= base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process= 0 (em0 taskq)
[thread]
Stopped at      m_tag_delete+0x48:      movl    0(%ecx),%eax
db> [/thread]

FisherKing

currently running 2.0-BETA5 (i386) built on Thu Jan 13 19:33:19 EST 201
not sure how far back this happens.

in a test network -
2 machines, each w/ 4 intel nics (em0 - em3)
WAN, LAN, Opt1, Opt2 (CARP interface)

Running CARP on WAN, LAN, Opt1 interfaces
Syncing on Opt2 interface.

Recently started getting panics on box2 when changing settings on box1.

Panic & BackTrace from box2 included below.


Fatal trap 12: page fault while in kernel mode

cpuid = 0; apic id = 00

fault virtual address	= 0x1a4

fault code		= supervisor read, page not present

instruction pointer	= 0x20:0xc09ee51d

stack pointer	        = 0x28:0xd670aa54

frame pointer	        = 0x28:0xd670aa70

code segment		= base 0x0, limit 0xfffff, type 0x1b

			= DPL 0, pres 1, def32 1, gran 1

processor eflags	= interrupt enabled, resume, IOPL = 0

current process		= 253 (devd)

[thread]
Stopped at      _mtx_lock_sleep+0x6d:   movl    0x1a4(%ecx),%eax

db> bt
Tracing pid 253 tid 64081 td 0xc4142000
_mtx_lock_sleep(c40f16d0,c4142000,0,c0ecfc57,fd,...) at _mtx_lock_sleep+0x6d
_mtx_lock_flags(c40f16d0,0,c0ecfc57,fd,0,...) at _mtx_lock_flags+0xf7
carp6_input(c3ae5800,c0286938,c40f3a00,c0ea9fce,3,...) at carp6_input+0x9bd
ifioctl(c46a3b44,c0286938,c40f3a00,c4142000,c40cf900,...) at ifioctl+0x141e
soo_ioctl(c412ddc8,c0286938,c40f3a00,c39aa400,c4142000,...) at soo_ioctl+0x415
kern_ioctl(c4142000,f,c0286938,c40f3a00,1a3b7d0,...) at kern_ioctl+0x1fd
ioctl(c4142000,d670acf8,c0ef7af5,c0ecdaff,c41a77f8,...) at ioctl+0x134
syscall(d670ad38) at syscall+0x220
Xint0x80_syscall() at Xint0x80_syscall+0x20
--- syscall (54, FreeBSD ELF32, ioctl), eip = 0x8088357, esp = 0xbfbfe89c, ebp = 0xbfbfe908 ---
db> reboot
[/thread]

jimp

Out of curiosity, what type of network cards do you have in that box? Is it rl and em both? Or just em? or just rl? Or something else?

LostInIgnorance

one em network (gig embedded on the board of an old dell p4). All network traffic is VLAN'd on that one interface.

jimp

OK, just checking… It looks odd to me that the backtrace references ed_probe_RTL80x9 which is a really old realtek chip, but it may just be something weird that I don't know at that level in the kernel/network stack.

We have arranged serial console access with someone who has been able to reproduce the panic so hopefully we'll have a lead on a fix early next week.

wallabybob

@jimp:

OK, just checking… It looks odd to me that the backtrace references ed_probe_RTL80x9 which is a really old realtek chip,

Here's an extract from the stack trace:

m_freem(c2feeb00,4,c0e6e75d,b87,c2f4e000,...) at m_freem+0x43
ed_probe_RTL80x9(c2f52580,0,c0e6e75d,546,c2f525bc,...) at 0xc06ec4d8
ed_probe_RTL80x9(c2f4e000,1,c0eb8bcc,4f,c2edb918,...) at 0xc06efea0
taskqueue_run(c2edb900,c2edb918,c0ea5f85,0,c0eb222b,...) at taskqueue_run+0x103

Note the two ed_probe_RTL80x9 references are not accompanied by a symbol name and offset. I suspect ed_probe_RTL80x9 is merely the closest lower value global symbol but its too far away to warrant printing the PC as symbol+offset. If that is the case you shouldn't take too much notice of the ed_probe_RTL80x9.

LostInIgnorance

@jimp:

We have arranged serial console access with someone who has been able to reproduce the panic so hopefully we'll have a lead on a fix early next week.

JimP, is there anything I can do to help out?

jimp

Not that I'm aware of. If the mbuf tag patch isn't the cause, it almost has to be the recent e1000 driver update (em, igb, etc).

jimp

Someone else had seen that once but so far we've been unable to replicate it so the real cause can be tracked down.

It seemed to be something in the configuration, though.

LostInIgnorance

I am afraid to update since I haven't heard anything back. Is it still crashing or has it been fixed?

jimp

Nothing has changed with the drivers, but there are plenty of other things that have been fixed, it may be worth trying.