Kernel panic 4-5 Nov (i386)



  • Hi,

    pfSense-Full-Update-2.0-BETA4-20101104-1041.tgz
    pfSense-Full-Update-2.0-BETA4-20101104-2246.tgz
    Kernel panic in interrupt routine after some level of network load (bfe0,ath0).
    Rollback to 30 Oct and all OK. Noting about 1-3 Nov.


  • Rebel Alliance Developer Netgate

    Need a lot more info than that. Any way you can get a capture of the kernel panic message at least? Or possibly switch to a developer kernel and get a "bt" output when the panic happens too?

    Or some idea of the traffic level involved?

    There were some performance patches added on Nov 3 that may be involved, but a lot more detail would be needed to track down how/why.



  • Just to confirm, I have today's nanoBSD 4GB snapshot running on my ALIX board and it seems to reboot when navigating the WebGUI. It rebooted out of the blue when loading the page to add a new firewall rule but also when loading the interface statistics page it rebooted a couple of times. Haven't seen that before but it has now done it about 10 to 15 times in a couple of days.

    I can reproduce the error by invoking the auto updater, when it gets to downloading the file approx 3% the interface freezes and the box reboots. Upgrading from a URL through SSH is fine.

    I would be happy to provide logfiles when it happens again but which one would you need? I checked the system log but couldn't find anything interesting.  ???

    When I leave the box alone it doesn't seem to reboot regardless of the throughput..


  • Rebel Alliance Developer Netgate

    Keep the serial console connected and get the output from around the time it resets. Before the reboot messages show up. It may be a panic message, or something else.

    You should be able to copy/paste it out of the serial terminal.



  • @jimp:

    Keep the serial console connected and get the output from around the time it resets. Before the reboot messages show up. It may be a panic message, or something else.

    You should be able to copy/paste it out of the serial terminal.

    Ok cool, I will do that now. I just updated my previous post that I can reproduce the issue.



  • Unfortunately i can only confirm the kernel panic (page fault). It happens when booting, just after the interface configuration.
    I am reinstalling a this moment to see if i can get the system working again.

    -m4rcu5


  • Rebel Alliance Developer Netgate

    @m4rcu5:

    Unfortunately i can only confirm the kernel panic (page fault). It happens when booting, just after the interface configuration.
    I am reinstalling a this moment to see if i can get the system working again.

    -m4rcu5

    That sounds more like the amd64 problem, this is on i386.



  • Ok, got it..  Couldn't invoke the auto updater as there is no newer snapshot.. But after randomly clicking around in the webinterface for a minute it went down when halfway through on loading "Diagnostics –> ARP Table"

    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    
    fault virtual address	= 0x10317
    fault code			= supervisor read, page not present
    instruction pointer	= 0x20:0xc095f46b
    stack pointer		= 0x28:0xe2e21bc4
    frame pointer		= 0x28:0xe2e21bc8
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    				= DPL 0\. pres 1, def32 1, gran 1
    processor eflags	= interrupt enables, resume, IOPL = 0
    current process	= 0 (ath0 taskq)
    trap number		= 12
    
    panic: page fault
    cpuid = 0
    
    Cannot dump. Device not defined or unavailable.
    
    Automatic reboot in 15 seconds - press a key on the console to abort
    Rebooting...
    
    


  • @jimp:

    @m4rcu5:

    Unfortunately i can only confirm the kernel panic (page fault). It happens when booting, just after the interface configuration.
    I am reinstalling a this moment to see if i can get the system working again.

    -m4rcu5

    That sounds more like the amd64 problem, this is on i386.

    jimp, i use the i386 image on a Intel core2duo.

    I have the exact same fault code on the same process (only on interface em0) as pakjebakmeel. This time it happend when booted from the live cd.

    -m4rcu5



  • For me it only happens when navigating the webGUI.. If I leave the GUI alone and generate massive traffic the box is rock solid.



  • Hi jimp,
    Sorry for screenshots, but I use real hardware ASUS Pundit and have not serial console. Before my first post I wait panic maximum 5 minutes. Now on lasts snapshot more long, but I know differences between snappshots absent.
    This panic was appear immediately after fetch command on pfSense host.





    wan.svg.txt
    lan.svg.txt



  • Hi all,

    Is this bug still present or can it be fixed by upgrading to a newer snapshot?  8)

    Thanks!


  • Rebel Alliance Developer Netgate

    It's still there. I just restarted the builder again to see if the fixes checked in yesterday made a difference. It should be done in a while but I'd still wait for an all-clear.


  • Rebel Alliance Developer Netgate

    Current snap is OK:

    2.0-BETA4 (amd64)
    built on Tue Nov 9 17:26:01 UTC 2010



  • pfSense-Full-Update-2.0-BETA4-20101109-1641.tgz



  • Rebel Alliance Developer Netgate

    Ah, well I was hoping it may be a similar issue to amd64. Looks like it may be different.



  • pfSense-Full-Update-2.0-BETA4-20101110-0504.tgz
    Same balls, but side-view. It seems to me traffic generating by router is reason of crash. Without this traffic router working more long time, but attempt to fetch any by router always cause panic.


  • Rebel Alliance Developer Netgate

    I've been sitting here furiously loading GUI pages on my poor little ALIX running a snapshot from today and though it's gotten slow at times, I have yet to see a panic.

    Is there anything else people in this thread might have in common? What kind of setups do you all have? Can you give a general idea of things that are in use? (Multi-wan, IPsec, OpenVPN, PPPoE, 3G, wireless, etc)



  • Not to cause you more frustration Jimp - but I'm seeing this GUI / Kernel Panic also.

    Running 2010-11-09 (i386) on a PIII.
    Reset to factory defaults yesterday & did a basic install
    I've got 2 dual port intel nics.
    fxp0 = WAN PPPoE
    fxp0 = Opt3 DHCP private (10.x.x.x)
    fxp1 = Opt1 DMZ static public
    fxp2 = LAN static private (172.x.x.x.)
    fxp3 = Opt2 static private (172.y.y.y)

    I'm running DHCP server on Lan and Opt2.
    I'm logging in under a 2nd administrator account.
    I'm running Captive Portal on Opt2 w/ local auth
    Firewall allows web, mail, dns traffic through to the public IPs on the DMZ
    I have freeswitch installed on the box, but it isn't the pfSense package.  pfSense doesn't know it's there.

    How does this match up with what the rest of you have?



  • I've noticed the following on the kernel panic screen.

    Cannot dump. Device not defined or unavailable.

    Would the dump be helpful in diagnosing this?  Is there a place where we can find directions on how to give it a dump device?  Is it possible to dump to the local disk, or to a USB memory stick?

    I do have a null modem cable & could dump to the serial port / windows terminal app if that's how it's done, but I'd need directions for that also.



  • Can it be your hardware?
    Can you please install dev kernel in there?


  • Rebel Alliance Developer Netgate

    In case someone missed it earlier in the thread, here are instructions for installing a dev kernel:

    http://doc.pfsense.org/index.php/Switching_Kernels

    Afterwards, capture the panic message and also the output of typing "bt" at the debugger prompt.



  • Hi!
    snapshot pfSense-Full-Update-2.0-BETA4-20101110-1837.tgz
    I switching to dev kernel but is not working at all (can't finish boot). Attaching dmesg for hardware configuration.
    Rollback to 13 Oct, because not working PPTP MPD is critically for me.







    dmesg.boot.txt



  • Can you please try setting net.inet.tcp.syncookies=0 and see if you still get panic?



  • pfSense-Full-Update-2.0-BETA4-20101111-2017.tgz
    I try setting net.inet.tcp.syncookies=0  in /boot/loader.conf.local and all without changes.
    Dev kernel cannot finish init scripts and panic on “syncache: mbuf to small”.
    Uniprocessor kernel working some time, but panic on interrupt routines (more often Ethernet bfe0 WAN, but may be clock interrupt).
    I can post screenshots if it can be helpful.



  • WOW that debug kernel is slow (on my PIII 500 Mhz)!

    I got the dump, but it's a physical machine so all I could do is take pictures.  For future reference, is there a better way to capture that dump?  The machine does have a serial port.

    pics attached.










  • I can confirm its not the hardware, i tried installing fresh install on 3 different pc, pentium 3 and 4.



  • I've found another way to produce the kernel panic and it seems to be related.

    I've noticed that when I log into the GUI that pfSense doesn't panic until the automatic update check has completed.  Related to that, if I ssh into the console and select option 13 (upgrade from console), and then option 1 (Update from a URL), the system will panic when it starts to download the file from the snapshot server.

    Even stranger, if I ssh into the console, choose option 8 (shell),  then then type something like "fetch http://snapshots.pfsense.org/FreeBSD_RELENG_8_1/i386/pfSense_HEAD/updates/pfSense-Full-Update-2.0-BETA4-20101115-1340.tgz" the kernel will panic then also.

    My "WAN" port is configured via PPPoE.  I've also set the physical port that the WAN is on as Opt 3.

    Still having this panic on snapshot dated "Nov 15 16:00:39 EST 2010".



  • Ok, I figured out how to setup the serial console on the full version of pfSense.

    Having done that, I've captured a couple of kernel panics and the back traces from that.  I've noticed that the panics appear to be related to my WAN port. My WAN's IP is received via PPPoE.

    The others who are seeing this panic - are you also getting your WAN IP via PPPoE?  Or maybe PPTP?  What driver is your WAN using?

    Here are the kernel panics & back traces.

    –After running fetch from an SSH terminal session

    
    # fetch http://snapshots.pfsense.org/FreeBSD_RELENG_8_1/i386/pfSense_HEAD/update
    s/pfSense-Full-Update-2.0-BETA4-20101115-1340.tgz
    
    pfSense-Full-Update-2.0-BETA4-20101115-1340.tg  0% of   75 MB    0  Bps
    
    Kernel page fault with the following non-sleepable locks held:
    exclusive sleep mutex fxp0 (network driver) r = 0 (0xc36c2018) locked @ /usr/pfS
    ensesrc/src/sys/dev/fxp/if_fxp.c:1288
    KDB: stack backtrace:
    X_db_sym_numargs(c0ea6373,c330788c,c0a32ac5,508,0,...) at X_db_sym_numargs+0x146
    
    kdb_backtrace(508,0,ffffffff,c144d77c,c33078c4,...) at kdb_backtrace+0x29
    witness_display_spinlock(c0ea888b,c33078d8,4,1,0,...) at witness_display_spinloc
    k+0x75
    witness_warn(5,0,c0ee6c23,c144d778,c3590550,...) at witness_warn+0x20d
    trap(c3307964) at trap+0x19e
    alltraps(c36d1b00,dedeadc0,c36d1b00,c36d1b00,c33079ec,...) at alltraps+0x1b
    m_tag_delete_chain(c36d1b00,0,df,0,c36c2000,...) at m_tag_delete_chain+0x3f
    reallocf(c36d1b00,100,0,9e3,3,...) at reallocf+0x8a5
    uma_zfree_arg(c1d7e380,c36d1b00,0,c36c3020,c3307a60,...) at uma_zfree_arg+0x29
    m_freem(c36d1b00,c36ca5c0,8,c36b9800,c36c2018,...) at m_freem+0x43
    fwohci_init(c36c2018,4,c0e62817,519,c3307ab4,...) at fwohci_init+0x545c
    fwohci_init(c36c2018,0,c0e62817,508,c36b9800,...) at fwohci_init+0x6613
    fwohci_init(c36b9800,c3307c0c,c0aa12bf,c36b9800,0,...) at fwohci_init+0x733b
    if_start(c36b9800,0,c0eb26f9,d1d,2,...) at if_start+0x12
    if_handoff(c36b9800,c39ad800,0,0) at if_handoff+0x25f
    ether_output_frame(c36b9800,c39ad800,c0e98e11,1,c3dac380,...) at ether_output_fr
    ame+0x65
    ng_car_q_event(c3db1080,c3dac380,c0ebbe43,c0e98e11,3,...) at ng_car_q_event+0x2e
    2b
    ng_rmnode(c3903bd0,0,c0ebbe43,d2c,0,...) at ng_rmnode+0x2e4
    ng_rmnode(0,c3307d38,c0e9de12,344,c3590550,...) at ng_rmnode+0x16a1
    fork_exit(c0b0e840,0,c3307d38) at fork_exit+0xb8
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc3307d70, ebp = 0 ---
    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address   = 0xdedeadc0
    fault code              = supervisor read, page not present
    instruction pointer     = 0x20:0xc0a51d58
    stack pointer           = 0x28:0xc33079a4
    frame pointer           = 0x28:0xc33079b4
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, def32 1, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process         = 13 (ng_queue0)
    [thread]
    Stopped at      m_tag_delete+0x48:      movl    0(%ecx),%eax
    db>
    db>
    db>
    db>
    db> bt
    Tracing pid 13 tid 64008 td 0xc3592000
    m_tag_delete(c36d1b00,dedeadc0,c36d1b00,c36d1b00,c33079ec,...) at m_tag_delete+0
    x48
    m_tag_delete_chain(c36d1b00,0,df,0,c36c2000,...) at m_tag_delete_chain+0x3f
    reallocf(c36d1b00,100,0,9e3,3,...) at reallocf+0x8a5
    uma_zfree_arg(c1d7e380,c36d1b00,0,c36c3020,c3307a60,...) at uma_zfree_arg+0x29
    m_freem(c36d1b00,c36ca5c0,8,c36b9800,c36c2018,...) at m_freem+0x43
    fwohci_init(c36c2018,4,c0e62817,519,c3307ab4,...) at fwohci_init+0x545c
    fwohci_init(c36c2018,0,c0e62817,508,c36b9800,...) at fwohci_init+0x6613
    fwohci_init(c36b9800,c3307c0c,c0aa12bf,c36b9800,0,...) at fwohci_init+0x733b
    if_start(c36b9800,0,c0eb26f9,d1d,2,...) at if_start+0x12
    if_handoff(c36b9800,c39ad800,0,0) at if_handoff+0x25f
    ether_output_frame(c36b9800,c39ad800,c0e98e11,1,c3dac380,...) at ether_output_fr
    ame+0x65
    ng_car_q_event(c3db1080,c3dac380,c0ebbe43,c0e98e11,3,...) at ng_car_q_event+0x2e
    2b
    ng_rmnode(c3903bd0,0,c0ebbe43,d2c,0,...) at ng_rmnode+0x2e4
    ng_rmnode(0,c3307d38,c0e9de12,344,c3590550,...) at ng_rmnode+0x16a1
    fork_exit(c0b0e840,0,c3307d38) at fork_exit+0xb8
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc3307d70, ebp = 0 ---
    db>
    
    After logging into the GUI - the dashboard just completed it's automatic update check
    [code]
    Kernel page fault with the following non-sleepable locks held:
    exclusive sleep mutex fxp0 (network driver) r = 0 (0xc36c2018) locked @ /usr/pfS
    ensesrc/src/sys/kern/kern_mutex.c:147
    KDB: stack backtrace:
    X_db_sym_numargs(c0ea6373,c3304a4c,c0a32ac5,93,0,...) at X_db_sym_numargs+0x146
    kdb_backtrace(93,0,ffffffff,c144d82c,c3304a84,...) at kdb_backtrace+0x29
    witness_display_spinlock(c0ea888b,c3304a98,4,1,0,...) at witness_display_spinloc
    k+0x75
    witness_warn(5,0,c0ee6c23,c144d828,c35907f8,...) at witness_warn+0x20d
    trap(c3304b24) at trap+0x19e
    alltraps(c39af800,dedeadc0,c39af800,c39af800,c3304bac,...) at alltraps+0x1b
    m_tag_delete_chain(c39af800,0,df,0,c36c2000,...) at m_tag_delete_chain+0x3f
    reallocf(c39af800,100,0,9e3,3,...) at reallocf+0x8a5
    uma_zfree_arg(c1d7e380,c39af800,0,c36c32c0,c3304c20,...) at uma_zfree_arg+0x29
    m_freem(c39af800,c36c9a00,8,c36c2000,c36b9800,...) at m_freem+0x43
    fwohci_init(c36c2018,4,c0e62817,82a,c36c2018,...) at fwohci_init+0x545c
    fwohci_init(c36c2000,1,c0ea4352,189,c130ccf8,...) at fwohci_init+0x7a25
    softclock(c130ccc0,c3304cc8,c09deb04,c1310a80,c35b95b8,...) at softclock+0x24a
    intr_event_execute_handlers(c35907f8,c35b9580,c0e9e0ad,533,c35b95f0,...) at intr
    _event_execute_handlers+0x125
    intr_event_add_handler(c358f110,c3304d38,c0e9de12,344,c35907f8,...) at intr_even
    t_add_handler+0x42f
    fork_exit(c09c78b0,c358f110,c3304d38) at fork_exit+0xb8
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc3304d70, ebp = 0 ---
    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address   = 0xdedeadc0
    fault code              = supervisor read, page not present
    instruction pointer     = 0x20:0xc0a51d58
    stack pointer           = 0x28:0xc3304b64
    frame pointer           = 0x28:0xc3304b74
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, def32 1, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process         = 12 (swi4: clock)
    [thread]
    Stopped at      m_tag_delete+0x48:      movl    0(%ecx),%eax
    db>
    db>
    db> bt
    Tracing pid 12 tid 64007 td 0xc3592280
    m_tag_delete(c39af800,dedeadc0,c39af800,c39af800,c3304bac,...) at m_tag_delete+0
    x48
    m_tag_delete_chain(c39af800,0,df,0,c36c2000,...) at m_tag_delete_chain+0x3f
    reallocf(c39af800,100,0,9e3,3,...) at reallocf+0x8a5
    uma_zfree_arg(c1d7e380,c39af800,0,c36c32c0,c3304c20,...) at uma_zfree_arg+0x29
    m_freem(c39af800,c36c9a00,8,c36c2000,c36b9800,...) at m_freem+0x43
    fwohci_init(c36c2018,4,c0e62817,82a,c36c2018,...) at fwohci_init+0x545c
    fwohci_init(c36c2000,1,c0ea4352,189,c130ccf8,...) at fwohci_init+0x7a25
    softclock(c130ccc0,c3304cc8,c09deb04,c1310a80,c35b95b8,...) at softclock+0x24a
    intr_event_execute_handlers(c35907f8,c35b9580,c0e9e0ad,533,c35b95f0,...) at intr
    _event_execute_handlers+0x125
    intr_event_add_handler(c358f110,c3304d38,c0e9de12,344,c35907f8,...) at intr_even
    t_add_handler+0x42f
    fork_exit(c09c78b0,c358f110,c3304d38) at fork_exit+0xb8
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc3304d70, ebp = 0 ---
    db>
    [/thread][/code][/thread]
    


  • my wan is DHCP.

    my pfsense was rebooting while downloading lightsquid.

    it seems to crash only when the box itself is getting something from the internet.

    slow dsl, or fast cable connection, same issue here i think.



  • Hi singerie - So it probably isn't PPPoE then.  Thanks.

    1. are you running either Captive Portal or Traffic shaping?
    2. What network driver is your WAN using?


  • i'm not using captive portal or traffic shaping, but i use multi wan, but only as a failover.

    My network card is a Intel pci-e.



  • I've seen similar panics in a build of 9-Nov. More in http://forum.pfsense.org/index.php/topic,29927.0.html





  • Beerman, Wallabybob, I'm curious, what nic is assigned to your WAN? Or what driver is your WAN using?



  • I'm using rl0 for my WAN interface. I could easily swap to vr0 or a wireless link or a USB NIC if you thought it worthwhile to gather some more data. I can reproduce my problem fairly easily.



  • My Alix 2c3 has this happen when updating and randomly at some other times when accessing the web gui.  WAN is vr1 and uses only DHCP, LAN is vr0, and I have ath0 for an access point on OPT1.

    PJ2: Some good information there.  It looks like it is crashing in the same function at the same line of code every time, called from the same code path.



  • I swapped interfaces so vr0 was my WAN and on three restarts I've hit the same panic: syncache: mbuf too small



  • I had wondered if it was just an issue w/ intel nics.  Sounds like it's not that specific.  Thanks Wallabybob.



  • @PJ2:

    Beerman, Wallabybob, I'm curious, what nic is assigned to your WAN? Or what driver is your WAN using?

    My WAN is on vr2_vlan7. (ALIX Board)


Locked