PfSense crashed on Alix



  • Hi there,

    today my internet suddenly seemed stuck. When I connected the console I saw this:

    Fatal trap 12: page fault while in kernel mode
    fault virtual address  = 0x8
    fault code              = supervisor read, page not present
    instruction pointer    = 0x20:0xc08f46ce
    stack pointer          = 0x28:0xc23787c8
    frame pointer          = 0x28:0xc23787f4
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, def32 1, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process        = 11 (irq10: vr0)


  • Rebel Alliance Developer Netgate

    That's a new one I haven't seen.

    If it happens again, get the new message and also type "bt" at that prompt to get a backtrace and paste it all here.



  • Okay. Will do that if it happens again.



  • Happened again :-(

    Here is the bt…

    db> bt
    Tracing pid 11 tid 64026 td 0xc2456b40
    rn_match(c0cd4fcc,c283f300,0,c2938000,c23fc3cc,...) at rn_match+0x4e
    pfr_match_addr(c288c9b0,c27b402a,2,c23fc3b8,c23fc3b4,...) at pfr_match_addr+0x63
    pf_test_tcp(c23fc4b4,c23fc4b0,1,c26c2400,c27f3600,...) at pf_test_tcp+0x4cb
    pf_test(1,c24fe400,c23fc678,0,0,...) at pf_test+0x8d2
    init_pf_mutex(0,c23fc678,c24fe400,1,0,...) at init_pf_mutex+0x5e6
    pfil_run_hooks(c0cfd140,c23fc6c8,c24fe400,1,0,...) at pfil_run_hooks+0x7e
    ip_input(c27f3600,57d5,c06fd8a0,c23fc70c,c06fd3b2,...) at ip_input+0x278
    netisr_dispatch_src(1,0,c27f3600,c23fc738,c0935a54,...) at netisr_dispatch_src+0x89
    netisr_dispatch(1,c27f3600,3,0,3,...) at netisr_dispatch+0x20
    ng_car_q_event(c27dc700,c27859c0,c0aaf160,c2456b40,c2456d80,...) at ng_car_q_event+0x2d54
    ng_rmnode_self(c2614380,5e3e,2,2,3,...) at ng_rmnode_self+0x10f
    ng_snd_item(c27859c0,0,c27dcb00,0,7f63,...) at ng_snd_item+0x3ca
    ng_parse_get_token(c27dca80,c27859c0,c0cf6c6c,c24566c0,4,...) at ng_parse_get_token+0xcbbf
    ng_rmnode_self(c27dc980,1,c23fc8e4,c0aa9c59,0,...) at ng_rmnode_self+0x10f
    ng_snd_item(c27859c0,0,c27dc780,0) at ng_snd_item+0x3ca
    ng_parse_get_token(0,c23fc938,c0ac3027,c0cd128c,c2631c80,...) at ng_parse_get_token+0x2d28
    ng_parse_get_token(0,c23fc948,1daf0008,28,28,...) at ng_parse_get_token+0x2e18
    ng_parse_get_token(0,c093f623,c27f3600,1) at ng_parse_get_token+0x2fc3
    ng_parse_get_token(0,0,c24566c0,c09433d0,c2456b40,...) at ng_parse_get_token+0x30f8
    ng_parse_get_token(c27dd200,c27859c0,80000000,c23fca68,c0ac92f4,...) at ng_parse_get_token+0x548d
    ng_rmnode_self(c27b400e,6,c23fca84,c0ac958e,c27b400e,...) at ng_rmnode_self+0x10f
    ng_snd_item(c27859c0,0,c27dd280,0,4,...) at ng_snd_item+0x3ca
    ng_parse_get_token(c27dd300,c27859c0,5e3e4fcf,6b40,ed7d0000,...) at ng_parse_get_token+0x7ad7
    ng_rmnode_self(0,c27f3600,c2534e00,6488,c23fcbc8,...) at ng_rmnode_self+0x10f
    ng_snd_item(c27859c0,0,c2631580,0,8864,...) at ng_snd_item+0x3ca
    ng_car_q_event(c24fd800,c27f3600,c24fd800,c24fd800,c27b4008,...) at ng_car_q_event+0x27b9
    ether_demux(c24fd800,c27f3600,3,0,3,...) at ether_demux+0x20e
    ether_vlanencap(c24fd800,c27f3600,0,c23fcc5c,c085400c,...) at ether_vlanencap+0x43f
    ucom_attach(c23fcc7c,c23f0008,c23f0028,c2450028,0,...) at ucom_attach+0x542b
    ucom_attach(c24aa000,0,109,bee27646,57d5,...) at ucom_attach+0x89d7
    intr_event_execute_handlers(c2436aa0,c2434600,c0b5910d,4f6,c2434670,...) at intr_event_execute_handlers+0x14b
    intr_getaffinity(c24ffc90,c23fcd38,0,0,0,...) at intr_getaffinity+0x14a
    fork_exit(c080dfe0,c24ffc90,c23fcd38) at fork_exit+0x90
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc23fcd70, ebp = 0 ---
    db>



  • Sorry I couldn't see the part before the bt. I just started up the console and everything looked good with my login prompt. After I put in the admin username I saw the db> prompt…



  • Again :-(

    Fatal trap 12: page fault while in kernel mode
    fault virtual address  = 0x8
    fault code              = supervisor read, page not present
    instruction pointer    = 0x20:0xc08f46ce
    stack pointer          = 0x28:0xc23787c8
    frame pointer          = 0x28:0xc23787f4
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, def32 1, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process        = 11 (irq10: vr0)


  • Rebel Alliance Developer Netgate

    None of that looks familiar, unfortunately. There had been a few 2.0 panics over the last few months but they were all resolved, this one looks new. It seems to have crashed in two dissimilar code paths as well even though the begin and end the same. One seems to involve netgraph (PPPoE on WAN? PPTP or L2TP server active?) and the other doesn't.

    Do you have anything plugged into the ALIX USB ports? It's odd that ucom is showing up there, that's the USB com port driver.

    What kind of features do you have going on this setup? Any services or VPNs enabled? Packages installed? Traffic Shaping?



  • A I have a usb to serial adapter connected to the beast ;) Maybe I should disconnect it from my mac when I am not using it…


  • Rebel Alliance Developer Netgate

    Check the output of "vmstat -i"

    I wonder if your usb adapter is sharing an IRQ with a network card, which seems like a recipe for disaster on a low power device.



  • Hi!

    This is the output…

    interrupt                          total      rate
    irq0: clk                        120917        99
    irq4: uart0                          501          0
    irq8: rtc                        154794        127
    irq9: ath0                        73851        61
    irq10: vr0                          6349          5
    irq11: vr1                          9958          8
    irq12: ohci0 ehci0                    1          0
    irq14: ata0                        11966          9
    Total                            378337        312


  • Rebel Alliance Developer Netgate

    That looks fine, though is that with or without the USB serial adapter plugged in?

    I suppose it may just be a coincidence that it's the irq process that is active when the panic happens, but it's hard to say for certain.



  • This was taken with the usb adapter inserted. But it is usb on the Macintosh side. It is still serial on the Alix…


  • Rebel Alliance Developer Netgate

    ahh… well then that wouldn't explain the ucom then. The ALIX would only see serial, not a USB device on its end.



  • Again :-(

    Build from 01/06

    db> bt
    Tracing pid 11 tid 64025 td 0xc2456d80
    pim_input(c30a5816,c24fa000,c308cd00,0,0,…) at pim_input+0xb8c
    ip_input(c308cd00,246,c24d8700,c2378bcc,c06fd9b1,...) at ip_input+0x604
    netisr_dispatch_src(1,0,c308cd00,c2378c04,c08e3ecf,...) at netisr_dispatch_src+0x89
    netisr_dispatch(1,c308cd00,c24fa000,c24fa000,c30a5808,...) at netisr_dispatch+0x20
    ether_demux(c24fa000,c308cd00,3,0,3,...) at ether_demux+0x16f
    ether_vlanencap(c24fa000,c308cd00,c2456d80,c2378c5c,c0853f81,...) at ether_vlanencap+0x43f
    ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,...) at ucom_attach+0x542b
    ucom_attach(c24ab000,0,109,82593edb,132,...) at ucom_attach+0x89d7
    intr_event_execute_handlers(c2436aa0,c2434680,c0b5910d,4f6,c24346f0,...) at intr_event_execute_handlers+0x14b
    intr_getaffinity(c24f9b50,c2378d38,0,0,0,...) at intr_getaffinity+0x14a
    fork_exit(c080dfe0,c24f9b50,c2378d38) at fork_exit+0x90
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc2378d70, ebp = 0 ---
    db>



  • And again. This is the last things I could grep:

    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process        = 11 (irq10: vr0)



  • Okay. Now again. I am testing igmpproxy right now, perhaps it has something to do with it?

    Fatal trap 12: page fault while in kernel mode
    fault virtual address  = 0x72636524
    fault code              = supervisor write, page not present
    instruction pointer    = 0x20:0xc096993c
    stack pointer          = 0x28:0xc2378b10
    frame pointer          = 0x28:0xc2378b64
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, def32 1, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process        = 11 (irq10: vr0)



  • @jlepthien:

    Okay. Now again. I am testing igmpproxy right now, perhaps it has something to do with it?

    Possibly. Did it happen at all before you started testing it?

    We'll get Ermal or someone to take a look at the backtraces when time permits.



  • Yeah, the first two happened before I think. But at least the first. All the ones from yesterday happened when I tested the igmp proxy…



  • Another one again. But I can only see stuff starting with the prompt most of the time….

    db> bt
    Tracing pid 11 tid 64025 td 0xc2456d80
    rn_match(c0cd4fcc,c2842200,0,0,c23788a8,...) at rn_match+0x17
    pfr_match_addr(c288d9b0,c31e5822,2,c2378894,c2378890,...) at pfr_match_addr+0x63
    pf_test_udp(c2378990,c237898c,1,c2562c00,c278a800,...) at pf_test_udp+0x4db
    pf_test(1,c24fa000,c2378b54,0,0,...) at pf_test+0xbb5
    init_pf_mutex(0,c2378b54,c24fa000,1,0,...) at init_pf_mutex+0x5e6
    pfil_run_hooks(c0cfd140,c2378ba4,c24fa000,1,0,...) at pfil_run_hooks+0x7e
    ip_input(c278a800,246,c24d2ac0,c2378bcc,c06fd9b1,...) at ip_input+0x278
    netisr_dispatch_src(1,0,c278a800,c2378c04,c08e3ecf,...) at netisr_dispatch_src+0x89
    netisr_dispatch(1,c278a800,c24fa000,c24fa000,c31e5808,...) at netisr_dispatch+0x20
    ether_demux(c24fa000,c278a800,3,0,3,...) at ether_demux+0x16f
    ether_vlanencap(c24fa000,c278a800,c2456d80,c2378c5c,c0853f81,...) at ether_vlanencap+0x43f
    ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,...) at ucom_attach+0x542b
    ucom_attach(c24ab000,0,109,cd9a2d5d,38ea,...) at ucom_attach+0x89d7
    intr_event_execute_handlers(c2436aa0,c2434680,c0b5910d,4f6,c24346f0,...) at intr_event_execute_handlers+0x14b
    intr_getaffinity(c24f9b50,c2378d38,0,0,0,...) at intr_getaffinity+0x14a
    fork_exit(c080dfe0,c24f9b50,c2378d38) at fork_exit+0x90
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xc2378d70, ebp = 0 ---
    db>



  • Do we have any info yet? Today this happened again and I really would like to know what this is. I can simply install 1.2.3 again and wait until 2.0 is out of beta, but I want to help the project. So devs, what could be the problem? Anything else I should check?



  • Would you guys please be so kind to give me an answer. Otherwise it is no fun posting these backtraces…
    Today it happened again...


  • Rebel Alliance Developer Netgate

    Unfortunately with the snapshot server out of commission until the new one is put in place there isn't much to do or try except keep track of the traces.



  • Yeah well but this doesn't answer my question what the real problem is. Or do you think that these problems were silently fixed in a new snapshot?


  • Rebel Alliance Developer Netgate

    It's hard to say with any certainty until someone with more in-depth knowledge of the freebsd kernel (such as ermal) can have a look and see if he can tell what is going on.



  • Yep. That's what I'm waiting for ;)



  • I've looked at a lot of FreeBSD dumps. This sort of problem is sometimes fairly straight forward to find but can also be very difficult to find. It can have a variety of causes including passing the wrong type of data structure to a function and freeing a data structure then reusing it while its being used for another purpose.

    If I was looking at this problem I expect the most useful items of information to me would be

    • a precise identification of the build on the which the problem was observed

    • a way of making it happen, even if it makes it happen only one in four times

    One of the back traces shows:
    ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,…) at ucom_attach+0x542b
    ucom_attach(c24ab000,0,109,cd9a2d5d,38ea,...) at ucom_attach+0x89d7
    The offsets can be misleadng in that static functions don't appear in the symbol table available to the crash time debugger. Since 0x1000 is 4k, 0x89d7 is at least 32k and its pretty unlikely that an attach function would have anything like that amount of code. This offset likely is in some static function whose code starts at a higher address than the code for ucom_attach.

    Another of the reports shows:
    Fatal trap 12: page fault while in kernel mode
    fault virtual address  = 0x72636524
    fault code              = supervisor write, page not present
    instruction pointer    = 0x20:0xc096993c
    stack pointer          = 0x28:0xc2378b10
    frame pointer          = 0x28:0xc2378b64

    If you look at the virtual address you might notice that it could be considered to be printable text: "?ecr" (the ? is for the character who binary representation is 0x24; I don't have the mapping from 0x24 to printable character in my head).  From the reported code it would appear that a data structure referenced by rn_match has a text string where rn_match is expecting it to hold the address of another data structure. The challenge is to find out how that happened.



  • What I see now is that this happens every 3-4 days. So I guess I will do a reboot now every night via cron to see if this then stops until I have better builds…



  • With the daily reboot in place I am not seeing this problem anymore. So what is the status of these problems? Has anyone (ermal) taken a look at the bt's? Is this "problem" fixed in newer snaps?



  • Today this happened again. So I cannot use this workaround :(

    Here is the bt:

    rn_match(c0cd504c,c283fd00,0,c2981718,e2992850,…) at rn_match+0x17
    pfr_match_addr(c288b9b0,c2741034,2,e299283c,e2992838,...) at pfr_match_addr+0x63
    pf_test_tcp(e2992938,e2992934,1,c26c4600,c272bd00,...) at pf_test_tcp+0x4cb
    pf_test(1,c2610400,e2992afc,0,0,...) at pf_test+0x8d2
    init_pf_mutex(0,e2992afc,c2610400,1,0,...) at init_pf_mutex+0x5e6
    pfil_run_hooks(c0cfd1c0,e2992b4c,c2610400,1,0,...) at pfil_run_hooks+0x7e
    ip_input(c272bd00,246,c24d38c0,e2992b74,c06fd9a1,...) at ip_input+0x278
    netisr_dispatch_src(1,0,c272bd00,e2992bac,c08e3f0f,...) at netisr_dispatch_src+0x89
    netisr_dispatch(1,c272bd00,c2610400,c2610400,c274101a,...) at netisr_dispatch+0x20
    ether_demux(c2610400,c272bd00,3,0,3,...) at ether_demux+0x16f
    ether_vlanencap(c2610400,c272bd00,ece0,18,c272bd00,...) at ether_vlanencap+0x43f
    ieee80211_hostap_detach(c2700000,c315a000,c272bd00,c2532480,c2438d80,...) at ieee80211_hostap_detach+0x362
    ieee80211_hostap_detach(c315a000,c272bd00,17,ffffffa0,0,...) at ieee80211_hostap_detach+0x29a7
    ath_suspend(c2514000,1,0,c0ca937c,0,...) at ath_suspend+0x1f67
    taskqueue_run(c251d100,c251d118,0,c0b53f14,0,...) at taskqueue_run+0x132
    taskqueue_thread_loop(c2514270,e2992d38,0,0,0,...) at taskqueue_thread_loop+0x88
    fork_exit(c086b060,c2514270,e2992d38) at fork_exit+0x90
    fork_trampoline() at fork_trampoline+0x8
    --- trap 0, eip = 0, esp = 0xe2992d70, ebp = 0 ---

    Please guys. Give me any info. What else do you need? Does nobody use 2.0-beta1 on Alix boards? Can't be...


  • Rebel Alliance Developer Netgate

    I use 2.0-beta1 on my ALIX but has not crashed on me yet. I haven't passed much traffic through it though as it's just been used for light testing and such.



  • i use 22nd snapshot on alix, hasnt crashed for me till now so might be some hardware issue or something like that



  • I don't think it is hardware related since 1.2.3 is running fine on this box. This just happened now with 2.0-beta1…



  • We're looking into it.



  • Thanks! Is there anyway I can tell pfSense to reboot automatically when it panics? But I guess no :-(
    I think I will go back to 1.2.3 because my girlfriend hates me everytime the internet connection dies and now it is almost daily ;)

    Downgrade is only working by re-flashing? I have an old 1.2.3 conf…



  • I can confirm now that it is definitely not a hardware issue. Now my box is running fine again with 1.2.3. I will use 2.0 again when it has RC status the earliest…

    Thanks



  • I just had what is possibly the same problem on an old Dell OptiPlex GX200 with a dual Intel gigabit card installed.

    LAN IP was completely dead, and I did not have a keyboard so no backtrace:
    em1: watchdog timeout – resetting
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address = 0xe0500a4
    fault code = supervisor read, page not present
    instruction pointer - 0x20:0xc0a63aa7
    stack pointer = 0x28:0xe2c547c4
    frame pointer = 0x28:0xe2c547f0
    code segment = base 0x0, limit 0xfffff, type 0x1b
        DPL 0, pres 1, def32 1, gran 1
    processor eflags = interrupt enabled, resume, IOPL = 0
    current process = 0 (em0 taskq)



  • @Uxorious:

    Stopped at rn_match+0x17: movl 0xc(%eax),%ebx

    It happened again some 20 hours later.
    LAN dead again, but stopped at exactly the same instruction.
    Since writing down the bactrace was too painful, I took a picture instead.




  • @Uxorious:

    I just had what is possibly the same problem on an old Dell OptiPlex GX200 with a dual Intel gigabit card installed.

    LAN IP was completely dead, and I did not have a keyboard so no backtrace:
    em1: watchdog timeout – resetting
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address = 0xe0500a4
    fault code = supervisor read, page not present
    instruction pointer - 0x20:0xc0a63aa7
    stack pointer = 0x28:0xe2c547c4
    frame pointer = 0x28:0xe2c547f0
    code segment = base 0x0, limit 0xfffff, type 0x1b
         DPL 0, pres 1, def32 1, gran 1
    processor eflags = interrupt enabled, resume, IOPL = 0
    current process = 0 (em0 taskq)



  • @ttlinna:

    I've had multiple similar problems. Unfortunately I haven't been able to grab the log since the problems have occured in production environments. Network just stops suddenly working. It can run well for days or just for an hour or so.

    My config includes use of limiters. Is it possible that it causes problems?
    That's just my hunch, since I've got older snapshots running fine without limiters.

    My config is fairly simple.
    WAN and another WAN on OPT.
    A couple NAT/FW rules inbound.
    Nothing else.



  • my alix with 20th feb snapshot works perfect and older versions also have been running stable enough for me for as much as 15 days then its no crash but i usually endup trying newer snapshots.


Locked