Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    PfSense crashed on Alix

    Scheduled Pinned Locked Moved 2.0-RC Snapshot Feedback and Problems - RETIRED
    49 Posts 11 Posters 25.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • jimpJ
      jimp Rebel Alliance Developer Netgate
      last edited by

      It's hard to say with any certainty until someone with more in-depth knowledge of the freebsd kernel (such as ermal) can have a look and see if he can tell what is going on.

      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

      Need help fast? Netgate Global Support!

      Do not Chat/PM for help!

      1 Reply Last reply Reply Quote 0
      • J
        jlepthien
        last edited by

        Yep. That's what I'm waiting for ;)

        | apple fanboy | music lover | network and security specialist | in love with cisco systems |

        1 Reply Last reply Reply Quote 0
        • W
          wallabybob
          last edited by

          I've looked at a lot of FreeBSD dumps. This sort of problem is sometimes fairly straight forward to find but can also be very difficult to find. It can have a variety of causes including passing the wrong type of data structure to a function and freeing a data structure then reusing it while its being used for another purpose.

          If I was looking at this problem I expect the most useful items of information to me would be

          • a precise identification of the build on the which the problem was observed

          • a way of making it happen, even if it makes it happen only one in four times

          One of the back traces shows:
          ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,…) at ucom_attach+0x542b
          ucom_attach(c24ab000,0,109,cd9a2d5d,38ea,...) at ucom_attach+0x89d7
          The offsets can be misleadng in that static functions don't appear in the symbol table available to the crash time debugger. Since 0x1000 is 4k, 0x89d7 is at least 32k and its pretty unlikely that an attach function would have anything like that amount of code. This offset likely is in some static function whose code starts at a higher address than the code for ucom_attach.

          Another of the reports shows:
          Fatal trap 12: page fault while in kernel mode
          fault virtual address  = 0x72636524
          fault code              = supervisor write, page not present
          instruction pointer    = 0x20:0xc096993c
          stack pointer          = 0x28:0xc2378b10
          frame pointer          = 0x28:0xc2378b64

          If you look at the virtual address you might notice that it could be considered to be printable text: "?ecr" (the ? is for the character who binary representation is 0x24; I don't have the mapping from 0x24 to printable character in my head).  From the reported code it would appear that a data structure referenced by rn_match has a text string where rn_match is expecting it to hold the address of another data structure. The challenge is to find out how that happened.

          1 Reply Last reply Reply Quote 0
          • J
            jlepthien
            last edited by

            What I see now is that this happens every 3-4 days. So I guess I will do a reboot now every night via cron to see if this then stops until I have better builds…

            | apple fanboy | music lover | network and security specialist | in love with cisco systems |

            1 Reply Last reply Reply Quote 0
            • J
              jlepthien
              last edited by

              With the daily reboot in place I am not seeing this problem anymore. So what is the status of these problems? Has anyone (ermal) taken a look at the bt's? Is this "problem" fixed in newer snaps?

              | apple fanboy | music lover | network and security specialist | in love with cisco systems |

              1 Reply Last reply Reply Quote 0
              • J
                jlepthien
                last edited by

                Today this happened again. So I cannot use this workaround :(

                Here is the bt:

                rn_match(c0cd504c,c283fd00,0,c2981718,e2992850,…) at rn_match+0x17
                pfr_match_addr(c288b9b0,c2741034,2,e299283c,e2992838,...) at pfr_match_addr+0x63
                pf_test_tcp(e2992938,e2992934,1,c26c4600,c272bd00,...) at pf_test_tcp+0x4cb
                pf_test(1,c2610400,e2992afc,0,0,...) at pf_test+0x8d2
                init_pf_mutex(0,e2992afc,c2610400,1,0,...) at init_pf_mutex+0x5e6
                pfil_run_hooks(c0cfd1c0,e2992b4c,c2610400,1,0,...) at pfil_run_hooks+0x7e
                ip_input(c272bd00,246,c24d38c0,e2992b74,c06fd9a1,...) at ip_input+0x278
                netisr_dispatch_src(1,0,c272bd00,e2992bac,c08e3f0f,...) at netisr_dispatch_src+0x89
                netisr_dispatch(1,c272bd00,c2610400,c2610400,c274101a,...) at netisr_dispatch+0x20
                ether_demux(c2610400,c272bd00,3,0,3,...) at ether_demux+0x16f
                ether_vlanencap(c2610400,c272bd00,ece0,18,c272bd00,...) at ether_vlanencap+0x43f
                ieee80211_hostap_detach(c2700000,c315a000,c272bd00,c2532480,c2438d80,...) at ieee80211_hostap_detach+0x362
                ieee80211_hostap_detach(c315a000,c272bd00,17,ffffffa0,0,...) at ieee80211_hostap_detach+0x29a7
                ath_suspend(c2514000,1,0,c0ca937c,0,...) at ath_suspend+0x1f67
                taskqueue_run(c251d100,c251d118,0,c0b53f14,0,...) at taskqueue_run+0x132
                taskqueue_thread_loop(c2514270,e2992d38,0,0,0,...) at taskqueue_thread_loop+0x88
                fork_exit(c086b060,c2514270,e2992d38) at fork_exit+0x90
                fork_trampoline() at fork_trampoline+0x8
                --- trap 0, eip = 0, esp = 0xe2992d70, ebp = 0 ---

                Please guys. Give me any info. What else do you need? Does nobody use 2.0-beta1 on Alix boards? Can't be...

                | apple fanboy | music lover | network and security specialist | in love with cisco systems |

                1 Reply Last reply Reply Quote 0
                • jimpJ
                  jimp Rebel Alliance Developer Netgate
                  last edited by

                  I use 2.0-beta1 on my ALIX but has not crashed on me yet. I haven't passed much traffic through it though as it's just been used for light testing and such.

                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  1 Reply Last reply Reply Quote 0
                  • X
                    xbipin
                    last edited by

                    i use 22nd snapshot on alix, hasnt crashed for me till now so might be some hardware issue or something like that

                    1 Reply Last reply Reply Quote 0
                    • J
                      jlepthien
                      last edited by

                      I don't think it is hardware related since 1.2.3 is running fine on this box. This just happened now with 2.0-beta1…

                      | apple fanboy | music lover | network and security specialist | in love with cisco systems |

                      1 Reply Last reply Reply Quote 0
                      • S
                        sullrich
                        last edited by

                        We're looking into it.

                        1 Reply Last reply Reply Quote 0
                        • J
                          jlepthien
                          last edited by

                          Thanks! Is there anyway I can tell pfSense to reboot automatically when it panics? But I guess no :-(
                          I think I will go back to 1.2.3 because my girlfriend hates me everytime the internet connection dies and now it is almost daily ;)

                          Downgrade is only working by re-flashing? I have an old 1.2.3 conf…

                          | apple fanboy | music lover | network and security specialist | in love with cisco systems |

                          1 Reply Last reply Reply Quote 0
                          • J
                            jlepthien
                            last edited by

                            I can confirm now that it is definitely not a hardware issue. Now my box is running fine again with 1.2.3. I will use 2.0 again when it has RC status the earliest…

                            Thanks

                            | apple fanboy | music lover | network and security specialist | in love with cisco systems |

                            1 Reply Last reply Reply Quote 0
                            • U
                              Uxorious
                              last edited by

                              I just had what is possibly the same problem on an old Dell OptiPlex GX200 with a dual Intel gigabit card installed.

                              LAN IP was completely dead, and I did not have a keyboard so no backtrace:
                              em1: watchdog timeout – resetting
                              Fatal trap 12: page fault while in kernel mode
                              cpuid = 0; apic id = 00
                              fault virtual address = 0xe0500a4
                              fault code = supervisor read, page not present
                              instruction pointer - 0x20:0xc0a63aa7
                              stack pointer = 0x28:0xe2c547c4
                              frame pointer = 0x28:0xe2c547f0
                              code segment = base 0x0, limit 0xfffff, type 0x1b
                                  DPL 0, pres 1, def32 1, gran 1
                              processor eflags = interrupt enabled, resume, IOPL = 0
                              current process = 0 (em0 taskq)

                              1 Reply Last reply Reply Quote 0
                              • U
                                Uxorious
                                last edited by

                                @Uxorious:

                                Stopped at rn_match+0x17: movl 0xc(%eax),%ebx

                                It happened again some 20 hours later.
                                LAN dead again, but stopped at exactly the same instruction.
                                Since writing down the bactrace was too painful, I took a picture instead.

                                IMG_1719.JPG
                                IMG_1719.JPG_thumb

                                1 Reply Last reply Reply Quote 0
                                • T
                                  ttlinna
                                  last edited by

                                  @Uxorious:

                                  I just had what is possibly the same problem on an old Dell OptiPlex GX200 with a dual Intel gigabit card installed.

                                  LAN IP was completely dead, and I did not have a keyboard so no backtrace:
                                  em1: watchdog timeout – resetting
                                  Fatal trap 12: page fault while in kernel mode
                                  cpuid = 0; apic id = 00
                                  fault virtual address = 0xe0500a4
                                  fault code = supervisor read, page not present
                                  instruction pointer - 0x20:0xc0a63aa7
                                  stack pointer = 0x28:0xe2c547c4
                                  frame pointer = 0x28:0xe2c547f0
                                  code segment = base 0x0, limit 0xfffff, type 0x1b
                                       DPL 0, pres 1, def32 1, gran 1
                                  processor eflags = interrupt enabled, resume, IOPL = 0
                                  current process = 0 (em0 taskq)

                                  1 Reply Last reply Reply Quote 0
                                  • U
                                    Uxorious
                                    last edited by

                                    @ttlinna:

                                    I've had multiple similar problems. Unfortunately I haven't been able to grab the log since the problems have occured in production environments. Network just stops suddenly working. It can run well for days or just for an hour or so.

                                    My config includes use of limiters. Is it possible that it causes problems?
                                    That's just my hunch, since I've got older snapshots running fine without limiters.

                                    My config is fairly simple.
                                    WAN and another WAN on OPT.
                                    A couple NAT/FW rules inbound.
                                    Nothing else.

                                    1 Reply Last reply Reply Quote 0
                                    • X
                                      xbipin
                                      last edited by

                                      my alix with 20th feb snapshot works perfect and older versions also have been running stable enough for me for as much as 15 days then its no crash but i usually endup trying newer snapshots.

                                      1 Reply Last reply Reply Quote 0
                                      • U
                                        Uxorious
                                        last edited by

                                        @Uxorious:

                                        @Uxorious:

                                        Stopped at rn_match+0x17: movl 0xc(%eax),%ebx

                                        It happened again some 20 hours later.
                                        LAN dead again, but stopped at exactly the same instruction.
                                        Since writing down the bactrace was too painful, I took a picture instead.

                                        For the past 5 days I have been running completely stable on 1.2.3 using the same hardware and configuration (recreated configuration since downgrading is not possible).

                                        Something bad is happening in 2.0 for sure…

                                        1 Reply Last reply Reply Quote 0
                                        • E
                                          eri--
                                          last edited by

                                          Please tell your configuration or better send your config.xml to investigate further.

                                          1 Reply Last reply Reply Quote 0
                                          • C
                                            computor
                                            last edited by

                                            I think I'm having a similar issue (trap 12s once or twice a day–more if torrenting, etc).  I thought it was a HW failure at first, but this crash has followed through 3 different boxes (a dual PIII, a single P4, and a dual Opteron blade).  I'm using nanoBSD and have upgraded several times to the latest snapshot--I'm probably a week out of date at most at the moment.  The faulting process is usually one of the NIC drivers (I don't think it's a driver problem--I've seen it on em, fxp, and bge), but once it was the openvpn process).  I have reflashed the card with a fresh nanobsd image a few times, so I don't think it's corruption.

                                            I have a second box hooked up to the serial console doing a full dump of the serial console which at this point has captured over a dozen such crashes (and the subsequent reboots).  As such, it's quite large--I can email it to you if you want, Chris.  A least a couple include backtraces.

                                            Will M.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.