Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    First hard crash in years on pfSense

    Scheduled Pinned Locked Moved General pfSense Questions
    46 Posts 6 Posters 6.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S
      stephenw10 Netgate Administrator
      last edited by

      Ah well the backtrace and/or message buffer might reveal something.

      keyserK 1 Reply Last reply Reply Quote 0
      • keyserK
        keyser Rebel Alliance @stephenw10
        last edited by keyser

        @stephenw10 Could I ask you to have a look at it? You strike me as the most knowledgeable person on this subject :-)

        Is there a way to PM you the intire crashdump directly? There is a lot of text, and vetting it for IP's and other secrets will take a long time....

        On another note: If the Error 2 actually implies parity error on Microcode ROM... Did the CPU microcode get updated with 23.01? Can that be re-flashed to perhaps solve read issues (decay).
        Could there be something in the line of a complete power-off needed to "reset" something? I haven't had it completely powered off yet since the update.

        Love the no fuss of using the official appliances :-)

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          Sure, if you upload it here I can review it:
          https://nc.netgate.com/nextcloud/index.php/s/6AQJnY7bDd9KZym

          The CPU microcode gets updated at boot by pfSense. If that was actually using bad code I'd expect to see significantly more issues.

          keyserK 1 Reply Last reply Reply Quote 0
          • keyserK
            keyser Rebel Alliance @stephenw10
            last edited by

            @stephenw10 Thank you Stephen. The files are uploaded now.
            I’m very grateful, and very qurious if you can see anything more specific than just defective hardware.

            I’ll still be reverting to 22.05 for now - I just want to make sure/know if this is related to the 23.01 install.

            Love the no fuss of using the official appliances :-)

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              Mmm, unfortunately there's nothing further shown there. Everything looks fine until it throws the MCA errors which also ties in with a hardware issue.
              How often does it panic like that?

              I suspect it will also panic in 22.05. I would be very interested to find out.

              Steve

              keyserK 1 Reply Last reply Reply Quote 0
              • keyserK
                keyser Rebel Alliance @stephenw10
                last edited by

                @stephenw10 Thank you for taking a look.
                It has paniced twice. first time about 7 days after going 23.01, and then about 14 days after that again.
                It was 100% stable on 22.01/22.05 for a 1½ years before going 23.01

                Love the no fuss of using the official appliances :-)

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  Well if it is stable in 22.05 again that would be a very interesting result. I suspect it was just coincidence though.

                  keyserK 1 Reply Last reply Reply Quote 0
                  • keyserK
                    keyser Rebel Alliance @stephenw10
                    last edited by

                    @stephenw10 You might be right - it's just a very statistically unlikely coincidence then :-)

                    QUESTIONS:
                    1: Could 23.01 have a problem with either the NVMe SSD I installed or the SFP modules I’m using, that can cause MCAs like this ?

                    2: Is there a theoretical situation where this issue is something that needs to be cleared/reset by a full power off (no power applied) for a short while?

                    Love the no fuss of using the official appliances :-)

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      1. It's possible. You may be the only user with that combination of hardware. Though the error doesn't indicate that directly.

                      2. We have seen bad SFP modules put a NIC into a state that requires a full power cycle to clear. Not on a 6100 though as far as I know.

                      keyserK 1 Reply Last reply Reply Quote 0
                      • keyserK
                        keyser Rebel Alliance @stephenw10
                        last edited by

                        @stephenw10 Thanks. Based on that, my current course of action now is as follows:

                        1: Keep it running on 22.05 as it is now - 6 weeks considered a success criteria based on the 8 and 14 days MCA interval on 23.01.

                        • If it crashes on 22.05 as well, one more test will be a full power off for a while and then resume 23.01 to see if it fails again = dead hardware.

                        • If it does not crash on 22.05, I'll revert both my SFP's to a RJ45 connection using my switch for fiber termination in a closed untagged VLAN, and resume testing 23.01

                        Does that not sound as the most conclusive way to go from here?

                        Love the no fuss of using the official appliances :-)

                        1 Reply Last reply Reply Quote 0
                        • stephenw10S
                          stephenw10 Netgate Administrator
                          last edited by

                          Yes, that would be a great test if you can do it.

                          keyserK 1 Reply Last reply Reply Quote 0
                          • keyserK
                            keyser Rebel Alliance @stephenw10
                            last edited by

                            @stephenw10 said in First hard crash in years on pfSense:

                            Yes, that would be a great test if you can do it.

                            Statusreport: The current uptime on 22.05 without issues is 27 days now.

                            Love the no fuss of using the official appliances :-)

                            Cool_CoronaC 1 Reply Last reply Reply Quote 0
                            • Cool_CoronaC
                              Cool_Corona @keyser
                              last edited by

                              @keyser Take 23.01 and run it in a VM on the same hardware.

                              Then you will know for sure. I bet you it doesnt crash with MCA errors.

                              I bet you its driver related for 23.01

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Yes, we have seen new drivers enable some piece of hardware that then triggers MCA errors. So although it is a hardware problem it's only a problem if you enable that hardware.
                                But that's unlikely in the 6100 because there are so many out there running 23.01. Unless there is any additional hardware in it. And the only things I see in the boot log are USB devices which shouldn't be capable of this.

                                keyserK 2 Replies Last reply Reply Quote 0
                                • keyserK
                                  keyser Rebel Alliance @stephenw10
                                  last edited by

                                  @stephenw10 said in First hard crash in years on pfSense:

                                  Yes, we have seen new drivers enable some piece of hardware that then triggers MCA errors. So although it is a hardware problem it's only a problem if you enable that hardware.
                                  But that's unlikely in the 6100 because there are so many out there running 23.01. Unless there is any additional hardware in it. And the only things I see in the boot log are USB devices which shouldn't be capable of this.

                                  Yeah that's my thinking as well. I am using a couple of SFP trancievers (one of which is a BiDi), and I have installed my own NWMe SSD. I have also connected a serialport USB cable which gives me a console backdoor to my Raspberry Pi in case it goes down. Lastly there is a Eaton USB UPS connected in NUT.

                                  If it stays solid on 22.05 - then, apart from the full power off I earlier planned, should I remove fx. the USB to Serial port cable? Which of my "anomalies" do you consider most likely to cause a driver issue that can MCA the box?

                                  Love the no fuss of using the official appliances :-)

                                  1 Reply Last reply Reply Quote 0
                                  • stephenw10S
                                    stephenw10 Netgate Administrator
                                    last edited by

                                    Do you have the new blinkboot version installed? Not that I'm aware of anything in it that would affect this.

                                    Of those things the NVMe drive is most likely to cause a problem since it's a PCIe device. No USB device should be able to cause that sort of error IMO. But it's easy to remove them as a test.

                                    Steve

                                    keyserK 1 Reply Last reply Reply Quote 0
                                    • keyserK
                                      keyser Rebel Alliance @stephenw10
                                      last edited by

                                      @stephenw10 said in First hard crash in years on pfSense:

                                      Do you have the new blinkboot version installed? Not that I'm aware of anything in it that would affect this.

                                      Of those things the NVMe drive is most likely to cause a problem since it's a PCIe device. No USB device should be able to cause that sort of error IMO. But it's easy to remove them as a test.

                                      Steve

                                      I do - as far as I remember I had that installed for quite a while before the 23.01 upgrade.

                                      Regarding the SSD - If it stays solid and passes my 6 weeks 22.05 period, I’ll try the full power off/on, and remove the serial port to give it a new spin. If it fails again, I’ll look into running for a period without the SSD.
                                      Thanks for sharing :-)

                                      Love the no fuss of using the official appliances :-)

                                      1 Reply Last reply Reply Quote 0
                                      • stephenw10S
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        There's a new BlinkBoot version that was just released in a new Netgate Firmware Update package CORDOBA-03.00.00.03t. I don't believe it will do anything here but it would be an easy test. The updated package is only in 23.01 though.

                                        RobbieTTR keyserK 2 Replies Last reply Reply Quote 0
                                        • keyserK
                                          keyser Rebel Alliance @stephenw10
                                          last edited by

                                          @stephenw10 Noted🙏

                                          Love the no fuss of using the official appliances :-)

                                          1 Reply Last reply Reply Quote 0
                                          • keyserK
                                            keyser Rebel Alliance @stephenw10
                                            last edited by

                                            @stephenw10 Well, my six weeks test period has now concluded and the box has been completely stable on 22.05 during that period.

                                            501e085b-2417-4ac5-9ab7-8f3a426eda90-image.png

                                            So tommorow I’ll give a full power-off + disconnect of my USB serial Port device a spin, and let it boot on 23.01 again.

                                            Here’s crossing my fingers that this will cut it. Otherwise I’ll have to start testing without my SSD and my SFP optics.

                                            I get that FreeBSD 14 may theoretically use some region of memory or cache or a new instruction that 12.3 does not, and thus hit an actual hardware error that 12.3 just never triggers. But I find that pretty unlikely…..

                                            Love the no fuss of using the official appliances :-)

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.