Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    May 2nd Snapshot doesnt work, breaks everything! Beware

    Scheduled Pinned Locked Moved 2.4 Development Snapshots
    57 Posts 15 Posters 10.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      LostInIgnorance
      last edited by

      I would love to send in logs as I have a 4m CSV dump from my syslog server, but still I have not been told where to send them. As they are raw dumps, I am not posting them into the forums but would gladly send them to one of the developers.

      1 Reply Last reply Reply Quote 0
      • jimpJ
        jimp Rebel Alliance Developer Netgate
        last edited by

        I don't need 4M worth of records. I don't have time to sort through all of that. Just the last dozen or so lines of each log file is sufficient.

        I think we have a lead on part of the problem, I pushed a fix for one potential path that could break it but there is one other that I haven't tracked down yet.

        https://redmine.pfsense.org/issues/8504

        More interesting to me now than logs are two things:

        1. The <gateways>section of your configuration(s) before and after upgrade, or at least after. You can redact IP addresses but do not alter anything else.
        2. Whether or not you have a default route for IPv4 or IPv6 in "netstat -rnW" after upgrade.</gateways>

        Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

        Need help fast? Netgate Global Support!

        Do not Chat/PM for help!

        1 Reply Last reply Reply Quote 0
        • jimpJ
          jimp Rebel Alliance Developer Netgate
          last edited by

          OK, there are at least three separate issues here from the looks of it:

          0. Harmless route errors spamming the console/logs https://redmine.pfsense.org/issues/8497ย  (Fixed now)
          1. An issue with the upgrade code not converting and handling default gateways properly in some cases https://redmine.pfsense.org/issues/8504 (Also fixed)
          2. An issue where certain DHCP WANs (igb interfaces at least) constantly link cycle which leads to all sorts of other symptoms (services not running, IP addresses/routes missing, GUI inaccessible, etc) https://redmine.pfsense.org/issues/8506

          We're still working on that last one.

          Now what I need to know is:

          • What hardware are you running where this is happening?
          • What type of network interface is it happening to? (Both systems here, and the logs posted in the thread are all igb, but we don't know if that's a coincidence or not)
          • Check "clog /var/log/system.log | grep link" and/or "dmesg | grep link" output to see if the link is flapping

          Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

          Need help fast? Netgate Global Support!

          Do not Chat/PM for help!

          1 Reply Last reply Reply Quote 0
          • T
            tmushy
            last edited by

            Updated to the latestest beta and still getting issues
            Im using a Qotom box

            May 11 17:55:36 pfSense php-fpm[22628]: /rc.linkup: DEVD Ethernet attached event for wan
            May 11 17:55:36 pfSense php-fpm[22628]: /rc.linkup: HOTPLUG: Configuring interface wan
            May 11 17:55:37 pfSense kernel: igb0: link state changed to UP
            May 11 17:55:37 pfSense kernel: igb0: link state changed to DOWN
            May 11 17:55:42 pfSense kernel: igb0: link state changed to UP
            May 11 17:55:43 pfSense php-fpm[22628]: /rc.linkup: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1526086543] unbound[66133:0] error: bind: address already in use [1526086543] unbound[66133:0] fatal error: could not open ports'
            May 11 17:55:43 pfSense kernel: igb0: link state changed to DOWN
            May 11 17:55:45 pfSense php-fpm[71870]: /rc.linkup: DEVD Ethernet detached event for wan

            Its just looping the same thing over and over

            1 Reply Last reply Reply Quote 0
            • L
              LostInIgnorance
              last edited by

              JimP, let us know when we can begin testing snapshots again as I can't keep rebuilding and restoring my firewall.

              1 Reply Last reply Reply Quote 0
              • jimpJ
                jimp Rebel Alliance Developer Netgate
                last edited by

                @LostInIgnorance:

                JimP, let us know when we can begin testing snapshots again as I can't keep rebuilding and restoring my firewall.

                Which is why you don't run snapshots on important production firewalls, at least not without proper lab testing first.

                No progress since my last post except that an additional issue has been found:

                3. Interface MTU being set incorrectly in some cases https://redmine.pfsense.org/issues/8507 โ€“ This can lead to what appears to be partially working connectivity. Some sites will load, others will fail, some may be partially work and partially broken due to resources that can't be fetched. Browsers may return a blank page rather than an error or fail to fetch links at all.

                Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                1 Reply Last reply Reply Quote 0
                • L
                  LostInIgnorance
                  last edited by

                  JimP, this is not an important firewall. It is only used for my home environment, but I get to listen to my wife complain about not being able to get online. More of an annoyance to reload than it is anything else. Let me know if there is more logs or testing you need on this.

                  1 Reply Last reply Reply Quote 0
                  • jimpJ
                    jimp Rebel Alliance Developer Netgate
                    last edited by

                    @LostInIgnorance:

                    I get to listen to my wife complain about not being able to get online.

                    If it's carrying your wife's traffic then that is THE very definition of an important production firewall :-)

                    @LostInIgnorance:

                    Let me know if there is more logs or testing you need on this.

                    I think we have an OK grasp of the general issues at the moment but a lack of leads on where the problem lies. So far all I've seen are symptoms and not the root cause yet, but since it's so tricky to reproduce in a lab setup it's a pain to try to dig into it for any length of time.

                    Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                    Need help fast? Netgate Global Support!

                    Do not Chat/PM for help!

                    1 Reply Last reply Reply Quote 0
                    • L
                      LostInIgnorance
                      last edited by

                      JimP, I think you're on to something with the mtu size. I can tell you that the interface (igb2) that is connecting, shows a default gateway and an IP, then it disappears from the "netstat -rnW" command screen.
                      I am also available after 6p CST if you would like remote access. As this appliance is a mirror of the C2758 Atom you used to sell, I am hoping there are not too many people that will experience this issue.

                      slog.jpg
                      slog.jpg_thumb

                      1 Reply Last reply Reply Quote 0
                      • jimpJ
                        jimp Rebel Alliance Developer Netgate
                        last edited by

                        The next round of snapshots should be better here. It was related to the MTU. Turns out in 11.2, FreeBSD improved dhclient so it could handle the MTU, but it took the upstream MTU unconditionally and had no way to ignore the value. In each case I've seen so far, the ISP has sent a bogus MTU back which caused two things:

                        1. On e1000 and some other drivers, setting the MTU causes the link to go down and back up, which triggers the interface event scripts, which restarted dhclient, which set the MTU again, which made the link go down and back up, repeat, repeat, repeat, boom.
                        2. On other drivers, the MTU would be set to this value but it may not have been right. In my case and for others, this was a stupid low value like 576 which meant some sites would work and others would fail or be half broken.

                        We have a patch in the tree now from a FreeBSD dev which will be in the next set of snapshots that lets us ignore the incoming MTU with a supersede in the dhclient config (which I also added in the tree), and hopefully all this should hopefully return sanity to cases affected by these issues.

                        Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                        Need help fast? Netgate Global Support!

                        Do not Chat/PM for help!

                        1 Reply Last reply Reply Quote 0
                        • D
                          Dazog
                          last edited by

                          @jimp:

                          The next round of snapshots should be better here. It was related to the MTU. Turns out in 11.2, FreeBSD improved dhclient so it could handle the MTU, but it took the upstream MTU unconditionally and had no way to ignore the value. In each case I've seen so far, the ISP has sent a bogus MTU back which caused two things:

                          1. On e1000 and some other drivers, setting the MTU causes the link to go down and back up, which triggers the interface event scripts, which restarted dhclient, which set the MTU again, which made the link go down and back up, repeat, repeat, repeat, boom.
                          2. On other drivers, the MTU would be set to this value but it may not have been right. In my case and for others, this was a stupid low value like 576 which meant some sites would work and others would fail or be half broken.

                          We have a patch in the tree now from a FreeBSD dev which will be in the next set of snapshots that lets us ignore the incoming MTU with a supersede in the dhclient config (which I also added in the tree), and hopefully all this should hopefully return sanity to cases affected by these issues.

                          Latest Build fixes issues with my DHCP WAN connection.

                          Bug is squashed.

                          Thank you for the hard work.

                          1 Reply Last reply Reply Quote 0
                          • jimpJ
                            jimp Rebel Alliance Developer Netgate
                            last edited by

                            @Dazog:

                            Latest Build fixes issues with my DHCP WAN connection.

                            Bug is squashed.

                            Thank you for the hard work.

                            Did you have the link cycling issue, the MTU issue, or both?

                            Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                            Need help fast? Netgate Global Support!

                            Do not Chat/PM for help!

                            1 Reply Last reply Reply Quote 0
                            • pfSenseTestP
                              pfSenseTest
                              last edited by

                              @jimp:

                              Did you have the link cycling issue, the MTU issue, or both?

                              I had the link cycling issue on the Netgate MBT-4220 system and the latest snapshot fixed it.

                              2x SG-5100 | MBT-4220 (retired) | SG-1000 (retired)

                              1 Reply Last reply Reply Quote 0
                              • D
                                Dazog
                                last edited by

                                @jimp:

                                @Dazog:

                                Latest Build fixes issues with my DHCP WAN connection.

                                Bug is squashed.

                                Thank you for the hard work.

                                Did you have the link cycling issue, the MTU issue, or both?

                                Cycling Issue.

                                1 Reply Last reply Reply Quote 0
                                • w0wW
                                  w0w
                                  last edited by

                                  I am not sure if it's related but after upgrading from 20 Apr snapshot to 19 May I lost connectivity to the internet. It is showing that PPPoE WAN is up and running, but I can not ping any IP on the internet from pfSense or LAN. I don't see anything unusual in the logs except those messages that pkg can not reach servers, rolling back ZFS snapshot restores connection immediately.

                                  P.S. Looks like in some stage it got connected because it shows my dynamic DNS as updated and once it reinstalled packages, but can not get package list anymore, ping 8.8.8.8 100% lost, traceroute does not even start to trace.

                                  What can I do else to analyze it?

                                  1 Reply Last reply Reply Quote 0
                                  • jimpJ
                                    jimp Rebel Alliance Developer Netgate
                                    last edited by

                                    @w0w:

                                    I am not sure if it's related but after upgrading from 20 Apr snapshot to 19 May I lost connectivity to the internet. It is showing that PPPoE WAN is up and running, but I can not ping any IP on the internet from pfSense or LAN. I don't see anything unusual in the logs except those messages that pkg can not reach servers, rolling back ZFS snapshot restores connection immediately.

                                    P.S. Looks like in some stage it got connected because it shows my dynamic DNS as updated and once it reinstalled packages, but can not get package list anymore, ping 8.8.8.8 100% lost, traceroute does not even start to trace.

                                    What can I do else to analyze it?

                                    That doesn't quite sound like it's related to anything in this thread. Start a new thread and post details of your setup there, at least show the routing table and what the gateways list/page looks like, maybe the config.xml entries for the gateways you have configured. You can redact any IP addresses in that info.

                                    Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                                    Need help fast? Netgate Global Support!

                                    Do not Chat/PM for help!

                                    1 Reply Last reply Reply Quote 0
                                    • w0wW
                                      w0w
                                      last edited by

                                      OK jimp.

                                      1 Reply Last reply Reply Quote 0
                                      • jimpJ
                                        jimp Rebel Alliance Developer Netgate
                                        last edited by

                                        Looks like we can consider all of these issues resolved as far as I can see. Every system I'm aware of that has tried the updated code is working properly now.

                                        Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                                        Need help fast? Netgate Global Support!

                                        Do not Chat/PM for help!

                                        1 Reply Last reply Reply Quote 0
                                        • T
                                          tmushy
                                          last edited by

                                          I can confirm the latest snapshot has indeed fixed all my issues!
                                          Thank you for resolving this. Working great now

                                          1 Reply Last reply Reply Quote 0
                                          • L
                                            LostInIgnorance
                                            last edited by

                                            JimP, I am sorry I didn't get to it earlier, but I was out of town. I just upgraded this morning and everything is working correctly as it should. Thanks for getting this fixed.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.