Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Wan failover doesn’t work and bigger problems

    Scheduled Pinned Locked Moved General pfSense Questions
    25 Posts 5 Posters 2.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • I
      idiotzoo @SethGko23
      last edited by

      @sethgko23 Ah sounds like a different problem. I'm encountering no routing without having to re-run the wizard.

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        The first thing I would check in the status file is the routing table. Make sure there is a default route and it's valid.
        Then I would check the state table file, make sure that are some states. If the ruleset is not loading there will be no open states.
        Then the system log and the dmesg (message buffer).

        Steve

        I 1 Reply Last reply Reply Quote 0
        • I
          idiotzoo @stephenw10
          last edited by

          ===group

          ===@stephenw10 Hi Steve,

          I've captured the status in fault state and after running the wizard to get things going. It now looks like this issue can be reproduced - the site has just had some electrical work done and it seems to break after a power off.

          Routing table looks fine. It's identical when things are working.

          There are states

          I can't see anything obvious in the system log, not sure what I'm looking for though.

          Likewise the message buffer has no smoking gun... although there is this bit (in bold)

          Configuring LAN interface...done.
          Configuring GUESTWIFI interface...done.
          Configuring BEELINE_WAN interface...done.
          Configuring IPsec VTI interfaces...done.
          Configuring CARP settings...done.
          Syncing OpenVPN settings...
          tun1: changing name to 'ovpns1'
          pid 379 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
          Segmentation fault (core dumped)

          Starting CRON... done.
          Starting package lldpd...done.
          Starting package Traffic Totals...done.
          Starting package OpenVPN Client Export Utility...done.
          Netgate pfSense Plus 21.05-RELEASE arm Tue Jun 01 16:52:45 EDT 2021
          Bootup complete
          pflog0: promiscuous mode enabled

          There's no sea of errors. Everything looks fine but it doesn't work. When it's in the broken state I've tried making firewall and NAT rule changes to trigger a reload. The only thing that seems to make it work again is running the setup wizard.

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            @idiotzoo said in Wan failover doesn’t work and bigger problems:

            pid 379 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
            Segmentation fault (core dumped)

            That's a pretty big smoking gun! If php crashes out at boot any number of the start up scripts may not have run.

            I would still expect to see something configured incorrectly in the status file. Are you able to send it to me to review?

            Just to be clear this only happens after a power outage and not ever after a normal reboot or shutdown/boot cycle?

            Steve

            I 1 Reply Last reply Reply Quote 0
            • I
              idiotzoo @stephenw10
              last edited by

              @stephenw10 ha… so I highlighted the right thing.

              I can certainly share the status files. Just didn’t want to post them publicly.

              This first occurred when I initiated a reboot, so that should have been a polite shutdown. The rest of the reboots have been less so.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Ok, well if you can PM me a link that would work. Otherwise you could open a ticket and make it for my attention. I can review it when time allows.

                Steve

                I 1 Reply Last reply Reply Quote 0
                • I
                  idiotzoo @stephenw10
                  last edited by

                  @stephenw10 Thanks Steve, I've sent you a PM with the status details. Will be sure to update folks here with the outcome either way.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    The logs show that was only up for ~2h before it stopped routing is that correct?

                    Three things jump out there:

                    You have SSH open to the world so your logs are full of random SSH login attempts and the resulting blocks.

                    Only the VoIP gateway group is actually failover. Most of your traffic is being policy routed via gateway groups that only have one gateway in them, is that intentional? Something you were doing for testing?

                    This is an SG-3100 running 21.05-REL. I should have spotted that before but I'd assumed you were on the latest version. You should upgrade to 21.05.2, there were specific fixes for the 3100 in 21.05.1 that addressed the php crash you're seeing and that could easily explain everything you're seeing.
                    https://docs.netgate.com/pfsense/en/latest/releases/21-05-1.html

                    Steve

                    I 1 Reply Last reply Reply Quote 0
                    • I
                      idiotzoo @stephenw10
                      last edited by

                      @stephenw10 Thanks Steve, appreciate you taking a look at it. In this case, I suspect that uptime is probably about right. Electrical safety tests meant things kept being switched off. The second power outage was unexpected when a circuit tripped out.

                      SSH being open to the world is certainly suboptimal, but it's been my only way in to make the system work after a power outage without a 2 hour drive. Ordinarily I would not do this.

                      The non-voip group is not using the failover as you spotted, basically because I couldn't make the failover work properly, well it would fail over but then stick and not switch back - or so it seemed, I have more testing to do there.

                      I'll update. The last time I ran an update everything died and I had a 2 hour drive to deal with it, hence I've resisted an update that I fear might kill the system or at least my access to it.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Hmm, well I would use a gateway directly rather than a group with one gateway in it.

                        21.05-REL on the 3100 has known issues that will cause you problems so I would update if at all possible.

                        Can you limit SSH to know IPs or to dyndns addresses perhaps?

                        Steve

                        I 1 Reply Last reply Reply Quote 1
                        • I
                          idiotzoo @stephenw10
                          last edited by

                          @stephenw10 Hi Steve, it looks like issue 12004 is possibly what was causing the pain on boot issues. The significance of a PHP crash wasn't something I'd fully appreciated.

                          With 21.05.2 the system appears to start up happily.

                          It would be super helpful if a fix that affected platforms referenced the models in the release notes.

                          Thanks for your help. Just the WAN failover to do some more testing with.

                          S S 2 Replies Last reply Reply Quote 0
                          • S
                            SteveITS Galactic Empire @idiotzoo
                            last edited by

                            @idiotzoo said in Wan failover doesn’t work and bigger problems:

                            helpful if a fix that affected platforms referenced the models in the release notes

                            Although the 3100 isn't directly mentioned in https://docs.netgate.com/pfsense/en/latest/releases/21-05-1.html it is one of the few 32 bit ARM models. (also the 1000).

                            Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                            When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                            Upvote 👍 helpful posts!

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S
                              stephenw10 Netgate Administrator
                              last edited by

                              Mmm, I agree it would have been useful there. It also didn't affect the SG-1000 directly because it only affected multicore devices. Really only the 3100.

                              But you should always be on the latest version really unless you have a very good reason not to be.

                              Steve

                              1 Reply Last reply Reply Quote 0
                              • S
                                serbus @idiotzoo
                                last edited by

                                @idiotzoo said in Wan failover doesn’t work and bigger problems:

                                Just the WAN failover to do some more testing with.

                                Hello!

                                All of my multi wan failover configs running on 21.05.2 cratered due to this:

                                https://redmine.pfsense.org/issues/11570

                                I manually reverted this code change from this issue to get it working again.

                                John

                                Lex parsimoniae

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.