• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

Wan failover doesn’t work and bigger problems

General pfSense Questions
5
25
2.5k
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • I
    idiotzoo
    last edited by idiotzoo Dec 11, 2021, 11:39 PM Dec 11, 2021, 11:27 PM

    I’m about to call it a day with pfsense sadly unless someone can tell me I’m doing something wrong these are show-stopping bugs.

    Firstly WAN failover just doesn’t work properly. I have two WAN links, one via a WISP router and another via pppoe. If the WISP goes down (packet loss) PFsense will usually fail over to the pppoe connection, although it can take far longer than it should. It almost never switches back, sometimes it does but usually I have to manually remove the pppoe from the gateway group, then re-add it. This is repeatable.

    The real biggie, and this has happened twice now, is after a restart (first time was a polite restart, second occasion was a power outage) pfsense doesn’t route traffic. Doesn’t get much more basic than that.

    Pfsense itself has internet access. All config looks fine in the gui. Network clients do not have internet access (there’s no traffic allowed between internal subnets). The only thing that fixes it is re-running the setup wizard, then it starts working. This happening once was worrying but perhaps “one of those things”. It’s happened again, simply unacceptable.

    Pfsense is running on a sg-3100

    The config for the system in question is not complicated.

    I’ve used this software for years, believe I know what I’m doing. It’s reached junk status for me now. Sad times. This network has unreliable WAN links and doesn’t have the most reliable power. I need a router/firewall to be the rock that keeps things working. Instead Pfsense has become the flakiest part of the setup.

    I’m posting this in the hope someone can suggest a remedy for these issues. As I say I know the software and I like it (well I did) so I’m willing to try stuff before I junk this and fight with Mikrotik configs instead (which I hate).

    1 Reply Last reply Reply Quote 0
    • S
      stephenw10 Netgate Administrator
      last edited by Dec 12, 2021, 2:35 AM

      How do you have the WAN failover configured? Since 2.4 you have been able to set a gateway group as the system default rather than use only policy routing.

      If both WANs are up whichever is lowest tier should be the primary in group. Id that not happening or are you just not seeing traffic use it as expected? States that have opened on the PPPoE WAN will not be closed when the WISP comes back up.

      Failing to route at all could be a number of things but I would first have checked for any states open. If you see 0 states it could be the ruleset was failing to load for some reason resulting in no outbound NAT. pfSense itself can still connect out in that situation but nothing behind it can.
      I would expect to see alerts logged if that did happen though.

      Steve

      I 1 Reply Last reply Dec 12, 2021, 11:14 AM Reply Quote 0
      • I
        idiotzoo @stephenw10
        last edited by Dec 12, 2021, 11:14 AM

        @stephenw10 Hi Steve. I have a gateway group called "office" which has the WISP as tier1 and the ADSL PPPoE as tier2. Packet loss or high latency is the trigger.

        A firewall rule is hit by traffic from the appropriate subnet and selects the gateway group.

        My experience is, when the tier1 gateway is marked as down, traffic starts using the tier2 but there can be a long delay. Users complain of losing internet access entirely - this seemed to work better in the past.

        When the tier1 gateway is back online traffic continues to use the tier2 gateway. Clearing the states doesn't change this.

        I have enabled "State Killing on Gateway Failure" which should clear the states when the gateways change anyway.

        I believe this should just work, but it always seems to need manual intervention to get it back.

        Regarding the routing failure, it's not something I can reliably reproduce. When it occurs I just have to make it work as quickly as I can. If I see it again before replacing the SG-3100 I'll check the state table.

        There are no alerts in the GUI when this occurs. Unfortunately the log is overwritten now so I can't see if the firewall log contains any errors.

        I 1 Reply Last reply Dec 12, 2021, 11:16 AM Reply Quote 0
        • I
          idiotzoo @idiotzoo
          last edited by idiotzoo Dec 12, 2021, 11:21 AM Dec 12, 2021, 11:16 AM

          @idiotzoo said in Wan failover doesn’t work and bigger problems:

          "State Killing on Gateway Failure"

          Just recognised the problem with this - it doesn't reset the states when the gateway returns... I don't think this is the issue, as manually clearing the states doesn't resolve the issue, but is there no way to have the states reset when the gateways change?

          One other thing I forgot to mention. When PFsense gets into this broken state on a restart, the web configurator doesn't work and needs to be restarted from the CLI.

          1 Reply Last reply Reply Quote 0
          • S
            stephenw10 Netgate Administrator
            last edited by Dec 12, 2021, 6:53 PM

            Mmm, indeed it will not kill states that are in use when the tier one gateway comes back up. Doing so would be needlessly disruptive in many cases.
            It's not usually an issue for most traffic since states timeout and new states are created on the tier 1 gateway seamlessly. However you might see issues for traffic that retains a state for long periods like VoIP for example.
            One the group has failed back you should see that in Status > Gateways > Groups.
            You can also check the firewall ruleset directly in /tmp/rules.debug. That file is updated with whatever gateway is current in the group.

            If the webgui cannot start I would expect something to be logged. A bad ruleset would not cause that but without something more to go on it's hard to say exactly what might.
            If you see it again I would try to grab a status file before you reboot by browsing to <your_firewall_ip>/status.php directly. That should have enough info to show what happened.

            Steve

            I 1 Reply Last reply Dec 12, 2021, 9:27 PM Reply Quote 0
            • I
              idiotzoo @stephenw10
              last edited by Dec 12, 2021, 9:27 PM

              @stephenw10 That all makes sense... I'm not sure why clearing states doesn't seem to resolve the tier1 gateway returning.

              My concern is this is a remote network that I support as a volunteer. My time is limited and I'm torn between spending time trying to chase down issues vs learning an alternative product. I'll try and make some time in the new year for some in depth testing. Recreating the issue on reboot is my greatest concern.

              1 Reply Last reply Reply Quote 0
              • S
                stephenw10 Netgate Administrator
                last edited by Dec 12, 2021, 9:49 PM

                Mmm, I understand. For the WAN failover try to verify that new connections are at least using the primary WAN when it comes back up.
                For the failing to start correctly case try to gather whatever info you can whilst it's in that state.

                I 1 Reply Last reply Jan 6, 2022, 1:57 PM Reply Quote 0
                • I
                  idiotzoo @stephenw10
                  last edited by Jan 6, 2022, 1:57 PM

                  @stephenw10 The site has just had a power outage (electrical testing) and the "no internet on reboot" problem showed itself.

                  As previously, I had to restart the web configurator.

                  By the time I got to the logs, I can't see anything obvious but haven't scoured it.

                  I have captured the status output both before and after re-running the setup wizard.

                  Anything specific I should be looking for? I'm happy to share things but not clear what needs redacting from the status output.

                  1 Reply Last reply Reply Quote 0
                  • S
                    SethGko23
                    last edited by Jan 6, 2022, 4:08 PM

                    Also experienced a power outage. Both devices are back up and appear to be functioning as they should but OpenVPN and any external traffic is now being blocked. I am fairly new to the system so any suggestions on what I could check would be huge.

                    Thanks

                    I 1 Reply Last reply Jan 6, 2022, 4:19 PM Reply Quote 0
                    • I
                      idiotzoo @SethGko23
                      last edited by Jan 6, 2022, 4:19 PM

                      @sethgko23 You could try running through the setup wizard again to see if that makes it work. That's what I have to do.

                      S 1 Reply Last reply Jan 6, 2022, 4:22 PM Reply Quote 0
                      • S
                        SethGko23 @idiotzoo
                        last edited by Jan 6, 2022, 4:22 PM

                        @idiotzoo What's the impact of that? Everything internally in the office is accessible it's just all our remote employees.

                        I 1 Reply Last reply Jan 6, 2022, 4:59 PM Reply Quote 0
                        • I
                          idiotzoo @SethGko23
                          last edited by Jan 6, 2022, 4:59 PM

                          @sethgko23 Ah sounds like a different problem. I'm encountering no routing without having to re-run the wizard.

                          1 Reply Last reply Reply Quote 0
                          • S
                            stephenw10 Netgate Administrator
                            last edited by Jan 6, 2022, 10:51 PM

                            The first thing I would check in the status file is the routing table. Make sure there is a default route and it's valid.
                            Then I would check the state table file, make sure that are some states. If the ruleset is not loading there will be no open states.
                            Then the system log and the dmesg (message buffer).

                            Steve

                            I 1 Reply Last reply Jan 20, 2022, 2:58 PM Reply Quote 0
                            • I
                              idiotzoo @stephenw10
                              last edited by Jan 20, 2022, 2:58 PM

                              ===group

                              ===@stephenw10 Hi Steve,

                              I've captured the status in fault state and after running the wizard to get things going. It now looks like this issue can be reproduced - the site has just had some electrical work done and it seems to break after a power off.

                              Routing table looks fine. It's identical when things are working.

                              There are states

                              I can't see anything obvious in the system log, not sure what I'm looking for though.

                              Likewise the message buffer has no smoking gun... although there is this bit (in bold)

                              Configuring LAN interface...done.
                              Configuring GUESTWIFI interface...done.
                              Configuring BEELINE_WAN interface...done.
                              Configuring IPsec VTI interfaces...done.
                              Configuring CARP settings...done.
                              Syncing OpenVPN settings...
                              tun1: changing name to 'ovpns1'
                              pid 379 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
                              Segmentation fault (core dumped)

                              Starting CRON... done.
                              Starting package lldpd...done.
                              Starting package Traffic Totals...done.
                              Starting package OpenVPN Client Export Utility...done.
                              Netgate pfSense Plus 21.05-RELEASE arm Tue Jun 01 16:52:45 EDT 2021
                              Bootup complete
                              pflog0: promiscuous mode enabled

                              There's no sea of errors. Everything looks fine but it doesn't work. When it's in the broken state I've tried making firewall and NAT rule changes to trigger a reload. The only thing that seems to make it work again is running the setup wizard.

                              1 Reply Last reply Reply Quote 0
                              • S
                                stephenw10 Netgate Administrator
                                last edited by Jan 20, 2022, 5:34 PM

                                @idiotzoo said in Wan failover doesn’t work and bigger problems:

                                pid 379 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
                                Segmentation fault (core dumped)

                                That's a pretty big smoking gun! If php crashes out at boot any number of the start up scripts may not have run.

                                I would still expect to see something configured incorrectly in the status file. Are you able to send it to me to review?

                                Just to be clear this only happens after a power outage and not ever after a normal reboot or shutdown/boot cycle?

                                Steve

                                I 1 Reply Last reply Jan 20, 2022, 8:31 PM Reply Quote 0
                                • I
                                  idiotzoo @stephenw10
                                  last edited by Jan 20, 2022, 8:31 PM

                                  @stephenw10 ha… so I highlighted the right thing.

                                  I can certainly share the status files. Just didn’t want to post them publicly.

                                  This first occurred when I initiated a reboot, so that should have been a polite shutdown. The rest of the reboots have been less so.

                                  1 Reply Last reply Reply Quote 0
                                  • S
                                    stephenw10 Netgate Administrator
                                    last edited by Jan 21, 2022, 12:08 AM

                                    Ok, well if you can PM me a link that would work. Otherwise you could open a ticket and make it for my attention. I can review it when time allows.

                                    Steve

                                    I 1 Reply Last reply Jan 26, 2022, 11:15 AM Reply Quote 0
                                    • I
                                      idiotzoo @stephenw10
                                      last edited by Jan 26, 2022, 11:15 AM

                                      @stephenw10 Thanks Steve, I've sent you a PM with the status details. Will be sure to update folks here with the outcome either way.

                                      1 Reply Last reply Reply Quote 0
                                      • S
                                        stephenw10 Netgate Administrator
                                        last edited by Jan 26, 2022, 1:28 PM

                                        The logs show that was only up for ~2h before it stopped routing is that correct?

                                        Three things jump out there:

                                        You have SSH open to the world so your logs are full of random SSH login attempts and the resulting blocks.

                                        Only the VoIP gateway group is actually failover. Most of your traffic is being policy routed via gateway groups that only have one gateway in them, is that intentional? Something you were doing for testing?

                                        This is an SG-3100 running 21.05-REL. I should have spotted that before but I'd assumed you were on the latest version. You should upgrade to 21.05.2, there were specific fixes for the 3100 in 21.05.1 that addressed the php crash you're seeing and that could easily explain everything you're seeing.
                                        https://docs.netgate.com/pfsense/en/latest/releases/21-05-1.html

                                        Steve

                                        I 1 Reply Last reply Jan 26, 2022, 1:41 PM Reply Quote 0
                                        • I
                                          idiotzoo @stephenw10
                                          last edited by Jan 26, 2022, 1:41 PM

                                          @stephenw10 Thanks Steve, appreciate you taking a look at it. In this case, I suspect that uptime is probably about right. Electrical safety tests meant things kept being switched off. The second power outage was unexpected when a circuit tripped out.

                                          SSH being open to the world is certainly suboptimal, but it's been my only way in to make the system work after a power outage without a 2 hour drive. Ordinarily I would not do this.

                                          The non-voip group is not using the failover as you spotted, basically because I couldn't make the failover work properly, well it would fail over but then stick and not switch back - or so it seemed, I have more testing to do there.

                                          I'll update. The last time I ran an update everything died and I had a 2 hour drive to deal with it, hence I've resisted an update that I fear might kill the system or at least my access to it.

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.