Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Upgrade success (Edit: not quite)

    Scheduled Pinned Locked Moved 2.1.1 Snapshot Feedback and Problems - RETIRED
    15 Posts 5 Posters 4.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      athurdent
      last edited by

      Felt brave today and upgraded my home firewall (Lanner FW-7530), so far my em interfaces work fine:

      em0: <intel(r) 1000="" pro="" network="" connection="" 7.3.8="">port 0xac80-0xac9f mem 0xfe7e0000-0xfe7fffff,0xfe7dc000-0xfe7dffff irq 16 at device 0.0 on pci1
      em0: Using MSIX interrupts with 3 vectors
      em0: [ITHREAD]
      em0: [ITHREAD]
      em0: [ITHREAD]</intel(r)> 
      

      The upgrade reminded me that I should get an SSD sometime soon, packages reinstallation takes ages on my CF. Bye bye 124 days of uptime since installing 2.1 ;)

      Many thanks to the pfSense team for the great work!

      Edit: Not that succesful after all.
      It seems that Gateway Groups do not work anymore for me. I.e when using a group with Tier 1 Cable (DHCP) and Tier 2 DSL (PPPoE) the /tmp/rules.debug never gets updated with the Tier 2 gateway when taking Tier 1 offline. Had to go back to 2.1 for now, works fine there.

      1 Reply Last reply Reply Quote 0
      • C
        cmb
        last edited by

        does your gateway status change under Status>Gateways? What is apinger logging?

        1 Reply Last reply Reply Quote 0
        • A
          athurdent
          last edited by

          Thanks for looking into this, cmb!

          My IPv6 tunnel (gif0 tunnel to SixXs) also depends on my DHCP cable connection, so that one goes down to, which is expected, just fyi.
          Both Gateways show Loss 100% and Offline.

          Apinger Log:

          Jan 18 17:14:24 vpn apinger: ALARM: WAN1_DHCP(5.xxx.xxx.1)  *** down ***
          Jan 18 17:14:24 vpn apinger: ALARM: WAN_SixXS(2001:xxxx:xxxx:xxxx::1)  *** down ***
          Jan 18 17:14:34 vpn apinger: Error while starting command form alarm(down) on target(5.xxx.xxx.1-WAN1_DHCP)
          Jan 18 17:14:34 vpn apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN1_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN1_DHCP' -c 'filter reload' ) exited with status: 255
          Jan 18 17:14:34 vpn apinger: Error while starting command form alarm(down) on target(2001:xxxx:xxxx:xxxx::1-WAN_SixXS)
          Jan 18 17:14:34 vpn apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_SixXS' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_SixXS' -c 'filter reload' ) exited with status: 255
          
          
          1 Reply Last reply Reply Quote 0
          • P
            phil.davis
            last edited by

            I have setup 2 WANs:
            WAN (vr1) - gateway 10.49.175.250
            OPT1 (vr2) - gateway 10.49.174.250
            NcellPriority gateway group - OPT1GW tier 1, WANGW tier 2
            Firewall rule to policy-route general traffic into NcellPriority gateway group.

            Both gateways up, traffic routes through OPT1, correct.
            /tmp/rules.debug has the correct line directing to OPT1:

            GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "
            

            Pull the cable from OPT1 (vr2)

            System Log:

            Jan 18 21:53:47 	kernel: vr2: link state changed to DOWN
            

            Gateway log tab:

            Jan 18 21:54:15 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
            Jan 18 21:54:25 	apinger: Error while starting command form alarm(OPT1GWdown) on target(8.8.4.4-OPT1GW)
            Jan 18 21:54:25 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns OPT1GW' -c 'service reload ipsecdns' -c 'service reload openvpn OPT1GW' -c 'filter reload' ) exited with status: 255
            

            Look in /tmp/rules.debug right up to 21:59 and it still has:

            GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "
            

            A couple of seconds after 22:00, the line changes to:

            GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "
            

            and now traffic is routed out WAN (vr1), and the general traffic failover is effective.
            The regular cron job that runs every 15 minutes has done a recalculation of the rules and put the currently-correct thing in place.

            Now I put the cable back in to OPT1 at 22:08
            System log:

            Jan 18 22:08:49 	kernel: vr2: link state changed to UP
            

            Gateway log tab:

            Jan 18 22:08:51 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
            Jan 18 22:09:01 	apinger: Error while starting command form alarm(OPT1GWdown) on target(8.8.4.4-OPT1GW)
            Jan 18 22:09:01 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns OPT1GW' -c 'service reload ipsecdns' -c 'service reload openvpn OPT1GW' -c 'filter reload' ) exited with status: 255
            

            But /tmp/rules.debug up to 10:14 still has:

            GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "
            

            then after 10:15 it changes to:

            GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "
            

            and the failback/recovery to OPT1 is effective at the 15-minute filter reload.

            filter_configure_sync does not seem to be happening at the time of the gateway status change, only at the 15-minute cron job interval.

            Now to switch slices back to 2.1 and see if it was always like this…

            As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
            If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

            1 Reply Last reply Reply Quote 0
            • P
              phil.davis
              last edited by

              With pfSense 2.1-RELEASE and exactly the same config as the post above, the results are good and log messages are quite different:
              At the start, with all links up, /tmp/rules.debug has:

              GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "
              

              Pull OPT1 cable:
              System Log:

              Jan 18 22:50:59 	kernel: vr2: link state changed to DOWN
              Jan 18 22:51:03 	php: rc.linkup: Hotplug event detected for OPT1(opt1) but ignoring since interface is configured with static IP (10.49.174.1 )
              Jan 18 22:51:38 	check_reload_status: updating dyndns OPT1GW
              Jan 18 22:51:38 	check_reload_status: Restarting ipsec tunnels
              Jan 18 22:51:38 	check_reload_status: Restarting OpenVPN tunnels/interfaces
              Jan 18 22:51:38 	check_reload_status: Reloading filter
              

              Gateways Log:

              Jan 18 22:51:28 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
              

              /tmp/rules.debug changes within 1 minute to have:

              GWNcellPriority = "  route-to { ( vr1 10.49.175.250 )  }  "
              

              checkip has changed to the WAN public IP - good
              Internet browsing has failed over to WAN inside 1 minute - good

              Replace OPT1 cable:
              System Log:

              Jan 18 22:54:19 	check_reload_status: rc.newwanip starting vr2
              Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: Informational is starting vr2.
              Jan 18 22:54:25 	php: rc.newwanip: rc.newwanip: on (IP address: 10.49.174.1) (interface: opt1) (real interface: vr2).
              Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 10.49.175.250
              Jan 18 22:54:25 	php: rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through 10.49.174.250
              Jan 18 22:54:29 	check_reload_status: updating dyndns OPT1GW
              Jan 18 22:54:29 	check_reload_status: Restarting ipsec tunnels
              Jan 18 22:54:29 	check_reload_status: Restarting OpenVPN tunnels/interfaces
              Jan 18 22:54:29 	check_reload_status: Reloading filter
              Jan 18 22:54:30 	php: rc.newwanip: Resyncing OpenVPN instances for interface OPT1.
              Jan 18 22:54:30 	php: rc.newwanip: Creating rrd update script
              Jan 18 22:54:32 	php: rc.newwanip: pfSense package system has detected an ip change 0.0.0.0 -> 10.49.174.1 ... Restarting packages.
              Jan 18 22:54:33 	check_reload_status: Starting packages
              Jan 18 22:54:40 	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use OPT1GW.
              Jan 18 22:54:45 	php: rc.start_packages: Restarting/Starting all packages.
              Jan 18 22:54:53 	check_reload_status: Reloading filter
              

              Gateway Log:

              Jan 18 22:54:19 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
              Jan 18 22:54:25 	apinger: SIGHUP received, reloading configuration.
              

              /tmp/rules.debug has quickly changed back to:

              GWNcellPriority = "  route-to { ( vr2 10.49.174.250 )  }  "
              

              checkip has changed back to OPT1 public IP - good
              Internet browsing has failed back to OPT1 inside 1 minute - good

              There are log meesages in 2.1 from rc.linkup and rc.newwanip that do good stuff for failover - none of that appears in the 2.1.1 logs. 2.1.1 has the apinger exit status 255 thing.

              What happened in 2.1.1? Is it something in pfsense-tools that was changed recently and intended for 2.2?

              As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
              If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

              1 Reply Last reply Reply Quote 0
              • P
                phil.davis
                last edited by

                There have been some changes to check_reload_status in pfsense-tools. Maybe they will help the failover processing?
                I don't have a suitable multi-WAN test environment at the moment (I am traveling). But On Sunday I will be able to do real testing. So please let me know if there is any change worth testing and I can get the latest snapshot on Saturday night or Sunday and test it.

                As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                1 Reply Last reply Reply Quote 0
                • jimpJ
                  jimp Rebel Alliance Developer Netgate
                  last edited by

                  Yes, the snapshots from Thursday afternoon or later should have the check_reload_status fixes which hopefully will address the behavior some are seeing in this and other similar threads. If someone can upgrade and test again it would be appreciated.

                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  1 Reply Last reply Reply Quote 0
                  • A
                    athurdent
                    last edited by

                    My test KVM still has that error when a gateway goes down, so I didn't try on my "production" machine because of that:

                    Jan 25 07:59:00 pfsense-kvm.local-lan apinger: Error while starting command form alarm(down) on target(192.168.xxx.254-WANGW)
                    Jan 25 07:59:00 pfsense-kvm.local-lan apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WANGW' -c 'service reload ipsecdns' -c 'service reload openvpn WANGW' -c 'filter reload' ) exited with status: 255
                    
                    1 Reply Last reply Reply Quote 0
                    • A
                      athurdent
                      last edited by

                      Tried the latest snapshot ( Sat Jan 25 10:43:17 EST 2014 ) and the problem is still there.

                      1 Reply Last reply Reply Quote 0
                      • P
                        phil.davis
                        last edited by

                        2.1.1-PRERELEASE (i386)
                        built on Sat Jan 25 10:00:56 EST 2014
                        FreeBSD 8.3-RELEASE-p14
                        

                        Test Alix 2D13 with its WAN connected to my home LAN. The test WAN gets DHCP from home LAN. WAN gateway monitor IP is set to an external IP on the real internet. Things are running nicely.
                        I pull the home WAN as a test - as expected the test WAN gateway goes to "pending" and eventually "offline" as apinger fails to get ping response from the monitor IP. So apinger takes some action, but this still appears in the gateway log:

                        Jan 26 22:30:47 	apinger: ALARM: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
                        Jan 26 22:30:57 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
                        Jan 26 22:30:57 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255
                        

                        Then I connect home WAN again (so I can post this  :) ). apinger on test pfSense starts getting ping response from the monitor IP - good. It attempts to take action when it decides the WAN gateway is up, and this in the gateway log:

                        Jan 26 22:34:00 	apinger: alarm canceled: WAN_DHCP(216.146.35.35) *** WAN_DHCPdown ***
                        Jan 26 22:34:10 	apinger: Error while starting command form alarm(WAN_DHCPdown) on target(216.146.35.35-WAN_DHCP)
                        Jan 26 22:34:10 	apinger: command (/usr/local/sbin/pfSctl -c 'service reload dyndns WAN_DHCP' -c 'service reload ipsecdns' -c 'service reload openvpn WAN_DHCP' -c 'filter reload' ) exited with status: 255
                        

                        So the "exited with status: 255" thing happens for all gateway transitions - no need for complicated multi-WAN failover scenarios to see this error message.

                        As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                        If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                        1 Reply Last reply Reply Quote 0
                        • jimpJ
                          jimp Rebel Alliance Developer Netgate
                          last edited by

                          Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

                          Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                          Need help fast? Netgate Global Support!

                          Do not Chat/PM for help!

                          1 Reply Last reply Reply Quote 0
                          • A
                            athurdent
                            last edited by

                            @jimp:

                            Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

                            Great thanks, no more error messages.

                            2 questions for things I ran into while testing failover with that new version:

                            Are states still flushed when a gateway goes down and the option "State Killing on Gateway Failure" is not checked? I have default gateway switching on.
                            I started a ping to an external host, pulled my primary gateway and the ping stopped, never recovered. Shouldn't it recover and automatically use my second gateway?

                            Is there a configure option to switch of those
                            MONITOR: <gateway>is down, removing from routing group <group>messages? They keep on coming and filling my mailbox as long as one gateway is down, but one message per gateway group would really be enough. And even worse is that one never gets a mail when the gateway has recovered, so this whole feature is a little useless (at least for me) and I have to rely on my monitoring system anyway.</group></gateway>

                            1 Reply Last reply Reply Quote 0
                            • A
                              athurdent
                              last edited by

                              Regarding state flushing, it seems that at some point all the states are flushed, not just those of the gateway going down.
                              I would suggest to modify the anti-lockout rule like this, using the "no state" feature of pf:

                              # make sure the user cannot lock himself out of the webConfigurator or SSH
                              pass in  quick on {$lanif} proto tcp from any to ({$lanif}) port { {$alports} } no state label "anti-lockout rule"
                              pass out quick on {$lanif} proto tcp from ({$lanif}) port { {$alports} } to any no state label "anti-lockout rule"
                              

                              You can keep on using your ssh/gui session that way even if all states get flushed. Also the gui feature "Reset States" would profit from that. You could get rid of:
                              "NOTE: If you reset the firewall state table, the browser session may appear to be hung after clicking "Reset". Simply refresh the page to continue."

                              1 Reply Last reply Reply Quote 0
                              • D
                                doktornotor Banned
                                last edited by

                                @jimp:

                                Finally locked in a fix for this. New snapshots building now should be OK. It'll be a few hours before they upload.

                                Confirmed fixed, finally… Yay!  8)

                                1 Reply Last reply Reply Quote 0
                                • P
                                  phil.davis
                                  last edited by

                                  2.1.1-PRERELEASE (i386)
                                  built on Sun Feb 2 12:42:30 EST 2014
                                  FreeBSD 8.3-RELEASE-p14
                                  

                                  All is well. I tried pulling out the phone line on the ADSL (default gateway), the dynamic DNS name, OpenVPN road warrior server and 2 OpenVPN site-2-site clients all switched to using OPT1. Plugged in the phone line again, ADSL negotiated, apinger detected the WAN online again, everything failed back.
                                  Pulled the cable on OPT1 (which had general internet traffic directed to it as tier1 of a gateway group). General browsing failed over to WAN. OpenVPN server and clients remained running untouched (as they should because they were already on WAN - so no need to restart them).
                                  During messing about, ADSL WAN went down by itself - had some minutes with both WAN and OPT1 down, and things recovered fine from that as links became available again. Always good to have the ISP give you a real test  ;)
                                  Gateways log tab has nice clean entries like this:

                                  Feb 3 10:24:32 	apinger: Starting Alarm Pinger, apinger(28478)
                                  Feb 3 10:30:25 	apinger: SIGHUP received, reloading configuration.
                                  Feb 3 10:52:28 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
                                  Feb 3 11:00:52 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
                                  Feb 3 11:08:57 	apinger: ALARM: WANGW(8.8.8.8) *** down ***
                                  Feb 3 11:09:10 	apinger: ALARM: OPT1GW(8.8.4.4) *** OPT1GWdown ***
                                  Feb 3 11:10:49 	apinger: SIGHUP received, reloading configuration.
                                  Feb 3 11:12:59 	apinger: alarm canceled: WANGW(8.8.8.8) *** down ***
                                  Feb 3 11:20:12 	apinger: SIGHUP received, reloading configuration.
                                  Feb 3 11:20:23 	apinger: alarm canceled: OPT1GW(8.8.4.4) *** OPT1GWdown ***
                                  

                                  As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                                  If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.