Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Major issue with QUAGGA-OSPF and VLANs (pfsense 2.3.0)

    Scheduled Pinned Locked Moved Routing and Multi WAN
    81 Posts 23 Posters 36.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      reqlez
      last edited by

      @jimp:

      After restarting services and yanking (virtual) cables I did manage to make it break, once.

      If it is related to restarting zebra, this patch might help:

      http://files.atx.pfsense.org/jimp/patches/skip_restart_for_routing_packages-2.3.1.patch

      Ultimately someone that can reproduce this reliably needs to report this directly to quagga since it appears to be a problematic change introduced in their 1.0.x code base.

      I saw somewhere in quagga notes that something got fixed recently. About this no restart patch. will it work on latest update ? also …. why not just include an option to TURN OFF restart of network packages ? somewhere in advanced options ? That would really help those unstable lines bringing the network down even if it's lower priority link while quagga reboots.

      1 Reply Last reply Reply Quote 0
      • T
        Taras_
        last edited by

        @jimp:

        We can't reliably reproduce it here, and it isn't our code to fix. It's something in Quagga 1.x on FreeBSD, so you'd be better off approaching the Quagga developers or maybe FreeBSD developers directly.

        I have 2 fresh pfSenses (SG-4860) with 2 ISPs/4 OpenVPNs and OSPF on top of it.
        This issue reliably reproduced :) :( :( , i.e. kernel routes aren't removed/updated properly (see 10.0.9.0/24 route):

        Codes: K - kernel route, C - connected, S - static, R - RIP,
               O - OSPF, I - IS-IS, B - BGP, P - PIM, A - Babel,
               > - selected route, * - FIB route
        
        K>* 0.0.0.0/0 via 192.168.0.1, igb1
        O   10.0.9.0/24 [110/60] via 10.255.255.101, igb5, 00:02:03
        K>* 10.0.9.0/24 via 10.255.2.2, ovpns1
        O   10.1.102.0/24 [110/50] via 10.255.2.2, ovpns1, 00:02:03
        K>* 10.1.102.0/24 via 10.255.2.2, ovpns1
        O   10.11.11.0/24 [110/10] is directly connected, lagg0, 00:02:16
        C>* 10.11.11.0/24 is directly connected, lagg0
        O   10.255.1.0/24 [110/70] via 10.255.2.2, ovpns1, 00:02:03
        K>* 10.255.1.0/24 via 10.255.2.2, ovpns1
        O   10.255.2.0/24 [110/40] is directly connected, ovpns1, 00:02:16
        C>* 10.255.2.0/24 is directly connected, ovpns1
        O   10.255.255.0/24 [110/50] is directly connected, igb5, 00:02:16
        C>* 10.255.255.0/24 is directly connected, igb5
        C>* 127.0.0.0/8 is directly connected, lo0
        C>* 192.168.0.0/24 is directly connected, igb1
        

        My primary question is if Quagga introduced some problems in recent updates may be we should return to version which don't have problems and push it through pfsense's packages?
        I have 2 support incidents from pfsense team, may be I should spent one of them on this problem?

        1 Reply Last reply Reply Quote 0
        • R
          reqlez
          last edited by

          @jimp:

          After restarting services and yanking (virtual) cables I did manage to make it break, once.

          If it is related to restarting zebra, this patch might help:

          http://files.atx.pfsense.org/jimp/patches/skip_restart_for_routing_packages-2.3.1.patch

          Ultimately someone that can reproduce this reliably needs to report this directly to quagga since it appears to be a problematic change introduced in their 1.0.x code base.

          Okay I tried this patch in 2.3.2 and it wont work …

          Also I submitted a request in quagga-users lost nobody got back to me yet.

          1 Reply Last reply Reply Quote 0
          • R
            reqlez
            last edited by

            Okay got a reply ( from Martin Winters the quagga god himself ! ) https://lists.quagga.net/pipermail/quagga-users/2016-October/014474.html

            I actually contacted the maintainer of freebsd port for quagga and he referred me to the list as he doesn't think this is port related.

            If you guys want to pitch in, go ahead… Martin is asking to compile latest code from git ... and honestly I have never complied zebra before, i think the last thing i complied on freebsd was java lol

            1 Reply Last reply Reply Quote 0
            • R
              reqlez
              last edited by

              Here is another comment Martin from Quagga made: "I don’t see why pfsense would restart Quagga - so I think this might
              be a bug. But there might be other reasons for it which I’m unaware
              of."

              I actually have some logs that I will be submitting either tonight or tomorrow.

              1 Reply Last reply Reply Quote 0
              • R
                reqlez
                last edited by

                OSPFD / ZEBRA Debug logs submitted to Martin. Now we wait and see. I have tried his "latest" development package and it does the same thing.

                1 Reply Last reply Reply Quote 0
                • R
                  reqlez
                  last edited by

                  So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?

                  See code in pfsense:

                  rc_stop() {
                  if [ -e /var/run/quagga/zebra.pid ]; then
                  /bin/kill -9 /bin/cat /var/run/quagga/zebra.pid
                  /bin/rm -f /var/run/quagga/zebra.pid
                  fi
                  if [ -e /var/run/quagga/ospfd.pid ]; then
                  /bin/kill -9 /bin/cat /var/run/quagga/ospfd.pid
                  /bin/rm -f /var/run/quagga/ospfd.pid
                  fi
                  }

                  But then again, why is it being restarted in the first place? Is it because of links that get IPs dynamically allocated ?  A UI option to skip quagga restart would be really appreciated guys! Pulling my hair out here testing this :(

                  1 Reply Last reply Reply Quote 0
                  • K
                    kpa
                    last edited by

                    The -9 signal is always a bad idea on any service, it is strictly reserved for the situation where no other signal is able to terminate the process that is stuck for whatever reason. This should be common knowledge among pfSense developers and people working on the packages and I'm really surprised such amateur mistakes are being made with such an important package.

                    1 Reply Last reply Reply Quote 0
                    • R
                      reqlez
                      last edited by

                      @kpa:

                      The -9 signal is always a bad idea on any service, it is strictly reserved for the situation where no other signal is able to terminate the process that is stuck for whatever reason. This should be common knowledge among pfSense developers and people working on the packages and I'm really surprised such amateur mistakes are being made with such an important package.

                      Here is my idea … why not have two waves of shutdowns ... first wave without -9    then sleep for a few seconds and do another wave with -9 ? Even  better ... trigger the second wave only if there are any processes still running...

                      1 Reply Last reply Reply Quote 0
                      • K
                        kpa
                        last edited by

                        It's still a very bad idea.

                        http://unix.stackexchange.com/questions/281439/why-should-i-not-use-kill-9-sigkill

                        Imagine a very big database that relies on proper shutdown for its integrity if the database has to be taken down. It has battery backed storage and UPS power and survives a power outage easily by performing the proper shutdown procedures when a power outage is detected and it can finish the procedures before the power really goes down. Now, if the main database process gets killed with -9 signal none of the shutdown processes get run because as in the linked document is described, "the process gets the rug pulled from it" and it's just removed forcibly from the system from the exact state it was when it was sent the -9 signal. This would leave that database in a inconsistent state and could cost days in repair time.

                        1 Reply Last reply Reply Quote 0
                        • S
                          Spydre13
                          last edited by

                          @reqlez:

                          So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?

                          I see Martin's reply to you on Oct. 10, but I don't see anything after that.  Are you emailing him off-list?

                          I was looking through the Quagga code last night, and found something that I'm wondering whether or not could be the problem.  Quagga (zebra daemon) puts routes into the kernel with flag "1" (RTF_PROTO1, see netstat man page).  When zebra starts up it's supposed to ignore (filter out) any kernel routes with flag "1" because it should assume it put those there to begin with.  I think before Quagga version 1 this was working, and in version >= 1 it pulls in those kernel routes into the zebra RIB.

                          If I reboot a firewall and go to OSPF -> Status -> Zebra routes, I see a bunch of OSPF routes but barely any K (kernel) routes.  If I make any change on the Global Settings or Interface Settings tab quagga restarts, and then when looking at the zebra routes it is filled with kernel routes (one for each OSPF route).

                          Can you ask Martin to look at this:
                          Commit: https://github.com/Quagga/quagga/commit/0d0686f98e64017415071e590bde262f0ab5a4c9
                          File: zebra/zebra_rib.c
                          Function: rib_sweep_table

                          This function is commented out starting in version 1, but it was used in version 0.99.24.  There is a block of code in it:

                          
                          if (rib->type == ZEBRA_ROUTE_KERNEL &&
                            CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELFROUTE))
                          {
                              ret = rib_uninstall_kernel (rn, rib);
                              if (! ret)
                                  rib_delnode (rn, rib);
                          }
                          
                          

                          The rib_weed_tables function that is still being used doesn't seem to do this same thing, from what I can tell.  This URL shows them side-by-side: https://fossies.org/diffs/quagga/0.99.24.1_vs_1.0.20160315/zebra/zebra_rib.c-diff.html

                          If you can point me to the thread where you are discussing this with Martin, I can pass this along to him if you prefer.

                          1 Reply Last reply Reply Quote 0
                          • R
                            reqlez
                            last edited by

                            Sorry I'm a mailing list noob and I just realized when you told me that this stuff is not going via lists … I'll post this and include the list this time, yes you are correct I been just emailing him

                            @Spydre13:

                            @reqlez:

                            So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?

                            I see Martin's reply to you on Oct. 10, but I don't see anything after that.  Are you emailing him off-list?

                            I was looking through the Quagga code last night, and found something that I'm wondering whether or not could be the problem.  Quagga (zebra daemon) puts routes into the kernel with flag "1" (RTF_PROTO1, see netstat man page).  When zebra starts up it's supposed to ignore (filter out) any kernel routes with flag "1" because it should assume it put those there to begin with.  I think before Quagga version 1 this was working, and in version >= 1 it pulls in those kernel routes into the zebra RIB.

                            If I reboot a firewall and go to OSPF -> Status -> Zebra routes, I see a bunch of OSPF routes but barely any K (kernel) routes.  If I make any change on the Global Settings or Interface Settings tab quagga restarts, and then when looking at the zebra routes it is filled with kernel routes (one for each OSPF route).

                            Can you ask Martin to look at this:
                            Commit: https://github.com/Quagga/quagga/commit/0d0686f98e64017415071e590bde262f0ab5a4c9
                            File: zebra/zebra_rib.c
                            Function: rib_sweep_table

                            This function is commented out starting in version 1, but it was used in version 0.99.24.  There is a block of code in it:

                            	      
                            if (rib->type == ZEBRA_ROUTE_KERNEL &&
                              CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELFROUTE))
                            {
                                ret = rib_uninstall_kernel (rn, rib);
                                if (! ret)
                                    rib_delnode (rn, rib);
                            }
                            
                            

                            The rib_weed_tables function that is still being used doesn't seem to do this same thing, from what I can tell.  This URL shows them side-by-side: https://fossies.org/diffs/quagga/0.99.24.1_vs_1.0.20160315/zebra/zebra_rib.c-diff.html

                            If you can point me to the thread where you are discussing this with Martin, I can pass this along to him if you prefer.

                            1 Reply Last reply Reply Quote 0
                            • R
                              reqlez
                              last edited by

                              @Spydre13

                              I just posted your comment on the same list:  https://lists.quagga.net/pipermail/quagga-users/2016-October/014476.html

                              This time i'm being less of a noob and actually e-mailing list.  We can continue this discussion there if you subscribe to it.

                              1 Reply Last reply Reply Quote 0
                              • B
                                bgibson
                                last edited by

                                Good morning,
                                Has there been any update regarding this issue? Is there another forum or notes I can follow to see when this is resolved? This is causing a huge problem within our company and if not fixed soon - we will have to change routing. I'm on the latest version of pfsense.

                                1 Reply Last reply Reply Quote 0
                                • E
                                  echu2016
                                  last edited by

                                  Hi bgibson,
                                  Meanwhile I suggest you to take mi heper's and my recommendation:

                                  https://forum.pfsense.org/index.php?topic=111108.msg620733#msg620733
                                  https://forum.pfsense.org/index.php?topic=111108.msg654483#msg654483

                                  1 Reply Last reply Reply Quote 0
                                  • B
                                    bgibson
                                    last edited by

                                    Thanks - I will look into the links.

                                    1 Reply Last reply Reply Quote 0
                                    • T
                                      Trey
                                      last edited by

                                      Hi,

                                      the new version 1.1 of quagga was released a couple of days ago:

                                      http://mirror.yannic-bonenberger.com/nongnu/quagga/quagga-1.1.0.changelog.txt

                                      As the problems startet with version 1.0 and having a look at the chengelog, I hope quagga is running smooth again after the update.

                                      Would be greate to see an update of the packeage to quagga 1.1.

                                      Thanks!

                                      1 Reply Last reply Reply Quote 0
                                      • R
                                        reqlez
                                        last edited by

                                        The problem is that because I still have not heard a reply from Martin after my last post I don't think anybody is working on a solution, and the guys from pfsense have not commented about their use of -9 to restart packages either and as to why they are restarted in the first place. So just thinking that a new release fixed anything… it probably didn't.

                                        @Trey:

                                        Hi,

                                        the new version 1.1 of quagga was released a couple of days ago:

                                        http://mirror.yannic-bonenberger.com/nongnu/quagga/quagga-1.1.0.changelog.txt

                                        As the problems startet with version 1.0 and having a look at the chengelog, I hope quagga is running smooth again after the update.

                                        Would be greate to see an update of the packeage to quagga 1.1.

                                        Thanks!

                                        1 Reply Last reply Reply Quote 0
                                        • R
                                          reqlez
                                          last edited by

                                          Hmmm… maybe I could be wrong... I do see something here...  :

                                          commit 7e73eb740f3c52a5b7c0ae9c2cd33b486d885552
                                          Author: Timo Teräs <timo.teras@iki.fi>Date:   Sat Apr 9 17:22:32 2016 +0300
                                          
                                              zebra: handle multihop nexthop changes properly
                                          
                                              The rib entries are normally added and deleted when they are
                                              changed. However, they are modified in placae when the nexthop
                                              reachability changes. This fixes to:
                                               - properly detect nexthop changes from nexthop_active_update()
                                                 calls from rib_process()
                                               - rib_update_kernel() to not reset FIB flags when a RIB entry
                                                 is being modifed (old and new RIB are same)
                                               - improves the "show ip route <prefix>" output to display
                                                 both ACTIVE and FIB flags for each nexthop
                                          
                                              Fixes: 325823a5 "zebra: support FIB override routes"
                                              Signed-off-by: Timo Teräs <timo.teras@iki.fi>Reported-By: Igor Ryzhov <iryzhov@nfware.com>Tested-by: NetDEF CI System <cisystem@netdef.org></cisystem@netdef.org></iryzhov@nfware.com></timo.teras@iki.fi></prefix></timo.teras@iki.fi> 
                                          

                                          not sure if something here would help "rib_update_kernel() to not reset FIB flags when a RIB entry
                                                is being modifed (old and new RIB are same)"  But maybe i'm not understanding the problem properly.

                                          @Trey:

                                          Hi,

                                          the new version 1.1 of quagga was released a couple of days ago:

                                          http://mirror.yannic-bonenberger.com/nongnu/quagga/quagga-1.1.0.changelog.txt

                                          As the problems startet with version 1.0 and having a look at the chengelog, I hope quagga is running smooth again after the update.

                                          Would be greate to see an update of the packeage to quagga 1.1.

                                          Thanks!

                                          1 Reply Last reply Reply Quote 0
                                          • S
                                            Spydre13
                                            last edited by

                                            @reqlez:

                                            not sure if something here would help "rib_update_kernel() to not reset FIB flags when a RIB entry
                                                  is being modifed (old and new RIB are same)"  But maybe i'm not understanding the problem properly.

                                            I looked at the changelog too, and didn't see anything that would fix this.  The main problem is that when Quagga restarts, it doesn't recognize the routes that it previously put in there, so it pulls them in as "kernel" routes and they will always take precedence.  That's why it works fine until Quagga is restarted (which is basically kill & start, there is no graceful restart in Quagga).  Since the rib_sweep_table() function isn't used anymore, when it starts up it doesn't remove routes from the list of kernel routes that it previously put there (which it flags as RTF_PROTO1, or "1" in netstat -r).  I don't see how they aren't having more issues with this, unless the common scenario is that Quagga never gets restarted unless the whole OS is restarted.

                                            I don't see why kill -9 matters here, because it worked fine before v1.0, and there is no graceful restart capability in Quagga.  Ideally pfSense could use the Quagga VTY to make changes live without restarting, and then write changes to the config files for the next time it starts up, but I doubt anyone wants to take on a project like that.

                                            If you want more details let me know, but it would probably make more sense to discuss on the Quagga list instead of here.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.