Major issue with QUAGGA-OSPF and VLANs (pfsense 2.3.0)
-
I believe this is a major issue and should be given top priority, we're talking about routing and deployments where redundancy is a must, this is just unacceptable. Maybe the devs could tell us when can we expect this to be solved.
-
while this is a major issue for you, me & probably a some others / the chances are, that more urgent matters exist.
If you can provide more detailed debugging info, it will help finding the root cause & will help getting a solution faster.i'm just a user of ospf & don't have the knowledge to find out why it is behaving like it is. afaik there has been little changes to the pfSense-package (except the conversion of the GUI)
–--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I've just tried going back to an earlier version of quagga on a test system. it appears to solve the 'kernel-route' issue …. but my test setup is too limited to fully test this. If i have spare time next week i'll run some further tests
if your test environment is better (or wish to risk this on a production environment), run below from shell :
for 32bit:pkg add -f http://pkg.freebsd.org/freebsd:10:x86:32/release_3/All/quagga-0.99.24.1_2.txz
for 64bit:
pkg add -f http://pkg.freebsd.org/freebsd:10:x86:64/release_3/All/quagga-0.99.24.1_2.txz
USE WITH CAUTION / THIS MAY HAVE UNWANTED CONSEQUENCES
-
just tried it on one of my production systems. downgrading seems to have solved the routing issues i had with the dual-openvpn failover.
i'll update the redmine accordingly.If @shaoranrch & @kennylam could confirm that downgrading helps, then we are getting somewhere :)
-
That worked for me too. OSPF routes on VLAN/OpenVPN are now selected as primary route ,as the costs defined.
-
Great … I have the same issue, of course after beating my head against the wall for 2 hours i find this post. K and O routes of same interface showing up, the K obviously doesn't get updated and my traffic doesn't failover.
I dont have any VLANs ... maybe rename the topic to "Major issue with QUAGGA-OSPF"
-
Dus reverting ti older version work first you?
-
By the way I confirmed that installing an older version as per above instructions fixed the problem.
What i still hate is that when the VPN connection gets reconnected ( even one with lower priority ) , the OSPF package gets restarted and the routing table gets cleared and stuff and drops traffic for a few seconds. This is an old limitation that has not been fixed still :(
-
I also found something else different on the version of the OSPF that works ( downgraded ).
router ospf
ospf router-id 192.168.2.254
passive-interface re1
network 192.168.2.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0on the version that works, there is only ONE entry per subnet here … on the NEW version that doesn't work, there are 2 entries per subnet ... so it looks like this :
router ospf
ospf router-id 192.168.2.254
passive-interface re1
network 192.168.2.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0( NOT EXACT but you get the idea, two entries per subnet under the ospfd.conf )
-
Hi All,
First of all, thanks for this post. We had a lot of major issues in the network after 2.3 update of pfsense. By this post we could fix the issue and found what happend after a lot of hours troubleshooting.
Found a bug notice about this already one month old: https://redmine.pfsense.org/issues/6305 and created a new one on our own name with our support subscription. Wil post an update if we get one.
Downgrading the package fixed the issue for us.
Also we cannot redistribute the default 0.0.0.0/0 using zebra.conf to our lan. We also have a support out for that question to hopfully get a fix or update.
-
Thanks. Ive bumped the redmine ticket Yesterday.
Hopefully it'll get fixed soon
-
Hi!
I am also seeing this behaviour… Asked about it here: https://forum.pfsense.org/index.php?topic=112698.0
Attached are configuration files. These were created manually because I had to include some commands to stop Quagga inserting routes to the OpenVPN addresses into the kernel. It worked before. A recent upgrade stopped the failover from working.
I really hope this is solved quickly.
Cheers,
Miguel -
No luck with support, they don't give any feedback or recognizes the issue. The test mentioned in the redmine is not a fair test. The bug has the effect that it advertises the whole network to itself so other bgp/ ospf instances in our netwerk are overwriten with this new data and locations are not reachable. It has nothing to do with wan failover in our case.
It could be that the extra kernel routes are the issue but if that is the case then try to fix this. All was doing well in previous configs and after upgrade this issue happend. After reversing quagga packages it is fixed so don't blame me for thinking that it is related to the quagga package.
How can we get some more action on this from the pfsense side? It is with issues like this that our management is not having faith in the solution, we have support but no response about this issue, not our package. It is part of the pfsense firewall suite product.
-
i think it would be ideal if one of the coredevs reverts the pfsense package to quagga 0.99.x.x for now. (there wasn't anything wrong with it)
then the coredevs have more time to find a way to replicate the issue & report it upstream.
I believe this issue might affect a lot of quagga-users, but not all of them have noticed it…. in some cases you only notice it when an interface goes down.
-
Hello All!!!
I´ve been with this annoying bug like three long long days!!
While I realized that wasn´t only me, i quickly solved by reverting the package to an older version as proposed previously in this thread (thanks!)For me is quite easy to reproduce it.
Let´s start by assuming we have a running and configured instance on an pfSense box.
Our daemon now learns a brand new route, for instance 10.1.1.0/24.
Example output:Quagga Zebra Routes:
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, P - PIM, A - Babel,
> - selected route, * - FIB routeO>* 10.1.1.0/24 [110/12] via 192.168.123.13, em5, 00:00:31
We now have a next hop change (because of a link down situation).
That line would now be seen (in my case) like this:
O>* 10.1.1.0/24 [110/12] via 192.168.19.25, em5, 00:00:02
(sorry for the upercase)
UP TO HERE OK!!!!BUT, let´s go back to the original route:
O>* 10.1.1.0/24 [110/12] via 192.168.123.13, em5, 00:00:31
IF, in this step for any reason zebra is reloaded or restarted from now on we will see like this:
O> 10.1.1.0/24 [110/12] via 192.168.123.13, em5, 00:00:31
K>* 10.1.1.0/24 [110/12] via 192.168.123.13, em5, 00:00:31What happens in a failover scenario? Well… This:
O> 10.1.1.0/24 [110/12] via 192.168.19.25, em5, 00:01:20
K> 10.1.1.0/24 [110/12] via 192.168.123.13, em5, 00:00:05*Red line shows the problem!!! Kernel route is wrong!!
As far as i read, there is a daemon option line "–keep-kernel" That says zebra to preserve previous learned routes before actually booting up.If my explanation seems ok, then there is only one simple way to reproduce it:
1- Make OSPF learn a new route.
2- Go to services and restart both :Quagga OSPFd and Quagga Zebra daemons.
3- Try to alter the paths and see that the line beggining with K won´t change any more!!!Hope I helped!
Thanks!!!
-
Bump!!
Any updates on this? Unfortunately I don't have an appropiate lab to test what echu2016 posted above, and my production systems are currently running the previous version of the Quagga package as suggested.
But what he posted makes perfect sense, and should be pretty simple to reproduce and track. -
After restarting services and yanking (virtual) cables I did manage to make it break, once.
If it is related to restarting zebra, this patch might help:
http://files.atx.pfsense.org/jimp/patches/skip_restart_for_routing_packages-2.3.1.patch
Ultimately someone that can reproduce this reliably needs to report this directly to quagga since it appears to be a problematic change introduced in their 1.0.x code base.
-
Hi,
this problem is a real show stopper. Has nobody a config, that we can supply to the quagga team in order to fix the problem? This problem really sucks, as it is only showing itself from time to time…
Dear pfsense team, what about a paid bugfix? What should it cost?!
regards
trey
-
We can't reliably reproduce it here, and it isn't our code to fix. It's something in Quagga 1.x on FreeBSD, so you'd be better off approaching the Quagga developers or maybe FreeBSD developers directly.
-
Has anybody opened a ticket with quagga yet ? Because I can easily reproduce it here just have to pull the main link cable at any one of the two sides of the link and it breaks.
If nobody submitted I'll contact them when my projects settle down.
-
i don't think anyone submitted anything.
-
I don't have much experience in submitting bug reports and don't sincerely have any time for all the information/testing they require to accept them
What I can say right now is that yesterday I upgraded to PFSense 2.3.2 and Quagga package also was upgraded to version 1.+, everything described here before has happened again. Reproducing the issue is quite easy. Just leave Quagga learn a few routes, then just click save or manually restart the service and you will see the routes duplicated. One with the preceding O and the preferred one with the preceding K label.
Like This:O> 10.33.150.128/25 [110/20] via 192.168.45.1, em2, 01:38:55
K>* 10.33.150.128/25 via 192.168.45.1, em2If for some reason this dynamic route disappears or changes the next hop, the Kernel route would still be preferred and consequently the routing will be done incorrectly, like this:
O> 10.33.150.128/25 [110/20] via 192.168.129.1, em2, 00:05:13
K>* 10.33.150.128/25 via 192.168.45.1, em2My solution again was rolling back to version 0.99 and locking the package to prevent further auto-updates.
pkg lock quagga
-
Has anybody opened a ticket with quagga yet ? Because I can easily reproduce it here just have to pull the main link cable at any one of the two sides of the link and it breaks.
If nobody submitted I'll contact them when my projects settle down.
Were you able to contact them? I was slamming my head against the wall for hours this weekend trying to figure out routing problems all over my network when I had a connection go down.
-
can it be related? https://lists.quagga.net/pipermail/quagga-dev/2016-February/014777.html
-
After restarting services and yanking (virtual) cables I did manage to make it break, once.
If it is related to restarting zebra, this patch might help:
http://files.atx.pfsense.org/jimp/patches/skip_restart_for_routing_packages-2.3.1.patch
Ultimately someone that can reproduce this reliably needs to report this directly to quagga since it appears to be a problematic change introduced in their 1.0.x code base.
I saw somewhere in quagga notes that something got fixed recently. About this no restart patch. will it work on latest update ? also …. why not just include an option to TURN OFF restart of network packages ? somewhere in advanced options ? That would really help those unstable lines bringing the network down even if it's lower priority link while quagga reboots.
-
We can't reliably reproduce it here, and it isn't our code to fix. It's something in Quagga 1.x on FreeBSD, so you'd be better off approaching the Quagga developers or maybe FreeBSD developers directly.
I have 2 fresh pfSenses (SG-4860) with 2 ISPs/4 OpenVPNs and OSPF on top of it.
This issue reliably reproduced :) :( :( , i.e. kernel routes aren't removed/updated properly (see 10.0.9.0/24 route):Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, P - PIM, A - Babel, > - selected route, * - FIB route K>* 0.0.0.0/0 via 192.168.0.1, igb1 O 10.0.9.0/24 [110/60] via 10.255.255.101, igb5, 00:02:03 K>* 10.0.9.0/24 via 10.255.2.2, ovpns1 O 10.1.102.0/24 [110/50] via 10.255.2.2, ovpns1, 00:02:03 K>* 10.1.102.0/24 via 10.255.2.2, ovpns1 O 10.11.11.0/24 [110/10] is directly connected, lagg0, 00:02:16 C>* 10.11.11.0/24 is directly connected, lagg0 O 10.255.1.0/24 [110/70] via 10.255.2.2, ovpns1, 00:02:03 K>* 10.255.1.0/24 via 10.255.2.2, ovpns1 O 10.255.2.0/24 [110/40] is directly connected, ovpns1, 00:02:16 C>* 10.255.2.0/24 is directly connected, ovpns1 O 10.255.255.0/24 [110/50] is directly connected, igb5, 00:02:16 C>* 10.255.255.0/24 is directly connected, igb5 C>* 127.0.0.0/8 is directly connected, lo0 C>* 192.168.0.0/24 is directly connected, igb1
My primary question is if Quagga introduced some problems in recent updates may be we should return to version which don't have problems and push it through pfsense's packages?
I have 2 support incidents from pfsense team, may be I should spent one of them on this problem? -
After restarting services and yanking (virtual) cables I did manage to make it break, once.
If it is related to restarting zebra, this patch might help:
http://files.atx.pfsense.org/jimp/patches/skip_restart_for_routing_packages-2.3.1.patch
Ultimately someone that can reproduce this reliably needs to report this directly to quagga since it appears to be a problematic change introduced in their 1.0.x code base.
Okay I tried this patch in 2.3.2 and it wont work …
Also I submitted a request in quagga-users lost nobody got back to me yet.
-
Okay got a reply ( from Martin Winters the quagga god himself ! ) https://lists.quagga.net/pipermail/quagga-users/2016-October/014474.html
I actually contacted the maintainer of freebsd port for quagga and he referred me to the list as he doesn't think this is port related.
If you guys want to pitch in, go ahead… Martin is asking to compile latest code from git ... and honestly I have never complied zebra before, i think the last thing i complied on freebsd was java lol
-
Here is another comment Martin from Quagga made: "I don’t see why pfsense would restart Quagga - so I think this might
be a bug. But there might be other reasons for it which I’m unaware
of."I actually have some logs that I will be submitting either tonight or tomorrow.
-
OSPFD / ZEBRA Debug logs submitted to Martin. Now we wait and see. I have tried his "latest" development package and it does the same thing.
-
So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?
See code in pfsense:
rc_stop() {
if [ -e /var/run/quagga/zebra.pid ]; then
/bin/kill -9/bin/cat /var/run/quagga/zebra.pid
/bin/rm -f /var/run/quagga/zebra.pid
fi
if [ -e /var/run/quagga/ospfd.pid ]; then
/bin/kill -9/bin/cat /var/run/quagga/ospfd.pid
/bin/rm -f /var/run/quagga/ospfd.pid
fi
}But then again, why is it being restarted in the first place? Is it because of links that get IPs dynamically allocated ? A UI option to skip quagga restart would be really appreciated guys! Pulling my hair out here testing this :(
-
The -9 signal is always a bad idea on any service, it is strictly reserved for the situation where no other signal is able to terminate the process that is stuck for whatever reason. This should be common knowledge among pfSense developers and people working on the packages and I'm really surprised such amateur mistakes are being made with such an important package.
-
@kpa:
The -9 signal is always a bad idea on any service, it is strictly reserved for the situation where no other signal is able to terminate the process that is stuck for whatever reason. This should be common knowledge among pfSense developers and people working on the packages and I'm really surprised such amateur mistakes are being made with such an important package.
Here is my idea … why not have two waves of shutdowns ... first wave without -9 then sleep for a few seconds and do another wave with -9 ? Even better ... trigger the second wave only if there are any processes still running...
-
It's still a very bad idea.
http://unix.stackexchange.com/questions/281439/why-should-i-not-use-kill-9-sigkill
Imagine a very big database that relies on proper shutdown for its integrity if the database has to be taken down. It has battery backed storage and UPS power and survives a power outage easily by performing the proper shutdown procedures when a power outage is detected and it can finish the procedures before the power really goes down. Now, if the main database process gets killed with -9 signal none of the shutdown processes get run because as in the linked document is described, "the process gets the rug pulled from it" and it's just removed forcibly from the system from the exact state it was when it was sent the -9 signal. This would leave that database in a inconsistent state and could cost days in repair time.
-
So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?
I see Martin's reply to you on Oct. 10, but I don't see anything after that. Are you emailing him off-list?
I was looking through the Quagga code last night, and found something that I'm wondering whether or not could be the problem. Quagga (zebra daemon) puts routes into the kernel with flag "1" (RTF_PROTO1, see netstat man page). When zebra starts up it's supposed to ignore (filter out) any kernel routes with flag "1" because it should assume it put those there to begin with. I think before Quagga version 1 this was working, and in version >= 1 it pulls in those kernel routes into the zebra RIB.
If I reboot a firewall and go to OSPF -> Status -> Zebra routes, I see a bunch of OSPF routes but barely any K (kernel) routes. If I make any change on the Global Settings or Interface Settings tab quagga restarts, and then when looking at the zebra routes it is filled with kernel routes (one for each OSPF route).
Can you ask Martin to look at this:
Commit: https://github.com/Quagga/quagga/commit/0d0686f98e64017415071e590bde262f0ab5a4c9
File: zebra/zebra_rib.c
Function: rib_sweep_tableThis function is commented out starting in version 1, but it was used in version 0.99.24. There is a block of code in it:
if (rib->type == ZEBRA_ROUTE_KERNEL && CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELFROUTE)) { ret = rib_uninstall_kernel (rn, rib); if (! ret) rib_delnode (rn, rib); }
The rib_weed_tables function that is still being used doesn't seem to do this same thing, from what I can tell. This URL shows them side-by-side: https://fossies.org/diffs/quagga/0.99.24.1_vs_1.0.20160315/zebra/zebra_rib.c-diff.html
If you can point me to the thread where you are discussing this with Martin, I can pass this along to him if you prefer.
-
Sorry I'm a mailing list noob and I just realized when you told me that this stuff is not going via lists … I'll post this and include the list this time, yes you are correct I been just emailing him
So apparently -9 is a really nasty way of stopping Quagga, as per Martin from Quagga, and he thinks this is not letting it flush routing tables before exit. Maybe there is new code in new version of Quagga that takes a bit more time to flush those routes ? and maybe that is why it was not an issue in 0.99 version but it is with 1.0 ?
I see Martin's reply to you on Oct. 10, but I don't see anything after that. Are you emailing him off-list?
I was looking through the Quagga code last night, and found something that I'm wondering whether or not could be the problem. Quagga (zebra daemon) puts routes into the kernel with flag "1" (RTF_PROTO1, see netstat man page). When zebra starts up it's supposed to ignore (filter out) any kernel routes with flag "1" because it should assume it put those there to begin with. I think before Quagga version 1 this was working, and in version >= 1 it pulls in those kernel routes into the zebra RIB.
If I reboot a firewall and go to OSPF -> Status -> Zebra routes, I see a bunch of OSPF routes but barely any K (kernel) routes. If I make any change on the Global Settings or Interface Settings tab quagga restarts, and then when looking at the zebra routes it is filled with kernel routes (one for each OSPF route).
Can you ask Martin to look at this:
Commit: https://github.com/Quagga/quagga/commit/0d0686f98e64017415071e590bde262f0ab5a4c9
File: zebra/zebra_rib.c
Function: rib_sweep_tableThis function is commented out starting in version 1, but it was used in version 0.99.24. There is a block of code in it:
if (rib->type == ZEBRA_ROUTE_KERNEL && CHECK_FLAG (rib->flags, ZEBRA_FLAG_SELFROUTE)) { ret = rib_uninstall_kernel (rn, rib); if (! ret) rib_delnode (rn, rib); }
The rib_weed_tables function that is still being used doesn't seem to do this same thing, from what I can tell. This URL shows them side-by-side: https://fossies.org/diffs/quagga/0.99.24.1_vs_1.0.20160315/zebra/zebra_rib.c-diff.html
If you can point me to the thread where you are discussing this with Martin, I can pass this along to him if you prefer.
-
I just posted your comment on the same list: https://lists.quagga.net/pipermail/quagga-users/2016-October/014476.html
This time i'm being less of a noob and actually e-mailing list. We can continue this discussion there if you subscribe to it.
-
Good morning,
Has there been any update regarding this issue? Is there another forum or notes I can follow to see when this is resolved? This is causing a huge problem within our company and if not fixed soon - we will have to change routing. I'm on the latest version of pfsense. -
Hi bgibson,
Meanwhile I suggest you to take mi heper's and my recommendation:https://forum.pfsense.org/index.php?topic=111108.msg620733#msg620733
https://forum.pfsense.org/index.php?topic=111108.msg654483#msg654483 -
Thanks - I will look into the links.
-
Hi,
the new version 1.1 of quagga was released a couple of days ago:
http://mirror.yannic-bonenberger.com/nongnu/quagga/quagga-1.1.0.changelog.txt
As the problems startet with version 1.0 and having a look at the chengelog, I hope quagga is running smooth again after the update.
Would be greate to see an update of the packeage to quagga 1.1.
Thanks!