Major issue with QUAGGA-OSPF and VLANs (pfsense 2.3.0)
- 
 Okay. So basically I will forward your comments again and see if anybody replies… I can't believe this bug has been out for like 8 months already and still no fix and nobody too involved about it ... unless it specifically targets pfsense or freebsd some how, but based on what you guys are saying this should affect all platforms, no ? I posted a new topic on the quagga-dev list as well, and haven't seen any responses to that. I don't see how they wouldn't be having this issue on any platform, as soon as quagga/zebra restarts. 
- 
 Okay. So basically I will forward your comments again and see if anybody replies… I can't believe this bug has been out for like 8 months already and still no fix and nobody too involved about it ... unless it specifically targets pfsense or freebsd some how, but based on what you guys are saying this should affect all platforms, no ? I posted a new topic on the quagga-dev list as well, and haven't seen any responses to that. I don't see how they wouldn't be having this issue on any platform, as soon as quagga/zebra restarts. Does this sound familiar ? https://lists.quagga.net/pipermail/quagga-dev/2016-February/014777.html 
- 
 Does this sound familiar ? https://lists.quagga.net/pipermail/quagga-dev/2016-February/014777.html I saw that thread a while ago, I'm not sure it's exactly the same issue, but it might help our situation. It seems like they are talking about the zebra (OSPF/BGP) routes not being removed from the kernel when the process is stopped/killed. I'm not sure what their expected behavior is, but that might not be ideal. Ideally (in my opinion) it would be good to remove the routes when stopping the process, but not remove the routes when restarting the process. Otherwise while restarting zebra the routes would be removed for a short time, which I don't think would be ideal, although it would probably be minor compared to the issue we are having. If you follow that thread, it's unclear whether they ended up doing anything with it or not, because their patches kept failing the CI tests. Assuming the routes zebra inserts are not removed when stopping/restarting, the code I was referring to previously was supposed to prevent kernel routes from being inserted into zebra as "kernel" routes when they originally were put there by zebra. I'm amazed that there's no response to this on their lists. I'm not sure if it's because so few people are working on it or what, it seems like it would be a big deal unless there's some other code handling this that works on other OS but not on FreeBSD. I'm not sure what the best option is going forward, if they are unresponsive it seems it would be best for pfSense to lock in the last version <1.0 (short-sighted fix) or fork it and correct the issue just for pfSense, but then it wouldn't continue to be updated. It seems like there are plenty of people who want to use OSPF but not many who are working on Quagga or other OSPF projects. I would be willing to contribute towards paying pfSense, Martin Winter (www.opensourcerouting.org), or someone else to fix this. @jimp - can you give your opinion on this? Would it be an option to use a fork of Quagga or specify to use the version before 1.0? OSPF really is broken right now in pfSense as soon as the service restarts (which is triggered by almost any change, and other things). 
- 
 Hi …. I understand it's a hack workaround, but do you think a button can be added in Pfsense under advanced options somewhere to not reboot network packages... just like that script you made ? The issue right now is while this may or may not be a pfsense issue, users of pfsense cannot use the product... and while I agree this is not a long term fix, we need something ... until maybe a year or two from now or who the f*** knows when somebody will look at this bug and will fix it and we then just have an extra option in pfsense that we will remember, look back and think "Man ... I hope there is not another stupid a*** bug in OSPF again, but if there is... we have a magic button" I'm like ready to learn development just to fix this garbage bug ... I have coded in C++ back in the days, maybe I can pick it up fast. :( :( :( :( :( not sure if something here would help "rib_update_kernel() to not reset FIB flags when a RIB entry 
 is being modifed (old and new RIB are same)" But maybe i'm not understanding the problem properly.I looked at the changelog too, and didn't see anything that would fix this. The main problem is that when Quagga restarts, it doesn't recognize the routes that it previously put in there, so it pulls them in as "kernel" routes and they will always take precedence. That's why it works fine until Quagga is restarted (which is basically kill & start, there is no graceful restart in Quagga). Since the rib_sweep_table() function isn't used anymore, when it starts up it doesn't remove routes from the list of kernel routes that it previously put there (which it flags as RTF_PROTO1, or "1" in netstat -r). I don't see how they aren't having more issues with this, unless the common scenario is that Quagga never gets restarted unless the whole OS is restarted. I don't see why kill -9 matters here, because it worked fine before v1.0, and there is no graceful restart capability in Quagga. Ideally pfSense could use the Quagga VTY to make changes live without restarting, and then write changes to the config files for the next time it starts up, but I doubt anyone wants to take on a project like that. If you want more details let me know, but it would probably make more sense to discuss on the Quagga list instead of here. That sounds like the issue. Preventing it from restarting is a hackish workaround no matter what signal is used. It will get restarted at some point and failing to recover gracefully is a regression in quagga's behavior in 1.x. It needs to recognize the flags it sets on routes in the table, and it isn't. Hopefully someone at Quagga can pick up and run with that on their list. 
- 
 Let's get like a fund raiser going, and collect like $10,000 and offer 10K to whoever can fix OSPF bug and integrate quagga VTY support into pfsense lol ( oh and integrate TCP/DNS instead of just ping support for gateway monitoring because every ISP now drops ICMP on high usage and gateway monitoring sucks without DNS / TCP ports support ) …. i'm willing to pitch in $1000 ... if all 3 conditions are met lol who else wants to donate here for a good cause ??? Does this sound familiar ? https://lists.quagga.net/pipermail/quagga-dev/2016-February/014777.html I saw that thread a while ago, I'm not sure it's exactly the same issue, but it might help our situation. It seems like they are talking about the zebra (OSPF/BGP) routes not being removed from the kernel when the process is stopped/killed. I'm not sure what their expected behavior is, but that might not be ideal. Ideally (in my opinion) it would be good to remove the routes when stopping the process, but not remove the routes when restarting the process. Otherwise while restarting zebra the routes would be removed for a short time, which I don't think would be ideal, although it would probably be minor compared to the issue we are having. If you follow that thread, it's unclear whether they ended up doing anything with it or not, because their patches kept failing the CI tests. Assuming the routes zebra inserts are not removed when stopping/restarting, the code I was referring to previously was supposed to prevent kernel routes from being inserted into zebra as "kernel" routes when they originally were put there by zebra. I'm amazed that there's no response to this on their lists. I'm not sure if it's because so few people are working on it or what, it seems like it would be a big deal unless there's some other code handling this that works on other OS but not on FreeBSD. I'm not sure what the best option is going forward, if they are unresponsive it seems it would be best for pfSense to lock in the last version <1.0 (short-sighted fix) or fork it and correct the issue just for pfSense, but then it wouldn't continue to be updated. It seems like there are plenty of people who want to use OSPF but not many who are working on Quagga or other OSPF projects. I would be willing to contribute towards paying pfSense, Martin Winter (www.opensourcerouting.org), or someone else to fix this. @jimp - can you give your opinion on this? Would it be an option to use a fork of Quagga or specify to use the version before 1.0? OSPF really is broken right now in pfSense as soon as the service restarts (which is triggered by almost any change, and other things). 
- 
 @Moderators 
 Please, rename this topic to 'Major issue with QUAGGA-OSPF and VLANs or VPNs (pfsense 2.3.0)'
 I have this problem on setup with multiple OpenVPN tunnels, but I never checked this thread because topic says only about VLANs. :'(
- 
 Let's get like a fund raiser going, and collect like $10,000 and offer 10K to whoever can fix OSPF bug and integrate quagga VTY support into pfsense lol ( oh and integrate TCP/DNS instead of just ping support for gateway monitoring because every ISP now drops ICMP on high usage and gateway monitoring sucks without DNS / TCP ports support ) …. i'm willing to pitch in $1000 ... if all 3 conditions are met lol who else wants to donate here for a good cause ??? First you need to find someone willing to fix the problem, otherwise the money doesn't help. I've already pointed out where the bug is (fairly confident anyways), and could fix it just by reverting the change they made. However, there's no guarantee that they will accept the fix. I can't get a response on why that change was made, or what the intention was. If they're not going to be responsive it seems like pfSense should either revert to the older version or use a fork that corrects this issue. 
- 
 yes but … who is going to be developing the fork lol Let's get like a fund raiser going, and collect like $10,000 and offer 10K to whoever can fix OSPF bug and integrate quagga VTY support into pfsense lol ( oh and integrate TCP/DNS instead of just ping support for gateway monitoring because every ISP now drops ICMP on high usage and gateway monitoring sucks without DNS / TCP ports support ) …. i'm willing to pitch in $1000 ... if all 3 conditions are met lol who else wants to donate here for a good cause ??? First you need to find someone willing to fix the problem, otherwise the money doesn't help. I've already pointed out where the bug is (fairly confident anyways), and could fix it just by reverting the change they made. However, there's no guarantee that they will accept the fix. I can't get a response on why that change was made, or what the intention was. If they're not going to be responsive it seems like pfSense should either revert to the older version or use a fork that corrects this issue. 
- 
 Hi guys… so ... maybe we should try changing the script and remove -9 like Martin suggested, I think he might not be too keen to respond until that is tried since he specifically asked to try that. Is it possible that while that piece of code was removed, another one was added to do the same function for cleanup of routes or similar ? 
- 
 you can find/remove the -9 in line 306-325 /usr/local/pkg/quagga_ospfd.incafter clicking 'save' in the webgui the rc-file will be updated /usr/local/etc/rc.d/quagga.shi don't have a test environment but i've done this on my home box. adjusting above should be fairly safe in a non-production environment. i also fail to see how this will solve the issue; but it might be a hackish workaround (as jimp already mentioned) 
- 
 Hi guys… so ... maybe we should try changing the script and remove -9 like Martin suggested, I think he might not be too keen to respond until that is tried since he specifically asked to try that. Is it possible that while that piece of code was removed, another one was added to do the same function for cleanup of routes or similar ? I would say go ahead and try this and see what happens. However, even if it cleans up the routes without using -9, that won't be ideal for two reasons: - Do you really want it to remove all OSPF routes from your firewall for a few seconds, maybe even longer? It takes some time for OSPF to start back up, establish neighbors, etc.
- What if Quagga (either zebra or ospf) crashes at some point? You would need to restart your firewall, just starting Quagga won't work because it didn't shut down cleanly and remote OSPF routes.
 Quagga really should be able to detect routes that it put into the kernel. Before v1.0 it did this, and it still actually does detect the routes it put there, it just doesn't remove them from the zebra RIB like it used to. 
- 
 Hello there, I'm wondering if anything have change this last month concerning Quagga OSPF/Kernel problem. Seems I'm still stuck with the kernel route written and the OSPF not used … Thx 
- 
 Still not solved? 
- 
 Still not solved? I don't know man … I almost want to switch to watchguard for my multi site OSPF deployments now. Knowing that support for a major component Like a routing package is non existent ( not pfsense fault seems quagga doesn't want to a knowledge problem ) is worrisome. I don't even have any more hours to invest in troubleshooting this as I have to catch up with projects. 
- 
 I'd like to confirm that removing the -9 has resolved the OSPF learned routes getting stuck as Kernel routes. I have been attempting a seamless voice failover setup using 2 openvpn tunnels and was running OSPF on those interfaces. This had been the only issue preventing this from working. After several tests in my lab everything appears to be working without issue. 
- 
 @reqlez could you report to your contacts at quagga that removing -9 works around the issue. We need a permanent fix, that also works with -9 
- 
 
- 
 Did https://github.com/pfsense/FreeBSD-ports/pull/265 to hack around this stupidity, since this has been going on for way too long… Obviously not a real solution, as noted here and here. So just to clarify, if you kill quagga without -9 it will remove the routes from the kernel until it starts back up and re-learns the routes, correct? So it basically creates a brief outage, which is not great either. It would be nice to hear from someone at pfSense about what our options are to get a long-term solution. From my understanding there are two options: - Prevent quagga from restarting, by using VTY for configuration changes instead of generating new configuration files and restarting.
- Add the code back to quagga (zebra) that was removed that filters out the kernel routes put there by itself.
 I think #2 would be easiest, but I'm not sure if the quagga community will be open to that, as I can't find out why the code was removed/commented out to begin with. 
- 
 Did https://github.com/pfsense/FreeBSD-ports/pull/265 to hack around this stupidity, since this has been going on for way too long… Obviously not a real solution, as noted here and here. So just to clarify, if you kill quagga without -9 it will remove the routes from the kernel until it starts back up and re-learns the routes, correct? So it basically creates a brief outage, which is not great either. I'd figure out that dealing with ~1-2 seconds outage would be a whole lot better than having bogus "kernel" routes picked up by zebra and getting routing broken. You of course are welcome to provide better solution. So far, for ~1 year, noone provided any better ideas for the upstream regression. Also, this thread is not about "pfSense should not restart routing packages". I'd guess that the summary provided by jimp is pretty accurate: Preventing it from restarting is a hackish workaround no matter what signal is used. It will get restarted at some point and failing to recover gracefully is a regression in quagga's behavior in 1.x. Restarting the package is required at minimum on upgrades, not avoidable. 
- 
 - Add the code back to quagga (zebra) that was removed that filters out the kernel routes put there by itself.
 I think #2 would be easiest, but I'm not sure if the quagga community will be open to that, as I can't find out why the code was removed/commented out to begin with. That is what needs to happen. Quagga needs to recognize its own routes by the flags in the routing table. There's no reason they should have removed that code that I can see. 
