Major issue with QUAGGA-OSPF and VLANs (pfsense 2.3.0)

shaoranrch

Hello,

I've encountered a major problem with QUAGGA OSPF being used over VLAN interfaces, currently I'm working on deployments that requiere our edge devices (pfsense) run OSPF, so yesterday I deployed the first one with the latest pfsense version (2.3.0), this particular deployment is as follows:

LAN01 is a physical interface joined to the OSPF domain
LAN02 is a VLAN interfaces (tag 100) joined as well to the OSPF domain
WAN01, 02 and 03 are all VLAN interfaces not joined to the OSPF domain

Everything seemed to work just fine, the firewall was learning all the networks as supposed and we had 100% connectivity. We disconnected LAN01 to see the convergence time, everything worked as intended, pfsense had the routes updated and the adjacency with the router on the other side of LAN01 removed.

Then I tried the same with LAN02 (remember, this one is a VLAN interface). The adjacency is NEVER lost (I left it there 5 minutes, it kept it up) so, from the perspective of pfsense everything is working fine even though the router on the other side is offline (and there are no other routers on this particular VLAN). It's as if QUAGGA ospf isn't paying attention to the fact that it doesn't receive hello's over this interface anymore, or as if, it does pay attention to it but just doesn't work as it should.

The above issue made the firewall keep all the routes going towards LAN02 which, of course, made us loose connectivity all over the network to the firewall, we did packet captures and we can see our routers sending traffic via LAN01, but the pfsense is answering via LAN02, we even checked the routes on pfsense, it kept all the routes going via LAN02.

I though it was some weird issue that probably would fix after rebooting, did it, same behavior, tried again doing the same but with LAN01 and it works as it should, the issue is only with VLAN interfaces and not physical ones.

Any help about this would be really appreciated.

heper

could you draw a schematic of your setup ?

shaoranrch

Hello,

thanks for the reply, sure here it's the topology:

I've checked everything, the problem is as I described for any reason the adjacency isn't dropped on failure on the VLAN interface and the routes are kept in the routing table. The vlans here are working as intended.

heper

might be because LAN2 likely has kernel routes/gateways towards the WANs & gateway monitoring keeps overriding quagga? (i'm just speculating here)

shaoranrch

Hi,

There's a rule that applies to both interfaces (they are inside an interface group called LAN, the rule is applied here) that sets the load-balancing via gateway-groups. Other than that, there's nothing added here.

Again, this is just happening when LAN02 (vlan) is offline, the router just keeps all the routes there, doesn't happen when LAN01 (physical) gets offline, it recalculates everything as intended.

I really don't know what to do anymore, other than ditching pfsense all together since we need this setup like this, and seems like OSPF is broken when VLANs are in use.

heper

could you post the quagga status output for all 3 situations (all online/ lan1 offline / lan2 offline)
best to highlight the related routes & leave out sensitive information

the more relevant info, the more likely the package maintainer might be able to track down the issue, or find a workaround

shaoranrch

Hi,

Uppon taking a closer examination I did noticed that Quagga indeed removes the adjacency and the OSPF table is at it should, but, for some reason, the routes learnt via OSPF and via VLAN 100 neighbor are being treated as kernel routes (just like you speculated, see picture below), thus the router is using them, what could be causing this? so far:

There aren't any static routes
The routes being treated as kernel routes were all learn via OSPF (and are a lot of routes)
Quagga is working as intended, the adjacency is being removed and the topology updated (as well as the routes), I didn't notice this the first time but it's happening like it should
Even though the routes were learnt from OSPF and the adjacency with the neighbor selected as next-hop is offline, the routes are kept in the FIB as kernel routes…

The only gateway-group involves the WANs, other than that, the LAN group is in "allow anything" mode.

Here's the routing table:

All those kernel routes are kept always the same, doesn't matter if R1 or R2 is offline (OSPF routes and LSA table on the other hand are updated as they should), I really don't get what's happening here.

heper

i started looking at this more closely.

i'm facing the same/similar issue on a multilink-openvpn-site2site (192.168.99.1 & 192.168.88.1)
while both vpn are online:


O   10.0.0.0/24 [110/120] via 192.168.99.1, ovpnc1, 00:07:54
K>* 10.0.0.0/24 via 192.168.99.1, ovpnc1
O   10.10.10.0/24 [110/110] via 192.168.99.1, ovpnc1, 00:07:54
K>* 10.10.10.0/24 via 192.168.99.1, ovpnc1
O   10.10.44.0/24 [110/110] via 192.168.99.1, ovpnc1, 00:07:54
K>* 10.10.44.0/24 via 192.168.99.1, ovpnc1
O   10.10.100.0/24 [110/110] via 192.168.99.1, ovpnc1, 00:07:54
K>* 10.10.100.0/24 via 192.168.99.1, ovpnc1
O   10.20.10.0/24 [110/10] is directly connected, em2_vlan10, 00:12:26
C>* 10.20.10.0/24 is directly connected, em2_vlan10
O   10.20.100.0/24 [110/10] is directly connected, em2, 00:12:27
C>* 10.20.100.0/24 is directly connected, em2
O   10.30.10.0/24 [110/1010] via 192.168.223.2, ovpns3, 00:12:21
K>* 10.30.10.0/24 via 192.168.223.2, ovpns3
C>* 127.0.0.0/8 is directly connected, lo0

While one vpn is down:


O   10.0.0.0/24 [110/520] via 192.168.88.1, ovpnc4, 00:00:05
K>* 10.0.0.0/24 via 192.168.99.1, ovpnc1
O   10.10.10.0/24 [110/510] via 192.168.88.1, ovpnc4, 00:00:05
K>* 10.10.10.0/24 via 192.168.99.1, ovpnc1
O   10.10.44.0/24 [110/510] via 192.168.88.1, ovpnc4, 00:00:05
K>* 10.10.44.0/24 via 192.168.99.1, ovpnc1
O   10.10.100.0/24 [110/510] via 192.168.88.1, ovpnc4, 00:00:05
K>* 10.10.100.0/24 via 192.168.99.1, ovpnc1
O   10.20.10.0/24 [110/10] is directly connected, em2_vlan10, 00:29:30
C>* 10.20.10.0/24 is directly connected, em2_vlan10
O   10.20.100.0/24 [110/10] is directly connected, em2, 00:29:31
C>* 10.20.100.0/24 is directly connected, em2
O   10.30.10.0/24 [110/1010] via 192.168.223.2, ovpns3, 00:29:25
K>* 10.30.10.0/24 via 192.168.223.2, ovpns3
C>* 127.0.0.0/8 is directly connected, lo0

quagga is ~~showing~~/USING selected kernel routes while there are no static routes set for subnets 10.0.0.0/24 | 10.10.10.0/24 | 10.10.44.0/24
When I take down the link, quagga changes its Ospf-route correctly / but the "old" kernel route stays in place & remains selected. This causes the routing to fail

Not sure if this is a quagga issue or a freebsd issue.
Might be related to:
https://forum.pfsense.org/index.php?topic=110245.0

Hopefully @jimp will pick up this post / afaik he's one of the few people who might know the root cause of this.
In the mean time i created a bugreport here: https://redmine.pfsense.org/issues/6305

as requested adding config files:
ospfd_client_side


# This file was created by the pfSense package manager.  Do not edit!

password ******
interface ovpnc4
  ip ospf cost 500
interface ovpnc1
  ip ospf cost 100
interface ovpns3
  ip ospf cost 1000

router ospf
  ospf router-id 10.20.10.1
  network 192.168.88.0/30 area 0.0.0.1
  network 192.168.99.0/30 area 0.0.0.1
  network 192.168.223.0/30 area 0.0.0.1
  network 192.168.77.0/26 area 0.0.0.1
  network 192.168.99.2/32 area 0.0.0.1
  network 192.168.223.1/32 area 0.0.0.1
  network 192.168.88.2/32 area 0.0.0.1
  network 192.168.100.1/32 area 0.0.0.1
  network 192.168.226.1/28 area 0.0.0.1
  network 10.20.10.0/24 area 0.0.0.1
  network 192.168.2.0/24 area 0.0.0.1
  network 10.20.100.0/24 area 0.0.0.1
  network 172.20.20.0/24 area 0.0.0.1
  network 192.168.66.0/24 area 0.0.0.1

zebra_client_side


# This file was created by the pfSense package manager.  Do not edit!

password ******
ip prefix-list ACCEPTFILTER deny 192.168.77.0/26
ip prefix-list ACCEPTFILTER deny 192.168.99.2/32
ip prefix-list ACCEPTFILTER deny 192.168.223.1/32
ip prefix-list ACCEPTFILTER deny 192.168.88.2/32
ip prefix-list ACCEPTFILTER deny 192.168.100.1/32
ip prefix-list ACCEPTFILTER deny 192.168.226.1/28
ip prefix-list ACCEPTFILTER permit any
route-map ACCEPTFILTER permit 10
match ip address prefix-list ACCEPTFILTER
ip protocol ospf route-map ACCEPTFILTER

ospf_server_side


# This file was created by the pfSense package manager.  Do not edit!

password ********
interface ovpns7
  ip ospf cost 500
interface ovpns2
  ip ospf cost 1000
interface ovpns5
interface ovpns1
  ip ospf cost 1000
  ip ospf authentication-key *******

router ospf
  ospf router-id 10.10.10.1
  network 192.168.88.0/30 area 0.0.0.1
  network 192.168.99.0/30 area 0.0.0.1
  network 192.168.222.0/30 area 0.0.0.1
  network 192.168.224.0/30 area 0.0.0.1
  area 0.0.0.0 authentication
  network 192.168.88.1/32 area 0.0.0.1
  network 192.168.99.1/32 area 0.0.0.1
  network 192.168.222.1/32 area 0.0.0.1
  network 192.168.224.1/32 area 0.0.0.1
  network 192.168.100.2/32 area 0.0.0.1
  network 10.10.10.0/24 area 0.0.0.1
  network 10.10.100.0/24 area 0.0.0.1
  network 192.168.77.0/24 area 0.0.0.1
  network 192.168.1.0/24 area 0.0.0.1
  network 10.10.44.0/24 area 0.0.0.1

zebra_server_side


# This file was created by the pfSense package manager.  Do not edit!

password ******
ip prefix-list ACCEPTFILTER deny 192.168.88.1/32
ip prefix-list ACCEPTFILTER deny 192.168.99.1/32
ip prefix-list ACCEPTFILTER deny 192.168.222.1/32
ip prefix-list ACCEPTFILTER deny 192.168.224.1/32
ip prefix-list ACCEPTFILTER deny 192.168.100.2/32
ip prefix-list ACCEPTFILTER permit any
route-map ACCEPTFILTER permit 10
match ip address prefix-list ACCEPTFILTER
ip protocol ospf route-map ACCEPTFILTER

shaoranrch

Hi,

I see, incredible at least it's not an isolated issue. Hopefully they'll check this and give us a solution.

Thanks.

kennylam

Same problem applies on my pair of pfSense 2.3 too….. the cost of path was properly calculated, but the kernel route just occupied the highest prioity.

Take my case as example:

O 192.168.101.0/24 [110/15] via 172.16.53.254, em1_vlan999, 00:00:12
K>* 192.168.101.0/24 via 192.168.168.1, em5

While em1_vlan999 is a direct link with lower cost (5) and em5 is a remote site with is in higher cost (200), em5 was selected still. The cost settings on all site are equal.

My setup relied on OpenVPN too, and worked fine on pfSense 2.2.3-2.2.6, until I upgraded all routers to pfSense 2.3.

pfSense 2.3_1 with Quagga_OSPF 0.6.13

shaoranrch

I believe this is a major issue and should be given top priority, we're talking about routing and deployments where redundancy is a must, this is just unacceptable. Maybe the devs could tell us when can we expect this to be solved.

heper

while this is a major issue for you, me & probably a some others / the chances are, that more urgent matters exist.
If you can provide more detailed debugging info, it will help finding the root cause & will help getting a solution faster.

i'm just a user of ospf & don't have the knowledge to find out why it is behaving like it is. afaik there has been little changes to the pfSense-package (except the conversion of the GUI)

–--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I've just tried going back to an earlier version of quagga on a test system. it appears to solve the 'kernel-route' issue …. but my test setup is too limited to fully test this. If i have spare time next week i'll run some further tests
if your test environment is better (or wish to risk this on a production environment), run below from shell :
for 32bit:


pkg add -f http://pkg.freebsd.org/freebsd:10:x86:32/release_3/All/quagga-0.99.24.1_2.txz

for 64bit:


pkg add -f  http://pkg.freebsd.org/freebsd:10:x86:64/release_3/All/quagga-0.99.24.1_2.txz

USE WITH CAUTION / THIS MAY HAVE UNWANTED CONSEQUENCES

heper

just tried it on one of my production systems. downgrading seems to have solved the routing issues i had with the dual-openvpn failover.
i'll update the redmine accordingly.

If @shaoranrch & @kennylam could confirm that downgrading helps, then we are getting somewhere :)

kennylam

That worked for me too. OSPF routes on VLAN/OpenVPN are now selected as primary route ,as the costs defined.

reqlez

Great … I have the same issue, of course after beating my head against the wall for 2 hours i find this post. K and O routes of same interface showing up, the K obviously doesn't get updated and my traffic doesn't failover.

I dont have any VLANs ... maybe rename the topic to "Major issue with QUAGGA-OSPF"

heper

Dus reverting ti older version work first you?

reqlez

By the way I confirmed that installing an older version as per above instructions fixed the problem.

What i still hate is that when the VPN connection gets reconnected ( even one with lower priority ) , the OSPF package gets restarted and the routing table gets cleared and stuff and drops traffic for a few seconds. This is an old limitation that has not been fixed still :(

reqlez

I also found something else different on the version of the OSPF that works ( downgraded ).

router ospf
ospf router-id 192.168.2.254
passive-interface re1
network 192.168.2.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0

on the version that works, there is only ONE entry per subnet here … on the NEW version that doesn't work, there are 2 entries per subnet ... so it looks like this :

router ospf
ospf router-id 192.168.2.254
passive-interface re1
network 192.168.2.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.101.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0
network 192.168.102.0/24 area 0.0.0.0
network 192.168.103.0/24 area 0.0.0.0
network 192.168.104.0/24 area 0.0.0.0

( NOT EXACT but you get the idea, two entries per subnet under the ospfd.conf )

r.vanmoerkerk

Hi All,

First of all, thanks for this post. We had a lot of major issues in the network after 2.3 update of pfsense. By this post we could fix the issue and found what happend after a lot of hours troubleshooting.

Found a bug notice about this already one month old: https://redmine.pfsense.org/issues/6305 and created a new one on our own name with our support subscription. Wil post an update if we get one.

Downgrading the package fixed the issue for us.

Also we cannot redistribute the default 0.0.0.0/0 using zebra.conf to our lan. We also have a support out for that question to hopfully get a fix or update.

heper

Thanks. Ive bumped the redmine ticket Yesterday.

Hopefully it'll get fixed soon