PFS2.4.5p1 FRR 0.6.7_5 (FRR7-7.3.1) OSPF Neighbor Reset Bug - Zebra crash



  • I have a simple dual site multi-WAN setup with OpenVPN and OSPF.

    https://docs.netgate.com/pfsense/en/latest/book/openvpn/openvpn-and-multi-wan.html - like that but with FRR.

    The setup is beautiful except for one thing - whenever one of the OpenVPN tunnels flaps, not only does the neighbor relationship reset on that link (as you'd expect) but the other link resets as well.

    In more detail -

    • site A's primary link has an OpenVPN tunnel to site B's primary link, and Site A's backup link has an OpenVPN tunnel to site B's backup link.
    • Both OpenVPN tunnels are running OSPF
    • When the backup tunnel goes down, there is no issue with the primary tunnel's OSPF
    • When the backup tunnel comes back up, there are no issues with the primary tunnel itself (no ping packets are lost), but the OSPF neighbor relationship over the primary tunnel is reset (and from the logs the backup link gets reset as well.... when Zebra crashes).

    I spent a day today emulating the site design in GNS3 and doing tests and it's the same issue there. I have memtested all RAM regardness and I am ruling out any hardware issues.

    Here is the logs on the head office aka Site A pfsense firewall (I log in reverse order):

    Sep 10 21:55:30	ospfd	54482	void nsm_change_state(struct ospf_neighbor *, int):[192.168.24.1:default], Exchange -> Full): scheduling new router-LSA origination
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Exchange -> Full (ExchangeDone)
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Exchange, seq_num:0x2f10e7cc, local:0x2f10e7cc
    Sep 10 21:55:30	ospfd	54482	void nsm_change_state(struct ospf_neighbor *, int):[192.168.24.1:default], Exchange -> Full): scheduling new router-LSA origination
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns2:10.255.27.5: Exchange -> Full (ExchangeDone)
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Exchange, seq_num:0x639ef79a, local:0x639ef79a
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:639ef79a , flags:1
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns2:10.255.27.5: ExStart -> Exchange (NegotiationDone)
    Sep 10 21:55:30	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1 Negotiation done (Master).
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is ExStart, seq_num:0x639ef799, local:0x639ef799
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:2f10e7cc , flags:1
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: ExStart -> Exchange (NegotiationDone)
    Sep 10 21:55:30	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1 Negotiation done (Master).
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is ExStart, seq_num:0x2f10e7cb, local:0x2f10e7cb
    Sep 10 21:55:30	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1: Initial DBD from Slave, ignoring.
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:639ef799 , flags:7
    Sep 10 21:55:30	ospfd	54482	default: Intializing [DD]: 192.168.24.1 with seqnum:639ef799 , flags:7
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns2:10.255.27.5: Init -> ExStart (2-WayReceived)
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Init, seq_num:0x5762edf5, local:0x639ef798
    Sep 10 21:55:30	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1: Initial DBD from Slave, ignoring.
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:2f10e7cb , flags:7
    Sep 10 21:55:30	ospfd	54482	default: Intializing [DD]: 192.168.24.1 with seqnum:2f10e7cb , flags:7
    Sep 10 21:55:30	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Init -> ExStart (2-WayReceived)
    Sep 10 21:55:30	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Init, seq_num:0x302e41b8, local:0x2f10e7ca
    Sep 10 21:55:27	ospfd	54482	void nsm_change_state(struct ospf_neighbor *, int):[192.168.24.1:default], Full -> Init): scheduling new router-LSA origination
    Sep 10 21:55:27	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Full -> Init (1-WayReceived)
    Sep 10 21:55:27	ospfd	54482	void nsm_change_state(struct ospf_neighbor *, int):[192.168.24.1:default], Full -> Init): scheduling new router-LSA origination
    Sep 10 21:55:27	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns2:10.255.27.5: Full -> Init (1-WayReceived)
    Sep 10 21:55:20	ospfd	54482	void nsm_change_state(struct ospf_neighbor *, int):[192.168.24.1:default], Exchange -> Full): scheduling new router-LSA origination
    Sep 10 21:55:20	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Exchange -> Full (ExchangeDone)
    Sep 10 21:55:20	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Exchange, seq_num:0x2f10e7c9, local:0x2f10e7c9
    Sep 10 21:55:20	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:2f10e7c9 , flags:1
    Sep 10 21:55:20	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: ExStart -> Exchange (NegotiationDone)
    Sep 10 21:55:20	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1 Negotiation done (Master).
    Sep 10 21:55:20	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is ExStart, seq_num:0x2f10e7c8, local:0x2f10e7c8
    Sep 10 21:55:20	ospfd	54482	Packet[DD]: Neighbor 192.168.24.1: Initial DBD from Slave, ignoring.
    Sep 10 21:55:20	ospfd	54482	default:Packet[DD]: 192.168.24.1 DB Desc send with seqnum:2f10e7c8 , flags:7
    Sep 10 21:55:20	ospfd	54482	default: Intializing [DD]: 192.168.24.1 with seqnum:2f10e7c8 , flags:7
    Sep 10 21:55:20	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Init -> ExStart (2-WayReceived)
    Sep 10 21:55:20	ospfd	54482	default:Packet[DD]: Neighbor 192.168.24.1 state is Init, seq_num:0x3341a031, local:0x0
    Sep 10 21:55:17	ospfd	54482	AdjChg: Nbr 192.168.24.1(default) on ovpns3:10.255.27.9: Down -> Init (PacketReceived)
    

    Here are the logs on the remote site aka Site B pfSense firewall (I bounced the backup tunnel on the site B firewall as that is what dials in. This log is the most telling...

    Sep 10 21:55:32	ospfd	8338	Zebra[Redistribute]: distribute-list update timer fired!
    Sep 10 21:55:30	ospfd	8338	void nsm_change_state(struct ospf_neighbor *, int):[192.168.27.1:default], Loading -> Full): scheduling new router-LSA origination
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Loading -> Full (LoadingDone)
    Sep 10 21:55:30	ospfd	8338	void nsm_change_state(struct ospf_neighbor *, int):[192.168.27.1:default], Loading -> Full): scheduling new router-LSA origination
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: Loading -> Full (LoadingDone)
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Exchange -> Loading (ExchangeDone)
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:2f10e7cc , flags:0
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: Neighbor 192.168.27.1 state is Exchange, seq_num:0x2f10e7cc, local:0x2f10e7cb
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: Exchange -> Loading (ExchangeDone)
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:639ef79a , flags:0
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: Neighbor 192.168.27.1 state is Exchange, seq_num:0x639ef79a, local:0x639ef799
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:639ef799 , flags:0
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: ExStart -> Exchange (NegotiationDone)
    Sep 10 21:55:30	ospfd	8338	Packet[DD]: Neighbor 192.168.27.1 Negotiation done (Slave).
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: Neighbor 192.168.27.1 state is ExStart, seq_num:0x639ef799, local:0x5762edf5
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:2f10e7cb , flags:0
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: ExStart -> Exchange (NegotiationDone)
    Sep 10 21:55:30	ospfd	8338	Packet[DD]: Neighbor 192.168.27.1 Negotiation done (Slave).
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: Neighbor 192.168.27.1 state is ExStart, seq_num:0x2f10e7cb, local:0x302e41b8
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:5762edf5 , flags:7
    Sep 10 21:55:30	ospfd	8338	default: Intializing [DD]: 192.168.27.1 with seqnum:5762edf5 , flags:7
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: Init -> ExStart (2-WayReceived)
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: Down -> Init (PacketReceived)
    Sep 10 21:55:30	ospfd	8338	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:302e41b8 , flags:7
    Sep 10 21:55:30	ospfd	8338	default: Intializing [DD]: 192.168.27.1 with seqnum:302e41b8 , flags:7
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Init -> ExStart (2-WayReceived)
    Sep 10 21:55:30	ospfd	8338	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Down -> Init (PacketReceived)
    Sep 10 21:55:27	ospfd	8338	[EC 134217740] Link State Update: Unknown Neighbor 192.168.27.1 on int: ovpnc3:10.255.27.10
    Sep 10 21:55:27	ospfd	8338	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface ovpnc3
    Sep 10 21:55:27	ospfd	8338	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface ovpnc2
    Sep 10 21:55:27	ospfd	8338	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface em1
    Sep 10 21:55:27	zebra	8096	client 20 says hello and bids fair to announce only ospf routes vrf=0
    Sep 10 21:55:27	ospfd	8338	ASBR[default:Status:1]: Update
    Sep 10 21:55:26	ospfd	39821	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface ovpnc3
    Sep 10 21:55:26	ospfd	39821	void nsm_change_state(struct ospf_neighbor *, int):[192.168.27.1:default], Full -> Deleted): scheduling new router-LSA origination
    Sep 10 21:55:26	ospfd	39821	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Full -> Deleted (KillNbr)
    Sep 10 21:55:26	ospfd	39821	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface ovpnc2
    Sep 10 21:55:26	ospfd	39821	void nsm_change_state(struct ospf_neighbor *, int):[192.168.27.1:default], Full -> Deleted): scheduling new router-LSA origination
    Sep 10 21:55:26	ospfd	39821	AdjChg: Nbr 192.168.27.1(default) on ovpnc2:10.255.27.6: Full -> Deleted (KillNbr)
    Sep 10 21:55:26	ospfd	39821	EXT (ospf_ext_link_ism_change): Set Adj. SID to interface em1
    Sep 10 21:55:26	ospfd	39821	Terminating on signal
    Sep 10 21:55:26	zebra	39212	Zebra final shutdown
    Sep 10 21:55:26	zebra	39212	client 20 disconnected. 15 ospf routes removed from the rib
    Sep 10 21:55:26	zebra	39212	release_daemon_table_chunks: Released 0 table chunks
    Sep 10 21:55:26	zebra	39212	Terminating on signal
    Sep 10 21:55:20	ospfd	39821	void nsm_change_state(struct ospf_neighbor *, int):[192.168.27.1:default], Exchange -> Full): scheduling new router-LSA origination
    Sep 10 21:55:20	ospfd	39821	AdjChg: Nbr 192.168.27.1(default) on ovpnc3:10.255.27.10: Exchange -> Full (ExchangeDone)
    Sep 10 21:55:20	ospfd	39821	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:2f10e7c9 , flags:0
    Sep 10 21:55:20	ospfd	39821	default:Packet[DD]: Neighbor 192.168.27.1 state is Exchange, seq_num:0x2f10e7c9, local:0x2f10e7c8
    Sep 10 21:55:20	ospfd	39821	default:Packet[DD]: 192.168.27.1 DB Desc send with seqnum:2f10e7c8 , flags:0
    

    The backup tunnel is using ovpns3 (10.255.27.8/30) and primary tunnel is ovpns2 (10.255.27.4/30).

    You can see the backup tunnel come up and all looks good. Then after that you can see both tunnels go "Full -> Init " and both start to form neighbor relationships from scratch, seemingly after a Zebra crash. During this time I've lost all my routes.

    This is a nasty show-stopping bug. Seems to be that the "zebra Terminating on signal" is the main culprit, and causing everything to fall in a great big steaming heap. Any ideas why it's doing that?!

    Thanks in advance.



  • Some more information. It doesn't seem to matter what the underlying tunnel is - OpenVPN or IPSec - it still exibits the same faulty behaviour. This is such a fundamental flaw, than I'm in total disbelief as to how this isn't big news in the pfSense user community. Either very few people use OSPF, or the few that do just don't care about losing routes?! Unbelievable.



  • FYI:

    https://forum.netgate.com/topic/149129/frr-ospf-tuning-for-fast-convergence/

    At least with ipsec, there should be some help, which is basically the patch of jimp:

    Under OSPF "Global settings", there is a field "ignore IPSEC restart", have you tried that?

    OSPF under pfsense needs a complete restart, when something changes, with all the problems, you have mentioned. It is not implemented for extended usage.



  • Hi @pete35 thanks for taking the time to respond. Yes "Ignore IPsec Restart" is ticked on both ends for the IPsec VTI tunnel scenario. That's what prompted me to try IPsec tunnels, as I prefer OpenVPN for the ability to run QoS within tunnels themselves.

    My mind is blown how such a fundamental flaw can exist within pFsense. I also tried Quagga but the same issue was there. My guess is both Quagga and FRR on pfSense make use of the same Zebra daemon, and the issue is with Zebra. In any case - doesn't matter if it's FRR or Quagga, or OpenVPN or IPsec - the fatal flaw for dynamic routing on pfSense current version annoyingly persists.

    Am tempted to see if this issue is in pfSense 2.5.0. I am also labbing up some VyOS 1.1.8 pfSense, just to test its routing behaviour for sake of comparision.



  • OK some more information. I created a new two-site topology in GNS3 from scatch, running two pfSense nodes with the current software as per the title of thread. I configured OpenVPN tunnels with routing, with primary-to-primary WAN, and backup-to-backup WAN tunnels. Any instability on one of the tunnels causes OSPF routing instability across all links (even stable links). It was delayed by about 30 seconds to 1 minute, but it was consistantly repeatable. I repeated the experiment several times throughout the day. For example I would leave the OpenVPN sessions up and saw they clocked over 20 minutes. Took down the backup tunnel. Still OK (as expected). Primary OSPF neighbor still stable, even after another 20 mins. Brought the backup link up again. Then within the minute - all OSPF routes and neighbors gone!

    NETGATE - YOU HAVE HUGE ISSUES WITH OSPF ROUTING IN PFSENSE!!!

    What's disheartening is that the response to this thread, barring one reply from @pete35, is basically tumbleweeds.. Netgate has a massive routing issue with OSPF in pfSense.... and seeminly they just don't care (gobsmacked).

    In any case I spent another half-day repeating my experiments but this time with OPNsense. I was pleasantly surprised to find it doesn't have the OSPF routing issue that pfSense does... and can now understand why they forked...

    Anyway I have been a big fan of pfSense in my time using it - wonderful software all told - but reliable routing is a must-have for me, so reluctantly, I'm packing my tools so to speak, and moving onto OPNsense. I just hope it doesn't have any different kinds of show-stopper issues (just my luck it will have).... but if they are minor then I'll purchase support and see if they can sort those out.

    I'll still keep an eye on this thread as with my workload, it'll be at least three weeks before I get anywhere near cutting over to OPNsense, so if you have any knowledge of a fix - I'm all ears!



  • Hello @Gcon ,

    Thanks for the troubleshooting effort so far, it's certainly appreciated. The root issue here warrants some broader proper implementation given that it can affect any package, but the following patch should work for the FRR package on the latest pfSense release (2.4.5-p1). Would you please apply it using the System_Patches package and test again?

    skip_restart_for_routing_packages-2.4.5.patch


  • Netgate Administrator

    Yes, test that patch if you're able to. It's also reported in the thread @pete35 linked to above.

    The correct way to go about this is to open a bug report if you have found a bug:
    https://redmine.pfsense.org/

    Or add to an existing bug if there is one. It sounds like there may well be if what you're reporting affects a lot of users. But we can always merge or link them.

    That way it can be tracked and seen by developers.

    Steve


Log in to reply