After a while, one or more networks stop routing - using FRR OSPF and VTi tunnels



  • This has happened now on two independent sites running different versions of FRR, in both occasions "some" but not all networks across IPSEC VTi tunnels seemingly randomly stopped working.

    In the Sydney case, SiteA 10.10.40.0/23 still could access SiteB 10.11.0.0/24 but SiteA 10.10.39.0/24 could no longer access SiteB 10.11.0.0/24. OSPF display routes at SiteA and SiteB both showed each site knew the correct routes, routing had stopped for 1 subnet, but the route still seemed to be present in FRR OSFP correctly.

    Sydney - pfSense 2.4.4_p3 with FRR 0.5.2
    SiteA to SiteB
    SiteA 10.10.40.0/23 & 10.10.39.0/24
    SiteB 10.11.0.0/24
    Suddenly the 10.10.39.0/24 network at SiteA could no longer get to SiteB 10.11.0.0/24

    solution - restart FRR zebra service at SiteA, routing started working again

    New Zealand - pfSense 2.4.4_p3 with FRR 0.6.2
    SiteA to SiteB
    45,000+ pings overnight, when I looked at it this morning, 3500 lost, so overnight, routing stopped.
    SiteA 192.168.5.0/24
    SiteB 10.1.1.0/24 & 192.168.1.0/24

    • solution - restart FRR zebra service at SiteA, routing started working again instantly

    Here's the OSFP routes from the NZ SiteA and SiteB firewalls, whilst routing was broken. You can clearly see both sites "know" about one another routes, yet, routing has stopped.

    However, SiteA could still get to 192.168.120.0/24 (SiteC) so some FRR routing was still working.

    SiteA pfSense, 192.168.5.0/24
    SiteA.PNG

    SiteB pfSense, 192.168.1.0/24 and 10.1.1.0/24
    SiteB.PNG



  • I know although my fault description is as accurate as I can make it, that it will be quite hard perhaps to even know internally where to look.

    Is there something I can do?
    Some specific logs or something else I can get if this occurs again to help diagnose what's going on?



  • I now have a 100% reproducible fault.

    I have WAN1 and WAN2 at the 192.168.1.251 firewall end.
    If WAN2GW goes offline (it's got an infrequently packet loss fault) then;

    • On WAN2GW going DOWN - 1 ping is lost when OSFP re-converges but everything remains online and functional, routing to remove OSPF neighbors all good.
    • On WAN2GW restoring, that's when the fault occurs.

    Also, the OSFP is setup to prefer WAN1 and not WAN2, so the main route, WAN1 never goes offline anyway, but the bouncing WAN2 causes the fault.

    Strangely the routes to one site:
    10.5.45.0/24
    10.5.55.0/24
    192.168.5.0/24

    Fail to get put back into the pfSense kernel routing table. OSFP status shows it knows about these networks and the neighbor and route to them, but pfSense itself doesn't have the route in the kernel routing table.

    All I need do is restart the FRR ospfd - FRR OSPF routing daemon OSFP service on the 192.168.1.251 firewall and the missing routes are restored (that is, put back into the pfSense kernel routing table)

    These seems to be related:
    [https://github.com/FRRouting/frr/issues/3937](link url)

    Another post listing the same buffer_write: write error on fd 8: Broken pipe error I'm seeing too
    [https://sourceforge.net/p/bsdrp/mailman/bsdrp-users/thread/CAHUyByTAkkZpUp7KBLGN_-9ZDths%2Ba0YPna3xnhoHJmRAL0JZw%40mail.gmail.com/#msg36634206](link url)

    Here's what I collected from SYSLOG:

    2019-07-31 12:04:22 User.Warning 192.168.1.251 Jul 31 12:04:05 dpinger: WAN2GW 114.23.xxx.yy: Alarm latency 593us stddev 42us loss 21%
    2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
    2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[48547]: /rc.dyndns.update: MONITOR: WAN2GW is down, omitting from routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|0.594ms|0.035ms|23%|down
    2019-07-31 12:04:24 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
    2019-07-31 12:04:38 Daemon.Error 192.168.1.251 Jul 31 12:04:21 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
    2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:40 Daemon.Warning 192.168.1.251 Jul 31 12:04:22 ospfd[85009]: Link State Acknowledgment: Unknown Neighbor 192.168.4.254.
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:21 Daemon.Error 192.168.1.251 Jul 31 12:06:04 php-fpm[48547]: /widgets/widgets/dyn_dns_status.widget.php: 48547MONITOR: WAN2GW is available now, adding to routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|8.079ms|57.037ms|10%|none
    2019-07-31 12:06:24 User.Warning 192.168.1.251 Jul 31 12:06:07 dpinger: WAN2GW 114.23.xxx.yy: Clear latency 7618us stddev 55288us loss 5%
    2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
    2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
    2019-07-31 12:06:41 Daemon.Error 192.168.1.251 Jul 31 12:06:23 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
    2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: buffer_write: write error on fd 8: Broken pipe
    2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: zclient_send_message: buffer_write failed to zclient fd 8, closing
    2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 zebra[86569]: ERROR: Command returned Incomplete on config line 12: access-list CONNECTallowACL remark
    2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: ERROR: Command returned Incomplete on config line 49: access-list CONNECTallowACL remark
    2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41
    2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41
    2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9
    2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9
    


  • Does bug 9668 look like it might solve this issue too?

    https://redmine.pfsense.org/issues/9668

    I haven't tried 0.6.3 yet, but it looks like it could be related!



  • Ok, I have a 100% reproducible issue with OSPF.
    Now upgraded to FRR 0.6.3 and same issue

    Site: 1
    WAN1 and backup WAN2
    LAN: 192.168.1.0/24

    • kernel routes redistributed via an access list 10.1.1.0/24

    Site: 2
    WAN1 and backup WAN2
    LAN: 192.168.5.0/24

    • connected routes redistributed via an access list (notably 192.168.5.0/24)

    Now, site1 can ping site2
    10.1.1.1 ping 192.168.5.1 - works great

    Disable WAN2 on site2 (it's metric is 20, wan1 is metric 10, WAN2 is just backup)
    10.1.1.1 ping 192.168.5.1 - works great

    Enable WAN2 on site2
    10.1.1.1 ping 192.168.5.1 - stops

    192.168.5.0/24 as a network disappears from the site1 routing table!
    192.168.5.0/24 is shown in the OSPF routing table!

    These screen shots are all from site1.
    site1 - routing table showing the 192.168.5.0/24 route present:
    rt present.PNG

    Now disable WAN2 at site1, wait till, enable WAN2 at site1, route gone!
    rt gone.PNG

    But, OSPF knows all about it....
    ospf knows.PNG


  • Rebel Alliance Developer Netgate

    Did you apply the patch on https://redmine.pfsense.org/issues/9668 to your firewall, or try from a 2.5.0 snapshot?

    The resolution in that bug required two parts: A base change, plus the new FRR package

    You can install the System Patches package and then create an entry for each commit to apply the fix.

    First apply 15701e03e36051907a23ddbe5ab04f42c94c0944 to see if that helps. Then apply a264f870479c36ac1599b936bbdd547f0f8a99ec to see if that helps.



  • I've had issues like this with FRR across a VTI, but my routes are not sitting in the "inactive" state as documented on the bug tracker.

    Instead they appear to be correct in FRR, and are simply not being added to the kernel/system table. These are all incoming announcements from the remote side (a Palo Alto).

    9/10 times I can solve this by just kicking FRR, which I usually do via frrctl restart...

    I am on FRR package v0.6.3, and I just applied both of Jim's patches to see if that improves the situation at all.



  • @ZPrime

    That does seem to be exactly what was occurring for me too..



  • Today, on another site running the 9668 patch and FRR 0.6.3 - routing done using OSFP

    I was adding and changing traditional non VTI IPSEC VPN tunnels and every time I hit apply the VPN from head office (HO) to the Data Centre hiccuped and lost 4 pings.

    It looks like the 9668 patch is forcing FRR to reload even when IPSEC changes are made that do not involve the VTI interfaces, would that be correct?

    If yes, that's quite an issue, this customer has some 60 traditional IPSEC tunnels and only uses FRR and OSPF between HO and the DataC for failover between WAN1 and WAN2.


  • Rebel Alliance Developer Netgate

    Yes, that's right. It's definitely not ideal but the script that gets called when that event fires doesn't see specific tunnels, so it can't be helped currently.



  • Do you mind if ask...
    Is there a change to patch 9668 on the roadmap that would address the issue and only reload FRR on VTI tunnel changes and not all IPSEC changes?


  • Rebel Alliance Developer Netgate

    Not currently. The script where it's called does not have any information which would allow it to make that kind of decision.


Log in to reply