Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    After a while, one or more networks stop routing - using FRR OSPF and VTi tunnels

    Scheduled Pinned Locked Moved FRR
    12 Posts 3 Posters 2.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • nzkiwi68N
      nzkiwi68
      last edited by

      This has happened now on two independent sites running different versions of FRR, in both occasions "some" but not all networks across IPSEC VTi tunnels seemingly randomly stopped working.

      In the Sydney case, SiteA 10.10.40.0/23 still could access SiteB 10.11.0.0/24 but SiteA 10.10.39.0/24 could no longer access SiteB 10.11.0.0/24. OSPF display routes at SiteA and SiteB both showed each site knew the correct routes, routing had stopped for 1 subnet, but the route still seemed to be present in FRR OSFP correctly.

      Sydney - pfSense 2.4.4_p3 with FRR 0.5.2
      SiteA to SiteB
      SiteA 10.10.40.0/23 & 10.10.39.0/24
      SiteB 10.11.0.0/24
      Suddenly the 10.10.39.0/24 network at SiteA could no longer get to SiteB 10.11.0.0/24

      solution - restart FRR zebra service at SiteA, routing started working again

      New Zealand - pfSense 2.4.4_p3 with FRR 0.6.2
      SiteA to SiteB
      45,000+ pings overnight, when I looked at it this morning, 3500 lost, so overnight, routing stopped.
      SiteA 192.168.5.0/24
      SiteB 10.1.1.0/24 & 192.168.1.0/24

      • solution - restart FRR zebra service at SiteA, routing started working again instantly

      Here's the OSFP routes from the NZ SiteA and SiteB firewalls, whilst routing was broken. You can clearly see both sites "know" about one another routes, yet, routing has stopped.

      However, SiteA could still get to 192.168.120.0/24 (SiteC) so some FRR routing was still working.

      SiteA pfSense, 192.168.5.0/24
      SiteA.PNG

      SiteB pfSense, 192.168.1.0/24 and 10.1.1.0/24
      SiteB.PNG

      1 Reply Last reply Reply Quote 0
      • nzkiwi68N
        nzkiwi68
        last edited by

        I know although my fault description is as accurate as I can make it, that it will be quite hard perhaps to even know internally where to look.

        Is there something I can do?
        Some specific logs or something else I can get if this occurs again to help diagnose what's going on?

        1 Reply Last reply Reply Quote 0
        • nzkiwi68N
          nzkiwi68
          last edited by

          I now have a 100% reproducible fault.

          I have WAN1 and WAN2 at the 192.168.1.251 firewall end.
          If WAN2GW goes offline (it's got an infrequently packet loss fault) then;

          • On WAN2GW going DOWN - 1 ping is lost when OSFP re-converges but everything remains online and functional, routing to remove OSPF neighbors all good.
          • On WAN2GW restoring, that's when the fault occurs.

          Also, the OSFP is setup to prefer WAN1 and not WAN2, so the main route, WAN1 never goes offline anyway, but the bouncing WAN2 causes the fault.

          Strangely the routes to one site:
          10.5.45.0/24
          10.5.55.0/24
          192.168.5.0/24

          Fail to get put back into the pfSense kernel routing table. OSFP status shows it knows about these networks and the neighbor and route to them, but pfSense itself doesn't have the route in the kernel routing table.

          All I need do is restart the FRR ospfd - FRR OSPF routing daemon OSFP service on the 192.168.1.251 firewall and the missing routes are restored (that is, put back into the pfSense kernel routing table)

          These seems to be related:
          [https://github.com/FRRouting/frr/issues/3937](link url)

          Another post listing the same buffer_write: write error on fd 8: Broken pipe error I'm seeing too
          [https://sourceforge.net/p/bsdrp/mailman/bsdrp-users/thread/CAHUyByTAkkZpUp7KBLGN_-9ZDths%2Ba0YPna3xnhoHJmRAL0JZw%40mail.gmail.com/#msg36634206](link url)

          Here's what I collected from SYSLOG:

          2019-07-31 12:04:22 User.Warning 192.168.1.251 Jul 31 12:04:05 dpinger: WAN2GW 114.23.xxx.yy: Alarm latency 593us stddev 42us loss 21%
          2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
          2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[48547]: /rc.dyndns.update: MONITOR: WAN2GW is down, omitting from routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|0.594ms|0.035ms|23%|down
          2019-07-31 12:04:24 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
          2019-07-31 12:04:38 Daemon.Error 192.168.1.251 Jul 31 12:04:21 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
          2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:40 Daemon.Warning 192.168.1.251 Jul 31 12:04:22 ospfd[85009]: Link State Acknowledgment: Unknown Neighbor 192.168.4.254.
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:21 Daemon.Error 192.168.1.251 Jul 31 12:06:04 php-fpm[48547]: /widgets/widgets/dyn_dns_status.widget.php: 48547MONITOR: WAN2GW is available now, adding to routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|8.079ms|57.037ms|10%|none
          2019-07-31 12:06:24 User.Warning 192.168.1.251 Jul 31 12:06:07 dpinger: WAN2GW 114.23.xxx.yy: Clear latency 7618us stddev 55288us loss 5%
          2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
          2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW.
          2019-07-31 12:06:41 Daemon.Error 192.168.1.251 Jul 31 12:06:23 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE
          2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: buffer_write: write error on fd 8: Broken pipe
          2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: zclient_send_message: buffer_write failed to zclient fd 8, closing
          2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 zebra[86569]: ERROR: Command returned Incomplete on config line 12: access-list CONNECTallowACL remark
          2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: ERROR: Command returned Incomplete on config line 49: access-list CONNECTallowACL remark
          2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41
          2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41
          2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9
          2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9
          
          1 Reply Last reply Reply Quote 0
          • nzkiwi68N
            nzkiwi68
            last edited by

            Does bug 9668 look like it might solve this issue too?

            https://redmine.pfsense.org/issues/9668

            I haven't tried 0.6.3 yet, but it looks like it could be related!

            1 Reply Last reply Reply Quote 0
            • nzkiwi68N
              nzkiwi68
              last edited by

              Ok, I have a 100% reproducible issue with OSPF.
              Now upgraded to FRR 0.6.3 and same issue

              Site: 1
              WAN1 and backup WAN2
              LAN: 192.168.1.0/24

              • kernel routes redistributed via an access list 10.1.1.0/24

              Site: 2
              WAN1 and backup WAN2
              LAN: 192.168.5.0/24

              • connected routes redistributed via an access list (notably 192.168.5.0/24)

              Now, site1 can ping site2
              10.1.1.1 ping 192.168.5.1 - works great

              Disable WAN2 on site2 (it's metric is 20, wan1 is metric 10, WAN2 is just backup)
              10.1.1.1 ping 192.168.5.1 - works great

              Enable WAN2 on site2
              10.1.1.1 ping 192.168.5.1 - stops

              192.168.5.0/24 as a network disappears from the site1 routing table!
              192.168.5.0/24 is shown in the OSPF routing table!

              These screen shots are all from site1.
              site1 - routing table showing the 192.168.5.0/24 route present:
              rt present.PNG

              Now disable WAN2 at site1, wait till, enable WAN2 at site1, route gone!
              rt gone.PNG

              But, OSPF knows all about it....
              ospf knows.PNG

              1 Reply Last reply Reply Quote 0
              • jimpJ
                jimp Rebel Alliance Developer Netgate
                last edited by jimp

                Did you apply the patch on https://redmine.pfsense.org/issues/9668 to your firewall, or try from a 2.5.0 snapshot?

                The resolution in that bug required two parts: A base change, plus the new FRR package

                You can install the System Patches package and then create an entry for each commit to apply the fix.

                First apply 15701e03e36051907a23ddbe5ab04f42c94c0944 to see if that helps. Then apply a264f870479c36ac1599b936bbdd547f0f8a99ec to see if that helps.

                Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                Need help fast? Netgate Global Support!

                Do not Chat/PM for help!

                1 Reply Last reply Reply Quote 2
                • Z
                  ZPrime
                  last edited by

                  I've had issues like this with FRR across a VTI, but my routes are not sitting in the "inactive" state as documented on the bug tracker.

                  Instead they appear to be correct in FRR, and are simply not being added to the kernel/system table. These are all incoming announcements from the remote side (a Palo Alto).

                  9/10 times I can solve this by just kicking FRR, which I usually do via frrctl restart...

                  I am on FRR package v0.6.3, and I just applied both of Jim's patches to see if that improves the situation at all.

                  nzkiwi68N 1 Reply Last reply Reply Quote 0
                  • nzkiwi68N
                    nzkiwi68 @ZPrime
                    last edited by

                    @ZPrime

                    That does seem to be exactly what was occurring for me too..

                    1 Reply Last reply Reply Quote 0
                    • nzkiwi68N
                      nzkiwi68
                      last edited by

                      Today, on another site running the 9668 patch and FRR 0.6.3 - routing done using OSFP

                      I was adding and changing traditional non VTI IPSEC VPN tunnels and every time I hit apply the VPN from head office (HO) to the Data Centre hiccuped and lost 4 pings.

                      It looks like the 9668 patch is forcing FRR to reload even when IPSEC changes are made that do not involve the VTI interfaces, would that be correct?

                      If yes, that's quite an issue, this customer has some 60 traditional IPSEC tunnels and only uses FRR and OSPF between HO and the DataC for failover between WAN1 and WAN2.

                      1 Reply Last reply Reply Quote 0
                      • jimpJ
                        jimp Rebel Alliance Developer Netgate
                        last edited by

                        Yes, that's right. It's definitely not ideal but the script that gets called when that event fires doesn't see specific tunnels, so it can't be helped currently.

                        Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                        Need help fast? Netgate Global Support!

                        Do not Chat/PM for help!

                        1 Reply Last reply Reply Quote 0
                        • nzkiwi68N
                          nzkiwi68
                          last edited by

                          Do you mind if ask...
                          Is there a change to patch 9668 on the roadmap that would address the issue and only reload FRR on VTI tunnel changes and not all IPSEC changes?

                          1 Reply Last reply Reply Quote 0
                          • jimpJ
                            jimp Rebel Alliance Developer Netgate
                            last edited by

                            Not currently. The script where it's called does not have any information which would allow it to make that kind of decision.

                            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                            Need help fast? Netgate Global Support!

                            Do not Chat/PM for help!

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.