After a while, one or more networks stop routing - using FRR OSPF and VTi tunnels
-
I now have a 100% reproducible fault.
I have WAN1 and WAN2 at the 192.168.1.251 firewall end.
If WAN2GW goes offline (it's got an infrequently packet loss fault) then;- On WAN2GW going DOWN - 1 ping is lost when OSFP re-converges but everything remains online and functional, routing to remove OSPF neighbors all good.
- On WAN2GW restoring, that's when the fault occurs.
Also, the OSFP is setup to prefer WAN1 and not WAN2, so the main route, WAN1 never goes offline anyway, but the bouncing WAN2 causes the fault.
Strangely the routes to one site:
10.5.45.0/24
10.5.55.0/24
192.168.5.0/24Fail to get put back into the pfSense kernel routing table. OSFP status shows it knows about these networks and the neighbor and route to them, but pfSense itself doesn't have the route in the kernel routing table.
All I need do is restart the FRR ospfd - FRR OSPF routing daemon OSFP service on the 192.168.1.251 firewall and the missing routes are restored (that is, put back into the pfSense kernel routing table)
These seems to be related:
[https://github.com/FRRouting/frr/issues/3937](link url)Another post listing the same buffer_write: write error on fd 8: Broken pipe error I'm seeing too
[https://sourceforge.net/p/bsdrp/mailman/bsdrp-users/thread/CAHUyByTAkkZpUp7KBLGN_-9ZDths%2Ba0YPna3xnhoHJmRAL0JZw%40mail.gmail.com/#msg36634206](link url)Here's what I collected from SYSLOG:
2019-07-31 12:04:22 User.Warning 192.168.1.251 Jul 31 12:04:05 dpinger: WAN2GW 114.23.xxx.yy: Alarm latency 593us stddev 42us loss 21% 2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. '' 2019-07-31 12:04:23 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[48547]: /rc.dyndns.update: MONITOR: WAN2GW is down, omitting from routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|0.594ms|0.035ms|23%|down 2019-07-31 12:04:24 Daemon.Error 192.168.1.251 Jul 31 12:04:06 php-fpm[366]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW. 2019-07-31 12:04:38 Daemon.Error 192.168.1.251 Jul 31 12:04:21 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing. 2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:39 Daemon.Error 192.168.1.251 Jul 31 12:04:22 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:40 Daemon.Warning 192.168.1.251 Jul 31 12:04:22 ospfd[85009]: Link State Acknowledgment: Unknown Neighbor 192.168.4.254. 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:04:41 Daemon.Error 192.168.1.251 Jul 31 12:04:23 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:21 Daemon.Error 192.168.1.251 Jul 31 12:06:04 php-fpm[48547]: /widgets/widgets/dyn_dns_status.widget.php: 48547MONITOR: WAN2GW is available now, adding to routing group WANgroup 114.23.xxx.yy|114.23.aaa.bb|WAN2GW|8.079ms|57.037ms|10%|none 2019-07-31 12:06:24 User.Warning 192.168.1.251 Jul 31 12:06:07 dpinger: WAN2GW 114.23.xxx.yy: Clear latency 7618us stddev 55288us loss 5% 2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. '' 2019-07-31 12:06:26 Daemon.Error 192.168.1.251 Jul 31 12:06:08 php-fpm[48547]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN2GW. 2019-07-31 12:06:41 Daemon.Error 192.168.1.251 Jul 31 12:06:23 php-fpm[34605]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing. 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.40.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.50.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 10.4.61.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.4.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:06:43 Daemon.Error 192.168.1.251 Jul 31 12:06:25 zebra[84277]: kernel_rtm_ipv4: 192.168.120.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.45.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 10.5.55.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:07:33 Daemon.Error 192.168.1.251 Jul 31 12:07:16 zebra[84277]: kernel_rtm_ipv4: 192.168.5.0/24: rtm_write() unexpectedly returned -4 for command RTM_DELETE 2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: buffer_write: write error on fd 8: Broken pipe 2019-07-31 12:07:33 Daemon.Warning 192.168.1.251 Jul 31 12:07:16 ospfd[85009]: zclient_send_message: buffer_write failed to zclient fd 8, closing 2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 zebra[86569]: ERROR: Command returned Incomplete on config line 12: access-list CONNECTallowACL remark 2019-07-31 12:07:34 Daemon.Error 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: ERROR: Command returned Incomplete on config line 49: access-list CONNECTallowACL remark 2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41 2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.5.253 on int: ipsec8000:172.27.255.41 2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9 2019-07-31 12:07:34 Daemon.Warning 192.168.1.251 Jul 31 12:07:17 ospfd[87029]: Link State Update: Unknown Neighbor 192.168.4.254 on int: ipsec9000:172.27.255.9
-
Does bug 9668 look like it might solve this issue too?
https://redmine.pfsense.org/issues/9668
I haven't tried 0.6.3 yet, but it looks like it could be related!
-
Ok, I have a 100% reproducible issue with OSPF.
Now upgraded to FRR 0.6.3 and same issueSite: 1
WAN1 and backup WAN2
LAN: 192.168.1.0/24- kernel routes redistributed via an access list 10.1.1.0/24
Site: 2
WAN1 and backup WAN2
LAN: 192.168.5.0/24- connected routes redistributed via an access list (notably 192.168.5.0/24)
Now, site1 can ping site2
10.1.1.1 ping 192.168.5.1 - works greatDisable WAN2 on site2 (it's metric is 20, wan1 is metric 10, WAN2 is just backup)
10.1.1.1 ping 192.168.5.1 - works greatEnable WAN2 on site2
10.1.1.1 ping 192.168.5.1 - stops192.168.5.0/24 as a network disappears from the site1 routing table!
192.168.5.0/24 is shown in the OSPF routing table!These screen shots are all from site1.
site1 - routing table showing the 192.168.5.0/24 route present:
Now disable WAN2 at site1, wait till, enable WAN2 at site1, route gone!
But, OSPF knows all about it....
-
Did you apply the patch on https://redmine.pfsense.org/issues/9668 to your firewall, or try from a 2.5.0 snapshot?
The resolution in that bug required two parts: A base change, plus the new FRR package
You can install the System Patches package and then create an entry for each commit to apply the fix.
First apply
15701e03e36051907a23ddbe5ab04f42c94c0944
to see if that helps. Then applya264f870479c36ac1599b936bbdd547f0f8a99ec
to see if that helps. -
I've had issues like this with FRR across a VTI, but my routes are not sitting in the "inactive" state as documented on the bug tracker.
Instead they appear to be correct in FRR, and are simply not being added to the kernel/system table. These are all incoming announcements from the remote side (a Palo Alto).
9/10 times I can solve this by just kicking FRR, which I usually do via
frrctl restart
...I am on FRR package v0.6.3, and I just applied both of Jim's patches to see if that improves the situation at all.
-
That does seem to be exactly what was occurring for me too..
-
Today, on another site running the 9668 patch and FRR 0.6.3 - routing done using OSFP
I was adding and changing traditional non VTI IPSEC VPN tunnels and every time I hit apply the VPN from head office (HO) to the Data Centre hiccuped and lost 4 pings.
It looks like the 9668 patch is forcing FRR to reload even when IPSEC changes are made that do not involve the VTI interfaces, would that be correct?
If yes, that's quite an issue, this customer has some 60 traditional IPSEC tunnels and only uses FRR and OSPF between HO and the DataC for failover between WAN1 and WAN2.
-
Yes, that's right. It's definitely not ideal but the script that gets called when that event fires doesn't see specific tunnels, so it can't be helped currently.
-
Do you mind if ask...
Is there a change to patch 9668 on the roadmap that would address the issue and only reload FRR on VTI tunnel changes and not all IPSEC changes? -
Not currently. The script where it's called does not have any information which would allow it to make that kind of decision.