BGP Flaps on pfsense

rahul.yedavi

Looking for a root cause of bgp flap issue happening more frequently in our pfsense pair of firewalls.

From the logs it indicates the flaps are occurring due to Hold Timer Expired however from the other end there are no interface flaps or drops happening.

Any other pointers apart from "Hold Timer Expired" are appreciated. It keeps on happening quiet freqently on random intervals & we couldn't find issue in other end switch side.

Jun 13 04:09:14 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl1.1001 4/0 (Hold Timer Expired) 0 bytes
Jun 13 04:09:14 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1001(<neigh_switch_2>) in vrf default Down BGP Notification send
Jun 13 04:09:14 <firewall_hostname> zebra[79339]: [EC 100663303] kernel_rtm: 10.0.0.0/8: rtm_write() unexpectedly returned -4 for command RTM_DELETE
Jun 13 04:09:15 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1001(<neigh_switch_2>) in vrf default Up
Jun 13 04:09:18 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl0.1000 4/0 (Hold Timer Expired) 0 bytes
Jun 13 04:09:18 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl0.1000(<neigh_switch_1>) in vrf default Down BGP Notification send
Jun 13 04:09:18 <firewall_hostname> zebra[79339]: [EC 100663303] kernel_rtm: 0.0.0.0/0: rtm_write() unexpectedly returned -4 for command RTM_DELETE
Jun 13 04:09:24 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl1.1001 4/0 (Hold Timer Expired) 0 bytes
Jun 13 04:09:24 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1001(<neigh_switch_2>) in vrf default Down BGP Notification send
Jun 13 04:09:24 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl0.1001 4/0 (Hold Timer Expired) 0 bytes
Jun 13 04:09:24 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl0.1001(<neigh_switch_1>) in vrf default Down BGP Notification send
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl1.1000 4/0 (Hold Timer Expired) 0 bytes
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1000(<neigh_switch_2>) in vrf default Down BGP Notification send
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl0.1000(<neigh_switch_1>) in vrf default Up
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1001(<neigh_switch_2>) in vrf default Up
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %NOTIFICATION: sent to neighbor ixl0.1001 6/7 (Cease/Connection collision resolution) 0 bytes
Jun 13 04:09:26 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl0.1001(<neigh_switch_1>) in vrf default Up
Jun 13 04:09:27 <firewall_hostname> bgpd[80064]: %ADJCHANGE: neighbor ixl1.1000(<neigh_switch_2>) in vrf default Up
Jun 13 04:09:29 <firewall_hostname> bgpd[80064]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from ixl0.1000 in vrf default
Jun 13 04:09:29 <firewall_hostname> bgpd[80064]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from ixl1.1001 in vrf default
Jun 13 04:09:29 <firewall_hostname> bgpd[80064]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from ixl0.1001 in vrf default
Jun 13 04:09:29 <firewall_hostname> bgpd[80064]: %NOTIFICATION: rcvd End-of-RIB for IPv4 Unicast from ixl1.1000 in vrf default

michmoor

@rahul-yedavi well the holddown timers is pretty important so we should focus on that. Why isnt the peer responding with BGP hellos? CPU usage?
Couple of things i would do

take a pcap on both sides. See if BGP hellos are being sent and correlate that with the lost of the adjacency.
Check CPU on both sides along with the health of the link. Is this over a VPN or over a direct connect? If over a VPN and assuming you have gateway monitoring enabled how are things looking?
Finally on a personal note, i always try to enable BFD. BFD does not rely on routing protocol timers so it can detect fault quicker and notify the routing process so routes can converge.

rahul.yedavi

@michmoor Thank you so much for the suggestions. This is very helpful

We do intend to take a simultaneous pcap from both ends however the occurrence of the event is random and we can't keep a pcap running for longer intervals. Is there any kind of cron job or schedular that we can set with a trigger of BGP flap event on pfsense which will collect a pcap during the actual occurrence of the issue?
Noted on this, we will monitor the cpu trend from both ends to check if any thresholds were crossed which can cause an issue with resource utilization. This is over direct connect.
We will have a discussion about the addition of BFD to the network.

michmoor

@rahul-yedavi Im not aware of any CRON job.
Are these BGP sessions made over VPN or over a direct connection (physically connected to another router).

If over a VPN, this may be the quality of the internet links between your two devices. DPinger may be able to reveal if there are link quality issues but ultimately you cant do anything about it.

rahul.yedavi

@michmoor , thanks for the response. We don't have any VPN between the firewall and the downstream device between which the BGP is flapping. The firewall is directly connected to the downstream switch.