• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

2 BGP Session dropping randomly same time

Scheduled Pinned Locked Moved TNSR
14 Posts 5 Posters 2.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N
    NBhatti
    last edited by Aug 2, 2022, 4:54 PM

    Hi, I am testing two BGP session with two different carriers in my homelab for an upcoming production implementation but observing BGP drops randomly with both of the carriers/ISP. My Dell server is running Version: 22.06-1~tnsr-v22.06-1 Build timestamp: Fri Jun 17 12:46:31 2022 CDT

    show ip bgp neighbor shows the following.

    BGP neighbor is 10.156.1.189, remote AS XXXX, local AS XXXXX, external link
      BGP version 4, remote router ID 192.168.224.139, local router ID 10.99.99.10
      BGP state = Established, up for 03:55:29
      Last read 00:00:01, Last write 00:00:02
      Hold time is 10, keepalive interval is 3 seconds
      Configured hold time is 10, keepalive interval is 3 seconds
      Neighbor capabilities:
        4 Byte AS: advertised and received
        AddPath:
          IPv4 Unicast: RX advertised IPv4 Unicast
        Dynamic: advertised
        Route refresh: advertised and received(old & new)
        Address Family IPv4 Unicast: advertised and received
        Hostname Capability: advertised (name: TNSR-2,domain name: n/a) not received
        Graceful Restart Capability: advertised and received
          Remote Restart timer is 120 seconds
          Address families by peer:
            none
      Graceful restart information:
        End-of-RIB send: IPv4 Unicast
        End-of-RIB received: IPv4 Unicast
        Local GR Mode: Helper*
        Remote GR Mode: Helper
        R bit: False
        Timers:
          Configured Restart Time(sec): 120
          Received Restart Time(sec): 120
        IPv4 Unicast:
          F bit: False
          End-of-RIB sent: Yes
          End-of-RIB sent after update: No
          End-of-RIB received: Yes
          Timers:
            Configured Stale Path Time(sec): 360
      Message statistics:
        Inq depth is 0
        Outq depth is 0
                             Sent       Rcvd
        Opens:                 18         18
        Notifications:         22         12
        Updates:           345953   14043667
        Keepalives:        466360     517920
        Route Refresh:          1          0
        Capability:             0          0
        Total:             812354   14561617
      Minimum time between advertisement runs is 0 seconds
      Update source is vpp7
    
     For address family: IPv4 Unicast
      Update group 12, subgroup 20
      Packet Queue length 0
      Inbound soft reconfiguration allowed
      Private AS numbers (all) removed in updates to this neighbor
      Community attribute sent to this neighbor(all)
      Outbound path policy configured
      Outgoing update prefix filter list is *EXPORT_IPv4
      959078 accepted prefixes
    
      Connections established 12; dropped 11
      **Last reset 03:55:47,   Notification sent (Hold Timer Expired)**
      External BGP neighbor may be up to 10 hops away.
    Local host: 10.156.1.190, Local port: 46565
    Foreign host: 10.156.1.189, Foreign port: 179
    Nexthop: 10.156.1.190
    BGP connection: shared network
    BGP Connect Retry Timer in Seconds: 10
    Estimated round trip time: 79 ms
    Read thread: on  Write thread: on  FD used: 24
    

    Last reset 03:55:47, Notification sent (Hold Timer Expired) is the line shows hold timer expired. Nothing in the logs and the ISP says they don't see anything in their logs either. This happens same time with 2 BGP sessions. My server is physically connected to a Mikrotik 10G Switch with DAC cable which is connected to a Cisco old 3750 switch where the ISP is connected.

    Any pointers how to troubleshoot or to see why both connections reset at the same time? I have reduced hold times just in case to speed up recovery but that does not either helps. Could this be a layer 2 issue or something else? How to debug?

    Thanks.

    M 1 Reply Last reply Aug 2, 2022, 7:20 PM Reply Quote 0
    • M
      michmoor LAYER 8 Rebel Alliance @NBhatti
      last edited by Aug 2, 2022, 7:20 PM

      @nbhatti Do both BGP sessions drop at the same time? If so this strongly indicates a problem within your domain.
      Also, start checking the links connected for any errors/drops. Do you have a link thats bouncing?
      Check system health of pfsense - CPU spikes or memory utilization that occurs during the outage?

      Is the topology: pfsense----Layer 2 cloud---ISP
      If so, is spanning-tree involved? Anything in the switch logs to indicate STP failure or flapping?

      Firewall: NetGate,Palo Alto-VM,Juniper SRX
      Routing: Juniper, Arista, Cisco
      Switching: Juniper, Arista, Cisco
      Wireless: Unifi, Aruba IAP
      JNCIP,CCNP Enterprise

      N 1 Reply Last reply Aug 2, 2022, 7:43 PM Reply Quote 0
      • N
        NBhatti @michmoor
        last edited by Aug 2, 2022, 7:43 PM

        @michmoor This is TNSR not pfSense, but yes the topology is more or less the same. TNSR --> Mikrotik Switch --> Cisco Switch --> ISP.

        Mikrotik running SwitchOS and has very limited logging capabilities but none of them are showing any single Frame drops, Mac errors or anything like that. The link is direct 10G SFP+ DAC Cable from 08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) network card. RSTP/STP is also disabled on Mikrotik and cisco both.

        Both session should not go down/reset same time. Initially I thought this is hardware, so I changed, but the new h/w is doing the same thing. I enabled option debug neighbor-events and

        route dynamic manager
            debug events
            log syslog
        exit
        

        but nothing should in syslog or FRR logs. I feel like this is a local issue but can't seem to figure out what to look for. Maybe BGP or connectivity debug may help in this case? BGP says session timeout which means it's not able to talk to the peer, but both doing so the same time? That l can not figure out, yet.

        1 Reply Last reply Reply Quote 0
        • M
          michmoor LAYER 8 Rebel Alliance
          last edited by Aug 2, 2022, 7:55 PM

          @nbhatti Are the switches connected to each other with a single link or multiple links?

          Firewall: NetGate,Palo Alto-VM,Juniper SRX
          Routing: Juniper, Arista, Cisco
          Switching: Juniper, Arista, Cisco
          Wireless: Unifi, Aruba IAP
          JNCIP,CCNP Enterprise

          N 1 Reply Last reply Aug 2, 2022, 7:59 PM Reply Quote 0
          • N
            NBhatti @michmoor
            last edited by Aug 2, 2022, 7:59 PM

            @michmoor Mikrotik is connected to Cisco via single Multimode SFP cable.

            3750-01#show interfaces Gi1/0/3
            GigabitEthernet1/0/3 is up, line protocol is up (connected)
              Hardware is Gigabit Ethernet, address is 0024.1406.9183 (bia 0024.1406.9183)
              Description: TO_MIKROTIK_CCS_P3
              MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
                 reliability 255/255, txload 53/255, rxload 18/255
              Encapsulation ARPA, loopback not set
              Keepalive not set
              Full-duplex, 1000Mb/s, link type is force-up, media type is 1000BaseBX10-U SFP
              input flow-control is off, output flow-control is unsupported
              ARP type: ARPA, ARP Timeout 04:00:00
              Last input 00:00:00, output 00:00:01, output hang never
              Last clearing of "show interface" counters 2w3d
              Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 42197
              Queueing strategy: fifo
              Output queue: 0/40 (size/max)
              5 minute input rate 71404000 bits/sec, 21307 packets/sec
              5 minute output rate 211588000 bits/sec, 30550 packets/sec
                 35882878680 packets input, 17777933845598 bytes, 0 no buffer
                 Received 318434790 broadcasts (148384674 multicasts)
                 0 runts, 0 giants, 0 throttles
                 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
                 0 watchdog, 148384674 multicast, 0 pause input
                 0 input packets with dribble condition detected
                 52252747898 packets output, 47370079035222 bytes, 0 underruns
                 0 output errors, 0 collisions, 0 interface resets
                 0 babbles, 0 late collision, 0 deferred
                 0 lost carrier, 0 no carrier, 0 PAUSE output
                 0 output buffer failures, 0 output buffers swapped out
            

            It's 1G connectivity, but not showing more than few hundred Mbps.

            M 1 Reply Last reply Aug 2, 2022, 8:09 PM Reply Quote 0
            • M
              michmoor LAYER 8 Rebel Alliance @NBhatti
              last edited by michmoor Aug 2, 2022, 8:10 PM Aug 2, 2022, 8:09 PM

              @nbhatti said in 2 BGP Session dropping randomly same time:

              Total output drops: 42197

              I see drops. Granted your counters havent been cleared so it would be a good idea to clear them
              Another option is to connect the ISP handoff directly to yoru TNSR. If its stable then we know the problem is between the Cisco and Mikrotick

              Firewall: NetGate,Palo Alto-VM,Juniper SRX
              Routing: Juniper, Arista, Cisco
              Switching: Juniper, Arista, Cisco
              Wireless: Unifi, Aruba IAP
              JNCIP,CCNP Enterprise

              N D 2 Replies Last reply Aug 2, 2022, 8:16 PM Reply Quote 0
              • N
                NBhatti @michmoor
                last edited by Aug 2, 2022, 8:16 PM

                @michmoor yes there are drops but no CRC no input or anything Layer 2. Connecting ISP to the system is a good idea. I would have to arrange SFP card for that since ISP is on Fiber. Thanks for the idea. Let's see if anyone from Netgate can come up as well to see if some debug can be turned on to see if DPDK is having some fun here maybe :)

                Thanks.

                1 Reply Last reply Reply Quote 0
                • M
                  mleighton Administrator
                  last edited by Aug 3, 2022, 5:02 PM

                  One suggestion is to run a packet capture outside of TNSR on TCP/179 to figure out if any connectivity problems can be seen there. That is most likely a quicker step than obtaining an SFP card to test without the switches in the path.

                  1 Reply Last reply Reply Quote 0
                  • D
                    Derelict LAYER 8 Netgate @michmoor
                    last edited by Aug 7, 2022, 2:27 PM

                    @michmoor said in 2 BGP Session dropping randomly same time:

                    I see drops.

                    A "drop" is counted any time a packet is sent to the drop node for any reason, even perfectly normal ones. This could be as simple as being blocked by an ACL, having no route to host, etc.

                    If you are working in the <= 10Gbit/sec space and are using reasonably-current physical hardware and Intel NICs, you are probably not experiencing an inability of tnsr to forward whatever traffic you are asking it to forward.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    M N 2 Replies Last reply Aug 8, 2022, 4:26 PM Reply Quote 0
                    • M
                      michmoor LAYER 8 Rebel Alliance @Derelict
                      last edited by Aug 8, 2022, 4:26 PM

                      @derelict The drops are from the Cisco side from the picture provided. More specifically output drops which are unrelated to any ACL, no route or even Intel NICs.

                      Firewall: NetGate,Palo Alto-VM,Juniper SRX
                      Routing: Juniper, Arista, Cisco
                      Switching: Juniper, Arista, Cisco
                      Wireless: Unifi, Aruba IAP
                      JNCIP,CCNP Enterprise

                      1 Reply Last reply Reply Quote 0
                      • ?
                        A Former User
                        last edited by Aug 10, 2022, 5:00 PM

                        If its stable then we know the problem is between
                        the Cisco and Mikrotick

                        If a MikroTik router is in "game" and there is also the version
                        7.4 installed, it could be coming from that devices, the BGP problems I mean. On some devices, I don´t know them exactly all, there will be only the eth ports able to use together with BGP and not the SFP(+) ports, but if
                        you will use them instead of that "warning" you will be running in trouble with BGP.

                        N 1 Reply Last reply Aug 10, 2022, 8:55 PM Reply Quote 0
                        • N
                          NBhatti @A Former User
                          last edited by Aug 10, 2022, 8:55 PM

                          @dobby_ Mikrotik is in the path but it’s running SwitchOS 2.3. That’s completely different than RouterOS. Unfortunately SwitchOS Has very limited logging capabilities can’t even see syslog. It does however show interface stats and other stuff on the GUI only but during all those BGP drops there was not a single packet loss, CRC or anything like that on the switch.
                          It happens very randomly something twice in a day and at times once a week. Can’t really establish any pattern. Funky thing is that both sessions drop at the same time.
                          I even disables one season to see if anything related but even though single session also drops. I am going to put the ISP cables directly in the NIC and see how it behaves.

                          Thanks.

                          1 Reply Last reply Reply Quote 0
                          • N
                            NBhatti @Derelict
                            last edited by Aug 10, 2022, 8:59 PM

                            @derelict said in 2 BGP Session dropping randomly same time:

                            @michmoor said in 2 BGP Session dropping randomly same time:

                            I see drops.

                            If you are working in the <= 10Gbit/sec space and are using reasonably-current physical hardware and Intel NICs, you are probably not experiencing an inability of tnsr to forward whatever traffic you are asking it to forward.

                            Even though i am using less then 10Gbit/sec but that should (worst case) cause queues or outbound packet drops? Why the BGP session drop?

                            D 1 Reply Last reply Aug 12, 2022, 6:29 AM Reply Quote 0
                            • D
                              Derelict LAYER 8 Netgate @NBhatti
                              last edited by Aug 12, 2022, 6:29 AM

                              @nbhatti As has been said, we don't know. For what you are describing to happen there would need to be practically zero traffic passing, to both BGP peers, at the same time, for long enough to trigger the hold timer expiration. It doesn't sound like that is the case from what you have stated. Some occasional packet loss will not cause two TCP sessions to stop passing traffic at the same time and not recover.

                              I would packet capture the BGP sessions to the peers (TCP port 179) and try to capture the event. Then load it up into wireshark and see what happened to the session(s).

                              This would best be done at from place in the topography like a switch mirror port mirroring the traffic of the port connected to the tnsr node.

                              Chattanooga, Tennessee, USA
                              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                              Do Not Chat For Help! NO_WAN_EGRESS(TM)

                              1 Reply Last reply Reply Quote 0
                              14 out of 14
                              • First post
                                14/14
                                Last post
                              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                                This community forum collects and processes your personal information.
                                consent.not_received