BGP + Metallb (K8) Intermittent Long Load Times For HTTP traffic

pchang388

Hi,

I preface this by saying I am very new to pfsense and still learning a lot about it and networking in general. I've been chasing this issue for a few weeks and have been unable to make traction on it. I am using pfsense with FRR BGP to set up internal IP address allocation to Metallb speakers/controller in a K3s cluster for ingress-nginx pods. These pods (2+) are behind a K8 service which receives it's LoadBalancerIP from Metallb via BGP peering with pfsense.

I pretty much followed the doc here to get things set up:

I noticed that every ~20 requests or so would hang and take 15+ seconds to finally complete as shown below

This happens consistently and makes it rather a pain to use BGP + Metallb + K3s for internal websites. I also stumbled upon another user experiencing the same thing: here. I tried various things as discussed here as well . But I am still running into the same issue after some tinkering.

I was wondering if anyone else had this same issue and was able to resolve it? Metallb also works with L2-advertisement but the user experience is better with BGP and I would like to make it work. As stated, I am still very new to Pfsense and have limited networking knowledge but can try my best to provide any additional info/logs for this issue.

stephenw10

How and what are you testing there to generate those results?

Do you see anything logged on either side?

Was the load-balancer working as expected before adding BGP?

Steve

pchang388

@stephenw10

Hey thanks for the response!

How and what are you testing there to generate those results?

This is from the "Inspect -> Network" page from the Chrome browser. That page has the response/load times of the HTTP calls being made. Same issue happens with Edge browser, haven't tried any others yet.
I am testing various internal web pages (plex, proxmox UI, custom apps, etc.) that run through a reverse proxy ingress-nginx pods (2+ but also tried with just 1) deployed on a k3s cluster (on proxmox VMs - 3 leader/5 follower cluster - app pods only run on follower/workers)
The Metallb (installed via helm - version: v0.13.9) pods deployed in the k3s cluster perform BGP peering with pfsense (FRR BGP) and request IP addresses in a specified subnet for any k8 service created with spec: type: LoadBalancer
In my case, I launch ingress-nginx (installed via helm - version: 1.6.4) with 2+ pods and a LoadBalancer service. These receive the IP address in the subnet range as expected and I am able to use any browser to access internal web pages behind the reverse proxy. I also have cert-manager handling certs through letsencrypt and route53 for these internal sites. Purchased a public domain for this purpose and also have pihole running internally that re-points the ip address for my public domain. Pihole has a wildcard DNS entry to point the public domain to Metallb IP address

Do you see anything logged on either side?

I do not see much with pfsense but as stated very new to pfsense and BGP, so I can try to gather something specific (location or wording) that you might find helpful during the window the problem occurs. Since it occurs very often, I can definitely get some info out without too much work.
From what I can tell though, no obvious errors in other applications in the stack. Metallb controller and speaker logs do not show anything at current/info level logging when the issue happens.
Ingress-nginx logs are a bit more interesting though, when HTTP calls start to hang, the request does not show up in the nginx logs at all. This gap suggests that the traffic is not even getting there (so not an issue with slow apps) and during this hanging time, other browser tabs/sessions I have open to other internal sites work fine but will also experience the same issue later. This finding points at being a per connection thing/issue? That led me to experiment with TCP timeout values in the Firewall advanced settings but still same issue exists. When the requests do finally get to ingress-nginx it sends all that have been hanging/waiting on the web page at what appears to be the exact same time.

Was the load-balancer working as expected before adding BGP?

So I had this exact same set up with Unifi USG working and did not run into this specific issue before. When the USG started randomly shutting down, I replaced it with a pfsense router (attempted to recreate BGP config via FRR) and the problem started occurring. So I came here hoping maybe someone else had similar issue(s) and resolved it.
As linked, this exact issue appears for other users so I do believe it's something pfsense related at this moment. I just don't have enough knowledge in this area to potentially identify the problem.
I've also tried tuning ingress-nginx various times to see if it would help. I did things like change timeouts for client/read/proxy, tuning http2, etc. and was unable to resolve the issue.
Problem seems to only happens for internal sites I run.

pchang388

Additional info from FRR Status section in pfsense that may help?

My router ip address is set at 10.0.0.1 and the BGP router-id is set to something outside of any used subnet (10.100.100.69)

Zebra Routes

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

K>* 0.0.0.0/0 [0/0] via REDACTED_PUB_IP, 20:11:36
C>* 10.0.0.0/24 [0/1] is directly connected, igb1, 20:11:36
K * 10.0.20.0/24 [0/0] via 10.0.20.2, 20:11:36
C>* 10.0.20.0/24 [0/1] is directly connected, ovpns1, 20:11:36
B>* 10.0.30.250/32 [200/0] via 10.0.0.220, igb1, weight 1, 00:19:25
  *                        via 10.0.0.221, igb1, weight 1, 00:19:25
  *                        via 10.0.0.222, igb1, weight 1, 00:19:25
  *                        via 10.0.0.223, igb1, weight 1, 00:19:25
  *                        via 10.0.0.224, igb1, weight 1, 00:19:25
C>* REDACTED_PUB_IP/22 [0/1] is directly connected, igb0, 20:11:36

Zebra IPv6 Routes

Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

K>* ::/0 [0/0] via REDACTED_PUB_IP, 20:11:36
C>* REDACTED::/64 [0/1] is directly connected, igb0, 20:11:36
C>* REDACTED::38/128 [0/1] is directly connected, igb0, 20:11:36
C * fe80::/64 [0/1] is directly connected, ovpns1, 20:11:36
C>* fe80::/64 [0/1] is directly connected, lo0, 20:11:36
C * fe80::/64 [0/1] is directly connected, igb1, 20:11:36
C * fe80::/64 [0/1] is directly connected, igb0, 20:11:36

BGP Routes

BGP table version is 10, local router ID is 10.100.100.69, vrf id 0
Default local pref 100, local AS 64512
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.0.30.250/32   10.0.0.220                      0      0 i
*=i                 10.0.0.222                      0      0 i
*=i                 10.0.0.223                      0      0 i
*=i                 10.0.0.224                      0      0 i
*=i                 10.0.0.221                      0      0 i

Displayed  1 routes and 5 total paths

BGP IPv6 Routes

No BGP prefixes displayed, 0 exist

BGP Summary

IPv4 Unicast Summary:
BGP router identifier 10.100.100.69, local AS number 64512 vrf-id 0
BGP table version 10
RIB entries 1, using 192 bytes of memory
Peers 5, using 71 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
10.0.0.220      4      64512      2430      2419        0    0    0 00:19:25            1        0
10.0.0.221      4      64512      2432      2420        0    0    0 00:19:26            1        0
10.0.0.222      4      64512      2431      2420        0    0    0 00:19:26            1        0
10.0.0.224      4      64512      2432      2421        0    0    0 00:19:26            1        0
10.0.0.223      4      64512      2432      2421        0    0    0 00:19:26            1        0

Total number of neighbors 5

BGP Neighbor

BGP neighbor is 10.0.0.220, remote AS 64512, local AS 64512, internal link
Hostname: k3s-worker-0
 Member of peer-group metallb-k8 for session parameters
  BGP version 4, remote router ID 10.0.0.220, local router ID 10.100.100.69
  BGP state = Established, up for 00:19:25
  Last read 00:00:25, Last write 00:00:24
  Hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      IPv4 Unicast: RX advertised IPv4 Unicast
    Route refresh: advertised
    Address Family IPv4 Unicast: advertised and received
    Address Family IPv6 Unicast: received
    Hostname Capability: advertised (name: pfSense.localdomain,domain name: n/a) not received
    Graceful Restart Capability: advertised
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: Disable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  4          4
    Notifications:          0          0
    Updates:                1          6
    Keepalives:          2414       2418
    Route Refresh:          0          2
    Capability:             0          0
    Total:               2419       2430
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  metallb-k8 peer-group member
  Update group 6, subgroup 12
  Packet Queue length 0
  Community attribute sent to this neighbor(large)
  1 accepted prefixes

  Connections established 4; dropped 3
  Last reset 00:22:04,  No AFI/SAFI activated for peer
Local host: 10.0.0.1, Local port: 179
Foreign host: 10.0.0.220, Foreign port: 42481
Nexthop: 10.0.0.1
Nexthop global: fe80::6662:66ff:fe21:5b9d
Nexthop local: fe80::6662:66ff:fe21:5b9d
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 28

BGP neighbor is 10.0.0.221, remote AS 64512, local AS 64512, internal link
Hostname: k3s-worker-1
 Member of peer-group metallb-k8 for session parameters
  BGP version 4, remote router ID 10.0.0.221, local router ID 10.100.100.69
  BGP state = Established, up for 00:19:26
  Last read 00:00:26, Last write 00:00:24
  Hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      IPv4 Unicast: RX advertised IPv4 Unicast
    Route refresh: advertised
    Address Family IPv4 Unicast: advertised and received
    Address Family IPv6 Unicast: received
    Hostname Capability: advertised (name: pfSense.localdomain,domain name: n/a) not received
    Graceful Restart Capability: advertised
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: Disable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  4          4
    Notifications:          0          0
    Updates:                1          6
    Keepalives:          2415       2420
    Route Refresh:          0          2
    Capability:             0          0
    Total:               2420       2432
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  metallb-k8 peer-group member
  Update group 6, subgroup 12
  Packet Queue length 0
  Community attribute sent to this neighbor(large)
  1 accepted prefixes

  Connections established 4; dropped 3
  Last reset 00:22:03,  No AFI/SAFI activated for peer
Local host: 10.0.0.1, Local port: 179
Foreign host: 10.0.0.221, Foreign port: 60415
Nexthop: 10.0.0.1
Nexthop global: fe80::6662:66ff:fe21:5b9d
Nexthop local: fe80::6662:66ff:fe21:5b9d
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 24

BGP neighbor is 10.0.0.222, remote AS 64512, local AS 64512, internal link
Hostname: k3s-worker-2
 Member of peer-group metallb-k8 for session parameters
  BGP version 4, remote router ID 10.0.0.222, local router ID 10.100.100.69
  BGP state = Established, up for 00:19:26
  Last read 00:00:26, Last write 00:00:24
  Hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      IPv4 Unicast: RX advertised IPv4 Unicast
    Route refresh: advertised
    Address Family IPv4 Unicast: advertised and received
    Address Family IPv6 Unicast: received
    Hostname Capability: advertised (name: pfSense.localdomain,domain name: n/a) not received
    Graceful Restart Capability: advertised
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: Disable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  4          4
    Notifications:          0          0
    Updates:                1          6
    Keepalives:          2415       2419
    Route Refresh:          0          2
    Capability:             0          0
    Total:               2420       2431
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  metallb-k8 peer-group member
  Update group 6, subgroup 12
  Packet Queue length 0
  Community attribute sent to this neighbor(large)
  1 accepted prefixes

  Connections established 4; dropped 3
  Last reset 00:22:03,  No AFI/SAFI activated for peer
Local host: 10.0.0.1, Local port: 179
Foreign host: 10.0.0.222, Foreign port: 52305
Nexthop: 10.0.0.1
Nexthop global: fe80::6662:66ff:fe21:5b9d
Nexthop local: fe80::6662:66ff:fe21:5b9d
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 27

BGP neighbor is 10.0.0.224, remote AS 64512, local AS 64512, internal link
Hostname: k3s-worker-4
 Member of peer-group metallb-k8 for session parameters
  BGP version 4, remote router ID 10.0.0.224, local router ID 10.100.100.69
  BGP state = Established, up for 00:19:26
  Last read 00:00:26, Last write 00:00:24
  Hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      IPv4 Unicast: RX advertised IPv4 Unicast
    Route refresh: advertised
    Address Family IPv4 Unicast: advertised and received
    Address Family IPv6 Unicast: received
    Hostname Capability: advertised (name: pfSense.localdomain,domain name: n/a) not received
    Graceful Restart Capability: advertised
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: Disable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  4          4
    Notifications:          0          0
    Updates:                1          6
    Keepalives:          2416       2420
    Route Refresh:          0          2
    Capability:             0          0
    Total:               2421       2432
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  metallb-k8 peer-group member
  Update group 6, subgroup 12
  Packet Queue length 0
  Community attribute sent to this neighbor(large)
  1 accepted prefixes

  Connections established 4; dropped 3
  Last reset 00:22:03,  No AFI/SAFI activated for peer
Local host: 10.0.0.1, Local port: 179
Foreign host: 10.0.0.224, Foreign port: 55361
Nexthop: 10.0.0.1
Nexthop global: fe80::6662:66ff:fe21:5b9d
Nexthop local: fe80::6662:66ff:fe21:5b9d
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 25

BGP neighbor is 10.0.0.223, remote AS 64512, local AS 64512, internal link
Hostname: k3s-worker-3
  BGP version 4, remote router ID 10.0.0.223, local router ID 10.100.100.69
  BGP state = Established, up for 00:19:26
  Last read 00:00:26, Last write 00:00:24
  Hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      IPv4 Unicast: RX advertised IPv4 Unicast
    Route refresh: advertised
    Address Family IPv4 Unicast: advertised and received
    Address Family IPv6 Unicast: received
    Hostname Capability: advertised (name: pfSense.localdomain,domain name: n/a) not received
    Graceful Restart Capability: advertised
  Graceful restart information:
    Local GR Mode: Helper*
    Remote GR Mode: Disable
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 0
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  4          4
    Notifications:          0          0
    Updates:                1          6
    Keepalives:          2416       2420
    Route Refresh:          0          2
    Capability:             0          0
    Total:               2421       2432
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  Update group 7, subgroup 13
  Packet Queue length 0
  Community attribute sent to this neighbor(large)
  1 accepted prefixes

  Connections established 4; dropped 3
  Last reset 00:22:03,  No AFI/SAFI activated for peer
Local host: 10.0.0.1, Local port: 179
Foreign host: 10.0.0.223, Foreign port: 41171
Nexthop: 10.0.0.1
Nexthop global: fe80::6662:66ff:fe21:5b9d
Nexthop local: fe80::6662:66ff:fe21:5b9d
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Read thread: on  Write thread: on  FD used: 26

stephenw10

Where exactly are you running the browser http test from?

The time periods you're seeing are about what I might expect for ICMP redirects in an asymmetric routing situation. And it would not surprise me to find that the Unifi just deals with that for you behind the scenes.

Otherwise I think I'd be trying to capture this in a pcap in the pfSense internal interface. You would be able to confirm that the traffic is not leaving pfSense and we can then look at why. Lack of a route is what I'd expect there but if it was I'd expect to see something in the routing logs.

Steve

pchang388

@stephenw10

Hey thanks for the follow up!

Where exactly are you running the browser http test from?

I am running these from a Windows 11 Desktop on the 10.0.0.1/24 subnet and it gets assigned an IP address from pfsense DHCP lease
Metallb is set to use IP addresses from:

spec:
  addresses:
  - 10.0.30.6-10.0.30.250

In my case, ingress-nginx pods use a service that is assigned 10.0.30.250 IP address

I apologize, I didn't have time today to look into more details but I will do some more digging tomorrow. I should have some time to do a pcap capture through GUI or tcpdump but am not very well versed in BGP yet so will need to do some reading on BGP routing. But thank you very much and I appreciate your time. I will post my pcap findings here when I get a chance

pchang388

@stephenw10

Hey thanks again for working with me on this issue, I am very appreciative.

Attached is a pcap dump done through pfsense GUI (I can also run tcpdump locally if needed). I can definitely provide more information if needed.

pfsense_pcap_gui.txt

10.0.30.250 is the IP address for ingress-nginx service running in k3s, it receives that IP from Metallb + pfsense. 10.0.0.220 is the k3s worker node where ingress-nginx is deployed. It is in the default subnet (10.0.0.1/24) but outside the DHCP client range. I added a DHCP reservation for this node so it will use a specific IP address.

10.0.0.24 is the laptop I was accessing the webpages from and is on my default subnet. During the test/recreation of the issue, I encountered the problem twice and it should be in the pcap. I am not sure what to look for but I do see ICMP redirects. I am looking forward to seeing what you find and getting closer to resolving this issue!

Thanks,
Peter

stephenw10

Ok, a diagram might be useful here just to remove confusion.

However I think that is your problem:

22:35:53.091736 IP 10.0.0.1 > 10.0.0.24: ICMP redirect 10.0.30.250 to host 10.0.0.220, length 72

As I understand it that is pfSense (10.0.0.1) redirecting your test host (10.0.0.24) to access nginx (10.0.30.250) via 10.0.0.220.
It does that because pfSense has a route via that and it's telling the host it can access it directly without having to go through pfSense.
So the host will do that until the ICMP redirect expires. At which point it will start sending traffic to pfSense again but pfSense will block it because it will be TCP traffic that is out of state. I would expect to see that blocked traffic in the firewall log.
This: https://docs.netgate.com/pfsense/en/latest/troubleshooting/asymmetric-routing.html#common-scenario

Basically the test is invalid because it's in the same subnet as the target creating an asymmetric route.
If you test from somewhere external I would not expect to see this issue.

Steve

pchang388

@stephenw10

I've been busy lately so haven't had the chance yet to respond back but I read through your response and linked documents and it all makes sense to me. As stated I am very new to pfsense + BGP so I tried my best to understand the concepts.

I have resolved the issue for now with your help and will try to post a longer read/explanation in a few days (or edit this existing post) when I can about what I learned/fixed since I always like to share/document what I've learned for myself and also to help future readers that may stumble on the same issue.

But essentially yes, it's due to asymmetric routing as you explained in your post. I ended up using my unifi 16 port switch and pfsense to create a VLAN for the proxmox server (setting the LACP bonded ports in unifi to use the VLAN Id that matches in both systems) where the k3s VMs run. Then added firewall rules to allow traffic between VLAN network and LAN network which is something also briefly mentioned in the doc you linked. I also found this doc very helpful:

https://docs.netgate.com/pfsense/en/latest/routing/static.html#asymmetric-routing

The issue from my new understanding was due to my traffic going as:

Client (10.0.0.0/24 via DHCP) -> Pfsense (10.0.0.1) -> Metallb (10.0.30.0/24) -> k3s workers (10.0.0.220-230)

As stephen and the docs state, pfsense tells client to directly access and client will do so until ICMP redirect expires. Since my return traffic is going from k3s workers -> client, it creates an asymmetric route as stated and traffic will eventually be blocked when ICMP redirect expires and ICMP redirect will have to occurr again when you try to reconnect.

With asymmetric routing such as in this example, any stateful firewall will drop legitimate traffic because it cannot properly keep state without seeing traffic in both directions. This generally only affects TCP, since other protocols do not have a formal connection handshake the firewall can recognize for use in state tracking.

From my new/limited understanding, one way to resolve this, would be to add a static route that avoids using the router for ICMP redirects/routing. But since in my previous/problem set up, I had the client (10.0.0.0/24 via DHCP) going to k3s workers (10.0.0.220-230), you can't put a static route on the same subnet (doesn't make sense anyways).

So a more traditional/proper set up would be to put the proxmox server(s) behind a VLAN in pfsense and the managed unifi switch. This would create a new subnet (10.0.10.0/24 in my case) for the k3s worker VMs and also provide a different gateway (10.0.10.1) and ability to create static routes and/or firewall rules that actually make sense. In my case, I only had to allow firewall rules to allow traffic between VLAN and LAN interfaces.

This change allows for a pathway that does not introduce asymmetric routing since there is a default gateway in both subnets and traffic would be able to flow forward and backward in the same manner.

Thanks again @stephenw10 - your help and explanations led me to the right direction and how to more properly set up my homelab network!

stephenw10

Nice! Yeah moving the server to a different VLAN so the route is always through pfSense in both direction is the 'correct' solution here.