BGP - K3S Kubernetes

penguinpages

Unless there is more to the settings then noted in above image

I believe, that is an allow map updates from ALL for ALL . Not sure what sequence 100 maps to ... but seems to match directions

stephenw10

Yes but you have to apply that route map to the neighbour in question:
Screenshot from 2024-01-07 23-29-50.png

https://docs.netgate.com/pfsense/en/latest/packages/frr/bgp/config-neighbor.html#peer-filtering

penguinpages

@stephenw10

I think that was the peice missing. Your image noted a field "Route Map Filter" But the image did not note context, and I poked around and made two below changes:

and I also set

Now BGP routes display

root@pandora:~# kubectl get services -A
NAMESPACE              NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
default                kubernetes             ClusterIP      10.43.0.1       <none>          443/TCP                      2d4h
kube-system            kube-dns               ClusterIP      10.43.0.10      <none>          53/UDP,53/TCP,9153/TCP       2d4h
kube-system            metrics-server         ClusterIP      10.43.59.241    <none>          443/TCP                      2d4h
kube-system            hubble-peer            ClusterIP      10.43.133.23    <none>          443/TCP                      2d4h
kube-system            hubble-relay           ClusterIP      10.43.129.105   <none>          80/TCP                       2d4h
kubernetes-dashboard   kubernetes-dashboard   ClusterIP      10.43.197.111   <none>          443/TCP                      2d4h
kube-system            cilium-ingress         LoadBalancer   10.43.152.213   172.16.113.72   80:30723/TCP,443:30333/TCP   2d4h

penguinpages

@penguinpages

Update:

So it seems the router is broadcasting, and what seems to map to deployments do updates.

I did a test deploy of test wordpress website and it seems to have setup IPLB services that propogate through BGP:

default                my-wordpress-mariadb   ClusterIP      10.43.21.226    <none>           3306/TCP                     116s
default                my-wordpress           LoadBalancer   10.43.72.205    172.16.113.176   80:31905/TCP,443:30186/TCP   116s

router frr BGP

BGP table version is 3, local router ID is 172.16.103.1, vrf id 0
Default local pref 100, local AS 65014
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *> 10.43.0.0/24     172.16.103.110                         0 65013 i
 *> 172.16.113.72/32 172.16.103.110                         0 65013 i
 *> 172.16.113.176/32
                    172.16.103.110                         0 65013 i

Displayed  3 routes and 3 total paths

And if I launch webpage from the hosting system (172.16.103.110) running k3s I get

admin@pandora:~$ curl -Is http://172.16.113.176
HTTP/1.1 200 OK
Date: Mon, 08 Jan 2024 17:12:29 GMT
Server: Apache
Link: <http://172.16.113.176/wp-json/>; rel="https://api.w.org/"
Content-Type: text/html; charset=UTF-8

host route table

admin@pandora:~$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         rt1.penguinpage 0.0.0.0         UG    427    0        0 br100
10.43.0.0       10.43.0.192     255.255.255.0   UG    0      0        0 cilium_host
10.43.0.192     0.0.0.0         255.255.255.255 UH    0      0        0 cilium_host
link-local      0.0.0.0         255.255.0.0     U     1000   0        0 nm-bond
172.16.100.0    0.0.0.0         255.255.255.0   U     427    0        0 br100
172.16.101.0    0.0.0.0         255.255.255.0   U     428    0        0 br101
172.16.102.0    0.0.0.0         255.255.255.0   U     426    0        0 br102
172.16.103.0    0.0.0.0         255.255.255.0   U     425    0        0 br103
admin@pandora:~$

So host is not aware of 172.16.113.0/24 segment outside the router helping

But remote from other subnets I get nothing.

Ex: Windows system on 172.16.100.0/24

C:\Users\user1>curl -o curl -Is http://172.16.113.176

C:\Users\user1>

But if I add static route.. things start working

route -p ADD 172.16.113.0 MASK 255.255.255.0 172.16.103.110 METRIC 1
PS C:\Users\Jerem> curl http://172.16.113.176


StatusCode        : 200
StatusDescription : OK
Content           : <!DOCTYPE html>
                    <html lang="en-US">
                    <head>

So this implies that :

That the target k3s node understands routing and route return working and via BGP
That its not a firwall issue
the pfsense router is not handling routing to this subnet for other subnets Ex: VLAN 100 nodes

Only thing I found odd is the route table out of FRR -> Status -> Zebra

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 108.234.144.1, igc0, 3d01h34m
C>* 10.10.10.1/32 [0/1] is directly connected, lo0, 3d01h34m
B>* 10.43.0.0/24 [20/0] via 172.16.103.110, igc1.103, weight 1, 16:36:31
C>* 108.234.144.0/22 [0/1] is directly connected, igc0, 3d01h34m
C>* 172.16.100.0/24 [0/1] is directly connected, igc1.100, 17:01:52
C>* 172.16.101.0/24 [0/1] is directly connected, igc1.101, 17:01:52
C>* 172.16.102.0/24 [0/1] is directly connected, igc1.102, 17:01:52
S   172.16.103.0/24 [1/0] via 172.16.103.110 inactive, weight 1, 17:01:52
C>* 172.16.103.0/24 [0/1] is directly connected, igc1.103, 17:01:52
C>* 172.16.104.0/24 [0/1] is directly connected, ovpns1, 3d01h34m
C>* 172.16.110.0/24 [0/1] is directly connected, igc1.110, 17:01:52
C>* 172.16.111.0/24 [0/1] is directly connected, igc1.111, 17:01:52
C>* 172.16.112.0/24 [0/1] is directly connected, igc1.112, 17:01:52
B>* 172.16.113.72/32 [20/0] via 172.16.103.110, igc1.103, weight 1, 16:36:31
B>* 172.16.113.176/32 [20/0] via 172.16.103.110, igc1.103, weight 1, 01:39:17
C>* 172.16.120.0/24 [0/1] is directly connected, igc1.120, 17:01:52
C>* 172.16.121.0/24 [0/1] is directly connected, igc1.121, 17:01:52
C>* 172.16.122.0/24 [0/1] is directly connected, igc1.122, 17:01:52
C>* 172.16.130.0/24 [0/1] is directly connected, igc1.130, 17:01:52
C>* 172.16.131.0/24 [0/1] is directly connected, igc1.131, 17:01:52
C>* 172.16.132.0/24 [0/1] is directly connected, igc1.132, 17:01:52

My guess is there is yet another configuration setting I need to do for BGP to forward properly from intranets.

stephenw10

So you're adding that static route to the external system directly?

Sounds like the client is just not using pfSense as it's default route. Or whatever it is using as that cannot hairpin the route back to pfSense possibly.

Or it's creating an asymmetric route; check the firewall logs.

penguinpages

@stephenw10

Yes..

Windows laptop yes it can get to internet and all other subnets... (ex: SSH into k3s node and run above tests)

 route -p ADD 172.16.113.0 MASK 255.255.255.0 172.16.103.110 METRIC 1

As a baseline I have another VM on that same VLAN as the laptop and it cannot get to the subnet (aka did not run above route add command)

So to me this means this is a misconfiguration within the pfsense that when my hosts (windows) attempt to find resource on 172.16.113.0 (per curl test) it goes to my default GW 172.16.100.1/24 and it should then forward down 172.16.103.110 to get to cilium 172.16.113.176.

If it was issue related to hairpin.. I think it would also fail on nodes within 172.16.103.0/24 also failing.. but they (as noted in test) can resolve 172.16.113.176.

stephenw10

Yes I would expect it to work if that traffic is going via pfSense.

Try a traceroute.

Check the states when you try to ping.

What's actually happening when it doesn't have that static route?

penguinpages

@stephenw10

Traceroute would require ICMP mapping and services.

I did it just for baseline from Linux host direct (which can get to website without route add) . no response. As well as from windows host with route add.. also no response.

That is why trying to get debug out of pfsense as to where packets "work when from 172.16.103.110" and with Route add "172.16.100.32"

But fail from 172.16.100.22

packetcapture-igc1.103-20240108164045.pcap

I also tried to add networks for distribtution in but no change

Just seems like the pfsense router is just not routing packets to known subnet .. but only from specific networks

stephenw10

Adding a static route to the windows client changes nothing in how pfSense routes that traffic. It has to be in the client itself.

penguinpages

@stephenw10

Not sure what you mean.

If my systems (windows / linux) are on 172.16.100.0/24 and know nothing of how to route to 172.16.113.0/24, they just use Default GW 172.16.100.1 which is pfsense.

Route (pfsense) then refers to its table, and based on BGP knows path to get to 172.16.113.0/24 is via 172.16.103.110.

PS C:\Users\user> route -p delete 172.16.113.0 MASK 255.255.255.0 172.16.103.110 METRIC 1                               OK!
PS C:\Users\user> route -p add 172.16.113.0 MASK 255.255.255.0 172.16.100.1 METRIC 1
 OK!
PS C:\Users\user> curl 172.16.113.176  # --> timeout

PS C:\Users\user> route -p delete 172.16.113.0 MASK 255.255.255.0 172.16.100.1 METRIC 1
 OK!
PS C:\Users\user> route -p add 172.16.113.0 MASK 255.255.255.0 172.16.103.110 METRIC 1
 OK!
PS C:\Users\user> curl 172.16.113.176


StatusCode        : 200
StatusDescription : OK
Content           : <!DOCTYPE html>
                    <html lang="en-US">
                    <head>

stephenw10

OK how does the client know how to reach 172.16.103.110? That must also be via pfSense at 172.16.100.1 right?

penguinpages

@stephenw10

Yes

Route table from pfsense:

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 108.234.144.1, igc0, 4d02h15m
C>* 10.10.10.1/32 [0/1] is directly connected, lo0, 4d02h15m
B>* 10.43.0.0/24 [20/0] via 172.16.103.110, igc1.103, weight 1, 1d17h17m
C>* 108.234.144.0/22 [0/1] is directly connected, igc0, 4d02h15m
C>* 172.16.100.0/24 [0/1] is directly connected, igc1.100, 1d17h42m
C>* 172.16.101.0/24 [0/1] is directly connected, igc1.101, 1d17h42m
C>* 172.16.102.0/24 [0/1] is directly connected, igc1.102, 1d17h42m
S   172.16.103.0/24 [1/0] via 172.16.103.110 inactive, weight 1, 1d17h42m
C>* 172.16.103.0/24 [0/1] is directly connected, igc1.103, 1d17h42m
C>* 172.16.104.0/24 [0/1] is directly connected, ovpns1, 4d02h15m
C>* 172.16.110.0/24 [0/1] is directly connected, igc1.110, 1d17h42m
C>* 172.16.111.0/24 [0/1] is directly connected, igc1.111, 1d17h42m
C>* 172.16.112.0/24 [0/1] is directly connected, igc1.112, 1d17h42m
B>* 172.16.113.72/32 [20/0] via 172.16.103.110, igc1.103, weight 1, 1d17h17m
B>* 172.16.113.176/32 [20/0] via 172.16.103.110, igc1.103, weight 1, 1d02h20m
C>* 172.16.120.0/24 [0/1] is directly connected, igc1.120, 1d17h42m
C>* 172.16.121.0/24 [0/1] is directly connected, igc1.121, 1d17h42m
C>* 172.16.122.0/24 [0/1] is directly connected, igc1.122, 1d17h42m
C>* 172.16.130.0/24 [0/1] is directly connected, igc1.130, 1d17h42m
C>* 172.16.131.0/24 [0/1] is directly connected, igc1.131, 1d17h42m
C>* 172.16.132.0/24 [0/1] is directly connected, igc1.132, 1d17h42m

stephenw10

So what did the states show when you try to open it without the static route on the client?

Looking at the pfSense routing table I wonder if the inactive more specific route to 172.16.103.0/24 is causing a problem.

penguinpages

@stephenw10

Could be...

But.......

Why when I add route to host... does it start working?
Why when from the hosting system can I (without adding route) get to site?

stephenw10

I'd still be checking the states and/or running an ping and pcaps to see where that is actually being sent.

penguinpages

@stephenw10

Just to close this out and also post what I learned.

Root cause: Server with multiple interfaces, where BGP and cilium are binding to the NOT default interface,, will never work.

Ex:

Idea was to have VLAN 103 for all containers.. used by various K8 clusters.... but.. Cilium returns routes based on underlying Linux .. which follows DGW through 172.16.100... which confuses hosts waiting for packets to return from pfsense 172.16.103.1

Working design:

Change:

remove all L2/3 subnet for 172.16.103.0
setup within CNI (Cilium) that its IP pool is now 172.16.103.0/24
redirect all bgp through host with bound DGW 172.16.100.110 with bgp neighbor definition to 172.16.100.1 (pfsense)

Now BGP does not take weird packet paths etc.

How I root cause.: Watch packet sessions on host:
tcpdump -i br103 -s 0 'tcp port http'

then

tcpdump -i br100 -s 0 'tcp port http'

Then from laptop
curl http://172.16.113.176

what I saw was packets in (10x due to fail return)... on both interfaces... which means return was out different interface.

Thanks for those who helped respond and posting. Hope this helps others not shave the same yak.

stephenw10

Nice catch.

vacquah

@penguinpages can you share your cilium bgp peering policy? I’d like to see what it looks like. I am having the same problem. I use almost the same equipment you have - I have a pfsense doing bgp, a brocade icx7250 doing layer 3 routing ( all my vlans are setup here ) and I have a server with 2 nics. I cant connect to a test nginx demo i have setup on my kubernetes cluster with cilium.

penguinpages

@vacquah said in BGP - K3S Kubernetes:

cilium bgp peering policy?

K3S Deployment


Install Cilium and K3S
https://docs.k3s.io/cli/server

CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='--flannel-backend=none --disable-network-policy --disable=servicelb --disable=traefik --tls-san=172.16.100.110 --disable-kube-proxy --node-label bgp-policy=pandora' sh -
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
echo "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml" >> ~/.bashrc
sudo -E cilium install --version 1.14.5 --set ipam.operator.clusterPoolIPv4PodCIDRList=10.43.0.0/16 --set bgpControlPlane.enabled=true --set k8sServiceHost=172.16.100.110 --set k8sServicePort=6443 --set kubeProxyReplacement=true --set ingressController.enabled=true --set ingressController.loadbalancerMode=dedicated

vi /etc/rancher/k3s/k3s.yaml

replace 127.0.0.1 with host ip 172.16.100.110

sudo -E cilium status --wait
sudo cilium hubble enable # need to run as root.. sudo profile issue
sudo -E cilium connectivity test
sudo -E kubectl get svc --all-namespaces
kubectl get services -A
sudo cilium hubble enable



Then apply policy

sudo su - admin
cd /media/md0/containers/
vi cilium_policy.yaml
######################

apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
name: rt1
spec:
nodeSelector:
matchLabels:
bgp-policy: pandora
virtualRouters:

localASN: 65013
exportPodCIDR: true
neighbors:
- peerAddress: 172.16.100.1/24
  peerASN: 65014
  eBGPMultihopTTL: 10
  connectRetryTimeSeconds: 120
  holdTimeSeconds: 90
  keepAliveTimeSeconds: 30
  gracefulRestart:
  enabled: true
  restartTimeSeconds: 120
  serviceSelector:
  matchExpressions:
  - {key: somekey, operator: NotIn, values: ['never-used-value']}

apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: "pandorac"
spec:
cidrs:

cidr: "172.16.103.0/24"
##########
root@pandora:/media/md0/containers# kubectl apply -f cilium_policy.yaml
root@pandora:/media/md0/containers# kubectl get ippools -A
NAME DISABLED CONFLICTING IPS AVAILABLE AGE
pandorac false False 253 4s

root@pandora:~# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 6m
kube-system kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 5m54s
kube-system metrics-server ClusterIP 10.43.76.245 <none> 443/TCP 5m53s
kube-system hubble-peer ClusterIP 10.43.10.37 <none> 443/TCP 5m50s
kube-system hubble-relay ClusterIP 10.43.124.90 <none> 80/TCP 2m53s
cilium-test echo-same-node NodePort 10.43.151.50 <none> 8080:30525/TCP 2m3s
cilium-test cilium-ingress-ingress-service NodePort 10.43.4.143 <none> 80:31000/TCP,443:31001/TCP 2m3s
kube-system cilium-ingress LoadBalancer 10.43.36.151 172.16.103.248 80:32261/TCP,443:32232/TCP 5m50s



Note Router BGP Status: Note state = Active
Ex: in web ui of rt1 (pfsense router -> status -> frr -> BGP -> Neighbor)

BGP neighbor is 172.16.100.110, remote AS 65013, local AS 65014, external link
Local Role: undefined
Remote Role: undefined
Description: pandorac Container interface Neighbor
Hostname: pandora
BGP version 4, remote router ID 172.16.100.110, local router ID 172.16.100.1
BGP state = Established, up for 00:01:12
Last read 00:00:12, Last write 00:00:12


Optional: Test external routing of example test website works from Ex: windows host on 172.16.100.0/24

PS C:\Users\Jerem> curl http://172.16.103.248

StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name='robots' content='max-image-preview:large' />
<t...
RawContent : HTTP/1.1 200 OK

vacquah

@penguinpages Thanks for sharing. I am getting confused / lost with all the IPs

Is 172.16.100.1/24 your pfsense router ip? Is 172.16.100.110 a specific kubernetes controlplane or worker node? I am having a hard time getting the big picture.

BGP - K3S Kubernetes

replace 127.0.0.1 with host ip 172.16.100.110

sudo su - admin cd /media/md0/containers/ vi cilium_policy.yaml ######################

sudo su - admin
cd /media/md0/containers/
vi cilium_policy.yaml
######################