Load balancer behind TNSR, Poor NAT....
Maybe I should just open a ticket on this, but.... here goes:
So, I am running HAproxy on a machine behind TNSR. Simple static NAT coming through on port 80. Once I open it up, and 8000 sessions build up after about 2 minutes, it stops making connections to HAproxy, and I can see this:
TNSR01 tnsr(config)# show packet-counters Count Node Reason 25 null-node blackholed packets 3 nat44-ed-out2in-slowpath no translation 24516 nat44-ed-out2in-slowpath maximum sessions exceeded
So... the "maximum sessions exceeded" is the issue here, I believe. I need to push a LOT of traffic through this NAT rule. What's the best way to optimize settings? Surely a measly 8000 NAT sessions doesn't max this beast out, right?
I'm on the default NAT mode: "Endpoint Dependent"
Looking at https://docs.netgate.com/tnsr/en/latest/advanced/dataplane-nat.html#dataplane-nat I'm thinking I need to set:
dataplane nat max-translations-per-thread
To some much larger number. What's the valid range? I don't see that in the docs?
Also, how can I tell how many threads I have?
TNSR01 tnsr(config)# show dataplane cpu threads ID Name Type PID LCore Core Socket -- -------- ---- ------ ----- ---- ------ 0 vpp_main 102880 1 1 0
That would seem to indicate ONE thread.... and this is on the XG-1537... so... is something wrong?
So... I did end up opening a ticket.
dataplane nat max-translations-per-thread 1000000 dataplane cpu workers 1
Then restarting the dataplane got things working.
However, I starting thinking (perhaps wrongly) that in order to get the full throughput out of this XG-1537, I really should have more workers, otherwise most of the cores would just be sitting idle.
So, I changed "cpu workers" to 2, and restarted dataplane - it would not come up.... changing workers back to 1 worked fine...
Next, I figured maybe 2 million translations was somehow "maxing out" the box. (It has 16Gig of RAM). OK, I'll reduce to 200k translations-per-thread, and spin up 3 workers. That seemed to work fine, as the primary web server based on the LAN was processing transactions no problem.
However, the next morning, I found out that this "somehow" broke other simple static NAT rules being used for small services (remote SSH access to a couple hosts, etc.) Moved it back to 1 million max-translations-per-thread and a single worker. Fixed.
Very puzzing.... Can I process 10gig worth of traffic through the box, most of it NATed to an internal webserver with only the single worker? If so, That's fine I guess...
Anyone have any thoughts? Am I doing something "unusual" here? I know networking, but this is my first experience with VPP.
audian last edited by
@dans thanks I know you are working with our TAC Support team on this, but I wanted to say thanks for sharing your journey here with the community.
@dans I might be about to have a similar setup as yours. How did it end?
Thanks from a fellow "Dan"