Esxi 5.5 static LAG on vswitch / vlans handled inside VM

heper

hi

-i've setup nic teaming on a standard vswitch (3 nics) https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048
-configured static lag on switch
-pfsense vm uses vmxnet3 nics. The lan vmxnet nic deals with all the vlans (configured inside the vm)

everything seems to work but:
unfortunately i can't seem to get more then 1 gbit through the router, so i'm guessing there is something wrong with the LAG. (its stuck at exactly 1Gbit/s)

anyone have a clue how i could start debugging ?

heper

nobody?

johnpoz

How are you testing? Lag does not give you X your bandwidth, it just gives you multiple paths at whatever your wire speed is. 1 + 1 does not equal 2.. Nor does 1+1+1 = 3 when you setup etherchannels. You just get multiple 1 gig paths, but how the switches determine how they load balance can still cause congestion on the line, etc..

Its not like frame 1 goes over path 1, frame 2 goes over path 2.. Nor does it do hey path 1 is full lets now use path 2, etc..

Here is a link to cisco load balancing options
http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst4500/12-2/54sg/configuration/guide/config/channel.html#wp1020804

So while you might have a 3 gig uplink now, depending on what traffic your sending over it more than likely all that traffic is going over the same path.. Now if you had another machine talking to a different machine over the same uplink they might use path 2 or 3 in your case, but it also might just use path 1 depending on how the loadbalancing hash worked out what path to send that traffic… And you could still run into a congestion issue on path 1, etc..

If you want to validate you can fill your now fatter uplink pipe you need to make sure depending on the load balancing method your doing that your multiple streams of data all use different paths and that you end up with a combined throughput of your total members in your etherchannel.

To be honest if you want more bandwidth per vlan you would most likely better off breaking out your vlans to their own paths so that vlan 10 uses path 1, and vlan 20 uses different path.. So now each vlan has 1 gig to itself vs thinking that I can just use a etherchannel and get 2 gig, etc. etc.. Or if you want more than 1 gig, then use a 10ge uplink.

heper

this is a simplified/stripped version of the scenario. in reality it also involves multi-wan & more vlans, but none of it matters.

-its not a cisco switch, but i've setup the static lag on the switch end with src-dst-ip hash / esxi_vswitch with ip hash as described in various vmware support pages.

testing inside the host: VM1 <–-> vswitch_vlan10<--> pfsense NAT <---> vswitch_vlan20 <--> VMt2 ==> maxes at around 2.5Gbit/s using pfs-2.2.6. (thats why i opted for 3-nic-lag)

breaking off the vlans with each its seperate path would involve moving the vlans outside the pfsense_VM. this would mean migrating the pfsense-vlan-interfaces to actual interfaces. this is a production system & no way to power it down during working hours

the 10gb uplink would involve spending >$1000 on new switches & nics. While it is obviously the easiest/best way, its not something that i see happening any time soon.

i'm fairly certain there is a configuration issue on either esxi or the switch ... or its impossible todo LAG on the vswitch / while that same vswitch doesn't handle the vlans

johnpoz

Dude does not matter what switch it is, they all pretty much work the same.. Sorry but there is no loadbalancing setting that turns lagg/etherchannel into 1+1=2, it is always just going to be 1 + 1..

Now depending on how your hash works out and if host 1 takes path A across the lagg talking to host 2, and host 3 takes path B when talking to 4 and all you don't run into any other bottle necks then sure they should be able to talk at 1 gig each, or somewhere in the high 800s to low 900s just like any other physical network connection ever..

But all the load balancing hashes just means they do some math to figure out which path to send that traffic… Its quite possible that your hash sent traffic for both 1 and 2 talking and 3 and 4 talking down the same path..

Your setup works for lots of hosts talking to lots of hosts.. Yes you should see more than your 1 gig when under load... But to prove it out to yourself your going to have to validate which path switch is sending traffic from host 1 to 2, and which path its using for host 3 to 4.. Run the test at same time and add them up..

heper

Run the test at same time and add them up..

well thats what i did and ended up at exactly 1gbit, i ran this multiple times ;)

But all the load balancing hashes just means they do some math to figure out which path to send that traffic… Its quite possible that your hash sent traffic for both 1 and 2 talking and 3 and 4 talking down the same path..

i guess thats possible, but i've got no clue how to verify what path a stream of data is taking along the LAG

johnpoz

you would have to verify that on the switch.. I have not used HP in many many years.. But assume there is a way to see which path in a lagg/etherchannel a specific connection is using. something along the lines of

test etherchannel load-balance interface port-channel 1 ip 10.10.10.2 10.10.10.1