Supermicro A1SRM-2758F board seems to not multi-thread

bigjerms

I've run iperf tests through an A1SRM-2758F board and it acts like only one core is being used.

I can set the board to 2 cores, 4 cores, or 8 cores and throughput with 64byte packets is the same regardless of the number of cores (about 300 mbps). I get better performance with larger packets (64kbytes) 940 mbps. I don't know if its a hashing algorithm limitation or what it might be. Does anyone know any tuning options to set for this board to get better performance or high pps throughput?

I have 8 gigs of ram and am using the onboard I354 Intel nics.

Any advice or pointers would be appreciated.

DeLorean

Do you have hardware checksum offload enabled ?

Grtz
DeLorean

bigjerms

Yes I have hardware checksum offload enabled.

Could the problem be in the way I'm testing? I'm using IPerf with multiple parallel streams. Would this hash to the same cpu? I'm not familiar with how the load is split across the cpus.

dopey

Is this over the WAN connection using PPPoE?

If so it's potentially a limitation of the igb driver and PPPoE. See:
https://redmine.pfsense.org/issues/4821

bigjerms

Its just a static IP on the Wan side. We are testing directly between two laptops with the firewall in between Static IP on the outside with no PPPoE enabled.

dopey

I missed that you were calculating throughput based on 64byte packets.

See:
https://blog.pfsense.org/?p=1866

Using a bit more pedestrian hardware, such as the C2758 that is for sale on the pfSense store, we find that we can forward at a rate of around 270 Kpps, and with fast forwarding or tryforward, we can obtain 426 Kpps. A simple SG-2220 will support 123 Kpps until we enable fastforward or tryforward, when we can obtain 217 Kpps.

Your 300mbps is not too far off the 426kpps value based on some calculations.

bigjerms

I read that post that you are quoting. I thought for some reason that that was single threaded tests and that this would be per cpu. I'm probably mistaken.

One of the reasons I thought this was per processor is when I disable cores on my motherboard I get the same results. So using 8, 4 or 2 cores still gets the exact same results.

dopey

I guess that's a good point. I've actually wondered that myself. i have the mini-itx version of the same board and always wondered just what the additional CPUs really buy in terms pfsense functionality.

Unfortunately, I'm bit by the PPPoE queue issue so am limited to only ~700mbit/s downloads over my gigabit fiber line so I never really bothered to dig too much further past that.

Pippin

What happens when you iperf directly between the two laptops? So without pfS in between.
Try both directions with same values/parameters for iperf.

laptop1-iperfserver –> laptop2-iperfclient
laptop2-iperfserver --> laptop1-iperfclient

bigjerms

I'll have to check that. The other laptop is another user who isn't in the office today or tomorrow. I'll test that and get back to you. Kind of silly we didn't test that before.

I'll respond with results as soon as I can.

DeLorean

@bigjerms:

Yes I have hardware checksum offload enabled.

Do you mean that this option checkbox in pfSense is marked or not ?
Marked -> hardware checksum offload is disabled
Unmarked (default) -> hardware checksum offload is enabled

Grtz
DeLorean

bigjerms

I have it marked so its disabled.

I was able to test the two test systems back to back and found that the results were the same so its not the firewall that is the limiting factor. Now I have to figure out why two macs back to back have low throughput with the 64byte packets.

It looks like the problem may be that iperf 2 and 3 is not multi-threaded so only one cpu is being used on the test boxes. This would explain why high small packets reduces the throughput.

Pippin

Maybe try the -Z argument?

–zerocopy : use a 'zero copy' sendfile() method of sending data. This uses much less CPU.

cmb

Lots of small packets are just difficult to process in general. If you want to fill a 1 Gb pipe with 64 byte frames, you likely need something like netmap to do so, no tool like iperf is going to achieve that rate.

With a single stream between a given source and destination, you're not likely going to utilize all the queues on the NIC. Assuming you didn't force it to a single queue (which would be bad), that's likely why you're not getting >1 core utilized.