10Gb/s Forwarding

bubble1975

Hi Y'all,

So, I've read through the Hardware Sizing Guide and read as far as I am able through the forums, but a few items are blurry, so I was hoping to get some clarification…

We have a fairly beefy server, and we are going to get the best 10G NIC card we can for it (with 2 ports on it). The server has 24GB RAM and 16 cores, each at 2.4GHz I think (maybe 2.6GHz). It is a pretty new Sun (Oracle) server.

My question is: if we try to push as much as we can through this sucker (via pfsense), the question is how fast can it inspect the packets and forward them. The Hardware Sizing Guide says:

"501+ Mbps - server class hardware with PCI-X or PCI-e network adapters. No less than 3.0 GHz CPU"

OK... The 10G NIC will definitely be on a PCI-E bus that can handle 8Gb/s. So right there I am limited to 8Gb/s on the PCI-E bus. But My "CPU" is only 2.4GHz, does that mean that I won't even be able to push 500Mb/s through the server? Is pfsense multithreaded, such that all 16 of my CPU cores can break up the load, thus achieving higher throughput, or rather, able to handle more packets per second? That bit wasn't clear to me. If pfsense only uses 1 CPU core, then that will greatly change what hardware we put this on. My hope is to be able to push 5Gb/s+ through this server, with pfsense as the firewall solution.

Also.. Just to explain our load a bit more. Our multi-gigabit traffic will be coming from the Aspera software package, which is a file transfer protocol capable of 10+Gb/s transfer rates over high-latency links. Such that you can download files, from San Francisco, at 10Gb/s from Washington DC, for example. It does this by opening many "sessions" in parallel, and sprays UDP packets like a firehose from one server to the other, and the packets are re-assembled on the downloading side. So, the pfsense box would mostly be processing millions and millions of these UDP packets. It will also handle other stuff like SSH, HTTP, DNS, etc, but obviously at a much slower rate.

Any insight much appreciated!!

Thanks,
erich

wallabybob

Throughput is highly dependent on frame sizes, the bigger the frame size you can use the higher the throughput you are likely to obtain. Can you effectively use Jumbo frames?

Its an interesting technical challenge to turn "single cpu" code into "multi cpu" code that scales well across multiple CPUs. FreeBSD have done a lot of work on turning many parts of the kernel into code that is capable of safely running on multiple CPUs concurrently. Its more than a year since I looked into this but I suspect FreeBSD 8.0 (the base for pfSense 2.0 BETA) would be at least a bit better than FreeBSD 7.2 (the base for pfSense 1.2.3). If I recall correctly, some higher speed NICs have capabilities for putting incoming frames into different queues and generating different interrupts for those queues with the interrupts potentially going to different CPUs thus giving the potential for handling multiple frames in parallel right from the interrupt handlers. I expect you will get more help on these matters in a FreeBSD forum than a pfSense forum. I would certainly be interested in hearing how you get on with this.

bubble1975

Yes, we plan to try to get jumbo frames enabled as much as we can, not sure it is possible from all angles, but we are certainly going to try. We are planning on getting one of these Myrinet 10G NICs. Benchmarking for FreeBSD 7.2-RELEASE is shown at the bottom of this web page:

http://www.myri.com/scs/performance/Myri10GE/

It seems that even with 1500 byte frames, they seem to get ~10% CPU utilization with 8 3GHz Nehalem cores, while shoving 9.2Gb/s. I don't know if that means we can get 7-8Gb/s with our 16 2.4GHz Nehalem cores or what, but even that would be OK. I just wonder how much extra overhead the CPUs have to endure when actually "inspecting or filtering" each packet, rather than simply forwarding them.

Thanks for your thoughts! I'll see if the FreeBSD forums can shed some light on this too.

Supermule

Looking forward to see test results….:)

bubble1975

So, this is the reply I got from the FreeBSD firewall forums:

"You're completely nuts to be putting 16 cores into a firewall box.
Most of the code pathways in PF and IPFW are serialised, so you won't be able to use more than a couple CPU cores in a packet filtering firewall.
And you really don't need 24 GB of RAM in a firewall box. Our gigabit fibre routers only have 2 GB, and they rarely use more than 512 MB of RAM.
Find the fastest dual-core CPU you can. Give it 2-4 GB of the fastest RAM it can handle. And give it PCIe NICs with as much offloading capability as you can.
And be sure to use the latest version of FreeBSD, as network throughput, packet filtering, and forwarding have greatly improved in 8.x over 7.x. Plus, you get the latest drivers for the fastest NICs."

So - does pfSense use PF and/or IPFW as the firewalling base code (is the "PF" the same "PF" in "pfSense"?) Or it is a more proprietary code that works differently? If I am to believe that guy, I should not care about how many cores I have, but rather how fast each core is and have fewer cores. But, if pfSense uses a more CPU-parallel code base that is not PF/IPFW based, it may change things…?

CaseyBlackburn

pfSense uses both pf and IPFW. (It doesn't use much of IPFW though, it gets used only for Captive Portal, Scheduling, and some other things)

Supermule

Rum it in a virtual machine and loadbalance it…..

Efonnes

But then the issue becomes having something fast enough to do the load balancing, which could be just as much of a CPU load.

Supermule

Dont you think 16 cores are enough??

@Efonne:

But then the issue becomes having something fast enough to do the load balancing, which could be just as much of a CPU load.

Efonnes

Depends on whether it uses them for doing the processing. At least in pfSense it is doing the load balancing through PF, so the multiple cores still won't make a difference.

eirikz

Running ESXi and 5 instances of pFsense in a loadbalancing scheme would be an easy setup I guess.

But geez, 16 cores and 24GB RAM ? Only customers I have that buy that kind of hardware use it for virtualization, and then we are talking about 50+ VMs of low\medium size.

bubble1975

After a long while, we finally put our test case together.

We have a Dell R610 with 3.47GHz Xeon X5677 Processor (Quad-Core)
12GB RAM, 1333MHz speed
Myricom 10G-PCIE2-8B2-2S NIC, with two 10G ports on it
pfSense 2.0-BETA4 (amd64) Built on August 22

As a control case, we downloaded data in our datacenter in Santa Cruz, CA all the way from Maryland at a partner site, and were able to achieve 2Gb/s sustained. No firewall in between.

Then we put our download server behind the new pfSense box and configured some filtering rules just for fun, and got the same 2Gb/s when downloading. Which indicates the pfSense box was not a bottleneck. We watched the pfSense box for load and saw very little. No dropped packets, CPU was working at less than 2% and the "Interrupt" metric as reported by top was at 7%. Most of the packets were UDP, transferring about 150 million packets per second with ~30 open states.

Given that the load was so low, I bet we could filter 5-7Gb/s through this box with little trouble. Which I intend to do in a few months. ;)

Random note, I noticed the default MTU for both WAN and LAN interfaces was 9000 bytes, even though the web interface has a note that says it should default to 1500 bytes by default on all interfaces. I had to manually set 1500 bytes in the web interface, since I am not able to use jumbo frames at this time. Perhaps that is a small bug in this version of the beta release.

dreamslacker

@bubble1975:

OK… The 10G NIC will definitely be on a PCI-E bus that can handle 8Gb/s. So right there I am limited to 8Gb/s on the PCI-E bus. But My "CPU" is only 2.4GHz, does that mean that I won't even be able to push 500Mb/s through the server?

The Myricom is a PCIe 2.0 8 lane card. The bandwidth allowed on such an interface is 4GBytes/s (32Gbits/s) in each direction simultaneously. If the slot is only capable of Gen. 1 PCIe, then this would be 2Gbytes/s (16Gbits/s).
However, as seeing that you're running on the R610 with the Tylersburg chipset, all the PCIe slots (32 lanes total) are capable of Gen. 2 speeds. i.e. Your card is not being crippled by the slot bandwidth.