Hardware for Gigabit openVPN
I'm a student at Carnegie Mellon University where we have gigabit all over campus, 802.11n wireless, and an internet2 connection. Because we have to use DHCP, which gives us internet publicly-routable IPs, we must segment/firewall our network. Next year we will have a few servers which I would like to put behind a firewall, primarily to offload the VPN connection. At a minimum I want to push 300 Mbps of openVPN AES-128-CBC traffic, and ideally closer to 1 Gbps. I also definitely need very close to 1 Gbps unencrypted. So I need some beefy hardware!
The primary thing I'm thinking about is the speed of the crypto. I either need a super-fast processor, or to offload the crypto to hardware. For offloading the crypto to hardware, I'm looking at:
AMD Geode - With CPUs around 500mhz, and the competition from VIA, this isn't really on the table
VIA Padlock - Looks very promising and overall faster than Geode, especially with CPU speeds up to 2GHz. Currently supported by pfSense.
AES-NI instruction set - Looks really fast, but requires OS support, currently present in FreeBSD 8.2. From extrapolation of the pfSense release cycle, it would take ~8 months to get pfSense with 8.2.
Hifn/Exar crypto add-in cards - These seem really hard to find or purchase as a consumer.
I've been looking at the VIA M840 with 1.6 ghz Via nano processor ( http://www.e-itx.com/m840-16.html ). This has low power consumption and fast hardware crypto, but the processor is a little on the weak side, which concerns me for pushing 1 Gbps of traffic unencrypted, and the potential for re-use later.
Another option is the Intel core i5 650, a dual core at 3.2 Ghz that has the AES-NI instruction set. This is a beefy processor, but the instruction set, implemented in the kernel in "aesni(4)", is in FreeBSD 8.2, and is not slated for pfSense 2.0, which will be based on 8.1. Again, looking at the release cycle, it would be about ~8 months for the upgrade to 8.2.
Strangely enough these systems only differ in cost by ~$100.
Do you think the Via 1.6 Ghz would be capable of doing 1 Gbps unencrypted when paired with good network hardware? AES-128-CBC encrypted?
Do you think the i5 could do 1 Gbps unencrypted? Software encrypted without aesni?
Is there a way to backport the aesni module from FreeBSD 8.2 into pfSense 2.0?
Are there any systems with benchmarks that are currently doing this?
What would you do?
Thanks for any suggestions or comments!
In my (limited) experience, merely firewalling at gigabit connection speeds requires Xeon level processors running at a minimum of 2 ghz, preferably 3ghz and dual.
In your stead, I might try to get a used IBM Power5 or something of its ilk, as long as you get loads of ghz/hour for the price. You can also consider splitting your setup into two lower-powered machines, one to firewall and one to serve your vpn. Two lower powered machines might be more affordable than one super-server.
The crypto offload in the Geode is only capable of 10-12 Mbit/s. The Via Padlock isn't much better at similar clock speeds, though the 1.6GHz chips are capable of around 65-70Mbit/s. AES-NI won't be supported under FreeBSD until 8.2, so you're looking at a wait on that, but with an appropriate CPU, AES-NI is a monster. The Hifn cards are decent, but you won't hit 1Gbit/s with one, the commonly-available 7955 has a theoretical max of ~300Mbit/s but in practice, at least with the systems I've used for testing, I've not done much better than 100Mbit/s. The 7956 might be better but it's still not going to get you 1Gbit/s.
I think your best bet is going to be overwhelming force. Find a modern CPU with a REAL high clock speed, +3GHz, and cross your fingers.
How parallelizable is firewalling? openVPN? How does the performance of a dual core relate to the performance of an equivalent quad core?
I would think that one or two high throughput VPN connections would not see an improvement moving from dual to quad, as each crypto stream is not parallelizable. For 3+ I would imagine a quad would help as each core could take a crypto stream.
What about firewalling? Can this be spread out over multiple cores, or would it just max out one core? I had thought the computation necessary for firewalling would be negligible (especially since I am not NATing), but apparently not?
The idea of separating the border firewall from the VPN is interesting, although I'm not eager to have another box to maintain. A while back I had the VPN server in a virtual machine- It wasn't the fastest thing. This would only offload the cost of firewalling, which brings me back to, "How computationally expensive is firewalling without NAT?"
Thanks for the help guys!
Not very parallelized but you won't need that much computing power for routing without NAT. It's the VPN requirements that kills.
Unfortunately, if you are looking at that kind of VPN throughput, it's much better to go for a separate box at the moment. This would ideally be something like a commercial firewall meant for VPN rather than just a computer running a VPN server software.
Edit: The only hardware I know of that are specified to push that kind of VPN throughput would be the Fortinet 310B and above appliances. Has the advantage of not being a subscribed service.
Cry Havok last edited by
The only hardware I know of capable of running a 1 Gb/s VPN is dedicated hardware and it won't be cheap. The upside is that dedicated VPN kit tends (in my experience) to be pretty hands off.
The hardware sizing page says:
501+ Mbps - server class hardware with PCI-X or PCI-e network adapters. No less than 3.0 GHz CPU.
VPN … relatively new server hardware (Xeon 800 FSB and newer) deployments are pushing over 100 Mbps
openssl speed rc4 command will give you an indication of potential speed from the hardware
Crypto algorithm will also be quite important. For example my ULV 1.5 gig Core2duo is good for approx 40mBYTE/ second AES and 170mbyte/second if you use RC4.
The 170mbyte / second is actually enough throughput in raw crypto for gigabit. OpenVPN uses RSA only for key negotiation, so this should be negligible.
On my windows desktop (i5 750) I get 110mbyte / sec throughput for AES-128 and 380mbyte / second for RC4, which gives you a decent margin. Thats only using 1 of the 4 cores, although that does mean the one running core is going to be running at a decent speed and gets the lion share of the cache.
Suspect you'll find if you get a good quality modern quad chip, say a sandy bridge i5 then you'll probably be ok, especially if you get a decent intel server network card, make sure its PCI-E. The 3ghz processor requirements on the hardware sizing are probably old enough to refer to an old P4 Xeon.
Of course, testing it yourself is the only way to work this sort of thing out!
Failing that, or if you want to use low end hardware. You've got load balancing capabilities within Pfsense.
Sounds to me like if you want that sort of level of throughput, you'd be better with a pair of front end firewalls running CARP, and a cluster of backend servers for VPN and crypto.
Please, could you explain me a little bit more in detail how to use the "openssl speed rc4 command" on pfsense ?
I did a speedtest with
and this is the output on my machine:
Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
2.0-RC1 (amd64) built on Fri Apr 15 00:19:55 EDT 2011
The test used only one core.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md2 1663.96k 3438.42k 4725.62k 5215.27k 5376.45k mdc2 7161.42k 7818.42k 8026.10k 8070.47k 8085.57k md4 24626.30k 84384.92k 227720.81k 407009.09k 528306.49k md5 19799.64k 64924.42k 163943.43k 265160.01k 323451.66k hmac(md5) 17091.68k 57443.85k 151350.55k 256802.11k 322259.33k sha1 16966.91k 51477.13k 117085.70k 169989.78k 195887.74k rmd160 15303.45k 42513.38k 87608.20k 119226.45k 133217.59k rc4 289911.19k 314130.39k 322105.04k 325243.00k 327569.24k des cbc 36080.76k 36816.73k 37663.01k 37974.90k 38092.36k des ede3 14070.89k 14415.70k 14502.73k 14530.21k 14546.68k idea cbc 0.00 0.00 0.00 0.00 0.00 seed cbc 0.00 0.00 0.00 0.00 0.00 rc2 cbc 22989.27k 23602.21k 23686.59k 23687.72k 23762.46k rc5-32/12 cbc 108711.56k 116991.09k 119624.74k 119942.49k 120324.47k blowfish cbc 71255.37k 75037.98k 75871.83k 75961.56k 76100.83k cast cbc 57685.63k 59844.22k 60244.33k 60179.37k 60518.49k aes-128 cbc 107838.88k 112294.31k 111809.12k 113542.00k 114159.93k aes-192 cbc 95458.52k 98549.85k 98041.11k 99992.84k 100524.71k aes-256 cbc 85177.66k 87775.53k 87758.50k 88753.14k 89053.31k camellia-128 cbc 69351.23k 71338.43k 71403.81k 71845.94k 71843.56k camellia-192 cbc 53078.23k 54090.60k 54111.13k 54509.17k 54534.17k camellia-256 cbc 52327.56k 54051.05k 54099.77k 54442.09k 54514.64k sha256 14032.23k 34896.44k 64516.60k 82351.22k 89491.12k sha512 10813.82k 43242.23k 78129.24k 117596.02k 138264.42k aes-128 ige 114732.91k 121908.42k 121101.13k 123703.41k 124635.53k aes-192 ige 100547.38k 106664.44k 105609.82k 107573.64k 108468.93k aes-256 ige 89518.31k 94510.48k 93505.73k 95191.07k 95707.15k sign verify sign/s verify/s rsa 512 bits 0.000288s 0.000032s 3478.0 30938.5 rsa 1024 bits 0.000975s 0.000062s 1026.1 16081.1 rsa 2048 bits 0.005248s 0.000165s 190.5 6047.3 rsa 4096 bits 0.033281s 0.000521s 30.0 1917.8 sign verify sign/s verify/s dsa 512 bits 0.000181s 0.000191s 5523.2 5238.2 dsa 1024 bits 0.000443s 0.000512s 2256.1 1952.0 dsa 2048 bits 0.001371s 0.001640s 729.3 609.6
Perhaps this will help you with your decision.
Yup, right commands.
So look at the 1024byte column and then at the AES128 row.
113542.00k gives you a kbyte / second figure
113mbyte / second, which is broadly in line with my lynnfield
Note that RC4 doesn't appear to be an option for openvpn - i assumed it would support the same ciphers as openssl. But anyway, one of the newer generation of intel processors can just about do gigabit on a single core without hardware acceleration. And i would assume the other core can manage all the nat
Thank you for your additional explaination! This was helpful for me.
Another suggestion could be to run 2 oder 3 OpenVPN Server on one pfsense to half the 1GBit/s on two servers. If this would mean, that Openssl could use 2 processes - one for each OpenVPN server - than a Dual or Quad core CPU should handle that traffic.
But I am not sure, if the en-/decryption is the only thing we have to focus on if we are talking about 1GBit/s OpenVPN !? Perhaps there will be other necessary parameters !?
Well, had a little play
Looks like when you setup multiple openvpn instances it sets up additional processes:
root 38648 0.0 0.2 5116 3516 ?? Ss 6:18PM 0:00.00 openvpn --config /var/etc/openvpn/server2.conf
root 40283 0.0 0.2 5116 3764 ?? Ss 11:38AM 0:04.11 openvpn --config /var/etc/openvpn/server1.conf
So in principle you could setup multiple processes this way, each listening on a different port and use the load balancer module. (From a quick google openvpn isn't multithreaded)
The only problem is the load balancer module only seems to do TCP, so you'd need to setup openvpn for TCP. Then you've got TCP tunnelled over TCP - which is generally a bad idea for throughput - so instead of a bottleneck with the encryption running on a single processor you have a bottleneck of setting up tunnelling badly. ("Proper" load balancers like the F5's I use at work can do UDP, but they're rather more expensive than PFSense!)
OpenVPN is the only thing you've suggested you're doing here which is processor intensive. Believe gigabit throughput is possible with a decent modern processor as long as you take care when picking your network cards and make sure they're decent PCI-E jobs.
Check out the Exar Express DX 17/18xx. Believe they'll be supported by OpenSwan.
Cry Havok last edited by
There's been a thread on getting high performance OpenVPN traffic on the OpenVPN mailing list over the last few days. I would highly recommend that people read that since it turns out the settings in OpenVPN are critical (look for Retake on openvpn speed over gigabit ethernet (SECOND attempt)). There are 2 settings that were found to make the difference between ~350 Mb/s and ~940 Mb/s in an otherwise identical setup.
For the lazy, here is a link to the thread:
Why not try and push it on a VM Pfsense where you can adjust settings on the hardware fairly easy. Monitor the bottlenecks very closely on the Vm and see how far you can push it?