Hardware for Gigabit openVPN



  • I'm a student at Carnegie Mellon University where we have gigabit all over campus, 802.11n wireless, and an internet2 connection. Because we have to use DHCP, which gives us internet publicly-routable IPs, we must segment/firewall our network. Next year we will have a few servers which I would like to put behind a firewall, primarily to offload the VPN connection. At a minimum I want to push 300 Mbps of openVPN AES-128-CBC traffic, and ideally closer to 1 Gbps. I also definitely need very close to 1 Gbps unencrypted. So I need some beefy hardware!

    The primary thing I'm thinking about is the speed of the crypto. I either need a super-fast processor, or to offload the crypto to hardware. For offloading the crypto to hardware, I'm looking at:

    • AMD Geode - With CPUs around 500mhz, and the competition from VIA, this isn't really on the table

    • VIA Padlock - Looks very promising and overall faster than Geode, especially with CPU speeds up to 2GHz. Currently supported by pfSense.

    • AES-NI instruction set - Looks really fast, but requires OS support, currently present in FreeBSD 8.2. From extrapolation of the pfSense release cycle, it would take ~8 months to get pfSense with 8.2.

    • Hifn/Exar crypto add-in cards - These seem really hard to find or purchase as a consumer.

    I've been looking at the VIA M840 with 1.6 ghz Via nano processor ( http://www.e-itx.com/m840-16.html ). This has low power consumption and fast hardware crypto, but the processor is a little on the weak side, which concerns me for pushing 1 Gbps of traffic unencrypted, and the potential for re-use later.

    Another option is the Intel core i5 650, a dual core at 3.2 Ghz that has the AES-NI instruction set. This is a beefy processor, but the instruction set, implemented in the kernel in "aesni(4)", is in FreeBSD 8.2, and is not slated for pfSense 2.0, which will be based on 8.1. Again, looking at the release cycle, it would be about ~8 months for the upgrade to 8.2.

    Strangely enough these systems only differ in cost by ~$100.

    The questions:

    • Do you think the Via 1.6 Ghz would be capable of doing 1 Gbps unencrypted when paired with good network hardware? AES-128-CBC encrypted?

    • Do you think the i5 could do 1 Gbps unencrypted? Software encrypted without aesni?

    • Is there a way to backport the aesni module from FreeBSD 8.2 into pfSense 2.0?

    • Are there any systems with benchmarks that are currently doing this?

    • What would you do?

    Thanks for any suggestions or comments!

    Craig



  • In my (limited) experience, merely firewalling at gigabit connection speeds requires Xeon level processors running at a minimum of 2 ghz, preferably 3ghz and dual.

    In your stead, I might try to get a used IBM Power5 or something of its ilk, as long as you get loads of ghz/hour for the price. You can also consider splitting your setup into two lower-powered machines, one to firewall and one to serve your vpn. Two lower powered machines might be more affordable than one super-server.



  • The crypto offload in the Geode is only capable of 10-12 Mbit/s.  The Via Padlock isn't much better at similar clock speeds, though the 1.6GHz chips are capable of around 65-70Mbit/s.  AES-NI won't be supported under FreeBSD until 8.2, so you're looking at a wait on that, but with an appropriate CPU, AES-NI is a monster.  The Hifn cards are decent, but you won't hit 1Gbit/s with one, the commonly-available 7955 has a theoretical max of ~300Mbit/s but in practice, at least with the systems I've used for testing, I've not done much better than 100Mbit/s.  The 7956 might be better but it's still not going to get you 1Gbit/s.

    I think your best bet is going to be overwhelming force.  Find a modern CPU with a REAL high clock speed, +3GHz, and cross your fingers.



  • How parallelizable is firewalling? openVPN? How does the performance of a dual core relate to the performance of an equivalent quad core?

    I would think that one or two high throughput VPN connections would not see an improvement moving from dual to quad, as each crypto stream is not parallelizable. For 3+ I would imagine a quad would help as each core could take a crypto stream.

    What about firewalling? Can this be spread out over multiple cores, or would it just max out one core? I had thought the computation necessary for firewalling would be negligible (especially since I am not NATing), but apparently not?

    The idea of separating the border firewall from the VPN is interesting, although I'm not eager to have another box to maintain. A while back I had the VPN server in a virtual machine- It wasn't the fastest thing. This would only offload the cost of firewalling, which brings me back to, "How computationally expensive is firewalling without NAT?"

    Thanks for the help guys!



  • Not very parallelized but you won't need that much computing power for routing without NAT.  It's the VPN requirements that kills.

    Unfortunately, if you are looking at that kind of VPN throughput, it's much better to go for a separate box at the moment.  This would ideally be something like a commercial firewall meant for VPN rather than just a computer running a VPN server software.

    Edit:  The only hardware I know of that are specified to push that kind of VPN throughput would be the Fortinet 310B and above appliances.  Has the advantage of not being a subscribed service.



  • The only hardware I know of capable of running a 1 Gb/s VPN is dedicated hardware and it won't be cheap. The upside is that dedicated VPN kit tends (in my experience) to be pretty hands off.

    The hardware sizing page says:

    501+ Mbps - server class hardware with PCI-X or PCI-e network adapters. No less than 3.0 GHz CPU.

    VPN … relatively new server hardware (Xeon 800 FSB and newer) deployments are pushing over 100 Mbps



  • openssl speed rc4 command will give you an indication of potential speed from the hardware

    Crypto algorithm will also be quite important. For example my ULV 1.5 gig Core2duo is good for approx 40mBYTE/ second AES and 170mbyte/second if you use RC4.
    The 170mbyte / second is actually enough throughput in raw crypto for gigabit. OpenVPN uses RSA only for key negotiation, so this should be negligible.
    On my windows desktop (i5 750) I get 110mbyte / sec throughput for AES-128 and 380mbyte / second for RC4, which gives you a decent margin. Thats only using 1 of the 4 cores, although that does mean the one running core is going to be running at a decent speed and gets the lion share of the cache.

    Suspect you'll find if you get a good quality modern quad chip, say a sandy bridge i5 then you'll probably be ok, especially if you get a decent intel server network card, make sure its PCI-E. The 3ghz processor requirements on the hardware sizing are probably old enough to refer to an old P4 Xeon.

    Of course, testing it yourself is the only way to work this sort of thing out!

    Failing that, or if you want to use low end hardware. You've got load balancing capabilities within Pfsense.

    Sounds to me like if you want that sort of level of throughput, you'd be better with a pair of front end firewalls running CARP, and a cluster of backend servers for VPN and crypto.



  • @wishy

    Please, could you explain me a little bit more in detail how to use the "openssl speed rc4 command" on pfsense ?

    Thanks!

    –-edit---

    I did a speedtest with

    openssl speed
    

    and this is the output on my machine:
    Intel(R) Xeon(R) CPU E5506 @ 2.13GHz
    2.0-RC1 (amd64) built on Fri Apr 15 00:19:55 EDT 2011
    4GB RAM
    The test used only one core.

    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    md2               1663.96k     3438.42k     4725.62k     5215.27k     5376.45k
    mdc2              7161.42k     7818.42k     8026.10k     8070.47k     8085.57k
    md4              24626.30k    84384.92k   227720.81k   407009.09k   528306.49k
    md5              19799.64k    64924.42k   163943.43k   265160.01k   323451.66k
    hmac(md5)        17091.68k    57443.85k   151350.55k   256802.11k   322259.33k
    sha1             16966.91k    51477.13k   117085.70k   169989.78k   195887.74k
    rmd160           15303.45k    42513.38k    87608.20k   119226.45k   133217.59k
    rc4             289911.19k   314130.39k   322105.04k   325243.00k   327569.24k
    des cbc          36080.76k    36816.73k    37663.01k    37974.90k    38092.36k
    des ede3         14070.89k    14415.70k    14502.73k    14530.21k    14546.68k
    idea cbc             0.00         0.00         0.00         0.00         0.00
    seed cbc             0.00         0.00         0.00         0.00         0.00
    rc2 cbc          22989.27k    23602.21k    23686.59k    23687.72k    23762.46k
    rc5-32/12 cbc   108711.56k   116991.09k   119624.74k   119942.49k   120324.47k
    blowfish cbc     71255.37k    75037.98k    75871.83k    75961.56k    76100.83k
    cast cbc         57685.63k    59844.22k    60244.33k    60179.37k    60518.49k
    aes-128 cbc     107838.88k   112294.31k   111809.12k   113542.00k   114159.93k
    aes-192 cbc      95458.52k    98549.85k    98041.11k    99992.84k   100524.71k
    aes-256 cbc      85177.66k    87775.53k    87758.50k    88753.14k    89053.31k
    camellia-128 cbc    69351.23k    71338.43k    71403.81k    71845.94k    71843.56k
    camellia-192 cbc    53078.23k    54090.60k    54111.13k    54509.17k    54534.17k
    camellia-256 cbc    52327.56k    54051.05k    54099.77k    54442.09k    54514.64k
    sha256           14032.23k    34896.44k    64516.60k    82351.22k    89491.12k
    sha512           10813.82k    43242.23k    78129.24k   117596.02k   138264.42k
    aes-128 ige     114732.91k   121908.42k   121101.13k   123703.41k   124635.53k
    aes-192 ige     100547.38k   106664.44k   105609.82k   107573.64k   108468.93k
    aes-256 ige      89518.31k    94510.48k    93505.73k    95191.07k    95707.15k
                      sign    verify    sign/s verify/s
    rsa  512 bits 0.000288s 0.000032s   3478.0  30938.5
    rsa 1024 bits 0.000975s 0.000062s   1026.1  16081.1
    rsa 2048 bits 0.005248s 0.000165s    190.5   6047.3
    rsa 4096 bits 0.033281s 0.000521s     30.0   1917.8
                      sign    verify    sign/s verify/s
    dsa  512 bits 0.000181s 0.000191s   5523.2   5238.2
    dsa 1024 bits 0.000443s 0.000512s   2256.1   1952.0
    dsa 2048 bits 0.001371s 0.001640s    729.3    609.6
    
    

    Perhaps this will help you with your decision.



  • Yup, right commands.

    So look at the 1024byte column and then at the AES128 row.
    113542.00k gives you a kbyte / second figure

    113mbyte / second, which is broadly in line with my lynnfield

    Note that RC4 doesn't appear to be an option for openvpn - i assumed it would support the same ciphers as openssl. But anyway, one of the newer generation of intel processors can just about do gigabit on a single core without hardware acceleration.  And i would assume the other core can manage all the nat



  • @wishy
    Thank you for your additional explaination! This was helpful for me.

    Another suggestion could be to run 2 oder 3 OpenVPN Server on one pfsense to half the 1GBit/s on two servers. If this would mean, that Openssl could use 2 processes - one for each OpenVPN server - than a Dual or Quad core CPU should handle that traffic.

    But I am not sure, if the en-/decryption is the only thing we have to focus on if we are talking about 1GBit/s OpenVPN !? Perhaps there will be other necessary parameters !?



  • Well, had a little play

    Looks like when you setup multiple openvpn instances it sets up additional processes:
    eg…
    root    38648  0.0  0.2  5116  3516  ??  Ss    6:18PM  0:00.00 openvpn --config /var/etc/openvpn/server2.conf
    root    40283  0.0  0.2  5116  3764  ??  Ss  11:38AM  0:04.11 openvpn --config /var/etc/openvpn/server1.conf

    So in principle you could setup multiple processes this way, each listening on a different port and use the load balancer module. (From a quick google openvpn isn't multithreaded)

    The only problem is the load balancer module only seems to do TCP, so you'd need to setup openvpn for TCP. Then you've got TCP tunnelled over TCP - which is generally a bad idea for throughput - so instead of a bottleneck with the encryption running on a single processor you have a bottleneck of setting up tunnelling badly. ("Proper" load balancers like the F5's I use at work can do UDP, but they're rather more expensive than PFSense!)

    OpenVPN is the only thing you've suggested you're doing here which is processor intensive. Believe gigabit throughput is possible with a decent modern processor as long as you take care when picking your network cards and make sure they're decent PCI-E jobs.



  • Check out the Exar Express DX 17/18xx. Believe they'll be supported by OpenSwan.



  • There's been a thread on getting high performance OpenVPN traffic on the OpenVPN mailing list over the last few days. I would highly recommend that people read that since it turns out the settings in OpenVPN are critical (look for Retake on openvpn speed over gigabit ethernet (SECOND attempt)). There are 2 settings that were found to make the difference between ~350 Mb/s and ~940 Mb/s in an otherwise identical setup.


  • Rebel Alliance Developer Netgate

    For the lazy, here is a link to the thread:

    http://sourceforge.net/mailarchive/message.php?msg_id=27365978


  • Banned

    Why not try and push it on a VM Pfsense where you can adjust settings on the hardware fairly easy. Monitor the bottlenecks very closely on the Vm and see how far you can push it?


Locked