AES-NI performance

aesguy

VAMike said:

"But when using the freebsd crypto device most of the work is done in kernel space rather than user space, so the cpu time measurement consists entirely of the time spent making system calls."

It does not appear that the crypto device is being used - OpenSSL invokes the appropriate CPU instructions directly. For example, on ARMv8, the AESE instruction is invoked directly: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

Secondly, we see evidence to support this - it matters not whether you set AES-NI in pfsense but rather does matter whether you invoke openssl with "-evp" or not.

I am not convinced that your assumption about kernel vs userland is valid. And therefore that these numbers are not as meaningless as you think.

Engineer

Out of curiosity, I added the -elapsed option to the original speed test and the results fell dramatically. I have never looked at or tried to understand the results, I was simply running the test and passing the supplied results on.

$ openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 1006598 aes-256-cbc's in 3.02s
Doing aes-256-cbc for 3s on 64 size blocks: 965281 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 779226 aes-256-cbc's in 3.02s
Doing aes-256-cbc for 3s on 1024 size blocks: 458171 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 92570 aes-256-cbc's in 3.01s
OpenSSL 1.0.1l-freebsd 15 Jan 2015
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 5326.91k 20592.66k 65978.50k 156389.03k 252121.25k

I have run the test multiple times now and the results are very repeatable. Without the -elapsed option, the numbers are all over the place but MUCH higher. I'm not going to argue whether the other results are real or not as I don't know enough one way or the other. Just passing on the results with and without the -elapsed command line option for others to evaluate.

VAMike

@aesguy:

It does not appear that the crypto device is being used - OpenSSL invokes the appropriate CPU instructions directly. For example, on ARMv8, the AESE instruction is invoked directly: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

Secondly, we see evidence to support this - it matters not whether you set AES-NI in pfsense but rather does matter whether you invoke openssl with "-evp" or not.

[2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 1439217 aes-128-cbc's in 0.37s
Doing aes-128-cbc for 3s on 64 size blocks: 1282244 aes-128-cbc's in 0.30s
Doing aes-128-cbc for 3s on 256 size blocks: 1185939 aes-128-cbc's in 0.26s
Doing aes-128-cbc for 3s on 1024 size blocks: 773748 aes-128-cbc's in 0.20s
Doing aes-128-cbc for 3s on 8192 size blocks: 272141 aes-128-cbc's in 0.11s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 62713.12k 276424.81k 1177601.49k 3900642.23k 20382894.37k

so there are the bogus numbers; there's no way this hardware is doing 20GByte/s of crypto. Let's see what happens with -elapsed:

[2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 1392190 aes-128-cbc's in 3.03s
Doing aes-128-cbc for 3s on 64 size blocks: 1415484 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 1560350 aes-128-cbc's in 3.02s
Doing aes-128-cbc for 3s on 1024 size blocks: 1176285 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 314815 aes-128-cbc's in 3.00s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 7348.47k 30118.56k 132117.70k 400462.41k 859654.83k

Now those are believable numbers. A bit low, but this is in a VM. Also note just how tremendously bad the performance is with small block sizes–that's the overhead of the context switches. If your theory is that cryptodev isn't relevant, let's just unload it:

[2.3.2-RELEASE][admin@pfSense.localdomain]/root: kldunload aesni
[2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 197582147 aes-128-cbc's in 2.82s
Doing aes-128-cbc for 3s on 64 size blocks: 49253757 aes-128-cbc's in 2.72s
Doing aes-128-cbc for 3s on 256 size blocks: 12996564 aes-128-cbc's in 2.82s
Doing aes-128-cbc for 3s on 1024 size blocks: 3230639 aes-128-cbc's in 2.77s
Doing aes-128-cbc for 3s on 8192 size blocks: 399946 aes-128-cbc's in 2.72s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 1120909.24k 1159444.76k 1179699.19k 1192806.52k 1205097.06k
[2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 172665690 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 49500772 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 9678881 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2480302 aes-128-cbc's in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 344003 aes-128-cbc's in 3.06s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 920883.68k 1056016.47k 825931.18k 844410.76k 920186.96k

Note that the wall clock and cpu clock results are much closer, and the numbers are actually plausible. The overhead of the context switches went away, and the performance is much, much better for small block sizes. You can also sanity check by ignoring the bandwidth summary and looking at the initial status output: with cryptodev it did about 1.4M small block operations in 3 seconds, and without cryptodev it did close to 200M small block operations in 3 seconds. There is no way in reality that 1.4M operations in 3s is better than 200M operations in 3s, unless you're measuring something wrong.

Another test–AES GCM isn't implemented in cryptodev so let's put the module back and see what happens:

[2.3.2-RELEASE][admin@pfSense.localdomain]/root: kldload aesni
[2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-gcm
Doing aes-128-gcm for 3s on 16 size blocks: 96121725 aes-128-gcm's in 2.76s
Doing aes-128-gcm for 3s on 64 size blocks: 39512692 aes-128-gcm's in 2.84s
Doing aes-128-gcm for 3s on 256 size blocks: 14260288 aes-128-gcm's in 2.95s
Doing aes-128-gcm for 3s on 1024 size blocks: 3360255 aes-128-gcm's in 2.74s
Doing aes-128-gcm for 3s on 8192 size blocks: 485825 aes-128-gcm's in 2.49s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-gcm 557669.38k 891702.40k 1239472.46k 1254801.55k 1596941.80k

Plausible results again, with the proper relationship between the non-cryptodev CBC and the GCM results. (Though much slower than it should be on this hardware. I don't know if that's an artifact of the VM or the fact that pfsense has an older version of openssl.) Something newer on the bare hardware:

openssl speed -evp aes-128-gcm

Doing aes-128-gcm for 3s on 16 size blocks: 116930558 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 66316891 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 32782942 aes-128-gcm's in 2.99s
Doing aes-128-gcm for 3s on 1024 size blocks: 12712095 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 2004498 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 16384 size blocks: 875464 aes-128-gcm's in 3.00s
OpenSSL 1.1.0c 10 Nov 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR=""/usr/lib/ssl"" -DENGINESDIR=""/usr/lib/x86_64-linux-gnu/engines-1.1""
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-gcm 623629.64k 1414760.34k 2806833.83k 4339061.76k 5473615.87k 4781200.73k

and the CBC output on bare hardware with openssl 1.1:

openssl speed -evp aes-128-cbc

Doing aes-128-cbc for 3s on 16 size blocks: 170365448 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 61436331 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 15619487 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 4102787 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 511408 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 254687 aes-128-cbc's in 2.99s
OpenSSL 1.1.0c 10 Nov 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR=""/usr/lib/ssl"" -DENGINESDIR=""/usr/lib/x86_64-linux-gnu/engines-1.1""
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 908615.72k 1315025.15k 1332862.89k 1400417.96k 1396484.78k 1395582.54k

This isn't a matter of opinion, it's a simple mistake that's been getting propagated for some time, leading to some wildly inaccurate claims about crypto performance. If you ask the guys writing the openssl code whether you can get 20GByte/s from a single core on a commodity intel chip you'll get a very clear "no". (And intel wouldn't be trying to sell very expensive quick assist hardware if it had lower performance than a cheap desktop: http://www.intel.com/content/www/us/en/ethernet-products/gigabit-server-adapters/quickassist-adapter-for-servers.html)

biggsy

Running the original test (without -elapsed) a few times gave a very wide range of results.

Microserver Gen 8 with E3-1265L V2 @ 2.50GHz:

[2.3.2-RELEASE][root@philter]/root: openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 1751663 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 64 size blocks: 1609691 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 1285984 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 1024 size blocks: 722643 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 143875 aes-256-cbc's in 3.01s
OpenSSL 1.0.1s-freebsd  1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc       9317.94k    34340.07k   109452.27k   246662.14k   391854.21k

``` 

Edit after reading more carefully:

[2.3.2-RELEASE][root@philter]/root: kldunload aesni
[2.3.2-RELEASE][root@philter]/root: openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 76690389 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 20091193 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 256 size blocks: 5107757 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 1285013 aes-256-cbc's in 3.01s
Doing aes-256-cbc for 3s on 8192 size blocks: 160927 aes-256-cbc's in 3.00s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 409015.41k 427498.84k 435861.93k 437478.50k 439437.99k

RMB

Lanner FW-7525D (Quad-core Atom C2558 @ 2.40GHz)
Shell Output - openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 988744 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 926802 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 762164 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 455059 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 93341 aes-256-cbc's in 3.00s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 5273.30k 19771.78k 65037.99k 155326.81k 254883.16k

Shell Output - openssl speed -evp aes-256-gcm -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-gcm for 3s on 16 size blocks: 20826334 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 8843173 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 2794049 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 754329 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 96056 aes-256-gcm's in 3.00s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-gcm 111073.78k 188654.36k 238425.51k 257477.63k 262296.92k

PfSense SG-2440 (Dual-core Atom C2358 @ 1.74GHz)
Shell Output - openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 727986 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 680875 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 557737 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 327133 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 67983 aes-256-cbc's in 3.01s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 3882.59k 14525.33k 47593.56k 111661.40k 185156.73k

Shell Output - openssl speed -evp aes-256-gcm -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-gcm for 3s on 16 size blocks: 14925214 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 6436982 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 2026331 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 549702 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 70004 aes-256-gcm's in 3.00s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-gcm 79601.14k 137322.28k 172913.58k 187631.62k 191157.59k

aesguy

VAMike,

I am interested in top performance possible for given hardware. "-elapsed" is not useful in that regarding because it measures the typical performance on that box given everything in place. In your case, not only are your tests including things like other applications and processes running (and the swapping in and out of all those processes millions of times per second) but additionally you are running in a VM where typically the operating system itself is given limited access to the underlying hardware resources! Of course your elapsed time ("-elapsed") is going to show drastically slower speeds - because the overhead of the operating system swapping and context switching all those applications and processes AND other operating system instances are all coming into play!! You are not only swapping out all the processes millions of times per second, but in your case, your operating system running in a VM itself is causing overhead! Naturally if your system running other applications AND operating systems then your "-elapsed" is going to show radically different results.

So we're talking apples to oranges. I am interested in the BEST performance that a hardware CAN achieve - not the performance of a particular box as configured and loaded with all sorts of junk - and openssl without "-elapsed" displays that better than with "-elapsed". If you care about maximizing AES performance, then dump all the process and applications running in the background, tune the OS (such as for example highly minimize context switching), build the application for the hardware - and now we're getting closer to achieving maximum performance possible - and this is what openssl without "-elapsed" can give us an idea of today before engaging in investing in a hardware platform. Heck, if one wanted to, you could go further and dump the OS altogether and write an application that boots and runs natively on the hardware itself but in many cases that would be non-trivial costs for that last mile of performance - but it's doable.

You seem to be more interested in performance of A GIVEN SYSTEM built for general purpose - which by the way is perfectly valid and what most pfsense users are after, but it's just not what I'm after in trying to get a good idea of the top performing hardware. Indeed, as you suggest by your tests you are seeing a difference of ~20fold - a big difference and hence why AES-NI offers a big gain in performance - if you can harness it properly.

regarding unloading aesni, try instead invoking openssl with and without "-evp". Like I showed you, the actual OpenSSL source (when invoked using -evp) invokes the actual AES-NI CPU instructions. Then consult the version of openssl you are using and your hardware platform against the source code to see whether the AES-NI instructions are being invoked directly. Here is a link: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/

VAMike

@aesguy:

I am interested in top performance possible for given hardware. "-elapsed" is not useful in that regarding because it measures the typical performance on that box given everything in place.

Well, that's not what you're measuring in your results. You're measuring the time the application spends talking to the kernel to ask for encryption services, and not counting at all the time spent doing encryption. If that's really what you want that's great, but it's a number with no utility whatsoever.

In your case, not only are your tests including things like other applications and processes running (and the swapping in and out of all those processes millions of times per second) but additionally you are running in a VM where typically the operating system itself is given limited access to the underlying hardware resources! Of course your elapsed time ("-elapsed") is going to show drastically slower speeds - because the overhead of the operating system swapping and context switching all those applications and processes AND other operating system instances are all coming into play!! You are not only swapping out all the processes millions of times per second, but in your case, your operating system running in a VM itself is causing overhead! Naturally if your system running other applications AND operating systems then your "-elapsed" is going to show radically different results.

You're simply confused here. The difference isn't system load, it's whether you're actually measuring the time spent doing crypto or not. If you were not using cryptodev then elapsed would be telling you what you think it is–as it is in the cases I showed where the aesni module is unloaded or in the GCM case. Those are valid uses of the defaults, and do show actual cpu time and are worth considering. To put this a different way, -elapsed is usually a less accurate measure, but in the cryptodev case it's the only way to get a measurement that's anywhere close to reality. Unloading the cryptodev aesni module and dropping -elapsed is a better solution (unless you're specifically trying to measure the performance with the aesni module.) Again, if you look at the diagnostic output and see something like "in 0.x seconds", that's bogus because it's not even close to the time spent. (And if you see that output you pretty much know it's a system with cryptodev loaded, it doesn't happen otherwise.) If you see something like "in 2.83s" then you're seeing the benefit of basing the calculation on cpu time, because that's an indication that openssl didn't get a full 3 seconds of cpu time and it's more accurate to base the bandwidth number on the amount of time it actually got. But that cpu time is going to be something pretty close to 3 seconds in any case where basing the calculation on cpu time is remotely valid.

You seem to be more interested in performance of A GIVEN SYSTEM built for general purpose

No, I'm interested in reality. The cryptodev numbers do not reflect reality. Again, look at the number of computations actually being performed, and try to understand what that means. You are looking at bogus numbers because the time used in the calculation simply does not reflect the time the cpu spends doing crypto (because it's being done in kernel space rather than user space). By ignoring the time the CPU spends doing the crypto you aren't getting a better understanding of the performance characteristics of the hardware, you're simply making an incorrect calculation.

which by the way is perfectly valid and what most pfsense users are after, but it's just not what I'm after in trying to get a good idea of the top performing hardware. Indeed, as you suggest by your tests you are seeing a difference of ~20fold - a big difference and hence why AES-NI offers a big gain in performance - if you can harness it properly.

The other thing you seem to not understand is that openssl uses the AES-NI instructions without using cryptodev. If you unload the aesni module on your hardware and run with and without -evp you'll see a significant difference, one that's real rather than one that's an interpretation error, and that difference is the use of the AES-NI instructions. Or focus entirely on the GCM results, which don't get screwed up by cryptodev.

aesguy

You're measuring the time the application spends talking to the kernel to ask for encryption services

OpenSSL isn't "talking to the kernel" per se - but rather the kernel gives the process running openssl small slices of time with the CPU and memory etc and feeds them openssl's instructions. And more importantly, openssl -evp on many architectures supports the native instructions for the specific architecture in use. On those supported architectures, it is NOT "asking for encryption services" - it knows the low-level CPU AES-NI instructions appropriate for that CPU and calls those. This is the 3rd time I'll provide the link to the source code, these are the ARMv8 assembly instructions: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

Out of that, you'll see for example the AES CBC encrypt instructions:

$code.=<<___;
	subs	$len,$len,#16
	mov	$step,#16
	b.lo	.Lcbc_abort
	cclr	$step,eq
	cmp	$enc,#0			// en- or decrypting?
	ldr	$rounds,[$key,#240]
	and	$len,$len,#-16
	vld1.8	{$ivec},[$ivp]
	vld1.8	{$dat},[$inp],$step
	vld1.32	{q8-q9},[$key]		// load key schedule...
	sub	$rounds,$rounds,#6
	add	$key_,$key,x5,lsl#4	// pointer to last 7 round keys
	sub	$rounds,$rounds,#2
	vld1.32	{q10-q11},[$key_],#32
	vld1.32	{q12-q13},[$key_],#32
	vld1.32	{q14-q15},[$key_],#32
	vld1.32	{$rndlast},[$key_]
	add	$key_,$key,#32
	mov	$cnt,$rounds
	b.eq	.Lcbc_dec
	cmp	$rounds,#2
	veor	$dat,$dat,$ivec
	veor	$rndzero_n_last,q8,$rndlast
	b.eq	.Lcbc_enc128
	vld1.32	{$in0-$in1},[$key_]
	add	$key_,$key,#16
	add	$key4,$key,#16*4
	add	$key5,$key,#16*5
	aese	$dat,q8
	aesmc	$dat,$dat
	add	$key6,$key,#16*6
	add	$key7,$key,#16*7
	b	.Lenter_cbc_enc

As you can see, these call the ARM AES-NI instruction set (including AESE). There is no "talking to the kernel" other than the kernel swaps the process in and out. Perhaps you are familiar with cryptodev and so believe everything needs to go through that - but it doesn't, look at openssl itself and when -evp is used. The above instructions are where openssl ends up when it detects ARMv8 architecture and then calls ARM CPU instructions directly.

The difference isn't system load, it's whether you're actually measuring the time spent doing crypto or not.

I'm not talking about system load - I'm talking about the necessary overhead that the OS has to do to handle multiple processes from running multiple applications. You don't seem to understand that there is overhead - not only are you measuring with other things running but your tests were also performed on a VM. Hardware resources are fixed and things like VM's are really just software implementations to mimic being able to handle multiple images - but you can't multiply the CPU, the bus, the network, the RAM, etc… Like it or not, but your measurements are being affected by other things running on your system.

If you were not using cryptodev then elapsed would be telling you what you think it is-

Again, on architectures that openssl supports the native AES-NI instructions, -evp does not involve cryptodev. You can dig up what happens on your architecture: https://github.com/openssl/openssl/tree/master/crypto/aes/asm

Perhaps what is confusing you in all this is that there is the cryptodev and the openssl implementations. They are both different, and likely do not support the same architectures.

To put this a different way, -elapsed is usually a less accurate measure, but in the cryptodev case it's the only way to get a measurement that's anywhere close to reality.

In your case, "reality" is the specific system as configured that is being measured. That's what I said last time - but that's different from the maximum top performance which is what I'm trying to gauge.

Unloading the cryptodev aesni module and dropping -elapsed is a better solution (unless you're specifically trying to measure the performance with the aesni module.)

Yes getting closer! That's what we're trying to measure - the top performance possible for a specific hardware. Even if that's not going to happen because say other applications need to run (as is the case in many deployed pfsense installations that run multiple functions).

You are looking at bogus numbers because the time used in the calculation simply does not reflect the time the cpu spends doing crypto (because it's being done in kernel space rather than user space).

Again, I believe you believe cryptodev is involved. And on certain architectures, openssl implements the native AES-NI instruction set. On Intel, those are AESENC, AESENCLAST, etc… - there's 7 of them, but on ARM they're AESE, etc... Completely different CPU's, completely different instruction sets. Really nothing to do with kernel - the kernel is there to provide operating system functionality, but in supported CPUs there's no "encryption services".

The other thing you seem to not understand is that openssl uses the AES-NI instructions without using cryptodev.

Wrong. Look at the openssl source code. On supported architectures, the AES-NI instructions are invoked directly.

VAMike

@aesguy:

You're measuring the time the application spends talking to the kernel to ask for encryption services

OpenSSL isn't "talking to the kernel" per se - but rather the kernel gives the process running openssl small slices of time with the CPU and memory etc and feeds them openssl's instructions.

No, you're quite simply wrong. openssl on pfsense opens /dev/crypto and offloads the AES CBC routines if a kernel module supporting AES CBC is loaded. That's why the performance changes so dramatically when you load and unload that module. That's why the performance is so bad on small blocks, it has to send each block up through the kernel and back. I even walked through step by step how to demonstrate that.

And more importantly, openssl -evp on many architectures supports the native instructions for the specific architecture in use.

Yes, except that the implementation on pfsense will ignore the built-in routines in the presence of a /dev/crypto that reports AES capability. "openssl engine -t -c" will show whether openssl has detected such.

If you were not using cryptodev then elapsed would be telling you what you think it is-

Again, on architectures that openssl supports the native AES-NI instructions, -evp does not involve cryptodev.

If that were true, then unloading the module would have no effect, right? But it does. "truss openssl speed -evp aes-128-cbc |& grep /dev/crypto". You can also take a look at the number of ioctls with and without the cryptodev stuff to see openssl passing the blocks off to the kernel one by one. (grep ioctl instead of grep /dev/crypto and be prepared for a lot of output) https://github.com/pfsense/FreeBSD-src/blob/devel/crypto/openssl/crypto/engine/eng_cryptodev.c is the code implementing cryptodev support, including support for evp mode (look for the "EVP" functions).

aesguy

Unfortunately we've strayed far from the overall goal - to roughly measure top performance possible of various AES-NI implementations. You are missing the point that I want to roughly measure this. I don't care about -elapsed simply because it adds in various other factors which distorts or does not give good idea of the maximum performance possible. For example:

openssl on pfsense opens /dev/crypto and offloads the AES CBC routines if a kernel module supporting AES CBC is loaded. That's why the performance changes so dramatically when you load and unload that module. That's why the performance is so bad on small blocks, it has to send each block up through the kernel and back.

I'm not interested in various factors due to a kernel implementation or design of cryptodev. The fact that data is being copied back and forth from userspace to kernel affects performance in non-trivial amounts. Who says everything has to be the cryptodev-way? And in that respect, openssl without "-elapsed" offers a much better indication of maximum possible performance for a particular piece of hardware.

aesguy

VAMike,

You've provided several AES-128-CBC results but any chance you can provide the results for AES-256-CBC (along with box info)?

openssl speed -evp aes-256-cbc

aesguy

RMB, thanks for the info but unfortunately run with "-elapsed", can you rerun without "-elapsed"?

openssl speed -evp aes-256-cbc

aesguy

Updated table for "openssl speed -evp aes-256-cbc":

8192BYTES	BOX	CPU	USERNAME	LINK
170926276.61k	unknown (China)	gen 5 i5	Koenig	
150749577.22k	Microserver Gen 8	ESXi 6.0	biggsy	
91090845.70k	Zotac ZBOX ID92	Core i5 4570T	highwire	
48454172.67k	SuperMicro Board: X11SBA-LN4F	Intel N3700	Engineer	
48351936.51k	SuperMicro 2758	Intel(R) Atom(TM) CPU C2758 @ 2.40GHz 8 CPUs	AR15USR	
42008576.00k	Gigabyte GA-N3150N-D3V board	Celeron N3150 with AES-NI		https://forum.pfsense.org/index.php?topic=108119.0
32321306.62k	SuperMicro 2758	Intel(R) Atom(TM) CPU C2758 @ 2.40GHz 8 CPUs	AR15USR	
32267479.72k	Supermicro	Intel N3700	Engineer	
29080158.21k	hp microserver gen 8	Xeon 1265Lv2	iorx	
27986842.97k	Gigabyte GA-N3150N-D3V	Celeron N3150 with AES-NI		https://forum.pfsense.org/index.php?topic=105114.msg601520#msg601520
24435715.51k	unknown (China)	gen 5 i5	Koenig	
24345837.57k	Lanner FW-7525D	Quad-core Atom C2558 @ 2.40GHz	RMB	
24332468.22k	Netgate SG-4860  	Intel(R) Atom(TM) CPU C2558 @ 2.40GHz 4 CPUs	bytesizedalex	
19462619.14k	SuperMicro 2758	Intel(R) Atom(TM) CPU C2758 @ 2.40GHz 8 CPUs	AR15USR	
18390712.32k	AM1	Athlon 5370	W4RH34D	
14241549.52k	pfSense SG-2440	Dual-core Atom C2358 @ 1.74GHz	RMB	
7123763.20k	Raspberry Pi 3	ARMv7l		
405686.95k	Intel i7-4510U + 2x Intel 82574 + 2x Intel i350 Mini-ITX Build			https://forum.pfsense.org/index.php?topic=115627.msg646395#msg646395
230708.57k	ci323 nano u	Celeron N3150 with AES-NI w/ -engine cryptodev		https://forum.pfsense.org/index.php?topic=115673.msg656602#msg656602
217617.75k	RCC-VE 2440	Intel Atom C2358		https://forum.pfsense.org/index.php?topic=91974.0
124788.74k	ALIX.APU2B4/APU2C4	1 GHz Quad Core AMD GX-412TC		http://wiki.ipfire.org/en/hardware/pcengines/apu2b4
34204.33k	ALIX.APU1C/APU1D	1 GHz Dual Core AMD G-T40E		http://wiki.ipfire.org/en/hardware/pcengines/apu1c

gjaltemba

Platform pfSense ESXI 6.0 VM
Version 2.3.2-RELEASE-p1 (amd64)
built on Tue Sep 27 12:13:07 CDT 2016
FreeBSD 10.3-RELEASE-p9

CPU Type Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
2 CPUs: 1 package(s) x 2 core(s)

Hardware crypto AES-CBC,AES-XTS,AES-GCM,AES-ICM

Shell Output - openssl speed -evp aes-256-cbc

Doing aes-256-cbc for 3s on 16 size blocks: 1019263 aes-256-cbc's in 0.31s
Doing aes-256-cbc for 3s on 64 size blocks: 959341 aes-256-cbc's in 0.29s
Doing aes-256-cbc for 3s on 256 size blocks: 779985 aes-256-cbc's in 0.22s
Doing aes-256-cbc for 3s on 1024 size blocks: 437868 aes-256-cbc's in 0.18s
Doing aes-256-cbc for 3s on 8192 size blocks: 88484 aes-256-cbc's in 0.01s
OpenSSL 1.0.1s-freebsd 1 Mar 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-256-cbc 52186.27k 212403.28k 912805.30k 2495314.54k 92782198.78k

VAMike

@aesguy:

Unfortunately we've strayed far from the overall goal - to roughly measure top performance possible of various AES-NI implementations. You are missing the point that I want to roughly measure this.

You're not roughly measuring this, you're measuring the wrong thing (basically you're measuring how many syscalls/second you can execute, which is completely unrelated to how fast you can process crypto). The hardware in question IS NOT CAPABLE OF THE SPEEDS YOU ARE DISCUSSING. Using cryptodev without -elapsed is simply wrong. If you take cryptodev out of the equation then you can drop -elapsed and get meaningful results.

You should really run this by an openssl developer if you won't believe me. When they stop laughing at the idea of a raspberry pi getting 7GByte/s AES 256 CBC (the hardware is capable of something around 40Mbyte/s) they'll tell you what I explained above.

You're clearly not going to actually consider this, but I hope whoever else reads this and sees the fantastical speed test results understands that they're bogus.

Side note: anyone who's getting these crazy results (the 0.whatever seconds) would be well served by turning off aesni.ko in their config unless they're primarily interested in ipsec performance–it's actually slowing down openvpn and anything else that uses openssl. Pippin did a nice writeup a couple of months ago walking through it: https://forum.pfsense.org/index.php?topic=115627.msg646775#msg646775
What pfsense should really be doing is making aesni.ko and cryptodev.ko two separate items. For good kernel ipsec performance you want aesni.ko (that will implement AES-NI in the kernel) without cryptodev.ko (which makes openssl stop using its internal routines in favor of the less efficient kernel syscall implementation). The only case where cryptodev actually makes sense is if you're using off-cpu acceleration like the old hifn or padlock stuff or quickassist. (But in that case you'll still want -elapsed to measure real throughput--the rule is, always use -elapsed with /dev/crypto.)

aesguy

When they stop laughing at the idea of a raspberry pi getting 7GByte/s AES 256 CBC (the hardware is capable of something around 40Mbyte/s) they'll tell you what I explained above.

Who ever said anything about 7GB/s?! We're comparing the relative performance of different AES-NI hardware implementations.

VAMike

@aesguy:

Who ever said anything about 7GB/s?! We're comparing the relative performance of different AES-NI hardware implementations.

No, you aren't. You're comparing their context switching rates, which has nothing to do with their crypto processing rates. You can't rationalize this into something positive. And the icing on the cake is that when people post the right numbers you turn them down because they don't fit your misconceptions.

Edit to add: actually, it's worse than that–given two cpus that are otherwise equal, this methodology will actually penalize the one with the more efficient crypto implementation (because it will spend relatively less time doing crypto in kernel space where the time isn't counted and more time in user space doing context switches, which are the only time counted).

aesguy

VAMike, the keyword is "relative" - and in this context refers to comparing results from different hardware.

VAMike

@aesguy:

VAMike, the keyword is "relative" - and in this context refers to comparing results from different hardware.

Again, the words put together don't make any sense. Why would you reject solid data in favor of bogus data, it's not like it's any harder to gather. If you collected the real numbers, then you'd have an actual "relative" comparison rather than an "irrelevant" comparison. Is it really that difficult to just admit you were wrong and move on?

aesguy

Again, the words put together don't make any sense.

Of course they don't make any sense to you - because you're either not reading or not trying to understand what I'm saying.