AES-NI performance



  • I am looking for a solution that will mostly be used for AES encryption and decryption.  It is most likely that a CPU with AES-NI will offer the best value.  Routing is a requirement but not a high priority.  Surprisingly, after looking around, not only am I under the impression that the Raspberry Pi 3 offers good value - but perhaps more surprisingly is that a Raspberry Pi 3 actually offers the best AES-NI performance.  I say this in comparison to what I've seen posted here from various pfsense users.

    To keep things simple, I've settled on one single value: openssl's AES-256-CBC 8192k column.  The following are my results:

    openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 551008 aes-256-cbc's in 0.20s
    Doing aes-256-cbc for 3s on 64 size blocks: 398438 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 256 size blocks: 193847 aes-256-cbc's in 0.10s
    Doing aes-256-cbc for 3s on 1024 size blocks: 64090 aes-256-cbc's in 0.04s
    Doing aes-256-cbc for 3s on 8192 size blocks: 8696 aes-256-cbc's in 0.01s
    OpenSSL 1.0.2c 12 Jun 2015
    built on: reproducible build, date unspecified
    options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr) 
    compiler: gcc -I. -I.. -I../include  -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -march=armv7-a -Wa,--noexecstack -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      44080.64k   110869.70k   496248.32k  1640704.00k  7123763.20k
    
    

    That 7123763.20k value is what I'm using to compare to others and that significantly outperforms others.

    I am interested if anyone can come close?  Even if it does not exceed the Pi 3 but is at least within the same ballpark, then the routing capabilities (and 2nd NIC) of a pfsense solution may result in a better overall solution.  Please use "openssl speed -evp aes-256-cbc"



  • I'm using a ZBOX small form factor computer with a Core i5 4570T that supports AES-NI.  Here are my results:

    [2.3.2-RELEASE][root@pfSense.home]/root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 1090105 aes-256-cbc's in 0.29s
    Doing aes-256-cbc for 3s on 64 size blocks: 1870619 aes-256-cbc's in 0.34s
    Doing aes-256-cbc for 3s on 256 size blocks: 1516771 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 1024 size blocks: 865844 aes-256-cbc's in 0.20s
    Doing aes-256-cbc for 3s on 8192 size blocks: 173742 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      60338.78k  356374.67k  1656718.40k  4364919.41k 91090845.70k



  • AM1 box

    Doing aes-256-cbc for 3s on 16 size blocks: 857911 aes-256-cbc's in 0.38s
    Doing aes-256-cbc for 3s on 64 size blocks: 832702 aes-256-cbc's in 0.33s
    Doing aes-256-cbc for 3s on 256 size blocks: 744018 aes-256-cbc's in 0.41s
    Doing aes-256-cbc for 3s on 1024 size blocks: 523430 aes-256-cbc's in 0.24s
    Doing aes-256-cbc for 3s on 8192 size blocks: 140310 aes-256-cbc's in 0.06s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(8x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      36604.20k  162416.54k  459999.66k  2213129.58k 18390712.32k



  • highwire - 91090845.70k is the fastest we've seen on this forum!

    That's a ZOTAC ZBox ID92 right?  Quite a bit more pricey than Raspberry Pi 3 but also better AES performance.



  • Also pretty good W4RH34D - 18390712.32k is about 2.5x Raspberry Pi 3.

    What's in your AM1 box - Athlon 5350?



  • @aesguy:

    Also pretty good W4RH34D - 18390712.32k is about 2.5x Raspberry Pi 3.

    What's in your AM1 box - Athlon 5350?

    5370



  • Lanner FW-7525D (Quad-core Atom C2558 @ 2.40GHz)
    Doing aes-256-cbc for 3s on 16 size blocks: 970824 aes-256-cbc's in 0.31s
    Doing aes-256-cbc for 3s on 64 size blocks: 921585 aes-256-cbc's in 0.27s
    Doing aes-256-cbc for 3s on 256 size blocks: 753715 aes-256-cbc's in 0.20s
    Doing aes-256-cbc for 3s on 1024 size blocks: 449350 aes-256-cbc's in 0.15s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92660 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      42591.25k  160843.69k  719856.10k  2245898.56k 24345837.57k

    PfSense SG-2440 (Dual-core Atom C2358 @ 1.74GHz)
    Doing aes-256-cbc for 3s on 16 size blocks: 695323 aes-256-cbc's in 0.36s
    Doing aes-256-cbc for 3s on 64 size blocks: 676799 aes-256-cbc's in 0.27s
    Doing aes-256-cbc for 3s on 256 size blocks: 550378 aes-256-cbc's in 0.22s
    Doing aes-256-cbc for 3s on 1024 size blocks: 330729 aes-256-cbc's in 0.17s
    Doing aes-256-cbc for 3s on 8192 size blocks: 67846 aes-256-cbc's in 0.04s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      34474.84k  168916.56k  608168.62k  2167026.48k 14241549.52k

    Kind regards,
    Rene.



  • Hi!

    For fun or reference :). A Hyper-v hosted pfsense on a hp microserver gen 8 with a Xeon 1265Lv2.

    
    [2.3.2-RELEASE][n23]/root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 1084848 aes-256-cbc's in 0.45s
    Doing aes-256-cbc for 3s on 64 size blocks: 1345250 aes-256-cbc's in 0.24s
    Doing aes-256-cbc for 3s on 256 size blocks: 709374 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 1024 size blocks: 472042 aes-256-cbc's in 0.19s
    Doing aes-256-cbc for 3s on 8192 size blocks: 110932 aes-256-cbc's in 0.03s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      38978.40k   355493.16k   774825.57k  2577978.71k 29080158.21k
    
    


  • This is from a connection over wifi, not sure if that makes a difference.

    On a SuperMicro 2758:

    openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 982926 aes-256-cbc's in 0.38s
    Doing aes-256-cbc for 3s on 64 size blocks: 921181 aes-256-cbc's in 0.27s
    Doing aes-256-cbc for 3s on 256 size blocks: 761431 aes-256-cbc's in 0.33s
    Doing aes-256-cbc for 3s on 1024 size blocks: 448646 aes-256-cbc's in 0.19s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92805 aes-256-cbc's in 0.04s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      41938.18k   215608.99k   594061.21k  2450205.35k 19462619.14k
    
    

    With the aesni turned off in the advanced settings:

    openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 955806 aes-256-cbc's in 0.29s
    Doing aes-256-cbc for 3s on 64 size blocks: 909612 aes-256-cbc's in 0.26s
    Doing aes-256-cbc for 3s on 256 size blocks: 758911 aes-256-cbc's in 0.30s
    Doing aes-256-cbc for 3s on 1024 size blocks: 446740 aes-256-cbc's in 0.13s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92400 aes-256-cbc's in 0.04s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      52905.15k   225804.29k   654420.94k  3659694.08k 19377684.48k
    
    


  • @aesguy:

    highwire - 91090845.70k is the fastest we've seen on this forum!

    That's a ZOTAC ZBox ID92 right?  Quite a bit more pricey than Raspberry Pi 3 but also better AES performance.

    Yes, an ID92.  Quite pricey (I bought mine on sale, but still).  I bought it for a HTPC but abandoned that plan and re purposed it.  It is very much overkill for this application (even running a VPN server) as my connection is only 100mbps/10mbps.



  • I was thinking of firing up the 6 core xeon but I just don't really care for epeen stuff anymore.  I mean if someone needs to see it I'll do it, no time for "just for grins" these days.



  • I have a chinese "mini-computer" (gen 5 i5)

    I did 2 test and got a very varying result:

    Try one:
    Doing aes-256-cbc for 3s on 16 size blocks: 1704782 aes-256-cbc's in 0.28s
    Doing aes-256-cbc for 3s on 64 size blocks: 1762586 aes-256-cbc's in 0.31s
    Doing aes-256-cbc for 3s on 256 size blocks: 1417931 aes-256-cbc's in 0.32s
    Doing aes-256-cbc for 3s on 1024 size blocks: 811284 aes-256-cbc's in 0.13s
    Doing aes-256-cbc for 3s on 8192 size blocks: 163126 aes-256-cbc's in 0.05s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      96983.15k  360977.61k  1133238.12k  6646038.53k 24435715.51k

    Try two:
    Doing aes-256-cbc for 3s on 16 size blocks: 1727740 aes-256-cbc's in 0.41s
    Doing aes-256-cbc for 3s on 64 size blocks: 1742973 aes-256-cbc's in 0.38s
    Doing aes-256-cbc for 3s on 256 size blocks: 1414059 aes-256-cbc's in 0.29s
    Doing aes-256-cbc for 3s on 1024 size blocks: 815243 aes-256-cbc's in 0.13s
    Doing aes-256-cbc for 3s on 8192 size blocks: 163008 aes-256-cbc's in 0.01s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      68046.38k  291396.63k  1252321.22k  6285619.44k 170926276.61k



  • Thanks all for the results, keep them coming!

    Here are the results so far:

    170926276.61k		gen 5 i5
    91090845.70k	Zotac ZBOX ID92	Core i5 4570T
    29080158.21k	hp microserver gen 8	Xeon 1265Lv2
    24435715.51k		gen 5 i5
    24345837.57k	Lanner FW-7525D	Quad-core Atom C2558 @ 2.40GHz
    19462619.14k		SuperMicro 2758
    18390712.32k	AM1	Athlon 5370
    14241549.52k	pfSense SG-2440	Dual-core Atom C2358 @ 1.74GHz
    7123763.20k	Raspberry Pi 3	ARMv7l
    


  • AR15USR,

    There doesn't seem any difference in your tests.  Can you try running without the "-evp" option?

    openssl speed aes-256-cbc
    


  • Koenig,

    Can you provide the make and model of your "gen 5 i5"?



  • Here's an updated list of results:

    170926276.61k		gen 5 i5	
    91090845.70k	Zotac ZBOX ID92	Core i5 4570T	
    42008576.00k	Gigabyte GA-N3150N-D3V board	Celeron N3150 with AES-NI	https://forum.pfsense.org/index.php?topic=108119.0
    29080158.21k	hp microserver gen 8	Xeon 1265Lv2	
    27986842.97k	Gigabyte GA-N3150N-D3V	Celeron N3150 with AES-NI	https://forum.pfsense.org/index.php?topic=105114.msg601520#msg601520
    24435715.51k		gen 5 i5	
    24345837.57k	Lanner FW-7525D	Quad-core Atom C2558 @ 2.40GHz	
    19462619.14k		SuperMicro 2758	
    18390712.32k	AM1	Athlon 5370	
    14241549.52k	pfSense SG-2440	Dual-core Atom C2358 @ 1.74GHz	
    7123763.20k	Raspberry Pi 3	ARMv7l	
    405686.95k	Intel i7-4510U + 2x Intel 82574 + 2x Intel i350 Mini-ITX Build		https://forum.pfsense.org/index.php?topic=115627.msg646395#msg646395
    230708.57k	ci323 nano u	Celeron N3150 with AES-NI w/ -engine cryptodev	https://forum.pfsense.org/index.php?topic=115673.msg656602#msg656602
    217617.75k	RCC-VE 2440	Intel Atom C2358	https://forum.pfsense.org/index.php?topic=91974.0
    124788.74k	ALIX.APU2B4/APU2C4	1 GHz Quad Core AMD GX-412TC	http://wiki.ipfire.org/en/hardware/pcengines/apu2b4
    34204.33k	ALIX.APU1C/APU1D	1 GHz Dual Core AMD G-T40E	http://wiki.ipfire.org/en/hardware/pcengines/apu1c
    


  • @aesguy:

    Koenig,

    Can you provide the make and model of your "gen 5 i5"?

    There's no brand or model on it…

    Something like this: https://www.aliexpress.com/item/Fanless-PC-Intel-NUC-Core-i7-5500u-i5-5257u-Iris-6100-Barebone-Mini-PC-Windows-2HDMI/32755490163.html?spm=2114.01010208.3.100.Dtd346&ws_ab_test=searchweb0_0,searchweb201602_2_10091_10090_10088_10089,searchweb201603_1&btsid=6d47dcd0-df75-47e8-84cf-86813f160f8e

    Some more results:

    aes-256-cbc      99810.65k  375805.41k  1454872.58k  4844784.55k 28507460.95k

    aes-256-cbc      62518.77k  350371.84k  1217122.52k  5055197.38k 34182738.74k

    aes-256-cbc      76404.78k  341786.43k  1224697.10k  4425564.16k 34284240.90k

    aes-256-cbc      91091.47k  242748.12k  1191453.72k  5068092.37k 85483061.25k

    aes-256-cbc    100148.30k  299186.69k  1330803.04k  6668591.10k 86076555.26k

    aes-256-cbc    105877.45k  377916.58k  1538361.48k  6694084.61k 57179897.86k

    aes-256-cbc      84355.12k  320069.81k  1420017.17k  6647087.10k 57598978.73k

    aes-256-cbc    106102.67k  260300.35k  1792681.83k  9638188.87k 34206646.27k

    All from the same machine.


  • Netgate

    14241549.52k pfSense SG-2440 Dual-core Atom C2358 @ 1.74GHz
    217617.75k RCC-VE 2440 Intel Atom C2358 https://forum.pfsense.org/index.php?topic=91974.0

    Obviously something off there.



  • First my system details -

    System: Netgate SG-4860
    Version: 2.3.2-RELEASE-p1 (amd64) built on Fri Sep 30 14:36:56 CDT 2016 FreeBSD 10.3-RELEASE-p9
    CPU Type: Intel(R) Atom(TM) CPU C2558 @ 2.40GHz 4 CPUs: 1 package(s) x 4 core(s)
    Hardware crypto: AES-CBC,AES-XTS,AES-GCM,AES-ICM

    Results (system pretty active so possibility for skewed results) -

    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 984814 aes-256-cbc's in 0.35s
    Doing aes-256-cbc for 3s on 64 size blocks: 920037 aes-256-cbc's in 0.30s
    Doing aes-256-cbc for 3s on 256 size blocks: 759776 aes-256-cbc's in 0.26s
    Doing aes-256-cbc for 3s on 1024 size blocks: 452100 aes-256-cbc's in 0.15s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92821 aes-256-cbc's in 0.03s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      44819.98k  193254.95k  754434.54k  3118823.75k 24332468.22k



  • @aesguy:

    AR15USR,

    There doesn't seem any difference in your tests.  Can you try running without the "-evp" option?

    openssl speed aes-256-cbc
    
    /root: openssl speed aes-256-cbc
    Doing aes-256 cbc for 3s on 16 size blocks: 5517180 aes-256 cbc's in 3.01s
    Doing aes-256 cbc for 3s on 64 size blocks: 1544753 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 256 size blocks: 399657 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 1024 size blocks: 258521 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 8192 size blocks: 32712 aes-256 cbc's in 2.99s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256 cbc      29348.53k    32954.73k    34104.06k    88241.83k    89558.79k
    
    

    For comparison:

    /root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 957210 aes-256-cbc's in 0.39s
    Doing aes-256-cbc for 3s on 64 size blocks: 893869 aes-256-cbc's in 0.24s
    Doing aes-256-cbc for 3s on 256 size blocks: 751299 aes-256-cbc's in 0.27s
    Doing aes-256-cbc for 3s on 1024 size blocks: 450002 aes-256-cbc's in 0.10s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92472 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      39207.32k   236212.09k   724075.46k  4537127.86k 32321306.62k
    


  • I have no idea what it means, and how good or bad output is, as i do not understand this, but, i thought lets try on my box :)

    any good?

    [2.3.2-RELEASE][admin@pfSense]/root: openssl speed aes-256-cbc
    Doing aes-256 cbc for 3s on 16 size blocks: 12830479 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 64 size blocks: 3389641 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 256 size blocks: 
    858407 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 1024 size blocks: 217919 aes-256 cbc's in 3.03s
    Doing aes-256 cbc for 3s on 8192 size blocks: 27176 aes-256 cbc's in 3.02s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256 cbc      68429.22k    72312.34k    73250.73k    73616.18k    73633.34k
    
    
    [2.3.2-RELEASE][admin@pfSense]/root:  openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 77185949 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 64 size blocks: 20190084 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 256 size blocks: 5139740 aes-256-cbc's in 3.02s
    Doing aes-256-cbc for 3s on 1024 size blocks: 1286608 aes-256-cbc's in 3.02s
    Doing aes-256-cbc for 3s on 8192 size blocks: 160088 aes-256-cbc's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc     411658.39k   430721.79k   435191.22k   436886.75k   437146.97k
    

    tnx


  • Netgate

    ~~Means you don't have AES-NI or it is disabled or ?

    The first 3 secs indicates clock time. The second time interval indicates CPU time. Note that on the accelerated systems they are performing operations on more data in < 1/10 the CPU time.~~ Don't listen to that guy.



  • SuperMicro with Intel N3700.  Not bad for a 6W CPU (System pulls 11 Watts from the wall).

    $ openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 991459 aes-256-cbc's in 0.25s
    Doing aes-256-cbc for 3s on 64 size blocks: 971848 aes-256-cbc's in 0.26s
    Doing aes-256-cbc for 3s on 256 size blocks: 785303 aes-256-cbc's in 0.28s
    Doing aes-256-cbc for 3s on 1024 size blocks: 393543 aes-256-cbc's in 0.16s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92318 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1l-freebsd 15 Jan 2015
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      63453.38k  241253.90k  714800.24k  2579123.40k 32267479.72k



  • @Koenig: thanks, I've labelled it as Unknown(China)

    @Derelict: I'm taking an indiscriminate method and keeping all data points provided.  It might be due to the OS version, random timing, etc.

    @bytesizedalex: thanks and added to the list

    @AR15USR: as I suspected - OpenSSL -evp determines itself whether AES-NI is present and uses it - doesn't matter what you set in pfsense.

    @NEK4TE: can you provide what your box & CPU are?

    @Engineer: which Supermicro box is it?



  • Updated results list:

    170926276.61k	unknown (China)	gen 5 i5	
    91090845.70k	Zotac ZBOX ID92	Core i5 4570T	
    42008576.00k	Gigabyte GA-N3150N-D3V board	Celeron N3150 with AES-NI	https://forum.pfsense.org/index.php?topic=108119.0
    32321306.62k	SuperMicro 2758		
    32267479.72k	Supermicro	Intel N3700	
    29080158.21k	hp microserver gen 8	Xeon 1265Lv2	
    27986842.97k	Gigabyte GA-N3150N-D3V	Celeron N3150 with AES-NI	https://forum.pfsense.org/index.php?topic=105114.msg601520#msg601520
    24435715.51k	unknown (China)	gen 5 i5	
    24345837.57k	Lanner FW-7525D	Quad-core Atom C2558 @ 2.40GHz	
    24332468.22k	Netgate SG-4860  	Intel(R) Atom(TM) CPU C2558 @ 2.40GHz 4 CPUs	
    19462619.14k	SuperMicro 2758		
    18390712.32k	AM1	Athlon 5370	
    14241549.52k	pfSense SG-2440	Dual-core Atom C2358 @ 1.74GHz	
    7123763.20k	Raspberry Pi 3	ARMv7l	
    405686.95k	Intel i7-4510U + 2x Intel 82574 + 2x Intel i350 Mini-ITX Build		https://forum.pfsense.org/index.php?topic=115627.msg646395#msg646395
    230708.57k	ci323 nano u	Celeron N3150 with AES-NI w/ -engine cryptodev	https://forum.pfsense.org/index.php?topic=115673.msg656602#msg656602
    217617.75k	RCC-VE 2440	Intel Atom C2358	https://forum.pfsense.org/index.php?topic=91974.0
    124788.74k	ALIX.APU2B4/APU2C4	1 GHz Quad Core AMD GX-412TC	http://wiki.ipfire.org/en/hardware/pcengines/apu2b4
    34204.33k	ALIX.APU1C/APU1D	1 GHz Dual Core AMD G-T40E	http://wiki.ipfire.org/en/hardware/pcengines/apu1c
    


  • iorx,

    Interested to see you're running the same processor in a Microserver Gen 8 as I do.

    Mine is running ESXi 6.0 though and produces slightly different numbers:

    [2.3.2-RELEASE] /root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 1767436 aes-256-cbc's in 0.38s
    Doing aes-256-cbc for 3s on 64 size blocks: 1616969 aes-256-cbc's in 0.35s
    Doing aes-256-cbc for 3s on 256 size blocks: 1308617 aes-256-cbc's in 0.27s
    Doing aes-256-cbc for 3s on 1024 size blocks: 723750 aes-256-cbc's in 0.13s
    Doing aes-256-cbc for 3s on 8192 size blocks: 143766 aes-256-cbc's in 0.01s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      75410.60k   294360.22k  1225164.62k  5580197.65k 150749577.22k
    
    

    @iorx:

    Hi!

    For fun or reference :). A Hyper-v hosted pfsense on a hp microserver gen 8 with a Xeon 1265Lv2.

    
    [2.3.2-RELEASE][n23]/root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 1084848 aes-256-cbc's in 0.45s
    Doing aes-256-cbc for 3s on 64 size blocks: 1345250 aes-256-cbc's in 0.24s
    Doing aes-256-cbc for 3s on 256 size blocks: 709374 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 1024 size blocks: 472042 aes-256-cbc's in 0.19s
    Doing aes-256-cbc for 3s on 8192 size blocks: 110932 aes-256-cbc's in 0.03s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      38978.40k   355493.16k   774825.57k  2577978.71k 29080158.21k
    
    


  • @aesguy:

    @Engineer: which Supermicro box is it?

    SuperMicro Board: X11SBA-LN4F with Intel N3700.

    Running 2.2.5 and whatever FreeBSD version comes with it but not sure if there have been improvements in the newer versions or not.

    Just re-ran the test with nobody using the Internet (wife and two kids on Facebook, snapchat, youtube, etc. really change the results) and got this….

    $ openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 951002 aes-256-cbc's in 0.28s
    Doing aes-256-cbc for 3s on 64 size blocks: 961593 aes-256-cbc's in 0.26s
    Doing aes-256-cbc for 3s on 256 size blocks: 770095 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 1024 size blocks: 454015 aes-256-cbc's in 0.14s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92419 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1l-freebsd 15 Jan 2015
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      54101.45k  238708.18k  870154.24k  3306036.34k 48454172.67k

    Makes more sense compared to the N3150 in the chart now.



  • @aesguy, here's the stats on my board if you want them:
    Intel(R) Atom(TM) CPU C2758 @ 2.40GHz 8 CPUs

    /root: openssl speed -evp aes-256-cbc
    Doing aes-256-cbc for 3s on 16 size blocks: 944591 aes-256-cbc's in 0.33s
    Doing aes-256-cbc for 3s on 64 size blocks: 888807 aes-256-cbc's in 0.26s
    Doing aes-256-cbc for 3s on 256 size blocks: 743989 aes-256-cbc's in 0.23s
    Doing aes-256-cbc for 3s on 1024 size blocks: 445355 aes-256-cbc's in 0.11s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92224 aes-256-cbc's in 0.02s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx) 
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc      46060.06k   220639.60k   840656.26k  4169540.75k 48351936.51k
    


  • I'm amazed that nobody has pointed out yet that most of these results are COMPLETELY BOGUS. If you have an openssl speed test result based on a time of less than 3 seconds, your result is invalid. What's happening is that openssl by default bases its time on the cpu time registered to the ssl process rather than the elapsed time, because when using software encryption on a loaded system you may not get 100% of the cpu and using the cpu time figure gives a better accounting of the work actually done. But when using the freebsd crypto device most of the work is done in kernel space rather than user space, so the cpu time measurement consists entirely of the time spent making system calls. BUT YOU DID NOT ACTUALLY GET THREE SECONDS OF COMPUTATION DONE IN .01 SECONDS!!!! If using the freebsd crypto device you MUST add -elapsed to the command line to get a better idea of the real performance. If you do not, you are basing your conclusions on a meaningless number.

    A simple sanity check will conclude that many (most?) of the results listed here suggest that the machines are performing crypto at a rate greater than their theoretical peak performance (based on the number of operations performed * clock rate of the machine). Any result that shows 170GB/s of work performed by a commodity PC is OBVIOUSLY INCORRECT. A report of 48GByte/s on an atom with a 25GB/s memory implementation is OBVIOUSLY INCORRECT.

    It's been hard to get people to stop using the freebsd crypto interface because they really, really want these numbers to be true. But if you compare openssl performance with and without cryptodev on an AES-NI system USING THE REAL NUMBERS you'll find that cryptodev is slower than openssl's native AES-NI (it basically has to be, because they're doing the same crypto operations, but the kernel module has a penalty for going into an out of kernel space).

    The real fastest implementation of AES-NI that I'm aware of is with AES GCM on the skylake core, where you should see somewhere in the neighborhood of 6GByte/s/core depending on the clock speed. (Yes, a commodity skylake desktop will completely stomp a broadwell xeon; can't wait to see the skylake xeons.) The GCM implementation on the later intel cores is significantly faster than CBC at larger block sizes when PCLMULQDQ is available.



  • VAMike said:

    "But when using the freebsd crypto device most of the work is done in kernel space rather than user space, so the cpu time measurement consists entirely of the time spent making system calls."

    It does not appear that the crypto device is being used - OpenSSL invokes the appropriate CPU instructions directly.  For example, on ARMv8, the AESE instruction is invoked directly: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

    Secondly, we see evidence to support this - it matters not whether you set AES-NI in pfsense but rather does matter whether you invoke openssl with "-evp" or not.

    I am not convinced that your assumption about kernel vs userland is valid.  And therefore that these numbers are not as meaningless as you think.



  • Out of curiosity, I added the -elapsed option to the original speed test and the results fell dramatically.  I have never looked at or tried to understand the results, I was simply running the test and passing the supplied results on.

    $ openssl speed -evp aes-256-cbc -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-cbc for 3s on 16 size blocks: 1006598 aes-256-cbc's in 3.02s
    Doing aes-256-cbc for 3s on 64 size blocks: 965281 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 256 size blocks: 779226 aes-256-cbc's in 3.02s
    Doing aes-256-cbc for 3s on 1024 size blocks: 458171 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 8192 size blocks: 92570 aes-256-cbc's in 3.01s
    OpenSSL 1.0.1l-freebsd 15 Jan 2015
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      5326.91k    20592.66k    65978.50k  156389.03k  252121.25k

    I have run the test multiple times now and the results are very repeatable.  Without the -elapsed option, the numbers are all over the place but MUCH higher.  I'm not going to argue whether the other results are real or not as I don't know enough one way or the other.  Just passing on the results with and without the -elapsed command line option for others to evaluate.



  • @aesguy:

    It does not appear that the crypto device is being used - OpenSSL invokes the appropriate CPU instructions directly.  For example, on ARMv8, the AESE instruction is invoked directly: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

    Secondly, we see evidence to support this - it matters not whether you set AES-NI in pfsense but rather does matter whether you invoke openssl with "-evp" or not.

    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-cbc
    Doing aes-128-cbc for 3s on 16 size blocks: 1439217 aes-128-cbc's in 0.37s
    Doing aes-128-cbc for 3s on 64 size blocks: 1282244 aes-128-cbc's in 0.30s
    Doing aes-128-cbc for 3s on 256 size blocks: 1185939 aes-128-cbc's in 0.26s
    Doing aes-128-cbc for 3s on 1024 size blocks: 773748 aes-128-cbc's in 0.20s
    Doing aes-128-cbc for 3s on 8192 size blocks: 272141 aes-128-cbc's in 0.11s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-128-cbc      62713.12k  276424.81k  1177601.49k  3900642.23k 20382894.37k

    so there are the bogus numbers; there's no way this hardware is doing 20GByte/s of crypto. Let's see what happens with -elapsed:

    [2.3.2-RELEASE][admin@pfSense.localdomain]/root:  openssl speed -elapsed -evp aes-128-cbc
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-128-cbc for 3s on 16 size blocks: 1392190 aes-128-cbc's in 3.03s
    Doing aes-128-cbc for 3s on 64 size blocks: 1415484 aes-128-cbc's in 3.01s
    Doing aes-128-cbc for 3s on 256 size blocks: 1560350 aes-128-cbc's in 3.02s
    Doing aes-128-cbc for 3s on 1024 size blocks: 1176285 aes-128-cbc's in 3.01s
    Doing aes-128-cbc for 3s on 8192 size blocks: 314815 aes-128-cbc's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-128-cbc      7348.47k    30118.56k  132117.70k  400462.41k  859654.83k

    Now those are believable numbers. A bit low, but this is in a VM. Also note just how tremendously bad the performance is with small block sizes–that's the overhead of the context switches. If your theory is that cryptodev isn't relevant, let's just unload it:

    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: kldunload aesni
    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-cbc
    Doing aes-128-cbc for 3s on 16 size blocks: 197582147 aes-128-cbc's in 2.82s
    Doing aes-128-cbc for 3s on 64 size blocks: 49253757 aes-128-cbc's in 2.72s
    Doing aes-128-cbc for 3s on 256 size blocks: 12996564 aes-128-cbc's in 2.82s
    Doing aes-128-cbc for 3s on 1024 size blocks: 3230639 aes-128-cbc's in 2.77s
    Doing aes-128-cbc for 3s on 8192 size blocks: 399946 aes-128-cbc's in 2.72s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-128-cbc    1120909.24k  1159444.76k  1179699.19k  1192806.52k  1205097.06k
    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -elapsed -evp aes-128-cbc
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-128-cbc for 3s on 16 size blocks: 172665690 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 64 size blocks: 49500772 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 256 size blocks: 9678881 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 1024 size blocks: 2480302 aes-128-cbc's in 3.01s
    Doing aes-128-cbc for 3s on 8192 size blocks: 344003 aes-128-cbc's in 3.06s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-128-cbc    920883.68k  1056016.47k  825931.18k  844410.76k  920186.96k

    Note that the wall clock and cpu clock results are much closer, and the numbers are actually plausible. The overhead of the context switches went away, and the performance is much, much better for small block sizes. You can also sanity check by ignoring the bandwidth summary and looking at the initial status output: with cryptodev it did about 1.4M small block operations in 3 seconds, and without cryptodev it did close to 200M small block operations in 3 seconds. There is no way in reality that 1.4M operations in 3s is better than 200M operations in 3s, unless you're measuring something wrong.

    Another test–AES GCM isn't implemented in cryptodev so let's put the module back and see what happens:

    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: kldload aesni
    [2.3.2-RELEASE][admin@pfSense.localdomain]/root: openssl speed -evp aes-128-gcm
    Doing aes-128-gcm for 3s on 16 size blocks: 96121725 aes-128-gcm's in 2.76s
    Doing aes-128-gcm for 3s on 64 size blocks: 39512692 aes-128-gcm's in 2.84s
    Doing aes-128-gcm for 3s on 256 size blocks: 14260288 aes-128-gcm's in 2.95s
    Doing aes-128-gcm for 3s on 1024 size blocks: 3360255 aes-128-gcm's in 2.74s
    Doing aes-128-gcm for 3s on 8192 size blocks: 485825 aes-128-gcm's in 2.49s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-128-gcm    557669.38k  891702.40k  1239472.46k  1254801.55k  1596941.80k

    Plausible results again, with the proper relationship between the non-cryptodev CBC and the GCM results. (Though much slower than it should be on this hardware. I don't know if that's an artifact of the VM or the fact that pfsense has an older version of openssl.) Something newer on the bare hardware:

    openssl speed -evp aes-128-gcm

    Doing aes-128-gcm for 3s on 16 size blocks: 116930558 aes-128-gcm's in 3.00s
    Doing aes-128-gcm for 3s on 64 size blocks: 66316891 aes-128-gcm's in 3.00s
    Doing aes-128-gcm for 3s on 256 size blocks: 32782942 aes-128-gcm's in 2.99s
    Doing aes-128-gcm for 3s on 1024 size blocks: 12712095 aes-128-gcm's in 3.00s
    Doing aes-128-gcm for 3s on 8192 size blocks: 2004498 aes-128-gcm's in 3.00s
    Doing aes-128-gcm for 3s on 16384 size blocks: 875464 aes-128-gcm's in 3.00s
    OpenSSL 1.1.0c  10 Nov 2016
    built on: reproducible build, date unspecified
    options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
    compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR=""/usr/lib/ssl"" -DENGINESDIR=""/usr/lib/x86_64-linux-gnu/engines-1.1""
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes  16384 bytes
    aes-128-gcm    623629.64k  1414760.34k  2806833.83k  4339061.76k  5473615.87k  4781200.73k

    and the CBC output on bare hardware with openssl 1.1:

    openssl speed -evp aes-128-cbc

    Doing aes-128-cbc for 3s on 16 size blocks: 170365448 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 64 size blocks: 61436331 aes-128-cbc's in 2.99s
    Doing aes-128-cbc for 3s on 256 size blocks: 15619487 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 1024 size blocks: 4102787 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 8192 size blocks: 511408 aes-128-cbc's in 3.00s
    Doing aes-128-cbc for 3s on 16384 size blocks: 254687 aes-128-cbc's in 2.99s
    OpenSSL 1.1.0c  10 Nov 2016
    built on: reproducible build, date unspecified
    options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
    compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR=""/usr/lib/ssl"" -DENGINESDIR=""/usr/lib/x86_64-linux-gnu/engines-1.1""
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes  16384 bytes
    aes-128-cbc    908615.72k  1315025.15k  1332862.89k  1400417.96k  1396484.78k  1395582.54k

    This isn't a matter of opinion, it's a simple mistake that's been getting propagated for some time, leading to some wildly inaccurate claims about crypto performance. If you ask the guys writing the openssl code whether you can get 20GByte/s from a single core on a commodity intel chip you'll get a very clear "no". (And intel wouldn't be trying to sell very expensive quick assist hardware if it had lower performance than a cheap desktop: http://www.intel.com/content/www/us/en/ethernet-products/gigabit-server-adapters/quickassist-adapter-for-servers.html)



  • Running the original test (without -elapsed) a few times gave a very wide range of results.

    Microserver Gen 8 with E3-1265L V2 @ 2.50GHz:

    [2.3.2-RELEASE][root@philter]/root: openssl speed -evp aes-256-cbc -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-cbc for 3s on 16 size blocks: 1751663 aes-256-cbc's in 3.01s
    Doing aes-256-cbc for 3s on 64 size blocks: 1609691 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 256 size blocks: 1285984 aes-256-cbc's in 3.01s
    Doing aes-256-cbc for 3s on 1024 size blocks: 722643 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 8192 size blocks: 143875 aes-256-cbc's in 3.01s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-256-cbc       9317.94k    34340.07k   109452.27k   246662.14k   391854.21k
    
    ``` 
    
    Edit after reading more carefully:
    
    

    [2.3.2-RELEASE][root@philter]/root: kldunload aesni
    [2.3.2-RELEASE][root@philter]/root: openssl speed -evp aes-256-cbc -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-cbc for 3s on 16 size blocks: 76690389 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 64 size blocks: 20091193 aes-256-cbc's in 3.01s
    Doing aes-256-cbc for 3s on 256 size blocks: 5107757 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 1024 size blocks: 1285013 aes-256-cbc's in 3.01s
    Doing aes-256-cbc for 3s on 8192 size blocks: 160927 aes-256-cbc's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc    409015.41k  427498.84k  435861.93k  437478.50k  439437.99k



  • Lanner FW-7525D (Quad-core Atom C2558 @ 2.40GHz)
    Shell Output - openssl speed -evp aes-256-cbc -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-cbc for 3s on 16 size blocks: 988744 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 64 size blocks: 926802 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 256 size blocks: 762164 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 1024 size blocks: 455059 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 8192 size blocks: 93341 aes-256-cbc's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      5273.30k    19771.78k    65037.99k  155326.81k  254883.16k

    Shell Output - openssl speed -evp aes-256-gcm -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-gcm for 3s on 16 size blocks: 20826334 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 64 size blocks: 8843173 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 256 size blocks: 2794049 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 1024 size blocks: 754329 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 8192 size blocks: 96056 aes-256-gcm's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-gcm    111073.78k  188654.36k  238425.51k  257477.63k  262296.92k

    PfSense SG-2440 (Dual-core Atom C2358 @ 1.74GHz)
    Shell Output - openssl speed -evp aes-256-cbc -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-cbc for 3s on 16 size blocks: 727986 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 64 size blocks: 680875 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 256 size blocks: 557737 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 1024 size blocks: 327133 aes-256-cbc's in 3.00s
    Doing aes-256-cbc for 3s on 8192 size blocks: 67983 aes-256-cbc's in 3.01s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-cbc      3882.59k    14525.33k    47593.56k  111661.40k  185156.73k

    Shell Output - openssl speed -evp aes-256-gcm -elapsed
    You have chosen to measure elapsed time instead of user CPU time.
    Doing aes-256-gcm for 3s on 16 size blocks: 14925214 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 64 size blocks: 6436982 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 256 size blocks: 2026331 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 1024 size blocks: 549702 aes-256-gcm's in 3.00s
    Doing aes-256-gcm for 3s on 8192 size blocks: 70004 aes-256-gcm's in 3.00s
    OpenSSL 1.0.1s-freebsd  1 Mar 2016
    built on: date not available
    options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
    compiler: clang
    The 'numbers' are in 1000s of bytes per second processed.
    type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
    aes-256-gcm      79601.14k  137322.28k  172913.58k  187631.62k  191157.59k



  • VAMike,

    1. I am interested in top performance possible for given hardware.  "-elapsed" is not useful in that regarding because it measures the typical performance on that box given everything in place.  In your case, not only are your tests including things like other applications and processes running (and the swapping in and out of all those processes millions of times per second) but additionally you are running in a VM where typically the operating system itself is given limited access to the underlying hardware resources!  Of course your elapsed time ("-elapsed") is going to show drastically slower speeds - because the overhead of the operating system swapping and context switching all those applications and processes AND other operating system instances are all coming into play!!  You are not only swapping out all the processes millions of times per second, but in your case, your operating system running in a VM itself is causing overhead!  Naturally if your system running other applications AND operating systems then your "-elapsed" is going to show radically different results.

    So we're talking apples to oranges.  I am interested in the BEST performance that a hardware CAN achieve - not the performance of a particular box as configured and loaded with all sorts of junk - and openssl without "-elapsed" displays that better than with "-elapsed".  If you care about maximizing AES performance, then dump all the process and applications running in the background, tune the OS (such as for example highly minimize context switching), build the application for the hardware - and now we're getting closer to achieving maximum performance possible - and this is what openssl without "-elapsed" can give us an idea of today before engaging in investing in a hardware platform.  Heck, if one wanted to, you could go further and dump the OS altogether and write an application that boots and runs natively on the hardware itself but in many cases that would be non-trivial costs for that last mile of performance - but it's doable.

    You seem to be more interested in performance of A GIVEN SYSTEM built for general purpose - which by the way is perfectly valid and what most pfsense users are after, but it's just not what I'm after in trying to get a good idea of the top performing hardware.  Indeed, as you suggest by your tests you are seeing a difference of ~20fold - a big difference and hence why AES-NI offers a big gain in performance - if you can harness it properly.

    1. regarding unloading aesni, try instead invoking openssl with and without "-evp".  Like I showed you, the actual OpenSSL source (when invoked using -evp) invokes the actual AES-NI CPU instructions.  Then consult the version of openssl you are using and your hardware platform against the source code to see whether the AES-NI instructions are being invoked directly.  Here is a link: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/


  • @aesguy:

    1. I am interested in top performance possible for given hardware.  "-elapsed" is not useful in that regarding because it measures the typical performance on that box given everything in place.

    Well, that's not what you're measuring in your results. You're measuring the time the application spends talking to the kernel to ask for encryption services, and not counting at all the time spent doing encryption. If that's really what you want that's great, but it's a number with no utility whatsoever.

    In your case, not only are your tests including things like other applications and processes running (and the swapping in and out of all those processes millions of times per second) but additionally you are running in a VM where typically the operating system itself is given limited access to the underlying hardware resources!  Of course your elapsed time ("-elapsed") is going to show drastically slower speeds - because the overhead of the operating system swapping and context switching all those applications and processes AND other operating system instances are all coming into play!!  You are not only swapping out all the processes millions of times per second, but in your case, your operating system running in a VM itself is causing overhead!  Naturally if your system running other applications AND operating systems then your "-elapsed" is going to show radically different results.

    You're simply confused here. The difference isn't system load, it's whether you're actually measuring the time spent doing crypto or not. If you were not using cryptodev then elapsed would be telling you what you think it is–as it is in the cases I showed where the aesni module is unloaded or in the GCM case. Those are valid uses of the defaults, and do show actual cpu time and are worth considering. To put this a different way, -elapsed is usually a less accurate measure, but in the cryptodev case it's the only way to get a measurement that's anywhere close to reality. Unloading the cryptodev aesni module and dropping -elapsed is a better solution (unless you're specifically trying to measure the performance with the aesni module.) Again, if you look at the diagnostic output and see something like "in 0.x seconds", that's bogus because it's not even close to the time spent. (And if you see that output you pretty much know it's a system with cryptodev loaded, it doesn't happen otherwise.) If you see something like "in 2.83s" then you're seeing the benefit of basing the calculation on cpu time, because that's an indication that openssl didn't get a full 3 seconds of cpu time and it's more accurate to base the bandwidth number on the amount of time it actually got. But that cpu time is going to be something pretty close to 3 seconds in any case where basing the calculation on cpu time is remotely valid.

    You seem to be more interested in performance of A GIVEN SYSTEM built for general purpose

    No, I'm interested in reality. The cryptodev numbers do not reflect reality. Again, look at the number of computations actually being performed, and try to understand what that means. You are looking at bogus numbers because the time used in the calculation simply does not reflect the time the cpu spends doing crypto (because it's being done in kernel space rather than user space). By ignoring the time the CPU spends doing the crypto you aren't getting a better understanding of the performance characteristics of the hardware, you're simply making an incorrect calculation.

    • which by the way is perfectly valid and what most pfsense users are after, but it's just not what I'm after in trying to get a good idea of the top performing hardware.  Indeed, as you suggest by your tests you are seeing a difference of ~20fold - a big difference and hence why AES-NI offers a big gain in performance - if you can harness it properly.

    The other thing you seem to not understand is that openssl uses the AES-NI instructions without using cryptodev. If you unload the aesni module on your hardware and run with and without -evp you'll see a significant difference, one that's real rather than one that's an interpretation error, and that difference is the use of the AES-NI instructions. Or focus entirely on the GCM results, which don't get screwed up by cryptodev.



  • You're measuring the time the application spends talking to the kernel to ask for encryption services

    OpenSSL isn't "talking to the kernel" per se - but rather the kernel gives the process running openssl small slices of time with the CPU and memory etc and feeds them openssl's instructions.  And more importantly, openssl -evp on many architectures supports the native instructions for the specific architecture in use.  On those supported architectures, it is NOT "asking for encryption services" - it knows the low-level CPU AES-NI instructions appropriate for that CPU and calls those.  This is the 3rd time I'll provide the link to the source code, these are the ARMv8 assembly instructions: https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aesv8-armx.pl

    Out of that, you'll see for example the AES CBC encrypt instructions:

    $code.=<<___;
    	subs	$len,$len,#16
    	mov	$step,#16
    	b.lo	.Lcbc_abort
    	cclr	$step,eq
    	cmp	$enc,#0			// en- or decrypting?
    	ldr	$rounds,[$key,#240]
    	and	$len,$len,#-16
    	vld1.8	{$ivec},[$ivp]
    	vld1.8	{$dat},[$inp],$step
    	vld1.32	{q8-q9},[$key]		// load key schedule...
    	sub	$rounds,$rounds,#6
    	add	$key_,$key,x5,lsl#4	// pointer to last 7 round keys
    	sub	$rounds,$rounds,#2
    	vld1.32	{q10-q11},[$key_],#32
    	vld1.32	{q12-q13},[$key_],#32
    	vld1.32	{q14-q15},[$key_],#32
    	vld1.32	{$rndlast},[$key_]
    	add	$key_,$key,#32
    	mov	$cnt,$rounds
    	b.eq	.Lcbc_dec
    	cmp	$rounds,#2
    	veor	$dat,$dat,$ivec
    	veor	$rndzero_n_last,q8,$rndlast
    	b.eq	.Lcbc_enc128
    	vld1.32	{$in0-$in1},[$key_]
    	add	$key_,$key,#16
    	add	$key4,$key,#16*4
    	add	$key5,$key,#16*5
    	aese	$dat,q8
    	aesmc	$dat,$dat
    	add	$key6,$key,#16*6
    	add	$key7,$key,#16*7
    	b	.Lenter_cbc_enc
    

    As you can see, these call the ARM AES-NI instruction set (including AESE).  There is no "talking to the kernel" other than the kernel swaps the process in and out.  Perhaps you are familiar with cryptodev and so believe everything needs to go through that - but it doesn't, look at openssl itself and when -evp is used.  The above instructions are where openssl ends up when it detects ARMv8 architecture and then calls ARM CPU instructions directly.

    The difference isn't system load, it's whether you're actually measuring the time spent doing crypto or not.

    I'm not talking about system load - I'm talking about the necessary overhead that the OS has to do to handle multiple processes from running multiple applications.  You don't seem to understand that there is overhead - not only are you measuring with other things running but your tests were also performed on a VM.  Hardware resources are fixed and things like VM's are really just software implementations to mimic being able to handle multiple images - but you can't multiply the CPU, the bus, the network, the RAM, etc…  Like it or not, but your measurements are being affected by other things running on your system.

    If you were not using cryptodev then elapsed would be telling you what you think it is-

    Again, on architectures that openssl supports the native AES-NI instructions, -evp does not involve cryptodev.  You can dig up what happens on your architecture: https://github.com/openssl/openssl/tree/master/crypto/aes/asm

    Perhaps what is confusing you in all this is that there is the cryptodev and the openssl implementations.  They are both different, and likely do not support the same architectures.

    To put this a different way, -elapsed is usually a less accurate measure, but in the cryptodev case it's the only way to get a measurement that's anywhere close to reality.

    In your case, "reality" is the specific system as configured that is being measured.  That's what I said last time - but that's different from the maximum top performance which is what I'm trying to gauge.

    Unloading the cryptodev aesni module and dropping -elapsed is a better solution (unless you're specifically trying to measure the performance with the aesni module.)

    Yes getting closer!  That's what we're trying to measure - the top performance possible for a specific hardware.  Even if that's not going to happen because say other applications need to run (as is the case in many deployed pfsense installations that run multiple functions).

    You are looking at bogus numbers because the time used in the calculation simply does not reflect the time the cpu spends doing crypto (because it's being done in kernel space rather than user space).

    Again, I believe you believe cryptodev is involved.  And on certain architectures, openssl implements the native AES-NI instruction set.  On Intel, those are AESENC, AESENCLAST, etc… - there's 7 of them, but on ARM they're AESE, etc...  Completely different CPU's, completely different instruction sets.  Really nothing to do with kernel - the kernel is there to provide operating system functionality, but in supported CPUs there's no "encryption services".

    The other thing you seem to not understand is that openssl uses the AES-NI instructions without using cryptodev.

    Wrong.  Look at the openssl source code.  On supported architectures, the AES-NI instructions are invoked directly.



  • @aesguy:

    You're measuring the time the application spends talking to the kernel to ask for encryption services

    OpenSSL isn't "talking to the kernel" per se - but rather the kernel gives the process running openssl small slices of time with the CPU and memory etc and feeds them openssl's instructions.

    No, you're quite simply wrong. openssl on pfsense opens /dev/crypto and offloads the AES CBC routines if a kernel module supporting AES CBC is loaded. That's why the performance changes so dramatically when you load and unload that module. That's why the performance is so bad on small blocks, it has to send each block up through the kernel and back. I even walked through step by step how to demonstrate that.

    And more importantly, openssl -evp on many architectures supports the native instructions for the specific architecture in use.

    Yes, except that the implementation on pfsense will ignore the built-in routines in the presence of a /dev/crypto that reports AES capability. "openssl engine -t -c" will show whether openssl has detected such.

    If you were not using cryptodev then elapsed would be telling you what you think it is-

    Again, on architectures that openssl supports the native AES-NI instructions, -evp does not involve cryptodev.

    If that were true, then unloading the module would have no effect, right? But it does. "truss openssl speed -evp aes-128-cbc |& grep /dev/crypto". You can also take a look at the number of ioctls with and without the cryptodev stuff to see openssl passing the blocks off to the kernel one by one. (grep ioctl instead of grep /dev/crypto and be prepared for a lot of output) https://github.com/pfsense/FreeBSD-src/blob/devel/crypto/openssl/crypto/engine/eng_cryptodev.c is the code implementing cryptodev support, including support for evp mode (look for the "EVP" functions).



  • Unfortunately we've strayed far from the overall goal - to roughly measure top performance possible of various AES-NI implementations.  You are missing the point that I want to roughly measure this.  I don't care about -elapsed simply because it adds in various other factors which distorts or does not give good idea of the maximum performance possible.  For example:

    openssl on pfsense opens /dev/crypto and offloads the AES CBC routines if a kernel module supporting AES CBC is loaded. That's why the performance changes so dramatically when you load and unload that module. That's why the performance is so bad on small blocks, it has to send each block up through the kernel and back.

    I'm not interested in various factors due to a kernel implementation or design of cryptodev.  The fact that data is being copied back and forth from userspace to kernel affects performance in non-trivial amounts.  Who says everything has to be the cryptodev-way?  And in that respect, openssl without "-elapsed" offers a much better indication of maximum possible performance for a particular piece of hardware.



  • VAMike,

    You've provided several AES-128-CBC results but any chance you can provide the results for AES-256-CBC (along with box info)?

    openssl speed -evp aes-256-cbc