10Gbe Tuning?



  • Can anyone who is currently running any 10Gbe ports with pfSense provide some input as to what sort of tuning you've done and the throughput you've seen?

    Last Friday I installed a pair of Intel X520 ports in each of my firewalls and I can't manage to get more than 1.8-2.0Gbit/s across a SFP+ Direct Attach cable running between them.

    One box is running 2.1 and the other 2.1.1 (built on Mon Jan 27 04:16:45 EST 2014).  Both are Intel Xeon E3-1240 V2 CPUs (3.4GHz).

    I've tried setting:

    • hw.intr_storm_threshold=10000

    • kern.ipc.maxsockbuf=16777216

    • net.inet.tcp.recvbuf_inc=524288

    • net.inet.tcp.recvbuf_max=16777216

    • net.inet.tcp.sendbuf_inc=16384

    • net.inet.tcp.sendbuf_max=16777216

    EDIT 1: "pciconf -lc" reports that the adapters are running as PCI-Express 2.0 with a link of x8, so that's not it (Note, I could be reading this incorrectly, see output below).

    ix0@pci0:1:0:0:	class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00
        cap 01[40] = powerspec 3  supports D0 D3  current D0
        cap 05[50] = MSI supports 1 message, 64 bit, vector masks 
        cap 11[70] = MSI-X supports 64 messages in map 0x20 enabled
        cap 10[a0] = PCI-Express 2 endpoint max data 256(512) link x8(x8)
        cap 03[e0] = VPD
    

    EDIT 2: An iperf loopback returns a bit over 12 Gbit/s (basically maxing one CPU on sending with iperf and another on receiving), so the CPU can handle it.

    EDIT 3: As a side note on performance, my main FreeNAS 9.2.0 box which uses the same settings as above (Physical server, same Intel 82599 NICs) has no issues hitting ~8Gbit/s with iperf between it and my Exchange server (VM with VMXNET3 on vSphere 5.5, vDS with Broadcom BCM57800 NICs as uplinks).  The two systems have a pair of Nexus 5548UP switches between them.



  • Having the same problem.

    Running 2 Dell R410's with dual-port X520 (SFP+) and iperf reports an avg of 1.5Gbps throughput. pfsense is on 2.1.1 (as of march 17th).
    I've done different tuning (including https://calomel.org/freebsd_network_tuning.html) on various systems but can't seem to go above that limit.

    When iperf'ing against localhost i get an avg of 7.5Gbps.

    Not quite sure what is causing such slow speeds (1.5gbps). Anyone has any ideas/suggestions?

    Cheers.



  • The newest builds of 2.1.1, the ones where the newer drivers are back in, have gotten me to about double what I was seeing when I use multiple threads.



  • Likely: It's as good as it's going to get prior to 2.2.



  • @gonzopancho:

    Likely: It's as good as it's going to get prior to 2.2.

    I can live with that for now.  If FreeBSD 10 does to pfSense what FreeBSD 9 did for FreeNAS, I should be able to hit wire speed once 2.2 drops.



  • @gonzopancho:

    Likely: It's as good as it's going to get prior to 2.2.

    So, to confirm, does that mean you guys know the source of the poor throughput and that will be addressed in 2.2 (either a fix or due to the upgrade to freeBSD 10)?

    Thanks.



  • FreeBSD doesn't go wirespeed on 10G NICs (without using large frames or tricks like netmap).
    Neither does linux.

    the intel 10G driver(s) are good, but not great.

    that all said, the situation should improve with 2.2



  • @gonzopancho:

    FreeBSD doesn't go wirespeed on 10G NICs (without using large frames or tricks like netmap).
    Neither does linux.

    the intel 10G driver(s) are good, but not great.

    that all said, the situation should improve with 2.2

    I was able to get ~8Gbit/s between two FreeNAS 9.x boxes without jumbo frames when using 4 threads.  That's pretty close to wire.



  • What will change in 2.2 that is expected to improve things for the 10G Intel cards? Would it be an upgrade to FreeBSD 10 or driver/tuning updates?

    For the current pfSense version (2.1.x) would Myricom 10-Gigabit Ethernet (Myri10GE) cards perform better (10Gbps speeds)?

    Thanks.



  • @Jason:

    I was able to get ~8Gbit/s between two FreeNAS 9.x boxes without jumbo frames when using 4 threads.  That's pretty close to wire.

    OK, Jason… FreeBSD won't forward at wirespeed on 10Gbps networks.

    Since the BSDRP guy can only manage to forward (no firewall, just fast forwarding) at a pinch over 1.8Mpps, (and you were doing, by my best estimate, 5.5Mpps), I'm going to assert that we still have work to do.

    brunoc:  we're currently engaged in a 10G performance study, but yes, part of the solution will be tuning, and part of it will be the threaded pf in pfSense version 2.2.



  • @gonzopancho:

    @Jason:

    I was able to get ~8Gbit/s between two FreeNAS 9.x boxes without jumbo frames when using 4 threads.  That's pretty close to wire.

    OK, Jason… FreeBSD won't forward at wirespeed on 10Gbps networks.

    Since the BSDRP guy can only manage to forward (no firewall, just fast forwarding) at a pinch over 1.8Mpps, (and you were doing, by my best estimate, 5.5Mpps), I'm going to assert that we still have work to do.

    brunoc:  we're currently engaged in a 10G performance study, but yes, part of the solution will be tuning, and part of it will be the threaded pf in pfSense version 2.2.

    One interesting thing of note is that at least one user here has had a lot of luck using pfSense on vSphere.  With virtualized NICs he seems to be getting better throughput than I am on bare-metal, even though I'm using faster CPUs, so I'm wondering how much of this is the Intel drivers.  The newest ones are better than the last, but they're still not exactly screaming along.

    I'll keep an eye on the 2.2 section of the forums.  Once it gets stable enough to run as the backup of a CARP pair (next to a 2.1.x box) maybe I'll upgrade one system at the office for testing.

    If there's any tuning that you want me to test out that can be done on 2.1.x, let me know.  I'd be glad to try a few things on my boxes.



  • @Jason:

    One interesting thing of note is that at least one user here has had a lot of luck using pfSense on vSphere.  With virtualized NICs he seems to be getting better throughput than I am on bare-metal, even though I'm using faster CPUs, so I'm wondering how much of this is the Intel drivers.  The newest ones are better than the last, but they're still not exactly screaming along.

    Where was that discussion about pfsense on esxi providing more throughput than your similar bare metal. I looked and can't find it. Thanks


  • Netgate Administrator

    I think it was this thread. I remember this figure seeming surprisingly high at the time, it still does:
    https://forum.pfsense.org/index.php?topic=72142.msg395165#msg395165

    Steve



  • I can't imagine any real performance gain for pf when running under VMware.



  • I have pfsense 2.1.4 on a new box with two CPUs: E5-2667 @2.90GHz, all 12 cores enabled, but hyperthreading and vt disabled.
    All traffic goes over one intel x520-sr2.
    With my simple test setup ( iperf between two VMs, traffic goes through the whole datacenter, with the pfsense box in the middle), I got up to 3Gbit/s (perhaps I could get more with better VMware-infrastructure) with a CPU load below 2.

    my /boot/loader.conf.local:

    kern.ipc.nmbclusters="262144"
    kern.ipc.nmbjumbop="262144"
    net.isr.bindthreads=0
    net.isr.maxthreads=1
    kern.random.sys.harvest.ethernet=0
    kern.random.sys.harvest.point_to_point=0
    kern.random.sys.harvest.interrupt=0
    net.isr.defaultqlimit=2048
    net.isr.maxqlimit=40960
    

    and my changes in system-tunables:

    hw.intr_storm_threshold=10000
    kern.ipc.maxsockbuf=16777216
    net.inet.tcp.sendbuf_max=16777216
    net.inet.tcp.recvbuf_max=16777216   
    net.inet.ip.fastforwarding=1
    net.inet.tcp.sendbuf_inc=262144
    net.inet.tcp.recvbuf_inc=262144 
    net.route.netisr_maxqlen=2048
    net.inet6.ip6.redirect=0
    net.inet.ip.redirect=0
    net.inet.ip.intr_queue_maxlen=2048
    

    And make sure to switch off LRO and TSO of the ix-interfaces. TSO is broken with IPv6, if it is enabled, only one paket is sent at once and then the box waits for the ACK until it sends the next one…
    Some of the options I found in the freebsd-wiki: https://wiki.freebsd.org/NetworkPerformanceTuning



  • Mine throughput completely sucks right now….Im seeing 600mbps (you read it right, not even 1gig) when testing iperf from my desktop to my pfSense router.  Ive applied the calomel tricks and tips re buffers etc and still seeing sucky perf so I need to do some debugging for sure. Im dreaming of the lefty heights of a 2gig connection right now!

    BTW, this guy nails 9.x gbps > https://forum.pfsense.org/index.php?topic=77144.msg435304#msg435304

    FYI Im using a1srm 2758f board with intel x520 SFP+ optical cables etc. I'm still limited to 600mbps on a gigabit ethernet cat6 wire to my quad i350 too.



  • Just an observation on the 9.22Gbps test result.

    1. The measurement is taken on LAN interface which is a bridge of 4 10Gbps + 1 1Gbps interfaces. It would be measuring the sum of all 5 interfaces.

    2. The test setup seem to be connect 1 host to each of the 10Gbps ports. Have these 4 hosts ran iperf.

    3. Most report seeing around 2Gbps on 10Gbps interfaces.  So 4x 2Gbps is within reach of the result.

    4. If the 10Gbps ports are doing line rate, shouldn't the test be measuring 40Gbps instead of 9Gbps? Still 9Gbps is impressive on older hardware.



  • Yes, the LAN reports the traffic on the bridge (mine is setup like this also) but Id assumed he was reporting line rate on 1 port rather than (4 * 2g + 1 * 1g) speeds. You are right though, without seeing his other ports there is ambiguity. I'd assumed given he spent the time to post he had close to line rate out of 1 port which theoretically should be possible, rather than close to line rate from 4+1…. good spot.


  • Netgate Administrator

    You must be hitting some limit. Are the NICs connecting at 10Gbps? Are you seeing errors on the interface? What does your CPU usage look like? Large interrupt load?

    Steve



  • @irj972:

    Mine throughput completely sucks right now….Im seeing 600mbps (you read it right, not even 1gig) when testing iperf from my desktop to my pfSense router.  Ive applied the calomel tricks and tips re buffers etc and still seeing sucky perf so I need to do some debugging for sure. Im dreaming of the lefty heights of a 2gig connection right now!

    BTW, this guy nails 9.x gbps > https://forum.pfsense.org/index.php?topic=77144.msg435304#msg435304

    FYI Im using a1srm 2758f board with intel x520 SFP+ optical cables etc. I'm still limited to 600mbps on a gigabit ethernet cat6 wire to my quad i350 too.

    PFSense 2.2 will have better multi-core multi-stream performance. Your Atom CPU has poor single thread performance, even thought it should have decent aggregate throughput.

    I'm getting 980mb, ~1.5gb with bi-directional test, with Iperf through PFSense NAT. All with 7.7% cpu load and no tweaking. The performance is entirely limited by my 2 testing computer's integrated NICs.


  • Netgate Administrator

    It still has almost double the single thread rating of, say, a D525 which can itself manage close to 600Mbps throughput.  :-
    This test used the pfSense box as the end point though so they are not comparable.

    Steve



  • @stephenw10:

    It still has almost double the single thread rating of, say, a D525 which can itself manage close to 600Mbps throughput.  :-
    This test used the pfSense box as the end point though so they are not comparable.

    Steve

    Steve, did you get anywhere with this?

    I also just ran some iperf test, I have Atom D2550, and it's also maxing out at ~450-500 mbps when I do UDP from my pfsense box. I see the CPU staying right at 25-27% load during tests. I'm thinking that this is getting limited by the single thread of iperf on Atom.

    Interestingly enough. I got a Lenovo T440 laptop with Win7, when I also run the UDP test from that (Intel NIC) it's also maxing out at 450-500 mbps.

    I'm not sure what to make of that. Maybe an issue with 2.0.x iperf?

    -Dmitri


  • Netgate Administrator

    Run 'top -SH' at the console to see how the usage breaks down across the cores.
    How are the NICs connected? If they're PCI you might hit a bottleneck there.
    Try running a test through pfSense instead of using it as an end-point.
    The previous user who got greater than 600Mbps through his atom had to make some tweaks. I forget the details but I think he disabled some PCI power saving options in the bios.
    You could try enabling ip fast-forwarding if your not using ipsec.

    Steve



  • @stephenw10:

    Run 'top -SH' at the console to see how the usage breaks down across the cores.
    How are the NICs connected? If they're PCI you might hit a bottleneck there.
    Try running a test through pfSense instead of using it as an end-point.
    The previous user who got greater than 600Mbps through his atom had to make some tweaks. I forget the details but I think he disabled some PCI power saving options in the bios.
    You could try enabling ip fast-forwarding if your not using ipsec.

    Steve

    I have embedded Broadcom NICs, not PCI.

    Unfortunately I don't have enough (powerful enough) equipment to handle 1 Gbps simulation through the pfsense.  I got a Lenovo T440 with an i5, but like I said in my previous thread, the I can't get 1 Gbps saturation via iperf on it either (it should be able to, maybe it's a Win7 issue or something.) I also got a NAS, but it's a very slow processor.  I got macbook air as well, but without a gigabit adapter (wifi only).

    So, using what I got. Pfsense –> Lenovo. TCP Window size of 128Kb:

    [ ID] Interval      Transfer    Bandwidth
    [  3]  0.0- 1.0 sec  37.6 MBytes  316 Mbits/sec
    [  3]  1.0- 2.0 sec  39.1 MBytes  328 Mbits/sec
    [  3]  2.0- 3.0 sec  38.4 MBytes  322 Mbits/sec
    [  3]  3.0- 4.0 sec  37.8 MBytes  317 Mbits/sec
    [  3]  4.0- 5.0 sec  37.1 MBytes  311 Mbits/sec
    [  3]  5.0- 6.0 sec  36.9 MBytes  309 Mbits/sec
    [  3]  6.0- 7.0 sec  37.1 MBytes  311 Mbits/sec
    [  3]  7.0- 8.0 sec  37.0 MBytes  310 Mbits/sec
    [  3]  8.0- 9.0 sec  40.0 MBytes  336 Mbits/sec
    [  3]  9.0-10.0 sec  37.9 MBytes  318 Mbits/sec
    [  3]  0.0-10.0 sec  379 MBytes  318 Mbits/sec

    I was running top -SH in another session:

    last pid: 65943;  load averages:  0.18,  0.04,  0.01    up 2+03:16:25  20:26:55
    169 processes: 10 running, 139 sleeping, 3 stopped, 17 waiting
    CPU:  0.0% user,  0.0% nice, 23.7% system, 24.9% interrupt, 51.3% idle
    Mem: 834M Active, 1198M Inact, 699M Wired, 296K Cache, 416M Buf, 1180M Free
    Swap: 8192M Total, 8192M Free

    PID USERNAME PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
      11 root    171 ki31    0K    64K CPU2    2  49.9H 91.16% idle{idle: cpu2}
      11 root    171 ki31    0K    64K RUN    3  50.3H 87.50% idle{idle: cpu3}
      11 root    171 ki31    0K    64K RUN    1  50.2H 83.25% idle{idle: cpu1}
      12 root    -68    -    0K  336K CPU0    0  10:10 60.89% intr{irq18: bge1
    65943 root      76    0 13556K  2628K CPU1    1  0:08 54.88% iperf{iperf}
      11 root    171 ki31    0K    64K RUN    0  50.5H 43.55% idle{idle: cpu0}
    34264 root      64  20  619M  301M bpf    1  17:53  0.00% snort{snort}
      258 root      76  20  6908K  1404K kqread  3  15:34  0.00% check_reload_stat
      12 root    -68    -    0K  336K WAIT    0  10:05  0.00% intr{irq16: bge0
      12 root    -32    -    0K  336K RUN    0  7:13  0.00% intr{swi4: clock}
    64693 proxy    64  20  380M  364M kqread  2  3:35  0.00% squid
    28093 root      44    0  5784K  1484K select  2  1:29  0.00% apinger
      23 root      20    -    0K    16K syncer  3  0:58  0.00% syncer
        0 root    -16    0    0K  176K sched  2  0:44  0.00% kernel{swapper}
      14 root    -16    -    0K    16K -      2  0:32  0.00% yarrow
    20488 root      44    0 26272K  7532K kqread  0  0:24  0.00% lighttpd
    86216 root      76  20  8296K  1932K wait    0  0:21  0.00% sh
      12 root    -32    -    0K  336K RUN    0  0:18  0.00% intr{swi4: clock}
        8 root    -16    -    0K    16K pftm    1  0:14  0.00% pfpurge
    30278 dhcpd    44    0 15180K 10444K select  2  0:13  0.00% dhcpd

    I'm not sure what the bottleneck is here. On second thought, it doesn't looks like a processor issue. Also, I already have ip fast-forward turned on (I do use IPsec, but have not had any issues with ip fast-forward yet).

    Thanks for any help!



  • Good news. I figured out the issue. The length of buffers was too short (1470 bytes for UDP by default), once I increased it to 16000 bytes things got moving much quicker.

    Again pfsense –> Lenovo:

    [2.1.4-RELEASE]: iperf -c 192.168.1.107 -u -b 1000m -i 1 -l 16000
    –----------------------------------------------------------
    Client connecting to 192.168.1.107, UDP port 5001
    Sending 16000 byte datagrams
    UDP buffer size: 56.0 KByte (default)

    [  3] local 192.168.1.1 port 46600 connected with 192.168.1.107 port 5001
    [ ID] Interval      Transfer    Bandwidth
    [  3]  0.0- 1.0 sec  104 MBytes  872 Mbits/sec
    [  3]  1.0- 2.0 sec  105 MBytes  884 Mbits/sec
    [  3]  2.0- 3.0 sec  108 MBytes  908 Mbits/sec
    [  3]  3.0- 4.0 sec  107 MBytes  894 Mbits/sec
    [  3]  4.0- 5.0 sec  109 MBytes  914 Mbits/sec
    [  3]  5.0- 6.0 sec  109 MBytes  915 Mbits/sec
    [  3]  6.0- 7.0 sec  109 MBytes  912 Mbits/sec
    [  3]  7.0- 8.0 sec  108 MBytes  909 Mbits/sec
    [  3]  8.0- 9.0 sec  106 MBytes  890 Mbits/sec
    [  3]  9.0-10.0 sec  105 MBytes  883 Mbits/sec
    [  3]  0.0-10.0 sec  1.05 GBytes  898 Mbits/sec
    [  3] Sent 70583 datagrams

    I'm pretty much hitting the practical limit of a gigabit right there.

    But when I switch to TCP, I'm still getting ~300mbps.


  • Netgate Administrator

    Even though your NICs are on-board they will still be connected via either a PCI or PCIe bus to the chipset. It seems  unlikely that it would be PCI but you never know. The exact NIC chip code will tell you. Clearly the CPU is not the restriction here, all the cores are still running idle processes.

    Steve



  • @dmitripr:

    12 root    -68    -    0K  336K CPU0    0  10:10 60.89% intr{irq18: bge1

    The interrupt load seems pretty high for <1Gbps throughput.



  • @razzfazz:

    @dmitripr:

    12 root    -68    -    0K  336K CPU0    0  10:10 60.89% intr{irq18: bge1

    The interrupt load seems pretty high for <1Gbps throughput.

    I'm sure these are not the best NICs out there. :). But considering 4 cores here, this is only ~15% of CPU usage. Probably not too bad, but not great either. Intel NICs would fair better for sure.



  • @dmitripr:

    I'm sure these are not the best NICs out there. :). But considering 4 cores here, this is only ~15% of CPU usage. Probably not too bad, but not great either. Intel NICs would fair better for sure.

    And Chelsio better still.



  • new Intel driver v2.5.25 for x520 / x540 cards was released last week - has anybody tried it yet?

    https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=14688&lang=eng&ProdId=3412



  • @gonzopancho:

    @Jason:

    I was able to get ~8Gbit/s between two FreeNAS 9.x boxes without jumbo frames when using 4 threads.  That's pretty close to wire.

    OK, Jason… FreeBSD won't forward at wirespeed on 10Gbps networks.

    Since the BSDRP guy can only manage to forward (no firewall, just fast forwarding) at a pinch over 1.8Mpps, (and you were doing, by my best estimate, 5.5Mpps), I'm going to assert that we still have work to do.

    brunoc:  we're currently engaged in a 10G performance study, but yes, part of the solution will be tuning, and part of it will be the threaded pf in pfSense version 2.2.

    Hmm, if all I need is a a pair of routers running CARP and NAT with a pool of IPs with 10GbE Intel NICs, would it make sense to go with 2.2 Alpha snapshots?



  • "8Gbps" is not how we measure these things.

    Quote PPS or go home.



  • @gonzopancho:

    "8Gbps" is not how we measure these things.

    Quote PPS or go home.

    My bad, lets say I need NAT (PAT really) for 500kpps



  • There is an active internal project to get the performance of 'pf' up.



  • @gonzopancho:

    There is an active internal project to get the performance of 'pf' up.

    Would be nice to  know a little more about that project.  For the time being, how near that mark can I get with a Xeon E5520/E5620, PCIe and a decent 10GbE Intel NIC?.

    Should I stay with 2.1.5 or venture with 2.2 ALPHA because of the FreeBSD 10 baseline? .



  • I'd go 2.2-BETA, personally.  there are only a couple things to get fixed.

    The test harness is here:  https://github.com/gvnn3/conductor

    (Remember, people say I don't know how to open source.)



  • I didn't know there was a Beta already, I'll look at it. Thanks.



  • It's not, but should be quite soon.



  • Now that 2.2 is beta.  A few questions about 10Gbe.

    1. are the system tune-able tweaks still necessary for the intel ix drivers?

    2. are the tweaks needed in the /boot/loader.conf.local as mentioned in reply #14?

    3. Are LRO and TSO still needed to be disabled in 2.2 beta for the ix drivers?

    Thank you in advance for any reply!



  • Dude in #14 doesn't understand what he's doing.

    (People who "tune" TCP variables to get packet filtering / NAT throughput are wasting time.)

    You're getting faster IPSec (AES-GCM w/ AES-NI) with 2.2.  You'll see some improvement from the threaded "pf" in FreeBSD 10(.1), upon which pFsense 2.2 is based.

    I've already discussed the faster version of pf here and elsewhere.  There are a couple easy improvements (good for 12-15%), and these might make it into 2.2.x.  After that it gets hard, pf is a really crappy architecture for performance.

    In any case, these things take time, and/or money.

    "Patches accepted."