2.2 on Atom 2750 NAT performance report

jeffbearer

The reason i'm testing 2.2 is because of throughput limitations on my firewall with 2.1.5 single threaded PF on an SMC Atom 2750 box under a real workload.

Before I deployed the Atom's I tested the NAT throughput using iperf with just a small number of streams and was able to get around 930Mbps on a 1Gbps link which matches what Netgate publishes for their similar firewall box. However when we deployed the box and implemented it under our real workload with many more connections our real throughput dropped to about 520Mbps. 1 CPU of the 8 was at 100% interrupt. Our real workload is a performance test that consists of 200-250 NAT connections.

It was suggested on IRC that I try 2.2 as the person was interested to know what the performance is with multi-threaded PF. I don't recall who that was.

To start I opened up the IGB settings from 1 queue to 4, I'll be moving that to 8 queues for the next test.


kern.ipc.nmbclusters="131072"
hw.igb.num_queues=4

And I when we started running the workload we saw a consistent 850mbps outbound 15mbps inbound on a 1Gbps link.

Top looks like this:


last pid: 35129;  load averages:  2.53,  2.52,  2.45                                                                                                                                                                up 3+10:49:08  07:05:24
163 processes: 12 running, 106 sleeping, 45 waiting
CPU 0:  0.0% user,  0.0% nice,  0.0% system, 62.2% interrupt, 37.8% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system, 59.5% interrupt, 40.5% idle
CPU 2:  0.0% user,  0.0% nice,  0.0% system, 66.7% interrupt, 33.3% idle
CPU 3:  0.0% user,  0.0% nice,  0.0% system, 62.5% interrupt, 37.5% idle
CPU 4:  0.0% user,  0.0% nice,  2.7% system,  2.7% interrupt, 94.6% idle
CPU 5:  0.3% user,  0.0% nice,  4.5% system,  2.4% interrupt, 92.8% idle
CPU 6:  0.0% user,  0.0% nice,  7.2% system,  1.5% interrupt, 91.3% idle
CPU 7:  0.0% user,  0.0% nice,  2.7% system,  2.7% interrupt, 94.6% idle
Mem: 16M Active, 77M Inact, 294M Wired, 136M Buf, 7496M Free
Swap: 16G Total, 16G Free

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root     155 ki31     0K   128K CPU7    7  82.6H 100.00% idle{idle: cpu7}
   11 root     155 ki31     0K   128K CPU5    5  82.6H 100.00% idle{idle: cpu5}
   11 root     155 ki31     0K   128K CPU6    6  82.6H  95.26% idle{idle: cpu6}
   11 root     155 ki31     0K   128K RUN     4  82.6H  94.68% idle{idle: cpu4}
   12 root     -92    -     0K   768K WAIT    2  41:18  64.16% intr{irq259: igb0:que}
   12 root     -92    -     0K   768K CPU0    0  40:37  63.48% intr{irq257: igb0:que}
   12 root     -92    -     0K   768K CPU3    3  41:32  62.99% intr{irq260: igb0:que}
   12 root     -92    -     0K   768K CPU1    1  41:45  62.26% intr{irq258: igb0:que}
   11 root     155 ki31     0K   128K RUN     0  81.5H  44.38% idle{idle: cpu0}
   11 root     155 ki31     0K   128K RUN     1  82.0H  43.46% idle{idle: cpu1}
   11 root     155 ki31     0K   128K RUN     2  82.0H  42.68% idle{idle: cpu2}
   11 root     155 ki31     0K   128K RUN     3  82.0H  41.46% idle{idle: cpu3}
   12 root     -72    -     0K   768K WAIT    5   5:06   6.05% intr{swi1: pfsync}
    0 root     -92    0     0K   416K -       7   4:56   3.47% kernel{igb0 que}
    0 root     -92    0     0K   416K -       6   4:54   3.47% kernel{igb0 que}
    0 root     -92    0     0K   416K -       7   4:39   3.17% kernel{igb0 que}
    0 root     -92    0     0K   416K -       5   4:58   2.98% kernel{igb0 que}
   15 root     -16    -     0K    16K -       7   3:22   0.68% rand_harvestq
   12 root     -60    -     0K   768K WAIT    3  29:12   0.00% intr{swi4: clock}
14885 root      20    0 16812K  2304K bpf     7   1:54   0.00% filterlog
   12 root     -72    -     0K   768K WAIT    4   1:16   0.00% intr{swi1: netisr 4}
 5453 root      20    0 14664K  2316K select  5   1:08   0.00% syslogd
    5 root     -16    -     0K    16K pftm    6   1:04   0.00% pf purge
    0 root     -16    0     0K   416K swapin  4   0:57   0.00% kernel{swapper}

I'm not sure if PF, or our test rig is leaving the extra 70mbps on the floor. but we are going to be tuning both to try to get it up as high as it will go.

If you have any additional questions, or want me to report other data let me know.
![Screen Shot 2014-11-07 at 10.08.11 AM.png](/public/imported_attachments/1/Screen Shot 2014-11-07 at 10.08.11 AM.png)
![Screen Shot 2014-11-07 at 10.08.11 AM.png_thumb](/public/imported_attachments/1/Screen Shot 2014-11-07 at 10.08.11 AM.png_thumb)
![Screen Shot 2014-11-07 at 10.07.55 AM.png](/public/imported_attachments/1/Screen Shot 2014-11-07 at 10.07.55 AM.png)
![Screen Shot 2014-11-07 at 10.07.55 AM.png_thumb](/public/imported_attachments/1/Screen Shot 2014-11-07 at 10.07.55 AM.png_thumb)

priller

@jeffbearer:

To start I opened up the IGB settings from 1 queue to 4, I'll be moving that to 8 queues for the next test.
kern.ipc.nmbclusters="131072"
hw.igb.num_queues=4

I've seen in another thread that those settings are not necessary in 2.2, because FreeBSD 10 allocates as needed.

Perhaps a test letting FreeBSD allocate dynamically would be worth a try.

jeffbearer

I ran our testing with 2,4,6,8 and no queue settings. looks like 4 or 6 gives us the best and equal performance.

jeffbearer

duh, I just realized that my LAN and WAN ports are vlans on the same physical 1g port and that is not ideal when your WAN link is 1gbps. I'll update when I move the lan port to it's own interface.