Load testing methods, PPS & Bandwidth - performance with igb/em

ben_uk

Hi guys,

I'm replacing my existing firewalls with some pfsense boxes and I'm just trying to get an idea of performance and how that should be tested. To give a quick overview of the configuration, I'm using the below. The servers are probably overkill - but its what I have spare. I should note that I tested this with two different motherboards, one with igb and one with em but saw the same results. The servers listed below use a Supermicro X9SCD-F with 2x integrated Intel 82580DB

2 pfsense servers: 3.4 GHz Intel Xeon E3-1240v2 / 16GB RAM / 2x 10KRPM SATAIII HDD (RAID1 gmirror)
2 test servers: 3.4 GHz Intel Xeon E3-1240v2 / 16GB RAM / 1x 10KRPM SATAIII HDD
2 switches: 1Gbit Juniper EX-3200

The configuration is a router on a stick set-up to provide firewalling and inter-vlan routing - with a single trunked 1Gbit interface to the switch (carrying the WAN VLAN and the internal VLANs)

Bandwidth
I've been using iperf for inter-vlan testing, using the following command:

Server 1: iperf -s
Server 2: iperf -c 188.94.17.130 -d

Packets per second

pfsense 1: netstat -w 1 -I igb0 (to view packets/second)
Server 1: hping 10.0.1.1 -q -i u2 –data 64 --icmp | tail -n10
Server 2: hping 10.0.2.1 -q -i u2 --data 64 --icmp | tail -n10

(ie. pinging each other).

Are there any better / more accurate ways of performing testing? I'm not quite getting the output that I expect.

Eg. Bandwidth results


[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec    459 MBytes    385 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec    642 MBytes    538 Mbits/sec

Which I would expect (with full duplex 1GB - auto neg. turned off on everything) to be 1Gb each way (not 1Gb total)?

Eg. Packets per second

The theoretical max over a 1Gbit connection should be about 1.4 million 64 byte packets per second, but I'm falling well short of this


            input         (igb0)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    731579     0     0   77851066     489258     0   56387720     0

During the test, top -aSCHIP shows


last pid: 55400;  load averages:  0.38,  0.14,  0.09        up 0+00:21:53  14:30:55
157 processes: 10 running, 110 sleeping, 37 waiting
CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 2:  0.0% user,  0.0% nice,  0.0% system, 28.2% interrupt, 71.8% idle
CPU 3:  0.0% user,  0.0% nice,  0.0% system, 27.4% interrupt, 72.6% idle
CPU 4:  0.0% user,  0.0% nice,  6.0% system,  0.0% interrupt, 94.0% idle
CPU 5:  0.0% user,  0.0% nice, 12.8% system,  0.0% interrupt, 87.2% idle
CPU 6:  0.0% user,  0.0% nice, 13.2% system,  0.0% interrupt, 86.8% idle
CPU 7:  0.0% user,  0.0% nice,  9.8% system,  0.0% interrupt, 90.2% idle
Mem: 52M Active, 15M Inact, 434M Wired, 72K Cache, 34M Buf, 15G Free
Swap: 32G Total, 32G Free

And vmstat -i shows


interrupt                          total       rate
irq1: atkbd0                          18          0
irq16: ehci0                        2014          1
irq19: atapci0                     11985          9
irq23: ehci1                        2015          1
cpu0: timer                      2584464       1995
irq256: igb0:que 0                778068        600
irq257: igb0:que 1                740291        571
irq258: igb0:que 2              10529010       8130
irq259: igb0:que 3              10489491       8099
irq260: igb0:que 4                830229        641
irq261: igb0:que 5                762681        588
irq262: igb0:que 6                798454        616
irq263: igb0:que 7                887188        685
irq264: igb0:link                      3          0
cpu1: timer                      2564435       1980
cpu4: timer                      2564434       1980
cpu3: timer                      2564434       1980
cpu5: timer                      2564434       1980
cpu6: timer                      2564434       1980
cpu2: timer                      2564434       1980
cpu7: timer                      2564434       1980
Total                           46366950      35804

In terms of BSD tunables


/etc/sysctl.conf

dev.igb.0.enable_lro=0
dev.igb.1.enable_lro=0
kern.random.sys.harvest.interrupt=0
kern.random.sys.harvest.ethernet=0
net.inet.ip.fastforwarding=1
kern.timecounter.hardware=HPET
dev.igb.0.rx_processing_limit=480
dev.igb.1.rx_processing_limit=480
kern.ipc.nmbclusters=512000

/boot/loader.conf

autoboot_delay="3"
vm.kmem_size="435544320"
vm.kmem_size_max="535544320"
kern.ipc.nmbclusters="655356"
hw.igb.num_queues="8"
hw.igb.max_interrupt_rate="30000"
hw.igb.rxd="3096"
hw.igb.txd="3096"

What I want to know is,

1. Are the testing methods I am using accurate?
2. Are the results I am seeing good/average/poor?
3. Is there anything else I should be doing.

NB. There is no NAT/rate limiting, just pure firewalling and VLAN routing.

Incidentally, I did contact the consultancy wing of pfsense for paid professional support - but after 3 emails without a response, I'm not sure that anyone actually supports it?

ben_uk

Actually, regarding bandwidth, I've managed to answer my own question. It appears that the rate is normal (ie. the 1Gbit total). After reviewing systat I can see that it is processing at max performance on the interface.


# systat -ifstat

      Interface           Traffic               Peak                Total

           igb0  in    115.311 MB/s        115.311 MB/s            6.273 GB
                 out   115.497 MB/s        115.497 MB/s            6.033 GB

I split WAN off from the VLAN trunk and put then on igb0 and igb1 respectively, ran iperf again and saw a full 1Gbps in each direction. So the trunk was certainly the limiting factor.

wallabybob

@ben_uk:

1. Are the testing methods I am using accurate?

They measure what they measure - they are accurate in that respect. But perhaps what they measure is not particularly relevant to your particular circumstance. Consider a motor car. 0-60kmph is a probably a very relevant metric if you want to drag other people at the traffic lights but probably not particularly relevant to an elderly person purchasing a car for trips on suburban roads to destinations at most a few suburbs away.

Your ping statistic is interesting, but how much of your "real life" traffic is continuous pings?

Some people have reported that putting a pfSense box between two systems results in significant loss of bandwidth over a single TCP connection between the systems. This might be relevant if they are looking primarily to reduce the time of a single bulk transfer (e.g. a large backup) through a pfSense box but is perhaps of much less relevance if they are more concerned that the pfSense box is adequate to support large numbers of concurrent web page downloads. What attributes of a pfSense box are most important to you?

ben_uk

At the moment, pfsense is already suitable. The key task is to route <20Mbps over a 1000Mb bearer - but as it is an edge appliance, there is a requirement to be able to cope in non-normal situations (small DOS attacks and high levels of inter-vlan traffic). Note, I say cope, this is not the purpose of the firewall, but it is going to be best if the firewall is tuned to the best of its ability.

So to answer your question. No, they won't be under continuous ping nor sustained transfers.

But I was actually just aiming to work towards a target of 1.4M pps - but I wonder if the single VLAN trunk (ie. just 1x 1Bb interface) is actually the limiting factor, due to tx and rx occuring 4 times over (hence the halved iperf results seen above).

Given the server has 2x 1Gb interfaces, what would be an optimal configuration?

igb0: wan
igb1: vlan trunk

or

igb0 + igb1 (lacp lagg): vlan trunk

or

something else?

Regarding the testing methods - I actually wanted to know how people actually test PPS rates. I've really struggled to find examples of what testing/tools/commands people use when coming up with a figure for 64 byte packet forwarding. Ie. whether they use hping or not, whether it is UDP or not, whether it is ICMP or not, etc.

ben_uk

Again, to follow up here. I set up a LAGG with LACP and bonded the two interfaces for the VLAN trunk to see if it altered the bandwidth test. Between 2 servers, it didn't change anything - but when testing 4 servers, the performance was shown. There's a good explanation on the limitations of LACP here https://supportforums.cisco.com/thread/2132362

If you are just transferring between 2 addresses that conversation will only flow down a single port within that port channel , thats the way port channels work . As you get more inputs from different addresses then the port channels will be more evened out due to the way the switch hashes the traffic from different sources down each port in the port channel . A single given conversation will only go down a single port .

SeventhSon

I would go for the LAGG option, for redundancy (at least for NIC/cable).

As for PPS testing, just lower the MTU on the sending and run the iperf again?