Load testing methods, PPS & Bandwidth - performance with igb/em

  • Hi guys,

    I'm replacing my existing firewalls with some pfsense boxes and I'm just trying to get an idea of performance and how that should be tested. To give a quick overview of the configuration, I'm using the below. The servers are probably overkill - but its what I have spare. I should note that I tested this with two different motherboards, one with igb and one with em but saw the same results. The servers listed below use a Supermicro X9SCD-F with 2x integrated Intel 82580DB

    2 pfsense servers: 3.4 GHz Intel Xeon E3-1240v2 / 16GB RAM / 2x 10KRPM SATAIII HDD (RAID1 gmirror)
    2 test servers: 3.4 GHz Intel Xeon E3-1240v2 / 16GB RAM / 1x 10KRPM SATAIII HDD
    2 switches: 1Gbit Juniper EX-3200

    The configuration is a router on a stick set-up to provide firewalling and inter-vlan routing - with a single trunked 1Gbit interface to the switch (carrying the WAN VLAN and the internal VLANs)

    I've been using iperf for inter-vlan testing, using the following command:

    Server 1: iperf -s
    Server 2: iperf -c -d

    Packets per second

    pfsense 1: netstat -w 1 -I igb0 (to view packets/second)
    Server 1: hping -q -i u2 –data 64 --icmp | tail -n10
    Server 2: hping -q -i u2 --data 64 --icmp | tail -n10

    (ie. pinging each other).

    Are there any better / more accurate ways of performing testing? I'm not quite getting the output that I expect.

    Eg. Bandwidth results

    [ ID] Interval       Transfer     Bandwidth
    [  4]  0.0-10.0 sec    459 MBytes    385 Mbits/sec
    [ ID] Interval       Transfer     Bandwidth
    [  5]  0.0-10.0 sec    642 MBytes    538 Mbits/sec

    Which I would expect (with full duplex 1GB - auto neg. turned off on everything) to be 1Gb each way (not 1Gb total)?

    Eg. Packets per second

    The theoretical max over a 1Gbit connection should be about 1.4 million 64 byte packets per second, but I'm falling well short of this

                input         (igb0)           output
       packets  errs idrops      bytes    packets  errs      bytes colls
        731579     0     0   77851066     489258     0   56387720     0

    During the test, top -aSCHIP shows

    last pid: 55400;  load averages:  0.38,  0.14,  0.09        up 0+00:21:53  14:30:55
    157 processes: 10 running, 110 sleeping, 37 waiting
    CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
    CPU 1:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
    CPU 2:  0.0% user,  0.0% nice,  0.0% system, 28.2% interrupt, 71.8% idle
    CPU 3:  0.0% user,  0.0% nice,  0.0% system, 27.4% interrupt, 72.6% idle
    CPU 4:  0.0% user,  0.0% nice,  6.0% system,  0.0% interrupt, 94.0% idle
    CPU 5:  0.0% user,  0.0% nice, 12.8% system,  0.0% interrupt, 87.2% idle
    CPU 6:  0.0% user,  0.0% nice, 13.2% system,  0.0% interrupt, 86.8% idle
    CPU 7:  0.0% user,  0.0% nice,  9.8% system,  0.0% interrupt, 90.2% idle
    Mem: 52M Active, 15M Inact, 434M Wired, 72K Cache, 34M Buf, 15G Free
    Swap: 32G Total, 32G Free

    And vmstat -i shows

    interrupt                          total       rate
    irq1: atkbd0                          18          0
    irq16: ehci0                        2014          1
    irq19: atapci0                     11985          9
    irq23: ehci1                        2015          1
    cpu0: timer                      2584464       1995
    irq256: igb0:que 0                778068        600
    irq257: igb0:que 1                740291        571
    irq258: igb0:que 2              10529010       8130
    irq259: igb0:que 3              10489491       8099
    irq260: igb0:que 4                830229        641
    irq261: igb0:que 5                762681        588
    irq262: igb0:que 6                798454        616
    irq263: igb0:que 7                887188        685
    irq264: igb0:link                      3          0
    cpu1: timer                      2564435       1980
    cpu4: timer                      2564434       1980
    cpu3: timer                      2564434       1980
    cpu5: timer                      2564434       1980
    cpu6: timer                      2564434       1980
    cpu2: timer                      2564434       1980
    cpu7: timer                      2564434       1980
    Total                           46366950      35804

    In terms of BSD tunables


    What I want to know is,

    1. Are the testing methods I am using accurate?
    2. Are the results I am seeing good/average/poor?
    3. Is there anything else I should be doing.

    NB. There is no NAT/rate limiting, just pure firewalling and VLAN routing.

    Incidentally, I did contact the consultancy wing of pfsense for paid professional support - but after 3 emails without a response, I'm not sure that anyone actually supports it?

  • Actually, regarding bandwidth, I've managed to answer my own question. It appears that the rate is normal (ie. the 1Gbit total). After reviewing systat I can see that it is processing at max performance on the interface.

    # systat -ifstat
          Interface           Traffic               Peak                Total
               igb0  in    115.311 MB/s        115.311 MB/s            6.273 GB
                     out   115.497 MB/s        115.497 MB/s            6.033 GB

    I split WAN off from the VLAN trunk and put then on igb0 and igb1 respectively, ran iperf again and saw a full 1Gbps in each direction. So the trunk was certainly the limiting factor.

  • @ben_uk:

    1. Are the testing methods I am using accurate?

    They measure what they measure - they are accurate in that respect. But perhaps what they measure is not particularly relevant to your particular circumstance. Consider a motor car. 0-60kmph is a probably a very relevant metric if you want to drag other people at the traffic lights but probably not particularly relevant to an elderly person purchasing a car for trips on suburban roads to destinations at most a few suburbs away.

    Your ping statistic is interesting, but how much of your "real life" traffic is continuous pings?

    Some people have reported that putting a pfSense box between two systems results in significant loss of bandwidth over a single TCP connection between the systems. This might be relevant if they are looking primarily to reduce the time of a single bulk transfer (e.g. a large backup) through a pfSense box but is perhaps of much less relevance if they are more concerned that the pfSense box is adequate to support large numbers of concurrent web page downloads. What attributes of a pfSense box are most important to you?

  • At the moment, pfsense is already suitable. The key task is to route <20Mbps over a 1000Mb bearer - but as it is an edge appliance, there is a requirement to be able to cope in non-normal situations (small DOS attacks and high levels of inter-vlan traffic). Note, I say cope, this is not the purpose of the firewall, but it is going to be best if the firewall is tuned to the best of its ability.

    So to answer your question. No, they won't be under continuous ping nor sustained transfers.

    But I was actually just aiming to work towards a target of 1.4M pps - but I wonder if the single VLAN trunk (ie. just 1x 1Bb interface) is actually the limiting factor, due to tx and rx occuring 4 times over (hence the halved iperf results seen above).

    Given the server has 2x 1Gb interfaces, what would be an optimal configuration?

    igb0: wan
    igb1: vlan trunk


    igb0 + igb1 (lacp lagg): vlan trunk


    something else?

    Regarding the testing methods - I actually wanted to know how people actually test PPS rates. I've really struggled to find examples of what testing/tools/commands people use when coming up with a figure for 64 byte packet forwarding. Ie. whether they use hping or not, whether it is UDP or not, whether it is ICMP or not, etc.

  • Again, to follow up here. I set up a LAGG with LACP and bonded the two interfaces for the VLAN trunk to see if it altered the bandwidth test. Between 2 servers, it didn't change anything - but when testing 4 servers, the performance was shown. There's a good explanation on the limitations of LACP here https://supportforums.cisco.com/thread/2132362

    If you are just transferring between 2 addresses  that conversation will only flow down a single port within that port channel , thats the way port channels work .  As you get more inputs from different addresses then the port channels will be more evened out due to the way the switch hashes the traffic from different sources down each port in the port channel .  A single given conversation will only go down a single port .

  • I would go for the LAGG option, for redundancy (at least for NIC/cable).

    As for PPS testing, just lower the MTU on the sending and run the iperf again?