Firewalling process multithreaded?

bubble1975

Hi Y'all,

I have a question about pfSense and how the filtering process works (pfSense 2.0-RELEASE). If I have 8 cores on my pfSense box, and I have eight different incoming streams from 8 different WAN hosts, can I expect each of those streams to be handled by a different processor? i.e. if I have a 10Gb/s capable NIC, and a single processor can handle 1Gb/s per stream max (with 1500 byte frames), but I had 8 streams from 8 different hosts coming in, could I get 8Gb/s in theory, aggregate between all the streams/hosts?

Or is the entire packet filtering process single threaded no matter how many processors or how many incoming streams I have?

I'm trying to test bandwidth incoming, and am wondering if I would get higher aggregate bandwidth if I somehow get a whole bunch of hosts (or streams) to send data to me instead of one.

Can anyone explain to me how it works?

Any info much appreciated!

clarknova

As I recall, pf does not thread well. The good news, however, is that some NIC drivers (em at least, in my experience), do balance the load over multiple cores, or at least one send and one receive process per NIC, IIRC, so if you have a 2-NIC box its routing speed should scale well to at least 4 cores (ignoring pf's load, which was less than the interrupt load when testing, IIRC).

bubble1975

Ack, I suspected as much, thanks for the heads up. I'm trying like mad to increase the performance of my box at 10Gb/s speeds, and so far I have discovered that using jumbo frames I can easily saturate 10Gb/s inbound filtering, it handles it great. But if 1500 byte frames come in, it is limited to 1-2Gb/s (I can't explain the variance yet). I'm guessing that the "single core" handling the pf process is just getting choked with the number of packets per second coming in, although the bizarre part is that I can't see any dropped packets on my pfSense box, so I'm not sure where the limiting is happening exactly.

Thus my desire to parallelize the process as much as possible. ;) I've got 8 cores on this box, it's a shame that 7 of them would sit idle while 1 is being hammered by 1500 byte packets… :(

clarknova

Have a look at 'top -SH' in the shell while testing.

bubble1975

Roger. I am pushing 1.2Gb/s and see the following:


last pid: 20597;  load averages:  0.00,  0.00,  0.00                                     up 9+22:00:32  10:38:44
135 processes: 10 running, 96 sleeping, 29 waiting
CPU:  0.0% user,  0.0% nice,  0.0% system, 11.6% interrupt, 88.4% idle
Mem: 69M Active, 20M Inact, 292M Wired, 192K Cache, 24M Buf, 23G Free
Swap: 32G Total, 32G Free

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
   11 root     171 ki31     0K   128K CPU5    5 237.8H 100.00% {idle: cpu5}
   11 root     171 ki31     0K   128K CPU4    4 237.5H 100.00% {idle: cpu4}
   11 root     171 ki31     0K   128K CPU1    1 237.5H 100.00% {idle: cpu1}
   11 root     171 ki31     0K   128K CPU2    2 237.5H 100.00% {idle: cpu2}
   11 root     171 ki31     0K   128K CPU3    3 237.6H 99.37% {idle: cpu3}
   11 root     171 ki31     0K   128K CPU6    6 237.4H 99.17% {idle: cpu6}
   11 root     171 ki31     0K   128K RUN     0 222.9H 82.86% {idle: cpu0}
   12 root     -68    -     0K   480K CPU7    7  67.0H 70.75% {irq260: mxge0}
   11 root     171 ki31     0K   128K CPU7    7 170.9H 37.89% {idle: cpu7}
   12 root     -68    -     0K   480K WAIT    0 874:03 20.26% {irq261: mxge1}
   12 root     -68    -     0K   480K WAIT    6  25:09  0.68% {irq259: bce3}
   12 root     -32    -     0K   480K WAIT    3  33:15  0.00% {swi4: clock}
   14 root      44    -     0K    16K -       5  16:59  0.00% yarrow
19592 root      64   20  5836K  1500K select  4   1:59  0.00% apinger
    0 root      45    0     0K   128K sched   0   1:58  0.00% {swapper}
60482 root      76   20  8292K  1832K wait    5   1:52  0.00% sh
   12 root     -68    -     0K   480K WAIT    3   0:18  0.00% {irq256: bce0}

mxge0 is my WAN interface (with data incoming at 1.2Gb/s over ) and mxge1 is my LAN interface (with almost no traffic outbound). Do you see any bottleneck in this picture? There are about 12 states that are moving traffic in the state table, all incoming, adding up to 1.2Gb/s.

cmb

The most CPU-intensive part of PF is giant locked so it can only use one CPU/core. The interrupt handlers, other processes running on the system, etc. can use the others. At very large scale what you usually end up with is one core with 70-80+% utilization and ~15-25% across all the others (though that depends a lot on exactly what you're doing with the box).

dreamslacker

Try adding hw.mxge.max_slices="4" to loader.conf.local and rebooting then test again. It should split up the interrupts and loading across more cores.

bubble1975

Wow! That really helped, like by a factor of like 300%!

I had to kick up a few system tunables before it would work however:

kern.ipc.nmbjumbop = 20000
kern.ipc.nmbjumbo9 = 12800
kern.ipc.nmbjumbo16 = 6400
kern.ipc.nmbclusters = 524288

Then reboot. Before doing that it would just complain during boot that "mxge0: couldn't open slice 1", and after some googling, determined that we need to kick up the jumbo clusters first, and then voila.

One thing though, I can only seem to get it to use 4 slices (per interface), when I would rather it used 6 or 7. As in, even if I tell it to use 7, it just uses 4 anyway. I can tell this in dmesg:

$ dmesg | grep mxge
mxge0: <myri10g-pcie-8b>mem 0xd4000000-0xd4ffffff,0xdf400000-0xdf4fffff irq 55 at device 0.0 on pci7
mxge0: [ITHREAD]
mxge0: [ITHREAD]
mxge0: [ITHREAD]
mxge0: [ITHREAD]
mxge1: <myri10g-pcie-8b>mem 0xd3000000-0xd3ffffff,0xdf200000-0xdf2fffff irq 40 at device 0.0 on pci8
mxge1: [ITHREAD]
mxge1: [ITHREAD]
mxge1: [ITHREAD]
mxge1: [ITHREAD]

Is it because I have an 8 core system and 4+4=8?

Thanks a million! This is a game changing revelation for me.</myri10g-pcie-8b></myri10g-pcie-8b>

dreamslacker

Seems like each core should give you about 1.5Gbit/s of throughput (excluding the ack throughput in the reverse direction).

I'm not certain how the ratio of upstream vs downstream is like in real world testing (seems to be approximately 1:3.5 interrupt loading based on your top output) for your traffic but you aren't likely to see 20Gbps total since one side would choke first without jumbo frames. Then there's the actual limits when it comes to processing the packets (as cmb mentioned, it's locked to one core).

i.e. 4 cores are used for downstream NIC, 4 are used to handle the acks coming back on the other NIC. Since the downstream >> acks, the downstream will saturate their 4 cores first whilst the 4 cores handling the ack side would idle at about 70%.

Also, note that slices work based on per connection (TCP/ UDP). A single connection cannot be sliced. If you really wanted to test the limits, use more host/ clients and open more simultaneous connections.
Try for an even mix (1 WAN to LAN for each LAN to WAN) so you get approximately equal amounts of upstream and downstream traffic. That should allow for the best utilization (since there should be equal packet rates in both directions) of the available cores.