Fabiatech FX5625 improving throughput

stephenw10

The maximum throughput with a D525 is somewhere in the 650Mbps region but that's with ideal test traffic. With real works traffic, mixed packet sizes it will be lower. There may not be that much that can be done here.

What load makes up the 100% usage on one core?

Can we see the output of top -aSH at the command line whilst you are seeing maximum throughput?

Steve

SimonB256

@stephenw10

The 100% CPU usage only seems to happen the early hours of the morning. Always at the same time. I'll try to get on to it and take a look remotely tomorrow morning and will post update.

SimonB256

@stephenw10

After manually chucking some data through to generate this load, the main process responsible is 'intr{irq257: em0:rx0}' with similar processes for the other interfaces alongside it but not quite as high (understandably as em0 is the WAN interface).

Sample output:

PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root     155 ki31     0K    64K CPU1    1  23.6H  87.26% [idle{idle: cpu1}]
   12 root     -92    -     0K   832K CPU0    0 233:12  79.73% [intr{irq257: em0:rx0}]
   11 root     155 ki31     0K    64K RUN     3  23.9H  76.87% [idle{idle: cpu3}]
   11 root     155 ki31     0K    64K CPU2    2  23.2H  49.23% [idle{idle: cpu2}]
   12 root     -92    -     0K   832K WAIT    2   4:05  34.74% [intr{irq278: em5:rx0}]
    0 root     -92    -     0K   816K -       3   7:47  20.59% [kernel{em0 rxq (cpuid 0)}]
   11 root     155 ki31     0K    64K RUN     0  18.6H  13.41% [idle{idle: cpu0}]
   12 root     -92    -     0K   832K WAIT    2  51:47  11.30% [intr{irq261: em1:rx0}]
   12 root     -92    -     0K   832K WAIT    0 107:33   5.89% [intr{irq265: em2:rx0}]
    0 root     -92    -     0K   816K -       2  23:04   5.05% [kernel{dummynet}]
   12 root     -92    -     0K   832K WAIT    1  16:14   4.75% [intr{irq258: em0:tx0}]
   12 root     -92    -     0K   832K WAIT    3   0:16   4.39% [intr{irq279: em5:tx0}]
   12 root     -92    -     0K   832K WAIT    3   6:09   1.87% [intr{irq262: em1:tx0}]
   12 root     -92    -     0K   832K WAIT    2  13:49   1.40% [intr{irq269: em3:rx0}]
    0 root     -92    -     0K   816K -       1   1:41   0.75% [kernel{em5 rxq (cpuid 2)}]
   12 root     -92    -     0K   832K WAIT    1  15:43   0.58% [intr{irq266: em2:tx0}]
74844 root      20    0  9868K  4700K CPU3    3   0:00   0.53% top -aSH
    0 root     -92    -     0K   816K -       1   2:42   0.46% [kernel{em1 rxq (cpuid 2)}]
   12 root     -92    -     0K   832K WAIT    0  11:15   0.42% [intr{irq281: em6:rx0}]
   12 root     -60    -     0K   832K WAIT    1   3:25   0.27% [intr{swi4: clock (0)}]
   12 root     -92    -     0K   832K WAIT    3   2:18   0.26% [intr{irq270: em3:tx0}]

Checking things like mbuf et al, and there appears to be plenty of room there:

35554/14801/50355 mbufs in use (current/cache/total)
33501/13093/46594/249500 mbuf clusters in use (current/cache/total/max)
33501/13051 mbuf+clusters out of packet secondary zone in use (current/cache)
0/34/34/124749 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/36962 9k jumbo clusters in use (current/cache/total/max)
0/0/0/20791 16k jumbo clusters in use (current/cache/total/max)
75890K/30022K/105912K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 sendfile syscalls
0 sendfile syscalls completed without I/O request
0 requests for I/O initiated by sendfile
0 pages read by sendfile as part of a request
0 pages were valid at time of a sendfile request
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed

Current MBUF limit set as:

[2.4.5-RELEASE][admin@firewall1.midlandcomputers.com]/root: sysctl kern.ipc.nmbclusters
kern.ipc.nmbclusters: 249500

stephenw10

em uses a single receive and transmit queue so you're unlikely to exhaust the mbufs.

What throughput were you seeing when that was taken?
Between which interfaces

What throughput do you see without any of those loader variables, just using the em defaults?

What output do you get from vmstat -i and sysctl net.isr

Steve

SimonB256

Output from sysctl net.isr :

net.isr.numthreads: 4
net.isr.maxprot: 16
net.isr.defaultqlimit: 256
net.isr.maxqlimit: 10240
net.isr.bindthreads: 0
net.isr.maxthreads: 4
net.isr.dispatch: direct

Output from vmstat -i:

interrupt                          total       rate
irq18: uhci2+                     304106          3
cpu0:timer                     108772857       1036
cpu1:timer                      68073061        648
cpu2:timer                       9281390         88
cpu3:timer                      19118159        182
irq257: em0:rx0                194215751       1850
irq258: em0:tx0                229258370       2183
irq259: em0:link                       1          0
irq261: em1:rx0                 48310327        460
irq262: em1:tx0                 82599543        787
irq263: em1:link                       1          0
irq265: em2:rx0                113082535       1077
irq266: em2:tx0                193176467       1840
irq267: em2:link                       1          0
irq269: em3:rx0                 23497096        224
irq270: em3:tx0                 39913436        380
irq271: em3:link                       1          0
irq273: em4:rx0                   157084          1
irq274: em4:tx0                   104642          1
irq275: em4:link                       1          0
irq277: pcib8                          1          0
irq278: em5:rx0                  3537702         34
irq279: em5:tx0                  3615446         34
irq280: em5:link                       1          0
irq281: em6:rx0                 11959127        114
irq282: em6:tx0                 15965140        152
irq283: em6:link                       1          0
irq284: em7:rx0                   421216          4
irq285: em7:tx0                    21775          0
irq286: em7:link                       9          0
Total                         1165385247      11098

In the example I posted above I was simply downloading large files two hosts without bandwidth caps. Where em0 is the WAN interface, and em1 & em5 were where the hosts were residing.

I will remove what I have entered from the loader.conf, reboot and retry, but rebooting the firewall during office hours is a pain to arrange. I'll get this done this evening.

stephenw10

You might try setting:
net.isr.bindthreads=1

The core affinity might give you better distribution.

SimonB256

Hi,

I've set that and rebooted, and will test over the weekend.

I might be gong completely along the wrong train of thought, but would net.isr.direct=1 possibly also help?

stephenw10

@SimonB256 said in Fabiatech FX5625 improving throughput:

net.isr.direct

That doesn't exist in FreeBSD after 9 (pfSense 2.4.5 is built on 11.3), that's what net.isr.dispatch: direct does.

Steve

SimonB256

Just to update, it appears that I am now getting better throughput after adding net.isr.bindthreads=1.

Thank you for your help.

stephenw10

Ah, good to hear. What sort of improvement are you seeing?

SimonB256

In terms of throughput I'm only seeing a 15-20Mbps increase (so we're up to 470Mbps). But we're seeing far less packet loss at the top end of these speeds.

Looking further at the kind of traffic we're handling. We're talking around 600-700 flows at any given time (according to ntop I have running elsewhere in the network), and around 15k-20k states listed on the firewall itself.

So I imagine for this small device, handling a reasonable amount of small connections at any time might explain why we wouldn't be getting the 600Mbps+ theoretical max.

stephenw10

Yes that seems reasonable. You would only see >600Mbps using all full size packets.

Steve