How pfSense utilize multicore processors and multi-CPU systems ?

Sergei_Shablovsky

Hi, pfSense Gurus!

Looking on perspective of upgrading to multi-CPU systems we have 2 main question:

How pfSense utilize multicore processors in ONE CPU systems ?
How pfSense utilize multicore processors in multi-CPU systems ?

UPDATE - Feb 2021

Hm. Looks like hard to find right answer...
I need a little bit to explain the topic start question:

What system is better for network-related operation (i.e. firewall, load balancing, gate, proxy, media stream,...):
a) 1 CPU with 4-10 cores, hi-frequency
b) 2-4 CPU with 4-6 cores, mid-frequency

And how the cache in CPU L2 (2-56Mb) and L3 (2-57Mb) impact on network-related operation (in cooperation with NIC card) ?

In general the situation looks like this: due the pfSense are based on FreeBSD, the answer may be somewhere between pfSense threads ability to effectively utilize several CPU cores (the only way for app to utilize more than one core is to execute more than one thread) and FreeBSD kernel drivers ability to utilize several CPUs (kernel is responsible for binding threads to cores.).

As for app (in our case this is pfSense) the answer are in using cpuset(2) function.

I try to searching in archives but unsuccessfully, or information was 5-10 years old.

Some peoples write that would be a big waste of CPU power unless you plan on terminating a few thousand IPSEC VPN. But I agree with one actor " It's massive overkill or not.. the problem is which one has multi-core support?"

So, the question start from is pfSense a strictly single-threaded?

As I know pfSense are up layer on pf base, and pf - up layer on FreeBSD network system. And pfSense developers make a lot of great modification of original FreeBSD pf.

So, what are Your answer on questions at the top of this message ?

Gertjan

@Sergei_Shablovsky said in How pfSense utilize multicore processors and multi-CPU systems ?:

So, the question start from is pfSense a strictly single-threaded?

The underlying FreeBSD is multicore and multithreading.
As are most FreeBSD applications and tools used by pfSense.
pfSense is a web interface that enables you to manipulate all the settings using a GUI, not a command line. Basically, it's a web interface and a lot of PHP script file ( I over-simplify )

The thing is : are your administrate this devices, or are you allowing hundreds or thousands of admins to do so ? ^^

See here https://www.netgate.com/products/appliances/ so you can see what Netgate itself uses for it's devices.

Example :
Intel(R) Pentium(R) 4 CPU 3.20GHz
2 CPUs: 1 package(s) x 2 hardware threads
AES-NI CPU Crypto: No
ancient "PC device" (15 years old) handles Gbit connections easily.

Btw : Netgate (pfSense) doesn't modify the original FreeBSD source a lot. It would be far to much work to bring out newer versions. I guess there will be some patches.

stephenw10

pfSense is not single threaded. pf is no longer single threaded so there are certainly advantages to use multiple CPU cores.
Some things are still single threaded. OpenVPN and PPPoE are two we most commonly see. Some NIC drivers cannot use more than one queue but most now do.
There's no significant difference between multiple cpus and multiple cores in a single CPU as far as I know.

Steve

Sergei_Shablovsky

@stephenw10 said in How pfSense utilize multicore processors and multi-CPU systems ?:

Some NIC drivers cannot use more than one queue but most now do.

Where am I able to see a list of NICs that able to using multitreads on FreeBSD ?

Gertjan

Well ....
Stay away from Realtek
Prefer Intel
and your good.

Sergei_Shablovsky

@Gertjan said in How pfSense utilize multicore processors and multi-CPU systems ?:

Well ....
Stay away from Realtek
Prefer Intel
and your good.

Thank You for advise!

Are You sure about intel ? Because even in pfSense official doc i able to see from all NICs troubleshootings at least 2 issues linked to Broadcom and 2 issues linked to Intel. No other NICs.
From statistical point of view this may be not good result.
Also search on this forum also give a point that many issues linked to Intel. Of course may be a lot of users prefer to using Intel NICs, and some of them have an issues...

Sergei_Shablovsky

@stephenw10 said in How pfSense utilize multicore processors and multi-CPU systems ?:

pfSense is not single threaded. pf is no longer single threaded so there are certainly advantages to use multiple CPU cores.
Some things are still single threaded. OpenVPN and PPPoE are two we most commonly see. Some NIC drivers cannot use more than one queue but most now do.
There's no significant difference between multiple cpus and multiple cores in a single CPU as far as I know.

You mean "no significant difference" from FreeBSD CPU-related kernel drivers that care about apps threads?

NogBadTheBad

@Sergei_Shablovsky said in How pfSense utilize multicore processors and multi-CPU systems ?:

@stephenw10 said in How pfSense utilize multicore processors and multi-CPU systems ?:

Some NIC drivers cannot use more than one queue but most now do.

Where am I able to see a list of NICs that able to using multitreads on FreeBSD ?

The FreeBSD web pages would be a good place to start.

A lot of the drivers are provided by the chip manufacturers igb for example is written by Intel.

https://www.freebsd.org/cgi/man.cgi?query=igb&sektion=4&manpath=freebsd-release-ports

https://www.freebsd.org/releases/11.2R/hardware.html#ethernet

stephenw10

@Sergei_Shablovsky said in How pfSense utilize multicore processors and multi-CPU systems ?:

Are You sure about intel ?

Very sure. Use Intel based NICs if you want the least likelihood of seeing issues.

Steve

Sergei_Shablovsky

@stephenw10 said in How pfSense utilize multicore processors and multi-CPU systems ?:

Steve

Appreciate Your help, Steve! :)

Sergei_Shablovsky

After FreeBSD coming and pfSense have a several major updates, time to return to this question.

In FreeBSD 9-11 separate process was creating for each card
(for example for Intels cards with 2Eth)
intr{irq273: igb1:que}
intr{irq292: igb3:que}
...and so on...

And because FreeBSD (BTW for a long time!) not able to paralleling PPPOE traffic for several threads, in FreeBSD 9-11 this processes going to several cores by using cpuset. This working nod bad until FreeBSD 12 come in.

Now on FreeBSD 12 all processes are together
kernel{if_io_tqg_0}
kernel{if_io_tqg_1}
kernel{if_io_tqg_2}
kernel{if_io_tqg_3}
....and so on....

And looks like no ability to assign each card to separate core.

As a result we have first core 75-80% loaded in middle, and up to 100% - at peak traffic loading.

Some people’s suggest tuning the iflib settings (sometime in conjunction with switching OFF hyper threading)

In loader:
net.isr.maxthreads="1024" # Use at most this many CPUs for netisr processing
net.isr.bindthreads="1" # Bind netisr threads to CPUs.
In sysctl:
net.isr.dispatch=deferred # direct / hybrid / deffered // Interrupt handling via multiple CPU, but with context switc

Or
dev.igb.0.iflib.tx_abdicate=1
dev.igb.0.iflib.separate_txrx=1

So the question is still: how to effectively manage loading on multi-core multi-CPU systems?

Especially when problems with powerd and est drivers for ALL Intel CPU still exist (look of this thread about SpeedStep & TurboBoost work together in FreeBSD https://forum.netgate.com/topic/112201/issue-with-intel-speedstep-settings)

Sergei_Shablovsky

Also this post about FreeBSD optimization and tuning for networking for Yours attention https://calomel.org/freebsd_network_tuning.html

Sergei_Shablovsky

And another one example how to manually dispatching processes to certain CPUs cores:

for l in `cat ${basedir}/ix_cpu_16core_2nic`; do
    if [ -n "$l" ]; then
        iface=`echo $l | cut -f 1 -d ":"`
        queue=`echo $l | cut -f 2 -d ":"`
        cpu=`echo $l | cut -f 3 -d ":"`
        irq=`vmstat -i | grep "${iface}:q${queue}" | cut -f 1 -d ":" | sed "s/irq//g"`
        echo "Binding ${iface} queue #${queue} (irq ${irq}) -> CPU${cpu}"
        cpuset -l $cpu -x $irq
    fi
done
 
ix0:0:1
ix0:1:2
ix0:2:3
ix0:3:4
ix0:4:5
ix0:5:6
ix0:6:7
ix0:7:8
ix1:0:9
ix1:1:10
ix1:2:11
ix1:3:12
ix1:4:13
ix1:5:14
ix1:6:15
ix1:7:16

Totally 8 interrupts: 1 interrupt to 1 CPU core, 0 for dummynet

manual interrupts on cpu cores

From here (You need a Google translator): https://local.com.ua/forum/topic/117570-freebsd-gateway-10g/

Sergei_Shablovsky

And at last another one interesting thread about Binding igb(4) IRQs and dummynet to CPUs https://dadv.livejournal.com/139366.html (Use a translate.google.com to read)

Shortly to say, because igb(4) driver queues linking algorithm (when first queue created, they linking to fires core - CPU0, and doing the same for each card) , PPPoE/GRE traffic in first queue on each of Intel cards linked strongly to CPU0, and because this traffic are high load -> CPU0 become quickly overloaded -> packets are more holding in NIC buffers -> we have dramatically latency increasing

Another interesting thing are the FreeBSD behavior of system thread that service dummynet: manually linking dummynet to CPU0 decrease core loading from 80% to 0,1%

Article worth to read.

Sergei_Shablovsky

Need to note that mostly all this links are about high-load PPPoE/PPtP/GRE traffic with 90% of traffic are ~600 bytes size.

Interesting to read detailed comments from pfSense developers side, even we speak about Netgate-branded hardware (SuperMicro motherboard and case, yes?) because Intel CPUs are the same, FreeBSD are the same and all drivers are the same for Your own bare metal and Netgate hardware.

And in near future we see only frequency increasing, numbers of cores increasing, and energy consuming decrease. Se the proper using multi cores CPUs in case of specialized solution like “network packet grinder” pfSense still actual.

stephenw10

What sort of increase in throughput do you see by applying that?

Were you seeing very uneven CPU core loading before applying it?

PPPoE is a special case in FreeBSD/pfSense. Only one Rx queue on a NIC will ever be used so only one core.

Steve

Sergei_Shablovsky

Need to note that I understand that dumminet was written by Luigi Rizzo as system shaper for imitating environment of a low-quality channels (with big latency, packet drops, etc.), that exist more in 2008-2010.
For nowadays ALTQ and NetGraph working better on fast 1-10-100G links.

I not speak especially about dummynet, or PPPoE/GRE but more about how to effectively loading multi-CPUs systems. Because in case firewalls, system with 2-4 Intel CPUs (E or X server series) and independent RAM banks on each CPU ARE MORE EFFECTIVE THAN system based on 1 CPU, but hi-frequency.

Effectively because, this mean ability to “fine tuning” the pfSense (FreeBSD) to professional cases (for example):

in small ISP where exist hi-loading by PPPoE/GRE traffic;
in middle companies networking with a lot of traffic with small packets (~500~800 bytes);
in broadcasting services/platforms oriented on mobile clients (with a lot of reconnections and small packets size);
...

The initial question in this thread mean
1. How processes and FreeBSD services in pfSense bundle utilize the cores and memory in multi-CPUs systems? Which behavior ?
2. When I understand each process / system service behavior, I able easy tuning pfSense in each usecase to achieve MORE BANDWITH, LESS LATENCY with not spending another $2-3k on a new server + NICs.

From my point of view this is reasonable in nowadays when each company try to cutting costs on a tight budget due economic situation from one side and online services needs rapidly increasing (due COVID-19) from the other side.

Sergei_Shablovsky

@sergei_shablovsky

Hm. Looks like hard to find right answer...

I need a little bit to explain the topic start question:

What system is better for network-related operation (i.e. firewall, load balancing, gate, proxy, media stream,...):
a) 1 CPU with 4-10 cores, hi-frequency
b) 2-4 CPU with 4-6 cores, mid-frequency

And how the cache in CPU L2 (2-56Mb) and L3 (2-57Mb) impact on network-related operation (in cooperation with NIC card) ?

Sergei_Shablovsky

Is anything changes in this after FreeBSD 13-based pfSense rolled out? Better CPU using? More cores better than CPU frequency? Etc...

AndyRH

@sergei_shablovsky Your question from 2/2021 is slightly flawed. The CPU package count is not relevant. Cores (threads) and frequency are relevant.
For tasks that are single threaded frequency is what you want, for tasks that are multi threaded you want enough threads to allow the concurrency you need. The result is a balance based on your goals. If single threaded tasks are your number one concern, you will lean to frequency at the expense of cores. However if you have several packages and many NICs, you will lean to core count at the expense of frequency because you will have many threads needing to execute at the same time and it is more efficient for the computer to have many threads vs having to share.

I hope that helps.