Suricata inline with Netgate SG-2440 -- high cpu utilization

bmeeks

No, Snort does not offer inline IPS operation on pfSense. While technically possible to configure it with DAQ (which is used by Snort to interface with the physical network layer), it is not efficient because two physical NIC ports are required for each instance; one for input and the other for output. The GUI code does not support inline operation at all. You would have to use a "command-line" mode and run Snort with no GUI.

boobletins

Which nic driver is that using (or what is the chipset)?

How much traffic are you trying to push through it?

I can get 400+mbps (my external line rate) with netmap+~30,000 rules enabled in addition to another ~900mbps on the LAN interface with an i5 (quad core, ~3ghz)

If you stick to a single interface and a reasonably limited ruleset, you may be able to get it working at around 300 mbps, though the i5 is using HyperScan with AVX2 which I don't think the atom processor has.

Tantamount

It's using the igb drivers that are compiled into the kernel:

igb0@pci0:0:20:0:	class=0x020000 card=0x1f418086 chip=0x1f418086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection I354'

If there is a way to determine the version of the driver, I haven't been able to find it. I supposed one could figure it out by what is normally included in the kernel based on its version, assuming nothing custom has been done:

FreeBSD 11.2-RELEASE-p3 FreeBSD 11.2-RELEASE-p3 #12 220591260a0(factory-RELENG_2_4_4): Thu Sep 20 11:00:13 EDT 2018     root@buildbot3:/crossbuild/244/obj/amd64/as0Ifpf7/crossbuild/244/pfSense/tmp/FreeBSD-src/sys/pfSense  amd64

There are 4 interfaces, but only two are used -- one for external, one for internal traffic. Suricata is only listening on the external interface.

It's supposed to be a gigabit circuit, but traffic rarely gets anywhere near that. I only have a few of the rulesets enabled -- mostly just the IP reputation ones.

I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

I had similar issues a while back when I was running pfsense as a VM on an Intel i7 3.4Ghz server with a server class Intel NIC. I purchased the netgate with the hope that the sponsor of pfsense would have equipment that was designed to work best with it, but that turned out not to be true.

Here's the output of dmidecode for the CPU:

	Socket Designation: P0
	Type: Central Processor
	Family: Pentium Pro
	Manufacturer: GenuineIntel
	ID: D8 06 04 00 FF FB EB BF
	Signature: Type 0, Family 6, Model 77, Stepping 8
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		DS (Debug store)
		ACPI (ACPI supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		SS (Self-snoop)
		HTT (Multi-threading)
		TM (Thermal monitor supported)
		PBE (Pending break enabled)
	Version:         Intel(R) Atom(TM) CPU  C2358  @ 1.74GHz
	Voltage: Unknown
	External Clock: 200 MHz
	Max Speed: 1600 MHz
	Current Speed: 1600 MHz
	Status: Populated, Enabled
	Upgrade: None
	L1 Cache Handle: Not Provided
	L2 Cache Handle: Not Provided
	L3 Cache Handle: Not Provided
	Serial Number: Not Specified
	Asset Tag: Not Specified
	Part Number: Not Specified
	Core Count: 16
	Characteristics: None

boobletins

@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

So this is different than the problem described up above?

@tantamount said:

Top shows the Suricata process gobbling the CPU resources.

In terms of the driver, "igb" is all I needed.

Could you start Suricata in inline mode and then paste the output from:

cat /var/log/system.log | grep netmap

We're looking for things like:

Dec  6 23:25:38 rawr kernel: 338.512666 [1071] netmap_grab_packets       bad pkt at 1054 len 4939
Dec  6 23:25:38 rawr kernel: 338.714285 [1071] netmap_grab_packets       bad pkt at 1073 len 4939

Or similar netmap errors. They may take some time to creep into the log depending on what's happening, but you can usually induce the errors by running a speedtest with netmap enabled (the speedtest will probably freeze your firewall and reset the interfaces repeatedly).

Please also provide the output from:

ifconfig igb0

(or whichever interface is running netmap -- remove any sensitive ips) and

sysctl -a | grep netmap

Tantamount

I'm trying to reply, but "Askimet" is flagging my reply as spam and not letting me. :/

boobletins

@tantamount Yeah, that's frustrating. Can you message me?

I'm trying to post a general how-to on this right now and that's also being blocked by Akismet, eyeroll.

Tantamount

Okay, I've figured out how this breaks.

After enabling inline, it seemed to work fine.

I did get a little block of text about bad pkt, but everything continued to work just fine. I ran a google speed test, saw suricata's utilization exceed 100% and while the rate was slower than when I use legacy, it was stable.
10 .. 20 .. 30 .. 40 minutes later, everything was still working fine.

It wasn't until I did another speed test, this time through Ookla, and then only when the speed test began to upload that things went quickly downhill.

syslog began constantly dumping these:

Dec  7 23:45:05 kernel: 105.834155 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 1514 m 0xfffff8010d499400
Dec  7 23:45:05 kernel: 105.845217 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 66 m 0xfffff8010d4a2d00
Dec  7 23:45:05 dpinger: WAN_DHCP x.x.x.x: sendto error: 55

Once these started to flood in, the suricata process pegged itself at 100%.

To fix, I changed the settings back to legacy, restarted Suricata and then had to forcefully 'kill -s 9' the old suricata process.

boobletins

@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

Ok, so the host RX ring is full and packets are being dropped by netmap because it has no place to send them (the host cannot accept them).

Then what happens is the pfSense watchdog notices the interface has high packetloss and starts trying to cycle the interface. It's downhill from there.

This is likely because the machine doesn't have the cpu power to handle things, but there could be other issues as well.

How much available RAM does the SG-2440 generally have with your setup? We can adjust some settings to buy some time (larger buffers, more rings, etc) -- but that'll be a stopgap and with only 4GB total on the SG-2440 it may not help.

Next time you try, see if top -H gives you more detail on the thread in question (for example suricata{RX#01-igb0} vs suricata{W#03} -- we'd like to know which type of thread is blocking)

We can start by trying to limit the processing Suricata does on the interface -- try disabling all rules via the Categories tab in the ui. The goal is to see if this is fundamentally a netmap issue or a processing power issue. Try the speedtest in that configuration.

It seems to me in principle it's possible for a speedtest to saturate a connection to the point that it starts dropping packets -- that's almost the point -- to determine how fast you can transmit/receive before you run into issues. So this particular netmap error may not be telling us much. But since Suricata is pegged at 100% cpu... I don't know. Try disabling the rules and let me know what happens.

Also: it will be very helpful if you can give me the information requested above. The output from ifconfig (mtu and also flags= and options= data) along with your current netmap settings will tell us a lot. Message me if it won't let you paste it here.

Tantamount

I just wanted to follow up in case anyone else stumbles on this thread. We moved to chat due to the spam issue.

We tried two things: disabling flow control, and increasing the ring_num netmap value.

dev.igb.0.fc = 0 (Default is 3)
dev.netmap.ring_num = 1024 (Default is 200)

This helped -- I'd still see those errors, but not in the amount that would cause the lockout problem.

However, when I enabled some additional rule categories, the problem returned, so it would seem that the atom CPU is not up to the task for inline filtering.

If I understand this correctly, for inline to work, traffic has to temporarily flow through netmap, but due to the limits of netmap's storage, if the CPU isn't able to keep up, netmap gets filled and then the interface gets filled waiting to fill up netmap. That's when this happens:

Dec 8 02:01:15 kernel: 275.078730 [2925] netmap_transmit igb0 full hwcur 586 hwtail 584 qlen 1 len 42 m 0xfffff8014052b800

Once the interface gets filled, the watchdog steps in, assumes there's a problem with the interface, and restarts it. However, restarting the interface won't fix the real problem (netmap is full because the CPU can't keep up), so this just makes things worse.

The part that still doesn't make sense to me is why the CPU never gets close to full utilization when I use legacy mode. In either mode, Suricata has to look at all of the packets in order to know which rules match, yet Suricata never gets above 20% CPU utilization even when it is handling 4 times the traffic. (Speed tests are approximately 4 times faster in legacy vs inline).

dd if=/dev/zero of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes transferred in 14.637447 secs (7,163,653,784 bytes/sec)

7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

boobletins

@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

Here is the netmap model:

Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
Packet Enters Network -> [LATENCY] -> Packet Pass To User

The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:

Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata

In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.

boobletins

You can also try increasing the overall memory available to netmap with:

dev.netmap.buf_num

The default is 163804 (see: https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 ). Mine is currently set to 983040. I would take this up slowly in increments (restart Suricata / netmap each time) so you don't run out of memory.

But remember that will only be a temporary stopgap and will not help with sustained heavy traffic.

bmeeks

@boobletins said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

Here is the netmap model:
Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
Packet Enters Network -> [LATENCY] -> Packet Pass To User
The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:
Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.

Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal. The original packet (or even packets in many cases) got sent on ahead while the IDS/IPS engine is looking at the copy (or copies, if several packets are needed before making a decision). That's why Legacy Mode blocking has the option for killing states once a block happens. You need that to disrupt and kill the session that got started by the original packets that made it through while the IDS/IPS was looking at the copies. I call this "leakage".

Inline IPS Mode (available only with Suricata) does not have the "leakage" problem. But as @boobletins pointed out above, the network throughput is dependent upon Suricata being fast enough to examine all the packets and either pass on the OK packets or drop the bad packets at essentially line rate.

Tantamount

@boobletins said

In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

@bmeeks said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal.

While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.

In legacy, with the buffer, I would still expect to see Suricata hit max CPU while there are packets to process, but I don't. I"ll see maybe 40% utilization max.

Could this have anything to do with being able to use multi-core vs not? Or is there blocking that occurs with netmap?

I just got my hands on another one of these SG-2440's. When I have time I'll load Linux and see if I see the same performance differences when it is configured inline.

boobletins

@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.

So this is true assuming that you hold the time to process the packets constant. There's nothing indicating that's the case. The pcap version could have more waits built in because it isn't responsible for real-time communication. It could also have a form of "waits" built in if it is caching to disk (in which case it would be IO limited, not cpu limited).

But if we assume for a moment that the time-to-process is the same and there really is higher cpu usage with netmap, then I would start by reading the runmodes and the packet capture documentation under performance.

There are several considerations in the performance section -- but I would start with this bit from the load-balancing section:

The AF_PACKET and PF_RING capture methods both have options to select the ‘cluster-type’. These default to ‘cluster_flow’ which instructs the capture method to hash by flow (5 tuple). This hash is symmetric. Netmap does not have a cluster_flow mode built-in. It can be added separately by using the “‘lb’ tool”:https://github.com/luigirizzo/netmap/tree/master/apps/lb

Using lb would require moderate customization (I don't know if it's in the default FreeBSD or not). You will then also have to change suricata run modes and some other things. This link may provide a starting point.

max-pending-packets: 4096

# Runmode the engine should use.
runmode: autofp

# Specifies the kind of flow load balancer used by the flow pinned autofp mode.
autofp-scheduler: active-packets

...

# Suricata is multi-threaded. Here the threading can be influenced.
threading:
  set-cpu-affinity: no
  detect-thread-ratio: 1.0

It looks to me like Suricata in legacy is using autofp, which means there wouldn't be any load balancing, so I'm not sure above is the issue.

You also have options with the threading, processer affinity, and max pending packets settings.

There are also some funky things that happen with interrupts in the netmap driver. If I recall when I read the igb code they chose a set interrupt at half the ring size. Yeah -- here it is.

There no guarantee that's the most efficient interrupt frequency, but we're getting well outside my understanding now. Still, that's a possible explanation.

boobletins

Another thing for you to consider: I'm not sure how you're testing throughput at the moment. I asked you to run a speedtest as an ad hoc confirmation that you weren't dropping packets.

That may or may not be a good way to judge the performance of Suricata in netmap mode depending on how your runmode and threading settings are set. If the speedtest is a single flow then all of the Suricata analysis of that flow would be stuck on a single core.

boobletins

Some notes on lb:

lb doesn't currently ship with FreeBSD or pfSense. It's possible to build it from the source repo, but if you do that it's not the same version of netmap.

Building the new version of netmap + lb from source on FreeBSD 11.2 yields driver build errors and it's downhill from there.

This package: https://github.com/bro/packet-bricks is more promising (don't let the "bro" dissuade you).

If I knew how I would try to put together a pfSense package for packet-bricks. It would help in some cases with Suricata processing because it would allow for better load balancing across CPUs in combination with Suricata's CPU affinity settings.

packet-bricks is run by the ICSI lab at Berkeley. It's a version of lb (also requires netmap) with creature comforts and additional capabilities.

If I'm reading the commits correctly the lb tool from the creator of netmap was recently added to FreeBSD as well, but I can't tell when it will be available...