Suricata inline with Netgate SG-2440 -- high cpu utilization



  • I did a search for this and found an 8-month-old thread about Suricata inline with high CPU utilization:
    https://forum.netgate.com/topic/127358/suricata-inline-high-cpu-with-no-rules

    I'm curious to know if anyone has gotten Suricata inline mode working correctly with the Netgate SG-2440 (Intel(R) Atom(TM) CPU C2358 @ 1.74GHz 2 CPUs: AES-NI CPU Crypto: Yes (active) )? Is the Suricata <-> Netmap stuff still considered experimental?

    2.4.4-RELEASE (amd64)
    built on Thu Sep 20 09:33:19 EDT 2018
    FreeBSD 11.2-RELEASE-p3

    It seems to work for a while before endpoints have trouble connecting and eventually the web GUI becomes unresponsive (ssh still works though). Top shows the Suricata process gobbling the CPU resources.

    Switching to legacy barely touches the CPU, but with "block on alert", at least one packet gets through before the traffic is blocked.



  • Yes, the Netmap side is still experimental. Your issue could be within the Netmap code of the Suricata binary, it could be something weird in that particular NIC driver, or it could be something poorly implemented within the FreeBSD kernel itself with regard to Netmap.



  • Thanks bmeeks.

    If my goal is to block traffic so that not even one packet gets through, is there an alternative? Does Snort provide inline with pfsense?

    Is there a known turn-key box solution like the Netgate that is known to work happily with Suricata, inline, and pfsense? (This question is for anyone out there)



  • No, Snort does not offer inline IPS operation on pfSense. While technically possible to configure it with DAQ (which is used by Snort to interface with the physical network layer), it is not efficient because two physical NIC ports are required for each instance; one for input and the other for output. The GUI code does not support inline operation at all. You would have to use a "command-line" mode and run Snort with no GUI.



  • Which nic driver is that using (or what is the chipset)?

    How much traffic are you trying to push through it?

    I can get 400+mbps (my external line rate) with netmap+~30,000 rules enabled in addition to another ~900mbps on the LAN interface with an i5 (quad core, ~3ghz)

    If you stick to a single interface and a reasonably limited ruleset, you may be able to get it working at around 300 mbps, though the i5 is using HyperScan with AVX2 which I don't think the atom processor has.



  • It's using the igb drivers that are compiled into the kernel:

    igb0@pci0:0:20:0:	class=0x020000 card=0x1f418086 chip=0x1f418086 rev=0x03 hdr=0x00
        vendor     = 'Intel Corporation'
        device     = 'Ethernet Connection I354'
    

    If there is a way to determine the version of the driver, I haven't been able to find it. I supposed one could figure it out by what is normally included in the kernel based on its version, assuming nothing custom has been done:

    FreeBSD 11.2-RELEASE-p3 FreeBSD 11.2-RELEASE-p3 #12 220591260a0(factory-RELENG_2_4_4): Thu Sep 20 11:00:13 EDT 2018     root@buildbot3:/crossbuild/244/obj/amd64/as0Ifpf7/crossbuild/244/pfSense/tmp/FreeBSD-src/sys/pfSense  amd64
    

    There are 4 interfaces, but only two are used -- one for external, one for internal traffic. Suricata is only listening on the external interface.

    It's supposed to be a gigabit circuit, but traffic rarely gets anywhere near that. I only have a few of the rulesets enabled -- mostly just the IP reputation ones.

    I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

    I had similar issues a while back when I was running pfsense as a VM on an Intel i7 3.4Ghz server with a server class Intel NIC. I purchased the netgate with the hope that the sponsor of pfsense would have equipment that was designed to work best with it, but that turned out not to be true.

    Here's the output of dmidecode for the CPU:

    	Socket Designation: P0
    	Type: Central Processor
    	Family: Pentium Pro
    	Manufacturer: GenuineIntel
    	ID: D8 06 04 00 FF FB EB BF
    	Signature: Type 0, Family 6, Model 77, Stepping 8
    	Flags:
    		FPU (Floating-point unit on-chip)
    		VME (Virtual mode extension)
    		DE (Debugging extension)
    		PSE (Page size extension)
    		TSC (Time stamp counter)
    		MSR (Model specific registers)
    		PAE (Physical address extension)
    		MCE (Machine check exception)
    		CX8 (CMPXCHG8 instruction supported)
    		APIC (On-chip APIC hardware supported)
    		SEP (Fast system call)
    		MTRR (Memory type range registers)
    		PGE (Page global enable)
    		MCA (Machine check architecture)
    		CMOV (Conditional move instruction supported)
    		PAT (Page attribute table)
    		PSE-36 (36-bit page size extension)
    		CLFSH (CLFLUSH instruction supported)
    		DS (Debug store)
    		ACPI (ACPI supported)
    		MMX (MMX technology supported)
    		FXSR (FXSAVE and FXSTOR instructions supported)
    		SSE (Streaming SIMD extensions)
    		SSE2 (Streaming SIMD extensions 2)
    		SS (Self-snoop)
    		HTT (Multi-threading)
    		TM (Thermal monitor supported)
    		PBE (Pending break enabled)
    	Version:         Intel(R) Atom(TM) CPU  C2358  @ 1.74GHz
    	Voltage: Unknown
    	External Clock: 200 MHz
    	Max Speed: 1600 MHz
    	Current Speed: 1600 MHz
    	Status: Populated, Enabled
    	Upgrade: None
    	L1 Cache Handle: Not Provided
    	L2 Cache Handle: Not Provided
    	L3 Cache Handle: Not Provided
    	Serial Number: Not Specified
    	Asset Tag: Not Specified
    	Part Number: Not Specified
    	Core Count: 16
    	Characteristics: None
    
    


  • @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

    I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

    So this is different than the problem described up above?

    @tantamount said:

    Top shows the Suricata process gobbling the CPU resources.

    In terms of the driver, "igb" is all I needed.

    Could you start Suricata in inline mode and then paste the output from:

    cat /var/log/system.log | grep netmap
    

    We're looking for things like:

    Dec  6 23:25:38 rawr kernel: 338.512666 [1071] netmap_grab_packets       bad pkt at 1054 len 4939
    Dec  6 23:25:38 rawr kernel: 338.714285 [1071] netmap_grab_packets       bad pkt at 1073 len 4939
    

    Or similar netmap errors. They may take some time to creep into the log depending on what's happening, but you can usually induce the errors by running a speedtest with netmap enabled (the speedtest will probably freeze your firewall and reset the interfaces repeatedly).

    Please also provide the output from:

    ifconfig igb0
    

    (or whichever interface is running netmap -- remove any sensitive ips) and

    sysctl -a | grep netmap
    


  • I'm trying to reply, but "Askimet" is flagging my reply as spam and not letting me. :/



  • @tantamount Yeah, that's frustrating. Can you message me?

    I'm trying to post a general how-to on this right now and that's also being blocked by Akismet, eyeroll.



  • Okay, I've figured out how this breaks.

    After enabling inline, it seemed to work fine.

    I did get a little block of text about bad pkt, but everything continued to work just fine. I ran a google speed test, saw suricata's utilization exceed 100% and while the rate was slower than when I use legacy, it was stable.
    10 .. 20 .. 30 .. 40 minutes later, everything was still working fine.

    It wasn't until I did another speed test, this time through Ookla, and then only when the speed test began to upload that things went quickly downhill.

    syslog began constantly dumping these:

    Dec  7 23:45:05 kernel: 105.834155 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 1514 m 0xfffff8010d499400
    Dec  7 23:45:05 kernel: 105.845217 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 66 m 0xfffff8010d4a2d00
    Dec  7 23:45:05 dpinger: WAN_DHCP x.x.x.x: sendto error: 55
    

    Once these started to flood in, the suricata process pegged itself at 100%.

    To fix, I changed the settings back to legacy, restarted Suricata and then had to forcefully 'kill -s 9' the old suricata process.



  • @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

    Ok, so the host RX ring is full and packets are being dropped by netmap because it has no place to send them (the host cannot accept them).

    Then what happens is the pfSense watchdog notices the interface has high packetloss and starts trying to cycle the interface. It's downhill from there.

    This is likely because the machine doesn't have the cpu power to handle things, but there could be other issues as well.

    How much available RAM does the SG-2440 generally have with your setup? We can adjust some settings to buy some time (larger buffers, more rings, etc) -- but that'll be a stopgap and with only 4GB total on the SG-2440 it may not help.

    Next time you try, see if top -H gives you more detail on the thread in question (for example suricata{RX#01-igb0} vs suricata{W#03} -- we'd like to know which type of thread is blocking)

    We can start by trying to limit the processing Suricata does on the interface -- try disabling all rules via the Categories tab in the ui. The goal is to see if this is fundamentally a netmap issue or a processing power issue. Try the speedtest in that configuration.

    It seems to me in principle it's possible for a speedtest to saturate a connection to the point that it starts dropping packets -- that's almost the point -- to determine how fast you can transmit/receive before you run into issues. So this particular netmap error may not be telling us much. But since Suricata is pegged at 100% cpu... I don't know. Try disabling the rules and let me know what happens.

    Also: it will be very helpful if you can give me the information requested above. The output from ifconfig (mtu and also flags= and options= data) along with your current netmap settings will tell us a lot. Message me if it won't let you paste it here.



  • I just wanted to follow up in case anyone else stumbles on this thread. We moved to chat due to the spam issue.

    We tried two things: disabling flow control, and increasing the ring_num netmap value.

    dev.igb.0.fc = 0 (Default is 3)
    dev.netmap.ring_num = 1024 (Default is 200)

    This helped -- I'd still see those errors, but not in the amount that would cause the lockout problem.

    However, when I enabled some additional rule categories, the problem returned, so it would seem that the atom CPU is not up to the task for inline filtering.

    If I understand this correctly, for inline to work, traffic has to temporarily flow through netmap, but due to the limits of netmap's storage, if the CPU isn't able to keep up, netmap gets filled and then the interface gets filled waiting to fill up netmap. That's when this happens:

    Dec 8 02:01:15 kernel: 275.078730 [2925] netmap_transmit igb0 full hwcur 586 hwtail 584 qlen 1 len 42 m 0xfffff8014052b800

    Once the interface gets filled, the watchdog steps in, assumes there's a problem with the interface, and restarts it. However, restarting the interface won't fix the real problem (netmap is full because the CPU can't keep up), so this just makes things worse.

    The part that still doesn't make sense to me is why the CPU never gets close to full utilization when I use legacy mode. In either mode, Suricata has to look at all of the packets in order to know which rules match, yet Suricata never gets above 20% CPU utilization even when it is handling 4 times the traffic. (Speed tests are approximately 4 times faster in legacy vs inline).

    dd if=/dev/zero of=/dev/null bs=1M count=100000
    100000+0 records in
    100000+0 records out
    104857600000 bytes transferred in 14.637447 secs (7,163,653,784 bytes/sec)
    

    7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?



  • @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

    7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

    So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

    Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

    Here is the netmap model:

    Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
    Packet Enters Network -> [LATENCY] -> Packet Pass To User
    

    The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

    As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:

    Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
    

    In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

    It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

    The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

    I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.



  • You can also try increasing the overall memory available to netmap with:

    dev.netmap.buf_num

    The default is 163804 (see: https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 ). Mine is currently set to 983040. I would take this up slowly in increments (restart Suricata / netmap each time) so you don't run out of memory.

    But remember that will only be a temporary stopgap and will not help with sustained heavy traffic.



  • @boobletins said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

    @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

    7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

    So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

    Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

    Here is the netmap model:

    Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
    Packet Enters Network -> [LATENCY] -> Packet Pass To User
    

    The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

    As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:

    Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
    

    In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

    It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

    The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

    I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.

    Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal. The original packet (or even packets in many cases) got sent on ahead while the IDS/IPS engine is looking at the copy (or copies, if several packets are needed before making a decision). That's why Legacy Mode blocking has the option for killing states once a block happens. You need that to disrupt and kill the session that got started by the original packets that made it through while the IDS/IPS was looking at the copies. I call this "leakage".

    Inline IPS Mode (available only with Suricata) does not have the "leakage" problem. But as @boobletins pointed out above, the network throughput is dependent upon Suricata being fast enough to examine all the packets and either pass on the OK packets or drop the bad packets at essentially line rate.