Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Suricata inline with Netgate SG-2440 -- high cpu utilization

    IDS/IPS
    3
    19
    2.9k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • bmeeksB
      bmeeks
      last edited by

      No, Snort does not offer inline IPS operation on pfSense. While technically possible to configure it with DAQ (which is used by Snort to interface with the physical network layer), it is not efficient because two physical NIC ports are required for each instance; one for input and the other for output. The GUI code does not support inline operation at all. You would have to use a "command-line" mode and run Snort with no GUI.

      1 Reply Last reply Reply Quote 0
      • B
        boobletins
        last edited by

        Which nic driver is that using (or what is the chipset)?

        How much traffic are you trying to push through it?

        I can get 400+mbps (my external line rate) with netmap+~30,000 rules enabled in addition to another ~900mbps on the LAN interface with an i5 (quad core, ~3ghz)

        If you stick to a single interface and a reasonably limited ruleset, you may be able to get it working at around 300 mbps, though the i5 is using HyperScan with AVX2 which I don't think the atom processor has.

        1 Reply Last reply Reply Quote 1
        • T
          Tantamount
          last edited by

          It's using the igb drivers that are compiled into the kernel:

          igb0@pci0:0:20:0:	class=0x020000 card=0x1f418086 chip=0x1f418086 rev=0x03 hdr=0x00
              vendor     = 'Intel Corporation'
              device     = 'Ethernet Connection I354'
          

          If there is a way to determine the version of the driver, I haven't been able to find it. I supposed one could figure it out by what is normally included in the kernel based on its version, assuming nothing custom has been done:

          FreeBSD 11.2-RELEASE-p3 FreeBSD 11.2-RELEASE-p3 #12 220591260a0(factory-RELENG_2_4_4): Thu Sep 20 11:00:13 EDT 2018     root@buildbot3:/crossbuild/244/obj/amd64/as0Ifpf7/crossbuild/244/pfSense/tmp/FreeBSD-src/sys/pfSense  amd64
          

          There are 4 interfaces, but only two are used -- one for external, one for internal traffic. Suricata is only listening on the external interface.

          It's supposed to be a gigabit circuit, but traffic rarely gets anywhere near that. I only have a few of the rulesets enabled -- mostly just the IP reputation ones.

          I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

          I had similar issues a while back when I was running pfsense as a VM on an Intel i7 3.4Ghz server with a server class Intel NIC. I purchased the netgate with the hope that the sponsor of pfsense would have equipment that was designed to work best with it, but that turned out not to be true.

          Here's the output of dmidecode for the CPU:

          	Socket Designation: P0
          	Type: Central Processor
          	Family: Pentium Pro
          	Manufacturer: GenuineIntel
          	ID: D8 06 04 00 FF FB EB BF
          	Signature: Type 0, Family 6, Model 77, Stepping 8
          	Flags:
          		FPU (Floating-point unit on-chip)
          		VME (Virtual mode extension)
          		DE (Debugging extension)
          		PSE (Page size extension)
          		TSC (Time stamp counter)
          		MSR (Model specific registers)
          		PAE (Physical address extension)
          		MCE (Machine check exception)
          		CX8 (CMPXCHG8 instruction supported)
          		APIC (On-chip APIC hardware supported)
          		SEP (Fast system call)
          		MTRR (Memory type range registers)
          		PGE (Page global enable)
          		MCA (Machine check architecture)
          		CMOV (Conditional move instruction supported)
          		PAT (Page attribute table)
          		PSE-36 (36-bit page size extension)
          		CLFSH (CLFLUSH instruction supported)
          		DS (Debug store)
          		ACPI (ACPI supported)
          		MMX (MMX technology supported)
          		FXSR (FXSAVE and FXSTOR instructions supported)
          		SSE (Streaming SIMD extensions)
          		SSE2 (Streaming SIMD extensions 2)
          		SS (Self-snoop)
          		HTT (Multi-threading)
          		TM (Thermal monitor supported)
          		PBE (Pending break enabled)
          	Version:         Intel(R) Atom(TM) CPU  C2358  @ 1.74GHz
          	Voltage: Unknown
          	External Clock: 200 MHz
          	Max Speed: 1600 MHz
          	Current Speed: 1600 MHz
          	Status: Populated, Enabled
          	Upgrade: None
          	L1 Cache Handle: Not Provided
          	L2 Cache Handle: Not Provided
          	L3 Cache Handle: Not Provided
          	Serial Number: Not Specified
          	Asset Tag: Not Specified
          	Part Number: Not Specified
          	Core Count: 16
          	Characteristics: None
          
          
          B 1 Reply Last reply Reply Quote 1
          • B
            boobletins @Tantamount
            last edited by

            @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

            I don't think this problem has anything to do with resources though. This problem happens when there is practically zero traffic, and none of the usual indicators like memory or CPU utilization are anywhere close to their limits.

            So this is different than the problem described up above?

            @tantamount said:

            Top shows the Suricata process gobbling the CPU resources.

            In terms of the driver, "igb" is all I needed.

            Could you start Suricata in inline mode and then paste the output from:

            cat /var/log/system.log | grep netmap
            

            We're looking for things like:

            Dec  6 23:25:38 rawr kernel: 338.512666 [1071] netmap_grab_packets       bad pkt at 1054 len 4939
            Dec  6 23:25:38 rawr kernel: 338.714285 [1071] netmap_grab_packets       bad pkt at 1073 len 4939
            

            Or similar netmap errors. They may take some time to creep into the log depending on what's happening, but you can usually induce the errors by running a speedtest with netmap enabled (the speedtest will probably freeze your firewall and reset the interfaces repeatedly).

            Please also provide the output from:

            ifconfig igb0
            

            (or whichever interface is running netmap -- remove any sensitive ips) and

            sysctl -a | grep netmap
            
            1 Reply Last reply Reply Quote 1
            • T
              Tantamount
              last edited by

              I'm trying to reply, but "Askimet" is flagging my reply as spam and not letting me. :/

              B 1 Reply Last reply Reply Quote 1
              • B
                boobletins @Tantamount
                last edited by boobletins

                @tantamount Yeah, that's frustrating. Can you message me?

                I'm trying to post a general how-to on this right now and that's also being blocked by Akismet, eyeroll.

                1 Reply Last reply Reply Quote 1
                • T
                  Tantamount
                  last edited by Tantamount

                  Okay, I've figured out how this breaks.

                  After enabling inline, it seemed to work fine.

                  I did get a little block of text about bad pkt, but everything continued to work just fine. I ran a google speed test, saw suricata's utilization exceed 100% and while the rate was slower than when I use legacy, it was stable.
                  10 .. 20 .. 30 .. 40 minutes later, everything was still working fine.

                  It wasn't until I did another speed test, this time through Ookla, and then only when the speed test began to upload that things went quickly downhill.

                  syslog began constantly dumping these:

                  Dec  7 23:45:05 kernel: 105.834155 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 1514 m 0xfffff8010d499400
                  Dec  7 23:45:05 kernel: 105.845217 [2925] netmap_transmit           igb0 full hwcur 210 hwtail 44 qlen 165 len 66 m 0xfffff8010d4a2d00
                  Dec  7 23:45:05 dpinger: WAN_DHCP x.x.x.x: sendto error: 55
                  

                  Once these started to flood in, the suricata process pegged itself at 100%.

                  To fix, I changed the settings back to legacy, restarted Suricata and then had to forcefully 'kill -s 9' the old suricata process.

                  1 Reply Last reply Reply Quote 1
                  • B
                    boobletins
                    last edited by boobletins

                    @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                    Ok, so the host RX ring is full and packets are being dropped by netmap because it has no place to send them (the host cannot accept them).

                    Then what happens is the pfSense watchdog notices the interface has high packetloss and starts trying to cycle the interface. It's downhill from there.

                    This is likely because the machine doesn't have the cpu power to handle things, but there could be other issues as well.

                    How much available RAM does the SG-2440 generally have with your setup? We can adjust some settings to buy some time (larger buffers, more rings, etc) -- but that'll be a stopgap and with only 4GB total on the SG-2440 it may not help.

                    Next time you try, see if top -H gives you more detail on the thread in question (for example suricata{RX#01-igb0} vs suricata{W#03} -- we'd like to know which type of thread is blocking)

                    We can start by trying to limit the processing Suricata does on the interface -- try disabling all rules via the Categories tab in the ui. The goal is to see if this is fundamentally a netmap issue or a processing power issue. Try the speedtest in that configuration.

                    It seems to me in principle it's possible for a speedtest to saturate a connection to the point that it starts dropping packets -- that's almost the point -- to determine how fast you can transmit/receive before you run into issues. So this particular netmap error may not be telling us much. But since Suricata is pegged at 100% cpu... I don't know. Try disabling the rules and let me know what happens.

                    Also: it will be very helpful if you can give me the information requested above. The output from ifconfig (mtu and also flags= and options= data) along with your current netmap settings will tell us a lot. Message me if it won't let you paste it here.

                    1 Reply Last reply Reply Quote 0
                    • T
                      Tantamount
                      last edited by Tantamount

                      I just wanted to follow up in case anyone else stumbles on this thread. We moved to chat due to the spam issue.

                      We tried two things: disabling flow control, and increasing the ring_num netmap value.

                      dev.igb.0.fc = 0 (Default is 3)
                      dev.netmap.ring_num = 1024 (Default is 200)

                      This helped -- I'd still see those errors, but not in the amount that would cause the lockout problem.

                      However, when I enabled some additional rule categories, the problem returned, so it would seem that the atom CPU is not up to the task for inline filtering.

                      If I understand this correctly, for inline to work, traffic has to temporarily flow through netmap, but due to the limits of netmap's storage, if the CPU isn't able to keep up, netmap gets filled and then the interface gets filled waiting to fill up netmap. That's when this happens:

                      Dec 8 02:01:15 kernel: 275.078730 [2925] netmap_transmit igb0 full hwcur 586 hwtail 584 qlen 1 len 42 m 0xfffff8014052b800

                      Once the interface gets filled, the watchdog steps in, assumes there's a problem with the interface, and restarts it. However, restarting the interface won't fix the real problem (netmap is full because the CPU can't keep up), so this just makes things worse.

                      The part that still doesn't make sense to me is why the CPU never gets close to full utilization when I use legacy mode. In either mode, Suricata has to look at all of the packets in order to know which rules match, yet Suricata never gets above 20% CPU utilization even when it is handling 4 times the traffic. (Speed tests are approximately 4 times faster in legacy vs inline).

                      dd if=/dev/zero of=/dev/null bs=1M count=100000
                      100000+0 records in
                      100000+0 records out
                      104857600000 bytes transferred in 14.637447 secs (7,163,653,784 bytes/sec)
                      

                      7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

                      B 1 Reply Last reply Reply Quote 0
                      • B
                        boobletins @Tantamount
                        last edited by boobletins

                        @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                        7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

                        So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

                        Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

                        Here is the netmap model:

                        Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
                        Packet Enters Network -> [LATENCY] -> Packet Pass To User
                        

                        The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

                        As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:

                        Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
                        

                        In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

                        It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

                        The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

                        I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.

                        bmeeksB 1 Reply Last reply Reply Quote 0
                        • B
                          boobletins
                          last edited by boobletins

                          You can also try increasing the overall memory available to netmap with:

                          dev.netmap.buf_num

                          The default is 163804 (see: https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 ). Mine is currently set to 983040. I would take this up slowly in increments (restart Suricata / netmap each time) so you don't run out of memory.

                          But remember that will only be a temporary stopgap and will not help with sustained heavy traffic.

                          1 Reply Last reply Reply Quote 0
                          • bmeeksB
                            bmeeks @boobletins
                            last edited by bmeeks

                            @boobletins said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                            @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                            7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?

                            So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.

                            Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:

                            Here is the netmap model:

                            Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped)
                            Packet Enters Network -> [LATENCY] -> Packet Pass To User
                            

                            The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.

                            As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:

                            Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
                            

                            In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

                            It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

                            The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.

                            I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.

                            Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal. The original packet (or even packets in many cases) got sent on ahead while the IDS/IPS engine is looking at the copy (or copies, if several packets are needed before making a decision). That's why Legacy Mode blocking has the option for killing states once a block happens. You need that to disrupt and kill the session that got started by the original packets that made it through while the IDS/IPS was looking at the copies. I call this "leakage".

                            Inline IPS Mode (available only with Suricata) does not have the "leakage" problem. But as @boobletins pointed out above, the network throughput is dependent upon Suricata being fast enough to examine all the packets and either pass on the OK packets or drop the bad packets at essentially line rate.

                            T 1 Reply Last reply Reply Quote 0
                            • T
                              Tantamount @bmeeks
                              last edited by

                              @boobletins said

                              In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.

                              It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.

                              @bmeeks said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                              Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal.

                              While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.

                              In legacy, with the buffer, I would still expect to see Suricata hit max CPU while there are packets to process, but I don't. I"ll see maybe 40% utilization max.

                              Could this have anything to do with being able to use multi-core vs not? Or is there blocking that occurs with netmap?

                              I just got my hands on another one of these SG-2440's. When I have time I'll load Linux and see if I see the same performance differences when it is configured inline.

                              B 1 Reply Last reply Reply Quote 0
                              • B
                                boobletins @Tantamount
                                last edited by boobletins

                                @tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:

                                While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.

                                So this is true assuming that you hold the time to process the packets constant. There's nothing indicating that's the case. The pcap version could have more waits built in because it isn't responsible for real-time communication. It could also have a form of "waits" built in if it is caching to disk (in which case it would be IO limited, not cpu limited).

                                But if we assume for a moment that the time-to-process is the same and there really is higher cpu usage with netmap, then I would start by reading the runmodes and the packet capture documentation under performance.

                                There are several considerations in the performance section -- but I would start with this bit from the load-balancing section:

                                The AF_PACKET and PF_RING capture methods both have options to select the ‘cluster-type’. These default to ‘cluster_flow’ which instructs the capture method to hash by flow (5 tuple). This hash is symmetric. Netmap does not have a cluster_flow mode built-in. It can be added separately by using the “‘lb’ tool”:https://github.com/luigirizzo/netmap/tree/master/apps/lb

                                Using lb would require moderate customization (I don't know if it's in the default FreeBSD or not). You will then also have to change suricata run modes and some other things. This link may provide a starting point.

                                max-pending-packets: 4096
                                
                                # Runmode the engine should use.
                                runmode: autofp
                                
                                # Specifies the kind of flow load balancer used by the flow pinned autofp mode.
                                autofp-scheduler: active-packets
                                
                                ...
                                
                                # Suricata is multi-threaded. Here the threading can be influenced.
                                threading:
                                  set-cpu-affinity: no
                                  detect-thread-ratio: 1.0
                                

                                It looks to me like Suricata in legacy is using autofp, which means there wouldn't be any load balancing, so I'm not sure above is the issue.

                                You also have options with the threading, processer affinity, and max pending packets settings.

                                There are also some funky things that happen with interrupts in the netmap driver. If I recall when I read the igb code they chose a set interrupt at half the ring size. Yeah -- here it is.

                                There no guarantee that's the most efficient interrupt frequency, but we're getting well outside my understanding now. Still, that's a possible explanation.

                                1 Reply Last reply Reply Quote 0
                                • B
                                  boobletins
                                  last edited by

                                  Another thing for you to consider: I'm not sure how you're testing throughput at the moment. I asked you to run a speedtest as an ad hoc confirmation that you weren't dropping packets.

                                  That may or may not be a good way to judge the performance of Suricata in netmap mode depending on how your runmode and threading settings are set. If the speedtest is a single flow then all of the Suricata analysis of that flow would be stuck on a single core.

                                  1 Reply Last reply Reply Quote 0
                                  • B
                                    boobletins
                                    last edited by

                                    Some notes on lb:

                                    lb doesn't currently ship with FreeBSD or pfSense. It's possible to build it from the source repo, but if you do that it's not the same version of netmap.

                                    Building the new version of netmap + lb from source on FreeBSD 11.2 yields driver build errors and it's downhill from there.

                                    This package: https://github.com/bro/packet-bricks is more promising (don't let the "bro" dissuade you).

                                    If I knew how I would try to put together a pfSense package for packet-bricks. It would help in some cases with Suricata processing because it would allow for better load balancing across CPUs in combination with Suricata's CPU affinity settings.

                                    packet-bricks is run by the ICSI lab at Berkeley. It's a version of lb (also requires netmap) with creature comforts and additional capabilities.

                                    If I'm reading the commits correctly the lb tool from the creator of netmap was recently added to FreeBSD as well, but I can't tell when it will be available...

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post
                                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.