Suricata inline with Netgate SG-2440 -- high cpu utilization
-
Okay, I've figured out how this breaks.
After enabling inline, it seemed to work fine.
I did get a little block of text about bad pkt, but everything continued to work just fine. I ran a google speed test, saw suricata's utilization exceed 100% and while the rate was slower than when I use legacy, it was stable.
10 .. 20 .. 30 .. 40 minutes later, everything was still working fine.It wasn't until I did another speed test, this time through Ookla, and then only when the speed test began to upload that things went quickly downhill.
syslog began constantly dumping these:
Dec 7 23:45:05 kernel: 105.834155 [2925] netmap_transmit igb0 full hwcur 210 hwtail 44 qlen 165 len 1514 m 0xfffff8010d499400 Dec 7 23:45:05 kernel: 105.845217 [2925] netmap_transmit igb0 full hwcur 210 hwtail 44 qlen 165 len 66 m 0xfffff8010d4a2d00 Dec 7 23:45:05 dpinger: WAN_DHCP x.x.x.x: sendto error: 55
Once these started to flood in, the suricata process pegged itself at 100%.
To fix, I changed the settings back to legacy, restarted Suricata and then had to forcefully 'kill -s 9' the old suricata process.
-
@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
Ok, so the host RX ring is full and packets are being dropped by netmap because it has no place to send them (the host cannot accept them).
Then what happens is the pfSense watchdog notices the interface has high packetloss and starts trying to cycle the interface. It's downhill from there.
This is likely because the machine doesn't have the cpu power to handle things, but there could be other issues as well.
How much available RAM does the SG-2440 generally have with your setup? We can adjust some settings to buy some time (larger buffers, more rings, etc) -- but that'll be a stopgap and with only 4GB total on the SG-2440 it may not help.
Next time you try, see if
top -H
gives you more detail on the thread in question (for examplesuricata{RX#01-igb0}
vssuricata{W#03}
-- we'd like to know which type of thread is blocking)We can start by trying to limit the processing Suricata does on the interface -- try disabling all rules via the Categories tab in the ui. The goal is to see if this is fundamentally a netmap issue or a processing power issue. Try the speedtest in that configuration.
It seems to me in principle it's possible for a speedtest to saturate a connection to the point that it starts dropping packets -- that's almost the point -- to determine how fast you can transmit/receive before you run into issues. So this particular netmap error may not be telling us much. But since Suricata is pegged at 100% cpu... I don't know. Try disabling the rules and let me know what happens.
Also: it will be very helpful if you can give me the information requested above. The output from ifconfig (mtu and also flags= and options= data) along with your current netmap settings will tell us a lot. Message me if it won't let you paste it here.
-
I just wanted to follow up in case anyone else stumbles on this thread. We moved to chat due to the spam issue.
We tried two things: disabling flow control, and increasing the ring_num netmap value.
dev.igb.0.fc = 0 (Default is 3)
dev.netmap.ring_num = 1024 (Default is 200)This helped -- I'd still see those errors, but not in the amount that would cause the lockout problem.
However, when I enabled some additional rule categories, the problem returned, so it would seem that the atom CPU is not up to the task for inline filtering.
If I understand this correctly, for inline to work, traffic has to temporarily flow through netmap, but due to the limits of netmap's storage, if the CPU isn't able to keep up, netmap gets filled and then the interface gets filled waiting to fill up netmap. That's when this happens:
Dec 8 02:01:15 kernel: 275.078730 [2925] netmap_transmit igb0 full hwcur 586 hwtail 584 qlen 1 len 42 m 0xfffff8014052b800
Once the interface gets filled, the watchdog steps in, assumes there's a problem with the interface, and restarts it. However, restarting the interface won't fix the real problem (netmap is full because the CPU can't keep up), so this just makes things worse.
The part that still doesn't make sense to me is why the CPU never gets close to full utilization when I use legacy mode. In either mode, Suricata has to look at all of the packets in order to know which rules match, yet Suricata never gets above 20% CPU utilization even when it is handling 4 times the traffic. (Speed tests are approximately 4 times faster in legacy vs inline).
dd if=/dev/zero of=/dev/null bs=1M count=100000 100000+0 records in 100000+0 records out 104857600000 bytes transferred in 14.637447 secs (7,163,653,784 bytes/sec)
7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?
-
@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?
So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.
Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:
Here is the netmap model:
Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped) Packet Enters Network -> [LATENCY] -> Packet Pass To User
The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.
As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:
Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.
It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.
The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.
I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.
-
You can also try increasing the overall memory available to netmap with:
dev.netmap.buf_num
The default is 163804 (see: https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4 ). Mine is currently set to 983040. I would take this up slowly in increments (restart Suricata / netmap each time) so you don't run out of memory.
But remember that will only be a temporary stopgap and will not help with sustained heavy traffic.
-
@boobletins said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
7 gigs a second. Doesn't all of this point to software being the troublemaker? netmap?
So I'm not sure what exactly is being measured by that command because as I'm not sure what FreeBSD is doing with those bytes. At best that looks like a bus speed measurement that would have little to do with either your nic or cpu processing speed.
Without checking out the Suricata code in detail ( https://github.com/OISF/suricata ), I can't say exactly how legacy mode works. But I can speculate a few ways it might appear to be much faster to you without actually being faster:
Here is the netmap model:
Packet Enters Network -> Packet Enters Suricata -> Packet Passed To User (or dropped) Packet Enters Network -> [LATENCY] -> Packet Pass To User
The [LATENCY] represents the work Suricata has to do on each packet before it can be either dropped or passed on. That work takes time (and CPU cycles). That time sits between, say, your initial request for some data and you getting that data back. All of the processing must be completed before the packet moves to you. This means you want that processing to be fast. If the processing isn't fast enough to keep up with the rate at which packets are arriving, then those packets are temporarily stored to be checked as soon as possible. This means 2 things start to happen: your internet latency increases while you wait for the processing to complete and memory starts filling up with backlogged packets. If something goes wrong (eg your buffers fill up because the processing can't be completed fast enough), then a packet is dropped.
As I said -- I don't know how legacy mode actually works, but here are 2 ways that it could work:
Packet Enters Network -> Packet Copied to Buffer or Disk (no processing, little to no latency) -> Packet Sent to User -> Packet Enters Suricata
In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.
It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.
The other thing to consider is -- ok, let's say that your memory buffers are all full of packets waiting to be inspected and Suricata isn't doing any writing to disk. What happens now? Packets get dropped -- just like they do in netmap mode -- but now you are none-the-wiser because your user experience is the same. The packets are dropped from analysis and not delivery.
I suppose what I'm saying is: when Suricata sits between you and packet delivery (as with netmap) -- then your CPU must be able to process packets faster than they can be sent/received. You have a little bit of burst buffer with RAM, but sustained high-speeds will require processing power. With 50,000 some odd ET Pro rules, that's a lot of processing that needs to be completed in short order. Hyperscan helps and compiling it with AVX2 support helps even more -- but the atom doesn't support that.
Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal. The original packet (or even packets in many cases) got sent on ahead while the IDS/IPS engine is looking at the copy (or copies, if several packets are needed before making a decision). That's why Legacy Mode blocking has the option for killing states once a block happens. You need that to disrupt and kill the session that got started by the original packets that made it through while the IDS/IPS was looking at the copies. I call this "leakage".
Inline IPS Mode (available only with Suricata) does not have the "leakage" problem. But as @boobletins pointed out above, the network throughput is dependent upon Suricata being fast enough to examine all the packets and either pass on the OK packets or drop the bad packets at essentially line rate.
-
@boobletins said
In this model the packets are only inspected after they get to their destination. This means that any additional processing latency has no effect on the user's experience. Your internet latency should remain low because it could take Suricata 30 seconds to finish processing the packet and it wouldn't matter to you.
It's true that there is still a potential buffer/memory issue -- the CPU can only work so fast in either model -- but in the 2nd model you can cache to disk without incurring massive latency. You couldn't do that in the netmap model.
@bmeeks said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
Legacy Mode in both Suricata and Snort uses the libpcap library (or plain old pcap). That code copies every single packet traversing an interface and sends the copy to Suricata (or Snort, if that package is installed). So in Legacy Mode the IDS/IPS engine is examining and working on copies of packets. The original packets were immediately sent on their merry way either to the kernel stack (if inbound) or to the NIC (if outbound). This is why Legacy Mode blocking is not ideal.
While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.
In legacy, with the buffer, I would still expect to see Suricata hit max CPU while there are packets to process, but I don't. I"ll see maybe 40% utilization max.
Could this have anything to do with being able to use multi-core vs not? Or is there blocking that occurs with netmap?
I just got my hands on another one of these SG-2440's. When I have time I'll load Linux and see if I see the same performance differences when it is configured inline.
-
@tantamount said in Suricata inline with Netgate SG-2440 -- high cpu utilization:
While both describe why there could be differences in latency, neither explains why Suricata legacy CPU usage is 1/4 that of inline for the same traffic.
So this is true assuming that you hold the time to process the packets constant. There's nothing indicating that's the case. The pcap version could have more waits built in because it isn't responsible for real-time communication. It could also have a form of "waits" built in if it is caching to disk (in which case it would be IO limited, not cpu limited).
But if we assume for a moment that the time-to-process is the same and there really is higher cpu usage with netmap, then I would start by reading the runmodes and the packet capture documentation under performance.
There are several considerations in the performance section -- but I would start with this bit from the load-balancing section:
The AF_PACKET and PF_RING capture methods both have options to select the ‘cluster-type’. These default to ‘cluster_flow’ which instructs the capture method to hash by flow (5 tuple). This hash is symmetric. Netmap does not have a cluster_flow mode built-in. It can be added separately by using the “‘lb’ tool”:https://github.com/luigirizzo/netmap/tree/master/apps/lb
Using lb would require moderate customization (I don't know if it's in the default FreeBSD or not). You will then also have to change suricata run modes and some other things. This link may provide a starting point.
max-pending-packets: 4096 # Runmode the engine should use. runmode: autofp # Specifies the kind of flow load balancer used by the flow pinned autofp mode. autofp-scheduler: active-packets ... # Suricata is multi-threaded. Here the threading can be influenced. threading: set-cpu-affinity: no detect-thread-ratio: 1.0
It looks to me like Suricata in legacy is using autofp, which means there wouldn't be any load balancing, so I'm not sure above is the issue.
You also have options with the threading, processer affinity, and max pending packets settings.
There are also some funky things that happen with interrupts in the netmap driver. If I recall when I read the igb code they chose a set interrupt at half the ring size. Yeah -- here it is.
There no guarantee that's the most efficient interrupt frequency, but we're getting well outside my understanding now. Still, that's a possible explanation.
-
Another thing for you to consider: I'm not sure how you're testing throughput at the moment. I asked you to run a speedtest as an ad hoc confirmation that you weren't dropping packets.
That may or may not be a good way to judge the performance of Suricata in netmap mode depending on how your runmode and threading settings are set. If the speedtest is a single flow then all of the Suricata analysis of that flow would be stuck on a single core.
-
Some notes on lb:
lb doesn't currently ship with FreeBSD or pfSense. It's possible to build it from the source repo, but if you do that it's not the same version of netmap.
Building the new version of netmap + lb from source on FreeBSD 11.2 yields driver build errors and it's downhill from there.
This package: https://github.com/bro/packet-bricks is more promising (don't let the "bro" dissuade you).
If I knew how I would try to put together a pfSense package for packet-bricks. It would help in some cases with Suricata processing because it would allow for better load balancing across CPUs in combination with Suricata's CPU affinity settings.
packet-bricks is run by the ICSI lab at Berkeley. It's a version of lb (also requires netmap) with creature comforts and additional capabilities.
If I'm reading the commits correctly the lb tool from the creator of netmap was recently added to FreeBSD as well, but I can't tell when it will be available...