Suricata sometimes pegs CPU after rule update

mouseskowitz

@bmeeks Thanks for the info. Does the edit survive the restart of Suricata post rule update? Since whatever the issue is usually survives a recycle of the service it will probably just be easier to turn up the logging when there is an issue.

Which log file should I be looking at after adding the -vv option?

bmeeks

@mouseskowitz said in Suricata sometimes pegs CPU after rule update:

@bmeeks Thanks for the info. Does the edit survive the restart of Suricata post rule update? Since whatever the issue is usually survives a recycle of the service it will probably just be easier to turn up the logging when there is an issue.

Which log file should I be looking at after adding the -vv option?

Yes, the edit should survive a rules update. It will not survive any edits/saves within the GUI, nor will it survice a Suricata interface start/stop from within the GUI. Any change you make and save within the GUI will recreate that shell script, even for interfaces you did not make an edit for within the GUI.

mouseskowitz

Thanks for the ideas everyone. I will post back after is breaks again and I can get some more detailed info on the problem.

mouseskowitz

Well, I had issues again this morning. This is the second time it has broken this way, non-Suricata PID with high CPU usage. The main symptom was the RTTsd on WAN to be 100's of ms to the point of the gateway showing down. I didn't try to turn Suricata logging on since I wasn't 100% sure it was related to that. But a recycle of the service fixed things. Hopefully these screen shots will help make sense of things.

0_1546256973586_2018-12-31 05_33_09-Window.png
0_1546257110324_2018-12-31 05_35_21-Window.png
0_1546257131775_2018-12-31 05_41_42-Window.png
0_1546257163140_2018-12-31 05_43_27-Window.png
0_1546257197495_2018-12-31 05_47_51-Window.png

The slight CPU increase around 3:15 looks to the the start of things. The 5:15 download spike is probably downloading roughly 90 MB Steam update. That triggered a spike in CPU that lasted way longer than the download. The tailing off to normal levels of CPU usage is when I recycled Suricata.

Should I just turn on the -vv logging now so that it's on for next time?

boobletins

I'm sure others here will be more knowledgeable, but I see high interrupt usage probably unrelated to Suricata. This is more likely the result of a hardware or driver issue (I believe restarting Suricata on the interface has the side effect that it restarts the interface by taking it up and down to take it in and out of netmap mode, so it may feel like Suricata is the issue).

Could you provide more detail on the NIC? Could you also please provide the information requested under "Asking for help?" here.

Have you added any system tunables?

mouseskowitz

@boobletins This is a Supermicro X9DRD-LF MOBO with dual Intel i350 NICs. I just added the two system tunables mentioned in the article you linked, but there was nothing custom before that.

We'll try the command output as a file since it was flagged for spam when I had it in code brackets.
0_1546329036464_pfsense help.txt

0_1546327922696_2019-01-01 01_30_05-Window.png
24 hour history of 8GB total memory
0_1546327942219_2019-01-01 01_30_38-Window.png

boobletins

If you aren't doing anything fancy with cpu pinning/affinity, then you might try setting hw.igb.num_queues: 1 in system tunables.

You might also (separately) try changing the MSI/MSIX settings for igb hw.igb.enable_msix.

Something seems off with the balance of these queues which looks to me like it's reflected in the interrupt usage:

dev.igb.1.queue0.rx_packets: 64537749
dev.igb.1.queue0.tx_packets: 454472179
dev.igb.1.queue1.rx_packets: 132156368
dev.igb.1.queue1.tx_packets: 117660
dev.igb.1.queue2.rx_packets: 43342779
dev.igb.1.queue2.tx_packets: 5487
dev.igb.1.queue3.rx_packets: 179038759
dev.igb.1.queue3.tx_packets: 63183

I don't know what exactly would be causing that. Maybe something to do with the vlans it looks like you're running.

Also note that the issue is with igb1. That's the interface with high interrupt (cpu) usage, but your network traffic graph is for igb0. I would be looking at your VLAN traffic here. Maybe there's a faulty process/configuration somewhere sending a huge number of smaller packets and that interrupt usage is reflecting actual traffic?

Regardless, it doesn't look like this is a suricata issue at all. I would try some of the FreeBSD or pfSense networking forums. I would be interested in knowing how you get it resolved...

mouseskowitz

@boobletins I added the first tunable. What would I put for the value of the second one?

I see what you're saying with an issue on igb1, but how would that cause elevated ping on igb0?

When everything is acting "normally" I was only able to get igb1:que 3 up to 35% WCPU by running an iperf test across vlans. Normally the busiest vlan only has 30 Kbps of traffic that hits the router that is not going out the WAN. There is no traffic that should have caused that high a interrupt usage. It looks like there were two more problems later in the day. But things have been behaving since then without me touching anything.
2_1546407110467_2019-01-01 22_28_56-Window.png 1_1546407110467_2019-01-01 22_28_42-Window.png 0_1546407110466_2019-01-01 22_28_25-Window.png

I'll have to see if anyone in the networking forums has any ideas.

boobletins

@mouseskowitz said in Suricata sometimes pegs CPU after rule update:

What would I put for the value of the second one?

I would leave MSIX alone and wait to see what happens with a single queue.

The i350-T2 should be able to handle the default of 4 queues per port, but something isn't right with the interrupt usage and load balancing across cores. Again, I'm not sure exactly what is causing that -- that's a relatively advanced networking question.

Could you provide the output from dmesg | grep interrupt and vmstat -i?

I see what you're saying with an issue on igb1, but how would that cause elevated ping on igb0?

I don't know, but I'm not convinced the two are related.

Another way of answering this question is to ask you which problem you are currently trying to solve

high cpu usage (as originally reported -- but likely not related to suricata)
or WAN latency issues (which may be unrelated to above)

mouseskowitz

@boobletins Ideally I would like to solve both of the issues, but the WAN latency is a higher priority. I think they are interrelated, but I have no idea which is causing the other.

dmesg | grep interrupt

igb0: Using MSIX interrupts with 5 vectors
igb1: Using MSIX interrupts with 5 vectors
igb0: Using MSIX interrupts with 5 vectors
igb1: Using MSIX interrupts with 5 vectors
igb0: Using MSIX interrupts with 5 vectors
igb1: Using MSIX interrupts with 5 vectors

vmstat -i

interrupt                          total       rate
irq1: atkbd0                           6          0
irq9: acpi0                            2          0
irq16: ehci0                     3845256          1
irq23: ehci1                     1831351          1
cpu0:timer                    2673498411       1035
cpu1:timer                     377015655        146
cpu3:timer                     376025919        146
cpu2:timer                     375885908        146
irq264: isci0                          1          0
irq266: igb0:que 0             186663825         72
irq267: igb0:que 1              39564243         15
irq268: igb0:que 2              46355532         18
irq269: igb0:que 3              49144499         19
irq270: igb0:link                      2          0
irq271: igb1:que 0             285667733        111
irq272: igb1:que 1              67109876         26
irq273: igb1:que 2              43993156         17
irq274: igb1:que 3              62085685         24
irq275: igb1:link                      8          0
irq276: ahci0                    9574415          4
Total                         4598261483       1781