Suricata InLine with igb NICs

newUser2pfSense

boobletins...I'll let it run for a while with all of the tweaks we've made and check it periodically for any netmap_grab_packets errors.

bmeeks...I agree.

newUser2pfSense

I let my system run for just over a week and I noticed this evening that I couldn't access the interwebs for some reason. I restarted my pfSense computer and everything seemed to go back to normal. I then noticed a few minutes ago the following on the console:

kernel 492.136807 [1071] netmap_grab_packets bad pkt at 878 len 4939
kernel 490.136919 [1071] netmap_grab_packets bad pkt at 667 len 4939
kernel 489.136703 [1071] netmap_grab_packets bad pkt at 933 len 4939
kernel 488.636876 [1071] netmap_grab_packets bad pkt at 875 len 4939
kernel 488.435620 [1071] netmap_grab_packets bad pkt at 806 len 4939
kernel 488.235492 [1071] netmap_grab_packets bad pkt at 766 len 4939

Interesting. I guess I'm going to have to bump up my dev.netmap.buf_size from 4096 to a larger value. I have 64 Gig or RAM in my pfSense comptuer so maybe I'll bump it up to 8192 and see how that works. Has anyone had a related experience after tuning their system?

Update - Since changing the buffer size to 8192, I've noticed webpages load a tad slower.

boobletins

I still periodically see packets larger than my mtu and netmap.buf_size. I haven't been able to track down the source. After tuning it's down to something like once per week - often without any interface hiccup.

I opened a support question here: https://redmine.openinfosecfoundation.org/issues/2720 -- but so far there's no information. I don't think it's a Suricata issue --
I'm no expert, but I don't see anything in the Suricata netmap code that would be adding length to packets.

It's possible that this type of noise is always there but the netmap configuration is more sensitive to violations of mtu/buf_size.

Really the error message just indicates that a packet was dropped because it exceeded the available buffer length. I believe the interface flap after that is due to the watchdog cycling the interface because it sees high packet loss (or latency). Packets are presumably dropped all the time by the OS and we're only aware of them because we're looking for netmap errors now.

For the record: my logs show the last errors on 12/6 with the same packet size you have above:

kernel: 338.512666 [1071] netmap_grab_packets       bad pkt at 1054 len 4939
kernel: 338.714285 [1071] netmap_grab_packets       bad pkt at 1073 len 4939
kernel: 338.914864 [1071] netmap_grab_packets       bad pkt at 1089 len 4939
kernel: 339.423360 [1071] netmap_grab_packets       bad pkt at 1203 len 4939
kernel: 340.414473 [1071] netmap_grab_packets       bad pkt at 1484 len 4939
kernel: 342.414619 [1071] netmap_grab_packets       bad pkt at 1542 len 4939
kernel: 346.414451 [1071] netmap_grab_packets       bad pkt at 2009 len 4939

The same size strikes me as a little odd -- what's putting packets of that exact size on the wire? They happen so rarely now that I don't want to run a pcap for weeks to catch them. I don't see any particularly odd traffic at the time in my logs (though of course the bad packets are dropped, so if they're all bad nothing would show up).

I'd be curious to know the output of "sysctl -a | grep missed_packets" -- or more precisely -- I'd be curious to know if you note those numbers now and compare them after "bad pkt" errors to see if the NIC counters are being incremented by netmap or if we lose that reporting. If it's still accurately incremented on a packet miss, then we should be able to compare inline to legacy mode to see if there's any significant increase in packet loss with netmap mode. I suspect there isn't, it's just louder about it's misses.

bmeeks

Here are a few Netmap-related links I've found. There are some references in these about various netmap errors and issues, particularly around stripping of VLAN tags and problems with flow control. Lots of the issues are NIC driver specific.

https://github.com/luigirizzo/netmap/blob/master/LINUX/README

http://freebsd.1045724.x6.nabble.com/Netmap-ixgbe-stripping-Vlan-tags-td5838105.html

https://redmine.openinfosecfoundation.org/issues/1925

https://helpmanual.io/man4/netmap-freebsd/

boobletins

@bmeeks said in Suricata InLine with igb NICs:

Here are a few Netmap-related links I've found. There are some references in these about various netmap errors and issues, particularly around stripping of VLAN tags and problems with flow control. Lots of the issues are NIC driver specific.

https://github.com/luigirizzo/netmap/blob/master/LINUX/README

http://freebsd.1045724.x6.nabble.com/Netmap-ixgbe-stripping-Vlan-tags-td5838105.html

https://redmine.openinfosecfoundation.org/issues/1925

https://helpmanual.io/man4/netmap-freebsd/

I've read through the man pages and netmap code several times now (which is why I'm so confident I know what the errors mean: https://github.com/luigirizzo/netmap/blob/master/sys/dev/netmap/netmap.c#L1169 )

VLAN tag stripping isn't an issue for me, but there was an interesting bit in that link:

When you switch an interface to netmap mode it does a soft-reset first. That reverts the vlanhwfilter configuration to default on. It's not netmap that does it but the driver. It seems to happen in or around ixgbe_setup_vlan_hw_support().

I just tested this on igb and em drivers, and both keep the vlanhwfilter setting across a netmap restart (along with other settings -- checksum offloading most importantly).

I think that the "bad pkt" error is "resolved" for both of us and probably more broadly. My guess is that the remaining errors are normal network noise that is just noisier than usual because the code writes to syslog for every dropped (because malformed) packet.

A flood of "bad pkt" errors would result if a user had a misconfiguration or an incompatible card (eg MTU set to 9000 to support jumbo frames with a netmap.buf_size of default 2048 would result in huge numbers of "bad pkt" errors) -- and so we're thinking that "bad pkt" means something isn't working correctly. Really it just means what it says -- a bad packet was received that is in violation of both our MTU and our buf_size. The packet should be dropped. Why someone is sending us a packet of size 4939 when we're advertising an MTU of 1500 is a good question.

To really get to the bottom of it I would need to capture and decode one of the oversized packets. Without having to capture enormous amounts of traffic the best way I can think of would be to try to recompile netmap/suricata with an expanded error message that output the packet in base64 to the log for analysis. I'll see how complicated is.

I started a write-up on how to troubleshoot netmap errors and got discouraged when my initial post was rejected by Akismet. If I write it up and send it your way, can you get it posted Bill?

bmeeks

@boobletins said in Suricata InLine with igb NICs:

I started a write-up on how to troubleshoot netmap errors and got discouraged when my initial post was rejected by Akismet. If I write it up and send it your way, can you get it posted Bill?

Sure! Write it up in a format that would make a good Sticky Post to put at the top of this forum along with the others. Give it a title to make it clear what it's about. Be sure to give yourself the credit in the notes, and you can even ask one of the Forum moderators to post the sticky for you. I have asked them to post mine in the past. You can ask @jimp or @johnpoz if they will make it a Sticky Post in this forum. If you run into difficulties or have a question, just let me know.

stephenw10

Nice work.

boobletins

@boobletins said in Suricata InLine with igb NICs:

I'll see how complicated is.

It looks like this would require a kernel rebuild which I'm not really up to -- I'd then have to run that experimental build on my firewall (and I've never built one before).

bmeeks

@boobletins said in Suricata InLine with igb NICs:

@boobletins said in Suricata InLine with igb NICs:

I'll see how complicated is.

It looks like this would require a kernel rebuild which I'm not really up to -- I'd then have to run that experimental build on my firewall (and I've never built one before).

Enabling debugging or extra error messages from within netmap itself will require rebuilding the kernel module. Though I've never done it, you might be able to build a compatible module with debugging enabled using a vanilla FreeBSD 11.2 machine. Then just copy the kernel module over to pfSense. If you have virtual machines, you could do this with not much risk. Just save a snapshot, install the new netmap kernel module and give it a try. If it breaks badly, just restore the previous snapshot.

Turning on debugging with Suricata is relatively easy, but I don't think any really useful information will be gleaned from Suricata itself. I think this issue is between the NIC drivers, the netmap kernel module and the kernel itself.

boobletins

@bmeeks

That's how I built the hyperscan module -- a fresh VM and then followed the directions.

If we think it's possible to do something similar and just copy over the netmap module, then I can try that.

bmeeks

@boobletins said in Suricata InLine with igb NICs:

@bmeeks

That's how I built the hyperscan module -- a fresh VM and then followed the directions.

If we think it's possible to do something similar and just copy over the netmap module, then I can try that.

I think it should work just fine. There is nothing customized about the netmap module in pfSense. They just enable the module to be built along with the kernel. I believe there are some minor pfSense-specific tweaks to the FreeBSD kernel code, but nothing to the netmap module itself.

newUser2pfSense

boobletins...From bmeeks post here: https://helpmanual.io/man4/netmap-freebsd/, which you likely already read, I wanted to get your take on this, "The only parameter worth modifying is dev.netmap.buf_num as it impacts the total amount of memory used by netmap." The default value is 163840; I checked my pfSense and the value matched. I'm wondering if we should change this value as well as the dev.netmap.buf_size? Just a thought.

boobletins

You can certainly do that -- mine is higher than default, but it won't help with any "bad pkt" errors if that's what you're trying to solve.

Really what you would be doing there is buying yourself a larger buffer if Suricata starts falling behind -- you've set aside more RAM in case of a backlog of packets.