Firewall looses L2/L3 connection, VLAN tagging - Intel igb driver

xciter327

Hi all,

We have had this issue on and off for about 1.5 years now. Sometimes the firewall exhibits a very strange issue. With zero notification or logs the network just stops working. There is nothing in the logs at all. This is a Supermicro A1SRi-2758F firewall with 4 Intel I350 ports built in. The server passes a 24 hour memtest just fine.

Things I have noticed:

L2/L3 stops working on WAN side. Adding a static ARP entry for the gateway does not fix the issue.
On LAN side I have 1 tagged VLAN on the interface, which looses it's tag and is shown on the switch in VLAN 1
Other interfaces (OPT1,2) also stop working completely. No L2/L3 connectivity at all.
Firewall does not reply to ARP messages
All ports are reported "UP" by ifconfig
IPMI on the firewall still works and console is interactive
If I do ifconfig igb0 down && ifconfig igb0 up" the console freezes and a reboot is required
A reboot is required to restore network connectivity.
This happens randomly(as far as I can tell). Sometimes it's months, sometimes days, sometimes minutes.

Actions I've taken to attempt to remedy the problem so far:

adding "hw.igb.num_queues=1" to boot loader (still crashes)
adding "dev.igb[0-3].eee_disabled=1" via tunable (still crashes)
Disabling a bunch of hardware offloading features on the network card "ifconfig igb0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso -vlanhwtag" (still crashes)
Using Intel BootUtil to disable power saving feature on all adapters"bootutil -DWOL -ALL"

The last two settings are under testing for now.

Issues that look related:
https://forum.opnsense.org/index.php?topic=5511.45

xciter327

This is already set:

kern.ipc.nmbclusters: 1000000

As well as:
net.inet.tcp.tso: 0
dev.igb.0.fc: 0
dev.igb.2.fc: 0
dev.igb.3.fc: 0
dev.igb.4.fc: 0

xciter327

A quick update:

NtopNG claims that 91% of the TCP packets destined for the firewall are "TCP SYN". So it looks like a TCP SYN flood. I check on some other firewalls and the TCP Flag packet distribution is way different. The packets are all destined for a network which is behind the pfsense, but it is configured with public IP networking.

Added pfblocker-ng with some block lists, which blocks the vast majority of these packets.
Added rate limiting for packets destined for that network
Enabled TCP SYN cookies.

stephenw10

That's the same board used in the C2758 that we shipped large numbers of. I'm not aware of any particular issue with it.

So all 4 interfaces fail at the same time? But the LAN continues to send traffic just untagged?

It does sound like some network resource exhausting, such as mbufs, but I would expect something to be logged.
When it fails and you try to ping out from the console what error do you see?

You might also check the sysctl output for the igb counters, see if anything is waaay higher than it should be.

Steve

xciter327

Yes, all four interface(well 3 since I'm using 3) stop transmitting at the same time
The LAN registers a functional MAC on the switch, but without a tag, which is more than I can say about the WAN and OPT1/2.
When it fails and I ping 8.8.8.8 from console nothing happens. When I ping gateway I get an lldp error because it cannot ARP the IP/MAC pair of the gateway.
If it happens again, I'll try outputting the "sysctl -a > /root/sysctl-crashed.log" or something similar.

I also checked historical log of ntopng. Seems like the TCP SYN flood has been going on for days. It's all a little bit from many IPs, so alerting did not catch it.

stephenw10

Do you know what sort of packet rate those are coming in at?

Were they being passed by the firewall before you added blocklists?

Steve

xciter327

Hi,

If I check with pftop for the whole lifetime of the connection there is not more than 3 packets. Usually only 1. Pay load is also not bigger than 180 bytes.
Yes most of them were being passed by the firewall. pfblocker-ng drops ~70% roughly.

Note: Today I prepared a pair of firewalls to replace the 1 at the location. Should give me more time for triage.

xciter327

Interestingly enough during "normal" functionality of the firewall, if I do a packet capture on the WAN, there isn't abnormal amount of TCP SYN packets.

Using this guide:
http://www.firewall.cx/general-topics-reviews/network-protocol-analyzers/1224-performing-tcp-syn-flood-attack-and-detecting-it-with-wireshark.html

stephenw10

What is the rate of the total incoming SYN packets I meant rather then per connection.

I assume they are from different source IPs?

Are you logging that traffic? That can end up consuming a lot of CPU. If you can block or pass without logging the firewall will remain up against a far bigger attack.

Steve

xciter327

Please note I'm checking this as the firewall is functional. The weird thing is that packet capture from firewall export to wireshark the amount of SYN is very little. I'm exporting netflow upstream and that tells me 91% of TCP packets destined for WAN have TCP SYN flag set:

As a percentage of the total traffic the SYNs are very little according to wireshark capture from firewall WAN. About 0.3-0.5%. I'm not sure how I can check the rate.
Yes, they are all from different source IPs.
Yes, Initially I was not logging traffic on the firewall, but now I enabled it to see what's going on.

This is the stats from 100k packet capture from WAN:

tman222

Hi @xciter327 - reading through your post I had a few clarification questions:

If you unplug the network cable from e.g. igb0 and plug it back in, is connectivity restored or is a reboot absolutely necessary?
Does your Gateway latency spike right before the firewall stops working?
Does everything stop working? Or, do existing connections still work and you just aren't able to open new connections?
What type of internet connection do you have? That is, what's on the other side of the firewall, cable, DSL, fiber, etc.?

Thanks in advance.

xciter327

Hi @tman222 I will try to answer as best as I can:

1. When I unplug the cable it does correctly register a hot-plug event on the interface, but connectivity is not restored. Reboot is absolutely required. I have not found another way to restore connectivity to this point.
1. I don't do gateway monitoring due to various bugs we have encountered with dpinger to this point. We do have smokeping setup, which does not register any sort of spikes to the firewall in question.
1. Everything stops working. Also VLAN tagging stops working as I mentioned before.
1. On this particular connection we have a leased fiber with an ethernet circuit on it(so L2 connection), however I have also seen this happen when connected to one of our own switches at another customer. On that connection we have "dark fiber" (so L1 connection) between the switch and our datacenter.

tman222

Thanks @xciter327 - a couple additional questions came to mind:

Have you checked whether a BIOS or firmware update might address this issue?
What power management settings have you configured in pfSense?
When the system locks up, have you tried shutting the system down completely including unplugging the power? Then plug power back and and restart system. Does that have any impact on how long it stays up or whether it still crashes?

Thanks again - hope this helps.

xciter327

@tman222

Yes. Latest BIOS is installed. Firmware for card is not updatable via the Intel utility as far as I see, however I have implemented the "-WOLD" flag successfully and have not had any issues since. Still under testing. On another place where I have 2 HA units I've setup "-WOLD" on on, but not on the other for some A/B testing.
PowerD is enabled and set to "Maximum" for all three options.
No I have not. I just do a "reset" via IPMI. What benefit do You thing fully power cycling the system will give?

tman222

@xciter327 said in Firewall looses L2/L3 connection, VLAN tagging - Intel igb driver:

@tman222

Yes. Latest BIOS is installed. Firmware for card is not updatable via the Intel utility as far as I see, however I have implemented the "-WOLD" flag successfully and have not had any issues since. Still under testing. On another place where I have 2 HA units I've setup "-WOLD" on on, but not on the other for some A/B testing.

PowerD is enabled and set to "Maximum" for all three options.

No I have not. I just do a "reset" via IPMI. What benefit do You thing fully power cycling the system will give?

Hi @xciter327 - Regarding 3. I had an interesting situation on my Supermicro system where an SFP+ port would stop working and wouldn't start working again until I shut the system down completely, removed all power, and started it back up. I thought it could be interesting to try in case a complete shutdown resets something that may be impacting the behavior that you are observing.

Hope this helps.

stephenw10

Mmm, I've certainly seen ix ports get stuck in a mode that survives a reboot. Only a complete power cycle cleared it.

Of course that doesn't explain why it fails initially.

Steve

xciter327

@tman222 said in Firewall looses L2/L3 connection, VLAN tagging - Intel igb driver:

@xciter327 said in Firewall looses L2/L3 connection, VLAN tagging - Intel igb driver:

@tman222

Yes. Latest BIOS is installed. Firmware for card is not updatable via the Intel utility as far as I see, however I have implemented the "-WOLD" flag successfully and have not had any issues since. Still under testing. On another place where I have 2 HA units I've setup "-WOLD" on on, but not on the other for some A/B testing.

PowerD is enabled and set to "Maximum" for all three options.

No I have not. I just do a "reset" via IPMI. What benefit do You thing fully power cycling the system will give?

Hi @xciter327 - Regarding 3. I had an interesting situation on my Supermicro system where an SFP+ port would stop working and wouldn't start working again until I shut the system down completely, removed all power, and started it back up. I thought it could be interesting to try in case a complete shutdown resets something that may be impacting the behavior that you are observing.

Hope this helps.

We have not had any issues of the sort with the Atom board. I'm having some issues with an Intel X710 adapters, but those are very obviously driver related.

On topic, I have the box no in the office and I'll try to reproduce the issues following another round of memtests. I have a sneaking suspicion that it might be something related to the amount of interrupts that are generated by the igb driver. I wonder if it would be possible to be hitting the limit that is mentioned in the documentation for ix network adapters:
hw.intr_storm_threshold=1000(default) is suggested to be raised to 10k. I've seen the IGB driver generating about 7-8k on a utilized gigabit link(per interface that is). Overall on igb systems all the CPU power is usually hogged up with igb's by the looks of it.

xciter327

Just wanted to mention the problematic box has passed roughly 6 days of memtesting without errors. I'll probably script some flent tests to run as a next step.

xciter327

Have not had time to script tests yet. One of the 2 brand new boxes with same hardware and "WOL" disabled froze a couple of days ago as well. The previous box's console was still interactive when issue happened. This one was a full freeze. Not reacting to any inputs.