Issues with an Intel x710 and pfsense 2.4.5-p1
-
Hi!
I've a pfsense with an x710 acting as a openvpn server in HA.
Last saturday, suddenly, it started to loss packets randomly.11 packets transmitted, 6 packets received, 45.5% packet loss round-trip min/avg/max/stddev = 0.488/0.499/0.507/0.007 ms
No drops detected on the switch where pfsense is connected or in intel card:
[2.4.5-RELEASE][root@vpn2ha.uv.es]/root: netstat -I ixl1 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll ixl1 1500 <Link#2> 40:a6:b7:1c:64:f9 13288705830 0 2104 13195081514 0 0
It seems to start when a "Malicious driver detection"
Nov 15 02:51:48 vpn2ha kernel: ixl1: Malicious Driver Detection event 2 on TX queue 771, pf number 1 Nov 15 02:51:48 vpn2ha kernel: ixl1: MDD TX event is for this function! Nov 15 02:51:50 vpn2ha kernel: ixl1: Malicious Driver Detection event 2 on TX queue 770, pf number 1 Nov 15 02:51:50 vpn2ha kernel: ixl1: MDD TX event is for this function! Nov 15 02:51:59 vpn2ha kernel: ixl1: WARNING: queue 3 appears to be hung! Nov 15 02:52:01 vpn2ha kernel: ixl1: WARNING: queue 2 appears to be hung!
Also, a other strange thing is that state table has a lot of searches/inserts/deletes but no traffic detected on the switch where pfsense is connected.
TSO is disabled:
ifconfig ixl1 ixl1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
Yesterday state table searches 745743961 per second!!!:
State Table Total Rate current entries 646 searches 39524429955 745743961.4/s inserts 101400193 1913211.2/s removals 101399547 1913199.0/s
And today, although I've set down interface ixl1 (the only interface connected):
State Table Total Rate current entries 5 searches 39525067859 302667.7/s inserts 101418208 776.6/s removals 101418203 776.6/s
Any idea?
ThnxAdditional info;
pciconf -lv ixl1 ixl1@pci0:1:0:1: class=0x020000 card=0x00008086 chip=0x15728086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Controller X710 for 10GbE SFP+' class = network subclass = ethernet
Driver version:
dev.ixl.1.%desc: Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k
Firmware version:
dev.ixl.1.fw_version: fw 7.0.50775 api 1.8 nvm 7.00 etid 80004cd5 oem 1.264.0 -
Is that something that just started? It was running without loss for some time?
You have the same NIC in the other node and that is not doing that? Even if you fail over?
Steve
-
@stephenw10 said in Issues with an Intel x710 and pfsense 2.4.5-p1:
Is that something that just started? It was running without loss for some time?
Sorry, it was running ok without packet loss for 10 days.
It became suddenly.You have the same NIC in the other node and that is not doing that? Even if you fail over?
Exactly, I've other node with the same hardware that is running for 6 days.
Steve
I don't understand the state table behaviour, with a lot of searches and deletes or inserts, even when the device was disconnected.
Thanks Steve. -
Commonly that might be seen when there is an interface mismatch between the nodes causing a state sync loop between them.
Check the interfaces are assigned identically in the config on both nodes.Steve
-
@stephenw10 said in Issues with an Intel x710 and pfsense 2.4.5-p1:
Commonly that might be seen when there is an interface mismatch between the nodes causing a state sync loop between them.
Check the interfaces are assigned identically in the config on both nodes.Steve
ummm, pfsync was disabled in both nodes, so I guess no states were synchronized.
WAN interface is the same in both nodes (ixl1) and openvpn intefaces were created by config sync (XMLRPC Sync) so they should be identical.
The others interfaces are down (without link).I also think that a internal loop could be involved, but I don't know how to find where the loop come from.
I know that if I reboot this server all things will back to normallity, but firstly I would like to find where the bug comes from.For example right now only a snmp query and a ssh session (10 states in state table ) makes it have
682.0/s inserts and removals./root: pfctl -s info Status: Enabled for 1 days 17:18:29 Debug: Urgent Interface Stats for ixl1 IPv4 IPv6 Bytes In 10792675853232 463536 Bytes Out 10912798045880 260 Packets In Passed 13165783082 0 Blocked 115984155 6601 Packets Out Passed 13117785234 0 Blocked 368850 3 State Table Total Rate current entries 10 searches 39525266221 265789.3/s inserts 101422148 682.0/s removals 101422138 682.0/s Counters match 190610985 1281.8/s bad-offset 0 0.0/s fragment 2559 0.0/s short 573 0.0/s normalize 453 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 2760 0.0/s proto-cksum 0 0.0/s state-mismatch 117091 0.8/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s map-failed 0 0.0/s
That blew my mind ;-)
-
@elbuit said in Issues with an Intel x710 and pfsense 2.4.5-p1:
Additionally, I've checked "Disable hardware checksum offload"
-RXCSUM,TXCSUM.
ixl1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO>
When I've dissabled all hardware accel, kernel showed a WARNING about queue 3
kernel: ixl1: WARNING: queue 3 appears to be hung! rc.gateway_alarm[7790]: >>> Gateway alarm: WANGW (Addr:xxx.xxx.xxx.xxx Alarm:1 RTT:10.504ms RTTsd:55.100ms Loss:31%)
But it started to answer pings and now is more stable
Gateways WANGW (default) xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx 6.397ms 41.771ms 0.0% Online
But state table is also behaving in the same way, with only 10 entries and a lot of searches/s, that should be approximatelly the same than packets per second.
But pfsense is getting less than 10pps and 263009.2searches per second. -
The 'queue appears to be hung' warning seems to often be triggered when the driver starts or re-starts. In itself it does not seem to be a problem. Whether or not that appears the NIC works as expected after the driver is loaded and you don't see that error again.
Steve
-
@stephenw10 said in Issues with an Intel x710 and pfsense 2.4.5-p1:
The 'queue appears to be hung' warning seems to often be triggered when the driver starts or re-starts. In itself it does seem to be a problem.
I'm not sure about that, problem started with a similar message:
Nov 15 02:51:48 vpn2ha kernel: ixl1: Malicious Driver Detection event 2 on TX queue 771, pf number 1 Nov 15 02:51:48 vpn2ha kernel: ixl1: MDD TX event is for this function! Nov 15 02:51:50 vpn2ha kernel: ixl1: Malicious Driver Detection event 2 on TX queue 770, pf number 1 Nov 15 02:51:50 vpn2ha kernel: ixl1: MDD TX event is for this function! Nov 15 02:51:59 vpn2ha kernel: ixl1: WARNING: queue 3 appears to be hung! Nov 15 02:52:01 vpn2ha kernel: ixl1: WARNING: queue 2 appears to be hung!
Could it not be the cause but the consequence?
I don't know.
It seems that is somethig related to driver/firwmare NIC, but I don't guess how it finished in a "states insertions/deletions loop"
That's quite weird.Whetehr or not that appears the NIC works as expected after the driver is loaded and you don't see that error again.
Yes, NIC is working correctly and states loop is going down, rigth now:
State Table Total Rate current entries 10 searches 39526358708 152927.3/s inserts 101445909 392.5/s removals 101445899 392.5/s
Steve
Thanks Steve for your help.
-
I'm trying to capture these inserts/removals.
I've tried with tcpdump on all interfaces , including pfsync, pflog, enc0, loopback, ... and no packets found.
Only my ssh session. -
@elbuit I have had a similar issue come up with my XG-1527 recently.
Queue hung after driver event. This causes incoming requests to hang.
I note this thread that suggests disabling TSO:
https://www.reddit.com/r/PFSENSE/comments/fqtgmj/intel_x710t4_ixl_malicious_driver_detection/
doc here:
https://docs.netgate.com/pfsense/en/latest/config/advanced-networking.html#hardware-tcp-segmentation-offloading -
Network (Advanced) Settings for reference
-
Hi @greenant
I've disabled TSO and as pfsense doc you've posted TSO is not a good choice if you are running a firewall. It's mainly for servers.Thanks for your advice.
Regards. -
TSO should be disabled by default. As you say it doesn't make much sense to have it enabled on a firewall for almost all setups.
Steve
-