Intel Interface Issues

rediske

Just found something, pciconf -l -c em0 gives some PCI info, including the line:

ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected

AER is Advanced Error Reporting and this notes some PCI bus errors. Next time I crash, I'll run this command at the console and see what it reveals.

rediske

EDIT: Changed script for all interfaces

And just 'cause it's Sunday, I wrote a little perl script:

#!/usr/local/bin/perl

for (my $i=1; $i <= 604800; $i++) {
print "\n";
my $ts=system('date');
my $err=system('/usr/sbin/pciconf -l -c em0 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em1 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em2 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em3 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c bge0 | grep AER');
sleep(1);

which outputs:

Sun Oct 14 13:13:41 CDT 2018
ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 2 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected

I redirected the output to a text file so I can have a second by second account of the state of the em0-em3 and bge0 interfaces, to see if PCI errors (and what kind and how many) occur second(s) before dpinger makes its syslog entry about the gateway dropping.

bfeitell

Take a look at the info about MSI/MSIX here:
https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html

rediske

So I waited a while until a crash. dpinger says the interface crashed at 16:57:23. My script stopped logging a full minute earlier at 16:56:10; maybe it was hanging on the system call to pciconf? The log I made found 2 additional fatal errors though, on em2 (nothing plugged in) and em3 (MikroTik router). So we went from:

em0 - 1 fatal 3 non-fatal 5 corrected
em1 - 1 fatal 3 non-fatal 5 corrected
em2 - 0 fatal 2 non-fatal 5 corrected
em3 - 0 fatal 3 non-fatal 5 corrected

bge0 - 0 fatal 0 non-fatal 0 corrected

to

em0 - 1 fatal 3 non-fatal 5 corrected
em1 - 1 fatal 3 non-fatal 5 corrected
em2 - 1 fatal 2 non-fatal 5 corrected
em3 - 1 fatal 3 non-fatal 5 corrected
bge0 - 0 fatal 0 non-fatal 0 corrected

But this error happened at 14:18:43, 2.5 hours before the eventual crash. After I rebooted again, without any changes (my son was trying to play time sensitive games), the machine crashed 2 more times inside 10 minutes.

Oh well, after the third crash/reboot, I swapped the NIC out and put it in a different PCI slot. dpinger logged packet loss on the WAN interface after that, but it hasn't dropped the interface altogether yet after 30 min knock on wood.

@BFEITELL I thought about MSI maybe causing problems. The dmesg I have above shows the USB device having trouble:

xhci0: Unable to map MSI-X table

but I don't know if that would matter? I could disable all the USB for that matter, I only need it for booting to install.

rediske

I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.

Gertjan

@rediske said in Intel Interface Issues:

I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.

Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
The NIC takes the system with it a couple of moments later.

Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
Or leave WAN disconnected for a while.

rediske

@gertjan said in Intel Interface Issues:

Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
The NIC takes the system with it a couple of moments later.

Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
Or leave WAN disconnected for a while.

I'm sorry, I got a little fast and loose with the term crash. The pfSense router never actually crashes, the ethernet interfaces become unresponsive to network traffic (ping, web configurator, etc).

Since I swapped the NIC out and changed PCI slots, em3 on the second NIC died twice now. On the first NIC it was em0 that kept dropping. Same config as before, em0 WAN, em1 LAN, em2 empty, em3 MikroTik router for wireless. I see I got a different WAN IP after the reboot last night, but this morning em3 is down already again and that's on an internal network, with very little traffic (wireless for 2 phones and 2 tablets) and my son and I were sleeping.

Right now it shows em2 and em3 have single fatal PCI errors and the ethernet connection and activity lights on em3 both went dark. I'm writing this on a PC plugged into a switch that's connected to em1 and the WAN is on em0, and those seem to work fine.

When this happened last night, I unplugged the em3 cable and plugged it back in and got link lights back, but it still wouldn't talk. This morning when I unplugged it and plugged it back in, the lights stayed dark.

At this point, I think I'm going to reinstall pfSense and maybe try messing with MSI settings. But I'm betting nothing I do will get either of these intel cards to be stable with this HP PC/mobo. I don't think it's traffic related as I imaged 2 VM's on my PC at the same time, 60 GB of traffic in 40 min (200 Mbits) and that went fine.

It just seems after some period of time, anything from an hour to 12 hours, it shuts off one or more ethernet interfaces, sometimes putting messages in the system log and sometimes not.

I saw these from the latest crash:

Oct 15 07:24:43 kernel em3: Watchdog timeout Queue[0]-- resetting
Oct 15 07:24:43 kernel Interface is RUNNING and ACTIVE
Oct 15 07:24:43 kernel em3: TX Queue 0 ------
Oct 15 07:24:43 kernel em3: hw tdh = -1, hw tdt = -1
Oct 15 07:24:43 kernel em3: Tx Queue Status = -2147483648
Oct 15 07:24:43 kernel em3: TX descriptors avail = 40
Oct 15 07:24:43 kernel em3: Tx Descriptors avail failure = 5
Oct 15 07:24:43 kernel em3: RX Queue 0 ------
Oct 15 07:24:43 kernel em3: hw rdh = -1, hw rdt = -1
Oct 15 07:24:43 kernel em3: RX discarded packets = 0
Oct 15 07:24:43 kernel em3: RX Next to Check = 525
Oct 15 07:24:43 kernel em3: RX Next to Refresh = 524

That repeated a few times, the last time being:

Oct 15 07:27:12 kernel em3: Watchdog timeout Queue[0]-- resetting
Oct 15 07:27:12 kernel Interface is RUNNING and ACTIVE
Oct 15 07:27:12 kernel em3: TX Queue 0 ------
Oct 15 07:27:12 kernel em3: hw tdh = -1, hw tdt = -1
Oct 15 07:27:12 kernel em3: Tx Queue Status = -2147483648
Oct 15 07:27:12 kernel em3: TX descriptors avail = 58
Oct 15 07:27:12 kernel em3: Tx Descriptors avail failure = 119
Oct 15 07:27:12 kernel em3: RX Queue 0 ------
Oct 15 07:27:12 kernel em3: hw rdh = -1, hw rdt = -1
Oct 15 07:27:12 kernel em3: RX discarded packets = 0
Oct 15 07:27:12 kernel em3: RX Next to Check = 0
Oct 15 07:27:12 kernel em3: RX Next to Refresh = 0

And now it's 10 AM and there's been no kernel errors since.

rediske

I left the machine with em3 down, since I don't need wifi anyway, and it's been functioning fine as far as I can tell. Only 4 entries on the system log:

Oct 15 10:01:07 check_reload_status Syncing firewall
Oct 15 10:01:07 syslogd exiting on signal 15
Oct 15 10:01:07 syslogd kernel boot file is /boot/kernel/kernel
Oct 15 10:01:07 pfsense.localdomain nginx: 2018/10/15 10:01:07 [error] 58467#100412: send() failed (54: Connection reset by peer)

It's been at 1-3% cpu usage and 7% memory, totally normal for a home network with just 1 PC using the web.

As a refresher, I'm using an AMD A4 PRO-7300B processor (3.8 GHz) in an HP EliteDesk 705 G1 SFF, 6GB RAM 500GB HDD. I did not disable the on board bge0 ethernet and it has nothing plugged into it. I have a single Intel PRO 1000 PT Quad Port 1Gb PCIe Ethernet card and I've tried two different cards in two different slots.

It'll be a bummer if I can't use the Intel cards. When I researched it, I heard they're usually wonderful for pfSense and I got the pair for $70. There's something sexy about having 8 MAC addresses numbered in a row ;)

Mats

One idea.

what hapends if you plug the wireless to the mainboard nic?
My idea is if it's an issue between microtic and Intel it might help running the mcirotic against another nictype

rediske

I did not try putting the MikroTik on another port, however I did try only having two of the Intel interfaces up as WAN and LAN, and I still want up having problems.

For fun, I tried installing the ESXi on the machine to put pfsense inside that. ESXi wouldn’t recognize the Intel at all.