Intel Interface Issues
-
The Intel liscence ACKs only stop those messages appearing. They are only related to the wireless drivers iwi(4) and ipw(4), it will have no effect here.
Those watchdog errors should never normally appear. It is failing, trying to recover and failing ti do so.
I would start trying to solve this by making the most basic install possible and cheking that runs OK before adding any packages etc.
You said you installed 2 quad port NICs but it looks like you're using on 4 ports. I would remove the second NIC if you;re not using it. Or even swap it out with the first one to be sure it's not a hardware issue.Steve
-
@stephenw10 Thanks much for the reply! I did drop to one NIC a couple days ago. These last few crashes are particularly puzzling because the only symptom I see is a log entry about the WAN interface losing some packets (15-25% loss). A few seconds after that, I notice the WAN link is totally unresponsive. I don't think I've had an instance yet where dpinger notes dropped packets and the device recovers. Sometimes the LAN side drops, too, but other times I can still ping, ssh and use the web interface. I "think" the LAN side stays responsive as long as there aren't watchdog timeouts and additional interface related log entries.
It does seem slightly more stable with ACPI/powerd enabled, as it's been staying up longer, but that could be random too.
I think you're right, my next two steps are:
Change NIC's, even use the other PCI slot
Reinstall from scratch, don't use the old config and don't install any packagesCurrently I have darkstat, iftop, nmap, ntopng and pfBlockerNG. I had Snort, but something was wrong with it as it didn't show up in the menus, so I dropped it. This might also be a symptom of something wrong in the install.
-
I forgot to note, I started out bridging 7 of the 8 ports in the 2 quad NIC's. It was working about the same as now, so my first step in troubleshooting was to disable the bridge and also change out the WAN cable and the cable to my PC.
I kind of wanted to use all 8 ports in the pfSense box as it saves cables, extra hardware, interfaces and allows more port by port management overall, not that I need it in a home network. I'm kind of testing this to see if it's something I want to install at work, where there may be some use in connecting 6 home locations via VPN.
-
It should work even if bridging is usually a bad idea.
Yes try to rule out hardware initially if you can. As I said start out with a super basic two NIC config and make sure that works.
Disable anything you don't need in the BIOS, soundcards etc.Steve
-
Just found something, pciconf -l -c em0 gives some PCI info, including the line:
ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
AER is Advanced Error Reporting and this notes some PCI bus errors. Next time I crash, I'll run this command at the console and see what it reveals.
-
EDIT: Changed script for all interfaces
And just 'cause it's Sunday, I wrote a little perl script:
#!/usr/local/bin/perl
for (my $i=1; $i <= 604800; $i++) {
print "\n";
my $ts=system('date');
my $err=system('/usr/sbin/pciconf -l -c em0 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em1 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em2 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c em3 | grep AER');
my $err=system('/usr/sbin/pciconf -l -c bge0 | grep AER');
sleep(1);which outputs:
Sun Oct 14 13:13:41 CDT 2018
ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 2 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 3 non-fatal 5 corrected
ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 correctedI redirected the output to a text file so I can have a second by second account of the state of the em0-em3 and bge0 interfaces, to see if PCI errors (and what kind and how many) occur second(s) before dpinger makes its syslog entry about the gateway dropping.
-
Take a look at the info about MSI/MSIX here:
https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html -
So I waited a while until a crash. dpinger says the interface crashed at 16:57:23. My script stopped logging a full minute earlier at 16:56:10; maybe it was hanging on the system call to pciconf? The log I made found 2 additional fatal errors though, on em2 (nothing plugged in) and em3 (MikroTik router). So we went from:
em0 - 1 fatal 3 non-fatal 5 corrected em1 - 1 fatal 3 non-fatal 5 corrected em2 - 0 fatal 2 non-fatal 5 corrected em3 - 0 fatal 3 non-fatal 5 corrected
bge0 - 0 fatal 0 non-fatal 0 corrected
to
em0 - 1 fatal 3 non-fatal 5 corrected
em1 - 1 fatal 3 non-fatal 5 corrected
em2 - 1 fatal 2 non-fatal 5 corrected
em3 - 1 fatal 3 non-fatal 5 corrected
bge0 - 0 fatal 0 non-fatal 0 correctedBut this error happened at 14:18:43, 2.5 hours before the eventual crash. After I rebooted again, without any changes (my son was trying to play time sensitive games), the machine crashed 2 more times inside 10 minutes.
Oh well, after the third crash/reboot, I swapped the NIC out and put it in a different PCI slot. dpinger logged packet loss on the WAN interface after that, but it hasn't dropped the interface altogether yet after 30 min knock on wood.
@BFEITELL I thought about MSI maybe causing problems. The dmesg I have above shows the USB device having trouble:
xhci0: Unable to map MSI-X table
but I don't know if that would matter? I could disable all the USB for that matter, I only need it for booting to install.
-
I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.
-
@rediske said in Intel Interface Issues:
I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.
Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
The NIC takes the system with it a couple of moments later.Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
Or leave WAN disconnected for a while. -
@gertjan said in Intel Interface Issues:
Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
The NIC takes the system with it a couple of moments later.Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
Or leave WAN disconnected for a while.I'm sorry, I got a little fast and loose with the term crash. The pfSense router never actually crashes, the ethernet interfaces become unresponsive to network traffic (ping, web configurator, etc).
Since I swapped the NIC out and changed PCI slots, em3 on the second NIC died twice now. On the first NIC it was em0 that kept dropping. Same config as before, em0 WAN, em1 LAN, em2 empty, em3 MikroTik router for wireless. I see I got a different WAN IP after the reboot last night, but this morning em3 is down already again and that's on an internal network, with very little traffic (wireless for 2 phones and 2 tablets) and my son and I were sleeping.
Right now it shows em2 and em3 have single fatal PCI errors and the ethernet connection and activity lights on em3 both went dark. I'm writing this on a PC plugged into a switch that's connected to em1 and the WAN is on em0, and those seem to work fine.
When this happened last night, I unplugged the em3 cable and plugged it back in and got link lights back, but it still wouldn't talk. This morning when I unplugged it and plugged it back in, the lights stayed dark.
At this point, I think I'm going to reinstall pfSense and maybe try messing with MSI settings. But I'm betting nothing I do will get either of these intel cards to be stable with this HP PC/mobo. I don't think it's traffic related as I imaged 2 VM's on my PC at the same time, 60 GB of traffic in 40 min (200 Mbits) and that went fine.
It just seems after some period of time, anything from an hour to 12 hours, it shuts off one or more ethernet interfaces, sometimes putting messages in the system log and sometimes not.
I saw these from the latest crash:
Oct 15 07:24:43 kernel em3: Watchdog timeout Queue[0]-- resetting
Oct 15 07:24:43 kernel Interface is RUNNING and ACTIVE
Oct 15 07:24:43 kernel em3: TX Queue 0 ------
Oct 15 07:24:43 kernel em3: hw tdh = -1, hw tdt = -1
Oct 15 07:24:43 kernel em3: Tx Queue Status = -2147483648
Oct 15 07:24:43 kernel em3: TX descriptors avail = 40
Oct 15 07:24:43 kernel em3: Tx Descriptors avail failure = 5
Oct 15 07:24:43 kernel em3: RX Queue 0 ------
Oct 15 07:24:43 kernel em3: hw rdh = -1, hw rdt = -1
Oct 15 07:24:43 kernel em3: RX discarded packets = 0
Oct 15 07:24:43 kernel em3: RX Next to Check = 525
Oct 15 07:24:43 kernel em3: RX Next to Refresh = 524That repeated a few times, the last time being:
Oct 15 07:27:12 kernel em3: Watchdog timeout Queue[0]-- resetting
Oct 15 07:27:12 kernel Interface is RUNNING and ACTIVE
Oct 15 07:27:12 kernel em3: TX Queue 0 ------
Oct 15 07:27:12 kernel em3: hw tdh = -1, hw tdt = -1
Oct 15 07:27:12 kernel em3: Tx Queue Status = -2147483648
Oct 15 07:27:12 kernel em3: TX descriptors avail = 58
Oct 15 07:27:12 kernel em3: Tx Descriptors avail failure = 119
Oct 15 07:27:12 kernel em3: RX Queue 0 ------
Oct 15 07:27:12 kernel em3: hw rdh = -1, hw rdt = -1
Oct 15 07:27:12 kernel em3: RX discarded packets = 0
Oct 15 07:27:12 kernel em3: RX Next to Check = 0
Oct 15 07:27:12 kernel em3: RX Next to Refresh = 0And now it's 10 AM and there's been no kernel errors since.
-
I left the machine with em3 down, since I don't need wifi anyway, and it's been functioning fine as far as I can tell. Only 4 entries on the system log:
Oct 15 10:01:07 check_reload_status Syncing firewall
Oct 15 10:01:07 syslogd exiting on signal 15
Oct 15 10:01:07 syslogd kernel boot file is /boot/kernel/kernel
Oct 15 10:01:07 pfsense.localdomain nginx: 2018/10/15 10:01:07 [error] 58467#100412: send() failed (54: Connection reset by peer)It's been at 1-3% cpu usage and 7% memory, totally normal for a home network with just 1 PC using the web.
As a refresher, I'm using an AMD A4 PRO-7300B processor (3.8 GHz) in an HP EliteDesk 705 G1 SFF, 6GB RAM 500GB HDD. I did not disable the on board bge0 ethernet and it has nothing plugged into it. I have a single Intel PRO 1000 PT Quad Port 1Gb PCIe Ethernet card and I've tried two different cards in two different slots.
It'll be a bummer if I can't use the Intel cards. When I researched it, I heard they're usually wonderful for pfSense and I got the pair for $70. There's something sexy about having 8 MAC addresses numbered in a row ;)
-
One idea.
what hapends if you plug the wireless to the mainboard nic?
My idea is if it's an issue between microtic and Intel it might help running the mcirotic against another nictype -
I did not try putting the MikroTik on another port, however I did try only having two of the Intel interfaces up as WAN and LAN, and I still want up having problems.
For fun, I tried installing the ESXi on the machine to put pfsense inside that. ESXi wouldn’t recognize the Intel at all.