Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Intel Interface Issues

    Scheduled Pinned Locked Moved Hardware
    20 Posts 5 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      rediske
      last edited by rediske

      Just found something, pciconf -l -c em0 gives some PCI info, including the line:

      ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected

      AER is Advanced Error Reporting and this notes some PCI bus errors. Next time I crash, I'll run this command at the console and see what it reveals.

      1 Reply Last reply Reply Quote 0
      • R
        rediske
        last edited by rediske

        EDIT: Changed script for all interfaces

        And just 'cause it's Sunday, I wrote a little perl script:

        #!/usr/local/bin/perl

        for (my $i=1; $i <= 604800; $i++) {
        print "\n";
        my $ts=system('date');
        my $err=system('/usr/sbin/pciconf -l -c em0 | grep AER');
        my $err=system('/usr/sbin/pciconf -l -c em1 | grep AER');
        my $err=system('/usr/sbin/pciconf -l -c em2 | grep AER');
        my $err=system('/usr/sbin/pciconf -l -c em3 | grep AER');
        my $err=system('/usr/sbin/pciconf -l -c bge0 | grep AER');
        sleep(1);

        which outputs:

        Sun Oct 14 13:13:41 CDT 2018
        ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
        ecap 0001[100] = AER 1 1 fatal 3 non-fatal 5 corrected
        ecap 0001[100] = AER 1 0 fatal 2 non-fatal 5 corrected
        ecap 0001[100] = AER 1 0 fatal 3 non-fatal 5 corrected
        ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected

        I redirected the output to a text file so I can have a second by second account of the state of the em0-em3 and bge0 interfaces, to see if PCI errors (and what kind and how many) occur second(s) before dpinger makes its syslog entry about the gateway dropping.

        1 Reply Last reply Reply Quote 1
        • B
          bfeitell
          last edited by

          Take a look at the info about MSI/MSIX here:
          https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html

          1 Reply Last reply Reply Quote 0
          • R
            rediske
            last edited by

            So I waited a while until a crash. dpinger says the interface crashed at 16:57:23. My script stopped logging a full minute earlier at 16:56:10; maybe it was hanging on the system call to pciconf? The log I made found 2 additional fatal errors though, on em2 (nothing plugged in) and em3 (MikroTik router). So we went from:

            em0 - 1 fatal 3 non-fatal 5 corrected
            em1 - 1 fatal 3 non-fatal 5 corrected
            em2 - 0 fatal 2 non-fatal 5 corrected
            em3 - 0 fatal 3 non-fatal 5 corrected
            

            bge0 - 0 fatal 0 non-fatal 0 corrected

            to

            em0 - 1 fatal 3 non-fatal 5 corrected
            em1 - 1 fatal 3 non-fatal 5 corrected
            em2 - 1 fatal 2 non-fatal 5 corrected
            em3 - 1 fatal 3 non-fatal 5 corrected
            bge0 - 0 fatal 0 non-fatal 0 corrected

            But this error happened at 14:18:43, 2.5 hours before the eventual crash. After I rebooted again, without any changes (my son was trying to play time sensitive games), the machine crashed 2 more times inside 10 minutes.

            Oh well, after the third crash/reboot, I swapped the NIC out and put it in a different PCI slot. dpinger logged packet loss on the WAN interface after that, but it hasn't dropped the interface altogether yet after 30 min knock on wood.

            @BFEITELL I thought about MSI maybe causing problems. The dmesg I have above shows the USB device having trouble:

            xhci0: Unable to map MSI-X table

            but I don't know if that would matter? I could disable all the USB for that matter, I only need it for booting to install.

            R 1 Reply Last reply Reply Quote 1
            • R
              rediske @rediske
              last edited by

              I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.

              GertjanG 1 Reply Last reply Reply Quote 0
              • GertjanG
                Gertjan @rediske
                last edited by Gertjan

                @rediske said in Intel Interface Issues:

                I neglected to say, my little perl script logged PCI status once every second (57,000+ lines) until mysteriously hanging/stopping one minute short of the crash. I doubt that's a coincidence.

                Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
                The NIC takes the system with it a couple of moments later.

                Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
                Or leave WAN disconnected for a while.

                No "help me" PM's please. Use the forum, the community will thank you.
                Edit : and where are the logs ??

                R 1 Reply Last reply Reply Quote 0
                • R
                  rediske @Gertjan
                  last edited by rediske

                  @gertjan said in Intel Interface Issues:

                  Well, let's say the device driver, and the related NIC most probably, goes down that moment - or, at least, becomes very busy.
                  The NIC takes the system with it a couple of moments later.

                  Just to exclude outside issues (DDOS) : is it possible that you change your "real" WAN IP ?
                  Or leave WAN disconnected for a while.

                  I'm sorry, I got a little fast and loose with the term crash. The pfSense router never actually crashes, the ethernet interfaces become unresponsive to network traffic (ping, web configurator, etc).

                  Since I swapped the NIC out and changed PCI slots, em3 on the second NIC died twice now. On the first NIC it was em0 that kept dropping. Same config as before, em0 WAN, em1 LAN, em2 empty, em3 MikroTik router for wireless. I see I got a different WAN IP after the reboot last night, but this morning em3 is down already again and that's on an internal network, with very little traffic (wireless for 2 phones and 2 tablets) and my son and I were sleeping.

                  Right now it shows em2 and em3 have single fatal PCI errors and the ethernet connection and activity lights on em3 both went dark. I'm writing this on a PC plugged into a switch that's connected to em1 and the WAN is on em0, and those seem to work fine.

                  When this happened last night, I unplugged the em3 cable and plugged it back in and got link lights back, but it still wouldn't talk. This morning when I unplugged it and plugged it back in, the lights stayed dark.

                  At this point, I think I'm going to reinstall pfSense and maybe try messing with MSI settings. But I'm betting nothing I do will get either of these intel cards to be stable with this HP PC/mobo. I don't think it's traffic related as I imaged 2 VM's on my PC at the same time, 60 GB of traffic in 40 min (200 Mbits) and that went fine.

                  It just seems after some period of time, anything from an hour to 12 hours, it shuts off one or more ethernet interfaces, sometimes putting messages in the system log and sometimes not.

                  I saw these from the latest crash:

                  Oct 15 07:24:43 kernel em3: Watchdog timeout Queue[0]-- resetting
                  Oct 15 07:24:43 kernel Interface is RUNNING and ACTIVE
                  Oct 15 07:24:43 kernel em3: TX Queue 0 ------
                  Oct 15 07:24:43 kernel em3: hw tdh = -1, hw tdt = -1
                  Oct 15 07:24:43 kernel em3: Tx Queue Status = -2147483648
                  Oct 15 07:24:43 kernel em3: TX descriptors avail = 40
                  Oct 15 07:24:43 kernel em3: Tx Descriptors avail failure = 5
                  Oct 15 07:24:43 kernel em3: RX Queue 0 ------
                  Oct 15 07:24:43 kernel em3: hw rdh = -1, hw rdt = -1
                  Oct 15 07:24:43 kernel em3: RX discarded packets = 0
                  Oct 15 07:24:43 kernel em3: RX Next to Check = 525
                  Oct 15 07:24:43 kernel em3: RX Next to Refresh = 524

                  That repeated a few times, the last time being:

                  Oct 15 07:27:12 kernel em3: Watchdog timeout Queue[0]-- resetting
                  Oct 15 07:27:12 kernel Interface is RUNNING and ACTIVE
                  Oct 15 07:27:12 kernel em3: TX Queue 0 ------
                  Oct 15 07:27:12 kernel em3: hw tdh = -1, hw tdt = -1
                  Oct 15 07:27:12 kernel em3: Tx Queue Status = -2147483648
                  Oct 15 07:27:12 kernel em3: TX descriptors avail = 58
                  Oct 15 07:27:12 kernel em3: Tx Descriptors avail failure = 119
                  Oct 15 07:27:12 kernel em3: RX Queue 0 ------
                  Oct 15 07:27:12 kernel em3: hw rdh = -1, hw rdt = -1
                  Oct 15 07:27:12 kernel em3: RX discarded packets = 0
                  Oct 15 07:27:12 kernel em3: RX Next to Check = 0
                  Oct 15 07:27:12 kernel em3: RX Next to Refresh = 0

                  And now it's 10 AM and there's been no kernel errors since.

                  1 Reply Last reply Reply Quote 0
                  • R
                    rediske
                    last edited by

                    I left the machine with em3 down, since I don't need wifi anyway, and it's been functioning fine as far as I can tell. Only 4 entries on the system log:

                    Oct 15 10:01:07 check_reload_status Syncing firewall
                    Oct 15 10:01:07 syslogd exiting on signal 15
                    Oct 15 10:01:07 syslogd kernel boot file is /boot/kernel/kernel
                    Oct 15 10:01:07 pfsense.localdomain nginx: 2018/10/15 10:01:07 [error] 58467#100412: send() failed (54: Connection reset by peer)

                    It's been at 1-3% cpu usage and 7% memory, totally normal for a home network with just 1 PC using the web.

                    As a refresher, I'm using an AMD A4 PRO-7300B processor (3.8 GHz) in an HP EliteDesk 705 G1 SFF, 6GB RAM 500GB HDD. I did not disable the on board bge0 ethernet and it has nothing plugged into it. I have a single Intel PRO 1000 PT Quad Port 1Gb PCIe Ethernet card and I've tried two different cards in two different slots.

                    It'll be a bummer if I can't use the Intel cards. When I researched it, I heard they're usually wonderful for pfSense and I got the pair for $70. There's something sexy about having 8 MAC addresses numbered in a row ;)

                    1 Reply Last reply Reply Quote 0
                    • M
                      Mats
                      last edited by

                      One idea.

                      what hapends if you plug the wireless to the mainboard nic?
                      My idea is if it's an issue between microtic and Intel it might help running the mcirotic against another nictype

                      1 Reply Last reply Reply Quote 0
                      • R
                        rediske
                        last edited by

                        I did not try putting the MikroTik on another port, however I did try only having two of the Intel interfaces up as WAN and LAN, and I still want up having problems.

                        For fun, I tried installing the ESXi on the machine to put pfsense inside that. ESXi wouldn’t recognize the Intel at all.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.