WAN NIC losing link on Intel(R) PRO/1000 (only on 2.2.x, not on 2.1.x)



  • Hi all,

    Following up (with its own thread) on the WAN NIC losing it's link since 2.2.x and not on 2.1.x.

    I've been pointed to a possible solution by cmb in a different thread:

    @cmb:

    This looks to match this:
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174

    which should be worked around if you disable MSI and MSIX.
    https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#MSI.2FMSIX

    I disabled MXIS and enabled MSI. (disabling both leads to non-working interfaces) I can confirm this setting is working by looking during boot time:

    em1: <intel(r) 1000="" pro="" network="" connection="" 7.4.2=""> port 0x2000-0x201f mem 0xf0900000-0xf091ffff,0xf0920000-0xf0923fff at device 0.0 on pci2
    em1: MSIX: insufficient vectors, using MSI
    em1: Using an MSI interrupt</intel(r)>
    

    My current /boot/loader.conf.local contains:

    autoboot_delay="3"
    vm.kmem_size="536870912"
    vm.kmem_size_max="1073741824"
    kern.ipc.nmbclusters="512000"
    boot_multicons="YES"
    boot_serial="YES"
    comconsole_speed="115200"
    console="comconsole,vidconsole"
    hw.usb.no_pf="1"
    vfs.zfs.prefetch_disable="1"
    hw.pci.enable_msix="0"
    hw.pci.enable_msi="1"
    

    I just implemented this in production and will report back in 5-7 days. (or sooner in case the problem returns)

    Regards,
    Kristof.



  • Just to follow up on this. (not really 5-7 days but I wanted to be sure we're good)

    I think this is solved, our current uptime (without losing connection) is: 15 Days 05 Hours 05 Minutes 08 Seconds

    Thanks to cmb who pointed out the solution. It was all due to driver changes and only on these Intel NICs.

    Regards,
    Kristof.



  • Unfortunately the problem remains. I tried the different combinations of disabling MSI/MSIX, does not make any difference.

    Anyhting else I can change that might affect this? (it's most likely a driver issue because it worked in previous pfsense versions, 2.1.x)


  • LAYER 8 Global Moderator

    so it runs for 15 + days, and you think its a driver issue?  Long time for a driver issue to not show itself.



  • it might run that long, sometimes it takes only a few days..

    It's the only (type of) firewall that we have (with Intel Pro 1000's) that has this issue. And it's been replaced with similar h/w. (to make sure hardware is not the issue)

    I'm basing my opinion on the reports that show up in freebsd's bugzilla:

    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174

    Other suggestions on how to troubleshoot/nuke this problem are appreciated..



  • Hello,

    I have seen the same problem on my Pf-box (2.2.4 32Bit).

    I use a Intel D2500cc with 4gb ram (I know overkill). With a hifn 7955 card onboard. Bios is version: CCCDT10N.86A.0037.2012.1217.1723 and a 30 gb IBM ssd drive. (Overkill too, but it was cheap  :D )

    Intel specs says the onboard nics are 8274GL models acording to this https://downloadmirror.intel.com/20718/eng/D2500CC_ProductGuide02_English.pdf link.

    I can see that my modem from my ISP shows error on the port where the PF is connected. Rebooting PF fixes the problem for a period of time. Sometimes 2,5,10,15 minuttes. Sometimes for 1-2- days.

    2 days ago I reinstalled the PF, this evening. The problem started, after 5 reboots. The WAN NIC didn´t not connect at all to the modem. The setting is a DHCP ip4. And not even a reboot could fix it in the end. Only a reinstall  :'(

    I have a backup of my settings, so a restore fixes all my settings.

    I run this box without a monitor, but now I have hooked up a monitor, and the console shows the following error when the problem starts:
    "ugen3.2: <unknown>at usbus3 disconnected"

    I run a Ipsec and Openvpn connection to a remote site. Also I have spoofed MAC address on the WAN NIC. To make my ISP happy.

    I have a buddy running PF on same exactly same HW configuration, He has also had the problem, where the WAN nic lost all connection.

    Earlier I have never seen this issue. Don´t remember if its only on version 2.2.X

    Any suggestions will be appreciated.</unknown>



  • If the information in FreeBSD bugzilla is correct, turning off TSO will solve the problem for now. You can do this in System -> Advanced, Networking tab, check "Disable hardware TCP segmentation offload" and reboot.

    I haven't checked GitHub to see if the final version of the patch has been added to pfSense 2.2.5-DEVELOPMENT. I think it unlikely, though, as it hasn't been MFCed to FreeBSD stable/10 yet.

    If the patch makes it into stable/10 by the time pfSense 2.3 releases, it should appear in 2.3. (2.3 snapshots are definitely alpha grade at present, and not suitable for production use).



  • When you say reboot, do you mean a power cycle or a warm reset?  Sometimes the hardware doesn't do a "from dead" restart unless it's power cycled.



  • A warm reset should be sufficient. Indeed, it should do the same job if you can get to a command prompt and:

    ifconfig em0 -tso
    ifconfig em0 -tso4
    ifconfig em0 -tso6

    Repeat this for any other em interfaces and ignore any errors - I can't remember which of these three flags em uses and don't have any boxes with a em interface to hand. The box I'm using has interfaces that use the later igb driver.

    There are also various ways to do this with the PHP shell, or by saving and applying a change to each interface.



  • I too have this issue, along with 2 other people that I have been talking with on here.  My box is a little different as it is a Firebox, using Marvel Nics.  This has only started since the 2.2.x upgrade went onto my CF, never has done this before now.

    Something changed, and I also get the odd times it does it.  I can normally guarantee it will do it if I try to access the GUI, there is a serious bug within.  Unplugging the cable from the Nic and plugging it back it resets the issue, don't know why, but it does.

    here is the topic I started and some things I have done to try to fix it.

    https://forum.pfsense.org/index.php?topic=100010.0



  • @deanot:

    I too have this issue, along with 2 other people that I have been talking with on here.  My box is a little different as it is a Firebox, using Marvel Nics.  This has only started since the 2.2.x upgrade went onto my CF, never has done this before now.

    Something changed, and I also get the odd times it does it.

    With respect, your issue is completely different - as it affects a different NIC using a different driver. The issue discussed in this thread is specific to em interfaces (older / lower spec Intel gigabit NICs - newer / higher spec Intel gigabit NICs use igb).

    The only thing in common with your issue is that something changed between FreeBSD 8.3 (pfSense 2.1.x) and FreeBSD 10.1 (pfSense 2.2.x) that causes random NIC failure. As there has been a huge amount of development in FreeBSD between those two releases and you are not running hardware supported by ESF/Netgate, you have four choices - debug it yourself, pay someone with suitable experience to debug it for you, hope someone in the community has an answer or change hardware. If you insist on sticking with the hardware you have, turning off the NIC's hardware offloading features might help, as the driver/hardware issue might be to do with offloading.



  • Hello guys,
    I'm having the same problem as you.

    I have a network adapter dual port intel 82546eb I use the ESXI 6.0 vmware.

    Board is really very old.

    What is the best dual port, you would advise me to buy for my case?

    I want to use the newest version of PF 2.2.4

    I thank everyone!



  • guys,

    I modified in order to disable the TSO.
    The TSO was already disabled the firewall in advanced networking part.
    I intend to change the network card, but it makes me yet quiet to do.

    https://calomel.org/freebsd_network_tuning.html
    http://www.peerwisdom.org/2013/04/03/large-send-offload-and-network-performance/

    Thank you.




  • I, too, have an Intel D2500CC with Intel 82574L onboard dual NIC running pfsense nanobsd 2.2.4-RELEASE (amd64).  I have turned off TSO via System -> Advanced, Networking tab, check "Disable hardware TCP segmentation offload."  I have not experienced any noticeable WAN disconnects and certainly none that require me to reboot the pfsense.  Below are some system logs for comparison.  On a related note, after stumbling upon this thread I saw the comment about D2500CC and was curious about this issue.  Turns out I already had TSO disabled so not sure if that was disabled when I imported my monowall config or if it is disabled by default.  I kept a a pretty good record of the settings/tweaks I have applied to pfsense but why TSO was disabled for me I am not sure but I am glad it was as I do not experience any WAN disconnects (maybe it is disabled on nanobsd? not sure.)

    Oct 24 15:04:49 kernel: em1: Using MSIX interrupts with 3 vectors
    Oct 24 15:04:49 kernel: em1: <intel(r) 1000="" pro="" network="" connection="" 7.4.2="">port 0x2000-0x201f mem 0x80120000-0x8013ffff,0x80100000-0x8011ffff,0x80140000-0x80143fff irq 17 at device 0.0 on pci1
    Oct 24 15:04:49 kernel: pci1: <acpi pci="" bus="">on pcib2
    Oct 24 15:04:49 kernel: pcib2: <acpi pci-pci="" bridge="">at device 28.1 on pci0
    Oct 24 15:04:49 kernel: em0: Using MSIX interrupts with 3 vectors
    Oct 24 15:04:49 kernel: em0: <intel(r) 1000="" pro="" network="" connection="" 7.4.2="">port 0x3000-0x301f mem 0x80220000-0x8023ffff,0x80200000-0x8021ffff,0x80240000-0x80243fff irq 16 at device 0.0 on pci2
    Oct 24 15:04:49 kernel: pci2: <acpi pci="" bus="">on pcib1
    Oct 24 15:04:49 kernel: pcib1: <acpi pci-pci="" bridge="">at device 28.0 on pci0

    Intel Desktop Board D2500CC
    http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-board-d2500cc.html

    Intel Atom D2500 CPU Specifications
    http://ark.intel.com/products/59682/Intel-Atom-Processor-D2500-1M-Cache-1_86-GHz</acpi></acpi></intel(r)></acpi></acpi></intel(r)>



  • Well, a few days into 2.2.5 now but the issue is not resolved yet.

    Intel(R) PRO/1000 with em driver, keeps disconnecting after random amount of time. (can be 1 day, can be 2-3 weeks)
    Internet connection is a business account from our local cable provider. (Telenet)

    I checked and these are the advanced settings:

    • Disable hardware checksum offload - Unchecked.

    • Disable hardware TCP segmentation offload - Checked.

    • Disable hardware large receive offload - Checked.

    But the problem remains after upgrading to 2.2.5.

    I temporarily placed an identical model with opnsense in place of this box. I will have to wait until a more recent freebsd is used and will retry then..



  • Have you tried "Disable Gateway Monitoring"?  Or configure it to be less sensitive?



  • Thanks for the suggestion, I have tried that in the past, same problem.

    The WAN interface keeps losing its ip.



  • @xtofh:

    The WAN interface keeps losing its ip.

    Is it static IP or DHCP assignment?  If DHCP assigned then I'd be watching at the renewals for failures / issues with a packet capture of ports 67/68.  What's the lifetime being handed out? (/var/db/dhclient.leases….)



  • @NOYB:

    Is it static IP or DHCP assignment?  If DHCP assigned then I'd be watching at the renewals for failures / issues with a packet capture of ports 67/68.  What's the lifetime being handed out? (/var/db/dhclient.leases….)

    It's dhcp (but a fixed lease), I didn't see any failures on dhcp with tcpdump. When the issue occurs, there are dhcp requests but simply no replies. (and the provider is working) Also, unplugging the ethernet cable, waiting for a minute, plugging it back in, doesn't help. Plugging a different device into the provider gives me an ip immediately.

    I have to reboot (by logging into the lan or a different wan interface) the whole system.

    /var/db/dhclient.leases.em1:

      option dhcp-lease-time 7200;
      option dhcp-message-type 5;
      option dhcp-server-identifier 195.130.x.y;
      option dhcp-renewal-time 3600;
      option dhcp-rebinding-time 3660;
    


  • @xtofh:

    I didn't see any failures on dhcp with tcpdump. When the issue occurs, there are dhcp requests but simply no replies.

    Why do you think that no replies is not an issue?



  • @NOYB:

    @xtofh:

    I didn't see any failures on dhcp with tcpdump. When the issue occurs, there are dhcp requests but simply no replies.

    Why do you think that no replies is not an issue?

    Not being helpful.. Disconnecting cable, reconnecting doesn't change anything. With 2.1.x I don't get the issue in the first place.

    I have a feeling the nic comes into a broken sort of state. (might be driver related as can be seen in my first post or the other threads)

    Any suggestions to fix/troubleshoot are welcome.



  • I have the same problem. I am running a 4-5 years old AMD64 box on a ASUS motherboard. I have two NICs one of which is an Intel Pro 100/1000 (4 or 5 years old). I have not yet created any custom firewall rules. When the install was fresh I could run PFSense for a few seconds to a few minutes before Pfsense stopped serving requests. The first trouble-shooting thing I did was to turn off any equipment that might compete for DHCP addressing, but to no avail.

    After realising that the Gateway-log was full of error messages such as those below I stumbled up on this topic.

    
    Dec 30 09:16:08   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:09   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:10   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:11   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:12   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:13   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:14   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:15   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:16:16   apinger: Could not bind socket on address(90.226.210.126) for monitoring address 90.226.210.1(WAN_DHCP) with error Can't assign requested address
    Dec 30 09:17:48   apinger: Starting Alarm Pinger, apinger(26023)
    Dec 30 09:36:25   apinger: Starting Alarm Pinger, apinger(21795)
    
    

    Starting out I had put the LAN Interface on the Intel card, but that caused the Web Configurator to become unresponsive as soon as the link went down. Then I switched the interfaces (so that the WAN Interface uses the Intel NIC) with the somewhat positive result that I could at least reboot the system remotely when the link dropped as it kept doing after a few minutes after rebooting.

    After checking "Disable hardware TCP segmentation offload" in System -> Advanced, Networking tab as advised by David_W, uptime increased from minutes to hours (at least in some cases, but sometimes it is minutes still).

    I have now also tried the changing the System: Advanced: System Tuneable net.net.tcp.tso variable from 1 to 0 as advised by julicravo, but that made the system more unstable as changing that variable caused the system to stop serving request just a few minutes after rebooting. I must admit though that it is somewhat hard to tell if the net.net.tcp.so made any difference for the better or worse.

    I think I will look for a dual interface card to resolve my problem. Any recommendations on such cards that play well with pfsense?



  • After checking one more time I realised that my Intel PRO NIC works fine. It is my other NIC from Marvell Semiconductors Yukon that is causing my problems. Right now I am running my WAN on the onboard ethernet and my LAN on the 1 GB Intel Pro NIC and this configuration works great. I found a thread on problems with Marvell and AMD64 here: https://forum.pfsense.org/index.php?topic=104420.msg582152#msg582152



  • any updates on this ? having similar issue



  • The only update I have is that I'm using opnsense for this specific firewall now. Not sure if that solves it because I try to keep updated with their upgrades. (and it gets very frequent updates resulting in (too) frequent reboots)

    All my other pfsense firewalls have a different NIC and they don't have the problem.

    I suppose we'll just have to wait until a more recent FreeBSD is used and we'll (hopefully) get updated nic drivers..



  • I hate to dig up the thread just to say "me too" but this thread accurately describes my problem. My specific WAN card is an Intel PRO/1000 PT Dual Port Server Adapter - network adapter - 2 ports (EXPI9402PTBLK). Happy to provide whatever other information would be useful.

    Any updates on this issue?

    I am running pfSense 2.3.2.

    Would the same issue affect a Intel Pro/1000 PT Quad port D72468 39Y6137 NC364T 10N8556 EXPI9404PTG2L20 D57995?