Igb driver - interface flapping for no apparent reason!?



  • I wasn't sure whether to put this in the Hardware section or not, partially because I'm not sure if it's a hardware problem or a software bug.

    Running 2.4.1-RELEASE
    Hardware is a Supermicro A1SRi-2558F, which is an Atom C2558 board, 4x onboard Intel LAN that use the "igb" driver.  Hardware is very similar to one of the (discontinued) official Netgate models, but has IPMI which is why I built on this instead of buying an official system.

    Was running along fine for several days after upgrading to 2.4.1, and then I noticed abysmal performance on the LAN.  Very basic setup - single WAN, single LAN interface, no VLANs in use. igb0 is WAN, igb1 is LAN.

    clog /var/log/system.log | less

    
    Nov 10 18:45:47 pf check_reload_status: Reloading filter
    Nov 10 18:45:48 pf check_reload_status: Linkup starting igb1
    Nov 10 18:45:48 pf kernel: igb1: link state changed to UP
    Nov 10 18:45:49 pf dhcpleases: kqueue error: unkown
    Nov 10 18:45:51 pf check_reload_status: updating dyndns lan
    Nov 10 18:45:51 pf check_reload_status: Reloading filter
    Nov 10 18:45:51 pf php-fpm[19092]: /rc.linkup: DEVD Ethernet detached event for lan
    Nov 10 18:45:51 pf check_reload_status: Reloading filter
    Nov 10 18:45:51 pf php-fpm[47641]: /rc.linkup: DEVD Ethernet attached event for lan
    Nov 10 18:45:51 pf php-fpm[47641]: /rc.linkup: HOTPLUG: Configuring interface lan
    Nov 10 18:45:51 pf kernel: igb1: link state changed to DOWN
    Nov 10 18:45:51 pf check_reload_status: Linkup starting igb1
    Nov 10 18:45:51 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:45:51 pf check_reload_status: Restarting ipsec tunnels
    Nov 10 18:45:54 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:45:54 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run
    /unbound.pid) does not exist, No such process.
    Nov 10 18:45:55 pf check_reload_status: Linkup starting igb1
    Nov 10 18:45:55 pf kernel: igb1: link state changed to UP
    Nov 10 18:45:55 pf php-fpm[47641]: /rc.linkup: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1510357555] unbound[72891:0] error: can't bind socket: Can't assign requested address for 2607:fcc8:xx-private-xx [1510357555] unbound[72891:0] fatal error: could not open ports'
    Nov 10 18:45:55 pf dhcpleases: kqueue error: unkown
    Nov 10 18:45:55 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:56 pf php-fpm[69404]: /rc.newwanipv6: rc.newwanipv6: Info: starting on igb0.
    Nov 10 18:45:56 pf php-fpm[69404]: /rc.newwanipv6: rc.newwanipv6: on (IP address: 2605:a000:xx-private-xx) (interface: wan) (real interface: igb0).
    Nov 10 18:45:56 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:56 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:57 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:45:57 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:57 pf check_reload_status: updating dyndns lan
    Nov 10 18:45:57 pf check_reload_status: Reloading filter
    Nov 10 18:45:57 pf php-fpm[23299]: /rc.linkup: DEVD Ethernet detached event for lan
    Nov 10 18:45:57 pf check_reload_status: Reloading filter
    Nov 10 18:45:57 pf php-fpm[47641]: /rc.linkup: DEVD Ethernet attached event for lan
    Nov 10 18:45:57 pf php-fpm[47641]: /rc.linkup: HOTPLUG: Configuring interface lan
    Nov 10 18:45:57 pf php-fpm[69404]: /rc.newwanipv6: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1510357557] unbound[90131:0] error: can't bind socket: Can't assign requested address for 192.168.42.1 [1510357557] unbound[90131:0] fatal error: could not open ports'
    Nov 10 18:45:57 pf dhcpleases: kqueue error: unkown
    Nov 10 18:45:57 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:58 pf kernel: igb1: link state changed to DOWN
    Nov 10 18:45:58 pf check_reload_status: Linkup starting igb1
    Nov 10 18:45:58 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:45:58 pf check_reload_status: Restarting ipsec tunnels
    Nov 10 18:45:58 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:45:58 pf php-fpm[69404]: /rc.newwanipv6: The command '/usr/local/sbin/dhcpd -6 -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpdv6.conf -pf /var/run/dhcpdv6.pid igb1' returned exit code '1', the output was 'Internet Systems Consortium DHCP Server 4.3.6 Copyright 2004-2017 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Config file: /etc/dhcpdv6.conf Database file: /var/db/dhcpd6.leases PID file: /var/run/dhcpdv6.pid Wrote 0 NA, 0 TA, 0 PD leases to lease file.  No subnet6 declaration for igb1 (fe80::1:1). ** Ignoring requests on igb1.  If this is not what    you want, please write a subnet6 declaration    in your dhcpd.conf file for the network segment to which interface igb1 is attached. **   Not configured to listen on any interfaces!  If you think you have received this message due to a bug rather than a configuration issue please read the section on submitting bugs on either our web page at www.isc.org or in the README file
    Nov 10 18:45:59 pf php-fpm[69404]: /rc.newwanipv6: ROUTING: setting default route to 173.91.32.1
    Nov 10 18:45:59 pf php-fpm[69404]: /rc.newwanipv6: ROUTING: setting IPv6 default route to fe80::201:5cff:fe8d:8246%igb0
    Nov 10 18:45:59 pf php-fpm[69404]: /rc.newwanipv6: Removing static route for monitor fe80::201:5cff:fe8d:8246 and adding a new route through fe80::201:5cff:fe8d:8246%igb0
    Nov 10 18:45:59 pf check_reload_status: Reloading filter
    Nov 10 18:45:59 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:45:59 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:46:00 pf dhcpleases: kqueue error: unkown
    Nov 10 18:46:01 pf check_reload_status: Linkup starting igb1
    Nov 10 18:46:01 pf kernel: igb1: link state changed to UP
    Nov 10 18:46:02 pf check_reload_status: updating dyndns lan
    Nov 10 18:46:02 pf check_reload_status: Reloading filter
    Nov 10 18:46:02 pf php-fpm[23299]: /rc.linkup: DEVD Ethernet detached event for lan
    Nov 10 18:46:02 pf check_reload_status: Reloading filter
    Nov 10 18:46:02 pf php-fpm[47641]: /rc.linkup: DEVD Ethernet attached event for lan
    Nov 10 18:46:02 pf php-fpm[47641]: /rc.linkup: HOTPLUG: Configuring interface lan
    Nov 10 18:46:03 pf php-fpm[81689]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
    Nov 10 18:46:03 pf kernel: igb1: link state changed to DOWN
    Nov 10 18:46:03 pf check_reload_status: Linkup starting igb1
    Nov 10 18:46:03 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:46:03 pf check_reload_status: Restarting ipsec tunnels
    Nov 10 18:46:05 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:46:05 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:46:06 pf check_reload_status: Linkup starting igb1
    Nov 10 18:46:06 pf kernel: igb1: link state changed to UP
    Nov 10 18:46:14 pf dhcpleases: kqueue error: unkown
    Nov 10 18:46:16 pf check_reload_status: updating dyndns lan
    Nov 10 18:46:17 pf check_reload_status: Reloading filter
    Nov 10 18:46:17 pf php-fpm[23299]: /rc.linkup: DEVD Ethernet detached event for lan
    Nov 10 18:46:17 pf check_reload_status: Reloading filter
    Nov 10 18:46:17 pf php-fpm[45617]: /rc.linkup: DEVD Ethernet attached event for lan
    Nov 10 18:46:17 pf php-fpm[45617]: /rc.linkup: HOTPLUG: Configuring interface lan
    Nov 10 18:46:17 pf kernel: igb1: link state changed to DOWN
    Nov 10 18:46:17 pf check_reload_status: Linkup starting igb1
    Nov 10 18:46:17 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:46:17 pf check_reload_status: Restarting ipsec tunnels
    Nov 10 18:46:18 pf php-fpm[69404]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
    Nov 10 18:46:18 pf check_reload_status: Reloading filter
    Nov 10 18:46:19 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:46:19 pf dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 10 18:46:20 pf check_reload_status: Linkup starting igb1
    Nov 10 18:46:20 pf kernel: igb1: link state changed to UP
    Nov 10 18:46:20 pf syslogd: sendto: Host is down
    Nov 10 18:46:20 pf syslogd: sendto: Host is down
    Nov 10 18:46:20 pf syslogd: sendto: Host is down
    Nov 10 18:46:21 pf php-fpm[69404]: /rc.newwanipv6: rc.newwanipv6: Info: starting on igb0.
    Nov 10 18:46:21 pf php-fpm[69404]: /rc.newwanipv6: rc.newwanipv6: on (IP address: 2605:a000:xxx-private-xxx) (interface: wan) (real interface: igb0).
    Nov 10 18:46:22 pf syslogd: sendto: Host is down
    Nov 10 18:46:22 pf dhcpleases: kqueue error: unkown
    Nov 10 18:46:23 pf dhcpleases: /etc/hosts changed size from original!
    Nov 10 18:46:23 pf php-fpm[69404]: /rc.newwanipv6: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1510357583] unbound[5198:0] error: bind: address already in use [1510357583] unbound[5198:0] fatal error: could not open ports'
    Nov 10 18:46:23 pf dhcpleases: kqueue error: unkown
    Nov 10 18:46:24 pf check_reload_status: updating dyndns lan
    Nov 10 18:46:24 pf check_reload_status: Reloading filter
    Nov 10 18:46:24 pf php-fpm[86176]: /rc.linkup: DEVD Ethernet detached event for lan
    Nov 10 18:46:24 pf check_reload_status: Reloading filter
    Nov 10 18:46:24 pf php-fpm[45617]: /rc.linkup: DEVD Ethernet attached event for lan
    Nov 10 18:46:24 pf php-fpm[45617]: /rc.linkup: HOTPLUG: Configuring interface lan
    Nov 10 18:46:24 pf kernel: igb1: link state changed to DOWN
    --snipped---
    
    

    So, something was making igb1 flap… I have no idea what or why.  This went on for a good 20+ minutes that I'm aware of, and eventually I just rebooted the router (gracefully)... and it worked fine again.

    Worth noting: igb1/LAN from the pfSense box is connected directly to a SamKnows monitoring "whitebox," which then connects to my switch. The "whitebox" effectively is just a 2-port bridge between my main switch and pfsense, it performs periodic traffic tests and ping monitoring whenever it determines my WAN isn't busy.  While troubleshooting, I removed the SamKnows box from the equation entirely (cabled pfSense directly to my switch), and that was when I could see the link flapping at the physical level (the switch port lights were going on then off again over and over; I couldn't see the lights on the pfSense box since the rear was facing away from me but I have to guess it was the same there).

    Any ideas / thoughts? Normally I'd think hardware problem, except this equipment had been working perfectly fine for the previous several days and nothing had changed, plus a reboot fixed it... (it has been running fine for another 3 days since the reboot) which makes me lean to software. Obviously not a physical cable issue because it's working fine after a reboot. The WAN side of pfSense seemed to be fine throughout, but when your LAN is flapping repeatedly it makes it very difficult to even get in to the router to gracefully reboot it or troubleshoot what was going on. (I thankfully have a physical keyboard and VGA console on the box, since while it does have IPMI, the IPMI is Java-based and sucks.)



  • Due to this very basic set up of your pfSense box, could you trying out to reinstall it after doing a config backup
    and this should be a fresh and full install for sure. In normal it takes something around ~30 minutes and mostly
    all went fine after doing so.

    From witch version you were upgrading? 2.3.x or 2.4.x to 2.4.1 or from a lesser version?
    Is your ISP offering VLANs at the Internet connection?

    Well known issues or problems are the following ones;

    • 2.4.0 has problems with some VLAN labeling
    • 2.4.1 has problems with some VLANs at over PPP
    • generally there are problems if you change from 2.2.x to 2.4.x
    • generally there are also problems with connected USB devices such CD/DVD burners (from time to time)

    (I thankfully have a physical keyboard and VGA console on the box, since while it does have IPMI, the IPMI is Java-based and sucks.)

    Is this a serial iso install or a vga iso installation?



  • Was from 2.3.4 to 2.4.0 and then to 2.4.1.  Ran fine for several days or weeks on 2.4.1 before having this problem.

    Don't have time to screw with a fresh install / config restore, this is my main WAN router at home, and it's in an inconvenient location to physically access (small crawlspace).

    ISP does not offer VLANs, ISP is TWC/Charter/Spectrum in the US.

    I am not using VLANs at all currently.

    The only connected USB devices are an APC UPS and a keyboard.

    This is a VGA installation. I wanted to try to setup SOL console redirection (available through the BIOS which should allow me to SSH into the IPMI to get console access), but never actually figured out how to make it work.



  • have you tried the options mentioned here? https://forum.pfsense.org/index.php?topic=69486.0

    or just try these options in your /boot/loader.conf.local

    kern.cam.boot_delay=10000
    kern.ipc.nmbclusters=1000000
    hw.igb.num_queues=1
    legal.intel_ipw.license_ack=1
    legal.intel_iwi.license_ack=1
    hw.pci.enable_msix=0
    hw.igb.enable_msix=0
    


  • @Birke:

    have you tried the options mentioned here? https://forum.pfsense.org/index.php?topic=69486.0

    or just try these options in your /boot/loader.conf.local

    kern.cam.boot_delay=10000
    kern.ipc.nmbclusters=1000000
    hw.igb.num_queues=1
    legal.intel_ipw.license_ack=1
    legal.intel_iwi.license_ack=1
    hw.pci.enable_msix=0
    hw.igb.enable_msix=0
    

    Very interesting, but the post you linked was from 2013 and I would hope that the drivers/kernel have advanced beyond this! Hoping maybe someone from the development team can comment on whether this is still necessary?



  • There are a lot of improvements since 2013, but also igb4 still have a lot of bugs, AFAIK kern.ipc.nmbclusters=1000000 at least still needed  to maintain stabilty, also I would recommend to disable TSO and LRO (System/Advanced/Networking), and I am using hw.igb.num_queues=1 also since 2.3, I know something was improved since, but I still seeing problems reported again and again.
    But… Honestly I have never encountered a cyclic state change of the igb network adapter, so it can be hardware issue as well... or it depends on hardware — in my case it just gives me kernel panic and in yours it cycling adapter, I am not sure.



  • Does it matter if I put these into loader.conf.local or just use the GUI's system tunables instead?

    I prefer to keep stuff in the GUI / accessible so it's backed up with a config dump rather than having to remember that I put something "important" into loader.conf.local…



  • I hope it does not matter 8)



  • I've seen flapping with a lot of the ONUs that BrightHouse used and resolved it by manually setting the interface speed.  Only other time I've seen it is a bad switch.  I've never seen it where my pfSense router was the issue but anything is possible.  What if you make a different port your LAN port to see if it is a bad port?



  • @Stewart:

    I've seen flapping with a lot of the ONUs that BrightHouse used and resolved it by manually setting the interface speed.  Only other time I've seen it is a bad switch.  I've never seen it where my pfSense router was the issue but anything is possible.  What if you make a different port your LAN port to see if it is a bad port?

    I had problems with the WAN side against an older Surfboard modem a looong time ago, but that was solved by hard-setting the mediatype to "autoselect" (instead of "Default (autoselect/driver's preference)").

    That didn't seem to help here.

    I haven't tried changing the port yet because it doesn't seem like a port hardware problem… I would expect consistent traffic problems rather than the weird intermittent sporadic issues I've had (that mostly seem to come up after rebooting the system for some reason).

    hw.igb.num_queues is apparently only modifiable at boot-time, so it must go into /boot/loader.conf.local and can't be set via system_advanced_sysctl.php.  :( It also doesn't seem to help the problem.

    I'm wondering if it's somehow service-related… The only packages I'm running are darkstat, NUT, and Avahi.  When I reboot the machine (yay for IPMI), I noticed the interface flapping behavior stop immediately after the console printed a notice about the Avahi shutdown, but that may have just been coincidence. I just disabled Darkstat for now (which I know puts the interface into promiscuous mode, so I'm wondering if that could have anything to do with it too).

    The board I'm using needs to be RMA'd anyway due to the Atom clock/boot bug... I need to get in touch with Supermicro and hope they can do an "Advance RMA" where they send me a new board first.



  • hw.igb.num_queues is apparently only modifiable at boot-time, so it must go into /boot/loader.conf.local and can't be set via system_advanced_sysctl.php.  :( It also doesn't seem to help the problem.

    If the system has created 4 queues for that port, it needs time to fill them up with packets. And if this will take to much time
    I am pretty sure that more then two installed packets will be reporting also any issues too on that pfSense system. A real
    port miss match cold be fast solved by setting up the proper port speed number. Narrow down the queues will be one thing
    and set up then the right mbuf.size number another thing.

    I had problems with the WAN side against an older Surfboard modem a looong time ago, but that was solved by hard-setting the mediatype to "autoselect" (instead of "Default (autoselect/driver's preference)").

    On that boards, and by default, the WAN port and the IPMI port will be shared due to the settings inside of the BIOS.
    You will be able to change this fast by rebooting and changing this settings in the BIOS for having perhaps success.



  • I know that Asrock Rack boards do really have this problem  - even if you have dedicated port for IPMI, the first port is also shared for IPMI, this can cause a lot of problems, like huge packet loss or non working IPMI, bit I am not sure is this applicable to supermicro boards also.



  • @BlueKobold:

    hw.igb.num_queues is apparently only modifiable at boot-time, so it must go into /boot/loader.conf.local and can't be set via system_advanced_sysctl.php.  :( It also doesn't seem to help the problem.

    If the system has created 4 queues for that port, it needs time to fill them up with packets. And if this will take to much time
    I am pretty sure that more then two installed packets will be reporting also any issues too on that pfSense system. A real
    port miss match cold be fast solved by setting up the proper port speed number. Narrow down the queues will be one thing
    and set up then the right mbuf.size number another thing.

    I had problems with the WAN side against an older Surfboard modem a looong time ago, but that was solved by hard-setting the mediatype to "autoselect" (instead of "Default (autoselect/driver's preference)").

    On that boards, and by default, the WAN port and the IPMI port will be shared due to the settings inside of the BIOS.
    You will be able to change this fast by rebooting and changing this settings in the BIOS for having perhaps success.

    I have already adjusted the num_queues to 1 as well as the nmbclusters to 1000000 and it did not help (system has been rebooted multiple times since adding these to loader.conf.local).  This supermicro board's IPMI is a separate dedicated port on the motherboard, and there isn't even an option to "share" one of the main ports.  Even if it could, I would expect that to be port 1 (igb0), which is my WAN port and hasn't shown any problems (yet).

    The interfaces are both configured for "autoselect", which matches with the equipment on the other end, but I'm still having trouble.

    Oddly, the WAN side is fine, it's only the LAN that is acting up.  I'm at work right now and can SSH to the router (I have a VPN tunnel between work and home) and watch as igb1 keeps going from "no media"  (thinks it is unplugged) to "active."  I tried using igb2 as LAN earlier (which is currently unused) and it was doing the same thing, flapping up and down, so I went back to igb1.

    I just altered igb1 to 1000-full (hardcoded) and now it seems to be stable again, but this isn't "correct" since the equipment on the other side is autonegotiating.


Log in to reply