After 2.1.x upgrade, check_reload_status loop on rc.linkup



  • Hi!

    I have quite a few PFSense servers, but this one is a little different, it has a 10Gbit network card.

    Whenever my ix0 or ix1 interfaces has changes on supported options, like adding tso and polling, or anything else, the check_reload_status gets on a absurd loop, eating memory as fast as 2GB a second, going into swap and effectivly locking up the environment. Also, this loop is so hard and probably create/use files, that kern.maxfiles is hit every 1 second on users nobody, root, and id 181.

    I've traced this loop to a specific operation, when check_reload_status run the rc.linkup script, which just before writes to the log "Linkup starting…".

    Here goes a piece of the log:

    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3005
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3004
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3003
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan10
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3002
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3001
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3000
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3009
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan20
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan21
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan100
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3005
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3004
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3003
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan10
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3002
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3001
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3000
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3009
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan20
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan21
    Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan100
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
    Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100

    A few more informations:

    Hardware
    Dell R420
    Intel Xeon E5-2430
    8 GB 1333Mhz
    2x HD 500GB SATA2 @ gmirror
    Intel X520DA2 10Gbit
    Intel i350 4P 1Gbit

    ixgbe.ko
    This driver is a slighty modified version 2.4.8 adding support to Dell's X520DA2 and compiled on PFsense 2.0.3, since the update to 2.1.4 and 2.1.5 the driver seems to be working fine, this is a production server and no problem was detected after a few weeks running the 2.1.4 version.

    There is a catch tho, this driver version has a few problem with vlan hardware tagging, so I have to disable it for the network card to work with the VLANs I have.

    The command I run on every start is: ifconfig ix0 tso polling -lro -vlanhwfilter -vlanhwtag
    Just after I run the command, check_reload_status triggers the rc.linkup and loops.

    loader.conf
    autoboot_delay="3"
    vm.kmem_size="435544320"
    vm.kmem_size_max="535544320"
    kern.ipc.nmbclusters="262144"
    kern.ipc.nmbjumbop="262144"
    hw.ixgbe.num_queues="4"
    ixgbe_load="YES"
    hw.intr_storm_threshold="0"
    hw.usb.no_pf="1"

    I'll try to take a look at the check_reload_status source on PFSense's git, maybe I can find something.

    If anyone could also help…

    Thanks everyone! :)



  • That sounds like an issue we fixed in 2.2 recently in check_reload_status. Though seems the changes you made trigger that circumstance in a much worse way than anything in stock releases ever would.

    I'd recommend trying 2.2 without making any changes at all to the driver or custom ifconfig commands, stock everything should be fine there. loader.conf.local should have the two ix changes noted here, and add the sysctl to system tunables for hw.intr_storm_threshold.
    https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Intel_ix.284.29_Cards

    Do not disable vlanhw* and TSO on 2.2, those parts should be unnecessary.



  • Oh I see, good, its good to know its a known bug and its fixed on 2.2.

    Unfortunalely its a production environment and its the core server of the network infrastruture, can't install a Beta version, will wait for the final version.

    Thanks for the help! :)



  • You're significantly better off with stock 2.2 at this point than a prior release with a kernel module network driver compiled for a different base OS. You've destabilized things with the driver change. Though I guess if you can guarantee a system isn't going to lose link, it's fine. The issue I was referring to was in check_reload_status leaking file handles, which is a tiny portion of the overall problem there, and nothing to do with the source of it. We're about to hit release candidate on 2.2. I'd strongly reconsider at that point, as what you're doing there is a house of cards that will quite possibly collapse (more than it already has).



  • Hmm maybe some low level operation from the driver is missing which is causing the loop on the handlers and making the problem even bigger. That is most definitely solved on 2.2 with the latest stable drivers.

    I think we will stick with your recommendation and switch to 2.2 on RC. We do use a lot of advanced features like Traffic Shapping and OSPF Routing (Quagga Package). Has Traffic Shapping changed on 2.2 in some level that could cause instabilities? I mean, its a major OS change from BSD 8 to 10. Also, I will check if the Quagga package is available on 2.2.

    Thanks again for the help :)



  • Everything in 2.2 has been well-tested at this point and is at least as good and many times better than 2.1x. Release candidate should be coming in less than a week. Same packages are available.



  • Perfect, waiting on RC :D

    Still stable for now, but I can't restart the server, but its not something we do anyway.