Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    After 2.1.x upgrade, check_reload_status loop on rc.linkup

    Scheduled Pinned Locked Moved Problems Installing or Upgrading pfSense Software
    7 Posts 2 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      ancientz
      last edited by

      Hi!

      I have quite a few PFSense servers, but this one is a little different, it has a 10Gbit network card.

      Whenever my ix0 or ix1 interfaces has changes on supported options, like adding tso and polling, or anything else, the check_reload_status gets on a absurd loop, eating memory as fast as 2GB a second, going into swap and effectivly locking up the environment. Also, this loop is so hard and probably create/use files, that kern.maxfiles is hit every 1 second on users nobody, root, and id 181.

      I've traced this loop to a specific operation, when check_reload_status run the rc.linkup script, which just before writes to the log "Linkup starting…".

      Here goes a piece of the log:

      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3005
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3004
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3003
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan10
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3002
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3001
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3000
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3009
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan20
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan21
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan100
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3005
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3004
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3003
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan10
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3002
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3001
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3000
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan3009
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan20
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan21
      Nov 9 01:32:18 check_reload_status: Linkup starting ix0_vlan100
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3005
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3004
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3003
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan10
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3002
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3001
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3000
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan3009
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan20
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan21
      Nov 9 01:32:17 check_reload_status: Linkup starting ix0_vlan100

      A few more informations:

      Hardware
      Dell R420
      Intel Xeon E5-2430
      8 GB 1333Mhz
      2x HD 500GB SATA2 @ gmirror
      Intel X520DA2 10Gbit
      Intel i350 4P 1Gbit

      ixgbe.ko
      This driver is a slighty modified version 2.4.8 adding support to Dell's X520DA2 and compiled on PFsense 2.0.3, since the update to 2.1.4 and 2.1.5 the driver seems to be working fine, this is a production server and no problem was detected after a few weeks running the 2.1.4 version.

      There is a catch tho, this driver version has a few problem with vlan hardware tagging, so I have to disable it for the network card to work with the VLANs I have.

      The command I run on every start is: ifconfig ix0 tso polling -lro -vlanhwfilter -vlanhwtag
      Just after I run the command, check_reload_status triggers the rc.linkup and loops.

      loader.conf
      autoboot_delay="3"
      vm.kmem_size="435544320"
      vm.kmem_size_max="535544320"
      kern.ipc.nmbclusters="262144"
      kern.ipc.nmbjumbop="262144"
      hw.ixgbe.num_queues="4"
      ixgbe_load="YES"
      hw.intr_storm_threshold="0"
      hw.usb.no_pf="1"

      I'll try to take a look at the check_reload_status source on PFSense's git, maybe I can find something.

      If anyone could also help…

      Thanks everyone! :)

      1 Reply Last reply Reply Quote 0
      • C
        cmb
        last edited by

        That sounds like an issue we fixed in 2.2 recently in check_reload_status. Though seems the changes you made trigger that circumstance in a much worse way than anything in stock releases ever would.

        I'd recommend trying 2.2 without making any changes at all to the driver or custom ifconfig commands, stock everything should be fine there. loader.conf.local should have the two ix changes noted here, and add the sysctl to system tunables for hw.intr_storm_threshold.
        https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Intel_ix.284.29_Cards

        Do not disable vlanhw* and TSO on 2.2, those parts should be unnecessary.

        1 Reply Last reply Reply Quote 0
        • A
          ancientz
          last edited by

          Oh I see, good, its good to know its a known bug and its fixed on 2.2.

          Unfortunalely its a production environment and its the core server of the network infrastruture, can't install a Beta version, will wait for the final version.

          Thanks for the help! :)

          1 Reply Last reply Reply Quote 0
          • C
            cmb
            last edited by

            You're significantly better off with stock 2.2 at this point than a prior release with a kernel module network driver compiled for a different base OS. You've destabilized things with the driver change. Though I guess if you can guarantee a system isn't going to lose link, it's fine. The issue I was referring to was in check_reload_status leaking file handles, which is a tiny portion of the overall problem there, and nothing to do with the source of it. We're about to hit release candidate on 2.2. I'd strongly reconsider at that point, as what you're doing there is a house of cards that will quite possibly collapse (more than it already has).

            1 Reply Last reply Reply Quote 0
            • A
              ancientz
              last edited by

              Hmm maybe some low level operation from the driver is missing which is causing the loop on the handlers and making the problem even bigger. That is most definitely solved on 2.2 with the latest stable drivers.

              I think we will stick with your recommendation and switch to 2.2 on RC. We do use a lot of advanced features like Traffic Shapping and OSPF Routing (Quagga Package). Has Traffic Shapping changed on 2.2 in some level that could cause instabilities? I mean, its a major OS change from BSD 8 to 10. Also, I will check if the Quagga package is available on 2.2.

              Thanks again for the help :)

              1 Reply Last reply Reply Quote 0
              • C
                cmb
                last edited by

                Everything in 2.2 has been well-tested at this point and is at least as good and many times better than 2.1x. Release candidate should be coming in less than a week. Same packages are available.

                1 Reply Last reply Reply Quote 0
                • A
                  ancientz
                  last edited by

                  Perfect, waiting on RC :D

                  Still stable for now, but I can't restart the server, but its not something we do anyway.

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.