Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    WAN interface cycle thought down and up state

    General pfSense Questions
    3
    15
    860
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D
      Draiget
      last edited by

      Hello,

      Faced with interesting issue couple of days ago, one of my WAN interfaces may some time go thought up\down cycle crazy amount of times, which makes whole system unresponsive (DNS/webConfigurator/Routing).

      PF version - 2.5.1-RELEASE.
      I have Mellanox ConnectX-3 EN as PPPoE connection to ISP (one of multiple uplinks) via VLAN (so, two interface assignments, one for VLAN with PPPoE, second just None configuration for ipv4&6):

      mlx4_core0@pci0:6:0:0:  class=0x020000 card=0x005515b3 chip=0x100315b3 rev=0x00 hdr=0x00
          vendor     = 'Mellanox Technologies'
          device     = 'MT27500 Family [ConnectX-3]'
          class      = network
          subclass   = ethernet
      

      For some reason it goes down and then repeat up\down cycle for a minutes, while each down-up event triggers check_reload_status and overflowing php-fpm connections (there's hell a lot of could not connect messages being generated per second, which also a pain):

      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 kernel: mlx4_en: mlxen0: Link Up
      Jun 20 17:46:42 fw1 kernel: mlxen0: link state changed to UP
      Jun 20 17:46:42 fw1 kernel: mlxen0.102: link state changed to UP
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0.102
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
      

      I had driver problems with MLX before (around 2 years ago on previous releases) so probably up/down may be caused by the card or the driver itself again on 2.5.1, but I'm wonder if there's any option to limit such behavior of check_reload_status/php-fpm to prevent bricking whole system just of a single interface issues? I've tried to raise up webConfigurator process count, but obviously it didn't help as the amount of these events is significant.

      Will appreciate for any ideas, thanks.

      GertjanG 1 Reply Last reply Reply Quote 0
      • GertjanG
        Gertjan @Draiget
        last edited by

        @draiget

        Most easy solution : use another NIC.
        Or check the world wide web for issues with FreeBSD 12.2 / Melox drivers, and if some one found a solution.

        No "help me" PM's please. Use the forum, the community will thank you.
        Edit : and where are the logs ??

        D 2 Replies Last reply Reply Quote 0
        • D
          Draiget @Gertjan
          last edited by

          @gertjan
          Reasonable with Mellanox, I'll try Intel one, but such pfSense behavior is not looking good, so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"? I believe we can rate-limit such infinite-loops in check_reload_status or whatever calls that function.

          GertjanG 1 Reply Last reply Reply Quote 0
          • GertjanG
            Gertjan @Draiget
            last edited by Gertjan

            @draiget

            Hummm.

            Image this : your "mlxen" UP en DOWN boncing has nothing to do with "Could not connect to /var/run/php-fpm.socket"
            The latter is a 'socket file', created by the PHP process, and used, amongst others by nginx, the GUI, so it can 'use and speak' PHP.

            The PHP (php-fpm) process should be running since system boot.
            It (nginx, php-fpm) might get restarted when something happens with an interface, like a link going DOWN to UP, but these are rather rare events.

            I guess these

            Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket

            will be gone as soon as you use NIC's that work.

            @draiget said in WAN interface cycle thought down and up state:

            so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"?

            Like a car. Remove just one wheel (out of 4 or more) while speeding on the high way.
            This WILL influence your driving comfort.

            edit :

            Another - better ;) - example :

            A switch accepts far more easily the fact you remove a cable, or put one back in : a switch does not contain 'programs' but shift, compare, lookup registers. They will get reset set flushed whatever during a clock cycle of the switch.
            A software router (as is pfSense) is another beast : a huge bunch of process 'have to know' that an interface went down, or came back. This often means : it's restarted with the new situation as initial parameters.
            Thus a very good reasons to stop flapping interfaces.

            No "help me" PM's please. Use the forum, the community will thank you.
            Edit : and where are the logs ??

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              @draiget said in WAN interface cycle thought down and up state:

              mlx4_en

              Mmm, I would definitely try a different NIC first. I have an older Mellanox card that initially seemed promising but it always behaved strangely. There's a lot going on with those cards. It could be a firmware or firmware config issue even.

              Steve

              1 Reply Last reply Reply Quote 0
              • D
                Draiget @Gertjan
                last edited by

                @gertjan said in WAN interface cycle thought down and up state:

                Most easy solution : use another NIC.

                What NIC will work fine as WAN?
                I have Intel X520-DA2 but it does not working either (unsupported sfp, boot options have no affect on it).

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  Doesn't work in what way? What are you connecting to it?

                  I would expect that NIC to work fine.

                  Steve

                  D 1 Reply Last reply Reply Quote 0
                  • D
                    Draiget @stephenw10
                    last edited by

                    @stephenw10 said in WAN interface cycle thought down and up state:

                    Doesn't work in what way? What are you connecting to it?

                    I would expect that NIC to work fine.

                    Steve

                    I'm not sure it should work my way, but I use it as a WAN for ISP uplink (not more than 500 meters to closest switch). DLink media-converter and MLX worked fine that way before, I see this is -DA card, which is probably only for a SAN connection :
                    Actually, I have these problems with MLX only now when is pretty hot outside, last year it was fine, it's just not able to handle up/down glitches properly, but Intel one just stay silent.

                    From dmesg it seems fine and interface are visible in both UI and ifconfig:

                    ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 38 at device 0.0 on pci6
                    ix0: Using MSI-X interrupts with 9 vectors
                    ix0: Ethernet address: 90:e2:ba:74:96:5c
                    ix0: PCI Express Bus: Speed 5.0GT/s Width x8
                    ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xece0-0xecff mem 0xdf380000-0xdf3fffff,0xdf2fc000-0xdf2fffff irq 45 at device 0.1 on pci6
                    ix1: Using MSI-X interrupts with 9 vectors
                    ix1: Ethernet address: 90:e2:ba:74:96:5d
                    ix1: PCI Express Bus: Speed 5.0GT/s Width x8
                    

                    But it stays in no carrier mode, maybe because it need different SFP modules.

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      What module are you trying to use?

                      Does it show the module is present in: ifconfig -vvvm ix0

                      D 1 Reply Last reply Reply Quote 0
                      • D
                        Draiget @stephenw10
                        last edited by

                        @stephenw10 said in WAN interface cycle thought down and up state:

                        ifconfig -vvvm ix0

                        ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
                                description: WAN_IX0
                                options=e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
                                capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
                                ether 90:e2:ba:74:96:5c
                                inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5
                                media: Ethernet autoselect
                                status: no carrier
                                supported media:
                                        media autoselect
                                nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
                                plugged: SFP/SFP+/SFP28 Unknown (SC)
                                vendor: Gateray PN: GR-S1-W313S-D SN: W19090202129 DATE: 2019-09-03
                                module temperature: 53.00 C Voltage: 3.30 Volts
                                RX: 0.04 mW (-13.80 dBm) TX: 0.15 mW (-8.02 dBm)
                        
                                SFF8472 DUMP (0xA0 0..127 range):
                                03 04 01 00 00 00 00 12 00 01 01 01 0D 00 03 1E
                                00 00 00 00 47 61 74 65 72 61 79 20 20 20 20 20
                                20 20 20 20 00 00 00 00 47 52 2D 53 31 2D 57 33
                                31 33 53 2D 44 20 20 20 31 2E 30 20 05 1E 00 93
                                00 1A 00 00 57 31 39 30 39 30 32 30 32 31 32 39
                                20 20 20 20 31 39 30 39 30 33 20 20 68 F0 01 F3
                                2D 00 11 FB 5D 59 65 F4 D2 C7 92 AC 1A 76 D5 93
                                78 65 66 00 00 00 00 00 00 00 00 00 A1 AB DE E6
                        
                        1 Reply Last reply Reply Quote 0
                        • stephenw10S
                          stephenw10 Netgate Administrator
                          last edited by

                          Ok, well, good news: It allows the NIC to attach. It can talk to the module. The module sees incoming signal.

                          Bad news: It doesn't offer any fixed link speeds and that looks like a 1G module. It's common for an ix card to requite setting to 1G fixed to link at 1G.

                          The only option you may have there is to set the available advertised speeds to 1G only:
                          Create the file /boot/loader.conf.local
                          Add to it:

                          hw.ix.advertise_speed=2
                          

                          Reboot. Then check sysctl -a | grep advertise_speed

                          It's not always effective though. For example:

                          [21.05-RELEASE][admin@7100.stevew.lan]/root: sysctl -a | grep advertise_speed
                          hw.ix.advertise_speed: 2
                          dev.ix.3.advertise_speed: 0
                          dev.ix.2.advertise_speed: 0
                          dev.ix.1.advertise_speed: 0
                          dev.ix.0.advertise_speed: 7
                          dev.ixl.1.advertise_speed: 6
                          dev.ixl.0.advertise_speed: 6
                          

                          Steve

                          D 1 Reply Last reply Reply Quote 0
                          • D
                            Draiget @stephenw10
                            last edited by

                            @stephenw10

                            There's some interesting messages in dmesg:

                            ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 34 at device 0.0 on pci4
                            ix0: Using MSI-X interrupts with 9 vectors
                            ix0: Ethernet address: 90:e2:ba:74:96:5c
                            ix0: PCI Express Bus: Speed 5.0GT/s Width x4
                            ix0: Advertised speed can only be set on copper or multispeed fiber media types.
                            Setting sysctl dev.ix.0.advertise_speed failed: 22
                            

                            Looks like it doesn't work well:

                            ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
                                    description: WAN_IX0
                                    options=8500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO>
                                    capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
                                    ether 90:e2:ba:74:96:5c
                                    inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5
                                    media: Ethernet autoselect
                                    status: no carrier
                                    supported media:
                                            media autoselect
                                    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
                            

                            But in case of Mellanox it works fine:

                            mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
                                    description: WAN_MLX0
                                    options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
                                    ether f4:52:14:7a:0d:70
                                    inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc
                                    media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
                                    status: active
                                    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
                            

                            I had an issue with fiber optics, yesterday it was fixed (I hope for a longer time) and MLX now works without issues, but having Intel one I think is better to use it to prevent such up/down stuff in future :)

                            Any ideas? Maybe patch driver to use only 1G (build it locally)?

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S
                              stephenw10 Netgate Administrator
                              last edited by

                              If you run ifconfig -vvvm against the Mellanox NIC does it show different media options available?

                              Anything is possible it's just a small matter of programming. ๐Ÿ˜‰
                              Not something I've seen attempted though.

                              Steve

                              D 1 Reply Last reply Reply Quote 0
                              • D
                                Draiget @stephenw10
                                last edited by

                                @stephenw10

                                Yes, there's different options for mlx:

                                mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
                                        description: WAN_MLX0
                                        options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
                                        capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
                                        ether f4:52:14:7a:0d:70
                                        inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc
                                        media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
                                        status: active
                                        supported media:
                                                media autoselect
                                                media 40Gbase-CR4 mediaopt full-duplex
                                                media 10Gbase-CX4 mediaopt full-duplex
                                                media 10Gbase-SR mediaopt full-duplex
                                                media 1000baseT mediaopt full-duplex
                                        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
                                
                                1 Reply Last reply Reply Quote 0
                                • stephenw10S
                                  stephenw10 Netgate Administrator
                                  last edited by

                                  Hmm, not sure why the ix NIC doesn't see it then. ๐Ÿ˜•

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.