WAN interface cycle thought down and up state

Draiget

Hello,

Faced with interesting issue couple of days ago, one of my WAN interfaces may some time go thought up\down cycle crazy amount of times, which makes whole system unresponsive (DNS/webConfigurator/Routing).

PF version - 2.5.1-RELEASE.
I have Mellanox ConnectX-3 EN as PPPoE connection to ISP (one of multiple uplinks) via VLAN (so, two interface assignments, one for VLAN with PPPoE, second just None configuration for ipv4&6):

mlx4_core0@pci0:6:0:0:  class=0x020000 card=0x005515b3 chip=0x100315b3 rev=0x00 hdr=0x00
    vendor     = 'Mellanox Technologies'
    device     = 'MT27500 Family [ConnectX-3]'
    class      = network
    subclass   = ethernet

For some reason it goes down and then repeat up\down cycle for a minutes, while each down-up event triggers check_reload_status and overflowing php-fpm connections (there's hell a lot of could not connect messages being generated per second, which also a pain):

Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 kernel: mlx4_en: mlxen0: Link Up
Jun 20 17:46:42 fw1 kernel: mlxen0: link state changed to UP
Jun 20 17:46:42 fw1 kernel: mlxen0.102: link state changed to UP
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0
Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0.102
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket

I had driver problems with MLX before (around 2 years ago on previous releases) so probably up/down may be caused by the card or the driver itself again on 2.5.1, but I'm wonder if there's any option to limit such behavior of check_reload_status/php-fpm to prevent bricking whole system just of a single interface issues? I've tried to raise up webConfigurator process count, but obviously it didn't help as the amount of these events is significant.

Will appreciate for any ideas, thanks.

Gertjan

@draiget

Most easy solution : use another NIC.
Or check the world wide web for issues with FreeBSD 12.2 / Melox drivers, and if some one found a solution.

Draiget

@gertjan
Reasonable with Mellanox, I'll try Intel one, but such pfSense behavior is not looking good, so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"? I believe we can rate-limit such infinite-loops in check_reload_status or whatever calls that function.

Gertjan

@draiget

Hummm.

Image this : your "mlxen" UP en DOWN boncing has nothing to do with "Could not connect to /var/run/php-fpm.socket"
The latter is a 'socket file', created by the PHP process, and used, amongst others by nginx, the GUI, so it can 'use and speak' PHP.

The PHP (php-fpm) process should be running since system boot.
It (nginx, php-fpm) might get restarted when something happens with an interface, like a link going DOWN to UP, but these are rather rare events.

I guess these

Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket

will be gone as soon as you use NIC's that work.

@draiget said in WAN interface cycle thought down and up state:

so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"?

Like a car. Remove just one wheel (out of 4 or more) while speeding on the high way.
This WILL influence your driving comfort.

edit :

Another - better ;) - example :

A switch accepts far more easily the fact you remove a cable, or put one back in : a switch does not contain 'programs' but shift, compare, lookup registers. They will get reset set flushed whatever during a clock cycle of the switch.
A software router (as is pfSense) is another beast : a huge bunch of process 'have to know' that an interface went down, or came back. This often means : it's restarted with the new situation as initial parameters.
Thus a very good reasons to stop flapping interfaces.

stephenw10

@draiget said in WAN interface cycle thought down and up state:

mlx4_en

Mmm, I would definitely try a different NIC first. I have an older Mellanox card that initially seemed promising but it always behaved strangely. There's a lot going on with those cards. It could be a firmware or firmware config issue even.

Steve

Draiget

@gertjan said in WAN interface cycle thought down and up state:

Most easy solution : use another NIC.

What NIC will work fine as WAN?
I have Intel X520-DA2 but it does not working either (unsupported sfp, boot options have no affect on it).

stephenw10

Doesn't work in what way? What are you connecting to it?

I would expect that NIC to work fine.

Steve

Draiget

@stephenw10 said in WAN interface cycle thought down and up state:

Doesn't work in what way? What are you connecting to it?

I would expect that NIC to work fine.

Steve

I'm not sure it should work my way, but I use it as a WAN for ISP uplink (not more than 500 meters to closest switch). DLink media-converter and MLX worked fine that way before, I see this is -DA card, which is probably only for a SAN connection :
Actually, I have these problems with MLX only now when is pretty hot outside, last year it was fine, it's just not able to handle up/down glitches properly, but Intel one just stay silent.

From dmesg it seems fine and interface are visible in both UI and ifconfig:

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 38 at device 0.0 on pci6
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: 90:e2:ba:74:96:5c
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xece0-0xecff mem 0xdf380000-0xdf3fffff,0xdf2fc000-0xdf2fffff irq 45 at device 0.1 on pci6
ix1: Using MSI-X interrupts with 9 vectors
ix1: Ethernet address: 90:e2:ba:74:96:5d
ix1: PCI Express Bus: Speed 5.0GT/s Width x8

But it stays in no carrier mode, maybe because it need different SFP modules.

stephenw10

What module are you trying to use?

Does it show the module is present in: ifconfig -vvvm ix0

Draiget

@stephenw10 said in WAN interface cycle thought down and up state:

ifconfig -vvvm ix0

ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: WAN_IX0
        options=e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:74:96:5c
        inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5
        media: Ethernet autoselect
        status: no carrier
        supported media:
                media autoselect
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        plugged: SFP/SFP+/SFP28 Unknown (SC)
        vendor: Gateray PN: GR-S1-W313S-D SN: W19090202129 DATE: 2019-09-03
        module temperature: 53.00 C Voltage: 3.30 Volts
        RX: 0.04 mW (-13.80 dBm) TX: 0.15 mW (-8.02 dBm)

        SFF8472 DUMP (0xA0 0..127 range):
        03 04 01 00 00 00 00 12 00 01 01 01 0D 00 03 1E
        00 00 00 00 47 61 74 65 72 61 79 20 20 20 20 20
        20 20 20 20 00 00 00 00 47 52 2D 53 31 2D 57 33
        31 33 53 2D 44 20 20 20 31 2E 30 20 05 1E 00 93
        00 1A 00 00 57 31 39 30 39 30 32 30 32 31 32 39
        20 20 20 20 31 39 30 39 30 33 20 20 68 F0 01 F3
        2D 00 11 FB 5D 59 65 F4 D2 C7 92 AC 1A 76 D5 93
        78 65 66 00 00 00 00 00 00 00 00 00 A1 AB DE E6

stephenw10

Ok, well, good news: It allows the NIC to attach. It can talk to the module. The module sees incoming signal.

Bad news: It doesn't offer any fixed link speeds and that looks like a 1G module. It's common for an ix card to requite setting to 1G fixed to link at 1G.

The only option you may have there is to set the available advertised speeds to 1G only:
Create the file /boot/loader.conf.local
Add to it:

hw.ix.advertise_speed=2

Reboot. Then check sysctl -a | grep advertise_speed

It's not always effective though. For example:

[21.05-RELEASE][admin@7100.stevew.lan]/root: sysctl -a | grep advertise_speed
hw.ix.advertise_speed: 2
dev.ix.3.advertise_speed: 0
dev.ix.2.advertise_speed: 0
dev.ix.1.advertise_speed: 0
dev.ix.0.advertise_speed: 7
dev.ixl.1.advertise_speed: 6
dev.ixl.0.advertise_speed: 6

Steve

Draiget

@stephenw10

There's some interesting messages in dmesg:

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 34 at device 0.0 on pci4
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: 90:e2:ba:74:96:5c
ix0: PCI Express Bus: Speed 5.0GT/s Width x4
ix0: Advertised speed can only be set on copper or multispeed fiber media types.
Setting sysctl dev.ix.0.advertise_speed failed: 22

Looks like it doesn't work well:

ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: WAN_IX0
        options=8500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO>
        capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:74:96:5c
        inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5
        media: Ethernet autoselect
        status: no carrier
        supported media:
                media autoselect
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

But in case of Mellanox it works fine:

mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: WAN_MLX0
        options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether f4:52:14:7a:0d:70
        inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc
        media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

I had an issue with fiber optics, yesterday it was fixed (I hope for a longer time) and MLX now works without issues, but having Intel one I think is better to use it to prevent such up/down stuff in future :)

Any ideas? Maybe patch driver to use only 1G (build it locally)?

stephenw10

If you run ifconfig -vvvm against the Mellanox NIC does it show different media options available?

Anything is possible it's just a small matter of programming.
Not something I've seen attempted though.

Steve

Draiget

@stephenw10

Yes, there's different options for mlx:

mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        description: WAN_MLX0
        options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether f4:52:14:7a:0d:70
        inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc
        media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
        status: active
        supported media:
                media autoselect
                media 40Gbase-CR4 mediaopt full-duplex
                media 10Gbase-CX4 mediaopt full-duplex
                media 10Gbase-SR mediaopt full-duplex
                media 1000baseT mediaopt full-duplex
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

stephenw10

Hmm, not sure why the ix NIC doesn't see it then.