WAN interface cycle thought down and up state
-
Hello,
Faced with interesting issue couple of days ago, one of my WAN interfaces may some time go thought up\down cycle crazy amount of times, which makes whole system unresponsive (DNS/webConfigurator/Routing).
PF version - 2.5.1-RELEASE.
I have Mellanox ConnectX-3 EN as PPPoE connection to ISP (one of multiple uplinks) via VLAN (so, two interface assignments, one for VLAN with PPPoE, second just None configuration for ipv4&6):mlx4_core0@pci0:6:0:0: class=0x020000 card=0x005515b3 chip=0x100315b3 rev=0x00 hdr=0x00 vendor = 'Mellanox Technologies' device = 'MT27500 Family [ConnectX-3]' class = network subclass = ethernet
For some reason it goes down and then repeat up\down cycle for a minutes, while each down-up event triggers check_reload_status and overflowing php-fpm connections (there's hell a lot of could not connect messages being generated per second, which also a pain):
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 kernel: mlx4_en: mlxen0: Link Up Jun 20 17:46:42 fw1 kernel: mlxen0: link state changed to UP Jun 20 17:46:42 fw1 kernel: mlxen0.102: link state changed to UP Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0 Jun 20 17:46:42 fw1 check_reload_status[60338]: Linkup starting mlxen0.102 Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
I had driver problems with MLX before (around 2 years ago on previous releases) so probably up/down may be caused by the card or the driver itself again on 2.5.1, but I'm wonder if there's any option to limit such behavior of check_reload_status/php-fpm to prevent bricking whole system just of a single interface issues? I've tried to raise up webConfigurator process count, but obviously it didn't help as the amount of these events is significant.
Will appreciate for any ideas, thanks.
-
Most easy solution : use another NIC.
Or check the world wide web for issues with FreeBSD 12.2 / Melox drivers, and if some one found a solution. -
@gertjan
Reasonable with Mellanox, I'll try Intel one, but such pfSense behavior is not looking good, so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"? I believe we can rate-limit such infinite-loops in check_reload_status or whatever calls that function. -
Hummm.
Image this : your "mlxen" UP en DOWN boncing has nothing to do with "Could not connect to /var/run/php-fpm.socket"
The latter is a 'socket file', created by the PHP process, and used, amongst others by nginx, the GUI, so it can 'use and speak' PHP.The PHP (php-fpm) process should be running since system boot.
It (nginx, php-fpm) might get restarted when something happens with an interface, like a link going DOWN to UP, but these are rather rare events.I guess these
Jun 20 17:46:42 fw1 check_reload_status[60338]: Could not connect to /var/run/php-fpm.socket
will be gone as soon as you use NIC's that work.
@draiget said in WAN interface cycle thought down and up state:
so in case if NIC had a problems which may occur once in a while, whole firewall will go offline "just because"?
Like a car. Remove just one wheel (out of 4 or more) while speeding on the high way.
This WILL influence your driving comfort.edit :
Another - better ;) - example :
A switch accepts far more easily the fact you remove a cable, or put one back in : a switch does not contain 'programs' but shift, compare, lookup registers. They will get reset set flushed whatever during a clock cycle of the switch.
A software router (as is pfSense) is another beast : a huge bunch of process 'have to know' that an interface went down, or came back. This often means : it's restarted with the new situation as initial parameters.
Thus a very good reasons to stop flapping interfaces. -
@draiget said in WAN interface cycle thought down and up state:
mlx4_en
Mmm, I would definitely try a different NIC first. I have an older Mellanox card that initially seemed promising but it always behaved strangely. There's a lot going on with those cards. It could be a firmware or firmware config issue even.
Steve
-
@gertjan said in WAN interface cycle thought down and up state:
Most easy solution : use another NIC.
What NIC will work fine as WAN?
I have Intel X520-DA2 but it does not working either (unsupported sfp, boot options have no affect on it). -
Doesn't work in what way? What are you connecting to it?
I would expect that NIC to work fine.
Steve
-
@stephenw10 said in WAN interface cycle thought down and up state:
Doesn't work in what way? What are you connecting to it?
I would expect that NIC to work fine.
Steve
I'm not sure it should work my way, but I use it as a WAN for ISP uplink (not more than 500 meters to closest switch). DLink media-converter and MLX worked fine that way before, I see this is
-DA
card, which is probably only for a SAN connection :
Actually, I have these problems with MLX only now when is pretty hot outside, last year it was fine, it's just not able to handle up/down glitches properly, but Intel one just stay silent.From
dmesg
it seems fine and interface are visible in both UI andifconfig
:ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 38 at device 0.0 on pci6 ix0: Using MSI-X interrupts with 9 vectors ix0: Ethernet address: 90:e2:ba:74:96:5c ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xece0-0xecff mem 0xdf380000-0xdf3fffff,0xdf2fc000-0xdf2fffff irq 45 at device 0.1 on pci6 ix1: Using MSI-X interrupts with 9 vectors ix1: Ethernet address: 90:e2:ba:74:96:5d ix1: PCI Express Bus: Speed 5.0GT/s Width x8
But it stays in
no carrier
mode, maybe because it need different SFP modules. -
What module are you trying to use?
Does it show the module is present in:
ifconfig -vvvm ix0
-
@stephenw10 said in WAN interface cycle thought down and up state:
ifconfig -vvvm ix0
ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN_IX0 options=e503bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:e2:ba:74:96:5c inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5 media: Ethernet autoselect status: no carrier supported media: media autoselect nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> plugged: SFP/SFP+/SFP28 Unknown (SC) vendor: Gateray PN: GR-S1-W313S-D SN: W19090202129 DATE: 2019-09-03 module temperature: 53.00 C Voltage: 3.30 Volts RX: 0.04 mW (-13.80 dBm) TX: 0.15 mW (-8.02 dBm) SFF8472 DUMP (0xA0 0..127 range): 03 04 01 00 00 00 00 12 00 01 01 01 0D 00 03 1E 00 00 00 00 47 61 74 65 72 61 79 20 20 20 20 20 20 20 20 20 00 00 00 00 47 52 2D 53 31 2D 57 33 31 33 53 2D 44 20 20 20 31 2E 30 20 05 1E 00 93 00 1A 00 00 57 31 39 30 39 30 32 30 32 31 32 39 20 20 20 20 31 39 30 39 30 33 20 20 68 F0 01 F3 2D 00 11 FB 5D 59 65 F4 D2 C7 92 AC 1A 76 D5 93 78 65 66 00 00 00 00 00 00 00 00 00 A1 AB DE E6
-
Ok, well, good news: It allows the NIC to attach. It can talk to the module. The module sees incoming signal.
Bad news: It doesn't offer any fixed link speeds and that looks like a 1G module. It's common for an ix card to requite setting to 1G fixed to link at 1G.
The only option you may have there is to set the available advertised speeds to 1G only:
Create the file /boot/loader.conf.local
Add to it:hw.ix.advertise_speed=2
Reboot. Then check
sysctl -a | grep advertise_speed
It's not always effective though. For example:
[21.05-RELEASE][admin@7100.stevew.lan]/root: sysctl -a | grep advertise_speed hw.ix.advertise_speed: 2 dev.ix.3.advertise_speed: 0 dev.ix.2.advertise_speed: 0 dev.ix.1.advertise_speed: 0 dev.ix.0.advertise_speed: 7 dev.ixl.1.advertise_speed: 6 dev.ixl.0.advertise_speed: 6
Steve
-
There's some interesting messages in
dmesg
:ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.3.24> port 0xecc0-0xecdf mem 0xdf300000-0xdf37ffff,0xdf2f8000-0xdf2fbfff irq 34 at device 0.0 on pci4 ix0: Using MSI-X interrupts with 9 vectors ix0: Ethernet address: 90:e2:ba:74:96:5c ix0: PCI Express Bus: Speed 5.0GT/s Width x4 ix0: Advertised speed can only be set on copper or multispeed fiber media types. Setting sysctl dev.ix.0.advertise_speed failed: 22
Looks like it doesn't work well:
ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN_IX0 options=8500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO> capabilities=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:e2:ba:74:96:5c inet6 fe80::92e2:baff:fe74:965c%ix0 prefixlen 64 scopeid 0x5 media: Ethernet autoselect status: no carrier supported media: media autoselect nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
But in case of Mellanox it works fine:
mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN_MLX0 options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether f4:52:14:7a:0d:70 inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
I had an issue with fiber optics, yesterday it was fixed (I hope for a longer time) and MLX now works without issues, but having Intel one I think is better to use it to prevent such up/down stuff in future :)
Any ideas? Maybe patch driver to use only 1G (build it locally)?
-
If you run
ifconfig -vvvm
against the Mellanox NIC does it show different media options available?Anything is possible it's just a small matter of programming.
Not something I've seen attempted though.Steve
-
Yes, there's different options for mlx:
mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN_MLX0 options=ed03bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether f4:52:14:7a:0d:70 inet6 fe80::f652:14ff:fe7a:d70%mlxen0 prefixlen 64 scopeid 0xc media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>) status: active supported media: media autoselect media 40Gbase-CR4 mediaopt full-duplex media 10Gbase-CX4 mediaopt full-duplex media 10Gbase-SR mediaopt full-duplex media 1000baseT mediaopt full-duplex nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
-
Hmm, not sure why the ix NIC doesn't see it then.