LACP doesn't work reliably, "slow" PDU transmission rate suspected
-
Hi
I have problems with the reliability of the LACP LAG of 2 1Gb interfaces between pfSense and a Juniper switch stack.
FW1 (pfSense primary) is connected to switch 1 and 3 via LACP
FW2 (pfsense secondary) is connected to switch 0 and 2 via LACPWhen switch 0 failed earlier I would have expected the other leg connected to switch 2 in the Virtual Chassis to provide resiliency, but that didn't happen. The secondary firewall went active because it couldn't reach the primary firewall anymore via CARP. At the same time FW2 did have some sort of connectivity, since it could take on traffic to some external IP addresses, and cause havoc in the process.
One possible error source I would like to explore is that Juniper tells me that its LACP partner (the FW's) use slow transmissions:
{master:1} stephan@us1-swi> show lacp interfaces ae2 extensive Aggregated interface: ae2 LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity ge-2/0/1 Actor No No Yes Yes Yes Yes Fast Active ge-2/0/1 Partner No No Yes Yes Yes Yes Slow Active ge-0/0/1 Actor No No Yes Yes Yes Yes Fast Active ge-0/0/1 Partner No No Yes Yes Yes Yes Slow Active LACP protocol: Receive State Transmit State Mux State ge-2/0/1 Current Slow periodic Collecting distributing ge-0/0/1 Current Slow periodic Collecting distributing LACP info: Role System System Port Port Port priority identifier priority number key ge-2/0/1 Actor 127 7c:25:86:ce:a4:2f 127 51 3 ge-2/0/1 Partner 32768 ac:1f:6b:66:fb:a2 32768 5 427 ge-0/0/1 Actor 127 7c:25:86:ce:a4:2f 127 1 3 ge-0/0/1 Partner 32768 ac:1f:6b:66:fb:a2 32768 3 427
The explanation for fast and slow timeouts is as follows (from Juniper docs):
Timeout—LACP timeout preference. Periodic transmissions of LACP PDUs occur at either a slow or fast transmission rate, depending upon the expressed LACP timeout preference (Slow Timeout or Fast Timeout). In a fast timeout, PDUs are sent every second and in a slow timeout, PDUs are sent every 30 seconds. LACP timeout occurs when 3 consecutive PDUs are missed. If LACP timeout is a fast timeout, the time taken when 3 consecutive PDUs are missed is 3 seconds (3x1 second). If LACP timeout is a slow timeout, the time taken is 90 seconds( 3x30 seconds).
So this sounds to me like the most likely reason why this fails. And sure enough, other LACPs, eg. to some of my Linux servers, do use
Fast
:{master:1} stephan@us1-swi> show lacp interfaces ae4 extensive Aggregated interface: ae4 LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity ge-2/0/4 Actor No No Yes Yes Yes Yes Fast Active ge-2/0/4 Partner No No Yes Yes Yes Yes Fast Active ge-0/0/4 FUP Actor No No Yes Yes Yes Yes Fast Active ge-0/0/4 FUP Partner No No Yes Yes Yes Yes Fast Active LACP protocol: Receive State Transmit State Mux State ge-2/0/4 Current Fast periodic Collecting distributing ge-0/0/4 Current Fast periodic Collecting distributing LACP info: Role System System Port Port Port priority identifier priority number key ge-2/0/4 Actor 127 7c:25:86:ce:a4:2f 127 53 5 ge-2/0/4 Partner 65535 c6:e9:d0:d8:94:79 255 1 9 ge-0/0/4 Actor 127 7c:25:86:ce:a4:2f 127 3 5 ge-0/0/4 Partner 65535 c6:e9:d0:d8:94:79 255 2 9
So my question is: How can I configure pfSense to use the fast transmissions for LACP?
Thanks
Stephan -
The lagg manpage states:
BUGS
There is no way to configure LACP administrative variables, including
system and port priorities. The current implementation always performs
active-mode LACP and uses 0x8000 as system and port priorities -
I'm interested in this as well.
In freebsd you are suppose to be able to set this with ifconfig.
lacp_fast_timeout Enable lacp fast-timeout on the interface. -lacp_fast_timeout Disable lacp fast-timeout on the interface.
I don't know where that would be done in pfSense though.
-
@stephan23 said in LACP doesn't work reliably, "slow" PDU transmission rate suspected:
So my question is: How can I configure pfSense to use the fast transmissions for LACP?
Maybe you should reword the title of this thread to => "How to configure fast transmission for LACP?"
-
@pete-s
Ah, I didn't see that before. I thought the note meant you couldn't tweak any of the lacp knobs.
I don't have anything to test with right now, but I'd guess it would be:
ifconfig lagg0 lacp_fast_timeout
You could use the shellcmd package to set it on boot. -
Thanks a lot for pointing out this option, that's excellent news!
What are the chances to make "fast" the default in pfSense? I can't think of a good reason why the less resilient setting should be the default. Or, if backwards compatibility is a concern here, at least expose the option to the WebUI? I've not used the shellcmd package before, but it sounds a bit hacky to me ... would shellcmd work without the risk of race conditions at boot time?
-
The shellcmd package has several options, I'd advise you test on a non-production system, but it's just an ifconfig command to set an option on the lagg, so I don't think it's a concern. You can test the command from 'diagnostics, command prompt' to verify it does what you need. As for adding the option in the webgui, you could put in a feature request in redmine.
-
Ok thanks, raised a freature request:
https://redmine.pfsense.org/issues/10504