LAGG (LACP) - UniFi Switch (16XG)



  • I recently purchased a UniFi 16XG Switch and got it installed in our rack.
    Today our task has been setting up a LAGG, on the PfSense side, using LACP to team/bond our two available 10G SFP+ ports, going to the switch.
    After getting everything configured on both sides the link is active but no packets will route.

    We have the two available ports on our 4 port Chelsio card setup for a LAGG.
    alt text
    alt text
    This LAGG has been assigned to an enabled interface.
    alt text
    alt text
    The interface is also part of a bridge.
    alt text
    The two selected ports on our 16XG have been configured for a LAGG.
    alt text
    Physically, port 11 on the UniFi switch and cxgbe0 on PfSense are connected, port 12 and cxgbe1 are also physically connected.

    We are using PfSense version 2.4.4-RELEASE-p1 with UniFi controller build atag_5.8.30_11076

    I have read on other posts there may be an issue one one side trying to use static LAG and the other using dynamic LACP but I'm not sure how to confirm this is the issue on my end or how to remedy it if so.

    What other issues might I be running into here and are there more settings I'm not aware of that need configured?

    Thank you for your time and have a wonderful day!



  • @kklouzal said in LAGG (LACP) - UniFi Switch (16XG):

    I have read on other posts there may be an issue one one side trying to use static LAG and the other using dynamic LACP...

    It is. It doesn't work.

    ... but I'm not sure how to confirm this is the issue on my end...

    You don't how the switch is configured?

    ...or how to remedy it if so.

    The remedy would be to configure LACP on the relevant ports on both sides.



  • The switch only gives me 3 options for the port: Switching, Mirrioring, and Aggregate. I have aggregate selected and from all the research I can find on google, Aggregate is in fact Dynamic LACP.

    How can I confirm on the PfSense side were using the correct settings?

    Outlined in various other posts people have had success by changing strict mode to 0 and cycling the LAGG interfaces.
    I've gone ahead and done this but it doesn't help.
    alt text

    I've since put it back to the default value of 1.

    As a side note, if I leave the LAGG interface setup in its current state, after about two hours every machine on the network will lose their IP address and DHCP will not supply a new address. Manually setting an IP on a machine will restore connectivity however DHCP will no longer supply an address, this occurs for machines with static mappings as well.


  • LAYER 8 Netgate

    We can be confident pfSense is using the right config because based on what you have posted that is what you have told it to do.

    Probably start by posting the output of:

    ifconfig -v lagg0
    ifconfig -v cxgbe0
    ifconfig -v cxgbe1

    Why the bridge? What are the bridge members?



  • Thank you for the reply. I'm positive as well that it's a configuration issue on the UniFi switch stopping us up here. The GUI gives little to no options for configuration, I'll have to dig into the CLI some more for better insight.

    As for the bridge, the ports are connected to the IPMI of other servers, one port is connected to an access point but that will be moved over to one of the four RJ-45 ports on the 16XG after this current issue is sorted out. We have plans to get a managed POE switch in the near future but I think the bridge will stay regardless so those ports can continue being used for IPMI.

    alt text


  • LAYER 8 Netgate

    If that lacp strict sysctl did not help, I would change it back to the default of 0 (remove the sysctl)

    Why is there no address on lagg0?

    No idea why you would continue to bridge if you had a good link to a good switch.



  • I had changed it from the default of 1 to 0 and since then removed the loader.conf entry and rebooted the machine.

    lagg0 is the Network Port for UniFi_LAGG which is setup like the other bridged interfaces to be enabled with no address then added as one of the bridge members. The bridge interface itself has the assigned address.

    I was able to get some details from our UniFi switch about the port 11/12 LAGG
    alt text

    It appears to be transmitting and receiving packets without any errors however removing the ethernet cable that allows me to ssh into it results in no longer being able to ping the switch.


  • LAYER 8 Netgate

    OK when I asked what interfaces were part of the bridge, that's what I was asking, what are the bridge member interfaces...

    I have no idea if adding a lagg as a bridge member even works. You're certainly poking your head into dark corners. I would take the lagg out of the bridge, number it, and see if it behaves in a more consistent manner. As soon as you know that works, add it to the bridge and see if that breaks it. Then you know.

    What does ifconfig -v bridge0 show?

    And what port is that ssh client plugged into?

    You're going to have to describe things in much more detail and with more specificity. Saying "the ethernet cable that allows me to ssh into it" tells us nothing we can act on.



  • I apologize, there are only 3 connected cables: two SPF+ direct attach connected to 11 and 12 on the switch as well as cxgbe0 and cxgbe1 on PfSense. In this configuration no traffic will pass, adding a third RJ-45 cable from port 16 on the switch into one of the free ports on PfSense allows us to communicate with the switch and login via SSH.

    alt text

    I'll try removing the LAGG from the bridge and assigning an IP so we can further narrow down the issue.


  • LAYER 8 Netgate

    Not sure why STP is enabled on the lagg0 bridge member. Did you specifically enable that?



  • Yes, however I should leave it disabled since this is the only switch currently in the network topology and having it enabled only makes sense for a larger environment with multiple switches also running the protocol.

    I went ahead and removed UniFi_LAGG from the bridge members, assigned it an IP address of 192.168.2.1, enabled DHCP for UniFi_LAGG, removed our 'third ethernet cable', and rebooted the switch. After the switch came online there were no DHCP leases handed out on the 192.168.2 subnet (was looking for one given to the switch) so I directly plugged my laptop into port 16 on the switch and was unable to receive an IP address. I then manually assigned my laptop an address of 192.168.2.10 and was unable to ping 192.168.2.1


  • LAYER 8 Netgate

    OK so look at the VLAN configuration on the switch.

    Did you enable a DHCP server on the lagg interface?

    Did you add firewall rules to the lagg interface?



  • Yes DHCP server was enabled and LAGG interface has firewall rule to permit all traffic, the switch doesn't have any VLANs setup, it should just switch traffic for the subnet it's directly attached to.
    alt text

    The fact that our switch is unable to optain an IP address from DHCP after removing the LAGG interface as a bridge member, directly assigning it an IP address, then enabling DHCP on that subnet tells me there is still a miss-configuration in the LAGG interface somewhere either on the PfSense or UniFi side?


  • LAYER 8 Netgate

    Packet capture for UDP 67 on the lagg and see what you see regarding DHCP traffic. Zero idea what is required for the switch itself to obtain DHCP there. It would have to be the management VLAN at least I would assume.

    Personally I would be less concerned with that as I would be with clients connected to the switch on the lagg VLAN getting addresses.



  • I unplugged both SPF+ cables making up the LAGG and unplugged the RJ-45 connected to my laptop then started the packet capture, I then plugged the two SPF+ cables back in and finally the RJ-45 and waited until my laptop stopped trying to identify on the network and finally set itself a 169 IP address. Here are the results
    alt text
    On the UniFi_LAGG interface DHCP server I have 1 static lease setup for the UniFi switch to use 192.168.2.80 as it's address. The DHCP server lease range is from 192.168.2.100 to 192.168.2.200 and from this I can conclude that PfSense DHCP server is attempting to assign the two devices their proper addresses. It would seem that communication from the switch back to PfSense is allowed to pass however traffic from PfSense over towards the switch is being blocked?

    I tried this with net.link.lagg.lacp.default_strict_mode set to 1 and 0 but it gave the exact same results, I have again set the value back to the default of 1.


  • LAYER 8 Netgate

    Yes, it looks like the DHCP server on the firewall is receiving the requests and responding.

    Is there some sort of DHCP snooping or protection in the switch that might be in play here?

    Odd that capture never sees the requests coming in. What did you filter on there?



  • I'm not aware of any security measures in place or can I find any options in the GUI about security that doesn't pertain to the UniFi Security Gateway, even at that those options are all disabled since we don't have that specific device present on our network.

    I am still unable to ping PfSense at 192.168.2.1 from my laptop which is connected through the switch even after we set a static IP of 192.168.2.50, Default Gateway of 192.168.2.1 and Primary DNS of 192.168.2.1, each ping attempt results in destination host unreachable. I suppose this could be due to the switch not having a proper address at this point?

    I'm also curious as to why net.link.lagg.lacp.default_strict_mode at 1 or 0 both yield the same results, shouldn't one of these values make the LAGG completely unusable?

    When I did the Packet Capture these were my settings:
    alt text


  • LAYER 8 Netgate

    @kklouzal said in LAGG (LACP) - UniFi Switch (16XG):

    I am still unable to ping PfSense at 192.168.2.1 from my laptop which is connected through the switch even after we set a static IP of 192.168.1.50,

    Typo? 192.168.2.50?


  • LAYER 8 Netgate

    Don't filter on the IP address. Capture everything on port 67. You are missing the DHCP broadcasts.



  • Yes it was a typo, whoops :)

    I went ahead and removed the Host Address from the filter, still only capturing outgoing requests for some reason.
    alt text


  • LAYER 8 Netgate

    No, anything sourced on port 68 is from the client. Anything sourced from port 67 is from the server.

    I would download that into wireshark and look at what DHCP is actually doing. Looks like it's working there to me, but can't see much other than two-way traffic.



  • I had some free time today and made a new discovery. I went to try and setup one of our other production servers to use a LAGG connection to this switch, as soon as I enabled ports 1+2 on the UniFi 16XG (these are the ports connected to this second server) I noticed the status LED's flashing very rapidly on all connected ports, I then immediately unplugged the single ethernet cable running between the 16XG and PfSense and this rapid flashing stopped, however all devices behind the switch still had access to the internet which meant the LAGG between the 16XG and PfSense was working!

    After our previous troubleshooting session here I left the two direct attach cables connected between PfSense and the 16XG and also tried many different combinations of configurations including playing around with VLANs. Ultimately I ended up with the original configuration when making this post (LAGG assigned to an enabled interface with no ipv4/v6 address set and included in the bridge) however with one difference, I switched the LAGG interface from LACP to ROUNDROBIN.

    As soon as I enabled a different set of ports on the switch to be aggregate then the original ports I had setup as aggregate (11+12) came to life!

    At this point I thought maybe the UniFi configuration never got applied and somehow by aggregating a different set of ports finally enabled the original configuration. I then went back and switched from ROUNDROBIN back to LACP but the again we stopped passing packets between the switch and firewall, I again rebooted the switch, rebooted the firewall, switched between net.link.lagg.lacp.default_strict_mode 0/1 rebooted the switch and firewall each time and still no packets would pass.

    I finally decided to go back to ROUNDROBIN but packets still would not pass! I proceeded to reboot everything again and still no packets passed! Finally I went back to the UniFi controller and once again enabled a different set of ports to be aggregate and once again packets started passing!

    I thought to myself again, maybe after switching back to LACP and trying this trick to enable a different set of ports would kick things off, unfortunately it did not.

    So I'm at a loss here for what's happening. You would think there is an issue with the configuration being applied on the UniFi controller however simply switching from ROUNDROBIN over to LACP then back to ROUNDROBIN forces us to use the trick again to get packets passing.



  • @kklouzal

    Have you tried setting a MAC address on the interface page?

    Seems I have to do this to make my WAN work on my MB8600 modem. I spun my wheels for a about an hour until I did so.


  • LAYER 8 Netgate

    I have never personally seen a switch that did not work correctly with pfSense LACP.

    That said, I have never used a Ubiquiti switch.

    I have not seen any reports that it does not work properly.

    Are you still messing around with the bridge here? Possible you created a layer 2 loop.



  • I've tried every troubleshooting step with the LAGG in a bridge and as a standalone interface with appropriate firewall rules to allow traffic and no combination will allow packets to pass using LACP.

    Currently ROUNDROBIN is working fine and in bridge mode however I would prefer to get it setup using LACP.

    It is a bit troubling that simply changing the LAGG Protocol to LACP then back to ROUNDROBIN breaks the system again requiring me to fuss around with the switch and set two random unused ports as aggregate before packets will start passing once more.



  • LAGG (LACP) - UniFi Switch (16XG):

    ifconfig -v lagg0

    Will your Unifi Switch work with while your pfsense box has a MAC address on that LAGG of 00:00:00:00:00:00?

    Yours-
    0_1547753714433_YourLaggMac.jpg

    Mine-
    0_1547753744730_MyLaggMac.jpg



  • Is there a way to force the Lag ID? I tried directly setting the MAC Address on lagg0 however lag id stayed all zeros.


  • LAYER 8 Netgate

    Yeah, but that might be the switch.



  • @derelict

    On my picture that is the MAC address that I spoofed on my WAN page. My modem is the other end of the LAGG in my case.

    I would assume that his case would be similar.. ??



  • @kklouzal said in LAGG (LACP) - UniFi Switch (16XG):

    Is there a way to force the Lag ID? I tried directly setting the MAC Address on lagg0 however lag id stayed all zeros.

    Make sure the address you are trying does not exist anywhere else in your system..

    The other issue I see is that both your ports appear to have the same MAC address.. Are you sure your ports are not in some kind of switch mode?



  • The only difference I can see between my output and yours from the image is that LAG ID is all 0's for mine and yours is set.

    Both of your ports are using the same MAC Address too
    lag id: -------------- 00-90-7f-88-b4-2e & 02-10-18-3a-41-f1
    laggport: em0 - 00-90-7f-88-b4-2e & 02-10-18-3a-41-f1
    laggport: em1 - 00-90-7f-88-b4-2e & 02-10-18-3a-41-f1

    For your setup, I would assume that 00-90-7f-88-b4-2e is the physical address of em0/em1 on PfSense and 02-10-18-3a-41-f1 is the physical address of your modem, each device on both ends have multiple ports on the same adapter so they are sharing a physical address.
    Mine is doing the same thing except with the Chelsio card and my UniFi 16XG switch

    lag id: ------------------ 00-00-00-00-00-00 - 00-00-00-00-00-00
    laggport: cxgbe0 - 98-be-94-12-d5-e0 - b4-fb-e4-50-50-16
    laggport: cxgbe1 - 98-be-94-12-d5-e0 - b4-fb-e4-50-50-16

    lag id of all 0's is telling me the link is not setting itself up properly. Switching over to ROUNDROBIN allows packets to pass but only after doing that tricky/hacky thing of going over to the switch and setting two unused ports as aggregate, which will kick off the link and get packets moving, then unaggregating those ports.

    I'm leaning more towards the side of something being wrong on the UniFi side of things here. I can't find mention of this problem anywhere else on the netgate forums or unifi forums so in all reality I probably have something misconfigured. There aren't many dials to turn and switches to flip without digging into the CLI on our switch. LACP should just work out of the box after aggregating two ports on the switch side.



  • Im looking at your 1st picture at the top of the thread here.

    That looks strange to me. Both ports should have an HW: address I believe. And they should be different.


  • LAYER 8 Netgate

    Two ports in LACP have the same MAC address. It's perfectly normal.

    [2.4.4-RELEASE][root@fw]/root: ifconfig -v lagg0
    lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
            options=6500bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
            ether 00:08:a2:0a:59:3f
            inet6 fe80::208:a2ff:fe0a:593f%lagg0 prefixlen 64 scopeid 0xb 
            nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
            media: Ethernet autoselect
            status: active
            groups: lagg 
            laggproto lacp lagghash l2,l3,l4
            lagg options:
                    flags=10<LACP_STRICT>
                    flowid_shift: 16
            lagg statistics:
                    active ports: 2
                    flapping: 0
            lag id: [(8000,00-08-A2-0A-59-3F,016B,0000,0000),
                     (0001,CC-4E-24-53-94-00,4E21,0000,0000)]
            laggport: igb4 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                    [(8000,00-08-A2-0A-59-3F,016B,8000,0005),
                     (0001,CC-4E-24-53-94-00,4E21,0001,0023)]
            laggport: igb5 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                    [(8000,00-08-A2-0A-59-3F,016B,8000,0006),
                     (0001,CC-4E-24-53-94-00,4E21,0001,0024)]
    [2.4.4-RELEASE][root@fw]/root: ifconfig -v igb4
    igb4: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
            options=6500bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
            ether 00:08:a2:0a:59:3f
            hwaddr 00:08:a2:0a:59:3f
            nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
            media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
    [2.4.4-RELEASE][root@fw]/root: ifconfig -v igb5
    igb5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
            options=6500bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
            ether 00:08:a2:0a:59:3f
            hwaddr 00:08:a2:0a:59:40
            nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
            media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
    


  • 0_1547779829396_LAGG.jpg

    The 2e address in my picture here is the MAC I spoofed on my WAN page.


  • LAYER 8 Netgate

    OK?

    Is that em0 or em1?

    What does ifconfig -v show for em0 and em1?



  • I was able to get Dynamic 802.3ad LACP working between the switch and a windows 10 machine with no problems at all. The only log entries I can find related to this issue are these here:

    cxgbe0: Interface stopped DISTRIBUTING, possible flapping
    cxgbe1: Interface stopped DISTRIBUTING, possible flapping


  • LAYER 8 Netgate

    And what does the switch say?

    I can get LACP running between my Brocade, Cisco, and D-Link switches with no problems at all. If your experience points to pfSense, mine points to your switch.



  • I'm not trying to play a whose at fault game here, just trying to pin down the issue so it can be corrected.

    Only option left to try is a different NIC and see if that changes things. There could be something physically wrong with the card or with the FreeBSD driver being used, it's an older T4 Chelsio adapter. I'll try one of the built in Intel adapters and report back.



  • @derelict

    It says the same thing that the picture shows em0 ends with 26 em1 ends with 27 my spoofed MAC is 2e

    em0:
    flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC>
    ether 00:90:7f:88:b4:2e
    hwaddr 00:90:7f:88:b4:26
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active

    em1:
    flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC>
    ether 00:90:7f:88:b4:2e
    hwaddr 00:90:7f:88:b4:27
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active



  • So after using two of the integrated Intel ports to setup the LAG everything is working fine and in bridge mode too. It was super easy and straightforward, just aggregate the ports on the UniFi 16XG and setup the LAG interface on PfSense as LACP, add to bridge, done.

    So this leaves us with the conclusion something is broken with the Chelsio card when attempting to configure a LAG. I have no way of knowing if it's the physical card at fault or if there is a driver issue here. I'd like to say this is a driver issue as there have been no troubles with this card thus far. It's also an older T4 adapter, most people will be using T5's and T6's which may not have any issues.

    Can anyone else verify their T4 card works with LACP? I'd like to get another users confirmation before spending $500 on a new adapter.