Weird multi tap-tunnel bridge lagg setup needs some help

  • Hi all,

    due to some weird decissions we sadly could not prevent, we have a task on setting up a link between two buildings on rather large campus. Sadly direct wiring, either copper or optical is not an option and for the second building due to cable length on phone lines only some very poor internet or WAN access was available.

    Despite recommendations from technicans now a WiFi based radio link was set up. We wanted some more advanced radio equipment running on different frequencies and providing some more features and are stuck now with some Outdoor WiFi access points which support some point to point connection on 5 GHz with some respectable bandwidth (835 MBit/s) over a quite large distance. Not bad for certain stuff, but not exactly what we need.
    To get some redundancy, we got two pairs of radios, so we should be able to do some loadbalancing and have a failover option.

    Luckily the devices are able to work L2 transparent in some way, so with one radio link we can simply plug in to switches on both sides and transport tagged vlans without issues, but problems come with the second link.
    Those devices have only one Ethernet Port and the link stays up as long as the radios have power, so you can access them for management. No dedicated Port for Data.
    If now the wireles link is lost for some reason (which we already had about once per month…), the ethernet link stays up and all means our switches have to do a failover simply do not recognize the problem with the lost connection and no failover occurs.

    Transparent Layer 2 link is required, as for some security reasons it is required to have nodes for several clusters in both separated buildings. But for the cluster failover to work, all nodes need to be in the same subnet. For different technical reasons so far no other solution can be expected for quite some time.

    Now our idea was to set up some appliances we had as "leftovers" with pfSense and try following idea among others:
    each radio is connected to one ethernet port of the appliance, in our case lets say OPT2 and OPT3. Each link has its own small subnet and some OpenVPN connection is set up, each pfsense having one interface with server and the other with client. This works fine so far. loss of wireless link should get us the tap device down, allowing for a proper failover.

    To provide load balancing or at least failover, the tap devices are incorporated into a LAGG using LACP for now. If I set up IP adresses in the same subnet on both lagg interfaces, the boxes can ping each other, success till now.

    Now VLANs come into the equation. We set up same VLANS on OPT1 and LAGG interface.
    Each VLAN is assigned to an interface, the matching vlan interfaces for OPT1 and LAGG are bridged...

    Idea sounds great, but so far not working.

    Doing packet captures on the different interfaces in promiscuous mode and trying to ping between nodes in each location (luckily only a lab setup so far) I can see arp requests from the local server, requesting the remote servers ip on each pfsense on OPT1, VLANs, LAGG. Thus bridging seems to work.
    but arp requests seem not to span the tunnel...

    Previously we tried other setups using GIF, GRE, QinQ in a comparable way, but no luck, if we got the configuration that far.

    Has anyone any idea what we are doing wrong, what we might have overseen, or why this could not work in this way at all and how we can achieve this? Ideas not exactly sticking to what we tried so far are welcome as well.