Cold Lab Start - Tier 1 Gateway Down, Tier 2 Gateway Up - Cannot Ping Externally



  • We have a lab set up for testing gateway groups on a pair of XG7100s.

    Our Tier 1 gateway is our primary ISP (a fixed Virgin line).

    Our Tier 2 gateway is our secondary ISP (fibre broadband PPPoE).

    We can only connect the lab to the primary ISP during overnight maintenance windows. Usually the lab is only connected to the secondary ISP.

    The primary ISP is on the igb0 WAN1 interface setup with CARP. The pair remains connected to a WAN switch and so doesn't cause failover even when the primary ISP is disconnected (as we want).

    The secondary ISP is on igb1 on the primary firewall only (as we can't CARP that because it's PPPoE). We get that means manual failover in some cases (out of scope of this thread).

    After we boot up the lab:

    All the CARP interfaces (the primary and a set of VLANs) correctly show as MASTER on the primary and BACKUP on the secondary. NOTE: Just want to be clear here - this isn't a question about CARP but I'm trying to ensure I've provided all info in case it is a factor).

    The status pages show the Tier 1 gateway (primary ISP) as being down. As expected.

    The status pages show the Tier 2 gateway (secondary ISP) as being up. As expected.

    But, the primary firewall cannot reach the internet. We cannot ping anything except the WAN2 default gateway and the DNS server we set for WAN2. We used the IP address of the bbc.co.uk to test.

    It's as if the default route is wrong or there is no default route. I'm guessing it's something to do with the system starting up with Tier 1 down but I actually have no idea.

    The IPv4 default gateway is set to the gateway group. If we change it to the WAN2 gateway then we can ping. But if we change it back to the gateway group we can still ping. It's as if the toggle reset something and we are fine from that point onwards. We don't really want to have to do this. It feels as if Tier 2 should just be taking over... But that's maybe because I don't properly understand gateway groups.

    I've uploaded some screenshots of various pages. Ideally I'm looking for some diagnostics I can so to understand what's happening.

    Any tips greatly appreciated.

    System_Routing_Gateways_NoIp.PNG
    System_Routing_GatewayGroups.PNG
    Status_gateways_gateways_noip.png
    Status_gateways_gatewaygroups.png
    Dignostics_IPv4routes_noip.png



  • Did you create a rule to pass a traffic to the GW group?
    https://docs.netgate.com/pfsense/en/latest/routing/multi-wan.html#firewall-rules



  • Hi @Zawi ,

    Thank you for reading my post.

    I have re-read the linked document. I'm reading it that the gateway group only needs to be set for internal interfaces? I have checked and we hadn't done that for all the rules so that has been corrected. I can't run another test until Friday but will re-try then.

    But, one thing is confusing me. When I realised we didn't have external access from within our VLANs I ran a ping test on the primary firewall itself. Reading the linked doc again I think maybe that test was invalid? The linked doc says the firewall rules use the routing table if no gateway is set. So, what happens during a ping test from the firewall itself? It uses the routing table? And the routing table is for some reason in a state that won't work because the primary gateway in the group is down?

    I'm sure there is just some key point of gateway groups and the routing table that I haven't got yet that if I did get would make this all obvious.



  • OK - update after our experiments today.

    Quick recap - the lab comes up in a state where the tier 1 gateway in the group is down but the tier 2 gateway is up.

    We are trying to solve 2 issues; machines in the VLANs can't get external access and the firewall itself can't get external access.

    We went through all the firewall rules for all our VLANs and ensured that the gateway was set to the gateway group. We had forgotten that step even though the docs clearly say we needed to do it. Once we did that, then our test machines inside our VLANs were able to get out externally. So the gateway on the firewall rule was definitely the cause of that issue and thanks for pointing it out. So that's one problem solved.

    But.

    The primary firewall itself still cannot get out externally. We get that the secondary firewall won't be able to because it's not connected to the secondary line (because it's a broadband line and can't be setup with CARP). But I was expecting that the primary firewall should be able to connect externally. This manifests in the dashboard having no status info and pings from the firewall itself failing.

    And.

    The second we change a VLAN firewall rule to use the gateway group, then the machines in the VLAN can no longer ping machines in other VLANs. I am finding this weird because I can't see any block entries in the logs (filtering by source and destination IP).

    Something about having a gateway group is stopping the firewall itself getting out externally and is also messing with the inter-VLAN traffic. I am not even sure anymore if this is happening because the tier 1 gateway is down. It may just be happening period. I can't find out until I get another maintenance window to route our primary line into the lab.

    Again, I'm sure this is just something I don't understand about gateway groups. I am going to go back again and re-read everything I can find about gateway groups. If anyone can fill in whatever I am missing I would greatly appreciate it as I have been struggling with this for days and days.


Log in to reply