Cold Lab Start - Tier 1 Gateway Down, Tier 2 Gateway Up - Cannot Ping Externally
-
We have a lab set up for testing gateway groups on a pair of XG7100s.
Our Tier 1 gateway is our primary ISP (a fixed Virgin line).
Our Tier 2 gateway is our secondary ISP (fibre broadband PPPoE).
We can only connect the lab to the primary ISP during overnight maintenance windows. Usually the lab is only connected to the secondary ISP.
The primary ISP is on the igb0 WAN1 interface setup with CARP. The pair remains connected to a WAN switch and so doesn't cause failover even when the primary ISP is disconnected (as we want).
The secondary ISP is on igb1 on the primary firewall only (as we can't CARP that because it's PPPoE). We get that means manual failover in some cases (out of scope of this thread).
After we boot up the lab:
All the CARP interfaces (the primary and a set of VLANs) correctly show as MASTER on the primary and BACKUP on the secondary. NOTE: Just want to be clear here - this isn't a question about CARP but I'm trying to ensure I've provided all info in case it is a factor).
The status pages show the Tier 1 gateway (primary ISP) as being down. As expected.
The status pages show the Tier 2 gateway (secondary ISP) as being up. As expected.
But, the primary firewall cannot reach the internet. We cannot ping anything except the WAN2 default gateway and the DNS server we set for WAN2. We used the IP address of the bbc.co.uk to test.
It's as if the default route is wrong or there is no default route. I'm guessing it's something to do with the system starting up with Tier 1 down but I actually have no idea.
The IPv4 default gateway is set to the gateway group. If we change it to the WAN2 gateway then we can ping. But if we change it back to the gateway group we can still ping. It's as if the toggle reset something and we are fine from that point onwards. We don't really want to have to do this. It feels as if Tier 2 should just be taking over... But that's maybe because I don't properly understand gateway groups.
I've uploaded some screenshots of various pages. Ideally I'm looking for some diagnostics I can so to understand what's happening.
Any tips greatly appreciated.
-
Did you create a rule to pass a traffic to the GW group?
https://docs.netgate.com/pfsense/en/latest/routing/multi-wan.html#firewall-rules -
Hi @Zawi ,
Thank you for reading my post.
I have re-read the linked document. I'm reading it that the gateway group only needs to be set for internal interfaces? I have checked and we hadn't done that for all the rules so that has been corrected. I can't run another test until Friday but will re-try then.
But, one thing is confusing me. When I realised we didn't have external access from within our VLANs I ran a ping test on the primary firewall itself. Reading the linked doc again I think maybe that test was invalid? The linked doc says the firewall rules use the routing table if no gateway is set. So, what happens during a ping test from the firewall itself? It uses the routing table? And the routing table is for some reason in a state that won't work because the primary gateway in the group is down?
I'm sure there is just some key point of gateway groups and the routing table that I haven't got yet that if I did get would make this all obvious.
-
OK - update after our experiments today.
Quick recap - the lab comes up in a state where the tier 1 gateway in the group is down but the tier 2 gateway is up.
We are trying to solve 2 issues; machines in the VLANs can't get external access and the firewall itself can't get external access.
We went through all the firewall rules for all our VLANs and ensured that the gateway was set to the gateway group. We had forgotten that step even though the docs clearly say we needed to do it. Once we did that, then our test machines inside our VLANs were able to get out externally. So the gateway on the firewall rule was definitely the cause of that issue and thanks for pointing it out. So that's one problem solved.
But.
The primary firewall itself still cannot get out externally. We get that the secondary firewall won't be able to because it's not connected to the secondary line (because it's a broadband line and can't be setup with CARP). But I was expecting that the primary firewall should be able to connect externally. This manifests in the dashboard having no status info and pings from the firewall itself failing.
And.
The second we change a VLAN firewall rule to use the gateway group, then the machines in the VLAN can no longer ping machines in other VLANs. I am finding this weird because I can't see any block entries in the logs (filtering by source and destination IP).
Something about having a gateway group is stopping the firewall itself getting out externally and is also messing with the inter-VLAN traffic. I am not even sure anymore if this is happening because the tier 1 gateway is down. It may just be happening period. I can't find out until I get another maintenance window to route our primary line into the lab.
Again, I'm sure this is just something I don't understand about gateway groups. I am going to go back again and re-read everything I can find about gateway groups. If anyone can fill in whatever I am missing I would greatly appreciate it as I have been struggling with this for days and days.
-
Re-reading all the docs over the weekend and going through forum posts. The following forum posts and doc links explained the intra-VLAN connectivity issue:
Forum post
Firewall Rules and Policy Route Negotiation
By-passing Policy Route Negotiation - Inclues example local segment rule
Mulit-WAN Policy Routing
Another Multi-WAN Policy Routing pageMy missing link was my failure to understand that the second you create a gateway group you enter policy routing territory (as soon as you set the gateway on a rule). My simple "allow all" one rule per VLAN then stop working for local segments as of course everything is directed out the gateway. Seems obvious now. So I need to split my rules up as explained by the forum post so that local traffic is passed without a gateway group and the gateway group rule is that last rule in the set.
I am still completely confused about why the firewall itself can't get out externally though. The IPv4 default gateway is the gateway group. This forum post says that is what is needed for the the firewall to maintain external connectivity. It also talks about a setting that flushes state when any gateway goes down. But in my scenario the Tier 1 gateway is down as the firewall comes up so I'm not sure what needs to be reset. I will try this setting anyway and see if anything changes. My understanding is that this setting will cause disconnections for both gateways if either fails but I'll worry about that if the setting fixes the current issue. I found a couple of old posts that talked about the previous feature "Default Gateway Switching" not working with PPPoE gateways. So maybe my Tier 2 gateway isn't taking over because it is a PPPoE gateway.
I will experiment further and update here after. I've posted all links used in case anyone else is struggling.