Windows Clients cannot access the internet, very strange unexpected DNS problem.
-
@johnpoz Look, when the bottom square computer turns into a world icon, the I know there is a problem.
So three things occur:
- No internet access in the browser
- The SERVFAIL message in dnslookup from both the client and dnslookup in pfsense.
- From both the client and pfsense at the command line, ping to 8.8.8.8 fails with the the TTL error.
The VPN was just to test if i can access the firewall and beyond because you talked about my loss of ip connectivity as well.
It just all of a sudden internal clients are not able to resolve and I tried to reproduce the error as expected after a couple of days and I did.
Stopped and restarted unbound daemon and suddenly I have that square icon at the bottom of windows again and I'm online. -
This post is deleted! -
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
From both the client and pfsense at the command line, ping to 8.8.8.8 fails with the the TTL error.
This error you described in the quoted text really sounds like either an ISP issue or something going weird with your VPN setup.
If you can't get a repy from a
ping
command directly to an IP address, then your basic Layer 2/3 connectivity is broken for the client you are trying theping
command from. At that point DNS andunbound
are totally and completely out of the picture.You may be attacking this problem from the wrong end. Instead of worrying about
unbound
, you need to see first what is happening to Layer 2/3 connectivity (that is, why is aping
to an outside IP address not working?). Theunbound
daemon should not break Layer 2/3 connectivity for a client.Think about this logically and troubleshoot in a logical manner.
- When the problem occurs, don't restart anything. First try a simple
ping <pfSense_LAN_IP_address>
. Does that work? - Next try
ping 8.8.8.8
. Does that work?
If neither of the above work, then most certainly DNS resolving is going to be broken and Windows is going to show the globe icon (for no Internet). At that point you need to be troubleshooting Layer 2/3 connectivity to see why the basic
ping
to a hard-coded address is not working. - When the problem occurs, don't restart anything. First try a simple
-
@bmeeks Yes not able to ping an external ip address from a client is strange, even though all connections are set and working, the firewall is reachable....There must be some ISP issue....I can hardly believe it's the internal routing.
-
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
@bmeeks Yes not able to ping an external ip address from a client is strange, even though all connections are set and working, the firewall is reachable....There must be some ISP issue....I can hardly believe it's the internal routing.
Then I would concentrate all my troubleshooting efforts on figuring out why external connectivity is broken at the basic Layer 2/3 level. Could be something with routing, could certainly be an ISP issue, or it might be the VPN setup in some fashion.
Only after you can 100% reliably ping an external IP address all the way through the network should you start looking at DNS and
unbound
issues. -
The client is connected to a switch configured with a local route (L) and advertised into OSPF and propagated the default route to all ospf routers on the ASBR that is directly connected with pfsense.
I also had this issue on a past network setup, but instead with SVIs at that time.
You could be right, it's either the cisco hardware or some ISP isue, the thing is if I connect a laptop or a pc directly to the LAN interface in a /30 subnet, then it works.Programming the switch is very straightforward, what else can I do to troubleshoot with the tools that exist in cisco IOS?
-
@IrixOS from your post above you show a ttl expired from 10.216.64.17 what device is this - is this upstream of pfsense, or some router on your network?
That normally points to a routing loop..
Also you could have some asymmetrical routing going on.. Which depending on what is talking to what, and if there is a stateful firewall in the mix.. Stateful firewalls don't like asymmetrical routing because there is no state, etc.. or with only seeing one side of the traffic the state can expire depending.
But @bmeeks is right on the money (as always) you need to troubleshoot your connectivity issues before you go looking to what can be wrong with unbound.. Unbound is not going to function as it should if your connectivity is broken... And not being able to ping 8.8.8.8 screams of connectivity problem!!
-
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
The client is connected to a switch configured with a local route (L) and advertised into OSPF and propagated the default route to all ospf routers on the ASBR that is directly connected with pfsense.
I also had this issue on a past network setup, but instead with SVIs at that time.
You could be right, it's either the cisco hardware or some ISP isue, the thing is if I connect a laptop or a pc directly to the LAN interface in a /30 subnet, then it works.Programming the switch is very straightforward, what else can I do to troubleshoot with the tools that exist in cisco IOS?
It sounds to me that you may have a routing problem. And that problem may take a little bit to manifest itself as all the network equipment does its OSPF stuff. That's not my area of networking strength. @johnpoz will be much more help there as he does this kind of stuff all the time.
But I do know that these routing protocols are dynamic in that the devices participating periodically recheck the paths to calculate the shortest one. On the surface it seems that at some point they calculate something that is "suboptimal"
in terms of staying connected. Restarting and/or disconnecting a port would force a new OSPF algorithm run, and on that run they calculate correctly but then get lost again later and the cycle repeats.
-
@bmeeks Yes Cisco use that 'suboptimal' term in all their concepts all the f* time
-
@bmeeks good insight.. Depending for sure - you could get different paths taken, or path could change - it would all come down to the actual setup.. And if there is even multiple paths that could be taken..
But yeah you could be on to something with the routing changing to why seeing issue sometimes and not others.
-
@johnpoz Actually I am not doing anything special here, I just added some cisco switches behind a pfsense box, done everything according to pfsense and cisco regulations, it should work, and yes I have multiple path in the form of ether channels, but the client is only two hops away from pfsense. Even one hop away from pfsense is probably gonna give the same issue, directly connected I know will probably work for sure.
It is probably not the ISP, because when I directly connect the PC with the VDSL modem, then there is no issue.It has to be the cisco hardware.
By my knowledge there is nothing else you can do further in Cisco IOS to troubleshoot the problem with the current network condition.Now be frankly, is this firewall ever tested with cisco hardware or ospf in general?
If have to throw everything into mottballs, that would be very lame if you ask me... -
@bmeeks
Don't bother with text, text is wrong, just the network model applies. -
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
Don't bother with text, text is wrong, just the network model applies.
But do all the link aggregations still apply? If so, that's a lot of places where something can go weird with a slight misconfiguration.
I also found a long thread from 2020 from a user that was having LACP issues with pfSense that were apparently never resolved. Have a look here: https://forum.netgate.com/topic/158534/lacp-not-working.
-
Yes they still apply, but I have nothing configured on the leftside of the CATALYST in the middle yet. The right side of the catalyst in the middle is configured.
Just consider the spot with 'I am here'. From there zooooom over PO1 to the ASBR (the switch that is directly connected to pfsense) and zooooom over PO2 to pfsense).
That's about it. OSPF is configured between, all routes are advertised and on the ASBR a Null route 0.0.0.0 0.0.0.0 is configured with the pfsense IP as its Next hop address. Static route in pfsense pointing back to the internal network in the form of a summary route, so all connections are there.
It should work according to regulations. -
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
Yes they still apply, but I have nothing configured on the leftside of the CATALYST in the middle yet. The right side of the catalyst in the middle is configured.
Just consider the spot with 'I am here'. From there zooooom over PO1 to the ASBR (the switch that is directly connected to pfsense) and zooooom over PO2 to pfsense).
That's about it. OSPF is configured between, all routes are advertised and on the ASBR a Null route 0.0.0.0 0.0.0.0 is configured with the pfsense IP as its Next hop address. Static route in pfsense pointing back to the internal network in the form of a summary route, so all connections are there.
It should work according to regulations.As I mentioned previously, this part of networking is not my strong point. I understand the basic concepts, but in my old job never had to actually fully design something like this. At my company we had the equivalent of @johnpoz engineers who designed the links. My job was primarily cybersecurity and firewalls, client/server software installation, configuration and administration, and various types of system programming. I interacted very frequently with the link-layer stuff and even did most of the firmware updates on equipment at my sites, but I was not heavy into the design phase.
-
@bmeeks
You mention that LACP of a past thread. I know the LACP aggregaat between pfsense is working at least from watching the pings.
JohnPoz advised me last year to do a wireshak to see what is going on, I have the feeling from the output of wireshark, not sure, there seems to be some point where dns doesn't come trough.
And that is the ip address of the LACP aggregate (10.216.64.17) as shown in the network diagram. -
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
there seems to be some point where dns doesn't come trough.
That may be true, but DNS coming through or not coming through has absolutely nothing to do with not getting a reply when running
ping 8.8.8.8
. As I've said a few times, forget DNS until you have zero problems pinging an outside IP address. When you ping an IP directly (without using a domain or host name), then DNS is not relevant. DNS is UDP and/or TCP. Ping is ICMP. When you can't ping an IP address directly, then ICMP or routing (or both) is broken.When the basic Layer 2/3 connectivity is broken, then of course DNS is not going to work. Also be aware that asymmetric routing, if present, is going to drive a stateful firewall like pfSense bonkers. It's going to block certain traffic because it may not have seen the SYN and so did not create an open state for any stateful replies.
-
@bmeeks Yes I agree.
-
Is pfsense actually ever tested with cisco hardware?
-
@IrixOS said in Windows Clients cannot access the internet, very strange unexpected DNS problem.:
Is pfsense actually ever tested with cisco hardware?
pfSense is not tested for connectivity with any type of switch. An Ethernet port is an Ethernet port. It either correctly auto-negotiates and connects at the physical layer, or it does not.
In regards to something like LACP, that is always going to be a question mark regardless of who the vendors are. It seems that no manufacturer can resist the urge to "improve" upon some agreed upon standard. That's why incompatibilities exist.
If you think the Cisco connection to your pfSense box is the source of the problem, then simplify the connection to a single GigE link for a test. You only have 1 Gig to the Internet according to the diagram (but I know you said the text was not always accurate). Collapsing down to a regular single GigE link will eliminate any possibilty of protocol incompatibilities because there will be no LAGG and no LACP.
If your diagram is accurate in regards to all those aggregated links, then I think your issue lies there. I think something is happening in OSPF in your internal networks. Maybe a routing loop as @johnpoz hypothesized.