DNS queries being lost
I believe I have a DNS issue, and the best I can figure out at this point is that it may have something to do with my pfSense box, but maybe not. I am hoping someone can help me with a diagnosis to pinpoint the issue.
I have two users in my office that are having problems with what was described to me originally as intermittent sporadic internet connection dropping. They would occasionally be unable to connect to an external website. There does not appear to be any correlation to any factor regarding when it occurs, or how long it might last (anywhere from 2 minutes to 20 minutes or so). It only occurs for these two users, one is running a Win 7 box, while the other is running a Win 10 box. There are four other users in the office that do not have an internet connection problem when this occurs.
In diagnosing this problem, I have narrowed it down to a DNS issue, but can not figure out what is going on. Here is what I have done.
1. Ping an external server (our company website) that I know is up. When pinged by name, I get the result that I can not find the host. When I ping by IP address, a normal successful ping response is returned. This is what first clued me in that they were not having an internet connection problem, but instead had some type of DNS resolution problem.
2. Started a Wireshark capture on one of the problem boxes, and also on the network gateway box (pfSense). Then from the problem box, pinged by name, IP address, nslookup –debug by name, and put the name into a web browser. Saved the results of both Wireshark captures to compare them, and saved the results from the command line results.
3. The network route for these tests follow this path. ProblemBox to DNSmasqBox to GatewayBox to Internet.
a. All Windows boxes in the office get their DHCP and DNS setup from the DNSmasqBox, the working and problem boxes have the same exact setup except for the IP addresses they are given.
b. DNS requests are forwarded from the DNSmasqBox to the GatewayBox for resolution. The DNSmasqBox has the following line in its config file: server=/pfgateway.mei.lan/192.168.112.11
(x.x.x.11 is the GatewayBox)
I believe our LAN names are resolved immediately from the DNSmasqBox, but there has never been an issue resolving local names.
c. The GatewayBox (pfSense) is configured to use the Unbound Resolver, and has 127.0.0.1 in the system settings for the DNS resolver.
4. A name query on a problem box gives the following results.
a. Wireshark on a problem box shows the request being forwarded to the DNSmasqBox.
b. Log file on the DNSmasqBox shows the request being forwarded to the GatewayBox.
c. Wireshark on the GatewayBox does not show the name request.
5. Compared nslookup –debug <name.com>between a problem box and a good box. The problem box only tries to resolve the name within the LAN network then stops. While a good box after being unable to resolve the name within the LAN network, goes on to try resolving it externally and succeeds.
On these two problem boxes, it appears that DNS requests are being lost and not making it out to a root name server, like our other working boxes. As I have described I do have Wireshark traces at two points on the network, log files, and network test results that I can review. I didn’t include the detailed data here in this already long post. So can someone suggest to me what things I should be looking for amongst my log data that would help me further diagnose this issue? Or let me know if you need to look at any specific log or packet capture data. I have no way of causing this issue to occur on the problem boxes, but have to wait until a user notifies me that the problem is occurring. It also doesn’t happen every day, as there may be a few days between its recurrence.