Weird bug: DNS randomly stops working, filter blocks DNS server responses…?

tristano

We've been experiencing a crippling bug in DNS and the only explanation I can find is that there is some corruption occurring in the state table for DNS queries. Everything will be working fine for a day or two, then out of the blue DNS will cease to function, breaking local internet connectivity and preventing the email system from working properly also.

When this happens, the firewall filter log begins filling up with messages such as:


Sep 19 07:31:17 firewall.elyria.infod pf 260054 rule 155/0(match): block in on em5: (tos 0x0, ttl 245, id 38686, offset 0, flags [DF], proto: UDP (17), length: 261) 166.102.165.11.53 > 192.168.254.1.59785:  15523[|domain]
Sep 19 07:31:17 firewall.elyria.infod pf 001561 rule 155/0(match): block in on em5: (tos 0x0, ttl 245, id 24924, offset 0, flags [DF], proto: UDP (17), length: 261) 166.102.165.13.53 > 192.168.254.1.59784:  15523[|domain]
Sep 19 07:31:17 firewall.elyria.infod pf 410106 rule 155/0(match): block in on em5: (tos 0x0, ttl 245, id 3451, offset 0, flags [DF], proto: UDP (17), length: 134) 166.102.165.11.53 > 192.168.254.1.59785:  45741[|domain]
Sep 19 07:31:17 firewall.elyria.infod pf 000927 rule 155/0(match): block in on em5: (tos 0x0, ttl 245, id 53786, offset 0, flags [DF], proto: UDP (17), length: 134) 166.102.165.13.53 > 192.168.254.1.59784:  45741[|domain]
Sep 19 07:31:20 firewall.elyria.infod pf 3\. 000512 rule 155/0(match): block in on em5: (tos 0x0, ttl 245, id 32058, offset 0, flags [DF], proto: UDP (17), length: 134) 166.102.165.11.53 > 192.168.254.1.59785:  45741[|domain]

166.102.165.11 is the ISP DNS server and 192.168.254.1 is the WAN interface on the router, so what we're seeing is packets from the DNS server being blocked from reaching the router. Because the source of the packets is port 53 and the destination is some high port, these appear to be ACKs of some sort, so a client is making a DNS request and the response is being blocked even though the client initiated the connection.

Why would this happen unless there is a problem in the state table…? This problem occurs on pfSense 1.2 full and embedded, on two very different hardware platforms.

Note: I'm trying to determine if the modem has a hand in this as well. The modem is utter crap and has a built-in router, which is obnoxious and makes it think that it is more important than it is. Currently I'm testing the DHCP lease setting on the modem, which was set to infinite. Since the problem occurs after a day or two of working fine, I'm wondering if this infinite lease is somehow upsetting pfSense. I've set the new lease time to 24 hours to see if that helps...I know pfSense does not like stale leases. Who would, but a pathetic, crap modem?

tristano

Some additional information is worth noting.

We have a dual WAN setup:

WAN1 - 192.168.254.1
WAN2 - 192.168.253.1

Systems in the DMZ use WAN2, systems on the LAN use WAN1, so it is somewhat strange that the firewall log is filling with blocked DNS responses going to the WAN1 interface when the systems that should be actively using DNS in the middle of the night are on the DMZ (i.e. email). Occasionally there is a block on the WAN2 interface, so this is not exactly conclusive of anything, but I still find it worth a mention.

Also, the configuration on pfSense is that DNS should be overridden by DHCP on WAN, and DHCP on WAN provides pfSense with the ISP DNS server addresses. I have tried statically setting the DNS servers in pfSense with the same result. :-\