Tough time with Unbound

johnpoz

@Stewart said in Tough time with Unbound:

If I query from a PC connected to the VLAN it just times out with no response from the DNS server.

Well sniff, does pfsense see the query?

With timeout either unbound not listening on the IP your sending the query too, or your traffic not getting to unbound..Firewall rule on the client or the server...

You seem to be listening on all IPs.. So sniff - does pfsense actually see the query when you send it.. And your sending it to the vlan IP of pfsense?

Gertjan

@Stewart said in Tough time with Unbound:

I don't see any rejections in the firewall log.

If it's unbound that doesn't accept the queries, and refuses the query because of an "acces-controm" rule, it's normal that the firewall doesn't log.
It would be unbound that discards the request. which you could see happening if you crank up the verbosity of the logging of unbound Services > DNS Resolver > Advanced Settings > Log level.

It's easy to see if the firewall is doing anything with the DNS traffic. Decalre a pass rule for UDP and TCP, destination port 53. Put the rule at the top on any interface and you'll see the "usage" counter starting to increment.

Like this :

Or de activate the acces-control all together.

edit :
After several seconds in place, see the counter :

Stewart

@Gertjan You know, I've never thought to add those rules just to log if there is traffic. That's a good idea.

I'll need to get back out to the site and do some more testing. I think recall seeing traffic coming into the firewall on port 53 and nothing going out but I ran a lot of captures so I'm not positive. It's a new install. All that's there are 2 Cisco SG250 switches with the VLANs set up on them. IPs are assigned via DHCP from the pfSense box. I can ping the box and log into the interface from the various VLANs. Anything else I should put on my list to check before going out?

Gertjan

@Stewart said in Tough time with Unbound:

Anything else I should put on my list to check before going out?

Probably this :

@Stewart said in Tough time with Unbound:

All that's there are 2 Cisco SG250 switches with the VLANs set up on them.

put dumb switches in place, and hard tag your devices so it's using a VLAN ID. This will bypass all setup issues related to these switches.
Btw : never used VLAN's myself, I prefer adding 'just another NIC'.

Stewart

@Gertjan Since I can successfully ping and navigate to the IP then the VLANs are set up properly. In the firewall it just shows up as another interface. On the switches you just untag the access VLAN and TAG the rest on the trunks. There really isn't much to it at this level.

Stewart

So I came out onsite and ran some tests. Now NONE of the dig commands work! Crazy!

top -aSH
last pid: 16121;  load averages:  3.07,  3.91,  3.87                                                                                                                                                                 up 3+01:27:52  14:38:09
201 processes: 7 running, 157 sleeping, 37 waiting
CPU: 26.3% user,  0.0% nice, 30.6% system,  5.1% interrupt, 38.0% idle
Mem: 96M Active, 103M Inact, 12K Laundry, 672M Wired, 317M Buf, 3035M Free
Swap: 4096M Total, 513M Used, 3583M Free, 12% Inuse

  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
23941 root          102    0  6392K  2140K CPU2    2  27:51  98.63% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /etc/syslog.conf
   11 root          155 ki31     0K    64K RUN     0  57.3H  50.41% [idle{idle: cpu0}]
   11 root          155 ki31     0K    64K CPU3    3  55.8H  37.81% [idle{idle: cpu3}]
   11 root          155 ki31     0K    64K RUN     1  53.7H  35.18% [idle{idle: cpu1}]
39062 unbound        44    0   100M 81476K CPU1    1   0:14  33.34% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
39062 unbound        46    0   100M 81476K umtxn   3   0:14  33.27% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
39062 unbound        45    0   100M 81476K umtxn   3   0:14  33.24% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
39062 unbound        42    0   100M 81476K umtxn   1   5:35  32.93% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}

Logs are empty in the GUI. Enabled advanced logging in the GUI which caused the service to restart I suppose. Now resolution works, but only on the second try. If I try to go to, say, arstechnica.com, it won't load the first time. Wireshark shows me:
If I reload the page it'll come up, but slowly, sometimes.

At the moment unbound is taking up most of the CPU:

last pid: 26202;  load averages:  4.30,  3.64,  3.12                                                                                                                                                                 up 3+01:57:31  15:07:48
237 processes: 7 running, 193 sleeping, 37 waiting
CPU: 39.6% user,  0.0% nice, 29.7% system,  6.5% interrupt, 24.2% idle
Mem: 1437M Active, 209M Inact, 12K Laundry, 745M Wired, 370M Buf, 1516M Free
Swap: 4096M Total, 511M Used, 3585M Free, 12% Inuse

  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
23941 root           99    0  6392K  2140K CPU0    0  46:36  89.78% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /etc/syslog.conf
13662 unbound        81    0 63396K 51684K CPU2    2   1:31  31.61% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
13662 unbound        52    0 63396K 51684K umtxn   3   1:32  30.94% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
13662 unbound        52    0 63396K 51684K umtxn   2   4:55  30.82% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
13662 unbound        52    0 63396K 51684K umtxn   2   1:31  29.71% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}

running clog tells me some more:

Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 63vRDCDd mod1  ns5-65.akam.net. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: mesh_run: end 153 recursion states (17 with reply, 84 detached),                                                                                                                         51 waiting replies, 28 recursion replies sent, 0 replies dropped, 0 states jostled out
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug: EDNS lookup known=0 vs=0
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: average recursion processing time 45.800638 sec
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug: serviced query UDP timeout=376 msec
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: histogram of recursion processing times
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug: inserted new pending reply id=89af
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: [25%]=1e-06 median[50%]=44.8 [75%]=72
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug: Need to send query but have no outgoing interfaces of that famil                                                                                                                        y
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info: error sending query to auth server ip6 2620:0:34::53 port 53 (len                                                                                                                         28)
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info:    1.000000    2.000000 1
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug: servselect ip4 192.31.80.30 port 53 (len 16)
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info: processQueryTargets: ns1.msft.net. A IN
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 70vRDCD mod1  ns-1471.awsdns-55.org. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info:   16.000000   32.000000 2
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug:    rtt=68675
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount                                                                                                                         16
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info:   32.000000   64.000000 10
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info: DelegationPoint<msft.net.>: 4 names (0 missing), 8 addrs (4 resul                                                                                                                       t, 0 avail) cacheNS
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info:   64.000000  128.000000 8
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug:    rtt=54538
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info:   ns4.msft.net. * A AAAA
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 73vRDCD mod1  ns-1627.awsdns-11.co.uk. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: 0vRDCD mod1  ns.bahnhof.net. A IN
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug: servselect ip4 192.33.14.30 port 53 (len 16)
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info:   ns2.msft.net. * A AAAA
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 74vRDCD mod1  ns-1655.awsdns-14.co.uk. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: 1vRDCD mod1  ns1.azprdmig.msft.net. A IN
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug:    rtt=43107
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info:   ns1.msft.net. * A AAAA
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 75vRDCDd mod1  ns-1870.awsdns-41.co.uk. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: 2vRDCD mod1  ns2.bahnhof.net. A IN
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug: servselect ip4 192.5.6.30 port 53 (len 16)
Feb 17 15:08:41 GateKeeper unbound: [13662:2] info:   ns3.msft.net. * A AAAA
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 76vRDCD mod1  ns1-201.azure-dns.com. AAAA IN
Feb 17 15:08:41 GateKeeper unbound: [13662:3] info: 3vRDCD mod1  ns2.azprdmig.msft.net. A IN
Feb 17 15:08:41 GateKeeper unbound: [13662:1] debug:    rtt=47876
Feb 17 15:08:41 GateKeeper unbound: [13662:2] debug:    ip6 2620:0:34::53 port 53 (len 28)
Feb 17 15:08:41 GateKeeper unbound: [13662:0] info: 77vRDCD mod1  ns1prod.6893.azuredns-prd.org. AAAA IN

The "Need to send query but have no outgoing interfaces of that family" concerns me.
Along with "error sending query to auth server"

johnpoz

Why would unbound be using that much cpu... That is nuts..

As to your auth error, that is ns3.msft.net, yeah its an authoritative ns.. for msft.net

looks like you have some issue for sure.. Why don't you stop unbound.. And then restart it... you should not be using anywhere close to that amount of cpu.. All 4 threads?

Stewart

@johnpoz I've restarted the whole firewall with no help. My laptop is currently the only thing on the network besides the 2 switches and the firewall. I don't even know where to look to troubleshoot. There's nothing in the logs. I'm currently disabling everything. Suricata and pfblockerNG are now both off.

I do have a theory, though. Running sockstat -l | grep unbound showed 386 responses. This is a new laptop and I had uTorrent running in the background from when I was re-downloading all my updated ISOs. Once I closed it it dropped down to 15 responses and things are working normally, at least for now. Am I running out of resources?

Edit: If that's the case then I'm concerned 1 laptop can take down the whole shebang. I've had clients with old ASAs and RV units that buckle under someone running a torrent. This is the first time I've seen it affect pfSense, if that's the case.

johnpoz

You had your utorrent client looking up every single IP in the swarm it was talking too - hehehe!! While yeah that can be a lot of queries... Shouldn't be too many to be honest..

Unless your having issues with resolving - but then again PTRs for these IPs that most likely do not resolve will force time outs and retries, etc..

Yeah trying to resolve the PTR for the hundreds if not 1000's of IPs you might talk to in p2p swarm - yeah that could be problematic...

edit: its not the torrents.. its the DNS queries that don't resolve and then waiting for timeouts, and then retrying, etc. Not like you get back NX or something... Freaking p2p swarm is going to have a ton of shit ips everyone using vpns, etc. that don't have any ptrs set, etc. etc..

Yeah doing 1000's of shit dns queries can be problematic ;)

Stewart

@johnpoz If that's what it is, how do I combat it? We can't have 1 rogue laptop take down the network. I'm looking at the options in the Advanced tab and shifting them all higher, but which will actually help?

johnpoz

No you just need to make sure your dns can handle it... Change the settings, more threads, etc.. Maybe not run dns on pfsense - run it on some other box if your pfsense box is not up to high levels of dns queries..

Look maybe you have some bad setup with ipv6 that your trying to use.. ??? Your average recursion time is was horrible..

Here is mine
Feb 17 07:56:26 unbound 72571:0 info: average recursion processing time 0.274979 sec

Stewart

@johnpoz I pushed up Advanced settings to:
Cache Size: 20MB
Outgoing TCP Buffers: 50
Incoming TCP Buffers: 50
Number of Queries per Thread: 2048
I then saved and opened up my uTorrent to see what happens. It's been running for a few minutes now and haven't seen any spikes.

clog -f /var/log/resolver.log | grep "average recursion"
Feb 17 16:22:37 GateKeeper unbound: [77568:0] info: average recursion processing time 0.133651 sec
Feb 17 16:22:37 GateKeeper unbound: [77568:0] info: average recursion processing time 0.140601 sec
Feb 17 16:22:37 GateKeeper unbound: [77568:0] info: average recursion processing time 0.163783 sec
Feb 17 16:22:37 GateKeeper unbound: [77568:0] info: average recursion processing time 0.149740 sec

Recursion time is also very quick now.

Stewart

@Stewart While I thought that was the issue, it isn't. The started plugging things in and internet isn't working. I come out and it's the same thing. Unbound periodically high cpu even with nothing plugged in. Running dig from the CLI when it doesn't work shows:

/root: dig @127.0.0.1 google.com A

; <<>> DiG 9.12.2-P1 <<>> @127.0.0.1 google.com A
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

It comes and goes.

Edit: When i run clog -f on the resolver.log file I can see it just scrolling through with all the lookups. Then it just stops and no DNS resolution can occur. Then it alll starts again.

Stewart

System log after a reboot shows

Mar 19 10:03:54 	php-fpm 	341 	/services_unbound_advanced.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1584626634] unbound[3232:0] error: bind: address already in use [1584626634] unbound[3232:0] fatal error: could not open ports'
Mar 19 10:03:54 	dhcpleases 		kqueue error: unknown
Mar 19 10:09:30 	dhcpleases 		Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.

Stewart

@Stewart said in Tough time with Unbound:

/usr/local/sbin/unbound -c /var/unbound/unbound.conf

Removed pfBlockerNG-devel and rebooted but that didn't help. Still the same errors. Went into services and restarted unbound and it started working. Rebooted and it's working. This is the first time I haven't seen error messages. Maybe some setting in pfBlockerNG-devel?