pfSense dropping connection every hour (XX:50)

randombits

It looks like an issue with Unbound stopping ?

Jan 30 14:32:14	unbound	42937	[42937:0] info: start of service (unbound 1.12.0).
Jan 30 14:32:14	unbound	42937	[42937:0] notice: init module 0: iterator
Jan 30 14:31:57	unbound	42937	[42937:0] notice: Restart of unbound 1.12.0.
Jan 30 14:31:57	unbound	42937	[42937:0] info: 64.000000 128.000000 1
Jan 30 14:31:57	unbound	42937	[42937:0] info: 32.000000 64.000000 15
Jan 30 14:31:57	unbound	42937	[42937:0] info: 16.000000 32.000000 3
Jan 30 14:31:57	unbound	42937	[42937:0] info: 8.000000 16.000000 6
Jan 30 14:31:57	unbound	42937	[42937:0] info: 4.000000 8.000000 8
Jan 30 14:31:57	unbound	42937	[42937:0] info: 2.000000 4.000000 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: 1.000000 2.000000 23
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.524288 1.000000 54
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.262144 0.524288 94
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.131072 0.262144 208
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.065536 0.131072 23
Jan 30 14:31:57	unbound	42937	[42937:0] info: lower(secs) upper(secs) recursions
Jan 30 14:31:57	unbound	42937	[42937:0] info: [25%]=0.185738 median[50%]=0.254897 [75%]=0.561728
Jan 30 14:31:57	unbound	42937	[42937:0] info: histogram of recursion processing times
Jan 30 14:31:57	unbound	42937	[42937:0] info: average recursion processing time 2.425595 sec
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 3: requestlist max 5 avg 0.553288 exceeded 0 jostled 0
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 3: 784 queries, 343 answers from cache, 441 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jan 30 14:31:57	unbound	42937	[42937:0] info: 64.000000 128.000000 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: 32.000000 64.000000 15
Jan 30 14:31:57	unbound	42937	[42937:0] info: 16.000000 32.000000 12
Jan 30 14:31:57	unbound	42937	[42937:0] info: 8.000000 16.000000 8
Jan 30 14:31:57	unbound	42937	[42937:0] info: 4.000000 8.000000 13
Jan 30 14:31:57	unbound	42937	[42937:0] info: 2.000000 4.000000 10
Jan 30 14:31:57	unbound	42937	[42937:0] info: 1.000000 2.000000 36
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.524288 1.000000 68
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.262144 0.524288 119
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.131072 0.262144 219
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.065536 0.131072 22
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.016384 0.032768 1
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.000000 0.000001 1
Jan 30 14:31:57	unbound	42937	[42937:0] info: lower(secs) upper(secs) recursions
Jan 30 14:31:57	unbound	42937	[42937:0] info: [25%]=0.19571 median[50%]=0.308405 [75%]=0.762144
Jan 30 14:31:57	unbound	42937	[42937:0] info: histogram of recursion processing times
Jan 30 14:31:57	unbound	42937	[42937:0] info: average recursion processing time 3.105196 sec
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 2: requestlist max 6 avg 0.879017 exceeded 0 jostled 0
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 2: 1012 queries, 483 answers from cache, 529 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jan 30 14:31:57	unbound	42937	[42937:0] info: 64.000000 128.000000 2
Jan 30 14:31:57	unbound	42937	[42937:0] info: 32.000000 64.000000 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: 16.000000 32.000000 9
Jan 30 14:31:57	unbound	42937	[42937:0] info: 8.000000 16.000000 10
Jan 30 14:31:57	unbound	42937	[42937:0] info: 4.000000 8.000000 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: 2.000000 4.000000 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: 1.000000 2.000000 9
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.524288 1.000000 24
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.262144 0.524288 40
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.131072 0.262144 123
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.065536 0.131072 4
Jan 30 14:31:57	unbound	42937	[42937:0] info: lower(secs) upper(secs) recursions
Jan 30 14:31:57	unbound	42937	[42937:0] info: [25%]=0.188882 median[50%]=0.250955 [75%]=0.677903
Jan 30 14:31:57	unbound	42937	[42937:0] info: histogram of recursion processing times
Jan 30 14:31:57	unbound	42937	[42937:0] info: average recursion processing time 3.457023 sec
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 1: requestlist max 2 avg 0.361702 exceeded 0 jostled 0
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 1: 350 queries, 115 answers from cache, 235 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jan 30 14:31:57	unbound	42937	[42937:0] info: 64.000000 128.000000 1
Jan 30 14:31:57	unbound	42937	[42937:0] info: 32.000000 64.000000 15
Jan 30 14:31:57	unbound	42937	[42937:0] info: 16.000000 32.000000 11
Jan 30 14:31:57	unbound	42937	[42937:0] info: 8.000000 16.000000 16
Jan 30 14:31:57	unbound	42937	[42937:0] info: 4.000000 8.000000 12
Jan 30 14:31:57	unbound	42937	[42937:0] info: 2.000000 4.000000 15
Jan 30 14:31:57	unbound	42937	[42937:0] info: 1.000000 2.000000 36
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.524288 1.000000 100
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.262144 0.524288 160
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.131072 0.262144 314
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.065536 0.131072 43
Jan 30 14:31:57	unbound	42937	[42937:0] info: 0.032768 0.065536 1
Jan 30 14:31:57	unbound	42937	[42937:0] info: lower(secs) upper(secs) recursions
Jan 30 14:31:57	unbound	42937	[42937:0] info: [25%]=0.188259 median[50%]=0.268698 [75%]=0.643216
Jan 30 14:31:57	unbound	42937	[42937:0] info: histogram of recursion processing times
Jan 30 14:31:57	unbound	42937	[42937:0] info: average recursion processing time 2.094595 sec
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 0: requestlist max 7 avg 1.03039 exceeded 0 jostled 0
Jan 30 14:31:57	unbound	42937	[42937:0] info: server stats for thread 0: 1301 queries, 577 answers from cache, 724 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jan 30 14:31:57	unbound	42937	[42937:0] info: service stopped (unbound 1.12.0).

stephenw10

It's being restarted because of the WAN gateway event, packet loss > 20%.

If you only have one WAN you can disable the gateway action since it never fails over.
In System > Routing > Gateways edit the WAN gateway and check: Disable Gateway Monitoring Action
Only that, do not disable gateway monitoring completely or you'll have no stats to see what's happening.

You can also try setting a different monitor IP. It could just be the gateway IP dropping pings under load.

If you really are seeing 20% packet loss every hour you need to find out why.

Steve

randombits

@stephenw10 Thanks Steve, I was assuming it was something to do with CRON tasks and I turned off Snort thinking it was one of it's clearing task every hour causing the issue.

My gateway logs looks a mess with latency alarms -

Jan 30 16:14:48	dpinger	93244	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.25.0.3 bind_addr 10.25.0.3 identifier "PROTONVPN_NL_VPNV4 "
Jan 30 16:14:48	dpinger	92888	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 1.1.1.1 bind_addr 95.172.224.197 identifier "WAN_DHCP "
Jan 30 16:11:06	dpinger	62125	WAN_DHCP 1.1.1.1: Clear latency 36650us stddev 13715us loss 5%
Jan 30 16:09:53	dpinger	62125	WAN_DHCP 1.1.1.1: Alarm latency 22761us stddev 8864us loss 21%
Jan 30 15:10:18	dpinger	62125	WAN_DHCP 1.1.1.1: Clear latency 34044us stddev 2790us loss 5%
Jan 30 15:09:07	dpinger	62125	WAN_DHCP 1.1.1.1: Alarm latency 32782us stddev 7266us loss 22%
Jan 30 13:57:15	dpinger	62374	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.25.0.3 bind_addr 10.25.0.3 identifier "PROTONVPN_NL_VPNV4 "
Jan 30 13:57:15	dpinger	62125	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 1.1.1.1 bind_addr 95.172.224.197 identifier "WAN_DHCP "
Jan 30 12:06:40	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33406us stddev 965us loss 6%
Jan 30 12:05:30	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 45450us stddev 35793us loss 21%
Jan 30 10:04:17	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 27470us stddev 8759us loss 5%
Jan 30 10:03:05	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 38152us stddev 18275us loss 21%
Jan 30 09:04:02	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33917us stddev 6029us loss 5%
Jan 30 09:02:41	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 52262us stddev 44016us loss 22%
Jan 30 08:03:03	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33318us stddev 275us loss 6%
Jan 30 08:01:51	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 54793us stddev 35239us loss 22%
Jan 30 07:02:17	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33292us stddev 410us loss 5%
Jan 30 07:01:06	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 17105us stddev 2230us loss 22%
Jan 30 06:00:44	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33548us stddev 2374us loss 5%
Jan 30 05:58:43	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 33444us stddev 3965us loss 21%
Jan 30 03:59:15	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33242us stddev 416us loss 5%
Jan 30 03:58:04	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 32853us stddev 3544us loss 21%
Jan 30 01:58:10	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33781us stddev 5139us loss 5%
Jan 30 01:56:58	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 33210us stddev 3804us loss 21%
Jan 30 00:57:35	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33298us stddev 427us loss 5%
Jan 30 00:56:22	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 23335us stddev 8280us loss 21%
Jan 30 00:56:07	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 24385us stddev 8422us loss 5%
Jan 30 00:54:53	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 82937us stddev 85001us loss 21%
Jan 29 23:56:46	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33331us stddev 284us loss 5%
Jan 29 23:55:34	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 23615us stddev 9275us loss 21%
Jan 29 22:54:47	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 33420us stddev 319us loss 5%
Jan 29 22:53:35	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 32657us stddev 3438us loss 21%
Jan 29 19:52:14	dpinger	78542	WAN_DHCP 1.1.1.1: Clear latency 26039us stddev 16170us loss 5%
Jan 29 19:51:15	dpinger	78542	WAN_DHCP 1.1.1.1: Alarm latency 52668us stddev 70776us loss 22%

I'll try a different monitor IP and see if that makes a difference and tick the gateway action Disable Gateway Monitoring Action .

randombits

I've also set the default gateway from Automatic to WAN_DHCP and restarted Snort. Unbound still appears to stop but system/gateways isn't showing anything new - as yet. FreePBX's trunks are staying connected, before only reconnecting with a FreePBX restart or killing FreePBX's states in Pfsense.

As to why Unbound is stopping I can't figure that out, as I said before I assumed it was a cron job 'messing' with Unbound.

1,31	0-5	*	*	*	root	/usr/bin/nice -n20 adjkerntz -a	  
1	3	*	*	0	root	/usr/bin/nice -n20 /etc/rc.update_bogons.sh	  
1	1	*	*	*	root	/usr/bin/nice -n20 /etc/rc.dyndns.update	  
*/60	*	*	*	*	root	/usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot	  
30	12	*	*	*	root	/usr/bin/nice -n20 /etc/rc.update_urltables	  
1	0	*	*	*	root	/usr/bin/nice -n20 /etc/rc.update_pkg_metadata	  
1	7	*	*	*	root	/usr/local/bin/php -q /root/pkg_check.php	  
*/1	*	*	*	*	root	/usr/sbin/newsyslog	  
1	3	*	*	*	root	/etc/rc.periodic daily	  
15	4	*	*	6	root	/etc/rc.periodic weekly	  
30	5	1	*	*	root	/etc/rc.periodic monthly	  
1	0	*	*	1	root	pkg update && pkg upgrade -y	  
51	04	*	*	2,4,6	root	/usr/bin/nice -n20 /usr/local/bin/php /usr/local/sbin/execacb.php	  
14	0	*/1	*	*	root	/usr/bin/nice -n20 /usr/local/bin/php -f /usr/local/pkg/snort/snort_check_for_rule_updates.php	  
0	7	4-10	*	*	root	/usr/local/bin/php /usr/local/www/pfblockerng/pfblockerng.php dcc >> /var/log/pfblockerng/extras.log 2>&1	  
15	0,12	*	*	*	root	/usr/local/bin/php /usr/local/www/pfblockerng/pfblockerng.php cron >> /var/log/pfblockerng/pfblockerng.log 2>&1	  
*/5	*	*	*	*	root	/usr/bin/nice -n20 /usr/local/bin/php -f /usr/local/pkg/snort/snort_check_cron_misc.inc	  
*/2	*	*	*	*	root	/usr/bin/nice -n20 /sbin/pfctl -q -t snort2c -T expire 900

stephenw10

Unbound is restarted by the gateway down/up scripts along with a bunch of other things that don't need to happen on a single WAN system. So the first thing is to disable the gateway action which will stop all that.

The next place I would look is in the traffic graphs. Do you have something uploading a load of stuff at that time? Saturating the WAN upload can produce latency spike like that.

Steve

randombits

@stephenw10 Things now seem a lot better than previously but Unbound is still restarting, FreePBX's trunks has stayed connected so far. I had no idea that lots of things can restart Unbound, the WAN isn't very loaded but I do have a poor ADSL line, which, could be causing the issue? my speeds are currently connected at 835 / 2.723 kbps/kbps up and down I have pfSense running on an mini-itx board Intel DQ77KB with a i5-2390T CPU @ 2.70GHz and 3GB of ram and dual onboard Intel NIC's (Intel 82579LM & Intel 82574L). Could saturating my slow line be causing an issue just a thought.

Thanks Steve for all your suggestions and help

Dave.

2022-01-31 09_52_47-pfSense.localdomain - Status_ Monitoring.png

stephenw10

Hmm, well it doesn't look like the packet loss is dircetly correlated with upload from that but it's hard to say fro that graph.
Try de-selecting all but Latency, Packet Loss, Inpass Total and Outpass Total.

With that sort of connection (I assume you mean 2.7Mbps down and 0.8Mbps up?) it would be very easy to saturate it and induce latency spikes though DSL is generally better at managing it. YMMV!

Steve

randombits

@stephenw10 Yes, sorry I should have made that clearer 2.7Mbps down and 0.8 up very slow but I'm ~2.5 miles from the exchange with a poor line with some aluminium wires I've been told. I didn't realise staturating my line speeds would have an effect on pfSense I thought more alone the lines of staurating pfSense which I would think it should be good for a gig or two's bandwidth.

I redraw the graph zoomed in and fewer options selected

2022-01-31 15_08_28-pfSense.localdomain - Status_ Monitoring.png

2022-01-31 15_18_14-Greenshot.png

stephenw10

Mmm, so it does appear to be spiking but with with download peaks. At least on bps terms.

That is the nature of your connection, actual packet loss is pretty ugly though. You should tune your monitoring values to reflect that.

You might want to experiment with fq_codel to mitigate upload saturation:
https://docs.netgate.com/pfsense/en/latest/recipes/codel-limiters.html

That doesn't really help much with download though.

Steve

randombits

@stephenw10
I've added an upload limiter from the guide and results aren't good

https://www.waveform.com/tools/bufferbloat?test-id=cd305f29-d16d-4843-9b3c-bef4820eaaf8

Ah, that would explain why with speed tests the upload is not steady with the needle very spiky, and a limiter would limit the upload floow speed so the 'line' could handle it better?

I went from and 'F' to a 'D' result !

Wahey ! when I followed the guide properly... I've gone from a 'F' to an 'A+'
https://www.waveform.com/tools/bufferbloat?test-id=c01f5e80-3629-44bc-b19d-fbac720eb9b4

Wow pfSense is such a cool thing.

stephenw10

Wow that's awesome! Nice.

randombits

@stephenw10 Maybe spoke to soon, another test or few now a 'D' or 'F' with nothing changed.

As you mentioned I think it needs a bit of fine tuning especially with the bandwidth settings