Should I be using Unbound Python mode? Is it stable?

Gertjan

@keyser said in Should I be using Unbound Python mode? Is it stable?:

I still have a command named "Unbound" in "top -m io" that does about 200 - 300Kb/s writes to disk. It only goes away if i stop pfBlockerNG.

That's not what I see.

When I

Then unbound isn't running any more.
This :

ps ax | grep 'unbound'

should not list any 'unbound' instances.

And when unbound isn't running, the underlying pfBlockerNG python scripts isn't running any more. Which means the python scripts doesn't produce any output any more.

pfBlockerNG could stop and restart unbound as it's internal update procedure, ones or more or less per day.

What do you see when you execute :

cat  /var/log/resolver.log | grep 'start'

keyser

@gertjan said in Should I be using Unbound Python mode? Is it stable?:

@keyser said in Should I be using Unbound Python mode? Is it stable?:

I still have a command named "Unbound" in "top -m io" that does about 200 - 300Kb/s writes to disk. It only goes away if i stop pfBlockerNG.

That's not what I see.

When I

Then unbound isn't running any more.
This :
ps ax | grep 'unbound'
should not list any 'unbound' instances.

And when unbound isn't running, the underlying pfBlockerNG python scripts isn't running any more. Which means the python scripts doesn't produce any output any more.

pfBlockerNG could stop and restart unbound as it's internal update procedure, ones or more or less per day.

What do you see when you execute :
cat  /var/log/resolver.log | grep 'start'

I understand what you are getting at, and for some bizarre reason I can no longer replicate the issue where the writing continues even through "Unbound" is stopped.
I agree with you that the writing should stop if unbound is stopped, but I have tried two different accounts of it not happening. It would seem that has just been impatience from my side, and not giving the residual parts of the unbound shutdown time to complete.
Right now, the writing does stop if I stop unbound (just as it does if I only stop pfBlockerNG).

However: The problem with excessive writing is still very much present when using "Unbound python mode".
Right now I have 2 feeds active, and I have them both set no "null blocking (no logging)". As expected I'm no longer seeing new log entries in dnsbl.log. I have disabled DNS reply logging, and no logfiles are seeing new entries apart from the odd hit on my IP block list/log.

Yet, Unbound is still sustaining about 350Kb/s writing to my SSD. So in the 2 min it took me to write this reply, some 30MB of data was written. Data that is not present in any logfiles.
Sidenote: There's only 8 clients on my network in total, and they were all in sleep during this little test.

fireodo

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Yet, Unbound is still sustaining about 350Kb/s writing to my SSD. So in the 2 min it took me to write this reply, some 30MB of data was written. Data that is not present in any logfiles.
Sidenote: There's only 8 clients on my network in total, and they were all in sleep during this little test.

Try to look with this command:

top -SH -o write (and after that "m")

what process is most active in writing.

Just a idea ...

keyser

@fireodo said in Should I be using Unbound Python mode? Is it stable?:

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Yet, Unbound is still sustaining about 350Kb/s writing to my SSD. So in the 2 min it took me to write this reply, some 30MB of data was written. Data that is not present in any logfiles.
Sidenote: There's only 8 clients on my network in total, and they were all in sleep during this little test.

Try to look with this command:

top -SH -o write (and after that "m")

what process is most active in writing.

Just a idea ...

Unbound by far. It's more or less sustained at the top with between 7 -> 60+ writes pr. refresh. At times there are two unbound lines each with their own writes - but they belong to the same PID.

Syncer is the second most popular with the occational 1 or 2 writes whenever unbound is quiet for one second.

keyser

@keyser Hmm, just noticed something:

1: If I enable DNS Reply logging it has a huge hit on the sustained write issue - it more or less doubles the sustained write issue.

2: With DNS reply logging active I noticed that Unbound is receiving a noticeable amount of DNS requests every second even though I have no clients.... Tracked it down and it turns out it is my Zabbix monitoring server that is requesting name resolution repeatedly for names of monitored devices (Seems Zabbix on Linux does not cache DNS resolutions by default).

With that in mind - and the writing almost exactly doubling:
Could we be looking at the python integration script is actually temporarily storing DNS replies to disk (1x write) even if DNS reply logging is disabled?
Enabling DNS reply logging would then require another write of the reply (2x write).

Just a hunch.....

Gertjan

@keyser said in Should I be using Unbound Python mode? Is it stable?:

.....

So it boils down to "you do something I (we) don't".
Then, "do what we do" ;) and you'll see that unbound or unbound related traffic becomes close to nothing.

My company network doesn't even show 'unbound' in the top 20 of "top -o write + m" - and most of my colleges are back from holiday and are all trying to give the impression that they work.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Zabbix monitoring

Is that permanently doing something to gather stats ?
I'm using Munin myself, buth Munin's node script only run every 5 minutes, and never last for 30 seconds or so.

keyser

@gertjan said in Should I be using Unbound Python mode? Is it stable?:

@keyser said in Should I be using Unbound Python mode? Is it stable?:

.....

So it boils down to "you do something I (we) don't".
Then, "do what we do" ;) and you'll see that unbound or unbound related traffic becomes close to nothing.

My company network doesn't even show 'unbound' in the top 20 of "top -o write + m" - and most of my colleges are back from holiday and are all trying to give the impression that they work.

I would love to if we can make this problem go away - the thing is, I don’t think I have done anything particular custom in my pfBlockerNG Setup. Is more or less vanilla/default settings in DNSBL. In IP I have the one exception of having it make ALIAS lists instead of auto rules.

This just crossed my mind: Are you perhaps using RAMdisk for /var? That could explain why your Unbound python script does not touch the disk.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Zabbix monitoring

Is that permanently doing something to gather stats ?
I'm using Munin myself, buth Munin's node script only run every 5 minutes, and never last for 30 seconds or so.

Yeah its monitoring a bunch af network equipment and servers for bandwidth statistics and utilization statistics.

azdeltawye

@keyser said in Should I be using Unbound Python mode? Is it stable?:
...

I would love to if we can make this problem go away...

I agree.

I have a similar problem with my SG-5100. I changed from Unbound to Unbound Python mode about 3 weeks ago and my disk usage (ufs) has steadily increased from about 45% to 75% currently.

Nothing fancy in my setup: about 30 - 35 clients, a few VLANs, VPN server, and minimal packages (avahi, nut, pfBlockerng-devel, snort, traffic stats).

I am considering changing back to non-Python mode.

keyser

@azdeltawye said in Should I be using Unbound Python mode? Is it stable?:

@keyser said in Should I be using Unbound Python mode? Is it stable?:
...

I would love to if we can make this problem go away...

I agree.

I have a similar problem with my SG-5100. I changed from Unbound to Unbound Python mode about 3 weeks ago and my disk usage (ufs) has steadily increased from about 45% to 75% currently.

Nothing fancy in my setup: about 30 - 35 clients, a few VLANs, VPN server, and minimal packages (avahi, nut, pfBlockerng-devel, snort, traffic stats).

I am considering changing back to non-Python mode.

Yeah, that’s the disk filling issue with pfBlockerNG. I have that as well.

There is definitely something fishy with the python script and how it handles data. The filling issue in my case is related to doing “DNS Reply Logging” in Unbound Python Mode.
When Unbound is under moderate DNS resolution load, the python script fails to respect the default 20.000 linies max setting for the DNS_reply log, and once its passed the 20.000 lines limit, it no longer attempts to delete the old entries, and the log file just keeps growing. Mine had some 900.000+ lines in it when my disk was almost full. Stopping pfBlockerNG and starting it again causes the script to truncate the logfile again (and regain your lost diskspace). But in my case it’s only a matter of time before DNS resultion load causes the script to miss truncation again, and you are on your way to a full disk.

Disabling DNS reply logging has so far prevented me from seeing the disk running full again. But who knows if it can happen to some of the other logfiles?

keyser

@gertjan

As a little followup on my investigations, I have taken measures to “quiet” down my zabbix server so it does not flood my Unbound resolver.
Doing that had a pretty big impact on the sustained write rate which now dropped to somewhere around 150 KB/s (from around 380 KB/s).

So there is no doubt the issue is somewhat related to DNS lookup activity in Unbound. The thing is - I have disabled DNS Reply logging, DNSBL block logging and so forth, and none of my logfiles are growing now. What is the Python script doing that touches/writes so much to disk when it’s not logging?
I suspect there is some temporary storing of DNS lookup data going on inside the script.
What DNSBL configuration could cause my script/python integration to store temporary data, that most other installations do not?

@ 150KB/s I’m looking at 4TB write IO/Year to my 8Gb eMMc - that will wear through its endurance in about 3 years. Still too fast in my opinion.

Gertjan

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Stopping pfBlockerNG and starting it again causes the script to truncate the logfile again (and regain your lost diskspace).

My experience is not the same.

I had set every log file to "40 000".
I changed it to the default "20 000" lines for all of them - and saved at the bottom of the page.
Just to be sure, I checked one of the concerned files : dns_reply.log, and counted the number of lines ::

[2.5.2-RELEASE][admin@pfsense.my-network.net]/var/unbound/var/log/pfblockerng: wc -l dns_reply.log
   41443 dns_reply.log

That's just a bit more then 40 000, the number I was using before.

I execute a manual Force -> reload.
Again :

[2.5.2-RELEASE][admin@pfsense.my-network.net]/var/unbound/var/log/pfblockerng: wc -l dns_reply.log
   20066 dns_reply.log

So, for me ok, the file was reset to "20 000" and some new lines (66) were already added.

Next test :

I asked the CRON task to execute every hour (you should normally not do this !!!) :
I waited.

At 11h00 AM local time, the cron executed.
I tested the file size again, at 11h00 so 10 minutes passed :

[2.5.2-RELEASE][admin@pfsense.my-network.net.net]/var/unbound/var/log/pfblockerng: wc -l dns_reply.log
   20826 dns_reply.log

That's a bit more then 20 000 - the 826 lines were added after 11h00.

So, for me, it works.

But : true, I can image that these files can get pretty big after "24 hours".
Keep track of the files the old fashioned way : manually, and adapt the CRON timing accordingly.

I'll set back my pfBlockerNg CRON timer to ones a day, and my log file size back to "40 000", and report back in a day or two.

keyser

@gertjan Yeah, you could work the issue that way as well.

You missed my main question however: Are you using RAMdisk for /tmp and /var in your config? Because that might explain why you are not seeing sustained writeIO to your SSD.

Just out of curiosity, what is your average write IO load over time in your installation? (iostat -x)

Gertjan

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Are you using RAMdisk for /tmp and /var in your config?

No way.
Just classic hard disks - The ones I removed from desktop devices after a 4 years live span.
I'll 'burn them up' in my pfSense as a retirement plan.
Btw : I'm using a completely stripped down PC device, consuming 60 Watt/hour.
easy to maintain, and I have a pile load of spare parts.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

(iostat -x)

[2.5.2-RELEASE][admin@pfsense.my-network.net]/root: iostat -x
                        extended device statistics
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
md0            0       0      0.0      0.0     0     0     0     0    0   0
ada0           0       9      0.2    205.9     5     1     0     1    0   0
cd0            0       0      0.0      0.0     0     0     0     0    0   0
pass0          0       0      0.0      0.0     3     0     6     4    0   0
pass1          0       0      0.0      0.0     0     0     0     0    0   0

tomashk

Do you use zfs? Maybe it is somehow related to this topic. Or maybe it is something that you can see while having both pfblocker and zfs? I haven't checked it so I'm just guessing.

keyser

@gertjan said in Should I be using Unbound Python mode? Is it stable?:

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Are you using RAMdisk for /tmp and /var in your config?

No way.
Just classic hard disks - The ones I removed from desktop devices after a 4 years live span.
I'll 'burn them up' in my pfSense as a retirement plan.
Btw : I'm using a completely stripped down PC device, consuming 60 Watt/hour.
easy to maintain, and I have a pile load of spare parts.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

(iostat -x)
[2.5.2-RELEASE][admin@pfsense.my-network.net]/root: iostat -x
                        extended device statistics
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
md0            0       0      0.0      0.0     0     0     0     0    0   0
ada0           0       9      0.2    205.9     5     1     0     1    0   0
cd0            0       0      0.0      0.0     0     0     0     0    0   0
pass0          0       0      0.0      0.0     3     0     6     4    0   0
pass1          0       0      0.0      0.0     0     0     0     0    0   0

Thanks for showing that. I see your box averages 200KB/s writes or about 5.5TB/year, so if you where on a 8GB eMMC, you too would suffer a dead card after a couple of years.

Unless you have something else doing some “heavy” logging, you seem to be suffering the issue as well. Can you confirm that Unbound is NOT your top sustained writer in “top -m io”? You have stated before that you are not having the issue.

PS: 60w sustained…. OUCH…. I’m doing 4.8W on my SG-2100. Your’e obviously not living in denmark - with our electricity prices that would be 200$/year in running costs alone.

keyser

@tomashk said in Should I be using Unbound Python mode? Is it stable?:

Do you use zfs? Maybe it is somehow related to this topic. Or maybe it is something that you can see while having both pfblocker and zfs? I haven't checked it so I'm just guessing.

Thanks, but that has already been covered earlier. ZFS is not in use here.

Gertjan

@keyser said in Should I be using Unbound Python mode? Is it stable?:

so if you where on a 8GB eMMC, you too would suffer a dead card

That's why I mentioned that I use classic solid plate disks, the ancient ones.
No "fancy" SSD or such drives for me for my pfSense application.
I can do so because I have the place for them = big box solution, also called "old PC".
( And the electricity bill isn't an issue neither. )

As I showed above, the log files are purged, so I don't have any space issues. The pfBlockerNG do grow, but are pruned by the daily pfBlockerNG cron task.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

Unless you have something else doing some “heavy” logging

I do.
pfSense logs (not pfBlockerNG) are syslogged to a local NAS.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

“top -m io”

and then 'm' :

Ones in a while unbound is on top of charts.

@keyser said in Should I be using Unbound Python mode? Is it stable?:

60w sustained…. OUCH….

We had a choice.
Plan A : giving away our solar over production "to the community".
Plan B : don't throw away our old equipment, and give it a second live as a firewall.

Plan B won.

Btw : I'm talking about pfSense @work. @home it runs out of a VM, so when I'm not there == 0 Watt.

azdeltawye

Update:

Last night I changed my pfB config from Unbound Python mode back to Unbound and immediately saw my disk usage (ufs) drop from 76% to 46%. This would appear to confirm that pfB Python mode was responsible for the excessive disk usage and that it is not ready for prime time yet... I don't have the time to be a Beta tester right now so I will have to wait this one out.

Unfortunately, with pfB Unbound mode, I'm back to no logging on the DNSBL count: 0 log entries logged since changing back to Unbound mode (Firewall/pfBlockerNG/Alerts/DNSBL Block). I started a thread about this a few months ago:
[https://forum.netgate.com/topic/164252/pfblockerng-devel-dnsbl-not-working-after-21-05-upgrade](link url)

Hopefully in the next few months we will see a new version of pfBlockerng-devel that will address these issues.

keyser

@azdeltawye said in Should I be using Unbound Python mode? Is it stable?:

Update:

Last night I changed my pfB config from Unbound Python mode back to Unbound and immediately saw my disk usage (ufs) drop from 76% to 46%. This would appear to confirm that pfB Python mode was responsible for the excessive disk usage and that it is not ready for prime time yet... I don't have the time to be a Beta tester right now so I will have to wait this one out.

Unfortunately, with pfB Unbound mode, I'm back to no logging on the DNSBL count: 0 log entries logged since changing back to Unbound mode (Firewall/pfBlockerNG/Alerts/DNSBL Block). I started a thread about this a few months ago:
[https://forum.netgate.com/topic/164252/pfblockerng-devel-dnsbl-not-working-after-21-05-upgrade](link url)

Hopefully in the next few months we will see a new version of pfBlockerng-devel that will address these issues.

Yeah, that issue certainly needs a fix - as does the sustained temporary writing to disk.
It would be such a shame if pfBlockerNG got a reputation for killing the small SG-xxxx boxes with very small eMMC’s and SSD’s because of wear.

I hope @BBcan177 will return at some point and have a look at this tread. I’d be happy to help/provide additional data in identifying the source of the problem.

Gertjan

@azdeltawye

The reason why "Python mode" was invented, is spread out all over the forum, but the basic explanation is : people want more details, faster details, and more DNSBL control.
Also : faster unbound restart.

The classic unbound log capabilities are not enough.
Even if you crank this one up to level 5 :

The needed info wouldn't be there.

But, just for fun, do it for yourself, activate level 5 unbound logging.
And keep an eye on this file /var/log/resolver.log
It will EAT your disk space - and as said, the needed info still sin't there.

Also : log file parsing is slow. The python mode uses 'pre complied' code and is thus much faster.
Another major advantage is that unbound doesn't need to load in these mega byte size DNSBL feeds at restart. The python scripts parses these files now.

The bottom line is : want DNSBL ? Throw resources on it. Or trim it down (use less feeds). And if you have less head room, keep watching for that ceiling.