DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs

Klaus2314

@Klaus2314 Sorry, I just realised that no matter what directory I try to change to using "cd" in the command line window of the pfsense gui it will always stay in the root folder. Odd...

jrey

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

in the command line window of the pfsense gui

you need to ssh in to run this -- sorry I didn't specifically mention that

Klaus2314

@jrey Yeah, no worries. sorry for my ignorance. Got this

and

Gertjan

@Klaus2314

That's not "/var/unbound/" as you've listed many files and folders that don't belong there.

edit : ok, you've figured that out.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

watchdog active for unbound

Ah, ok, you've the 'create random issues on my system' tool installed.
My advise : don't use it.
It a developer tool.
You not developing unbound, you are using it. When 'pfSense' decides that it has to rstart unbound because pfBlocker (for example) was realoaded with new DNSBL, then let pfSense restart unbound.

On my pfSense, unbound can run for days, weeks or longer. It is restarted at lest ones a week because my pfBlocker reloads (updates, if any) DNSBL and a consequence is : unbound is restarted.

Your situation : you use unbound, pfBlockerbl so unbound can get restarted more often.
Because you use pfBlockerbl and probably not the python mode ( is that so ? ) the restart can take a long time. One, two or more minutes to start again has been seen before, deping on the number of DNSBL entries.
Restarting it at that moment, while it was already restarting, will create ... well, it won't do any good, that's for sure.
That's te scenario you've created : the watchdog package checks every minute, for every selected service, if they are running. If it does,'t it will start it.
But what if it was already in the startup phase ? ( noop, the watchdog package isn't smart ).

That's why python mode was invented : unbound restarts way faster = no more, or far smaller, DNS outages.
For myself : that's why I sync my DNSBL every week, not xx times a day.
Result - for me - rock solid 'DNS'.

( and no DHCP DNS registration, as that option also restarts unbound on every incomonig lease (renewal) )

edit :

Try this :
In the GUI :
stop unbound
stop pfblocker

Then : on the console or SSH :
delete these two files.

Now, in the GUI, start unbound and start pfblockerng.

"PHP" can not create a 10 Mbytes file xx time per second. That's just impossible.

edit 2 :

68 Mbytes ...
If you don't use the python mode, your DNS is worthless as unbound can't read this file 'in a snap' during startup. It will take some serious time to do so.
Consider making this file waaay smaller == way less DNSBL.

I've a 4100, and, true, it has an Intel processor, but it's a small one.
4 Mbytes is already a noticeable load on the unbound startup.

jrey

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

I have watchdog active for unbound could that interfere with the pfblocker in terms of timing?

can watchdog interfere - yes but only if unbound is failing and/or in the process of doing a "normal" restart.
Keep in mind the it is generally recommended that watchdog is not running. Depending on your configuration pfblocker will restart unbound when it needs to (for example, a DNSBL list has changed).
Personally I don't even have watchdog installed and as the DNSBL iists I use only change every 3 or 4 days even though I check daily for updates, unbound only restarts when an updated lists is available. The restart only takes a few seconds here, you'd never notice.

But there is the problem, the file is rather large, not that size is an issue for a DB unless you are out of disk space.
"Error: database disk image is malformed"

in this case you could simply delete the two files when you are in that directory

cd /var/unbound
rm pfb_py_resolver.sqlite
rm pfb_py_resolver.sqlite.invalid

alternative is you could just move the .sqlite version to a backup and the system will create a new one -- once that happens, you could then delete the backup and invalid versions. to do that you would

cd /var/unbound
mv pfb_py_resolver.sqlite pfb_py_resolver.sqlite.bak
(wait for the new file to be created check with ls -l until it appears, won't take long, then)
rm pfb_py_resolver.sqlite.bak
rm pfb_py_resolver.sqlite.invalid

Did you manually copy or mv the file to .invalid - that would not be there normally.
looks like based on the time stamps the system would be trying to write to the one dated Sep 3, but that today Sep 18 the invalid copy was created -- that may be when the error causing the corruption started? 14:25 (check logs)

Klaus2314

@jrey Thank's a lot for your advice! I'll dig into this. I have watchdog on because unbound (and somertimes kea dhcp) stopped working (I suspect after pfblocker updatding at night).
I'll try again without.

Gertjan

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

stopped working

I've edited my post above with some observations.
You will find some very possible reasons why.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

and somertimes kea dhcp

Be gentile to yourself : switch back to ISC for the moment.

Klaus2314

@Gertjan ok, thanks again. I can't delete the files. permission denied.
Doesn't sudo work on BSD? It says unknown command.

Again, sorry for my novice ignorance :-D

jrey

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

It will take some serious time to do so.

you'd think that will default settings a zone file that big the system would be throwing memory errors. That's a lot of zones. of course there is a limit as to how many it loads and ignores the rest. Either way though the list is too big.

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

switch back to ISC for the moment.

Also true.

jrey

@Klaus2314

does your ssh prompt say you are root or admin?

if you ssh in with a "normal" user account you won't have permission.

ssh bob@... logs in won't work (bob doesn't have permission)
ssh admin@... or ssh root@... logs in will work.

Gertjan

@jrey said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

default settings a zone

The default pfb_py_zone.txt file size is 'zero' as pfBlockerng has no DNSBL pre set or pre loaded.

Then the admin comes along, and selects a whole bunch of these :

and then .... major issues happens. The system can't cope anymore.

Btw : unbound doesn't only 'allocate' memory to load the file, and stores it into that memort check.
Noop, it reads the fiie, interpretes every line, and starts to fill structures 'per line', making each line part of a big array with DNSBL to be matched on every host name request ...
This is done using arm or intel instrcution, true, but even so : it's very time consumming.
How long can be seen here :

so with 119414 entries, it took 15:57-15:22 = already 35 seconds.

Oh, I get it : @Klaus2314 thought that unbound was 'broken' ....
Klaus2314 : check for yourself, ask for apfBlockerng reload here and see what the log tells you :

and hit the blue run button.

Klaus2314

@Gertjan OK, I have now managed to delete those files, de-installed watchdog and followed your other advice. I will see how it goes now and report back.
Thanks again for your patience!

jrey

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

so with 119414 entries, it took 15:57-15:22 = already 35 seconds.

What version are you running?, and this is on a 4100 I think you said?

Below is from a crap little test box (my production box a 2100) and has much the same with regards to timing, but has a larger "list" than shown here

 Original    Matches    Removed    Final     
 ----------------------------------------
 166351      61104      67425      98926     
 -----------------------------------------
TLD finalize... completed [ 09/15/24 04:15:10 ]

Saving DNSBL statistics... completed
Reloading Unbound Resolver (DNSBL python)
Stopping Unbound Resolver.
Unbound stopped in 2 sec.
Additional mounts (DNSBL python):
  No changes required.
Starting Unbound Resolver... completed [ 09/15/24 04:15:12 ]
DNSBL update [ 98926 | PASSED  ]... completed

Certainly no 35 seconds..
Cache you say ?

I've never really seen a benefit to "Resolver cache restored" that takes a lot of time.
restoring cache, number of entries in zone are generally not related.

Cache works just fine... and any 2 hour sample here will run cache from about 35% to 60% cache hit depending who's doing what. The past 2 hours looks kind of like this.

Screen Shot 2024-09-18 at 9.50.49 AM.png

Gertjan

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

Certainly no 35 seconds..

My pfb_py_zone.txt has 119 414 entries.
Yours has 98 926 entries, somewhat less.

before, you showed this :

That's waaaay bigger. Like 68 Mbytes big.

Your pfSEnse has probably bit more 'hose power' as mine.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

I've never really seen a benefit to "Resolver cache restored" that takes a lot of time.

The unbound cache - there are command line commands to dump it - isn't that big.
Its just useful to load it back in again. It's a cache after all.

My cache : same info, shown differently :

gray = served out of cache.
Other colors : resolving was needed.

jrey

@Gertjan

You're confused, that file size you are quoting back is from the OP not me.

I was asking about your 35 second time from your system compared to my unbound and that section of the log where you show and calculate the 35 seconds.
a) yes on my little test box the "list" is a little smaller (process time 2 seconds) I said that.
b) I also said on my 2100 the list is a little larger than yours but the the timing of the unbound restart in that section of the log is similar to my little test box ~ 2-4 seconds

The only difference I can see is that you are restoring the unbound cache, that is where the time difference appears to be from. And I went on to say that I've never seen a benefit to doing that and my cache hit rate is just fine.

All I've said is I've never seen a benefit to restoring resolver cache...and therefore taking the time hit doing so. The cache works just fine here.

Gertjan

@jrey said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

You're confused,

Noop. The @Klaus2314 didn't made into my post - I've edited.

Klaus2314

Hi all,

since I deleted the files and de-installed watchdog no more errors occurred in the last 24 hours (which included a few cron jobs by pfblocker) so things seem to be fine again.

Oh, and yes, I have been running the python mode before (and still am) on DNSBL.

thanks again for helping.