DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs

Klaus2314

Hi,

I couldn't find a thread about this but recently I've been getting hundreds of these errors logged.
I removed the package and re-installed but no change. I have re-booted the FW but no change.
The system has been running for years and has been regularly updated without issues. Also pf-blocker seems to be running fine otherwise nothing in the pfblocker error logs. These only appear in system logs.

Running
2.7.2-RELEASE (amd64)
pfblocker ng 3.2.0_8

All recommended patches applied.
Any idea what could be causing this?

Thanks!

jrey

@Klaus2314

interesting.

you might want to check file permissions?

cd /var/unbound
ls -l

this is the database it is using
pfb_py_resolver.sqlite

what happens if you try to access the database / table directly do you get any errors?

/var/unbound: sqlite3 pfb_py_resolver.sqlite
SQLite version 3.43.1 2023-09-11 12:01:27
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE resolver (row integer, totalqueries integer, queries integer);
CREATE TABLE lastclear ( row INTEGER, lastipclear TEXT, lastdnsblclear TEXT );
sqlite> select * from lastclear;
0|May 22 11:29:48|May 22 11:55:07
sqlite> .exit

so in the above you are typing the following and checking results as above.

cd /var/unbound
sqlite3 pfb_py_resolver.sqlite
 .schema
select * from lastclear;
 .exit

Gertjan

@Klaus2314

pfBlocker needed to 'do something' with one of its SQL3 database files, and 5 times per second ( ! ) if found the file 'invalid' so recreated it.
Sure enough, the file recreated was also invalid.
And so on.

The database is called 'lastevent', and its in the "/var/unbound/pfb_py_resolver.sqlite" file.

[24.03-RELEASE][root@pfSense.bhf.tld]/root: sqlite3
SQLite version 3.44.0 2023-11-01 11:23:50
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open /var/unbound/pfb_py_resolver.sqlite
sqlite> .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE resolver (row integer, totalqueries integer, queries integer);
INSERT INTO resolver VALUES(0,345511,0);
CREATE TABLE lastclear ( row INTEGER, lastipclear TEXT, lastdnsblclear TEXT );
INSERT INTO lastclear VALUES(0,'Sep 15 00:00:00','Sep 15 00:00:00');
CREATE TABLE lastevent ( row INTEGER, groupname TEXT, entry TEXT, details TEXT );
INSERT INTO lastevent VALUES(0,'DNSBL_Regex','incoming.telemetry.mozilla.org192.168.1.6','incoming.telemetry.mozilla.org,192.168.1.6,-|PRI|HTTP/2.0|-,Python_HTTPS,DNSBL_Regex,incoming.telemetry.mozilla.org,test_rgx2');
COMMIT;
sqlite> .quit

Quiet small actually.

Anyway, you know now what it is.
The thing is, pfBlockerng, PHP actually, can't create this 20Kbytes file in the /var/unbound/pfb_py_resolver.sqlite folder.
That's ... not good.

I did discover that that file is update very often ... mine was regenerated every couple of seconds for sure.

So ... disk full ? Or the partition that hosts "/var/unbound/" is full ?
The drive/partition was mounted cleanly during the last reboot ?
Or does you pfblockerng has so much activity that the number of "last events" are arriving faster as that PHP can write out the disk ?

Klaus2314

@jrey said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

this is the database it is using
pfb_py_resolver.sqlite

Thank you!
forgot to mention: I have watchdog active for unbound could that interfere with the pfblocker in terms of timing?

Also: the drive is far from full. tons of space left.

This doesn't seem to exist in that folder:

Klaus2314

@Klaus2314 Sorry, I just realised that no matter what directory I try to change to using "cd" in the command line window of the pfsense gui it will always stay in the root folder. Odd...

jrey

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

in the command line window of the pfsense gui

you need to ssh in to run this -- sorry I didn't specifically mention that

Klaus2314

@jrey Yeah, no worries. sorry for my ignorance. Got this

and

Gertjan

@Klaus2314

That's not "/var/unbound/" as you've listed many files and folders that don't belong there.

edit : ok, you've figured that out.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

watchdog active for unbound

Ah, ok, you've the 'create random issues on my system' tool installed.
My advise : don't use it.
It a developer tool.
You not developing unbound, you are using it. When 'pfSense' decides that it has to rstart unbound because pfBlocker (for example) was realoaded with new DNSBL, then let pfSense restart unbound.

On my pfSense, unbound can run for days, weeks or longer. It is restarted at lest ones a week because my pfBlocker reloads (updates, if any) DNSBL and a consequence is : unbound is restarted.

Your situation : you use unbound, pfBlockerbl so unbound can get restarted more often.
Because you use pfBlockerbl and probably not the python mode ( is that so ? ) the restart can take a long time. One, two or more minutes to start again has been seen before, deping on the number of DNSBL entries.
Restarting it at that moment, while it was already restarting, will create ... well, it won't do any good, that's for sure.
That's te scenario you've created : the watchdog package checks every minute, for every selected service, if they are running. If it does,'t it will start it.
But what if it was already in the startup phase ? ( noop, the watchdog package isn't smart ).

That's why python mode was invented : unbound restarts way faster = no more, or far smaller, DNS outages.
For myself : that's why I sync my DNSBL every week, not xx times a day.
Result - for me - rock solid 'DNS'.

( and no DHCP DNS registration, as that option also restarts unbound on every incomonig lease (renewal) )

edit :

Try this :
In the GUI :
stop unbound
stop pfblocker

Then : on the console or SSH :
delete these two files.

Now, in the GUI, start unbound and start pfblockerng.

"PHP" can not create a 10 Mbytes file xx time per second. That's just impossible.

edit 2 :

68 Mbytes ...
If you don't use the python mode, your DNS is worthless as unbound can't read this file 'in a snap' during startup. It will take some serious time to do so.
Consider making this file waaay smaller == way less DNSBL.

I've a 4100, and, true, it has an Intel processor, but it's a small one.
4 Mbytes is already a noticeable load on the unbound startup.

jrey

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

I have watchdog active for unbound could that interfere with the pfblocker in terms of timing?

can watchdog interfere - yes but only if unbound is failing and/or in the process of doing a "normal" restart.
Keep in mind the it is generally recommended that watchdog is not running. Depending on your configuration pfblocker will restart unbound when it needs to (for example, a DNSBL list has changed).
Personally I don't even have watchdog installed and as the DNSBL iists I use only change every 3 or 4 days even though I check daily for updates, unbound only restarts when an updated lists is available. The restart only takes a few seconds here, you'd never notice.

But there is the problem, the file is rather large, not that size is an issue for a DB unless you are out of disk space.
"Error: database disk image is malformed"

in this case you could simply delete the two files when you are in that directory

cd /var/unbound
rm pfb_py_resolver.sqlite
rm pfb_py_resolver.sqlite.invalid

alternative is you could just move the .sqlite version to a backup and the system will create a new one -- once that happens, you could then delete the backup and invalid versions. to do that you would

cd /var/unbound
mv pfb_py_resolver.sqlite pfb_py_resolver.sqlite.bak
(wait for the new file to be created check with ls -l until it appears, won't take long, then)
rm pfb_py_resolver.sqlite.bak
rm pfb_py_resolver.sqlite.invalid

Did you manually copy or mv the file to .invalid - that would not be there normally.
looks like based on the time stamps the system would be trying to write to the one dated Sep 3, but that today Sep 18 the invalid copy was created -- that may be when the error causing the corruption started? 14:25 (check logs)

Klaus2314

@jrey Thank's a lot for your advice! I'll dig into this. I have watchdog on because unbound (and somertimes kea dhcp) stopped working (I suspect after pfblocker updatding at night).
I'll try again without.

Gertjan

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

stopped working

I've edited my post above with some observations.
You will find some very possible reasons why.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

and somertimes kea dhcp

Be gentile to yourself : switch back to ISC for the moment.

Klaus2314

@Gertjan ok, thanks again. I can't delete the files. permission denied.
Doesn't sudo work on BSD? It says unknown command.

Again, sorry for my novice ignorance :-D

jrey

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

It will take some serious time to do so.

you'd think that will default settings a zone file that big the system would be throwing memory errors. That's a lot of zones. of course there is a limit as to how many it loads and ignores the rest. Either way though the list is too big.

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

switch back to ISC for the moment.

Also true.

jrey

@Klaus2314

does your ssh prompt say you are root or admin?

if you ssh in with a "normal" user account you won't have permission.

ssh bob@... logs in won't work (bob doesn't have permission)
ssh admin@... or ssh root@... logs in will work.

Gertjan

@jrey said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

default settings a zone

The default pfb_py_zone.txt file size is 'zero' as pfBlockerng has no DNSBL pre set or pre loaded.

Then the admin comes along, and selects a whole bunch of these :

and then .... major issues happens. The system can't cope anymore.

Btw : unbound doesn't only 'allocate' memory to load the file, and stores it into that memort check.
Noop, it reads the fiie, interpretes every line, and starts to fill structures 'per line', making each line part of a big array with DNSBL to be matched on every host name request ...
This is done using arm or intel instrcution, true, but even so : it's very time consumming.
How long can be seen here :

so with 119414 entries, it took 15:57-15:22 = already 35 seconds.

Oh, I get it : @Klaus2314 thought that unbound was 'broken' ....
Klaus2314 : check for yourself, ask for apfBlockerng reload here and see what the log tells you :

and hit the blue run button.

Klaus2314

@Gertjan OK, I have now managed to delete those files, de-installed watchdog and followed your other advice. I will see how it goes now and report back.
Thanks again for your patience!

jrey

@Gertjan said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

so with 119414 entries, it took 15:57-15:22 = already 35 seconds.

What version are you running?, and this is on a 4100 I think you said?

Below is from a crap little test box (my production box a 2100) and has much the same with regards to timing, but has a larger "list" than shown here

 Original    Matches    Removed    Final     
 ----------------------------------------
 166351      61104      67425      98926     
 -----------------------------------------
TLD finalize... completed [ 09/15/24 04:15:10 ]

Saving DNSBL statistics... completed
Reloading Unbound Resolver (DNSBL python)
Stopping Unbound Resolver.
Unbound stopped in 2 sec.
Additional mounts (DNSBL python):
  No changes required.
Starting Unbound Resolver... completed [ 09/15/24 04:15:12 ]
DNSBL update [ 98926 | PASSED  ]... completed

Certainly no 35 seconds..
Cache you say ?

I've never really seen a benefit to "Resolver cache restored" that takes a lot of time.
restoring cache, number of entries in zone are generally not related.

Cache works just fine... and any 2 hour sample here will run cache from about 35% to 60% cache hit depending who's doing what. The past 2 hours looks kind of like this.

Screen Shot 2024-09-18 at 9.50.49 AM.png

Gertjan

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

Certainly no 35 seconds..

My pfb_py_zone.txt has 119 414 entries.
Yours has 98 926 entries, somewhat less.

before, you showed this :

That's waaaay bigger. Like 68 Mbytes big.

Your pfSEnse has probably bit more 'hose power' as mine.

@Klaus2314 said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

I've never really seen a benefit to "Resolver cache restored" that takes a lot of time.

The unbound cache - there are command line commands to dump it - isn't that big.
Its just useful to load it back in again. It's a cache after all.

My cache : same info, shown differently :

gray = served out of cache.
Other colors : resolving was needed.

jrey

@Gertjan

You're confused, that file size you are quoting back is from the OP not me.

I was asking about your 35 second time from your system compared to my unbound and that section of the log where you show and calculate the 35 seconds.
a) yes on my little test box the "list" is a little smaller (process time 2 seconds) I said that.
b) I also said on my 2100 the list is a little larger than yours but the the timing of the unbound restart in that section of the log is similar to my little test box ~ 2-4 seconds

The only difference I can see is that you are restoring the unbound cache, that is where the time difference appears to be from. And I went on to say that I've never seen a benefit to doing that and my cache hit rate is just fine.

All I've said is I've never seen a benefit to restoring resolver cache...and therefore taking the time hit doing so. The cache works just fine here.

Gertjan

@jrey said in DNSBL SQLite3 database [ lastevent ] corrupt messages in system logs:

You're confused,

Noop. The @Klaus2314 didn't made into my post - I've edited.