Squidguard category filtering silently fails with large blacklist - a workaround

JonathanLee

@mikeinnyc Hi Mike great post, how much resources are being consumed with this configuration?

JonathanLee

Here is a better example of Jails in use. Each is running within the host machine. Maybe a solution to attack this problem this way?

They can even change the IP addresses of the Jails while they are live, so you could set it to one or the other with Lan based IP addresses also.

mikeinnyc

When I worked on wall street for a really fast three decades before the days of IT Lockdowns and dialup 56k, T1s, and the most expensive DSL connections $800 a month. We "May Have" used gambling sites and others. (See the movie the wolf of Wall Street!") I do not recall ever seeing porn on computer screens!" Well after a couple of industry lawsuits here's how it rolled in every co afterward:

DENY EVERYTHING - Guess what we have this using pfSense by default!
Beg Compliance for Opening website access - Always denied. Thank gawd for cell phones :)
Read approved websites list of maybe 1 page in total - all business related.
WHY- Because LOG FILES must be kept for life on Wall Street. So it's possible that "Girls Gone Really Wild," may come back to haunt me! :) Probably would help me hahaha.

Now, my recommendations - I have vast experience in getting sued as a CEO.

IF - You have employees or people you can physically tap on the shoulder (1099) then use the above. Only have approved outgoing websites. If they complain, they won't because personal cell phone 5G is outside the scope and jurisdiction of record keeping
unless it's business related. Besides they screen record everything and video record. Using encrypted sites like signal and datchat for personal use is unstoppable.

If Web Hosting - Deny all countries by default - except this Whitelist block of IPS from USA. You probably should use cloud flare if more than a few countries. So many bad actors from certain countries.
Deny all ports - Except this whitelist of ports. You can further add ips to lock in down more.
Then, protect those web hosting ports with rate Liming stick tables (HAProxy) and other filters.
The point is that with web hosting you should rarely need outbound denys lists why because by default deny all except this IP and that Ip and this port. Your production Network traffic should not leak into Private lans period.
Buy enterprise hardware, not VMs. You will always have problems so check logs.

DBMandrake

Guys, all this talk about firewall rule policies, HA Proxy, corporate policies etc is nothing to do with what this topic is about and is only serving to derail and dilute the thread.

Please try to stick discussion about the bug in the SquidGuard package which allows a too small ramdisk to overflow during extraction without any warnings or errors and then imports a corrupted database into squidGuard, this is what this thread is about. Thanks.

JonathanLee

@mikeinnyc

It is about Business Impact Analysis and Mission Essential Functions.

The rabbit hole:
Yes, configure your system on a as need basis. Take into consideration many factors some being Data Classifications, Geographical Considerations, Data Sovereignty, Organizational Consequences. The Serverless Architecture of our modern systems today also start to play key roles, when using something like Azure as a Domain. Moreover, Software Defined Networking is playing more of a role than most end users know about today with the use of hyper-convergence. "User-facing" problems are far different than confidentiality and data integrity problems and they require different considerations within risk mitigation. It is not only that machine virtualization that needs consideration, but container virtualization, and full application virtualization. They take different roles within risk mitigation as they can perform data marshalling over the NIC cards easily. Look to how many cloud service models they have today, Infrastructure as a service, Software as a service, Platform as a service, Anything as a service, and even Security as a service. The toxic idea of just avoiding virtualization, and or ignoring it, no longer applies. The risk mitigation plans have to include virtualization today. All needs must be taken into consideration within implementing authorization solutions. It's Discretionary and Role-Based Access Control. Windows 11 helped to solve some of the issues with virtualization risks on an end user platform, as some issues were occurring inside of Windows 10.

Access Control Lists:
I have also noticed that some websites if you simply just block the IP address inside of the access control lists, they can still be accessed over HTTPS or HTTP, that is why I am using Squidguard it checks the http/https get requests, but it needs a blacklist to function, manual or downloadable.

One of the reasons why I am studying software and computer science, is to help find a really good solution, and it seems as soon as we get a good one working some prototype protocol evolves that needs to be accounted for that is not following the rules or compliance within Internet Assigned Numbers Authority. Why have rules like the ones from IANA if there is no compliance. Now in comes the need for something like internet backbone compliance servers or cards installed right on a Ciena system, that can track and block prototype protocol abuses. Lets agree on one thing the wild west days before GDPR and CCPA are gone forever.

JonathanLee

This post is deleted!

JonathanLee

@dbmandrake Please take time to check my hypothetical solution with Squidgard lists above, let me know what you think.

mikeinnyc

@jonathanlee I know this sounds off-topic but it's dead on.
Don't block the world and call it a bug :)
This is a hardware limitation issue. Try adding more memory first and then add many IP aliases.

The real question is network design.
By Default pfSense Blocks everything. How can we add more whitelists?
Do you really want employees accessing everything or is this "Home use only?"
Sorry to be abrupt but this solves your problem. Better hardware and more ram.

DBMandrake

@jonathanlee Re: Containers / Jails.

This seems like a massive degree of overkill, what problem do they solve exactly ?

The reason why the ramdisk exists at all is for small devices with limited storage where the temporary disk space needed to extract the plain ascii version of the blocklists (around 300MB for the blocklist I'm using) would cause the device to run out of disk space.

I'm just not seeing how a container solves the problem of lack of disk space.

It also speeds up the extraction and importing process since what are essentially temp files don't have to be written out to disk.

However on a server with a decent sized SSD there isn't really any advantage to using the ramdisk apart from a slight speed increase, but the disadvantage is it can fail with larger blocklists and due to inadequate error checking the failure is not detected and the incomplete blocklist is imported into squidguard without complaint which then silently breaks your filter categories. This is a big problem in an environment like a school where a school has a duty of care to not allow pupils to access certain kinds of websites.

If I do write a patch to add a ramdisk enable/disable preference option I will also write a patch to fix the error checking so that a failure due to exceeding the ramdisk size (when enabled) is reported to the user and the incomplete blocklists do not overwrite the currently active ones.

I would like to do this it's just a matter of finding the time to work on it as I'm bogged down with too many things at the moment.

DBMandrake

@mikeinnyc Sorry but you comments are way off the mark - "Better hardware and more ram" doesn't solve anything. You clearly haven't read and/or understood the original post and grasped the issue with the fixed size ramdisk which is currently a part of the blocklist import process.

The hardware is absolutely capable of working with a blocklist of the size in question - without even breaking a sweat. Once the limitation of the small, fixed size ramdisk is removed, that is.

michmoor

@dbmandrake im thinking that post was some type of SPAM. I could be wrong

JonathanLee

@dbmandrake I actually forgot about the speeds of SSD drives today, the hypothetical solution I hoped would also help solve the issue with downtime when updating blacklists, on my firewall everything goes offline during blacklist updates, and the firewall can't use the full blacklist because of the same issue you described and solved. My system is the MAX so it has an extra 30GBs SSD on it. Additionally, it could protect the blacklist uptime if something got corrupted with a bad blacklist update, this way it could default back to that other container if that issue should ever occur. Kind of like a HA-Proxy just for blacklists, primary and secondary. High availability.

Thanks for looking at that post, I just wanted to have some input on it with Squidguard, alongside more visibility on FreeBSD Jails for the possibly retooling them for something else.

DBMandrake

@jonathanlee said in Squidguard category filtering silently fails with large blacklist - a workaround:

@dbmandrake I actually forgot about the speeds of SSD drives today, the hypothetical solution I hoped would also help solve the issue with downtime when updating blacklists, on my firewall everything goes offline during blacklist updates, and cant use the full blacklist because of the issue you described.

I've been using the full size blacklist since before I started this thread without issue - with the patch to disable the ramdisk. No issues have cropped up yet, in fact the firewall hasn't been rebooted since before this thread was started. I actually have a second firewall running this patch as well as I've had to temporarily set up a second proxy server for a slightly different use case.

Regarding going offline during the update, I'd have to check but as far as I know Squid doesn't go offline during the extraction of the tar file - which is the longest part of the process.

I think it's only offline for a few seconds at the end of the import process for the same amount of time as if you'd pressed the Apply button in the squidguard config page, which forces squidguard to re-read the on disk version of the blacklist binary database into memory.

But I should run a test to time how long the proxy is out of action. I have mine scheduled to do the blacklist update automatically at 2am anyway so if the proxy is down for a few seconds at 2am nobody cares.

Not sure what you mean when you say "everything" goes offline on your firewall when the blacklist updates - only the proxy (and transparent proxy) will be affected, all other traffic is unaffected.

JonathanLee

@dbmandrake everything on my network is pointed at the proxy, plus I run a WPAD, what I mean is when Squidguard updates that blacklist the proxy starts to update and users have no internet access until it restores, I am running a Netgate SG2100-MAX it only has 4GBs ram with it. It takes a bit longer for me around 5 mins, it takes long enough that it will stop a streaming movie. I need to set it to update during the AM too, again I am running a DSL 6meg.

DBMandrake

@jonathanlee For automatic scheduled update see my post in another thread.

JonathanLee

@dbmandrake thanks for the information on the auto update.