Can pfsense and/or snort help prevent automated content theft/site scraping?

redgoblet

Client of mine operates an e-comm site with 150,000 product SKUs. They've invested significantly in product descriptions, images and technical content.
Automated site scraping is becoming a problem. It's easy enough to block obvious scrapers with firewall rules (IP block), but requires constant manual oversight.

Has anyone ever used pfsense and/or snort to combat this? The trick is distinguishing between good bots (Google, etc) and bad bots.

thanks

firewalluser

I'm not aware of any way pfsense/snort can do this as it needs knowledge of what your decision making process is to categorise the searching/web scraping that is acceptable and what is unacceptable.

I know from the search engines and the webservers I have built for my customers, the webserver tracks all ip addresses, and suspect ones which appear to be looking up more than normal* amounts of product info can be blocked, however this doesnt work for search engines looking for data in batches from various ip address over time which spot the threshold and then come back from a different IP address and search for amounts of data which wont trigger the threshold.

Bottom line is you cant stop all webscraping/searching except the most blantant ie software systematically going through the website exploring all links and products with no intelligence built into it unlike an AI search engine.

*normal is defined by previous customer enquiries and orders. In other words, some products get enquired/ordered with other parts more than some other parts, and this knowledge is used to define the threshold automatically and threshold is different depending on what they are looking up.

However, as I said to one customer in a similar situ to you, they manufacturer their own trademarked products, so does it matter if the product description & images end up on places like Alibaba.com? After all, it raises awareness of the product line which is ultimately all they want anyway and counterfeit goods are a different ball game.

redgoblet

Thanks, firewalluser…

In this case, the company is a retailer, not a manufacturer, so it competes with other e-comms selling the same products. They have spent tons of money compiling and presenting product data, and to have it ripped off and represented by competitors upsets everybody.

There are several companies that filter this type of traffic...and such services are very expensive and very much in demand.

Again, thanks for your response...

doktornotor

I'd like to point out that this is primarily a firewall (packet filter)/router – not "the answer to life, the universe and everything."

kejianshi

I think the only thing thats going to protect you from from getting your site ripped-off is:

System > packages > Lawyers

Be careful with the "Lawyers" package. It is resource intensive and often consumes more than it saves you.

firewalluser

@redgoblet:

In this case, the company is a retailer, not a manufacturer, so it competes with other e-comms selling the same products. They have spent tons of money compiling and presenting product data, and to have it ripped off and represented by competitors upsets everybody.

The tons of money (and time) spent on putting the data together is really a cost of business, however if you are paying a photographer to take photos of the product, see if you can get some sort of branding in the pic, like maybe a garment is hung on a dummy or draped over a flat surface, pin/stick a company logo to the product. It would need time and effort to photoshop it out which might act as a deterrent, likewise check out other things like stenography or just adding some bits/watermarks to the pics before publishing.

However any pics copied and then modified might lose the added bits or the stenography making it harder to claim your pics have been ripped off. Just like people might use an effect to swirl their face in a pic, so the reverse algo exists to unswirl someone's face, but reductions will strip some data out from a pic and then you lose your hidden bits.

There is no perfect solution unfortunately and even though pfSense is a good firewall, what you are looking for is not really the domain of a firewall.

What I'd suggest is get your web guys to come up with some script to hilight in realtime when a search engine/user agent is going through your site and vistiing a high number of pages within a short space of time (customers like to linger), and to then setup a rule in pfsense to block the ip address if its not coming from a recognised ip address block assigned to a search engine. Google & Bing have their own address blocks for their spiders so its easy to spot genuine search engines so I would also suggest looking up the address block the ip belongs to and maybe even consider blocking the address block, but you need to make aware if you block an ISP's address block you could be blocking (potential) customers.

Have a look at pfBlocker in the packages. I'm using an earlier version on a site and I'm blocking a large number of countries because its a private webserver for employees so there is no need for overseas visitors. Now this uses a txt file stored on the webserver which lists the CIDR's assigned to various countries and ISP's so I can block quite a few people very easily. The text file needs to sit somewhere accessible, so maybe you can get your webserver to add the IP address to the txt file which pfBlocker is using, and then maybe trigger a reload or refresh in pfsense if one is needed when a new ip address is added to the block list? That might be your quickest and cheapest solution, but it depends on the webserver you are using as something like this might exist already.

jwelter99

@redgoblet:

Client of mine operates an e-comm site with 150,000 product SKUs. They've invested significantly in product descriptions, images and technical content.
Automated site scraping is becoming a problem. It's easy enough to block obvious scrapers with firewall rules (IP block), but requires constant manual oversight.

Has anyone ever used pfsense and/or snort to combat this? The trick is distinguishing between good bots (Google, etc) and bad bots.

thanks

We've used Haproxy to limit such scrapers by create use of acl's and delays. Basically you setup acl's to detect the scraping activity as it's pretty easy to determine it's not a human browsing from the rate of requests and then you blacklist that IP for a timeout period.

It's not pretty but does work…..

Some examples on the haproxy docs on how to do this.