Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Can pfsense and/or snort help prevent automated content theft/site scraping?

    Scheduled Pinned Locked Moved Firewalling
    7 Posts 5 Posters 3.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      redgoblet
      last edited by

      Client of mine operates an e-comm site with 150,000 product SKUs. They've invested significantly in product descriptions, images and technical content.
      Automated site scraping is becoming a problem. It's easy enough to block obvious scrapers with firewall rules (IP block), but requires constant manual oversight.

      Has anyone ever used pfsense and/or snort to combat this? The trick is distinguishing between good bots (Google, etc) and bad bots.

      thanks

      1 Reply Last reply Reply Quote 0
      • F
        firewalluser
        last edited by

        I'm not aware of any way pfsense/snort can do this as it needs knowledge of what your decision making process is to categorise the searching/web scraping that is acceptable and what is unacceptable.

        I know from the search engines and the webservers I have built for my customers, the webserver tracks all ip addresses, and suspect ones which appear to be looking up more than normal* amounts of product info can be blocked, however this doesnt work for search engines looking for data in batches from various ip address over time which spot the threshold and then come back from a different IP address and search for amounts of data which wont trigger the threshold.

        Bottom line is you cant stop all webscraping/searching except the most blantant ie software systematically going through the website exploring all links and products with no intelligence built into it unlike an AI search engine.

        *normal is defined by previous customer enquiries and orders. In other words, some products get enquired/ordered with other parts more than some other parts, and this knowledge is used to define the threshold automatically and threshold is different depending on what they are looking up.

        However, as I said to one customer in a similar situ to you, they manufacturer their own trademarked products, so does it matter if the product description & images end up on places like Alibaba.com? After all, it raises awareness of the product line which is ultimately all they want anyway and counterfeit goods are a different ball game.

        Capitalism, currently The World's best Entertainment Control System and YOU cant buy it! But you can buy this, or some of this or some of these

        Asch Conformity, mainly the blind leading the blind.

        1 Reply Last reply Reply Quote 0
        • R
          redgoblet
          last edited by

          Thanks, firewalluser…

          In this case, the company is a retailer, not a manufacturer, so it competes with other e-comms selling the same products. They have spent tons of money compiling and presenting product data, and to have it ripped off and represented by competitors upsets everybody.

          There are several companies that filter this type of traffic...and such services are very expensive and very much in demand.

          Again, thanks for your response...

          1 Reply Last reply Reply Quote 0
          • D
            doktornotor Banned
            last edited by

            I'd like to point out that this is primarily a firewall (packet filter)/router – not "the answer to life, the universe and everything."

            1 Reply Last reply Reply Quote 0
            • K
              kejianshi
              last edited by

              I think the only thing thats going to protect you from from getting your site ripped-off is:

              System > packages > Lawyers

              Be careful with the "Lawyers" package.  It is resource intensive and often consumes more than it saves you.

              1 Reply Last reply Reply Quote 0
              • F
                firewalluser
                last edited by

                @redgoblet:

                In this case, the company is a retailer, not a manufacturer, so it competes with other e-comms selling the same products. They have spent tons of money compiling and presenting product data, and to have it ripped off and represented by competitors upsets everybody.

                The tons of money (and time) spent on putting the data together is really a cost of business, however if you are paying a photographer to take photos of the product, see if you can get some sort of branding in the pic, like maybe a garment is hung on a dummy or draped over a flat surface, pin/stick a company logo to the product. It would need time and effort to photoshop it out which might act as a deterrent, likewise check out other things like stenography or just adding some bits/watermarks to the pics before publishing.

                However any pics copied and then modified might lose the added bits or the stenography making it harder to claim your pics have been ripped off. Just like people might use an effect to swirl their face in a pic, so the reverse algo exists to unswirl someone's face, but reductions will strip some data out from a pic and then you lose your hidden bits.

                There is no perfect solution unfortunately and even though pfSense is a good firewall, what you are looking for is not really the domain of a firewall.

                What I'd suggest is get your web guys to come up with some script to hilight in realtime when a search engine/user agent is going through your site and vistiing a high number of pages within a short space of time (customers like to linger), and to then setup a rule in pfsense to block the ip address if its not coming from a recognised ip address block assigned to a search engine. Google & Bing have their own address blocks for their spiders so its easy to spot genuine search engines so I would also suggest looking up the address block the ip belongs to and maybe even consider blocking the address block, but you need to make aware if you block an ISP's address block you could be blocking (potential) customers.

                Have a look at pfBlocker in the packages. I'm using an earlier version on a site and I'm blocking a large number of countries because its a private webserver for employees so there is no need for overseas visitors. Now this uses a txt file stored on the webserver which lists the CIDR's assigned to various countries and ISP's so I can block quite a few people very easily. The text file needs to sit somewhere accessible, so maybe you can get your webserver to add the IP address to the txt file which pfBlocker is using, and then maybe trigger a reload or refresh in pfsense if one is needed when a new ip address is added to the block list? That might be your quickest and cheapest solution, but it depends on the webserver you are using as something like this might exist already.

                Capitalism, currently The World's best Entertainment Control System and YOU cant buy it! But you can buy this, or some of this or some of these

                Asch Conformity, mainly the blind leading the blind.

                1 Reply Last reply Reply Quote 0
                • J
                  jwelter99
                  last edited by

                  @redgoblet:

                  Client of mine operates an e-comm site with 150,000 product SKUs. They've invested significantly in product descriptions, images and technical content.
                  Automated site scraping is becoming a problem. It's easy enough to block obvious scrapers with firewall rules (IP block), but requires constant manual oversight.

                  Has anyone ever used pfsense and/or snort to combat this? The trick is distinguishing between good bots (Google, etc) and bad bots.

                  thanks

                  We've used Haproxy to limit such scrapers by create use of acl's and delays.  Basically you setup acl's to detect the scraping activity as it's pretty easy to determine it's not a human browsing from the rate of requests and then you blacklist that IP for a timeout period.

                  It's not pretty but does work…..

                  Some examples on the haproxy docs on how to do this.

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.