There are many techniques used these days by websites that want to keep crawlers away from their pages. Here is summary of the most common ones and how they work.

Suspicious Cookies

Most websites set cookies which are sent by the web client on every HTTP request they make. Those cookies can be used by the website setting them to determine whether an incoming request should be blocked or not. To avoid this check, one can simply disable cookies in the spider (in Scrapy, this is done with the COOKIES_ENABLED setting). Requests are blocked by cookies after the web server receives too many requests with the same cookies in a small period of time, or coming from many different IPs, something very unlikely to happen for a human visitor.

Javascript Challenge

In combination with cookies, websites also employ Javascript challenges to detect and ban crawlers. This works by sending a challenge page as response when a client requests an ordinary web page without sending cookies. The challenge page is a simple HTML that contains some Javascript calculation. The result of the calculation is to be submitted back automatically to the web server.

If the calculation result is correct the web server will send the requested web page along with some other cookies to “whitelist” the session. Those cookies will be sent on subsequent requests letting the web server know that the client already solved the challenge and return the requested web page. However, if the calculation is incorrect (or never arrives), the web server will return another challenge page or simply block the request. This workflow is shown in the diagram below:

To handle this, one needs to implement the challenge calculation logic in the spider. This article contains some useful tips for translating Javascript code to Python. The diagram below illustrates how the spider will work:

Please note that sometimes the challenge page is sent with a HTTP 5xx/4xx status code, which means those code need to be handled too. The Javascript code is often obfuscated, so we need to “beautify” it. JSBeautifier can be helpful for this. Bitwise operations are often involved in the calculation logic, this StackOverflow answer gives some advices on reimplementing these Javascript operations in Python.

Besides reimplementing the Javascript calculation logic, we also need to capture the initial values of Javascript variables used in it. One way to do that is by using regular expressions, another is to use js2xml which converts Javascript code to an XML document that can be parsed with familiar XPaths.

Here is a (simplified) example of a Javascript challenge page and the spider used to handle it:

page.html


<script type="text/javascript">// <![CDATA[
function challenge() {
    var x = 17;
    var y = 25;
    var challenge_answer = (x * y) + 9;
    document.forms[0].elements[0].value = challenge_answer;
    document.forms[0].submit();
}

// ]]></script></pre>
<noscript>Please enable JavaScript to view the page content.<br /></noscript>
<pre>

</pre>
<form action="/answer-challenge" method="POST"><input type="hidden" name="challenge_answer" value="0" /></form>
<pre>

spider.py

import re
import urllib
import urlparse

from scrapy import Request, Spider

class ProjectSpider(Spider):
    name = 'project-website.com'
    start_urls = ['project-website.com']

    # pre-compile regular expressions to save cpu time
    _x_re = re.compile(r'var x = (\d+);')
    _y_re = re.compile(r'var y = (\d+);')
    _addition_re = re.compile(r'var challenge_answer = \(x \* y\) \+ (\d+)')

    def parse(self, response):
        is_challenge_page = response.xpath('/html/body[@onload="challenge()"]')

        if is_challenge_page:
            return self.parse_challenge_page(response)
        else:
            # process requested page here

    def parse_challenge_page(self, response):
        form_action = response.xpath('/html/body/form/@action').extract()[0]
        form_url = urlparse.urljoin(response.url, form_action)
        script = response.xpath('/html/head/script/text()').extract()[0]
        x = int(self._x_re.search(script).group(1))
        y = int(self._y_re.search(script).group(1))
        addition = int(self._addition_re.search(script).group(1))
        challenge_answer = (x * y) + addition
        input_name = response.xpath('/html/body/form/input/@name').extract()[0]
        body = '%s=%s' % (urllib.quote(input_name), urllib.quote(challenge_answer))

        return Request(
            form_url,
            method='POST',
            body=body,
            callback=self.parse
        )

settings.py

COOKIES_ENABLED = False

User-Agent Blacklisting

Some websites block requests based on the value of the User-Agent HTTP header. This works by comparing the user agent against a pre-compiled list of forbidden ones. Hence, the most common way to deal with this is by rotating the User-Agent values from a pool of well-known ones used by web browsers. Here is an example on how to implement that as a downloader middleware in Scrapy:

middlewares.py

from random import choice
from scrapy import signals
from scrapy.exceptions import NotConfigured

class RotateUserAgentMiddleware(object):
    """Rotate user-agent for each request."""
    def __init__(self, user_agents):
        self.enabled = False
        self.user_agents = user_agents

    @classmethod
    def from_crawler(cls, crawler):
        user_agents = crawler.settings.get('USER_AGENT_CHOICES', [])

        if not user_agents:
            raise NotConfigured("USER_AGENT_CHOICES not set or empty")

        o = cls(user_agents)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)

        return o

    def spider_opened(self, spider):
        self.enabled = getattr(spider, 'rotate_user_agent', self.enabled)

    def process_request(self, request, spider):
        if not self.enabled or not self.user_agents:
            return

        request.headers['user-agent'] = choice(self.user_agents)

spider.py

from scrapy import Spider

class ProjectSpider(Spider):
  name = 'project-website.com'
  rotate_user_agent = True

settings.py

DOWNLOADER_MIDDLEWARES = {
    'project.middlewares.RotateUserAgentMiddleware': 110,
}

USER_AGENT_CHOICES = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]

IP Address

In addition to checking the User-Agent header, a website can block HTTP requests based on their IP address. They will ban an IP address if they detect an unusually large amount of requests coming from it. To avoid this, one can reduce the crawl rate by setting a longer download delay (2 seconds or higher, see DOWNLOAD_DELAY setting). Reducing the number of concurrent requests can help too. Scrapy allows to set the limit of concurrent requests either per domain or per IP.

Another way to handle this issue is by using a pool of rotating IPs. There are plenty of services available for implementing this functionality: the free (albeit slow and unreliable) Tor network, paid proxy providers and more advanced solutions like Crawlera that provide automatic IP rotation, throttling and ban management.

robots.txt

Websites also use the Robots Exclusion Standard (also known as robots.txt) to blacklist specific crawlers that support them. This technique is not very effective because most web crawlers (or at least the one that often cause more damage) are likely to ignore it, but it’s widely adopted nevertheless due to the simplicity of its implementation.

Thanks to Pablo, Mikhail, Alexis and Elias who have reviewed and improved this post.

No related posts.