The Anti Anti-scraping Strategy

Before knowing how to prevent being blacklisted while scraping, we need to understand how a website detect web crawlers.

There’re multiple anti-scraping mechanisms to distinguish a spider from a normal user. Some of the methods are^[1]:

Unusual traffic/high download rate especially from a single client/or IP address within a short time span.

Repetitive tasks performed on the website – based on an assumption that a human user won’t perform the same repetitive tasks all the time.

Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.

Corresponding to the anti-scraping mechanisms are the following strategies:

IP Addresses Rotation

IP blacklisting is the perhaps the easiest way of anti-scraping. By creating a pool of IP addresses, and making requests using different IP, will make it harder for server to detect crawlers.

Use a list of IP from proxy services
- Randomly pick an IP for certain time interval

Cookies are encrypted data stored in client side, some website use cookie to identify user. If a user client sends requests in high frequency, it’s possible to be identified as suspicious crawler, thus denied access.

Customize and manage a pool of cookies
- Sending request to server without cookie headers, parse the package for set-cookie value; store it in cookie collector;
- Get cookie from cookie collect and if the cookie is not usable, then delete it from cookie collector;
- Manage the cookie collector with timestamp, in order for the spider to get the oldest cookie first each time.
Disable cookie
- By disabling cookie, it may help prevent being banned by some websites which use cookie to identify crawlers.

User-Agent Imitation

To disguise as browser, one of the approaches is to modify the User-Agent. User-Agent is a string in request header that contains user-agent information, such as the version of web browser, client, operating system etc.

Spoof the User-Agent by making a list of user agents and randomly pick one for each request. Setting the User-Agent to a common web browser instead of using the default ones.

Rate Limit

Slow down the crawling, treat the websites nicely, don’t overwhelm it, or DDoS the server.

Put some random sleeps between requests
Add some delays after crawling a number of pages
Use the lowest number of concurrent requests possible

Be Careful About Honeypots

These honeypots usually are links that normal user can’t see but a spider can^[1]. They might be having the CSS style “display:none”. Thus the detection of honeypots can be tricky.

Avoid Repetitive Crawling Pattern

Some websites implemented intelligent anti-crawling mechanisms, thus a repetitive action will be likely detected as spider. To make a spider looks like a human, add random clicks, mouse movement, random actions etc. Using automated testing tool like Selenium may simulator normal “human actions”.

Reference

If you liked this post, you can share it with your followers or follow me on Twitter!

Anti Anti-spider Strategy

Common strategies for coping with anti-spider

The Anti Anti-scraping Strategy

IP Addresses Rotation

User-Agent Imitation

Rate Limit

Be Careful About Honeypots

Avoid Repetitive Crawling Pattern

Reference

The Anti Anti-scraping Strategy

IP Addresses Rotation

Cookie Rotation

User-Agent Imitation

Rate Limit

Be Careful About Honeypots

Avoid Repetitive Crawling Pattern

Reference