Limit/block bad bots

MageStack has native functionality to prioritise certain "good" bots (Google, Bing, Yahoo, Pingdom) and to reduce priority of "bad" third party bots (Majestic SEO, Rogerbot etc.), this is handled within the WAF itself, see DOS filter rules for more information.

The default thresholds are:

Type Rate Soft Warning Action Hard Warning Action
Good bot Unlimited - -
Bad bot 3 concurrent requests 429 Too Many Requests Header 503 Service Unavailable Header

When the connection rate it exceeded, the bad bots will be presented with a 429 Too Many Requests header, followed by a 503 Service Unavailable if the requests continue.

It is important to note that no legitimate web browser or search engine will be effected by this. It is a very intelligent rule that can identify the subtle differences, never causing a false positive.

Re-qualify bad bots

You can also override the default bad bot definition, by re-qualifying bots yourself. The later examples can be used in conjunction with this setting, so that your specified bots are not treated as bad bots.

if ($http_user_agent ~* (DeepCrawl)) {
  set $magestack_bot_type "Good";
}

! It is important to define this setting before the order actions

Limiting crawl bots

There are two mechanisms that can be used to further control the behaviour of both good and bad bots, to give you more granular control of how you want to treat them.

  • Bot-side crawl delay
  • Server-side rate limit

Crawl delay

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. Yandex interprets the value as the number of seconds to wait between subsequent visits, Bing uses the value as an opaque number that controls visit frequency with no quantitative specification.

In your robots.txt, specify,

User-agent: *
Crawl-delay: 10

Server-side

A custom rate limit can be deployed as a solution if the specified crawl bot is ignoring the Crawl Delay. To globally rate limit bad bots, you can use,

if ($magestack_bot_type = "Bad") {
  set $magestack_custom_limit one_per_five_seconds;
}
limit_req zone=custom_one_per_five_seconds;

Blocking bad bots

As a last resort, you can deny access completely to bad bots.

if ($magestack_bot_type = "Bad") {
  return 403;
}