Limit/block bad bots

MageStack has native functionality to prioritise certain "good" bots (Google, Bing, Yahoo, Pingdom) and to reduce priority of "bad" third party bots (Majestic SEO, Rogerbot etc.), this is handled within the WAF itself, see DOS filter rules for more information.

The default thresholds are:

Type Rate Soft Warning Action Hard Warning Action
Good bot Unlimited - -
Bad bot 3 concurrent requests 429 Too Many Requests Header 503 Service Unavailable Header

When the connection rate it exceeded, the bad bots will be presented with a 429 Too Many Requests header, followed by a 503 Service Unavailable if the requests continue.

It is important to note that no legitimate web browser or search engine will be effected by this. It is a very intelligent rule that can identify the subtle differences, never causing a false positive.

Re-qualify bad bots

You can also override the default bad bot definition, by re-qualifying bots yourself. The later examples can be used in conjunction with this setting, so that your specified bots are not treated as bad bots.

if ($http_user_agent ~* (DeepCrawl)) {
  set $magestack_bot_type "Good";
}

! It is important to define this setting before the order actions

Limiting crawl bots

There are two mechanisms that can be used to further control the behaviour of both good and bad bots, to give you more granular control of how you want to treat them.

  • Bot-side crawl delay
  • Server-side rate limit

Crawl delay

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. Yandex interprets the value as the number of seconds to wait between subsequent visits, Bing uses the value as an opaque number that controls visit frequency with no quantitative specification.

In your robots.txt, specify,

User-agent: *
Crawl-delay: 10

Server-side

A custom rate limit can be deployed as a solution if the specified crawl bot is ignoring the Crawl Delay. To globally rate limit bad bots, you can use,

if ($magestack_bot_type = "Bad") {
  set $magestack_custom_limit one_per_five_seconds;
}
limit_req zone=custom_one_per_five_seconds;

Blocking bad bots

As a last resort, you can deny access completely to bad bots.

if ($magestack_bot_type = "Bad") {
  return 403;
}

Bot detection

MageStack also incorporates a bot detection system that will challenge and attempt to block non human visitors.

Enable bot detection on an entire VHost

While commonly required URLs have been added, it is highly recommended that you test this in your staging/dev environment with any ERP system, payment providers sandbox or other 3rd party services in order to ensure they are functioning as expected.

# Enable Testcookie
testcookie on;

# Enable this when using HTTPS
testcookie_https_location on;

# Enable BotProtect on all requests by default
set $magestack_botprotect_disabled 0;

# Disable protection on URL
if ($request_uri ~* /about-magento-demo-store/)
{
    set $magestack_botprotect_disabled 1;
}

# Disable protection by IP address

if ($remote_addr ~* "127.0.0.1") { set $magestack_botprotect_disabled 1; }

# Disable protection by Useragent

if ($http_user_agent ~* "(Semrush|Yandex)") { set $magestack_botprotect_disabled 1; }

Enable bot detection on specific URLs

# Enable Testcookie
testcookie on;

# Enable this when using HTTPS
testcookie_https_location on;

# Enable protection on a URL
if ($request_uri ~* /about-magento-demo-store/)
{
    set $magestack_botprotect_disabled 0;
}

# Enable bot detection per IP address

if ($remote_addr ~* "127.0.0.1") { set $magestack_botprotect_disabled 0; }

# Enable bot detection by User agent (Not effective/recommended, included for completeness)

if ($http_user_agent ~* "(Semrush|Yandex)") { set $magestack_botprotect_disabled 0; }