Updated on 2024-04-12 GMT+08:00

Configuring Anti-Crawler Rules

You can configure website anti-crawler protection rules to defend against crawlers such as search engines, scanners, script tools, and other crawlers.

Prerequisites

You have added the website you want to protect to WAF.

Constraints

  • Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
  • It takes several minutes for a new rule to take effect. After the rule takes effect, protection events triggered by the rule will be displayed on the Events page.
  • If your service is connected to CDN, exercise caution when using this function.

    CDN caching may impact Anti-Crawler performance and page accessibility.

Procedure

  1. Log in to the management console.
  2. Click in the upper left corner of the management console and select a region or project.
  3. Click in the upper left corner and choose Web Application Firewall under Security & Compliance.
  4. In the navigation pane on the left, choose Website Settings.
  5. In the Policy column of the row containing the domain name, click the number to go to the Policies page.
  6. In the Anti-Crawler configuration area, enable anti-crawler using the toggle on the right. If you enable this function, click Configure Bot Mitigation.
  7. Select the Feature Library tab and enable the protection by referring to Table 1.

    A feature-based anti-crawler rule has two protective actions:
    • Block

      WAF blocks and logs detected attacks.

      Enabling this feature may have the following impacts:

      • Blocking requests of search engines may affect your website SEO.
      • Blocking scripts may block some applications because those applications may trigger anti-crawler rules if their user-agent field is not modified.
    • Log only

      Detected attacks are logged only. This is the default protective action.

    Scanner is enabled by default, but you can enable other protection types if needed.

    Table 1 Anti-crawler detection features

    Type

    Description

    Remarks

    Search Engine

    This rule is used to block web crawlers, such as Googlebot and Baiduspider, from collecting content from your site.

    If you enable this rule, WAF detects and blocks search engine crawlers.

    NOTE:

    If Search Engine is not enabled, WAF does not block POST requests from Googlebot or Baiduspider.

    Scanner

    This rule is used to block scanners, such as OpenVAS and Nmap. A scanner scans for vulnerabilities, viruses, and other jobs.

    After you enable this rule, WAF detects and blocks scanner crawlers.

    Script Tool

    This rule is used to block script tools. A script tool is often used to execute automatic tasks and program scripts, such as HttpClient, OkHttp, and Python programs.

    If you enable this rule, WAF detects and blocks the execution of automatic tasks and program scripts.

    NOTE:

    If your application uses scripts such as HttpClient, OkHttp, and Python, disable Script Tool. Otherwise, WAF will identify such script tools as crawlers and block the application.

    Other

    This rule is used to block crawlers used for other purposes, such as site monitoring, using access proxies, and web page analysis.

    NOTE:

    To avoid being blocked by WAF, crawlers may use a large number of IP address proxies.

    If you enable this rule, WAF detects and blocks crawlers that are used for various purposes.