Updated on 2024-11-19 GMT+08:00

Configuring Anti-Crawler Rules

You can configure website anti-crawler protection rules to protect against search engines, scanners, script tools, and other crawlers, and use JavaScript to create custom anti-crawler protection rules.

If you have enabled enterprise projects, ensure that you have all operation permissions for the project where your WAF instance locates. Then, you can select the project from the Enterprise Project drop-down list and configure protection policies for the domain names in the project.

Prerequisites

Constraints

  • Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
  • If your service is connected to CDN, exercise caution when using the JS anti-crawler function.

    CDN caching may impact JS anti-crawler performance and page accessibility.

  • The JavaScript anti-crawler function is unavailable for pay-per-use WAF instances.
  • This function is not supported in the standard edition.
  • JS anti-crawler protection is not supported if you use the cloud-ELB access mode.
  • If JavaScript anti-crawler event logs cannot be viewed, see Why Are There No Protection Logs for Some Requests Blocked by WAF JavaScript Anti-Crawler Rules?
  • The protective action for website anti-crawler JavaScript challenge is Log only, and that for JavaScript authentication is Verification code. If a visitor fails the JavaScript authentication, a verification code is required for access. Requests will be forwarded as long as the visitor enters a valid verification code.
  • WAF JavaScript-based anti-crawler rules only check GET requests and do not check POST requests.

How JavaScript Anti-Crawler Protection Works

Figure 1 shows how JavaScript anti-crawler detection works, which includes JavaScript challenges (step 1 and step 2) and JavaScript authentication (step 3).

Figure 1 JavaScript Anti-Crawler protection process
If JavaScript anti-crawler is enabled when a client sends a request, WAF returns a piece of JavaScript code to the client.
  • If the client sends a normal request to the website, triggered by the received JavaScript code, the client will automatically send the request to WAF again. WAF then forwards the request to the origin server. This process is called JavaScript verification.
  • If the client is a crawler, it cannot be triggered by the received JavaScript code and will not send a request to WAF again. The client fails JavaScript authentication.
  • If a client crawler fabricates a WAF authentication request and sends the request to WAF, the WAF will block the request. The client fails JavaScript authentication.

By collecting statistics on the number of JavaScript challenges and authentication responses, the system calculates how many requests the JavaScript anti-crawler defends. In Figure 2, the JavaScript anti-crawler has logged 18 events, 16 of which are JavaScript challenge responses, and 2 of which are JavaScript authentication responses. Others indicates the number of WAF authentication requests fabricated by the crawler.

Figure 2 Parameters of a JavaScript anti-crawler protection rule

The protective action for website anti-crawler JavaScript challenge is Log only, and that for JavaScript authentication is Verification code. If a visitor fails the JavaScript authentication, a verification code is required for access. Requests will be forwarded as long as the visitor enters a valid verification code.

Configuring an Anti-Crawler Rule

  1. Log in to the management console.
  2. Click in the upper left corner of the management console and select a region or project.
  3. Click in the upper left corner and choose Web Application Firewall under Security & Compliance.
  4. In the navigation pane on the left, choose Policies.
  5. Click the name of the target policy to go to the protection configuration page.
  6. Click the Anti-Crawler configuration area and toggle it on or off if needed.

    • : enabled.
    • : disabled.

  7. Select the Feature Library tab and enable the protection by referring to Table 1.

    A feature-based anti-crawler rule has two protective actions:
    • Block

      WAF blocks and logs detected attacks.

      Enabling this feature may have the following impacts:

      • Blocking requests of search engines may affect your website SEO.
      • Blocking scripts may block some applications because those applications may trigger anti-crawler rules if their user-agent field is not modified.
    • Log only

      Detected attacks are logged only. This is the default protective action.

    Scanner is enabled by default, but you can enable other protection types if needed.
    Figure 3 Feature Library
    Table 1 Anti-crawler detection features

    Type

    Description

    Remarks

    Search Engine

    This rule is used to block web crawlers, such as Googlebot and Baiduspider, from collecting content from your site.

    If you enable this rule, WAF detects and blocks search engine crawlers.

    NOTE:

    If Search Engine is not enabled, WAF does not block POST requests from Googlebot or Baiduspider. If you want to block POST requests from Baiduspider, use the configuration described in Configuration Example - Search Engine.

    Scanner

    This rule is used to block scanners, such as OpenVAS and Nmap. A scanner scans for vulnerabilities, viruses, and other jobs.

    After you enable this rule, WAF detects and blocks scanner crawlers.

    Script Tool

    This rule is used to block script tools. A script tool is often used to execute automatic tasks and program scripts, such as HttpClient, OkHttp, and Python programs.

    If you enable this rule, WAF detects and blocks the execution of automatic tasks and program scripts.

    NOTE:

    If your application uses scripts such as HttpClient, OkHttp, and Python, disable Script Tool. Otherwise, WAF will identify such script tools as crawlers and block the application.

    Other

    This rule is used to block crawlers used for other purposes, such as site monitoring, using access proxies, and web page analysis.

    NOTE:

    To avoid being blocked by WAF, crawlers may use a large number of IP address proxies.

    If you enable this rule, WAF detects and blocks crawlers that are used for various purposes.

  8. Select the JavaScript tab and change Status if needed.

    JavaScript anti-crawler is disabled by default. To enable it, click and then click OK in the displayed dialog box to toggle on .

    Protective Action: Block or Log only. You can also select Verification code. If the JavaScript challenge fails, a verification code is required. As long as the visitor provides a valid verification code, their request will not be restricted.

    • Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
    • If your service is connected to CDN, exercise caution when using the JS anti-crawler function.

      CDN caching may impact JS anti-crawler performance and page accessibility.

  9. Configure a JavaScript-based anti-crawler rule by referring to Table 2.

    Two protective actions are provided: Protect all requests and Protect specified requests.

    • To protect all requests except requests that hit a specified rule
      Set Protection Mode to Protect all requests. Then, click Exclude Rule, configure the request exclusion rule, and click Confirm.
      Figure 4 Exclude Rule
    • To protect a specified request only

      Set Protection Mode to Protect specified requests, click Add Rule, configure the request rule, and click Confirm.

      Figure 5 Add Rule
    Table 2 Parameters of a JavaScript-based anti-crawler protection rule

    Parameter

    Description

    Example Value

    Rule Name

    Name of the rule

    waf

    Rule Description

    A brief description of the rule. This parameter is optional.

    -

    Effective Date

    Time the rule takes effect.

    Immediate

    Condition List

    Parameters for configuring a condition are as follows:

    • Field: Select the field you want to protect from the drop-down list. Currently, only Path and User Agent are included.
    • Subfield
    • Logic: Select a logical relationship from the drop-down list.
      NOTE:

      If you set Logic to Include any value, Exclude any value, Equal to any value, Not equal to any value, Prefix is any value, Prefix is not any of them, Suffix is any value, or Suffix is not any of them, you need to select a reference table.

    • Content: Enter or select the content that matches the condition.
    • Case sensitive: This parameter can be configured if Path is selected for Field. If you enable this, the system matches the case-sensitive path. It helps the system accurately identify and handle various crawler requests, improving the accuracy and effectiveness of anti-crawler policies.

    Path Include /admin

    Priority

    Rule priority. If you have added multiple rules, rules are matched by priority. The smaller the value you set, the higher the priority.

    5

Related Operations

  • To disable a rule, click Disable in the Operation column of the rule. The default Rule Status is Enabled.
  • To modify a rule, click Modify in the row containing the rule.
  • To delete a rule, click Delete in the row containing the rule.

Configuration Example - Logging Script Crawlers Only

To verify that WAF is protecting domain name www.example.com against an anti-crawler rule:

  1. Execute a JavaScript tool to crawl web page content.
  2. On the Feature Library tab, enable Script Tool and select Log only for Protective Action. (If WAF detects an attack, it logs the attack only.)

    Figure 6 Enabling Script Tool

  3. Enable anti-crawler protection.

    Figure 7 Anti-Crawler configuration area

  4. In the navigation pane on the left, choose Events to go to the Events page.

    Figure 8 Viewing Events - Script crawlers

Configuration Example - Search Engine

To allow the search engine of Baidu or Google and block the POST request of Baidu:

  1. Set Status of Search Engine to by referring to 6.
  2. Configure a precise protection rule by referring to Configuring Custom Precise Protection Rules.

    Figure 9 Blocking POST requests