Help Center/ Web Application Firewall/ User Guide (Kuala Lumpur Region)/ Policies/ Configuring Anti-Crawler Rules

Updated on 2024-03-14 GMT+08:00

View PDF

Configuring Anti-Crawler Rules

You can configure website anti-crawler protection rules to protect against search engines, scanners, script tools, and other crawlers, and use JavaScript to create custom anti-crawler protection rules.

Prerequisites

You have added your website to a policy.

Constraints

Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
If your service is connected to CDN, exercise caution when using the JS anti-crawler function.
CDN caching may impact JS anti-crawler performance and page accessibility.
WAF only logs JavaScript challenge and JavaScript authentication events. No other protective actions can be configured for JavaScript challenge and authentication.
WAF JavaScript-based anti-crawler rules only check GET requests and do not check POST requests.

How JavaScript Anti-Crawler Protection Works

Figure 1 shows how JavaScript anti-crawler detection works, which includes JavaScript challenges (step 1 and step 2) and JavaScript authentication (step 3).

Figure 1 JavaScript Anti-Crawler protection process
Click to enlarge

If JavaScript anti-crawler is enabled when a client sends a request, WAF returns a piece of JavaScript code to the client.

If the client sends a normal request to the website, triggered by the received JavaScript code, the client will automatically send the request to WAF again. WAF then forwards the request to the origin server. This process is called JavaScript verification.
If the client is a crawler, it cannot be triggered by the received JavaScript code and will not send a request to WAF again. The client fails JavaScript authentication.
If a client crawler fabricates a WAF authentication request and sends the request to WAF, the WAF will block the request. The client fails JavaScript authentication.

By collecting statistics on the number of JavaScript challenges and authentication responses, the system calculates how many requests the JavaScript anti-crawler defends. In Figure 2, the JavaScript anti-crawler has logged 18 events, 16 of which are JavaScript challenge responses, and 2 of which are JavaScript authentication responses. Others indicates the number of WAF authentication requests fabricated by the crawler.

Figure 2 Parameters of a JavaScript anti-crawler protection rule
Click to enlarge

WAF only logs JavaScript challenge and JavaScript authentication events. No other protective actions can be configured for JavaScript challenge and authentication.

Procedure

Log in to the management console.
Click in the upper left corner of the management console and select a region or project.
Click in the upper left corner and choose Security > Web Application Firewall to go to the Dashboard page.
In the navigation pane on the left, choose Policies.
Click the name of the target policy to go to the protection configuration page.
In the Anti-Crawler configuration area, toggle on the function if needed. Then, click Configure Bot Mitigation.

Select the Feature Library tab and enable the protection by referring to Table 1.

A feature-based anti-crawler rule has two protective actions:

Block
WAF blocks and logs detected attacks.
Enabling this feature may have the following impacts:
- Blocking requests of search engines may affect your website SEO.
- Blocking scripts may block some applications because those applications may trigger anti-crawler rules if their user-agent field is not modified.
Log only
Detected attacks are logged only. This is the default protective action.

Scanner is enabled by default, but you can enable other protection types if needed.

**Table 1** Anti-crawler detection features
Type	Description	Remarks
Search Engine	This rule is used to block web crawlers, such as Googlebot and Baiduspider, from collecting content from your site.	If you enable this rule, WAF detects and blocks search engine crawlers. NOTE: If Search Engine is not enabled, WAF does not block POST requests from Googlebot or Baiduspider.
Scanner	This rule is used to block scanners, such as OpenVAS and Nmap. A scanner scans for vulnerabilities, viruses, and other jobs.	After you enable this rule, WAF detects and blocks scanner crawlers.
Script Tool	This rule is used to block script tools. A script tool is often used to execute automatic tasks and program scripts, such as HttpClient, OkHttp, and Python programs.	If you enable this rule, WAF detects and blocks the execution of automatic tasks and program scripts. NOTE: If your application uses scripts such as HttpClient, OkHttp, and Python, disable Script Tool. Otherwise, WAF will identify such script tools as crawlers and block the application.
Other	This rule is used to block crawlers used for other purposes, such as site monitoring, using access proxies, and web page analysis. NOTE: To avoid being blocked by WAF, crawlers may use a large number of IP address proxies.	If you enable this rule, WAF detects and blocks crawlers that are used for various purposes.

Select the JavaScript tab and change Status if needed.

JavaScript anti-crawler is disabled by default. To enable it, click and then click OK in the displayed dialog box to toggle on .
- Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
- If your service is connected to CDN, exercise caution when using the JS anti-crawler function.
  CDN caching may impact JS anti-crawler performance and page accessibility.

Configure a JavaScript-based anti-crawler rule by referring to Table 2.

Two protective actions are provided: Protect all requests and Protect specified requests.

To protect all paths except a specified path
Set Protection Mode to Protect all paths. Then, click Exclude Path, configure protected paths, and click Confirm.

To protect a specified path only
Set Protection Mode to Protect specified requests, click Add Rule, configure the request rule, and click Confirm.

**Table 2** Parameters of a JavaScript-based anti-crawler protection rule
Parameter	Description	Example Value
Rule Name	Name of the rule	wafjs
Path	A part of the URL, not including the domain name A URL is used to define the address of a web page. The basic URL format is as follows: Protocol name://Domain name or IP address[:Port]/[Path/.../File name]. For example, if the URL is http://www.example.com/admin, set Path to /admin. NOTE: The path does not support regular expressions. The path cannot contain two or more consecutive slashes. For example, ///admin. If you enter ///admin, WAF converts /// to /.	/admin
Logic	Select a logical relationship from the drop-down list.	Include
Rule Description	A brief description of the rule.	None

Related Operations

To disable a rule, click Disable in the Operation column of the rule. The default Rule Status is Enabled.
To modify a rule, click Modify in the row containing the rule.
To delete a rule, click Delete in the row containing the rule.

Parent topic: Policies

Previous topic: Configuring Web Tamper Protection Rules to Prevent Static Web Pages from Being Tampered With

Next topic: Configuring Information Leakage Prevention Rules to Protect Sensitive Information from Leakage

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot