Help Center/ Edge Security/ User Guide/ Security Protection/ Protection Policy/ Configuring Protection Policies/ Configuring Anti-Crawler Rules

Updated on 2024-10-31 GMT+08:00

View PDF

Configuring Anti-Crawler Rules

You can configure website anti-crawler protection rules to protect against search engines, scanners, script tools, and other crawlers, and use JavaScript to create custom anti-crawler protection rules.

Prerequisites

A protected website has been added. For details, see Adding a Website to EdgeSec.

Constraints

Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
It takes several minutes for a new rule to take effect. After the rule takes effect, protection events triggered by the rule will be displayed on the Events page.
If your service is connected to CDN, exercise caution when using this function.
CDN caching may impact Anti-Crawler performance and page accessibility.

How JavaScript Anti-Crawler Protection Works

Figure 1 shows how JavaScript anti-crawler detection works, which includes JavaScript challenges (step 1 and step 2) and JavaScript authentication (step 3).

Figure 1 JavaScript Anti-Crawler protection process

If JavaScript anti-crawler is enabled when a client sends a request, EdgeSec returns a piece of JavaScript code to the client.

If the client sends a normal request to the website, triggered by the received JavaScript code, the client will automatically send the request to EdgeSec again. EdgeSec then forwards the request to the origin server. This process is called JavaScript verification.
If the client is a crawler, it cannot be triggered by the received JavaScript code and will not send a request to EdgeSec again. The client fails JavaScript authentication.
If a client crawler fabricates an EdgeSec authentication request and sends the request to EdgeSec, the EdgeSec will block the request. The client fails JavaScript authentication.

By collecting statistics on the number of JavaScript challenges and authentication responses, the system calculates how many requests the JavaScript anti-crawler defends. In Figure 2, the JavaScript anti-crawler has logged 18 events, 16 of which are JavaScript challenge responses, and 2 of which are JavaScript authentication responses. Others is the number of EdgeSec authentication requests fabricated by the crawler.

Figure 2 Parameters of a JavaScript anti-crawler protection rule

EdgeSec only logs JavaScript challenge and JavaScript authentication events. No other protective actions can be configured for JavaScript challenge and authentication.

Procedure

Log in to the management console.
Click in the upper left corner of the page and choose Content Delivery & Edge Computing > CDN and Security.
In the navigation pane on the left, choose Edge Security > Website Settings. The Website Settings page is displayed.
In the Policy column of the row containing the domain name, click the number to go to the Policies page.

Figure 3 Website list
In the Anti-Crawler configuration area, toggle on the anti-crawler function. If you enable this function, click Configure Bot Mitigation.

Figure 4 Anti-Crawler configuration area

Select the Feature Library tab and enable the protection by referring to Figure 5.

A feature-based anti-crawler rule has two protective actions:

Block
EdgeSec blocks and logs detected attacks.
Log only
Detected attacks are logged only. This is the default protective action.

Scanner is enabled by default, but you can enable other protection types if needed.

Figure 5 Feature Library

**Table 1** Anti-crawler detection features
Type	Description	Remarks
Search Engine	This rule is used to block web crawlers, such as Googlebot and Baiduspider, from collecting content from your site.	If you enable this rule, EdgeSec detects and blocks search engine crawlers. NOTE: If Search Engine is not enabled, EdgeSec does not block POST requests from Googlebot or Baiduspider. If you want to block POST requests from Baiduspider, use the configuration described in Configuration Example - Search Engine.
Scanner	This rule is used to block scanners, such as OpenVAS and Nmap. A scanner scans for vulnerabilities, viruses, and other jobs.	If you enable this rule, EdgeSec detects and blocks scanner crawlers.
Script Tool	This rule is used to block script tools. A script tool is often used to execute automatic tasks and program scripts, such as HttpClient, OkHttp, and Python programs.	If you enable this rule, EdgeSec detects and blocks the execution of automatic tasks and program scripts. NOTE: If your application uses scripts such as HttpClient, OkHttp, and Python, disable Script Tool. Otherwise, EdgeSec will identify such script tools as crawlers and block the application.
Other	This rule is used to block crawlers used for other purposes, such as site monitoring, using access proxies, and web page analysis. NOTE: To avoid being blocked by EdgeSec, crawlers may use a large number of IP address proxies.	If you enable this rule, EdgeSec detects and blocks crawlers that are used for various purposes.

Select the JavaScript tab and configure Status and Protective Action.

JavaScript anti-crawler is disabled by default. To enable it, click and click OK in the displayed dialog box.
- Cookies must be enabled and JavaScript supported by any browser used to access a website protected by anti-crawler protection rules.
- If your service is connected to CDN, exercise caution when using the JS anti-crawler function.
  CDN caching may impact JS anti-crawler performance and page accessibility.

Configure a JavaScript-based anti-crawler rule by referring to Table 2.

Two protective actions are provided: Protect all requests and Protect specified requests.

To protect all requests except requests that hit a specified rule
Set Protection Mode to Protect all requests. Then, click Exclude Rule, configure the request exclusion rule, and click Confirm.
Figure 6 Exclude Path
To protect a specified request only
Set Protection Mode to Protect specified requests, click Add Rule, configure the request rule, and click Confirm.

Figure 7 Add Rule

**Table 2** Parameters of a JavaScript-based anti-crawler protection rule
Parameter	Description	Example Value
Rule Name	Name of the rule	EdgeSec
Rule Description	A brief description of the rule. This parameter is optional.	-
Effective Date	Time the rule takes effect.	Immediate
Condition List	Parameters for configuring a condition are described as follows: Field: Select the field you want to protect from the drop-down list. Currently, only Path and User Agent are included. Subfield Logic: Select a logical relationship from the drop-down list. NOTE: If you select Include any value, Exclude any value, Equal to any value, Not equal to any value, Prefix is any value, Prefix is not any of them, Suffix is any value, or Suffix is not any of them, a reference table must be selected for Content. For details about reference tables, see Creating a Reference Table. Content: Enter or select the content that matches the condition.	Path Include /admin
Priority	Rule priority. If you have added multiple rules, rules are matched by priority. The smaller the value you set, the higher the priority.	5

Other Operations

To modify a rule, click Modify in the row containing the rule.
To delete a rule, click Delete in the row containing the rule.

Configuration Example - Logging Script Crawlers Only

To verify that EdgeSec is protecting domain name www.example.com against an anti-crawler rule:

Execute a JavaScript tool to crawl web page content.
On the Feature Library tab, enable Script Tool and select Log only for Protective Action. (If EdgeSec detects an attack, it logs the attack only.)

Figure 8 Enabling Script Tool
Enable anti-crawler protection.

Figure 9 Anti-Crawler configuration area
In the navigation pane on the left, choose Events to go to the Events page.

Figure 10 Viewing Events - Script crawlers

Configuration Example - Search Engine

The following shows how to allow the search engine of Baidu or Google and block the POST request of Baidu.

Set Status of Search Engine to by referring to the instructions in 5.
Configure a precise protection rule by referring to Configuring a Precise Protection Rule.

Figure 11 Blocking POST requests

Parent topic: Configuring Protection Policies

Previous topic: Configuring Geolocation Access Control Rules to Block Requests from Specific Locations

Next topic: Configuring a Global Whitelist Rule to Ignore False Positives

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot