ALM-45745 Average RPC Queuing Time of the Guardian TokenServer Exceeds the Threshold

Alarm Description

The system checks the average RPC queuing time of the TokenServer service every 30 seconds. This alarm is generated when the average RPC queuing time of the TokenServer instance has exceeded the threshold for five consecutive times.

This alarm is cleared when the system detects that the average RPC queuing time falls below the threshold.

This alarm applies only to MRS 3.5.0 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
45745	Critical (default threshold: 300 ms) Major (default threshold: 200 ms)	Yes

Alarm ID

Alarm Severity

Auto Cleared

45745

Critical (default threshold: 300 ms)

Major (default threshold: 200 ms)

Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
Additional Information	Trigger Condition	Specifies the alarm triggering condition.

Impact on the System

If the average RPC queuing time of the Guardian TokenServer instance exceeds the threshold, service access to OBS may slow down or even OBS cannot be accessed.

Possible Causes

The alarm threshold is improperly configured.
The memory configured for the Guardian TokenServer instance is too small, and frame freezing occurs on the JVM due to frequent full garbage collection.

Handling Procedure

Check whether the alarm threshold is set properly.

Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name of the TokenServer instance for which this alarm is generated.
On FusionInsight Manager, choose Cluster > Services > Guardian. On the page that is displayed, click the Instances tab, click the TokenServer role for the host name obtained in Step 1, click the drop-down list in the upper right corner of the Chart area, and select Customize. On the Customize Statistics page, choose RPC > Average Time of TokenServer RPC Queuing, and click OK.
Check whether the average RPC queuing time of TokenServer reaches the alarm threshold (300 ms for critical alarms and 200 ms for major alarms).
- If yes, go to Step 4.
- If no, go to Step 6.
On FusionInsight Manager, choose O&M > Alarm > Thresholds, select the desired cluster, choose Guardian > RPC, and click Average Time of TokenServer RPC Queuing. In the right pane, locate the default rule, and click Modify in the Operation column. On the Modify Rule page, change the threshold for the Critical or Major alarm severity to 150% of the peak value within one day after the alarm is generated, and click OK.
Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 6.

Check whether the memory of the Guardian TokenServer is too small.

On FusionInsight Manager, choose O&M > Alarm > Alarms and check whether alarm TokenServer Heap Memory Usage Exceeds the Threshold is reported on the TokenServer instance.
- If yes, go to Step 7.
- If no, go to Step 9.
Rectify the fault by following the handling procedure of ALM-45737 TokenServer Heap Memory Usage Exceeds the Threshold.
Wait 10 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 9.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select Guardian for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.