ALM-19006 HBase Replication Sync Failed (For MRS 2.x or Earlier)
Description
This alarm is generated when disaster recovery (DR) data fails to be synchronized to a standby cluster.
This alarm is cleared when DR data synchronization succeeds.
Attribute
Alarm ID |
Alarm Severity |
Auto Clear |
---|---|---|
19006 |
Major |
Yes |
Parameters
Parameter |
Description |
---|---|
ServiceName |
Specifies the service for which the alarm is generated. |
RoleName |
Specifies the role for which the alarm is generated. |
HostName |
Specifies the host for which the alarm is generated. |
Impact on the System
HBase data in a cluster fails to be synchronized to the standby cluster, causing data inconsistency between active and standby clusters.
Possible Causes
- The HBase service on the standby cluster is abnormal.
- The network is abnormal.
Procedure
- Observe whether the system automatically clears the alarm.
- Go to the cluster details page and choose Alarms.
- In the alarm list, click the alarm to obtain alarm generation time from Generated Time in Alarm Details. Check whether the alarm has existed for over 5 minutes.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 2.a.
- Check the HBase service status of the standby cluster.
- Go to the cluster details page and choose Alarms.
- In the alarm list, click the alarm and obtain HostName from Location in Alarm Details.
- Log in to the node where the HBase client of the active cluster is located. Run the following commands to switch the user:
su - omm
- Run the status 'replication', 'source' command to check the synchronization status of the faulty node.
The synchronization status of a node is as follows.
10-10-10-153: SOURCE: PeerID=abc, SizeOfLogQueue=0, ShippedBatches=2, ShippedOps=2, ShippedBytes=320, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=0, TimeStampsOfLastShippedOp=Mon Jul 18 09:53:28 CST 2016, Replication Lag=0, FailedReplicationAttempts=0 SOURCE: PeerID=abc1, SizeOfLogQueue=0, ShippedBatches=1, ShippedOps=1, ShippedBytes=160, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=16788, TimeStampsOfLastShippedOp=Sat Jul 16 13:19:00 CST 2016, Replication Lag=16788, FailedReplicationAttempts=5
- Obtain PeerID corresponding to a record whose FailedReplicationAttempts value is greater than 0.
In the preceding step, data on the faulty node 10-10-10-153 fails to be synchronized to a standby cluster whose PeerID is abc1.
- Run the list_peers command to find the cluster and the HBase instance corresponding to PeerID.
PEER_ID CLUSTER_KEY STATE TABLE_CFS abc1 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase2 ENABLED abc 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase ENABLED
In the preceding information, /hbase2 indicates that data is synchronized to the HBase2 instance of the standby cluster.
- In the service list of the standby cluster, check whether the health status of the HBase instance obtained in 2.f is Good.
- In the alarm list, check whether the alarm ALM-19000 HBase Service Unavailable exists.
- Rectify the fault by following the steps provided in ALM-19000 HBase Service Unavailable.
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 3.a.
- Check the network connection between RegionServers on active and standby clusters.
- Go to the cluster details page and choose Alarms.
- In the alarm list, click the alarm and obtain HostName from Location in Alarm Details.
- Log in to the faulty RegionServer node.
- Run the ping command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal.
- Contact the O&M personnel to restore the network.
- After the network recovers, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 4.
- Collect fault information.
- On MRS Manager, choose .
- Contact the O&M engineers and send the collected logs.
Reference
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.