创建大量ZNode后ZooKeeper Server启动失败
问题
创建大量ZNode后,ZooKeeper集群处于故障状态不能自动恢复,尝试重启失败,ZooKeeper Server日志显示如下内容:
Follower:
2016-06-23 08:00:18,763 | WARN | QuorumPeer[myid=26](plain=/10.16.9.138:2181)(secure=disabled) | Exception when following the leader | org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93) java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:156) at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:276) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1094)
2016-06-23 08:00:18,764 | INFO | QuorumPeer[myid=26](plain=/10.16.9.138:2181)(secure=disabled) | shutdown called | org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:198) java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:198) at org.apache.zookeeper.server.quorum.QuorumPeer.stopFollower(QuorumPeer.java:1141) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1098)
Leader:
2016-06-23 07:30:57,481 | WARN | QuorumPeer[myid=25](plain=/10.16.9.136:2181)(secure=disabled) | Unexpected exception | org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1108) java.lang.InterruptedException: Timeout while waiting for epoch to be acked by quorum at org.apache.zookeeper.server.quorum.Leader.waitForEpochAck(Leader.java:1221) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:487) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1105)
2016-06-23 07:30:57,482 | INFO | QuorumPeer[myid=25](plain=/10.16.9.136:2181)(secure=disabled) | Shutdown called | org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:623) java.lang.Exception: shutdown Leader! reason: Forcing shutdown at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:623) at org.apache.zookeeper.server.quorum.QuorumPeer.stopLeader(QuorumPeer.java:1149) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1110)
回答
创建大量节点后,follower与leader同步时数据量大,在集群数据同步限定时间内不能完成同步过程,导致超时,各个ZooKeeper Server启动失败。
参考修改集群服务配置参数章节,进入ZooKeeper服务“全部配置”页面。不断尝试调大ZooKeeper配置文件“zoo.cfg”中的“syncLimit”和“initLimit”两参数值,直到ZooKeeperServer正常。
参数 |
描述 |
默认值 |
---|---|---|
syncLimit |
follower与leader进行同步的时间间隔(时长为ticket时长的倍数)。如果在该时间范围内leader没响应,连接将不能被建立。 |
15 |
initLimit |
follower连接到leader并与leader同步的时间(时长为ticket时长的倍数)。 |
15 |
如果将参数“initLimit”和“syncLimit”的参数值均配置为“300”之后,ZooKeeper Server仍然无法恢复,则需确认没有其他应用程序正在kill ZooKeeper。例如,参数值为“300”,ticket时长为2000毫秒,即同步限定时间为300*2000ms=600s。
可能存在以下场景,在ZooKeeper中创建的数据过大,需要大量时间与leader同步,并保存到硬盘。在这个过程中,如果ZooKeeper需要运行很长时间,则需确保没有其他监控应用程序kill ZooKeeper而判断其服务停止。