Error content:
2019-06-05 02:09:03.008888 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05 02:09:03.010827 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05 02:09:04.631367 I | rafthttp: peer 8816eaa680e63c73 became active
2019-06-05 02:09:04.631405 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 reader)
2019-06-05 02:09:04.632227 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message reader)
2019-06-05 02:09:04.634697 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 writer)
2019-06-05 02:09:04.635154 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message writer)
2019-06-05 02:09:04.961320 I | etcdserver: updating the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965052 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965231 I | etcdserver/api: enabled capabilities for version 3.3
2019-06-05 02:20:39.344648 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.022208ms, to a3d1fb0d28ed2953)
2019-06-05 02:20:39.344676 W | etcdserver: server is likely overloaded
2019-06-05 02:20:39.344685 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.127928ms, to 8816eaa680e63c73)
2019-06-05 02:20:39.344689 W | etcdserver: server is likely overloaded
The main error messages are: failed to send out heartbeat on time (exceeded the 100ms timeout for 401.80886ms)
The heartbeat detection error is mainly related to the following factors (disk speed, CPU performance and network instability):
etcd uses the raft algorithm. The leader will send heartbeat to each follower regularly. If the leader does not send heartbeat to the follower for two consecutive heartbeat times, etcd will print this log to give an alarm. Usually, this issue is caused by the disk running too slowly. The leader usually attaches some metadata to the heartbeat packet. The leader needs to solidify these data to the disk before sending it. The disk writing process may compete with other applications, or the disk runs too slowly because it is a virtual or SATA type. At this time, only better and faster disk hardware can solve the problem. Etcd exposure to Prometheus’ metrics index walfsyncduration_Seconds shows the average time spent on the wal log. Generally, this indicator should be less than 10ms
the second reason is that the CPU has insufficient computing power. If the CPU utilization is really high through the monitoring system, we should move the etcd to a better machine, and then ensure that the etcd process enjoys the computing power of some cores through cgroups, or improve the priority of etcd
the third reason may be that the network speed is too slow. If Prometheus shows that the network service quality is not good, such as high delay or high packet loss rate, the problem can be solved by moving etcd to the case where the network is not congested. However, if etcd is deployed across machine rooms, long delay is inevitable. It is necessary to adjust heartbeat interval according to RTT of machine rooms, and the parameter selection timeout is at least 5 times that of heartbeat interval