How to categorize this issue?
/area high-availability
/kind bug
/kind regression
What happened:
It has been observed that bootstrap case(0->3 replicas) of HA etcd cluster can fail and lead to Split brain problem in HA Etcd Cluster.
Consider this scenario:
- Bootstrap case: HA Etcd cluster with
replicas=3 and assume that one of the etcd pod member was unable to come up for sometime due to any reason(scheduling etc.)
- But other two etcd's pod member were able to come up and form the cluster (choose a new leader as quorum formed)
- This leads to backup-restore of etcd leader to take a first full snapshot and uploaded to the bucket
2025-03-18 13:37:55 | {"log":"Creating snapstore from provider: Swift","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"backup-restore started leading...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"backup-restore became: Leader","severity":"INFO"}
...
2025-03-18 13:37:56 | {"log":"Applied watch on etcd from revision: 2","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully saved full snapshot at: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Total time to save Full snapshot: 0.182602 seconds.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Manifest object uploaded successfully.","severity":"INFO"}
- Meanwhile at the same time third pod member had also start coming up(which was stuck in step 1), and it triggers the initialization as usual but this time a full snapshot is present in bucket, hence it trigger the restoration from bucket.
- This restoration from full snapshot can lead this member to start it's own cluster. (although there will be ClusterID mismatch error on etcd logs).
Logs of backup-restore while doing the restoration:
2025-03-18 13:37:56 | {"log":"Removing directory(/var/etcd/data/new.etcd).","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"No delta snapshots present over base snapshot.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully restored from base snapshot: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully fetched and saved data of the base snapshot in 0.042820924 seconds [CompressionPolicy:gzip]","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Fetched the snapshot from the object store in 0.042820924 seconds","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"start decompressing the snapshot with gzip compressionPolicy","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Restoring from base snapshot: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Creating temporary directory /var/etcd/data/restoration.temp for persisting full and delta snapshots locally.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Removing directory(/var/etcd/data/new.etcd.part).","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Finding latest set of snapshot to recover from...","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Responding to status request with: Progress","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"No snapshot found. BackupBucket is empty","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Temporary directory /var/etcd/data/temp does not exist. Creating it...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Checking whether the backup bucket is empty or not...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Member etcd-main-2 part of running cluster","severity":"INFO"}
Cluster state will look like this:
+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+
| 4f318e68482038f4 | started | etcd-main-0 | https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 | https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | false |
| 8744d798b794c6ab | started | etcd-main-1 | https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 | https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | false |
| cd4ba37ba8467c6f | unstarted | | https://etcd-main-2.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 | | false |
+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+
I have no name!@etcd-main-0:/$ etcdctl endpoint status --cluster -w table
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | 4f318e68482038f4 | 3.4.34 | 18 MB | false | false | 33 | 4094554 | 4094554 | |
| https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | 8744d798b794c6ab | 3.4.34 | 18 MB | true | false | 33 | 4094555 | 4094555 | |
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
What you expected to happen:
Bootstrap of etcd cluster should succeed with all member join the same cluster.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Split brain problem will not noticed as Etcd resource status: AllMembersReady until member which form the separate cluster restart, now it can't come-up.
time="2025-03-24T03:08:48Z" level=info msg="Responding to status request with: New" actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Received start initialization request." actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Updating status from New to Progress" actor=backup-restore-server
...
time="2025-03-24T03:08:48Z" level=info msg="Validation mode: sanity" actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Checking if member etcd-main-2 is part of a running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Member etcd-main-2 not part of any running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Could not find member etcd-main-2 in the list" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="member heartbeat is not present" actor=initializer
time="2025-03-24T03:08:48Z" level=info msg="backup-restore will start the scale-up check" actor=initializer
time="2025-03-24T03:08:48Z" level=info msg="Checking whether etcd cluster is marked for scale-up" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Checking if member etcd-main-2 is part of a running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Member etcd-main-2 not part of any running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Could not find member etcd-main-2 in the list" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Etcd cluster scale-up is detected" actor=initializer
{"level":"warn","ts":"2025-03-24T03:08:48.802811Z","caller":"clientv3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003150e0/etcd-main-client:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2025-03-24T03:08:49Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
....
{"level":"warn","ts":"2025-03-24T03:09:51.938461Z","caller":"clientv3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000712b40/etcd-main-client:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2025-03-24T03:09:51Z" level=fatal msg="unable to add a learner in a cluster: error while adding member as a learner: etcdserver: unhealthy cluster" actor=initializer
How to categorize this issue?
/area high-availability
/kind bug
/kind regression
What happened:
It has been observed that bootstrap case(
0->3replicas) of HA etcd cluster can fail and lead to Split brain problem in HA Etcd Cluster.Consider this scenario:
replicas=3and assume that one of the etcd pod member was unable to come up for sometime due to any reason(scheduling etc.)Logs of backup-restore while doing the restoration:
Cluster state will look like this:
What you expected to happen:
Bootstrap of etcd cluster should succeed with all member join the same cluster.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Split brain problem will not noticed as Etcd resource status:
AllMembersReadyuntil member which form the separate cluster restart, now it can't come-up.