Skip to content

[Edge case] Split brain problem can occur during bootstrap of HA Etcd Cluster #864

@ishan16696

Description

@ishan16696

How to categorize this issue?
/area high-availability
/kind bug
/kind regression

What happened:
It has been observed that bootstrap case(0->3 replicas) of HA etcd cluster can fail and lead to Split brain problem in HA Etcd Cluster.

Consider this scenario:

  1. Bootstrap case: HA Etcd cluster with replicas=3 and assume that one of the etcd pod member was unable to come up for sometime due to any reason(scheduling etc.)
  2. But other two etcd's pod member were able to come up and form the cluster (choose a new leader as quorum formed)
  3. This leads to backup-restore of etcd leader to take a first full snapshot and uploaded to the bucket
2025-03-18 13:37:55 | {"log":"Creating snapstore from provider: Swift","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"backup-restore started leading...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"backup-restore became: Leader","severity":"INFO"}
...
2025-03-18 13:37:56 | {"log":"Applied watch on etcd from revision: 2","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully saved full snapshot at: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Total time to save Full snapshot: 0.182602 seconds.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Manifest object uploaded successfully.","severity":"INFO"}
  1. Meanwhile at the same time third pod member had also start coming up(which was stuck in step 1), and it triggers the initialization as usual but this time a full snapshot is present in bucket, hence it trigger the restoration from bucket.
  2. This restoration from full snapshot can lead this member to start it's own cluster. (although there will be ClusterID mismatch error on etcd logs).

Logs of backup-restore while doing the restoration:

2025-03-18 13:37:56 | {"log":"Removing directory(/var/etcd/data/new.etcd).","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"No delta snapshots present over base snapshot.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully restored from base snapshot: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Successfully fetched and saved data of the base snapshot in 0.042820924 seconds [CompressionPolicy:gzip]","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Fetched the snapshot from the object store in 0.042820924 seconds","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"start decompressing the snapshot with gzip compressionPolicy","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Restoring from base snapshot: Full-00000000-00000001-1742305076.gz","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Creating temporary directory /var/etcd/data/restoration.temp for persisting full and delta snapshots locally.","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Removing directory(/var/etcd/data/new.etcd.part).","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Finding latest set of snapshot to recover from...","severity":"INFO"}
2025-03-18 13:37:56 | {"log":"Responding to status request with: Progress","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"No snapshot found. BackupBucket is empty","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Temporary directory /var/etcd/data/temp does not exist. Creating it...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Checking whether the backup bucket is empty or not...","severity":"INFO"}
2025-03-18 13:37:55 | {"log":"Member etcd-main-2 part of running cluster","severity":"INFO"}

Cluster state will look like this:

+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+
|        ID        |  STATUS   |    NAME     |                                 PEER ADDRS                                 |                                CLIENT ADDRS                                | IS LEARNER |
+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+
| 4f318e68482038f4 |   started | etcd-main-0 | https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 | https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 |      false |
| 8744d798b794c6ab |   started | etcd-main-1 | https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 | https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 |      false |
| cd4ba37ba8467c6f | unstarted |             | https://etcd-main-2.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2380 |                                                                            |      false |
+------------------+-----------+-------------+----------------------------------------------------------------------------+----------------------------------------------------------------------------+------------+


I have no name!@etcd-main-0:/$ etcdctl endpoint status --cluster -w table
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|                                  ENDPOINT                                  |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd-main-0.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | 4f318e68482038f4 |  3.4.34 |   18 MB |     false |      false |        33 |    4094554 |            4094554 |        |
| https://etcd-main-1.etcd-main-peer.shoot--hc-dev--ccahccd7e5-haas.svc:2379 | 8744d798b794c6ab |  3.4.34 |   18 MB |      true |      false |        33 |    4094555 |            4094555 |        |
+----------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

What you expected to happen:
Bootstrap of etcd cluster should succeed with all member join the same cluster.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Split brain problem will not noticed as Etcd resource status: AllMembersReady until member which form the separate cluster restart, now it can't come-up.

time="2025-03-24T03:08:48Z" level=info msg="Responding to status request with: New" actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Received start initialization request." actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Updating status from New to Progress" actor=backup-restore-server
...
time="2025-03-24T03:08:48Z" level=info msg="Validation mode: sanity" actor=backup-restore-server
time="2025-03-24T03:08:48Z" level=info msg="Checking if member etcd-main-2 is part of a running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Member etcd-main-2 not part of any running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Could not find member etcd-main-2 in the list" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="member heartbeat is not present" actor=initializer
time="2025-03-24T03:08:48Z" level=info msg="backup-restore will start the scale-up check" actor=initializer
time="2025-03-24T03:08:48Z" level=info msg="Checking whether etcd cluster is marked for scale-up" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Checking if member etcd-main-2 is part of a running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Member etcd-main-2 not part of any running cluster" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Could not find member etcd-main-2 in the list" actor=member-add
time="2025-03-24T03:08:48Z" level=info msg="Etcd cluster scale-up is detected" actor=initializer
{"level":"warn","ts":"2025-03-24T03:08:48.802811Z","caller":"clientv3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003150e0/etcd-main-client:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2025-03-24T03:08:49Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
....
{"level":"warn","ts":"2025-03-24T03:09:51.938461Z","caller":"clientv3/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000712b40/etcd-main-client:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2025-03-24T03:09:51Z" level=fatal msg="unable to add a learner in a cluster: error while adding member as a learner: etcdserver: unhealthy cluster" actor=initializer

Metadata

Metadata

Assignees

Labels

area/high-availabilityHigh availability relatedkind/bugBugkind/regressionBug that hit us already in the past and that is reappearing/requires a proper solutionlifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions