Closing an out-of-band snapshot request connection should stop snapshotting process.

**How to categorize this issue?**

/area backup quality robustness usability
/kind bug

**What happened**:

Currently the `backup-restore` server has a list of endpoints, but out of them the `/snapshot` route has `/full` and `/delta` which are defined at https://github.com/gardener/etcd-backup-restore/blob/0d19f287cd11984724d76fa4a6c773f3b9a62cb3/pkg/server/httpAPI.go#L139-L140

Now when these endpoints are hit but soon the client closes its connection due to any reason, it currently does not stop the snapshotting process and proceeds to take the snapshot and upload it to the configured cloud provider bucket.

**FullSnapshot Handler 👇**  

---

https://github.com/gardener/etcd-backup-restore/blob/0d19f287cd11984724d76fa4a6c773f3b9a62cb3/pkg/server/httpAPI.go#L317
https://github.com/gardener/etcd-backup-restore/blob/0d19f287cd11984724d76fa4a6c773f3b9a62cb3/pkg/snapshot/snapshotter/snapshotter.go#L206

> Note: Observe that the above function ignores the context parameter inside the `TriggerFullSnapshot` method.

---

**DeltaSnapshot Handler 👇**

---

https://github.com/gardener/etcd-backup-restore/blob/0d19f287cd11984724d76fa4a6c773f3b9a62cb3/pkg/server/httpAPI.go#L337
https://github.com/gardener/etcd-backup-restore/blob/0d19f287cd11984724d76fa4a6c773f3b9a62cb3/pkg/server/httpAPI.go#L349

>Note: Observe that the `TriggerDeltaSnapshot` method doesn't accept any context.

---

Ideally this should not happen, if the connection closes midway, the `backup-restore` server should also stop the snapshotting process by propagating the http request's context around and when it gets cancelled, it should halt the snapshotting process.

**Why is this important**:

Assume a client ( could be a `etcd-druid` (compaction or OnDemand Snapshot OpsTask), a script, curl, etc) which triggers an out-of-band snapshot and is waiting, but then the connection could close due to various reasons such as networking issue, timeout, or deliberate attempt to stop the process, etc. In these cases, the client would inevitably assume that the request has not succeeded due to these mentioned issues and expects a failure in triggering the snapshot, rightly so. But our current logic doesn't take this into account and once triggered, the `backup-restore` sees it through completion come what may. Which can be potentially misleading when the client expects the snapshot to have failed but `etcdbr` takes it and pushes to the bucket. To me this seems like a big problem from the contract point of view.

**How to reproduce it (as minimally and precisely as possible)**:

It can be easily reproduced with the `etcd-druid` setup wherein you can load-up the etcd cluster to a decent size ( upwards of 4 GB to see it clearly ) using the [benchmark](https://github.com/etcd-io/etcd/tree/main/tools/benchmark) tool, then trigger an out-of-band full snapshot using `curl` and while it is processing, close the connection by pressing `^C` and observe the `backup-restore` leader logs to see that it proceeds to take full snapshot.

**Anything else we need to know?**: NA

**Environment**:

- etcd-backup-restore version:
- etcd-druid version (if relevant):
- Gardener version (if relevant):
- Kubernetes version (use `kubectl version`):
- Cloud provider or hardware configuration:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closing an out-of-band snapshot request connection should stop snapshotting process. #944

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	mux.HandleFunc("/snapshot/full", h.serveFullSnapshotTrigger)
	mux.HandleFunc("/snapshot/delta", h.serveDeltaSnapshotTrigger)

Closing an out-of-band snapshot request connection should stop snapshotting process. #944

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions