How to categorize this issue?
/area backup quality robustness usability
/kind bug
What happened:
Currently the backup-restore server has a list of endpoints, but out of them the /snapshot route has /full and /delta which are defined at
|
mux.HandleFunc("/snapshot/full", h.serveFullSnapshotTrigger) |
|
mux.HandleFunc("/snapshot/delta", h.serveDeltaSnapshotTrigger) |
Now when these endpoints are hit but soon the client closes its connection due to any reason, it currently does not stop the snapshotting process and proceeds to take the snapshot and upload it to the configured cloud provider bucket.
FullSnapshot Handler 👇
|
s, err := h.Snapshotter.TriggerFullSnapshot(req.Context(), isFinal) |
|
func (ssr *Snapshotter) TriggerFullSnapshot(_ context.Context, isFinal bool) (*brtypes.Snapshot, error) { |
Note: Observe that the above function ignores the context parameter inside the TriggerFullSnapshot method.
DeltaSnapshot Handler 👇
|
func (h *HTTPHandler) serveDeltaSnapshotTrigger(rw http.ResponseWriter, req *http.Request) { |
|
s, err := h.Snapshotter.TriggerDeltaSnapshot() |
Note: Observe that the TriggerDeltaSnapshot method doesn't accept any context.
Ideally this should not happen, if the connection closes midway, the backup-restore server should also stop the snapshotting process by propagating the http request's context around and when it gets cancelled, it should halt the snapshotting process.
Why is this important:
Assume a client ( could be a etcd-druid (compaction or OnDemand Snapshot OpsTask), a script, curl, etc) which triggers an out-of-band snapshot and is waiting, but then the connection could close due to various reasons such as networking issue, timeout, or deliberate attempt to stop the process, etc. In these cases, the client would inevitably assume that the request has not succeeded due to these mentioned issues and expects a failure in triggering the snapshot, rightly so. But our current logic doesn't take this into account and once triggered, the backup-restore sees it through completion come what may. Which can be potentially misleading when the client expects the snapshot to have failed but etcdbr takes it and pushes to the bucket. To me this seems like a big problem from the contract point of view.
How to reproduce it (as minimally and precisely as possible):
It can be easily reproduced with the etcd-druid setup wherein you can load-up the etcd cluster to a decent size ( upwards of 4 GB to see it clearly ) using the benchmark tool, then trigger an out-of-band full snapshot using curl and while it is processing, close the connection by pressing ^C and observe the backup-restore leader logs to see that it proceeds to take full snapshot.
Anything else we need to know?: NA
Environment:
- etcd-backup-restore version:
- etcd-druid version (if relevant):
- Gardener version (if relevant):
- Kubernetes version (use
kubectl version):
- Cloud provider or hardware configuration:
- Others:
How to categorize this issue?
/area backup quality robustness usability
/kind bug
What happened:
Currently the
backup-restoreserver has a list of endpoints, but out of them the/snapshotroute has/fulland/deltawhich are defined atetcd-backup-restore/pkg/server/httpAPI.go
Lines 139 to 140 in 0d19f28
Now when these endpoints are hit but soon the client closes its connection due to any reason, it currently does not stop the snapshotting process and proceeds to take the snapshot and upload it to the configured cloud provider bucket.
FullSnapshot Handler 👇
etcd-backup-restore/pkg/server/httpAPI.go
Line 317 in 0d19f28
etcd-backup-restore/pkg/snapshot/snapshotter/snapshotter.go
Line 206 in 0d19f28
DeltaSnapshot Handler 👇
etcd-backup-restore/pkg/server/httpAPI.go
Line 337 in 0d19f28
etcd-backup-restore/pkg/server/httpAPI.go
Line 349 in 0d19f28
Ideally this should not happen, if the connection closes midway, the
backup-restoreserver should also stop the snapshotting process by propagating the http request's context around and when it gets cancelled, it should halt the snapshotting process.Why is this important:
Assume a client ( could be a
etcd-druid(compaction or OnDemand Snapshot OpsTask), a script, curl, etc) which triggers an out-of-band snapshot and is waiting, but then the connection could close due to various reasons such as networking issue, timeout, or deliberate attempt to stop the process, etc. In these cases, the client would inevitably assume that the request has not succeeded due to these mentioned issues and expects a failure in triggering the snapshot, rightly so. But our current logic doesn't take this into account and once triggered, thebackup-restoresees it through completion come what may. Which can be potentially misleading when the client expects the snapshot to have failed butetcdbrtakes it and pushes to the bucket. To me this seems like a big problem from the contract point of view.How to reproduce it (as minimally and precisely as possible):
It can be easily reproduced with the
etcd-druidsetup wherein you can load-up the etcd cluster to a decent size ( upwards of 4 GB to see it clearly ) using the benchmark tool, then trigger an out-of-band full snapshot usingcurland while it is processing, close the connection by pressing^Cand observe thebackup-restoreleader logs to see that it proceeds to take full snapshot.Anything else we need to know?: NA
Environment:
kubectl version):