Skip to content

Closing an out-of-band snapshot request connection should stop snapshotting process. #944

@anveshreddy18

Description

@anveshreddy18

How to categorize this issue?

/area backup quality robustness usability
/kind bug

What happened:

Currently the backup-restore server has a list of endpoints, but out of them the /snapshot route has /full and /delta which are defined at

mux.HandleFunc("/snapshot/full", h.serveFullSnapshotTrigger)
mux.HandleFunc("/snapshot/delta", h.serveDeltaSnapshotTrigger)

Now when these endpoints are hit but soon the client closes its connection due to any reason, it currently does not stop the snapshotting process and proceeds to take the snapshot and upload it to the configured cloud provider bucket.

FullSnapshot Handler 👇


s, err := h.Snapshotter.TriggerFullSnapshot(req.Context(), isFinal)

func (ssr *Snapshotter) TriggerFullSnapshot(_ context.Context, isFinal bool) (*brtypes.Snapshot, error) {

Note: Observe that the above function ignores the context parameter inside the TriggerFullSnapshot method.


DeltaSnapshot Handler 👇


func (h *HTTPHandler) serveDeltaSnapshotTrigger(rw http.ResponseWriter, req *http.Request) {

s, err := h.Snapshotter.TriggerDeltaSnapshot()

Note: Observe that the TriggerDeltaSnapshot method doesn't accept any context.


Ideally this should not happen, if the connection closes midway, the backup-restore server should also stop the snapshotting process by propagating the http request's context around and when it gets cancelled, it should halt the snapshotting process.

Why is this important:

Assume a client ( could be a etcd-druid (compaction or OnDemand Snapshot OpsTask), a script, curl, etc) which triggers an out-of-band snapshot and is waiting, but then the connection could close due to various reasons such as networking issue, timeout, or deliberate attempt to stop the process, etc. In these cases, the client would inevitably assume that the request has not succeeded due to these mentioned issues and expects a failure in triggering the snapshot, rightly so. But our current logic doesn't take this into account and once triggered, the backup-restore sees it through completion come what may. Which can be potentially misleading when the client expects the snapshot to have failed but etcdbr takes it and pushes to the bucket. To me this seems like a big problem from the contract point of view.

How to reproduce it (as minimally and precisely as possible):

It can be easily reproduced with the etcd-druid setup wherein you can load-up the etcd cluster to a decent size ( upwards of 4 GB to see it clearly ) using the benchmark tool, then trigger an out-of-band full snapshot using curl and while it is processing, close the connection by pressing ^C and observe the backup-restore leader logs to see that it proceeds to take full snapshot.

Anything else we need to know?: NA

Environment:

  • etcd-backup-restore version:
  • etcd-druid version (if relevant):
  • Gardener version (if relevant):
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/backupBackup relatedarea/qualityOutput qualification (tests, checks, scans, automation in general, etc.) relatedarea/robustnessRobustness, reliability, resilience relatedarea/usabilityUsability relatedkind/bugBuglifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions