Skip to content

Fix silent volume restoration failures for encrypted snapshots#267

Open
kaovilai wants to merge 2 commits intovelero-io:mainfrom
kaovilai:issue3145
Open

Fix silent volume restoration failures for encrypted snapshots#267
kaovilai wants to merge 2 commits intovelero-io:mainfrom
kaovilai:issue3145

Conversation

@kaovilai
Copy link
Copy Markdown
Collaborator

Resolves velero-io/velero#3145 and velero-io/velero#9128

Previously, when restoring EBS snapshots with KMS encryption, the plugin
would report success even when volume creation failed due to missing KMS
permissions (kms:Decrypt, kms:ReEncrypt*, kms:CreateGrant). This created
a silent failure scenario where Velero logs showed successful restoration
but the volume was never actually created.

Changes:

  • Add volume creation verification with polling in CreateVolumeFromSnapshot
  • Wait for volume to reach 'available' state before returning success
  • Enhanced error handling for KMS permission failures with actionable messages
  • Add configurable timeout (volumeCreationTimeout) and poll interval (volumeCreationPollInterval)
  • Comprehensive test coverage for new error handling and configuration

The fix transforms silent failures into clear error messages, helping users
quickly identify and resolve KMS permission issues during volume restoration.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes silent volume restoration failures for encrypted EBS snapshots by adding proper verification when creating volumes from snapshots. Previously, the plugin would report success even when KMS permission issues prevented actual volume creation.

Key changes:

  • Added volume status polling to verify successful creation before reporting success
  • Enhanced error handling with specific KMS permission guidance
  • Added configurable timeout and polling interval parameters

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
velero-plugin-for-aws/volume_snapshotter.go Implements volume creation verification with polling, timeout configuration, and enhanced error messaging for KMS failures
velero-plugin-for-aws/volume_snapshotter_test.go Adds comprehensive test coverage for new configuration parsing and error enhancement functionality
Comments suppressed due to low confidence (1)

velero-plugin-for-aws/volume_snapshotter.go:152

  • The variable volumeID is being assigned but was not declared in this function. This should use := for declaration or the variable should be declared earlier.
	volumeID = *output.VolumeId

@kaovilai kaovilai marked this pull request as ready for review July 30, 2025 23:03
@github-actions github-actions Bot requested review from sseago and ywk253100 July 30, 2025 23:03
sseago
sseago previously approved these changes Aug 6, 2025
@tropnikovvl
Copy link
Copy Markdown
Contributor

Hi @ywk253100. Could you please take a look?

@kaovilai
Copy link
Copy Markdown
Collaborator Author

@tropnikovvl can you help check if this fixes the issue?

@tropnikovvl
Copy link
Copy Markdown
Contributor

tropnikovvl commented Dec 25, 2025

Hi @kaovilai ,
I tried this functionality and it doesn't seem to work.
I have two EKS clusters with different KMS keys for disk encryption. I removed access rights from Velero from EKS 2 to access the KMS keys of EKS 1
And when I tried to restore a snapshot, I saw the following in the logs:

time="2025-12-25T15:27:51Z" level=info msg="Executing PVCAction" cmd=/velero logSource="pkg/restore/actions/pvc_action.go:81" pluginName=velero restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Adding PV pvc-40e631cd-6334-44f4-a6e8-db80bad1901a as an additional item to restore" cmd=/velero kind=PersistentVolumeClaim logSource="pkg/restore/actions/pvc_action.go:159" name=test namespace=test pluginName=velero restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Done executing PVCAction" cmd=/velero logSource="pkg/restore/actions/pvc_action.go:167" pluginName=velero restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Getting client for /v1, Kind=PersistentVolume" logSource="pkg/restore/restore.go:1034" restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Find BackupVolumeInfo for PV pvc-40e631cd-6334-44f4-a6e8-db80bad1901a." groupResource=persistentvolumes logSource="pkg/restore/restore.go:1212" namespace= original name=pvc-40e631cd-6334-44f4-a6e8-db80bad1901a restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Restoring persistent volume from snapshot." logSource="pkg/restore/restore.go:2492" restore=velero/test
time="2025-12-25T15:27:51Z" level=info msg="Waiting for volume to become available" cmd=/plugins/velero-plugin-for-aws interval=1.5e+10 logSource="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/volume_snapshotter.go:210" pluginName=velero-plugin-for-aws restore=velero/test timeout=6e+11 volumeID=vol-05e6f49ce19cf5764
time="2025-12-25T15:37:55Z" level=info msg="Skip action velero.io/csi-pvc-restorer for resource persistentvolumeclaims:test/test, because the CSI feature is not enabled. Feature setting is ." groupResource=persistentvolumeclaims logSource="pkg/restore/restore.go:1321" namespace=test original name=test restore=velero/test

However, in Cloudtrail I see a KMS error
Screenshot 2025-12-25 at 17 11 18

@kaovilai
Copy link
Copy Markdown
Collaborator Author

This is new territory for me. If you know proper fix feel free to take over by creating another pr and I'll close this one.

1 similar comment
@kaovilai
Copy link
Copy Markdown
Collaborator Author

This is new territory for me. If you know proper fix feel free to take over by creating another pr and I'll close this one.

@tropnikovvl
Copy link
Copy Markdown
Contributor

tropnikovvl commented Dec 25, 2025

@kaovilai

The Problem

  1. CreateVolume returns successfully with a volume ID
  2. AWS validates KMS permissions asynchronously after the API call returns
  3. If KMS permissions are missing, the volume either stays in creating forever or just disappears
  4. Your polling code waits 10 minutes and times out without seeing any error
  5. The real error only shows up in CloudTrail as KMS AccessDenied

So basically, you are polling for a volume that will never become available, but AWS doesn't tell us why through the EC2 API.

Solution:

You need to check KMS permissions before calling CreateVolume, not after. Here's the approach:

  1. Add KMS client and validate permissions upfront
type VolumeSnapshotter struct {
    log                    logrus.FieldLogger
    ec2                    *ec2.Client
    kms                    *kms.Client  // Add this
    volumeCreationTimeout  time.Duration
    volumePollInterval     time.Duration
}

Initialize it in the Init method:

b.ec2 = ec2.NewFromConfig(cfg)
b.kms = kms.NewFromConfig(cfg)  // Add this
  1. Add a pre-flight KMS check function

This is the key part - we validate KMS access before creating the volume:

  func (b *VolumeSnapshotter) verifyKMSKeyAccess(ctx context.Context, kmsKeyID *string) error {
      if kmsKeyID == nil || *kmsKeyID == "" {
          return nil  // Not encrypted, skip check
      }

      b.log.WithField("kmsKeyID", *kmsKeyID).Info("Checking KMS key access before volume creation")

      // Check if key exists and is enabled
      describeOutput, err := b.kms.DescribeKey(ctx, &kms.DescribeKeyInput{
          KeyId: kmsKeyID,
      })
      if err != nil {
          return errors.Wrapf(err, "KMS key %s is not accessible. Need kms:DescribeKey permission", *kmsKeyID)
      }

      if describeOutput.KeyMetadata.KeyState != kmstypes.KeyStateEnabled {
          return errors.Errorf("KMS key %s is %s, must be Enabled", *kmsKeyID, describeOutput.KeyMetadata.KeyState)
      }

      // Test encryption/decryption permissions
      testGen, err := b.kms.GenerateDataKey(ctx, &kms.GenerateDataKeyInput{
          KeyId:   kmsKeyID,
          KeySpec: kmstypes.DataKeySpecAes256,
      })
      if err != nil {
          return errors.Wrapf(err, "Missing kms:GenerateDataKey permission for key %s", *kmsKeyID)
      }

      _, err = b.kms.Decrypt(ctx, &kms.DecryptInput{
          CiphertextBlob: testGen.CiphertextBlob,
      })
      if err != nil {
          return errors.Wrapf(err, "Missing kms:Decrypt permission for key %s", *kmsKeyID)
      }

      b.log.Info("KMS permissions verified successfully")
      return nil
  }
  1. Update CreateVolumeFromSnapshot to check KMS first

Modify the function to validate KMS before creating the volume:

  func (b *VolumeSnapshotter) CreateVolumeFromSnapshot(snapshotID, volumeType, volumeAZ string, iops *int64) (volumeID string, err error) {
      ctx := context.Background()

      // Get snapshot details
      descSnapOutput, err := b.ec2.DescribeSnapshots(ctx, &ec2.DescribeSnapshotsInput{
          SnapshotIds: []string{snapshotID},
      })
      if err != nil {
          return "", errors.Wrapf(err, "failed to describe snapshot %s", snapshotID)
      }

      snapshot := descSnapOutput.Snapshots[0]

      // THIS IS THE KEY CHANGE: Check KMS permissions BEFORE CreateVolume
      if snapshot.Encrypted != nil && *snapshot.Encrypted {
          if err := b.verifyKMSKeyAccess(ctx, snapshot.KmsKeyId); err != nil {
              return "", errors.Wrapf(err,
                  "Cannot create volume from encrypted snapshot %s. "+
                  "IAM role needs: kms:Decrypt, kms:GenerateDataKey, kms:CreateGrant, kms:DescribeKey",
                  snapshotID)
          }
      }

      // Create volume (existing code)
      input := &ec2.CreateVolumeInput{
          SnapshotId:       &snapshotID,
          AvailabilityZone: &volumeAZ,
          VolumeType:       types.VolumeType(volumeType),
          Encrypted:        snapshot.Encrypted,
          KmsKeyId:         snapshot.KmsKeyId,  // Keep the KMS key from snapshot
          TagSpecifications: []types.TagSpecification{
              {
                  ResourceType: types.ResourceTypeVolume,
                  Tags:         getTagsForCluster(snapshot.Tags),
              },
          },
      }

      if iopsVolumeTypes.Has(volumeType) && iops != nil {
          input.Iops = aws.Int32(int32(*iops))
      }

      output, err := b.ec2.CreateVolume(ctx, input)
      if err != nil {
          return "", errors.WithStack(err)
      }

      volumeID = *output.VolumeId

      // Keep the existing waiter code
      if err := b.waitForVolumeAvailable(volumeID); err != nil {
          return "", errors.Wrapf(err, "volume creation failed for snapshot %s", snapshotID)
      }

      return volumeID, nil
  }
  1. Optional: Use AWS SDK waiter instead of manual polling

The AWS SDK has a built-in waiter that handles edge cases better:

  func (b *VolumeSnapshotter) waitForVolumeAvailableWithWaiter(ctx context.Context, volumeID string) error {
      ctx, cancel := context.WithTimeout(ctx, b.volumeCreationTimeout)
      defer cancel()

      waiter := ec2.NewVolumeAvailableWaiter(b.ec2, func(o *ec2.VolumeAvailableWaiterOptions) {
          o.MinDelay = b.volumePollInterval
          o.MaxDelay = 2 * b.volumePollInterval
      })

      output, err := waiter.Wait(ctx, &ec2.DescribeVolumesInput{
          VolumeIds: []string{volumeID},
      }, b.volumeCreationTimeout)

      if err != nil {
          // Check if volume disappeared (sign of silent KMS failure)
          _, descErr := b.ec2.DescribeVolumes(context.Background(), &ec2.DescribeVolumesInput{
              VolumeIds: []string{volumeID},
          })
          if descErr != nil {
              var apiErr smithy.APIError
              if errors.As(descErr, &apiErr) && apiErr.ErrorCode() == "InvalidVolume.NotFound" {
                  return errors.Errorf(
                      "Volume %s disappeared - likely KMS permission failure. Check CloudTrail for KMS errors",
                      volumeID)
              }
          }
          return err
      }

      if len(output.Volumes) == 0 || output.Volumes[0].State != types.VolumeStateAvailable {
          return errors.Errorf("Volume %s not available after creation", volumeID)
      }

      return nil
  }

Why This Works

The pre-flight KMS check catches permission issues before we even try to create the volume. This gives users an immediate, clear error instead of a 10-minute timeout with no explanation.

The approach tests actual KMS operations (GenerateDataKey + Decrypt) that are required for encrypted volume creation, so if the test passes, the real operation should succeed too.

Documentation Note

Should update the README to emphasize that encrypted volumes need these KMS permissions:

  • kms:Decrypt
  • kms:GenerateDataKey
  • kms:CreateGrant
  • kms:DescribeKey

Without them, volume restoration will fail.

Let me know if you need clarification on any part of this.
P.S. I haven't tested the code above, it's just an example implementation.

@kaovilai
Copy link
Copy Markdown
Collaborator Author

K can try update pr later and you can help test

@kaovilai
Copy link
Copy Markdown
Collaborator Author

pushed

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 20.94595% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.52%. Comparing base (59a6774) to head (dab3881).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
velero-plugin-for-aws/volume_snapshotter.go 20.94% 116 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #267      +/-   ##
==========================================
- Coverage   35.39%   34.52%   -0.87%     
==========================================
  Files           7        7              
  Lines         664      808     +144     
==========================================
+ Hits          235      279      +44     
- Misses        404      501      +97     
- Partials       25       28       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tropnikovvl
Copy link
Copy Markdown
Contributor

I can take care of this issue after merging this PR #276

@tropnikovvl
Copy link
Copy Markdown
Contributor

Hi @kaovilai , could you please resolve conflicts?

@kaovilai
Copy link
Copy Markdown
Collaborator Author

resolving

Resolves velero-io/velero#3145 and velero-io/velero#9128

Previously, when restoring EBS snapshots with KMS encryption, the plugin
would report success even when volume creation failed due to missing KMS
permissions (kms:Decrypt, kms:ReEncrypt*, kms:CreateGrant). This created
a silent failure scenario where Velero logs showed successful restoration
but the volume was never actually created.

Changes:
- Add volume creation verification with polling in CreateVolumeFromSnapshot
- Wait for volume to reach 'available' state before returning success
- Enhanced error handling for KMS permission failures with actionable messages
- Add configurable timeout (volumeCreationTimeout) and poll interval (volumeCreationPollInterval)
- Comprehensive test coverage for new error handling and configuration

The fix transforms silent failures into clear error messages, helping users
quickly identify and resolve KMS permission issues during volume restoration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
- Introduced a new method `verifyKMSKeyAccess` in the VolumeSnapshotter to check necessary KMS permissions before creating a volume from an encrypted snapshot.
- The method verifies if the KMS key exists, is enabled, and checks permissions for `kms:GenerateDataKey`, `kms:Decrypt`, and `kms:CreateGrant`.
- Updated the `CreateVolumeFromSnapshot` method to call `verifyKMSKeyAccess` for encrypted snapshots.
- Enhanced unit tests to cover scenarios for nil and empty KMS key checks.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai
Copy link
Copy Markdown
Collaborator Author

@tropnikovvl done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EBS Snapshot restores do not show failures when KMS permission is denied

5 participants