Fix silent volume restoration failures for encrypted snapshots#267
Fix silent volume restoration failures for encrypted snapshots#267kaovilai wants to merge 2 commits intovelero-io:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes silent volume restoration failures for encrypted EBS snapshots by adding proper verification when creating volumes from snapshots. Previously, the plugin would report success even when KMS permission issues prevented actual volume creation.
Key changes:
- Added volume status polling to verify successful creation before reporting success
- Enhanced error handling with specific KMS permission guidance
- Added configurable timeout and polling interval parameters
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| velero-plugin-for-aws/volume_snapshotter.go | Implements volume creation verification with polling, timeout configuration, and enhanced error messaging for KMS failures |
| velero-plugin-for-aws/volume_snapshotter_test.go | Adds comprehensive test coverage for new configuration parsing and error enhancement functionality |
Comments suppressed due to low confidence (1)
velero-plugin-for-aws/volume_snapshotter.go:152
- The variable
volumeIDis being assigned but was not declared in this function. This should use:=for declaration or the variable should be declared earlier.
volumeID = *output.VolumeId
|
Hi @ywk253100. Could you please take a look? |
|
@tropnikovvl can you help check if this fixes the issue? |
|
Hi @kaovilai ,
|
|
This is new territory for me. If you know proper fix feel free to take over by creating another pr and I'll close this one. |
1 similar comment
|
This is new territory for me. If you know proper fix feel free to take over by creating another pr and I'll close this one. |
|
The Problem
So basically, you are polling for a volume that will never become available, but AWS doesn't tell us why through the EC2 API. Solution: You need to check KMS permissions before calling
type VolumeSnapshotter struct {
log logrus.FieldLogger
ec2 *ec2.Client
kms *kms.Client // Add this
volumeCreationTimeout time.Duration
volumePollInterval time.Duration
}Initialize it in the Init method: b.ec2 = ec2.NewFromConfig(cfg)
b.kms = kms.NewFromConfig(cfg) // Add this
This is the key part - we validate KMS access before creating the volume: func (b *VolumeSnapshotter) verifyKMSKeyAccess(ctx context.Context, kmsKeyID *string) error {
if kmsKeyID == nil || *kmsKeyID == "" {
return nil // Not encrypted, skip check
}
b.log.WithField("kmsKeyID", *kmsKeyID).Info("Checking KMS key access before volume creation")
// Check if key exists and is enabled
describeOutput, err := b.kms.DescribeKey(ctx, &kms.DescribeKeyInput{
KeyId: kmsKeyID,
})
if err != nil {
return errors.Wrapf(err, "KMS key %s is not accessible. Need kms:DescribeKey permission", *kmsKeyID)
}
if describeOutput.KeyMetadata.KeyState != kmstypes.KeyStateEnabled {
return errors.Errorf("KMS key %s is %s, must be Enabled", *kmsKeyID, describeOutput.KeyMetadata.KeyState)
}
// Test encryption/decryption permissions
testGen, err := b.kms.GenerateDataKey(ctx, &kms.GenerateDataKeyInput{
KeyId: kmsKeyID,
KeySpec: kmstypes.DataKeySpecAes256,
})
if err != nil {
return errors.Wrapf(err, "Missing kms:GenerateDataKey permission for key %s", *kmsKeyID)
}
_, err = b.kms.Decrypt(ctx, &kms.DecryptInput{
CiphertextBlob: testGen.CiphertextBlob,
})
if err != nil {
return errors.Wrapf(err, "Missing kms:Decrypt permission for key %s", *kmsKeyID)
}
b.log.Info("KMS permissions verified successfully")
return nil
}
Modify the function to validate KMS before creating the volume: func (b *VolumeSnapshotter) CreateVolumeFromSnapshot(snapshotID, volumeType, volumeAZ string, iops *int64) (volumeID string, err error) {
ctx := context.Background()
// Get snapshot details
descSnapOutput, err := b.ec2.DescribeSnapshots(ctx, &ec2.DescribeSnapshotsInput{
SnapshotIds: []string{snapshotID},
})
if err != nil {
return "", errors.Wrapf(err, "failed to describe snapshot %s", snapshotID)
}
snapshot := descSnapOutput.Snapshots[0]
// THIS IS THE KEY CHANGE: Check KMS permissions BEFORE CreateVolume
if snapshot.Encrypted != nil && *snapshot.Encrypted {
if err := b.verifyKMSKeyAccess(ctx, snapshot.KmsKeyId); err != nil {
return "", errors.Wrapf(err,
"Cannot create volume from encrypted snapshot %s. "+
"IAM role needs: kms:Decrypt, kms:GenerateDataKey, kms:CreateGrant, kms:DescribeKey",
snapshotID)
}
}
// Create volume (existing code)
input := &ec2.CreateVolumeInput{
SnapshotId: &snapshotID,
AvailabilityZone: &volumeAZ,
VolumeType: types.VolumeType(volumeType),
Encrypted: snapshot.Encrypted,
KmsKeyId: snapshot.KmsKeyId, // Keep the KMS key from snapshot
TagSpecifications: []types.TagSpecification{
{
ResourceType: types.ResourceTypeVolume,
Tags: getTagsForCluster(snapshot.Tags),
},
},
}
if iopsVolumeTypes.Has(volumeType) && iops != nil {
input.Iops = aws.Int32(int32(*iops))
}
output, err := b.ec2.CreateVolume(ctx, input)
if err != nil {
return "", errors.WithStack(err)
}
volumeID = *output.VolumeId
// Keep the existing waiter code
if err := b.waitForVolumeAvailable(volumeID); err != nil {
return "", errors.Wrapf(err, "volume creation failed for snapshot %s", snapshotID)
}
return volumeID, nil
}
The AWS SDK has a built-in waiter that handles edge cases better: func (b *VolumeSnapshotter) waitForVolumeAvailableWithWaiter(ctx context.Context, volumeID string) error {
ctx, cancel := context.WithTimeout(ctx, b.volumeCreationTimeout)
defer cancel()
waiter := ec2.NewVolumeAvailableWaiter(b.ec2, func(o *ec2.VolumeAvailableWaiterOptions) {
o.MinDelay = b.volumePollInterval
o.MaxDelay = 2 * b.volumePollInterval
})
output, err := waiter.Wait(ctx, &ec2.DescribeVolumesInput{
VolumeIds: []string{volumeID},
}, b.volumeCreationTimeout)
if err != nil {
// Check if volume disappeared (sign of silent KMS failure)
_, descErr := b.ec2.DescribeVolumes(context.Background(), &ec2.DescribeVolumesInput{
VolumeIds: []string{volumeID},
})
if descErr != nil {
var apiErr smithy.APIError
if errors.As(descErr, &apiErr) && apiErr.ErrorCode() == "InvalidVolume.NotFound" {
return errors.Errorf(
"Volume %s disappeared - likely KMS permission failure. Check CloudTrail for KMS errors",
volumeID)
}
}
return err
}
if len(output.Volumes) == 0 || output.Volumes[0].State != types.VolumeStateAvailable {
return errors.Errorf("Volume %s not available after creation", volumeID)
}
return nil
}Why This Works The pre-flight KMS check catches permission issues before we even try to create the volume. This gives users an immediate, clear error instead of a 10-minute timeout with no explanation. The approach tests actual KMS operations (GenerateDataKey + Decrypt) that are required for encrypted volume creation, so if the test passes, the real operation should succeed too. Documentation Note Should update the README to emphasize that encrypted volumes need these KMS permissions:
Without them, volume restoration will fail. Let me know if you need clarification on any part of this. |
|
K can try update pr later and you can help test |
|
pushed |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #267 +/- ##
==========================================
- Coverage 35.39% 34.52% -0.87%
==========================================
Files 7 7
Lines 664 808 +144
==========================================
+ Hits 235 279 +44
- Misses 404 501 +97
- Partials 25 28 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I can take care of this issue after merging this PR #276 |
|
Hi @kaovilai , could you please resolve conflicts? |
|
resolving |
Resolves velero-io/velero#3145 and velero-io/velero#9128 Previously, when restoring EBS snapshots with KMS encryption, the plugin would report success even when volume creation failed due to missing KMS permissions (kms:Decrypt, kms:ReEncrypt*, kms:CreateGrant). This created a silent failure scenario where Velero logs showed successful restoration but the volume was never actually created. Changes: - Add volume creation verification with polling in CreateVolumeFromSnapshot - Wait for volume to reach 'available' state before returning success - Enhanced error handling for KMS permission failures with actionable messages - Add configurable timeout (volumeCreationTimeout) and poll interval (volumeCreationPollInterval) - Comprehensive test coverage for new error handling and configuration The fix transforms silent failures into clear error messages, helping users quickly identify and resolve KMS permission issues during volume restoration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
- Introduced a new method `verifyKMSKeyAccess` in the VolumeSnapshotter to check necessary KMS permissions before creating a volume from an encrypted snapshot. - The method verifies if the KMS key exists, is enabled, and checks permissions for `kms:GenerateDataKey`, `kms:Decrypt`, and `kms:CreateGrant`. - Updated the `CreateVolumeFromSnapshot` method to call `verifyKMSKeyAccess` for encrypted snapshots. - Enhanced unit tests to cover scenarios for nil and empty KMS key checks. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
|
@tropnikovvl done |

Resolves velero-io/velero#3145 and velero-io/velero#9128
Previously, when restoring EBS snapshots with KMS encryption, the plugin
would report success even when volume creation failed due to missing KMS
permissions (kms:Decrypt, kms:ReEncrypt*, kms:CreateGrant). This created
a silent failure scenario where Velero logs showed successful restoration
but the volume was never actually created.
Changes:
The fix transforms silent failures into clear error messages, helping users
quickly identify and resolve KMS permission issues during volume restoration.
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com