[proposal] remote storage refactor proposal#3750
[proposal] remote storage refactor proposal#3750andaaron wants to merge 2 commits intoproject-zot:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3750 +/- ##
=======================================
Coverage 91.46% 91.46%
=======================================
Files 193 193
Lines 27456 27456
=======================================
Hits 25112 25112
Misses 1520 1520
Partials 824 824 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| ### S3 Blob Path Structure | ||
| ``` | ||
| {rootDir}/storage/blobs/{algorithm}/{digest} # Shared across all repos |
There was a problem hiding this comment.
I'm curious why we are using hard links for local storage, but are fine without any kind of reference counting links in S3. i.e. if the implementation logic for blobs without links is possible, then would local storage also work fine without the hardlinks?
I'm thinking of whether there any drawbacks of not using hard links in local storage.
There was a problem hiding this comment.
Using hardlinks saves space if you have multiple instances of the same blobs in different paths on the same drive. If we remove hardlinks disk usage would go up significantly.
For example: you push a base image and a derived image, and, because they have different repositories, the base image layers would take twice as much disk space for the 2 different instances, once for the base image, and once for the derived image repos.
Sure, we could implement a single instance of {rootDir}/storage/blobs/{algorithm}/{digest} also in case of local storage, but that would break tools reading directly from the disk and expecting each repository to be valid OCI layout of its own. One such cases is the product we developed zot for in the first place.
| if is.storageDriver != nil { | ||
| return path.Join(is.rootDir, "storage", ispec.ImageBlobsDir, | ||
| digest.Algorithm().String(), digest.Encoded()) | ||
| } |
There was a problem hiding this comment.
In newer code, we should probably not assume that a non-empty storage driver == S3. Zot may in future support perhaps a new storage driver. Since we have the interfaces in place, we should check if it is really S3 and then return the appropriate path.
There was a problem hiding this comment.
Agreed. There's also the GCS PR we should be merging soon.
| } | ||
| ``` | ||
|
|
||
| **Key Point:** No configuration flag needed. S3 detection is automatic via `storeDriver != nil`. |
There was a problem hiding this comment.
I'm not onboard with this - if zot supports something new in future, all this code would need to be updated, rather we should check if the storage is really S3 before proceeding.
There was a problem hiding this comment.
This reads S3 because that's the only option besides local disk right now.
I would keep the existing dedupe logic only for local.
So it makes little difference what value storeDriver has as long as it is not local driver, for which we need the dedupe.
S3 and any other possible value in the future would use {rootDir}/storage/blobs/{algorithm}/{digest}.
| 3. Review results | ||
| 4. Run migration: `zot migrate-s3-storage --config config.json` | ||
| 5. **Restart zot** (automatically uses shared blob storage) | ||
| 6. Verify and monitor |
There was a problem hiding this comment.
Since this is a destructive operation (migrating files to the new path), will the migration logic also keep track of what was moved so that it can be rolled back if required?
Out of curiosity, what would a rollback cost on S3?
There was a problem hiding this comment.
Those are both good questions.
I think we can keep track of 1, but unless we keep it in the "cache DB", keeping it up to date after using start using the service would be practically impossible. Even if we track it in the cache DB, It complicates maintaining it.
Maybe we should have an entry there for the original blob location same as we do now, and simply use that instead of {rootDir}/storage/blobs/{algorithm}/{digest} in case of old blobs from pre-transition. I think this could potentially help cut the cost of copying stuff back in case of a rollback.
Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.