Feature Request
Description
Context: #12132
TL;DR:
talos upgrade performs no validation if the chosen image is compatible with the node. To make matters worse, the --image option has a default argument, which upgrades the node to a "default" image.
The "default" image is seemingly randomly chosen as it is unlikely to have the same schematic ID as the one on the node or might be an older version than what is currently installed.
An implicit upgrade to a "default" image would be unexpected for a Talos administrator, but should be generally safe as it is assumed it is a reversible action due to Talos' A/B upgrades and the existence of the talosctl rollback command.
However, as shown in the discussion linked above - there is an edge case. For Talos nodes with secure boot enabled, upgrading to a non-secure boot image completely bricks the node, as systemd-boot gets replaced.
To recover from such situation, at the very least requires physical access and a live USB. However, if combined with TPM-encrypted partitions/disks, booting from an USB causes the node to get stuck in a "booting" state, as the TPM will refuse to unseal the keys for the partitions/disks. Thus wiping the STATE and EPHEMERAL partitions and starting fresh becomes the only option, leading to data loss
Feature Request
With that in mind, I would like to suggest a few ideas on how to improve the upgrade command, so that you cannot accidentally shoot yourself in the foot:
- Remove the default value for the
--image option
- Perform validation checks when upgrading and require
--force (or something to that effect) if upgrading to an older image, from a secure boot to a non-secure boot image, or if changing image arch.
- Change the default
--image value to be "what is currently specified in the machine config" or require --image to be provided if --insecure is used
Bonus: Add a "boot into maintanance mode without wiping the system disk" boot entry to the Talos ISO
Feature Request
Description
Context: #12132
TL;DR:
talos upgradeperforms no validation if the chosen image is compatible with the node. To make matters worse, the--imageoption has a default argument, which upgrades the node to a "default" image.The "default" image is seemingly randomly chosen as it is unlikely to have the same schematic ID as the one on the node or might be an older version than what is currently installed.
An implicit upgrade to a "default" image would be unexpected for a Talos administrator, but should be generally safe as it is assumed it is a reversible action due to Talos' A/B upgrades and the existence of the
talosctl rollbackcommand.However, as shown in the discussion linked above - there is an edge case. For Talos nodes with secure boot enabled, upgrading to a non-secure boot image completely bricks the node, as
systemd-bootgets replaced.To recover from such situation, at the very least requires physical access and a live USB. However, if combined with TPM-encrypted partitions/disks, booting from an USB causes the node to get stuck in a "booting" state, as the TPM will refuse to unseal the keys for the partitions/disks. Thus wiping the
STATEandEPHEMERALpartitions and starting fresh becomes the only option, leading to data lossFeature Request
With that in mind, I would like to suggest a few ideas on how to improve the
upgradecommand, so that you cannot accidentally shoot yourself in the foot:--imageoption--force(or something to that effect) if upgrading to an older image, from a secure boot to a non-secure boot image, or if changing image arch.--imagevalue to be "what is currently specified in the machine config" or require--imageto be provided if--insecureis usedBonus: Add a "boot into maintanance mode without wiping the system disk" boot entry to the Talos ISO