Problem
The linux.devices and linux.resources.devices sections of config-linux.md describe how to configure device access for containers but include no security guidance about the implications of granting r (read) or w (write) access to block devices.
When a block device is configured in linux.devices and linux.resources.devices grants access: "rw" or "rwm", the container process can perform raw block-level I/O via standard read() and write() syscalls — regardless of the process capabilities set.
Specifically:
read() on a block device fd does not require CAP_SYS_RAWIO or any other capability
write() on a block device fd does not require CAP_SYS_RAWIO or any other capability
mount() correctly requires CAP_SYS_ADMIN
This means a container with a block device entry and only the default unprivileged capability set can read the entire contents of the host device (including all filesystem data, credentials, and keys) and potentially write to it (modifying or corrupting the host filesystem at the block level).
The specification does not document this behavior. As a result, runtime implementors and container orchestrators may assume that Linux capabilities serve as a security boundary for device access — which they do for mount(), but not for raw I/O.
Impact
The gap affects the entire container ecosystem that consumes this specification:
- Container runtimes (runc, crun, youki) faithfully implement the spec and create device nodes with the specified access — no additional validation is performed on block devices
- Container orchestrators (containerd, CRI-O, Docker) populate
linux.devices based on higher-level configuration (--device, device plugins, hostPath BlockDevice) without security warnings
- Kubernetes exposes block devices via
hostPath type: BlockDevice, device plugins (GPU, FPGA, SR-IOV), and CSI raw block volumes — all of which result in linux.devices entries
- Security tooling (admission controllers, policy engines) commonly audit capabilities and seccomp profiles but rarely inspect device cgroup rules for block device access
Verified behavior
Tested with runc 1.3.4 on cgroup v2 (eBPF device controller), default seccomp profile active:
# Container capabilities (default set, no SYS_ADMIN, no SYS_RAWIO):
CapPrm: 0x00000000a80425fb
# mount() — correctly blocked:
mount: permission denied (are you root?)
# Raw read via dd — succeeds, extracts host /etc/shadow:
$ dd if=/dev/hostdisk bs=4096 count=38400 2>/dev/null | strings | grep '^root:'
root:x:0:0:root:/root:/bin/sh
root:*::0:::::
# Raw write via dd — succeeds:
$ echo TEST | dd of=/dev/hostdisk bs=1 seek=153000000 count=5 conv=notrunc
5+0 records in
5+0 records out
Proposed Changes
1. Add security note to linux.devices section
After the existing description of linux.devices, add:
Security consideration: Creating a block device node (type "b") and granting r or w access in linux.resources.devices allows the container process to perform raw block-level I/O on the underlying host device using standard read() and write() syscalls. These syscalls are not gated by any Linux capability — device cgroup permission and Unix file permissions are the only controls. Removing CAP_SYS_ADMIN prevents mount() but does not prevent raw data access.
Runtimes and orchestrators SHOULD warn when block devices are configured with read or write access. Effective defenses include user namespaces (remapped UID 0 cannot open root-owned device nodes) and running container processes as non-root users.
2. Add note to linux.resources.devices access field
After the access field description, add:
Note: The r and w permissions control access through the device cgroup controller (or eBPF device program on cgroup v2). When applied to block devices, these permissions enable raw block-level I/O that is independent of Linux capabilities. CAP_SYS_RAWIO is not required for read() or write() on block device file descriptors.
References
Problem
The
linux.devicesandlinux.resources.devicessections ofconfig-linux.mddescribe how to configure device access for containers but include no security guidance about the implications of grantingr(read) orw(write) access to block devices.When a block device is configured in
linux.devicesandlinux.resources.devicesgrantsaccess: "rw"or"rwm", the container process can perform raw block-level I/O via standardread()andwrite()syscalls — regardless of the process capabilities set.Specifically:
read()on a block device fd does not requireCAP_SYS_RAWIOor any other capabilitywrite()on a block device fd does not requireCAP_SYS_RAWIOor any other capabilitymount()correctly requiresCAP_SYS_ADMINThis means a container with a block device entry and only the default unprivileged capability set can read the entire contents of the host device (including all filesystem data, credentials, and keys) and potentially write to it (modifying or corrupting the host filesystem at the block level).
The specification does not document this behavior. As a result, runtime implementors and container orchestrators may assume that Linux capabilities serve as a security boundary for device access — which they do for
mount(), but not for raw I/O.Impact
The gap affects the entire container ecosystem that consumes this specification:
linux.devicesbased on higher-level configuration (--device, device plugins,hostPath BlockDevice) without security warningshostPath type: BlockDevice, device plugins (GPU, FPGA, SR-IOV), and CSI raw block volumes — all of which result inlinux.devicesentriesVerified behavior
Tested with runc 1.3.4 on cgroup v2 (eBPF device controller), default seccomp profile active:
Proposed Changes
1. Add security note to
linux.devicessectionAfter the existing description of
linux.devices, add:2. Add note to
linux.resources.devicesaccess fieldAfter the
accessfield description, add:References
r/wcontrolread()/write()on device inodesddusesread()/write(), not raw I/O ioctls