docs/design/direct-blk-device-assignment.md
Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers that prevent them from working together smoothly.
First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism.
Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS, the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume. Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox.
Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS
as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS,
but it requires the CSI driver to pass the relevant information to the Kata Containers runtime.
An ideal long term solution would be to have the kubelet coordinating the communication between the CSI driver and
the container runtime, as described in KEP-2857.
However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case.
The proposed solution is built on top of a previous proposal described by Eric Ernst. The previous proposal has two gaps:
Writing a csiPlugin.json file to the volume root path introduced a security risk. A malicious user can gain unauthorized
access to a block device by writing their own csiPlugin.json to the above location through an ephemeral CSI plugin.
The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for implementing CSI volume resize and volume stat collection APIs.
This document particularly focuses on how to address these two gaps.
PersistentVolumeClaim (PVC) needs to have the accessMode as ReadWriteOncePod.fsGroup, fsGroupChangePolicy, and subPath are not supported.--extra-create-metadata set
to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV.
Con: The CSI plugin will always skip host mounting of the PV.runtimeclass during NodePublish. This approach can be found in the ALIBABA CSI plugin.kata-runtime command line commands.
NodePublishVolume -- It invokes kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]
to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation.
The volumePath is the target_path in the CSI NodePublishVolumeRequest.
The mountInfo is a serialized JSON string.NodeGetVolumeStats -- It invokes kata-runtime direct-volume stats --volume-path [volumePath] to retrieve the filesystem stats of direct-assigned volume.NodeExpandVolume -- It invokes kata-runtime direct-volume resize --volume-path [volumePath] --size [size] to send a resize request to the Kata Containers runtime to
resize the direct-assigned volume.NodeStageVolume/NodeUnStageVolume -- It invokes kata-runtime direct-volume remove --volume-path [volumePath] to remove the persisted metadata of a direct-assigned volume.The mountInfo object is defined as follows:
type MountInfo struct {
// The type of the volume (ie. block)
VolumeType string `json:"volume-type"`
// The device backing the volume.
Device string `json:"device"`
// The filesystem type to be mounted on the volume.
FsType string `json:"fstype"`
// Additional metadata to pass to the agent regarding this volume.
Metadata map[string]string `json:"metadata,omitempty"`
// Additional mount options.
Options []string `json:"options,omitempty"`
}
Notes: given that the mountInfo is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password).
Instead of the CSI node driver writing the mount info into a csiPlugin.json file under the volume root,
as described in the original proposal, here we propose that the CSI node driver passes the mount information to
the Kata Containers runtime through a new kata-runtime commandline command. The kata-runtime then writes the mount
information to a mountInfo.json file in a predefined location (/run/kata-containers/shared/direct-volumes/[volume_path]/).
When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking
whether there is a mountInfo file under the computed Kata direct-volumes directory. If it is, the runtime parses the mountInfo file,
updates the mount spec with the data in mountInfo. The updated mount spec is then passed to the Kata agent in the guest VM together
with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the direct-volumes/[volume_path]/
directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it.
Later, when the Kata Containers runtime handles the get-stats and resize commands, it uses the sandbox id to identify
the endpoint of the corresponding containerd-shim-kata-v2.
containerd-shim-kata-v2 provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: /direct-volume/stats and /direct-volume/resize:
Example:
$ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]'
$ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }'
The shim then forwards the corresponding request to the kata-agent to carry out the operations inside the guest VM. For resize operation,
the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call block_resize in QEMU).
The mount spec of a direct-assigned volume is passed to kata-agent through the existing Storage GRPC object.
Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats:
rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse);
rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty);
message VolumeStatsRequest {
// The volume path on the guest outside the container
string volume_guest_path = 1;
}
message ResizeVolumeRequest {
// Full VM guest path of the volume (outside the container)
string volume_guest_path = 1;
uint64 size = 2;
}
// This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto)
message VolumeStatsResponse {
// This field is OPTIONAL.
repeated VolumeUsage usage = 1;
// Information about the current condition of the volume.
// This field is OPTIONAL.
// This field MUST be specified if the VOLUME_CONDITION node
// capability is supported.
VolumeCondition volume_condition = 2;
}
message VolumeUsage {
enum Unit {
UNKNOWN = 0;
BYTES = 1;
INODES = 2;
}
// The available capacity in specified Unit. This field is OPTIONAL.
// The value of this field MUST NOT be negative.
uint64 available = 1;
// The total capacity in specified Unit. This field is REQUIRED.
// The value of this field MUST NOT be negative.
uint64 total = 2;
// The used capacity in specified Unit. This field is OPTIONAL.
// The value of this field MUST NOT be negative.
uint64 used = 3;
// Units by which values are measured. This field is REQUIRED.
Unit unit = 4;
}
// VolumeCondition represents the current condition of a volume.
message VolumeCondition {
// Normal volumes are available for use and operating optimally.
// An abnormal volume does not meet these criteria.
// This field is REQUIRED.
bool abnormal = 1;
// The message describing the condition of the volume.
// This field is REQUIRED.
string message = 2;
}
Given the following definition:
---
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
runtime-class: kata-qemu
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: ebs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
skip-hostmount: "true"
name: ebs-claim
spec:
accessModes:
- ReadWriteOncePod
volumeMode: Filesystem
storageClassName: ebs-sc
resources:
requests:
storage: 4Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
csi.storage.k8s.io/fstype: ext4
Let’s assume that changes have been made in the aws-ebs-csi-driver node driver.
Node publish volume
NodePublishVolume API invokes: kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}".Kata-runtime writes the mount-info JSON to a file called mountInfo.json under /run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf.Node unstage volume
NodeUnstageVolume API invokes: kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"./run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf.Use the volume in sandbox
containerd-shim-kata-v2 examines the container spec,
and iterates through the mounts. For each mount, if there is a mountInfo.json file under /run/kata-containers/shared/direct-volumes/[mount source path],
it generates a storage GRPC object after overwriting the mount spec with the information in mountInfo.json./run/kata-containers/shared/direct-volumes/[mount source path].Node expand volume
NodeExpandVolume API invokes: kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi./run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf.Node get volume stats
NodeGetVolumeStats API invokes: kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"./run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf.