design/ceph/ceph-nvmeof-gateway.md
This design document proposes adding NVMe over Fabrics (NVMe-oF) gateway support to Rook Ceph, enabling RBD volumes to be exposed and accessed from outside the Kubernetes cluster via the NVMe/TCP protocol.
Currently, Rook Ceph provides excellent support for RBD disks through CSI drivers, but lacks a mechanism to expose RBD volumes to clients outside the Kubernetes cluster. With Ceph's deprecation of iSCSI gateway support and the introduction of NVMe-oF gateway functionality, there is an opportunity to provide external block storage access through the use of NVMeOF protocol.
Rook will have CephNVMeOFGateway CRD for handling the communication b/w Block PVC in the k8s cluster and the client.
The NVMe-oF implementation separates concerns between infrastructure management and storage provisioning:
Rook Operator: Manages NVMe-oF gateway pod lifecycle, scaling, and health monitoring Ceph CSI Driver: Handles dynamic provisioning, subsystem creation, and NVMe namespace management NVMe-oF Gateways: Serve NVMe-oF protocol and manage RBD backend connections
+----------------------+
| Kubernetes Cluster |
+----------------------+
|
+------------------------------|------------------------------+
| | |
+--------v--------+ +-------v-------+ +-------v-------+
| Rook Operator | --------. | Ceph Cluster | .---------| NVMe-oF Gateway |
+------------------+ \ +----------------+ / +-----------------+
\ | /
\ | /
+-----------v-----------+
| Storage Layer |
| +--------------------+ |
| | OSD 1 (Physical) | |
| +--------------------+ |
+-----------|------------+
|
+----------v----------+
| RBD Volume |
| (Block Device) |
+----------|------------+
|
+----------v----------+
| NVMe-oF Target |
| (Presents RBD as |
| NVMe Namespace) |
+----------|------------+
|
+---------v---------+
| Network |
| TCP/IP or RDMA |
| (NVMe-oF Protocol) |
+---------|-----------+
|
+--------------v--------------+
| Client Node |
+-----------------------------+
| NVMe-oF Initiator (nvme-cli)|
| |
| Application sees |
| /dev/nvmeXnY |
+-----------------------------+
The CephNVMeOFGateway CRD focuses solely on gateway infrastructure deployment, leaving storage provisioning to the CSI driver.
gateway-group-1 - Logical grouping of gateways2 - Two gateway instances for HAapiVersion: ceph.rook.io/v1
kind: CephNVMeOFGateway
metadata:
name: nvmeof-gateway
namespace: rook-ceph
spec:
gatewayGroup: "gateway-group-1"
replicas: 2
placement:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/storage
operator: In
values: ["true"]
resources:
# limits:
# cpu: "500m"
# memory: "1024Mi"
# requests:
# cpu: "500m"
# memory: "1024Mi"
StorageClass Configuration The NVMe-oF StorageClass integrates with the CSI driver for dynamic provisioning :
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-nvmeof
parameters:
clusterID: rook-ceph
pool: replicapool
subsystemNQN: nqn.2016-06.io.ceph:rook-ceph
nvmeofGatewayAddress: ceph-nvmeof-gateway.rook-ceph.svc.cluster.local
nvmeofGatewayPort: "5500"
listenerPort: "4420"
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
imageFormat: "2"
imageFeatures: layering,deep-flatten,exclusive-lock,object-map,fast-diff
provisioner: nvmeof.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: false
``
---yaml
# Example PVC using NVMe-oF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvmeof-external-volume
namespace: default
spec:
storageClassName: ceph-nvmeof
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Note: This PVC is created solely for CSI driver provisioning. No Kubernetes pod will mount it as the volume is accessed by external NVMe-oF clients.
The target aka the client can than be configured from the procedure provided Ceph documentation for connecting to the NVMe backed Block device.
Multiple gateways within the same group share configuration and provide high availability :
Rook automatically creates a Kubernetes service for gateway discovery:
apiVersion: v1
kind: Service
metadata:
name: ceph-nvmeof-gateway-group-1
namespace: rook-ceph
labels:
gateway-group: gateway-group-1
spec:
selector:
app: rook-ceph-nvmeof-gateway
gateway-group: gateway-group-1
ports:
- name: nvmeof-data
port: 4420
targetPort: 4420
protocol: TCP
- name: management
port: 5500
targetPort: 5500
protocol: TCP
type: ClusterIP
External clients connect using standard NVMe-oF procedures. Detailed client setup instructions are available in the Ceph documentation. Basic Client Connection Steps:
nvme discover -t tcp -a <gateway-service-ip> -s 5500
nvme connect -t tcp -n <nqn> -a <gateway-ip> -s 5500
NVMe Namespace: logical block device (NVMe namespace) that is presented by the NVMe-oF gateway.
subsystems: A list of NVMe-oF subsystems to be configured on this gateway group. Each subsystem acts as a container for NVMe namespaces and defines access control.
nqn: The NVMe Qualified Name (NQN) for the subsystem (e.g., nqn.2016-06.io.spdk:production). This NQN will be advertised to initiators.
hosts: A list of initiator NQNs allowed to connect to this specific subsystem. If empty, any initiator can connect (less secure). This can be dynamically updated by the CSI driver or manually.
https://docs.ceph.com/en/latest/rbd/nvmeof-overview/ https://github.com/ceph/ceph-nvmeof https://github.com/ceph/ceph-csi/pull/5397/