Balance CPU Crimson.

We introduced the following utilities to help analysing the Performance impact of two strategies for allocation of CPU cores to Seastar reactor threads. This is limited to a single host deployment at the moment.

OSD-based: this consists on allocating CPU cores from the same NUMA socket to the same OSD. for simplicity, if the OSD id is even, all its reactor threads are allocated to NUMA socket 0, and consequently if the OSD is is odd, all its reactor threads are allocated to NUMA socket 1.
NUMA socket based: this consists of allocating evenly CPU cores from each NUMA socket to the reactors, so all the OSD end up with reactor on both NUMA sockets.

A new option --crimson-balance-cpu <osd|socket> has been implemented in vstart.sh to support these strategies.

Worth pointing out, there are three CPU allocation strategies:

when the new flag is not specified (default), Seastar reactors to use CPUs in ascending contiguous order (unbalanced across sockets),
osd: distribute across sockets uniformly, don't split within an OSD,
socket: distribute across sockets uniformly, split within an OSD.

The utilities introduced are:

balance-cpu.py: a stand-alone script to produce the list of CPU core ids to use by vstart.sh when allocating Seastar reactor threads. It uses as input the .json produced by lscpu.py.
lscpu.py: a Python module to parse the .json file created by lscpu --json. This produces a Python dictionary with the NUMA details, that is, number of sockets, range of CPU core ids (physical and HT-siblings).
tasksetcpu.py: a stand-alone script to produce a grid showing the current CPU allocation, useful to quickly visualise the allocation strategy.

Usage:

The following is a typical example of creating a cluster with three OSDs and three reactors per OSD, and the desired CPU allocation policy:

# MDS=0 MON=1 OSD=3 MGR=1 /ceph/src/vstart.sh --new -x --localhost --without-dashboard --cyanstore --redirect-output --crimson --crimson-smp 3 --no-restart --crimson-balance-cpu osd

The following is the corresponding CPU distribution:

The following snippet shows the typical usage of the balance-cpu.py script:

lscpu --json > /tmp/numa_nodes.json
python3 ${CEPH_DIR}/../src/tools/contrib/balance-cpu.py -o $CEPH_NUM_OSD -r $crimson_smp \
  -b $balance_strategy -u /tmp/numa_nodes.json > /tmp/numa_args.out

the accepted balance strategies are "osd" or "socket".
the file produced /tmp/numa_args.out contains the list of CPU ids that vstart.sh consumes to issue the corresponding ceph configuration commands.

The grid can be printed as follows:

  [ ! -f "${NUMA_NODES_OUT}" ] && lscpu --json > ${NUMA_NODES_OUT}
  python3 /ceph/src/tools/contrib/tasksetcpu.py -c $TEST_NAME -u ${NUMA_NODES_OUT} -d ${RUN_DIR}

Performance

The following charts show the comparison of IOPs for the three CPU allocation policies: default (contiguous allocation, no balance), OSD-based, NUMA socket-based. It is interesting to note that there does not seem to be any significant throughput degradation, for this small configuration (3 OSD, 3 reactors). However, the OSD-based allocation requires higher memory utilisation than the other two configurations, which is an interesting finding and requires further investigation.