Back to Ceph

Crimson developer documentation

doc/dev/crimson/index.rst

21.0.014.8 KB
Original Source

.. _crimson_dev_doc:

=============================== Crimson developer documentation

See :ref:Crimson's User Guide <crimson_doc> for more information.

Building Crimson

Crimson is not enabled by default. Enable it at build time by running::

$ WITH_CRIMSON=true ./install-deps.sh $ ./do_cmake.sh -DWITH_CRIMSON=ON

Please note, ASan_ is enabled by default if Crimson is built from a source cloned using git.

.. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer

.. _dev_crimson_vstart:

vstart.sh

.. note:: vstart.sh_ enables crimson :ref:crimson-required-flags automatically when --crimson is used.

The following options can be used with vstart.sh.

--crimson Start crimson-osd instead of ceph-osd.

--nodaemon Do not daemonize the service.

--redirect-output Redirect the stdout and stderr to out/$type.$num.stdout.

--osd-args Pass extra command line options to crimson-osd or ceph-osd. This is useful for passing Seastar options to crimson-osd. For example, one can supply --osd-args "--memory 2G" to set the amount of memory to use. Please refer to the output of::

  crimson-osd --help-seastar

for additional Seastar-specific command line options.

--crimson-smp The number of cores to use for each OSD. If BlueStore is used, the balance of available cores (as determined by nproc) will be assigned to the object store.

--bluestore Use the alienized BlueStore as the object store backend. This is the default (see above section on the :ref:Crimson's Object Store Backends <crimson-backends> for more details)

--cyanstore Use CyanStore as the object store backend.

--memstore Use the alienized MemStore as the object store backend.

--seastore Use SeaStore as the back end object store.

--seastore-devs Specify the block device used by SeaStore.

--seastore-secondary-devs Optional. SeaStore supports multiple devices. Enable this feature by passing the block device to this option.

--seastore-secondary-devs-type Optional. Specify the type of secondary devices. When the secondary device is slower than main device passed to --seastore-devs, cold data on the faster device will be migrated to the slower devices over time. Valid types include HDD, SSD(default), ZNS, and RANDOM_BLOCK_SSD Note secondary devices should not be faster than the main device.

To start a cluster with a single Crimson node, run::

$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh
--without-dashboard --bluestore --crimson
--redirect-output

Another SeaStore example::

$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x
--without-dashboard --seastore
--crimson --redirect-output
--seastore-devs /dev/sda
--seastore-secondary-devs /dev/sdb
--seastore-secondary-devs-type HDD

Stop this vstart cluster by running::

$ ../src/stop.sh --crimson

daemonize

Unlike ceph-osd, crimson-osd does not daemonize itself even if the daemonize option is enabled. In order to read this option, crimson-osd needs to ready its config sharded service, but this sharded service lives in the Seastar reactor. If we fork a child process and exit the parent after starting the Seastar engine, that will leave us with a single thread which is a replica of the thread that called fork()_. Tackling this problem in Crimson would unnecessarily complicate the code.

Since supported GNU/Linux distributions use systemd, which is able to daemonize processes, there is no need to daemonize ourselves. Those using sysvinit can use start-stop-daemon to daemonize crimson-osd. If this is does not work out, a helper utility may be devised.

.. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html

logging

Crimson-osd currently uses the logging utility offered by Seastar. See src/common/dout.h for the mapping between Ceph logging levels to the severity levels in Seastar. For instance, messages sent to derr will be issued using logger::error(), and the messages with a debug level greater than 20 will be issued using logger::trace().

+---------+---------+ | ceph | seastar | +---------+---------+ | < 0 | error | +---------+---------+ | 0 | warn | +---------+---------+ | [1, 6) | info | +---------+---------+ | [6, 20] | debug | +---------+---------+ | > 20 | trace | +---------+---------+

Note that crimson-osd does not send log messages directly to a specified log_file. It writes the logging messages to stdout and/or syslog. This behavior can be changed using --log-to-stdout and --log-to-syslog command line options. By default, --log-to-stdout is enabled, and --log-to-syslog is disabled.

Profiling Crimson

Fio

crimson-store-nbd exposes configurable FuturizedStore internals as an NBD server for use with fio.

In order to use fio to test crimson-store-nbd, perform the below steps.

#. You will need to install libnbd, and compile it into fio

.. prompt:: bash $

  apt-get install libnbd-dev
  git clone git://git.kernel.dk/fio.git
  cd fio
  ./configure --enable-libnbd
  make

#. Build crimson-store-nbd

.. prompt:: bash $

  cd build
  ninja crimson-store-nbd

#. Run the crimson-store-nbd server with a block device. Specify the path to the raw device, for example /dev/nvme1n1, in place of the created file for testing with a block device.

.. prompt:: bash $

  export disk_img=/tmp/disk.img
  export unix_socket=/tmp/store_nbd_socket.sock
  rm -f $disk_img $unix_socket
  truncate -s 512M $disk_img
  ./bin/crimson-store-nbd \
    --device-path $disk_img \
    --smp 1 \
    --mkfs true \
    --type transaction_manager \
    --uds-path ${unix_socket} &

Below are descriptions of these command line arguments:

--smp The number of CPU cores to use (Symmetric MultiProcessor)

--mkfs Initialize the device first.

--type The back end to use. If transaction_manager is specified, SeaStore's TransactionManager and BlockSegmentManager are used to emulate a block device. Otherwise, this option is used to choose a backend of FuturizedStore, where the whole "device" is divided into multiple fixed-size objects whose size is specified by --object-size. So, if you are only interested in testing the lower-level implementation of SeaStore like logical address translation layer and garbage collection without the object store semantics, transaction_manager would be a better choice.

#. Create a fio job file named nbd.fio

.. code:: ini

  [global]
  ioengine=nbd
  uri=nbd+unix:///?socket=${unix_socket}
  rw=randrw
  time_based
  runtime=120
  group_reporting
  iodepth=1
  size=512M

  [job0]
  offset=0

#. Test the Crimson object store, using the custom fio built just now

.. prompt:: bash $

  ./fio nbd.fio

CBT

We can use cbt_ for performance tests::

$ git checkout main $ make crimson-osd $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml $ git checkout yet-another-pr $ make crimson-osd $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml $ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v 19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted 19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted 19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected 19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted 19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted 19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted 19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted 19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted 19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted 19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted 19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted 19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted 19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted 19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted 19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted 19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted 19:48:23 - WARNING - cbt - 1 tests failed out of 16

Here we compile and run the same test against two branches: main and yet-another-pr. We then compare the results. Along with every test case, a set of rules is defined to check for performance regressions when comparing the sets of test results. If a possible regression is found, the rule and corresponding test results are highlighted.

.. _cbt: https://github.com/ceph/cbt

Hacking Crimson

Seastar Documents

See Seastar Tutorial <https://github.com/scylladb/seastar/blob/master/doc/tutorial.md>_ . Or build a browsable version and start an HTTP server::

$ cd seastar $ ./configure.py --mode debug $ ninja -C build/debug docs $ python3 -m http.server -d build/debug/doc/html

Install pandoc and other dependencies beforehand.

Debugging Crimson

Debugging with GDB

The tips_ for debugging Scylla also apply to Crimson.

.. _tips: https://github.com/scylladb/scylla/blob/master/docs/dev/debugging.md#tips-and-tricks

Human-readable backtraces with addr2line

When a Seastar application crashes, it leaves us with a backtrace of addresses, like::

Segmentation fault. Backtrace: 0x00000000108254aa 0x00000000107f74b9 0x00000000105366cc 0x000000001053682c 0x00000000105d2c2e 0x0000000010629b96 0x0000000010629c31 0x00002a02ebd8272f 0x00000000105d93ee 0x00000000103eff59 0x000000000d9c1d0a /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a 0x000000000d833ac9 Segmentation fault

The seastar-addr2line utility provided by Seastar can be used to map these addresses to functions. The script expects input on stdin, so we need to copy and paste the above addresses, then send EOF by inputting control-D in the terminal. One might use echo or cat instead::

$ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd

0x00000000108254aa
0x00000000107f74b9
0x00000000105366cc
0x000000001053682c
0x00000000105d2c2e
0x0000000010629b96
0x0000000010629c31
0x00002a02ebd8272f
0x00000000105d93ee
0x00000000103eff59
0x000000000d9c1d0a
0x00000000108254aa

[Backtrace #0] seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136 seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157 seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164 seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119 seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105 seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101 ?? ??:0 seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418 seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5) main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)

Note that seastar-addr2line is able to extract addresses from its input, so you can also paste the log messages as below::

2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace: 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf

Unlike the classic ceph-osd, Crimson does not print a human-readable backtrace when it handles fatal signals like SIGSEGV or SIGABRT. It is also more complicated with a stripped binary. So instead of planting a signal handler for those signals into Crimson, we can use script/ceph-debug-docker.sh to map addresses in the backtrace::

assuming you are under the source tree of ceph

$ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8 .... [root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line [root@3deb50a8ad51 ~]# dnf install -q -y file [root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd

paste the backtrace here

Code Walkthroughs

  • Ceph Code Walkthroughs: Crimson <https://www.youtube.com/watch?v=rtkrHk6grsg>_

  • Ceph Code Walkthroughs: SeaStore <https://www.youtube.com/watch?v=0rr5oWDE2Ck>_

.. rubric:: Contents

.. toctree:: :maxdepth: 1

OSDState <osd> The ClientRequest Pipeline <pipeline> Error Handling <error-handling> BackfillMachine <backfillmachine> SeaStore <seastore> PoseidonStore <poseidonstore>