doc/dev/osd_internals/map_message_handling.rst
The OSD handles routing incoming messages to PGs, creating the PG if necessary in some cases.
PG messages generally come in two varieties:
There are several ways in which a message might be dropped or delayed. It is important that the message delaying does not result in a violation of certain message ordering requirements on the way to the relevant PG handling logic:
MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the OSD must perform several tasks:
Each PG asynchronously catches up to the currently published map during process_peering_events before processing the event. As a result, different PGs may have different views as to the "current" map.
One consequence of this design is that messages containing submessages from multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage with the PG's epoch as well as tagging the message as a whole with the OSD's current published epoch.
See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used between OSDs to coordinate most non peering activities including replicating MOSDPGOp operations.
OSD::require_same_or_newer map checks that the current OSDMap is at least as new as the map epoch indicated on the message. If not, the message is queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this cannot violate the above conditions since any two messages will be queued in order of receipt and if a message is received with epoch e0, a later message from the same source must be at epoch at least e0. Note that two PGs from the same OSD count for these purposes as different sources for single PG messages. That is, messages from different PGs may be reordered.
MOSDPGOps follow the following process:
MOSDSubOps follow the following process:
OSD::enqueue_op calls PG::queue_op which checks waiting_for_map before calling OpWQ::queue which adds the op to the queue of the PG responsible for handling it.
OSD::dequeue_op is then eventually called, with a lock on the PG. At this time, the op is passed to PG::do_request, which checks that:
If these conditions are not met, the op is either discarded or queued for later processing. If all conditions are met, the op is processed according to its type:
PrimaryLogPG::do_op handles CEPH_MSG_OSD_OP op and will queue it
See OSD::handle_pg_(notify|info|log|query)
Peering messages are tagged with two epochs:
These are the same in cases where there was no triggering message. We discard a peering message if the message's query_epoch if the PG in question has entered a new epoch (See PG::old_peering_evt, PG::queue_peering_event). Notifies, infos, notifies, and logs are all handled as PG::PeeringMachine events and are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created state machine event along with epoch_sent and query_epoch in order to generically check PG::old_peering_message upon insertion and removal from the queue.
Note, notifies, logs, and infos can trigger the creation of a PG. See OSD::get_or_create_pg.