Data Collector Design

Motivation

As a Chef user who uses both Chef Client Mode and Chef Solo Mode (including the mode commonly known as "Chef Client Local Mode"),
I want to be able to collect data about my entire fleet regardless of their client operation type,
so that I may better understand the impacts of my changes and may better detect failures.

Definitions

To eliminate ambiguity and confusion, the following terms are used throughout this RFC:

Chef: the tool used to automate your system.
Chef Client Mode: Chef configured in "client mode" where a Chef Server is used to provide Chef its resources and artifacts
Chef Solo Mode: Chef configured in a mode that utilizes a local Chef Zero server. Formerly known as "Chef Client Local Mode" (run as chef-client --local-mode)
Chef Solo Legacy Mode: Chef in the pre 12.10 Solo operational mode (run as chef-solo) or Chef run as chef-solo --legacy-mode

Specification

Similar to how data is collected and reported for Chef Reporting, we expect to implement a new EventDispatch class/instance that collects data about the Chef run and reports it accordingly. Unlike Chef Reporting, the server that receives this data is not running on the Chef Server, allowing users to utilize this function whether they use Chef Server or not. No new data collection methods are expected to be implemented as a result of this change; this change serves to implement a generic way to report the collected data in a "webhook-like" fashion to a non-Chef-Server receiver.

The implementation must work with Chef running in any mode:

Chef Client Mode
Chef Solo Mode
Chef Solo Legacy Mode

Protocol and Authentication

All payloads will be sent to the Data Collector server via HTTP POST to the URL specified in the data_collector_server_url configuration parameter. Users should be encouraged to use a TLS-protected endpoint.

Optionally, payloads may also be written out to multiple HTTP endpoints or JSON files on the local filesystem (of the node running chef-client) by specifying the data_collector_output_locations configuration parameter.

For the initial implementation, transmissions to the Data Collector server can optionally be authenticated with the use of a pre-shared token, which will be sent in a HTTP header. Given that the receiver is not the Chef Server, existing methods of using a Chef client key to authenticate the request are unavailable.

Configuration

The configuration required for this new functionality can be placed in the client.rb or any other Chef::Config-supported location (such as a client.d or solo.d directory).

Parameters

data_collector_server_url: required*. The full URL to the data collector server API. All messages will be POST'd to this URL. The Data Collector class will be registered and enabled if this config parameter is specified. * If the data_collector_output_locations configuration parameter is specified, this setting may be omitted.
data_collector_token: optional. A pre-shared token that, if present, will be passed as an HTTP header named x-data-collector-token to the Data Collector server. The server can choose to accept or reject the data posted based on the token or lack thereof.
data_collector_mode: The Chef mode in which the Data Collector will be enabled. For example, you may wish to only enable the Data Collector when running in Chef Solo Mode. Must be one of: :solo, :client, or :both. The :solo value is used for Chef operating in Chef Solo Mode or Chef Solo Legacy Mode. Default: :both.
data_collector_raise_on_failure: If true, the Chef run will fatally exit if it is unable to successfully POST to the Data Collector server. Default: false.
data_collector_output_locations: optional. An array of URLs and/or file paths to which data collection payloads will also be written. This may be used without specifying the data_collector_server_url configuration parameter.

Schemas

For the initial implementation, three JSON schemas will be utilized.

Action Schema

The Action Schema is used to notify when a Chef object changes. In our case, the primary use will be to update the Data Collector server with the current node object.

json

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "description": "Data Collector - action schema",
  "properties": {
    "entity_name": {
      "description": "The name of the entity",
      "type": "string"
    },
    "entity_type": {
      "description": "The type of the entity",
      "type": "string",
      "enum": [
        "bag",
        "client",
        "cookbook",
        "environment",
        "group",
        "item",
        "node",
        "organization",
        "permission",
        "role",
        "user",
        "version"]
    },
    "entity_uuid": {
      "description": "Unique ID identifying this object, which should persist across runs and invocations",
      "type": "string",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$"
    },
    "id": {
      "description": "Globally Unique ID for this message",
      "type": "string",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$"
    },
    "message_version": {
      "description": "Message Version",
      "type": "string",
      "enum": [
        "1.1.0"
      ]
    },
    "message_type": {
      "description": "Message Type",
      "type": "string",
      "enum": ["action"]
    },
    "organization_name": {
      "description": "It is the name of the org on which the run took place",
      "type": ["string", "null"]
    },
    "recorded_at": {
      "description": "It is the ISO timestamp when the action happened",
      "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-5][0-9]:[0-9]{2}Z$",
      "type": "string"
    },
    "remote_hostname": {
      "description": "The remote hostname which initiated the action",
      "type": "string"
    },
    "requestor_name": {
      "description": "The name of the client or user that initiated the action",
      "type": "string"
    },
    "requestor_type": {
      "description": "Was the requestor a client or user?",
      "type": "string",
      "enum": ["client", "user"]
    },
    "run_id": {
      "description": "The run ID of the run in which this node object was updated",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$",
      "type": "string"
    },
    "service_hostname": {
      "description": "The FQDN of the Chef server, if appropriate",
      "type": "string"
    },
    "source": {
      "description": "The tool / client mode that initiated the action. Note that 'chef_solo' includes Chef Solo Mode and Chef Solo Legacy Mode.",
      "type": "string",
      "enum": ["chef_solo", "chef_client"]
    },
    "task": {
      "description": "What action was performed?",
      "type": "string",
      "enum": ["associate", "create", "delete", "dissociate", "invite", "reject", "update"]
    },
    "user_agent": {
      "description": "The User-Agent of the requestor",
      "type": "string"
    },
    "data": {
      "description": "The payload containing the entire request data",
      "type": "object"
    }
  },
  "required": [
    "entity_name",
    "entity_type",
    "entity_uuid",
    "id",
    "message_type",
    "message_version",
    "organization_name",
    "recorded_at",
    "remote_hostname",
    "requestor_name",
    "requestor_type",
    "run_id",
    "service_hostname",
    "source",
    "task",
    "user_agent"
  ],
  "title": "ActionSchema",
  "type": "object"
}

The data field will contain the value of the object on which an action took place.

Run Start Schema

The Run Start Schema will be used by Chef to notify the data collection server at the start of the Chef run.

json

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "description": "Data Collector - Runs run_start schema",
  "properties": {
    "chef_server_fqdn": {
      "description": "It is the FQDN of the chef_server against which current reporting instance runs",
      "type": "string"
    },
    "entity_uuid": {
      "description": "Unique ID identifying this node, which should persist across Chef runs",
      "type": "string",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$"
    },
    "id": {
      "description": "It is the internal message id for the run",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$",
      "type": "string"
    },
    "message_version": {
      "description": "Message Version",
      "type": "string",
      "enum": [
        "1.0.0"
      ]
    },
    "message_type": {
      "description": "It defines the type of message being sent",
      "type": "string",
      "enum": ["run_start"]
    },
    "node_name": {
      "description": "It is the name of the node on which the run took place",
      "type": "string"
    },
    "organization_name": {
      "description": "It is the name of the org on which the run took place",
      "type": "string"
    },
    "run_id": {
      "description": "It is the runid for the run",
      "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$",
      "type": "string"
    },
    "source": {
      "description": "The tool / client mode that initiated the action. Note that 'chef_solo' includes Chef Solo Mode and Chef Solo Legacy Mode.",
      "type": "string",
      "enum": ["chef_solo", "chef_client"]
    },
    "start_time": {
      "description": "It is the ISO timestamp of when the run started",
      "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$",
      "type": "string"
    }
  },
  "required": [
    "chef_server_fqdn",
    "entity_uuid",
    "id",
    "message_version",
    "message_type",
    "node_name",
    "organization_name",
    "run_id",
    "source",
    "start_time"
  ],
  "title": "RunStartSchema",
  "type": "object"
}

Run End Schema

The Run End Schema will be used by Chef Client to notify the data collection server at the completion of the Chef Client's converge phase, and to report data on the Chef Client run, including resources changed and any errors encountered.

json

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "description": "Data Collector - Runs run_converge schema",
    "properties": {
        "chef_server_fqdn": {
            "description": "It is the FQDN of the chef_server against which current reporting instance runs",
            "type": "string"
        },
        "end_time": {
            "description": "It is the ISO timestamp of when the run ended",
            "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$",
            "type": "string"
        },
        "entity_uuid": {
          "description": "Unique ID identifying this node, which should persist across Chef Client/Solo runs",
          "type": "string",
          "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$"
        },
        "error": {
            "description": "It has the details of the error in the run if any",
            "type": "object"
        },
        "expanded_run_list": {
            "description": "The expanded run list object from the node",
            "type": "object"
        },
        "id": {
            "description": "It is the internal message id for the run",
            "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$",
            "type": "string"
        },
        "message_type": {
            "description": "It defines the type of message being sent",
            "type": "string",
            "enum": ["run_converge"]
        },
        "message_version": {
            "description": "Message Version",
            "type": "string",
            "enum": [
                "1.1.0"
            ]
        },
        "node": {
            "description": "The node object after the converge completed",
            "type": "object"
        },
        "node_name": {
            "description": "Node Name",
            "type": "string",
            "format": "node-name"
        },
        "organization_name": {
            "description": "Organization Name",
            "type": "string"
        },
        "resources": {
            "description": "This is the list of all resources for the run",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "after": {
                        "description": "Final State of the resource",
                        "type": "object"
                    },
                    "before": {
                        "description": "Initial State of the resource",
                        "type": "object"
                    },
                    "cookbook_name": {
                        "description": "Name of the cookbook that initiated the change",
                        "type": "string"
                    },
                    "cookbook_version": {
                        "description": "Version of the cookbook that initiated the change",
                        "type": "string",
                        "pattern": "^[0-9]*\\.[0-9]*(\\.[0-9]*)?$"
                    },
                    "delta": {
                        "description": "Difference between initial and final value of resource",
                        "type": "string"
                    },
                    "duration": {
                        "description": "Duration of the run consumed by processing of this resource, in milliseconds",
                        "type": "string"
                    },
                    "id": {
                        "description": "Resource ID",
                        "type": "string"
                    },
                    "ignore_failure": {
                        "description": "the ignore_failure setting on a resource, indicating if a failure on this resource should be ignored",
                        "type": "boolean"
                    },
                    "name": {
                        "description": "Resource Name",
                        "type": "string"
                    },
                    "result": {
                        "description": "The action taken on the resource",
                        "type": "string"
                    },
                    "status": {
                        "description": "Status indicating how Chef processed the resource",
                        "type": "string",
                        "enum": [
                          "failed",
                          "skipped",
                          "unprocessed",
                          "up-to-date",
                          "updated"
                        ]
                    },
                    "type": {
                        "description": "Resource Type",
                        "type": "string"
                    }
                },
                "required": [
                    "after",
                    "before",
                    "delta",
                    "duration",
                    "id",
                    "ignore_failure",
                    "name",
                    "result",
                    "status",
                    "type"
                ]
            }
        },
        "run_id": {
            "description": "It is the runid for the run",
            "pattern": "^[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}$",
            "type": "string"
        },
        "run_list": {
            "description": "It is the runlist for the run",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "source": {
            "description": "The tool / client mode that initiated the action. Note that 'chef_solo' includes Chef Solo Mode and Chef Solo Legacy Mode.",
            "type": "string",
            "enum": ["chef_solo", "chef_client"]
        },
        "start_time": {
            "description": "It is the ISO timestamp of when the run started",
            "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$",
            "type": "string"
        },
        "status": {
            "description": "It gives the status of the run",
            "type": "string",
            "enum": [
                "success",
                "failure"
            ]
        },
        "total_resource_count": {
            "description": "It is the total number of resources for the run",
            "type": "integer",
            "minimum": 0
        },
        "updated_resource_count": {
            "description": "It is the number of updated resources during the course of the run",
            "type": "integer",
            "minimum": 0
        }
    },
    "required": [
        "chef_server_fqdn",
        "entity_uuid",
        "id",
        "end_time",
        "expanded_run_list",
        "message_type",
        "message_version",
        "node",
        "node_name",
        "organization_name",
        "resources",
        "run_id",
        "run_list",
        "source",
        "start_time",
        "status",
        "total_resource_count",
        "updated_resource_count"
    ],
    "title": "RunEndSchema",
    "type": "object"
}

Technical Implementation

The remainder of document will focus entirely on the nuts and bolts of the Data Collector.

Action Collection Integration

Most of the work is done by a separate Action Collection to track the actions of Chef resources. If the Data Collector is not enabled, it never registers with the Action Collection and no work will be done by the Action Collection to track resources.

Additional Collected Information

The Data Collector also collects:

the expanded run list
deprecations
the node
formatted error output for exceptions

Most of this is done through hooking events directly in the Data Collector itself. The ErrorHandlers module is broken out into a module, which is directly mixed into the Data Collector, to separate that concern out into a different file. This ErrorHandlers module is straightforward with fairly little state, but involves a lot of hooked methods.

Basic Configuration Modes

Configured for Automate

Do nothing. The URL is constructed from the base Chef::Config[:chef_server_url], auth is just Chef Server API authentication, and the default behavior is that it is configured.

Configured to Log to a File

Setup a file output location, no token is necessary:

Chef::Config[:data_collector][:output_locations] = { files:  [ "/Users/lamont/data_collector.out" ] }

Note the fact that you can't assign to Chef::Config[:data_collector][:output_locations][:files] and will NoMethodError if you try.

Configured to Log to a Non-Chef Server Endpoint

Setup a server url, requiring a token:

Chef::Config[:data_collector][:server_url] = "https://chef.acme.local/myendpoint.html"
Chef::Config[:data_collector][:token] = "mytoken"

This works for chef-clients, which are configured to hit a Chef server, but use a custom non-Chef-Automate endpoint for reporting, or for chef-solo/zero users.

XXX: There is also the Chef::Config[:data_collector][:output_locations] = { urls: [ "https://chef.acme.local/myendpoint.html" ] } method, which will behave differently, particularly on non-chef-solo use cases. In that case, the Data Collector server_url will still be automatically derived from the chef_server_url and the Data Collector will attempt to contact that endpoint. But with the token being supplied, the Data Collector will use that token and will not use Chef Server authentication. Thus, the server should 403 back. Also, if raise_on_failure is left to the default of false, then the Data Collector will simply drop that failure and continue without raising, which will appear to work, and output will be send to the configured output_locations.

Note that the presence of a token flips all external URIs to using the token. So it is not possible to use this feature to talk to both a Chef Automate endpoint and a custom URI reporting endpoint. This would seem to be the most useful of an incredibly marginally useful feature and it does not work. But given how hopelessly complicated this is, the recommendation is to use the server_url and to avoid using any url options in the output_locations since that feature is fairly poorly designed at this point in time.

Resiliency to Failures

The Data Collector in Chef >= 15.0 is resilient to failures that occur anywhere in the main loop of the Chef::Client#run method. In order to do this, there is a lot of defensive coding around internal data structures that may be nil. (e.g. failures before the node is loaded will result in the node being nil.) The spec tests for the Data Collector now run through a large sequence of events -- which must, unfortunately, be manually kept in sync with the events in the Chef::Client if those events are ever 'moved' around -- which should catch any issues in the Data Collector with early failures. The specs should also serve as documentation for what the messages will look like under different failure conditions. The goal was to keep the format of the messages to look as near as possible to the same schema as possible, even in the presence of failures, but some data structures will be entirely empty.

When the Data Collector fails extraordinarily early, it still sends both a start and an end message. This will happen if it fails so early that it would not normally have sent a start message.

Decision to Be Enabled

This is complicated due to over-design and is encapsulated in the #should_be_enabled? method and the ConfigValidation module. The #should_be_enabled? message and ConfigValidation should probably be merged into one renamed Config module to isolate the concern of processing the Chef::Config options and of doing the correct thing.

Run Start and Run End Message modules

These are separated out into their own modules, which are very deliberately not mixed into the main Data Collector. They use the Data Collector and Action Collection public interfaces. They are stateless themselves. This keeps the collaboration between them and the Data Collector very easy to understand. The start message is relatively simple and straightforwards. The complication of the end message is mostly due to walking through the Action Collection and all the collected action records from the entire run, along with a lot of defensive programming to deal with early errors.

Relevant Event Sequence

As it happens in the actual chef-client run:

events.register(data_collector)
events.register(action_collection)
run_status.run_id = request_id
events.run_start(Chef::VERSION, run_status)

failures during registration will cause registration_failed(node_name, exception, config) here and skip to #13
failures during node loading will cause node_load_failed(node_name, exception, config) here and skip to #13

events.node_load_success(node)
run_status.node = node

failures during run list expansion will cause run_list_expand_failed(node, exception) here and skip to #13

events.run_list_expanded(expansion)
run_status.start_clock
events.run_started(run_status)

failures during cookbook resolution will cause events.cookbook_resolution_failed(node, exception) here and skip to #13
failures during cookbook synch will cause events.cookbook_sync_failed(node, exception) and skip to #13

events.cookbook_compilation_start(run_context)
< the resource events happen here, which hit the Action Collection, may throw any of the other failure events >
events.converge_complete or events.converge_failed(exception)
run_status.stop_clock
run_status.exception = exception if it failed
events.run_completed(node, run_status) or events.run_failed(exception, run_status)