Back to Machinelearning

Entry Point JSON Graph format

docs/code/GraphRunner.md

5.0.05.0 KB
Original Source

Entry Point JSON Graph format

The entry point graph in ML.NET is an array of nodes. More information about the definition of entry points and classes that help construct entry point graphs can be found in the EntryPoint.md document.

Each node is an object with the following fields:

  • name: string. Required. Name of the entry point.
  • inputs: object. Optional. Specifies non-default inputs to the entry point. Note that if the entry point has required inputs (which is very common), the inputs field is required.
  • outputs: object. Optional. Specifies the variables that will hold the node's outputs.

Input and output types

The following types are supported in JSON graphs:

  • string. Represented as a JSON string, maps to a C# string.
  • float. Represented as a JSON float, maps to a C# float or double.
  • bool. Represented as a JSON bool, maps to a C# bool.
  • enum. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest).
  • int. Represented as a JSON integer, maps to a C# int or long.
  • array of the above. Represented as a JSON array, maps to a C# array.
  • dictionary. Currently not implemented. Represented as a JSON object, maps to a C# Dictionary<string,T>.
  • component. Represented as a JSON object with 2 fields: name:string and settings:object.

Variables

The following input/output types can not be represented as a JSON value:

  • IDataView
  • IFileHandle
  • ITransformModel
  • IPredictorModel

These must be passed as variables. The variable is represented as a JSON string that begins with $. Note the following rules:

  • A variable can appear in the outputs only once per graph. That is, the variable can be 'assigned' only once.
  • If the variable is present in inputs of one node and in the outputs of another node, this signifies a graph 'edge'. The same variable can participate in many edges.
  • If the variable is present only in inputs, but never in outputs, it is a graph input. All graph inputs must be provided before a graph can be run.
  • The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is ambiguous, ML.NET throws an exception.
  • Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, ML.NET throws an exception. Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw.

Variables for arrays and dictionaries.

It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above). They are treated the same way as regular 'scalar' variables.

If we want to reference an item of the collection, we can use the [] syntax:

  • $var[5] denotes 5th element of an array variable.
  • $var[foo] and $var['foo'] both denote the element with key 'foo' of a dictionary variable. This is not yet implemented.

Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects:

  • ["$v1", "$v2", "$v3"] denotes an array containing 3 variables.
  • {"foo": "$v1", "bar": "$v2"} denotes a collection containing 2 key-value pairs. This is also not yet implemented.

Example of a JSON entry point manifest object, and the respective entry point graph node

Let's consider the following manifest snippet, describing an entry point 'CVSplit.Split':

javascript
    {
      "name": "CVSplit.Split",
      "desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)",
      "inputs": [
        {
          "name": "Data",
          "type": "DataView",
          "desc": "Input dataset",
          "required": true
        },
        {
          "name": "NumFolds",
          "type": "Int",
          "desc": "Number of folds to split into",
          "required": false,
          "default": 2
        },
        {
          "name": "StratificationColumn",
          "type": "String",
          "desc": "Stratification column",
          "aliases": [
            "strat"
          ],
          "required": false,
          "default": null
        }
      ],
      "outputs": [
        {
          "name": "TrainData",
          "type": {
            "kind": "Array",
            "itemType": "DataView"
          },
          "desc": "Training data (one dataset per fold)"
        },
        {
          "name": "TestData",
          "type": {
            "kind": "Array",
            "itemType": "DataView"
          },
          "desc": "Testing data (one dataset per fold)"
        }
      ]
    }

As we can see, the entry point has 3 inputs (one of them required), and 2 outputs. The following is a correct graph containing call to this entry point:

javascript
{
  "nodes": [
    {
      "name": "CVSplit.Split",
      "inputs": {
        "Data": "$data1"
      },
      "outputs": {
        "TrainData": "$cv"
      }
    }]
}