Back to Chromium

MediaSequence keys and functions reference

third_party/mediapipe/src/mediapipe/util/sequence/README.md

150.0.7840.145.1 KB
Original Source

MediaSequence keys and functions reference

The documentation below will first provide an overview of using MediaSequence for machine learning tasks. Then, the documentation will describe the function prototypes used in MediaSequence for storing multimedia data in SequenceExamples. Finally, the documentation will describe the specific keys for storing specific types of data.

Overview of MediaSequence for machine learning

The goal of MediaSequence is to provide a tool for transforming annotations of multimedia into input examples ready for use with machine learning models in TensorFlow. The most semantically appropriate data type for this task that can be easily parsed in TensorFlow is tensorflow.train.SequenceExamples/tensorflow::SequenceExamples. Using SequenceExamples enables quick integration of new features into TensorFlow pipelines, easy open sourcing of models and data, reasonable debugging, and efficient TensorFlow decoding. For many machine learning tasks, TensorFlow Examples are capable of fulfilling that role. However, Examples can become unwieldy for sequence data, particularly when the number of features per timestep varies, creating a ragged struction. Video object detection is one example task that requires this ragged structure because the number of detections per frame varies. SequenceExamples can easily encode this ragged structure. Sequences naturally match the semantics of video as a sequence of frames or other common media patterns. The video feature lists will be stored in order with strictly increasing timestamps so the data is unambiguously ordered. The interpretable semantics simplify debugging and decoding of potentially complicated data. One potential disadvantage of SequenceExamples is that keys and formats can vary widely. The MediaSequence library provides tools for consistently manipulating and decoding SequenceExamples in Python and C++ in a consistent format. The consistent format enables creating a pipeline for processing data sets. A goal of MediaSequence as a pipeline is that users should only need to specify the metadata (e.g. videos and labels) for their task. The pipeline will turn the metadata into training data.

The pipeline has two stages. First, users must generate the metadata describing the data and applicable labels. This process is straightforward and described in the next section. Second, users run MediaPipe graphs with the UnpackMediaSequenceCalculator and PackMediaSequenceCalculators to extract the relevant data from multimedia files. A sequence of graphs can be chained together in this second stage to achieve complex processing such as first extracting a subset of frames from a video and then extracting deep features or object detections for each extracted frame. As MediaPipe is built to simply and reproducibly process media files, the two stage approach separates and simplifies data management.

Creating metadata for a new data set

Generating examples for a new data set typically only requires defining the metadata. MediaPipe graphs can interpret this metadata to fill out the SequenceExamples using the UnpackMediaSequenceCalculator and PackMediaSequenceCalculator. This section will list the metadata required for different types of tasks and provide a limited descripiton for the data filled by MediaPipe. The input media will be referred to as video because that is a common case, but audio files or other sequences could be supported. The function calls in the Python API will be used in examples, and the equivalent C++ calls are described below.

The video metadata is a way to access the video, using set_clip_data_path to define the path on disk, and the time span to include using set_clip_start_timestamp and set_clip_end_timestamp. The data path can be absolute or can be relative to a root directory passed to the UnpackMediaSequenceCalculator. The start and end timestamps should be valid MediaPipe timestamps in microseconds. Given this information, the pipeline can extract the portion of the media between the start and end timestamps. If you do not specify a start time, the video is decoded from the beginning. If you do not specify an end time, the entire video is decoded. The start and end times are not filled if left empty.

The features extracted from the video depends on the MediaPipe graph that is run. The documentation of keys below and in PackMediaSequenceCalculator provide the best description.

The annotations including labels should be added as metadata. They will be passed through the MediaPipe pipeline unchanged. The label format will vary depending on the task you want to do. Several examples are included below. In general, the MediaPipe processing is independent of any labels that you provide: only the clip data path, start time, and end time matter.

Clip classification

For clip classification, e.g. is this video clip about basketball?, you should use set_clip_label_index with the integer index of the correct class and set_clip_label_string with the human readable version of the correct class. The index is often used when training the model and the string is used for human readable debugging. The same number of indices and strings need to be provided. The association between the two is just their relative positions in the list.

Example lines creating metadata for clip classification
python
# Python: functions from media_sequence.py as ms
sequence = tf.train.SequenceExample()
ms.set_clip_data_path(b"path_to_video", sequence)
ms.set_clip_start_timestamp(1000000, sequence)
ms.set_clip_end_timestamp(6000000, sequence)
ms.set_clip_label_index((4, 3), sequence)
ms.set_clip_label_string((b"run", b"jump"), sequence)
c++
// C++: functions from media_sequence.h
tensorflow::SequenceExample sequence;
SetClipDataPath("path_to_video", &sequence);
SetClipStartTimestamp(1000000, &sequence);
SetClipEndTimestamp(6000000, &sequence);
SetClipLabelIndex({4, 3}, &sequence);
SetClipLabelString({"run", "jump"}, &sequence);

Temporal detection

For temporal event detection or localization, e.g. classify regions in time where people are playing a sport, the labels are referred to as segments. You need to set the segment timespans with set_segment_start_timestamp and set_segment_end_timestamp and labels with set_segment_label_index and set_segment_label_string. All of these are repeated fields so you can provide multiple segments for each clip. The label index and string have the same meaning as for clip classification. Only the start and end timestamps need to be provided. (The pipeline will automatically call set_segment_start_index to the index of the image frame under the image/timestamp key that is closest in time, and similarly for set_segment_end_index. Allowing the pipeline to fill in the indices corrects for frame rate changes automatically.) The same number of values must be present in each field. If the same segment would have multiple labels, the segment start and end time must be duplicated.

Example lines creating metadata for temporal detection
python
# Python: functions from media_sequence.py as ms
sequence = tf.train.SequenceExample()
ms.set_clip_data_path(b"path_to_video", sequence)
ms.set_clip_start_timestamp(1000000, sequence)
ms.set_clip_end_timestamp(6000000, sequence)

ms.set_segment_start_timestamp((2000000, 4000000), sequence)
ms.set_segment_end_timestamp((3500000, 6000000), sequence)
ms.set_segment_label_index((4, 3), sequence)
ms.set_segment_label_string((b"run", b"jump"), sequence)
c++
// C++: functions from media_sequence.h
tensorflow::SequenceExample sequence;
SetClipDataPath("path_to_video", &sequence);
SetClipStartTimestamp(1000000, &sequence);
SetClipEndTimestamp(6000000, &sequence);

SetSegmentStartTimestamp({2000000, 4000000}, &sequence);
SetSegmentEndTimestamp({3500000, 6000000}, &sequence);
SetSegmentLabelIndex({4, 3}, &sequence);
SetSegmentLabelString({"run", "jump"}, &sequence);

Tracking and spatiotemporal detection

For object tracking or detection in videos, e.g. classify regions in time and space, the labels are typically bounding boxes. Unlike previous tasks, the annotations are provided as a FeatureList instead of in a context Feature because they occur in multiple frames. Set up a detection task with add_bbox, add_bbox_timestamp, add_bbox_label_string, and add_bbox_label_index. Only add metadata for annotated frames. The pipeline will add empty features to each feature list to align the box annotations with the nearest image frame. add_bbox_is_annotated distinguishes between annotated frames and frames added as padding. 1 is added if the frame was annotated and 0 otherwise. It is automatically maintained in PackMediaSequenceCalculator. Other fields can be used for tracking tasks: add_bbox_track_string identifies instances over time and add_bbox_class_string can be concatenated to the track string if track ids are not already unique. If track ids are unique across classes, you do not need to fill out the class information.

Example lines creating metadata for spatiotemporal detection or tracking
python
# Python: functions from media_sequence.py as ms
sequence = tf.train.SequenceExample()
ms.set_clip_data_path(b"path_to_video", sequence)
ms.set_clip_start_timestamp(1000000, sequence)
ms.set_clip_end_timestamp(6000000, sequence)

# For an object tracking task with action labels:
loctions_on_frame_1 = np.array([[0.1, 0.2, 0.3 0.4],
                                [0.2, 0.3, 0.4, 0.5]])
ms.add_bbox(locations_on_frame_1, sequence)
ms.add_bbox_timestamp(3000000, sequence)
ms.add_bbox_label_index((4, 3), sequence)
ms.add_bbox_label_string((b"run", b"jump"), sequence)
ms.add_bbox_track_string((b"id_0", b"id_1"), sequence)
# ms.add_bbox_class_string(("cls_0", "cls_0"), sequence)  # if required
locations_on_frame_2 = locations_on_frame_1[0]
ms.add_bbox(locations_on_frame_2, sequence)
ms.add_bbox_timestamp(5000000, sequence)
ms.add_bbox_label_index((3), sequence)
ms.add_bbox_label_string((b"jump",), sequence)
ms.add_bbox_track_string((b"id_0",), sequence)
# ms.add_bbox_class_string(("cls_0",), sequence)  # if required
c++
// C++: functions from media_sequence.h
tensorflow::SequenceExample sequence;
SetClipDataPath("path_to_video", &sequence);
SetClipStartTimestamp(1000000, &sequence);
SetClipEndTimestamp(6000000, &sequence);

// For an object tracking task with action labels:
std::vector<mediapipe::Location> locations_on_frame_1;
AddBBox(locations_on_frame_1, &sequence);
AddBBoxTimestamp(3000000, &sequence);
AddBBoxLabelIndex({4, 3}, &sequence);
AddBBoxLabelString({"run", "jump"}, &sequence);
AddBBoxTrackString({"id_0", "id_1"}, &sequence);
// AddBBoxClassString({"cls_0", "cls_0"}, &sequence); // if required
std::vector<mediapipe::Location> locations_on_frame_2;
AddBBox(locations_on_frame_2, &sequence);
AddBBoxTimestamp(5000000, &sequence);
AddBBoxLabelIndex({3}, &sequence);
AddBBoxLabelString({"jump"}, &sequence);
AddBBoxTrackString({"id_0"}, &sequence);
// AddBBoxClassString({"cls_0"}, &sequence); // if required

Running a MediaSequence through MediaPipe

UnpackMediaSequenceCalculator and PackMediaSequenceCalculator

MediaSequence utilizes MediaPipe for processing by providing two special calculators. The UnpackMediaSequenceCalculator is used to extract data from SequenceExamples. This will often be the metadata, such as the path to the video file, and the clip start and end times. However, after storing images in a SequenceExample, the images themselves can also be unpacked for further processing, such as computing optical flow. Whatever data is extracted during processing is added to the SequenceExample by the PackMediaSequenceCalculator. The values that are unpacked and packed into these calculators are determined by the tags on the streams in the MediaPipe calculator graph. (Tags are required to be all capitals and underscores. To encode prefixes for feature keys as tags, prefixes for feature keys should follow the same convention.) The documentation for these two calculators describes the variety of data they support. The timestamps of each feature list being unpacked must be in strictly increasing order. Any other MediaPipe processing can be used between these calculators to extract features.

Adding data and reconciling metadata

In general, the pipeline will decode the specified media between the clip start and end timestamps and store any requested features. A common feature to request is JPEG encoded images, so this will be used it as an example. Each image between the clip start and end timestamps (generally inclusive) is added to the SequenceExample's feature list with add_image_encoded and the corresponding timestamp it arrived at is added with add_image_timestamp. At the end of the image stream, the pipeline will determine and store what metadata it can about the stream. For images, it will store the height and width of the image as well as the number of channels and encoding format. Similar storage and metadata computation is done when adding audio, float feature vectors, or encoded optical flow to the SequenceExample. The code that reconciles the metadata is in media_sequence.cc.

Automatically aligning bounding boxes to images

At the time of writing, the image/timestamp is also used to update the closest timestamp for segment/start/index and segment/end/index and bounding box data. Segment indices are relative to the start of the clip (i.e. only reference data within the SequenceExample), while timestamps are absolute times within the video. Bounding box data is aligned to the image/timestamps by inserting empty bounding box annotations and indicating this with add_bbox_is_annotated. If images are stored at a lower rate than the bounding box data, then only the nearest annotation to each frame is retained and any others are dropped. Be careful when downsampling frame rates with bounding box annotations; downsampling bounding box annotations is the only time annotations will be lost in the MediaPipe pipeline.

Chaining processing graphs

A common use case is to derive deep features from frames in a video when those features are too expensive to compute during training. For example, extracting ResNet-50 features on each frame of video. In the MediaSequence pipeline, the way to generate these features is to first extract the images to the SequenceExample in one MediaPipe graph. Then create a second MediaPipe graph that unpacks the images from the SequenceExample and appends the new features to a copy of that SequenceExample. This chaining behavior makes it easy to incrementally add features in a modular way and makes debugging easier because you can identify the anomalous stage more easily. Once the pipeline is complete, any unnecessary features can be removed. Be aware that the number of derived feature timestamps may be different than the number of input features, e.g. optical flow can't be estimated for the last frame of a video clip, so it adds one less frame of data. With the exception of aligning bounding boxes, the pipeline does nothing to require consistent timestamps between features.

Using prefixes

Prefixes enable storing semantically identical data without collisions. For example, it is possible to store predicted and ground truth bounding boxes by using different prefixes. We can also store bounding boxes and labels from different tasks by utilizing prefixes.

To minimize burdening the API and documentation, eschew using prefixes unless necessary.

The recommended prefix format, enforced by some MediaPipe functions, is all caps with underscores, and numeric characters after the first character. e.g. MY_FAVORITE_FEATURE_V1.

The convention for encoding groundtruth labels is to use no prefix, while predicted labels are typically tagged with prefixes. For example:

  • Example groudntruth keys:

    • region/label/string
    • region/label/confidence
  • Example predicted label keys:

    • PREDICT_V1/region/label/string
    • PREDICT_V1/region/label/confidence

Function prototypes for each data type

MediaSequence provides accessors to store common data patterns in SequenceExamples. The exact functions depend on the type of data and the key, but the patterns are similar. Each function has a name related to the key, so we will document the functions with a generic name, Feature. Note that due to different conventions for Python and C++ code, the capitalization and parameter order varies, but the functionality should be equivalent.

Each function takes an optional prefix parameter. For some common cases, such as storing instance segmentation labels along with images, named versions with prefixes baked in provided as documented below. Lastly, generic features and audio streams should almost always use a prefix because storing multiple features or transformed audio streams is common.

The code generating these functions resides in media_sequence.h/.cc/.py and media_sequence_util.h/.cc/.py. The media_sequence files generally defines the API that should be used directly by developers. The media_sequence_util files provide the function generation code used to define new features. If you require additional features not supplied in the media_sequence files, use the functions in media_sequence_util to create more in the appropriate namespace / module_dict in your own files and import those as well.

In these prototypes, the prefix is optional as indicated by [ ]s. The C++ types are abbreviated. The code and test cases are recommended for understanding the exact types. The purpose of these example is to provide an illustration of the pattern.

Singular Context Features

python callc++ calldescription
has_feature(example [, prefix])HasFeature([const string& prefix,] const tf::SE& example)Returns a boolean if the feature is present.
get_feature(example [, prefix])GetFeature([const string& prefix,] const tf::SE& example)Returns a single feature of the appropriate type (string, int64, float).
clear_feature(example [, prefix])ClearFeature([const string& prefix,] tf::SE* example)Clears the feature.
set_feature(value, example [, prefix])SetFeature([const string& prefix,], const TYPE& value, tf::SE* example)Clears and stores the feature of the appropriate type.
get_feature_key([prefix])GetFeatureKey([const string& prefix])Returns the key used by related functions.
get_feature_default_parser()Returns the tf.io.FixedLenFeature for this type. (Python only.)

List Context Features

python callc++ calldescription
has_feature(example [, prefix])HasFeature([const string& prefix,] const tf::SE& example)Returns a boolean if the feature is present.
get_feature(example [, prefix])GetFeature([const string& prefix,] const tf::SE& example)Returns a sequence feature of the appropriate type (comparable to list/vector of string, int64, float).
clear_feature(example [, prefix])ClearFeature([const string& prefix,] tf::SE* example)Clears the feature.
set_feature(values, example [, prefix])SetFeature([const string& prefix,], const vector<TYPE>& values, tf::SE* example)Clears and stores the list of features of the appropriate type.
get_feature_key([prefix])GetFeatureKey([const string& prefix])Returns the key used by related functions.
get_feature_default_parser()Returns the tf.io.VarLenFeature for this type. (Python only.)

Singular Feature Lists

python callc++ calldescription
has_feature(example [, prefix])HasFeature([const string& prefix,] const tf::SE& example)Returns a boolean if the feature is present.
get_feature_size(example [, prefix])GetFeatureSize([const string& prefix,] const tf::SE&(example)Returns the number of features under this key. Will be 0 if the feature is absent.
get_feature_at(index, example [, prefix])GetFeatureAt([const string& prefix,] const tf::SE& example, const int index)Returns a single feature of the appropriate type (string, int64, float) at position index of the feature list.
clear_feature(example [, prefix])ClearFeature([const string& prefix,] tf::SE* example)Clears the entire feature.
add_feature(value, example [, prefix])AddFeature([const string& prefix,], const TYPE& value, tf::SE* example)Appends a feature of the appropriate type to the feature list.
get_feature_key([prefix])GetFeatureKey([const string& prefix])Returns the key used by related functions.
get_feature_default_parser()Returns the tf.io.FixedLenSequenceFeature for this type. (Python only.)

List Feature Lists

python callc++ calldescription
has_feature(example [, prefix])HasFeature([const string& prefix,] const tf::SE& example)Returns a boolean if the feature is present.
get_feature_size(example [, prefix])GetFeatureSize([const string& prefix,] const tf::SE& example)Returns the number of feature sequences under this key. Will be 0 if the feature is absent.
get_feature_at(index, example [, prefix])GetFeatureAt([const string& prefix,] const tf::SE& example, const int index)Returns a repeated feature of the appropriate type (comparable to list/vector of string, int64, float) at position index of the feature list.
clear_feature(example [, prefix])ClearFeature([const string& prefix,] tf::SE* example)Clears the entire feature.
add_feature(value, example [, prefix])AddFeature([const string& prefix,], const vector<TYPE>& value, tf::SE* example)Appends a sequence of features of the appropriate type to the feature list.
get_feature_key([prefix])GetFeatureKey([const string& prefix])Returns the key used by related functions.
get_feature_default_parser()Returns the tf.io.VarLenFeature for this type. (Python only.)

Keys

These keys are broadly useful for covering the range of multimedia based machine learning tasks. The key itself should be human interpretable, and descriptions are provided for elaboration.

keytypepython call / c++ calldescription
example/idcontext bytesset_example_id / SetExampleIdA unique identifier for each example.
example/dataset_namecontext bytesset_example_dataset_name / SetExampleDatasetNameThe name of the data set, including the version.
example/dataset/flag/stringcontext bytes listset_example_dataset_flag_string / SetExampleDatasetFlagStringA list of bytes for dataset related attributes or flags for this example.
keytypepython call / c++ calldescription
clip/data_pathcontext bytesset_clip_data_path / SetClipDataPathThe relative path to the data on disk from some root directory.
clip/start/timestampcontext intset_clip_start_timestamp / SetClipStartTimestampThe start time, in microseconds, for the start of the clip in the media.
clip/end/timestampcontext intset_clip_end_timestamp / SetClipEndTimestampThe end time, in microseconds, for the end of the clip in the media.
clip/label/indexcontext int listset_clip_label_index / SetClipLabelIndexA list of label indices for this clip.
clip/label/stringcontext string listset_clip_label_string / SetClipLabelStringA list of label strings for this clip.
clip/label/confidencecontext float listset_clip_label_confidence / SetClipLabelConfidenceA list of label confidences for this clip.
clip/media_idcontext bytesset_clip_media_id / SetClipMediaIdAny identifier for the media beyond the data path.
clip/alternative_media_idcontext bytesset_clip_alternative_media_id / SetClipAlternativeMediaIdYet another alternative identifier.
clip/encoded_media_bytescontext bytesset_clip_encoded_media_bytes / SetClipEncodedMediaBytesThe encoded bytes for storing media directly in the SequenceExample.
clip/encoded_media_start_timestampcontext intset_clip_encoded_media_start_timestamp / SetClipEncodedMediaStartTimestampThe start time for the encoded media if not preserved during encoding.
keytypepython call / c++ calldescription
segment/start/timestampcontext int listset_segment_start_timestamp / SetSegmentStartTimestampA list of segment start times in microseconds.
segment/start/indexcontext int listset_segment_start_index / SetSegmentstartIndexA list of indices marking the first frame index >= the start time.
segment/end/timestampcontext int listset_segment_end_timestamp / SetSegmentEndTimestampA list of segment end times in microseconds.
segment/end/indexcontext int listset_segment_end_index / SetSegmentEndIndexA list of indices marking the last frame index <= the end time.
segment/label/indexcontext int listset_segment_label_index / SetSegmentLabelIndexA list with the label index for each segment. Multiple labels for the same segment are encoded as repeated segments.
segment/label/stringcontext bytes listset_segment_label_string / SetSegmentLabelStringA list with the label string for each segment. Multiple labels for the same segment are encoded as repeated segments.
segment/label/confidencecontext float listset_segment_label_confidence / SetSegmentLabelConfidenceA list with the label confidence for each segment. Multiple labels for the same segment are encoded as repeated segments.

Prefixes are used to distinguish betwen different semantic meanings of regions. This practice is so common, that the BBox version of function calls will be provided. Each call accepts an optional prefix to avoid name collisions. "Region" is used in the keys because of the similar semantic meaning between different types of regions.

A few special accessors are provided to work with multiple keys at once.

Regions can be given identifiers for labels, tracks, and classes. Although similar information can be stored in each identifier, the intended use is different. Labels should be used when predicting a label for a region (such as the class of the bounding box or action performed by a person). Tracks should be used to uniquely identify regions over sequential frames. Classes are only intended to be used to disambiguate track ids if those ids are not unique across object labels. The recommendation is to prefer label fields for classification tasks and tracking (or class) fields for tracking information.

keytypepython call / c++ calldescription
region/bbox/yminfeature list float listadd_bbox_ymin / AddBBoxYMinA list of normalized minimum y values of bounding boxes in a frame.
region/bbox/xminfeature list float listadd_bbox_xmin / AddBBoxXMinA list of normalized minimum x values of bounding boxes in a frame.
region/bbox/ymaxfeature list float listadd_bbox_ymax / AddBBoxYMaxA list of normalized maximum y values of bounding boxes in a frame.
region/bbox/xmaxfeature list float listadd_bbox_xmax / AddBBoxXMaxA list of normalized maximum x values of bounding boxes in a frame.
region/bbox/\*specialadd_bbox / AddBBoxOperates on ymin,xmin,ymax,xmax with a single call.
region/point/xfeature list float listadd_bbox_point_x / AddBBoxPointXA list of normalized x values for points in a frame.
region/point/yfeature list float listadd_bbox_point_y / AddBBoxPointYA list of normalized y values for points in a frame.
region/point/\*specialadd_bbox_point / AddBBoxPointOperates on point/x,point/y with a single call.
region/radiusfeature list float listadd_bbox_point_radius / AddBBoxRadiusA list of radii for points in a frame.
region/3d_point/xfeature list float listadd_bbox_3d_point_x / AddBBox3dPointXA list of normalized x values for points in a frame.
region/3d_point/yfeature list float listadd_bbox_3d_point_y / AddBBox3dPointYA list of normalized y values for points in a frame.
region/3d_point/zfeature list float listadd_bbox_3d_point_z / AddBBox3dPointZA list of normalized z values for points in a frame.
region/3d_point/\*specialadd_bbox_3d_point / AddBBox3dPointOperates on 3d_point/{x,y,z} with a single call.
region/timestampfeature list intadd_bbox_timestamp / AddBBoxTimestampThe timestamp in microseconds for the region annotations.
region/num_regionsfeature list intadd_bbox_num_regions / AddBBoxNumRegionsThe number of boxes or other regions in a frame. Should be 0 for unannotated frames.
region/is_annotatedfeature list intadd_bbox_is_annotated / AddBBoxIsAnnotated1 if this timestep is annotated. 0 otherwise. Distinguishes empty from unannotated frames.
region/is_generatedfeature list int listadd_bbox_is_generated / AddBBoxIsGeneratedFor each region, 1 if the region is procedurally generated for this frame.
region/is_occludedfeature list int listadd_bbox_is_occluded / AddBBoxIsOccludedFor each region, 1 if the region is occluded in the current frame.
region/label/indexfeature list int listadd_bbox_label_index / AddBBoxLabelIndexFor each region, lists the integer label. Multiple labels for one region require duplicating the region.
region/label/stringfeature list bytes listadd_bbox_label_string / AddBBoxLabelStringFor each region, lists the string label. Multiple labels for one region require duplicating the region.
region/label/confidencefeature list float listadd_bbox_label_confidence / AddBBoxLabelConfidenceFor each region, lists the confidence or weight for the label. Multiple labels for one region require duplicating the region.
region/track/indexfeature list int listadd_bbox_track_index / AddBBoxTrackIndexFor each region, lists the integer track id. Multiple track ids for one region require duplicating the region.
region/track/stringfeature list bytes listadd_bbox_track_string / AddBBoxTrackStringFor each region, lists the string track id. Multiple track ids for one region require duplicating the region.
region/track/confidencefeature list float listadd_bbox_track_confidence / AddBBoxTrackConfidenceFor each region, lists the confidence or weight for the track. Multiple track ids for one region require duplicating the region.
region/class/indexfeature list int listadd_bbox_class_index / AddBBoxClassIndexFor each region, lists the integer class. Multiple classes for one region require duplicating the region.
region/class/stringfeature list bytes listadd_bbox_class_string / AddBBoxClassStringFor each region, lists the string class. Multiple classes for one region require duplicating the region.
region/class/confidencefeature list float listadd_bbox_class_confidence / AddBBoxClassConfidenceFor each region, lists the confidence or weight for the class. Multiple classes for one region require duplicating the region.
region/embedding/floatfeature list float listadd_bbox_embedding_floats / AddBBoxEmbeddingFloatsFor each region, provide an embedding as sequence of floats.
region/partscontext bytes listset_bbox_parts / SetBBoxPartsThe list of region parts expected in this example.
region/embedding/ dimensions_per_regioncontext int listset_bbox_embedding_dimensions_per_region / SetBBoxEmbeddingDimensionsPerRegionProvide the dimensions for each embedding.
region/embedding/formatcontext stringset_bbox_embedding_format / SetBBoxEmbeddingFormatProvides the encoding format, if any, for region embeddings.
region/embedding/encodedfeature list bytes listadd_bbox_embedding_encoded / AddBBoxEmbeddingEncodedFor each region, provide an encoded embedding.
region/embedding/confidencefeature list float listadd_bbox_embedding_confidence / AddBBoxEmbeddingConfidenceFor each region, provide a confidence for the embedding.
region/unmodified_timestampfeature list intadd_bbox_unmodified_timestamp / AddUnmodifiedBBoxTimestampUsed to store the original timestamps if procedurally aligning timestamps to image frames.
keytypepython call / c++ calldescription
image/encodedfeature list bytesadd_image_encoded / AddImageEncodedThe encoded image at each timestep.
image/timestampfeature list intadd_image_timestamp / AddImageTimestampThe timestamp in microseconds for the image.
image/multi_encodedfeature list bytes listadd_image_multi_encoded / AddImageMultiEncodedStoring multiple images at each timestep (e.g. from multiple camera views).
image/label/indexfeature list int listadd_image_label_index / AddImageLabelIndexIf an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.
image/label/stringfeature list bytes listadd_image_label_string / AddImageLabelStringIf an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.
image/label/confidencefeature list float listadd_image_label_confidence / AddImageLabelConfidenceIf an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.
image/formatcontext bytesset_image_format / SetImageFormatThe encoding format of the images.
image/channelscontext intset_image_channels / SetImageChannelsThe number of channels in the image.
image/colorspacecontext bytesset_image_colorspace / SetColorspaceThe colorspace of the images.
image/heightcontext intset_image_height / SetImageHeightThe height of the image in pixels.
image/widthcontext intset_image_width / SetImageWidthThe width of the image in pixels.
image/frame_ratecontext floatset_image_frame_rate / SetImageFrameRateThe rate of images in frames per second.
image/data_pathcontext bytesset_image_data_path / SetImageDataPathThe path of the image file if it did not come from a media clip.
keytypepython call / c++ calldescription
CLASS_SEGMENTATION/image/encodedfeature list bytesadd_class_segmentation_encoded / AddClassSegmentationEncodedThe encoded image of class labels at each timestep.
CLASS_SEGMENTATION/image/timestampfeature list intadd_class_segmentation_timestamp / AddClassSegmentationTimestampThe timestamp in microseconds for the class labels.
CLASS_SEGMENTATION/image/multi_encodedfeature list bytes listadd_class_segmentation_multi_encoded / AddClassSegmentationMultiEncodedStoring multiple segmentation masks in case they overlap.
CLASS_SEGMENTATION/image/formatcontext bytesset_class_segmentation_format / SetClassSegmentationFormatThe encoding format of the class label images.
CLASS_SEGMENTATION/image/heightcontext intset_class_segmentation_height / SetClassSegmentationHeightThe height of the image in pixels.
CLASS_SEGMENTATION/image/widthcontext intset_class_segmentation_width / SetClassSegmentationWidthThe width of the image in pixels.
CLASS_SEGMENTATION/image/class/ label/indexcontext int listset_class_segmentation_class_label_index / SetClassSegmentationClassLabelIndexIf necessary a mapping from values in the image to class labels.
CLASS_SEGMENTATION/image/class/ label/stringcontext bytes listset_class_segmentation_class_label_string / SetClassSegmentationClassLabelStringA mapping from values in the image to class labels.
keytypepython call / c++ calldescription
INSTANCE_SEGMENTATION/image/ encodedfeature list bytesadd_instance_segmentation_encoded / AddInstanceSegmentationEncodedThe encoded image of object instance labels at each timestep.
INSTANCE_SEGMENTATION/image/ timestampfeature list intadd_instance_segmentation_timestamp / AddInstanceSegmentationTimestampThe timestamp in microseconds for the object instance labels.
INSTANCE_SEGMENTATION/image/multi_encodedfeature list bytes listadd_instance_segmentation_multi_encoded / AddInstanceSegmentationEncodedStoring multiple segmentation masks in case they overlap.
INSTANCE_SEGMENTATION/image/ formatcontext bytesset_instance_segmentation_format / SetInstanceSegmentationFormatThe encoding format of the object instance labels.
INSTANCE_SEGMENTATION/image/ heightcontext intset_instance_segmentation_height / SetInstanceSegmentationHeightThe height of the image in pixels.
INSTANCE_SEGMENTATION/image/ widthcontext intset_instance_segmentation_width / SetInstanceSegmentationWidthThe width of the image in pixels.
INSTANCE_SEGMENTATION/image/ class/label/indexcontext int listset_instance_segmentation_class_label_index / SetInstanceSegmentationClassLabelIndexIf necessary a mapping from values in the image to class labels.
INSTANCE_SEGMENTATION/image/ class/label/stringcontext bytes listset_instance_segmentation_class_label_string / SetInstanceSegmentationClassLabelStringA mapping from values in the image to class labels.
INSTANCE_SEGMENTATION/image/ object/class/indexcontext intset_instance_segmentation_object_class_index / SetInstanceSegmentationObjectClassIndexIf necessary a mapping from values in the image to class indices.
keytypepython call / c++ calldescription
FORWARD_FLOW/image/encodedfeature list bytesadd_forward_flow_encoded / AddForwardFlowEncodedThe encoded forward optical flow field at each timestep.
FORWARD_FLOW/image/timestampfeature list intadd_forward_flow_timestamp / AddForwardFlowTimestampThe timestamp in microseconds for the optical flow field.
FORWARD_FLOW/image/multi_encodedfeature list bytes listadd_forward_flow_multi_encoded / AddForwardFlowMultiEncodedStoring multiple optical flow fields at each timestep (e.g. from multiple camera views).
FORWARD_FLOW/image/formatcontext bytesset_forward_flow_format / SetForwardFlowFormatThe encoding format of the optical flow field.
FORWARD_FLOW/image/channelscontext intset_forward_flow_channels / SetForwardFlowChannelsThe number of channels in the optical flow field.
FORWARD_FLOW/image/heightcontext intset_forward_flow_height / SetForwardFlowHeightThe height of the optical flow field in pixels.
FORWARD_FLOW/image/widthcontext intset_forward_flow_width / SetForwardFlowWidthThe width of the optical flow field in pixels.
FORWARD_FLOW/image/frame_ratecontext floatset_forward_flow_frame_rate / SetForwardFlowFrameRateThe rate of optical flow field in frames per second.
FORWARD_FLOW/image/saturationcontext floatset_forward_flow_saturation / SetForwardFlowSaturationThe saturation value used before encoding the flow field to an image.

Storing generic features is powerful, but potentially confusing. The recommendation is to use more specific methods if possible. When using these generic features, always supply a prefix. (The recommended prefix format, enforced by some MediaPipe functions, is all caps with underscores, e.g. MY_FAVORITE_FEATURE.) Following this recommendation, the keys will be listed with a generic PREFIX. Calls exist for storing generic features in both the feature_list and the context. For anything that occurs with a timestamp, use the feature_list; for anything that applies to the example as a whole, without timestamps, use the context.

keytypepython call / c++ calldescription
PREFIX/feature/floatsfeature list float listadd_feature_floats / AddFeatureFloatsA list of floats at a timestep.
PREFIX/feature/bytesfeature list bytes listadd_feature_bytes / AddFeatureBytesA list of bytes at a timestep. Maybe be encoded.
PREFIX/feature/intsfeature list int listadd_feature_ints / AddFeatureIntsA list of ints at a timestep.
PREFIX/feature/timestampfeature list intadd_feature_timestamp / AddFeatureTimestampA timestamp for a set of features.
PREFIX/feature/durationfeature list int listadd_feature_duration / AddFeatureDurationIt is occasionally useful to indicate that a feature applies to a time range. This should only be used for features and annotations should be provided as Segments.
PREFIX/feature/confidencefeautre list float listadd_feature_confidence / AddFeatureConfidenceThe confidence for each generated feature.
PREFIX/feature/dimensionscontext int listset_feature_dimensions / SetFeatureDimensionsA list of integer dimensions for each feature.
PREFIX/feature/ratecontext floatset_feature_rate / SetFeatureRateThe rate that features are calculated as features per second.
PREFIX/feature/bytes/formatcontext bytesset_feature_bytes_format / SetFeatureBytesFormatThe encoding format if any for features stored as bytes.
PREFIX/context_feature/floatscontext float listset_context_feature_floats / AddContextFeatureFloatsA list of floats for the entire example.
PREFIX/context_feature/bytescontext bytes listset_context_feature_bytes / AddContextFeatureBytesA list of bytes for the entire example. Maybe be encoded.
PREFIX/context_feature/intscontext int listset_context_feature_ints / AddContextFeatureIntsA list of ints for the entire example.

Audio is a special subtype of generic features with additional data about the audio format. When using audio, always supply a prefix. The keys here will be listed with a generic PREFIX.

To understand the terminology, it is helpful conceptualize the audio as a list of matrices. The columns of the matrix are called samples. The rows of the matrix are called channels. Each matrix is called a packet. The packet rate is how often packets appear per second. The sample rate is the rate of columns per second. The audio sample rate is used for derived features such as spectrograms where the STFT is computed over audio at some other rate.

keytypepython call / c++ calldescription
PREFIX/feature/floatsfeature list float listadd_feature_floats / AddFeatureFloatsA list of floats at a timestep.
PREFIX/feature/timestampfeature list intadd_feature_timestamp / AddFeatureTimestampA timestamp for a set of features.
PREFIX/feature/sample_ratecontext floatset_feature_sample_rate / SetFeatureSampleRateThe number of features per second. (e.g. for a spectrogram, this is the rate of STFT windows.)
PREFIX/feature/num_channelscontext intset_feature_num_channels / SetFeatureNumChannelsThe number of channels of audio in each stored feature.
PREFIX/feature/num_samplescontext intset_feature_num_samples / SetFeatureNumSamplesThe number of samples of audio in each stored feature.
PREFIX/feature/packet_ratecontext floatset_feature_packet_rate / SetFeaturePacketRateThe number of packets per second.
PREFIX/feature/audio_sample_ratecontext floatset_feature_audio_sample_rate / SetFeatureAudioSampleRateThe sample rate of the original audio for derived features.

Text features may be timed with the media such as captions or automatic speech recognition results, or may be descriptions. This collection of keys should be used for many, very short text features. For a few, longer segments please use the Segment keys in the context as described above. As always, prefixes can be used to store different types of text such as automated and ground truth transcripts.

keytypepython call / c++ calldescription
text/languagecontext bytesset_text_langage / SetTextLanguageThe language for the corresponding text.
text/context/contentcontext bytesset_text_context_content / SetTextContextContentStorage for large blocks of text in the context.
text/context/token_idcontext int listset_text_context_token_id / SetTextContextTokenIdStorage for large blocks of text in the context as token ids.
text/context/embeddingcontext float listset_text_context_embedding / SetTextContextEmbeddingStorage for large blocks of text in the context as embeddings.
text/contentfeature list bytesadd_text_content / AddTextContentStorage for time aligned segments of text.
text/timestampfeature list intadd_text_timestamp / AddTextTimestampWhen a text token occurs in microseconds.
text/durationfeature list intadd_text_duration / SetTextDurationThe duration in microseconds for the corresponding text tokens.
text/confidencefeature list floatadd_text_confidence / AddTextConfidenceHow likely the text is correct.
text/embeddingfeautre list float listadd_text_embedding / AddTextEmbeddingA floating point vector for the corresponding text token.
text/token/idfeature list intadd_text_token_id / AddTextTokenIdAn integer id for the corresponding text token.