Back to Autogpt

Data Sampling

docs/integrations/block-integrations/sampling.md

0.6.442.1 KB
Original Source

Data Sampling

What it is

The Data Sampling block is a tool for selecting a subset of data from a larger dataset using various sampling methods.

What it does

This block takes a dataset as input and returns a smaller sample of that data based on specified criteria. It supports multiple sampling methods, allowing users to choose the most appropriate technique for their needs.

How it works

The block processes the input data and applies the chosen sampling method to select a subset of items. It can work with different data structures and supports data accumulation for scenarios where data is received in batches.

Inputs

InputDescription
DataThe dataset to sample from. This can be a single dictionary, a list of dictionaries, or a list of lists.
Sample SizeThe number of items to select from the dataset.
Sampling MethodThe technique used to select the sample. Options include random, systematic, top, bottom, stratified, weighted, reservoir, and cluster sampling.
AccumulateA flag indicating whether to accumulate data before sampling. This is useful for scenarios where data is received in batches.
Random SeedAn optional value to ensure reproducible random sampling.
Stratify KeyThe key to use for stratified sampling (required when using the stratified sampling method).
Weight KeyThe key to use for weighted sampling (required when using the weighted sampling method).
Cluster KeyThe key to use for cluster sampling (required when using the cluster sampling method).

Outputs

OutputDescription
Sampled DataThe selected subset of the input data.
Sample IndicesThe indices of the sampled items in the original dataset.

Possible use case

A data scientist working with a large customer dataset wants to create a representative sample for analysis. They could use this Data Sampling block to select a smaller subset of customers using stratified sampling, ensuring that the sample maintains the same proportions of different customer segments as the full dataset.