docs/getting-started/concepts/data-ingestion.md
A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features.
Feast uses a time-series data model to represent data. This data model is used to interpret feature data in data sources in order to build training datasets or materialize features into an online store.
Below is an example data source with a single entity column (driver) and two feature columns (trips_today, and rating).
Feast supports primarily time-stamped tabular data as data sources. There are many kinds of possible data sources:
Ingesting from batch sources is only necessary to power real-time models. This is done through materialization. Under the hood, Feast manages an offline store (to scalably generate training data from batch sources) and an online store (to provide low-latency access to features for real-time models).
A key command to use in Feast is the materialize_incremental command, which fetches the latest values for all entities in the batch source and ingests these values into the online store.
When working with On Demand Feature Views with write_to_online_store=True, you can also control whether transformations are applied during ingestion by using the transform_on_write parameter. Setting transform_on_write=False allows you to materialize pre-transformed features without reapplying transformations, which is particularly useful for large batch datasets that have already been processed.
Materialization can be called programmatically or through the CLI:
<details> <summary>Code example: programmatic scheduled materialization</summary>This snippet creates a feature store object which points to the registry (which knows of all defined features) and the online store (DynamoDB in this case), and
# Define Python callable
def materialize():
repo_config = RepoConfig(
registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
project="feast_demo_aws",
provider="aws",
offline_store="file",
online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
)
store = FeatureStore(config=repo_config)
store.materialize_incremental(datetime.datetime.now())
# (In production) Use Airflow PythonOperator
materialize_python = PythonOperator(
task_id='materialize_python',
python_callable=materialize,
)
With timestamps:
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
Simple materialization (for data without event timestamps):
feast materialize --disable-event-timestamp
# Use BashOperator
materialize_bash = BashOperator(
task_id='materialize',
bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
)
If the schema parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during feast apply.
The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse,
or if a query is provided to the source, by running the query with a LIMIT clause and inspecting the result.
Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.