content/influxdb3/cloud-serverless/write-data/best-practices/schema-design.md
Use the following guidelines to design your schema for simpler and more performant queries.
<!-- TOC -->The InfluxDB data model organizes time series data into buckets and measurements. A bucket can contain multiple measurements. Measurements contain multiple tags and fields.
In time series data, the primary key for a row of data is typically a combination of timestamp and other attributes that uniquely identify each data point. In InfluxDB, the primary key for a row is the combination of the point's timestamp and tag set - the collection of tag keys and tag values on the point. A row's primary key tag set does not include tags with null values.
[!Important] Overwriting points with the same primary key (timestamp and tag set) is not reliable for maintaining a last-value view. For recommended patterns, see Duplicate points in the line protocol reference.
When designing your schema for InfluxDB, a common question is, "what should be a tag and what should be a field?" The following guidelines should help answer that question as you design your schema.
{{% product-name %}} indexes tag keys, field keys, and other metadata to optimize performance. It doesn't index tag values or field values.
[!Note] The InfluxDB 3 storage engine supports infinite tag value and series cardinality. Unlike InfluxDB backed by the TSM storage engine, tag value cardinality doesn't affect the overall performance of your bucket.
Use unique names for tags and fields within the same measurement. {{% product-name %}} stores tags and fields as unique columns in a measurement that represents the measurement on disk. If you attempt to write a measurement that contains tags or fields with the same name, the write fails due to a column conflict.
A measurement has a maximum number of columns. Each row must include a time column. As a result, a measurement can have the following:
If you attempt to write to a measurement and exceed the column limit, then the write request fails and InfluxDB returns an error.
How you structure your schema within a measurement can affect the overall performance of queries against that measurement. The following guidelines help to optimize query performance:
A wide schema refers to a schema with a large number of columns (tags and fields).
Wide schemas can lead to the following issues:
To prevent wide schema issues, limit the number of tags and fields stored in a measurement. If you need to store more than the maximum number of columns, consider segmenting your fields into separate measurements.
In InfluxDB, the primary key for a row is the combination of the point's timestamp and tag set - the collection of tag keys and tag values on the point. A point that contains more tags has a more complex primary key, which could impact sorting performance if you sort using all parts of the key.
A sparse schema is one where, for many rows, columns contain null values.
These generally stem from the following:
Sparse schemas require the InfluxDB query engine to evaluate many null columns, adding unnecessary overhead to storing and querying data.
For an example of a sparse schema, view the non-homogenous schema example below.
Reporting fields at different times with different timestamps creates distinct rows that contain null values--for example:
You report fieldA with tagset, and then report field B with the same tagset, but with a different timestamp.
The result is two rows: one row has a null value for field A and the other has a null value for field B.
In contrast, if you report fields at different times while using the same tagset and timestamp, the existing row is updated. This requires slightly more resources at ingestion time, but then gets resolved at persistence time or compaction time and avoids a sparse schema.
Data stored within a measurement should be "homogenous," meaning each row should have the same tag and field keys. All rows stored in a measurement share the same columns, but if a point doesn't include a value for a column, the column value is null. A measurement full of null values has a "sparse" schema.
{{< expand-wrapper >}} {{% expand "View example of a sparse, non-homogenous schema" %}}
Non-homogenous schemas are often caused by writing points to a measurement with inconsistent tag or field sets. In the following example, data is collected from two different sources and each source returns data with different tag and field sets.
{{< flex >}} {{% flex-content %}}
These sets of data written to the same measurement will result in a measurement full of null values (also known as a sparse schema):
| time | source | src | code | currency | crypto | price | cost | volume |
|---|---|---|---|---|---|---|---|---|
| 2023-01-01T12:00:00Z | src1 | USD | bitcoin | 16588.45865 | ||||
| 2023-01-01T12:00:00Z | 2 | EUR | bitcoin | 16159.5806 | 16749450200 | |||
| 2023-01-01T13:00:00Z | src1 | USD | bitcoin | 16559.49871 | ||||
| 2023-01-01T13:00:00Z | 2 | EUR | bitcoin | 16131.3694 | 16829683245 | |||
| 2023-01-01T14:00:00Z | src1 | USD | bitcoin | 16577.46667 | ||||
| 2023-01-01T14:00:00Z | 2 | EUR | bitcoin | 16148.8727 | 17151722208 | |||
| 2023-01-01T15:00:00Z | src1 | USD | bitcoin | 16591.36998 | ||||
| 2023-01-01T15:00:00Z | 2 | EUR | bitcoin | 16162.4167 | 17311854919 |
{{% /expand %}} {{< /expand-wrapper >}}
When writing data to a field, use the most appropriate data type for your data--write integers as integers, decimals as floats, and booleans as booleans. A query against a field that stores integers outperforms a query against string data; querying over many long string values can negatively affect performance.
Naming conventions for measurements, tag keys, and field keys can simplify or complicate the process of writing queries for your data. The following guidelines help to ensure writing queries for your data is as simple as possible.
Use one tag or one field for each data attribute. If your source data contains multiple data attributes in a single parameter, split each attribute into its own tag or field.
Measurement names, tag keys, and field keys should be simple and accurately describe what each contains. Keep names free of data. The most common cause of a complex naming convention is when you try to "embed" data attributes into a measurement name, tag key, or field key.
When each key and value represents one attribute (not multiple concatenated attributes) of your data, you'll reduce the need for regular expressions in your queries. Without regular expressions, your queries will be easier to write and more performant.
For example, consider the following line protocol that embeds multiple attributes (location, model, and ID) into a sensor tag value:
home,sensor=loc-kitchen.model-A612.id-1726ZA temp=72.1
home,sensor=loc-bath.model-A612.id-2635YB temp=71.8
{{< expand-wrapper >}} {{% expand "View written data" %}}
{{% influxql/table-meta %}} name: home {{% /influxql/table-meta %}}
| time | sensor | temp |
|---|---|---|
| 2023-01-01T00:00:00Z | loc-kitchen.model-A612.id-1726ZA | 72.1 |
| 2023-01-01T00:00:00Z | loc-bath.model-A612.id-2635YB | 71.8 |
{{% /expand %}} {{< /expand-wrapper >}}
To query data from the sensor with ID 1726ZA, you have to use either SQL pattern
matching or regular expressions to evaluate the sensor tag:
{{< code-tabs-wrapper >}} {{% code-tabs %}} SQL InfluxQL {{% /code-tabs %}} {{% code-tab-content %}}
SELECT * FROM home WHERE sensor LIKE '%id-1726ZA%'
{{% /code-tab-content %}} {{% code-tab-content %}}
SELECT * FROM home WHERE sensor =~ /id-1726ZA/
{{% /code-tab-content %}} {{< /code-tabs-wrapper >}}
SQL pattern matching and regular expressions both complicate the query and are less performant than simple equality expressions.
The better approach would be to write each sensor attribute as a separate tag:
home,location=kitchen,sensor_model=A612,sensor_id=1726ZA temp=72.1
home,location=bath,sensor_model=A612,sensor_id=2635YB temp=71.8
{{< expand-wrapper >}} {{% expand "View written data" %}}
{{% influxql/table-meta %}} name: home {{% /influxql/table-meta %}}
| time | location | sensor_model | sensor_id | temp |
|---|---|---|---|---|
| 2023-01-01T00:00:00Z | kitchen | A612 | 1726ZA | 72.1 |
| 2023-01-01T00:00:00Z | bath | A612 | 2635YB | 71.8 |
{{% /expand %}} {{< /expand-wrapper >}}
To query data from the sensor with ID 1726ZA using this schema, you can use a
simple equality expression:
{{< code-tabs-wrapper >}} {{% code-tabs %}} SQL & InfluxQL {{% /code-tabs %}} {{% code-tab-content %}}
SELECT * FROM home WHERE sensor_id = '1726ZA'
{{% /code-tab-content %}} {{< /code-tabs-wrapper >}}
This query is easier to write and is more performant than using pattern matching or regular expressions.
To simplify query writing, avoid using reserved keywords or special characters in measurement names, tag keys, and field keys.
When using SQL or InfluxQL to query measurements, tags, and fields with special characters or keywords, you have to wrap these keys in double quotes.
SELECT
"example-field", "tag@1-23"
FROM
"example-measurement"
WHERE
"tag@1-23" = 'ABC'