Back to Lance

Schema Format Specification

docs/src/format/table/schema.md

4.0.114.7 KB
Original Source

Schema Format Specification

Overview

The schema describes the structure of a Lance table, including all fields, their data types, and metadata. Schemas use a logical type system where data types are represented as strings that map to Apache Arrow data types. Each field in the schema has a unique identifier (field ID) that enables robust schema evolution and version tracking.

!!! note

Logical types are currently being simplified through discussion [#5864](https://github.com/lance-format/lance/discussions/5864).
Proposed changes include consolidating encoding-specific variants (e.g., `large_string` and `string`, `large_binary` and `binary`)
into single logical types with runtime optimization. Additionally, [#5817](https://github.com/lance-format/lance/discussions/5817) proposes adding
`string_view` and `binary_view` types. This document describes the current implementation.

Data Types

Lance supports a comprehensive set of data types that map to Apache Arrow types. Data types are represented as strings in the schema and can be grouped into several categories.

Primitive Types

Logical TypeArrow TypeDescription
nullNullNull type (no values)
boolBooleanBoolean (true/false)
int8Int8Signed 8-bit integer
uint8UInt8Unsigned 8-bit integer
int16Int16Signed 16-bit integer
uint16UInt16Unsigned 16-bit integer
int32Int32Signed 32-bit integer
uint32UInt32Unsigned 32-bit integer
int64Int64Signed 64-bit integer
uint64UInt64Unsigned 64-bit integer

Floating Point Types

Logical TypeArrow TypeDescription
halffloatFloat16IEEE 754 half-precision floating point (16-bit)
floatFloat32IEEE 754 single-precision floating point (32-bit)
doubleFloat64IEEE 754 double-precision floating point (64-bit)

String and Binary Types

Logical TypeArrow TypeDescription
stringUtf8Variable-length UTF-8 encoded string
binaryBinaryVariable-length binary data
large_stringLargeUtf8Variable-length UTF-8 string (supports large offsets)
large_binaryLargeBinaryVariable-length binary data (supports large offsets)

Decimal Types

Decimal types support arbitrary-precision numeric values. The format is: decimal:<bit_width>:<precision>:<scale>

Logical TypeArrow TypePrecisionExample
decimal:128:P:SDecimal128Up to 38 digitsdecimal:128:10:2 (10 total digits, 2 after decimal)
decimal:256:P:SDecimal256Up to 76 digitsdecimal:256:20:5
  • Precision (P): Total number of digits (1-38 for Decimal128, up to 76 for Decimal256)
  • Scale (S): Number of digits after the decimal point (0 ≤ S ≤ P)

Date and Time Types

Logical TypeArrow TypeDescription
date32:dayDate32Date (days since epoch)
date64:msDate64Date (milliseconds since epoch)
time32:sTime32Time (seconds since midnight)
time32:msTime32Time (milliseconds since midnight)
time64:usTime64Time (microseconds since midnight)
time64:nsTime64Time (nanoseconds since midnight)
duration:sDurationDuration (seconds)
duration:msDurationDuration (milliseconds)
duration:usDurationDuration (microseconds)
duration:nsDurationDuration (nanoseconds)

Timestamp Types

Timestamp types represent a point in time and may include timezone information. Format: timestamp:<unit>:<timezone>

  • Unit: s (seconds), ms (milliseconds), us (microseconds), ns (nanoseconds)
  • Timezone: IANA timezone string (e.g., UTC, America/New_York) or - for no timezone

Examples:

  • timestamp:us:UTC - Microsecond precision timestamp in UTC
  • timestamp:ms:America/New_York - Millisecond precision timestamp in America/New_York timezone
  • timestamp:ns:- - Nanosecond precision timestamp with no timezone

Complex Types

Struct Type

A struct is a container for named fields with heterogeneous types.

Logical TypeArrow TypeDescription
structStructComposite type containing multiple named fields

Struct fields are represented as child fields in the schema.

Example schema with a struct:

protobuf
Field {
    name: "address"
    type: "struct"
    children: [
        Field { name: "street", type: "string" },
        Field { name: "city", type: "string" },
        Field { name: "zip", type: "int32" }
    ]
}

List Types

Lists represent variable-length arrays of a single type.

Logical TypeArrow TypeDescription
listListVariable-length list of values
list.structList(Struct)Variable-length list of struct values
large_listLargeListVariable-length list (supports large offsets)
large_list.structLargeList(Struct)Variable-length list of struct values (large offsets)

The element type is specified as a child field.

Fixed-Size List Types

Fixed-size lists have a predetermined size known at schema definition time. Format: fixed_size_list:<element_type>:<size>

Logical TypeDescriptionExample
fixed_size_list:float:128Fixed-size list of 128 floatsVector embeddings (128-dimensional)
fixed_size_list:int32:10Fixed-size list of 10 integers

Special extension types:

  • fixed_size_list:lance.bfloat16:256 - Fixed-size list of bfloat16 values

Fixed-Size Binary Type

Fixed-size binary data with a predetermined size in bytes. Format: fixed_size_binary:<size>

Logical TypeDescriptionExample
fixed_size_binary:16Fixed-size binary of 16 bytesMD5 hash
fixed_size_binary:32Fixed-size binary of 32 bytesSHA-256 hash

Dictionary Type

Dictionary-encoded data with separate keys and values. Format: dict:<value_type>:<key_type>:<ordered>

  • Value type: The type of dictionary values
  • Key type: The type used for dictionary indices (typically int8, int16, or int32)
  • Ordered: Boolean indicating if dictionary values are sorted (currently not fully supported)

Example: dict:string:int16:false - Dictionary-encoded strings with int16 keys

Map Type

Key-value pairs stored in a structured format.

Logical TypeArrow TypeDescription
mapMapKey-value pairs (currently supports unordered keys only)

Maps have key and value types specified as child fields.

Extension Types

Lance supports custom extension types that provide semantic meaning on top of Arrow types.

Blob Type

Represents large binary data stored externally.

Logical TypeDescription
blobLarge binary data with external storage reference
jsonJSON-encoded data stored as binary

Blob types are stored as large binary data with metadata describing storage location.

BFloat16 Type

Brain float (bfloat16) is a 16-bit floating point format optimized for ML. Used within fixed-size lists: fixed_size_list:lance.bfloat16:SIZE

Field IDs

Field IDs are unique integer identifiers assigned to each field in a schema. They are essential for robust schema evolution, as they allow fields to be renamed or reordered without breaking references.

Field ID Assignment

Initial assignment (depth-first order): When a table is created, field IDs are assigned to all fields in depth-first order, starting from 0.

Nested fields are linked via the parent_id field in the protobuf message. For example, if field "c" (id: 2) is a struct containing fields "x", "y", "z", those child fields will have parent_id: 2. Top-level fields have parent_id: -1.

Example with nested structure:

Field order: a, b, c.x, c.y, c.z, d

Assigned IDs with parent relationships:
- a: 0 (parent_id: -1)
- b: 1 (parent_id: -1)
- c: 2 (parent_id: -1, struct type)
- c.x: 3 (parent_id: 2)
- c.y: 4 (parent_id: 2)
- c.z: 5 (parent_id: 2)
- d: 6 (parent_id: -1)

Note: A parent_id of -1 indicates a top-level field. For nested fields, parent_id references the ID of the parent field. Child fields reference their parent via parent_id rather than being stored as separate "children" arrays in the protobuf message (though the Rust in-memory representation maintains a children vector for convenience).

New field assignment (incremental): When fields are added later (e.g., through schema evolution), they receive the next available ID incrementally. This preserves the history of field additions.

Field ID Properties

  • Immutable: Once assigned, a field's ID never changes
  • Unique: Each field within a table has a unique ID
  • Stable: IDs are preserved across schema evolution operations
  • Sparse: Field IDs may not form a contiguous sequence after schema evolution

Using Field IDs

When referencing fields internally within the format, use the field ids rather than field names or positions.

Field Metadata

Fields can carry additional metadata as key-value pairs to configure encoding, primary key behavior, and other properties.

Primary Key Metadata

Primary key configuration is handled by two protobuf fields in the Field message:

  • unenforced_primary_key (bool): Whether this field is part of the primary key
  • unenforced_primary_key_position (uint32): Position in primary key ordering (1-based for ordered, 0 for unordered)

For detailed discussion on primary key configuration, see Unenforced Primary Key in the table format overview.

Encoding Metadata

Column encoding configurations are specified with the lance-encoding: prefix. See File Format Encoding Specification for complete details on available encodings.

Arrow Extension Type Metadata

Custom Arrow extension types may have metadata under the ARROW:extension: namespace (e.g., ARROW:extension:name).

Schema Protobuf Definition

The schema is serialized using protobuf messages. Key messages include:

Field Message

protobuf
%%% proto.message.lance.file.Field %%%

The Field message contains:

  • id: Unique field identifier (int32)
  • name: Field name (string)
  • type: Field type enum (PARENT, REPEATED, or LEAF)
  • logical_type: Logical type string representation (string) - e.g., "int64", "struct", "list"
  • nullable: Whether the field can be null (bool)
  • parent_id: Parent field ID for nested fields; -1 for top-level fields (int32)
  • metadata: Key-value pairs for additional configuration (map<string, bytes>)
  • unenforced_primary_key: Whether this field is part of the primary key (bool)
  • unenforced_primary_key_position: Position in primary key ordering (uint32, 0 = unordered)

Schema Message

The complete schema is represented as a collection of top-level fields plus metadata.

Schema Evolution

Field IDs enable efficient schema evolution:

  • Add Column: Assign a new field ID and add to schema
  • Drop Column: Remove field from schema; its ID may be reused in some systems
  • Rename Column: Change field name; ID remains the same
  • Reorder Columns: Change field order in schema; IDs remain the same
  • Type Evolution: Data type can be changed. This might require rewriting the column in the data, depending on how the type was changed.

The use of field IDs ensures that data files can be correctly interpreted even as the schema changes over time.

Example Schemas

The examples below use a simplified representation of the field structure. In the actual protobuf format, type refers to the field type enum (PARENT/REPEATED/LEAF) and logical_type contains the data type string representation.

Simple Table

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1
}
Field {
    id: 1
    name: "name"
    logical_type: "string"
    nullable: true
    parent_id: -1
}
Field {
    id: 2
    name: "created_at"
    logical_type: "timestamp:us:UTC"
    nullable: true
    parent_id: -1
}

Nested Structure

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1  // Top-level field
}
Field {
    id: 1
    name: "user"
    logical_type: "struct"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 2
    name: "name"
    logical_type: "string"
    nullable: true
    parent_id: 1  // Nested under "user" struct (id: 1)
}
Field {
    id: 3
    name: "email"
    logical_type: "string"
    nullable: true
    parent_id: 1  // Nested under "user" struct (id: 1)
}
Field {
    id: 4
    name: "tags"
    logical_type: "list"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 5
    name: "item"
    logical_type: "string"
    nullable: true
    parent_id: 4  // Nested under "tags" list (id: 4)
}

With Vector Embeddings

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1  // Top-level field
    unenforced_primary_key: true
    unenforced_primary_key_position: 1  // Ordered position in primary key
}
Field {
    id: 1
    name: "text"
    logical_type: "string"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 2
    name: "embedding"
    logical_type: "fixed_size_list:lance.bfloat16:384"
    nullable: true
    parent_id: -1  // Top-level field
}

Type Conversion Reference

When converting between logical types and Arrow types, Lance uses the following mappings:

Arrow TypeLogical Type Format
Arrow::Nullnull
Arrow::Booleanbool
Arrow::Int8 to Int64int8, int16, int32, int64
Arrow::UInt8 to UInt64uint8, uint16, uint32, uint64
Arrow::Float16halffloat
Arrow::Float32float
Arrow::Float64double
Arrow::Utf8string
Arrow::LargeUtf8large_string
Arrow::Binarybinary
Arrow::LargeBinarylarge_binary
Arrow::Decimal128(p, s)decimal:128:p:s
Arrow::Decimal256(p, s)decimal:256:p:s
Arrow::Date32date32:day
Arrow::Date64date64:ms
Arrow::Time32(TimeUnit)time32:s, time32:ms
Arrow::Time64(TimeUnit)time64:us, time64:ns
Arrow::Timestamp(unit, tz)timestamp:unit:tz
Arrow::Duration(unit)duration:s, duration:ms, duration:us, duration:ns
Arrow::Structstruct
Arrow::List(Element)list or list.struct if element is Struct
Arrow::LargeList(Element)large_list or large_list.struct
Arrow::FixedSizeList(Element, Size)fixed_size_list:type:size
Arrow::FixedSizeBinary(Size)fixed_size_binary:size
Arrow::Dictionary(KeyType, ValueType)dict:value_type:key_type:false
Arrow::Mapmap