Back to Arrow

Glossary

docs/source/format/Glossary.rst

latest8.0 KB
Original Source

.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.

.. _glossary:

======== Glossary

.. glossary:: :sorted:

array vector A contiguous, one-dimensional sequence of values with known length where all values have the same type. An array consists of zero or more :term:buffers <buffer>, a non-negative length, and a :term:data type. The buffers of an array are laid out according to the data type as defined by the columnar format.

   Arrays are contiguous in the sense that iterating the values of
   an array will iterate through a single set of buffers, even
   though an array may consist of multiple disjoint buffers, or
   may consist of child arrays that themselves span multiple
   buffers.

   Arrays are one-dimensional in that they are a sequence of
   :term:`slots <slot>` or singular values, even though for some
   data types (like structs or unions), a slot may represent
   multiple values.

   Defined by the :doc:`./Columnar`.

buffer A contiguous region of memory with a given length. Buffers are used to store data for arrays.

   Buffers may be in CPU memory, memory-mapped from a file, in
   device (e.g. GPU) memory, etc., though not all Arrow
   implementations support all of these possibilities.

canonical extension type An :term:extension type that has been standardized by the Arrow community so as to improve interoperability between implementations.

   .. seealso::
      :ref:`format_canonical_extensions`.

child array parent array In an array of a :term:nested type, the parent array corresponds to the :term:parent type and the child array(s) correspond to the :term:child type(s) <child type>. For example, a List[Int32]-type parent array has an Int32-type child array.

child type parent type In a :term:nested type, the nested type is the parent type, and the child type(s) are its parameters. For example, in List[Int32], List is the parent type and Int32 is the child type.

chunked array A discontiguous, one-dimensional sequence of values with known length where all values have the same type. Consists of zero or more :term:arrays <array>, the "chunks".

   Chunked arrays are discontiguous in the sense that iterating
   the values of a chunked array may require iterating through
   different buffers for different indices.

   Not part of the columnar format; this term is specific to
   certain language implementations of Arrow (primarily C++ and
   its bindings).

   .. seealso:: :term:`record batch`, :term:`table`

complex type nested type A :term:data type whose structure depends on one or more other :term:child data types <child type>. For instance, List is a nested type that has one child.

   Two nested types are equal if and only if their child types are
   also equal.

data type type A type that a value can take, such as Int8 or List[Utf8]. The type of an array determines how its values are laid out in memory according to :doc:./Columnar.

   .. seealso:: :term:`nested type`, :term:`primitive type`

dictionary An array of values that accompany a :term:dictionary-encoded <dictionary-encoding> array.

dictionary-encoding An array that stores its values as indices into a :term:dictionary array instead of storing the values directly.

   .. seealso:: :ref:`dictionary-encoded-layout`

extension type storage type An extension type is an user-defined :term:data type that adds additional semantics to an existing data type. This allows implementations that do not support a particular extension type to still handle the underlying data type (the "storage type").

   For example, a UUID can be represented as a 16-byte fixed-size
   binary type.

   .. seealso:: :ref:`format_metadata_extension_types`

field A column in a :term:schema. Consists of a field name, a :term:data type, a flag indicating whether the field is nullable or not, and optional key-value metadata.

IPC format A specification for how to serialize Arrow data, so it can be sent between processes/machines, or persisted on disk.

   .. seealso:: :term:`IPC file format`,
                :term:`IPC streaming format`

IPC file format file format random-access format An extension of the :term:IPC streaming format that can be used to serialize Arrow data to disk, then read it back with random access to individual record batches.

IPC message message The IPC representation of a particular in-memory structure, like a :term:record batch or :term:schema. Will always be one of the members of MessageHeader in the Flatbuffers protocol file <https://github.com/apache/arrow/blob/main/format/Message.fbs>_.

IPC streaming format streaming format A protocol for streaming Arrow data or for serializing data to a file, consisting of a stream of :term:IPC messages <IPC message>.

physical layout A specification for how to arrange values in memory.

   .. seealso:: :ref:`format_layout`

primitive type A data type that does not have any child types.

   .. seealso:: :term:`data type`

record batch In the :ref:IPC format <format-ipc>: the primitive unit of data. A record batch consists of an ordered list of :term:buffers <buffer> corresponding to a :term:schema.

   **In some implementations** (primarily C++ and its bindings): a
   *contiguous*, *two-dimensional* chunk of data.  A record batch
   consists of an ordered collection of :term:`arrays <array>` of
   the same length.

   Like arrays, record batches are contiguous in the sense that
   iterating the rows of a record batch will iterate through a
   single set of buffers.

schema A collection of :term:fields <field> with optional metadata that determines all the :term:data types <data type> of an object like a :term:record batch or :term:table.

slot A single logical value within an array, i.e. a "row".

table A discontiguous, two-dimensional chunk of data consisting of an ordered collection of :term:chunked arrays <chunked array>. All chunked arrays have the same length, but may have different types. Different columns may be chunked differently.

   Like chunked arrays, tables are discontiguous in the sense that
   iterating the rows of a table may require iterating through
   different buffers for different indices.

   Not part of the columnar format; this term is specific to
   certain language implementations of Arrow (for example C++ and
   its bindings, and Go).

   .. image:: ../cpp/tables-versus-record-batches.svg
      :alt: A graphical representation of an Arrow Table and a
            Record Batch, with structure as described in text above.

   .. seealso:: :term:`chunked array`, :term:`record batch`