docs/src/main/sphinx/connector/hive.md
The Hive connector allows querying data stored in an Apache Hive data warehouse. Hive is a combination of three components:
Trino only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive's execution environment.
The Hive connector requires a
{ref}Hive metastore service <hive-thrift-metastore> (HMS), or a compatible
implementation of the Hive metastore, such as
{ref}AWS Glue <hive-glue-metastore>.
You must select and configure a supported file system in your catalog configuration file.
The coordinator and all workers must have network access to the Hive metastore and the storage system. Hive metastore access with the Thrift protocol defaults to using port 9083.
Data files must be in a supported file format. File formats can be
configured using the format table property
and other specific properties:
ORC <orc-format-configuration>Parquet <parquet-format-configuration>In the case of serializable formats, only specific SerDes are allowed:
ColumnarSerDeLazyBinaryColumnarSerDeorg.apache.hadoop.io.Textorg.apache.hadoop.io.BytesWritable containing protocol
buffer records using
com.twitter.elephantbird.hive.serde.ProtobufDeserializerorg.apache.hadoop.hive.serde2.OpenCSVSerdeorg.apache.hive.hcatalog.data.JsonSerDeorg.openx.data.jsonserde.JsonSerDe. Find
more details about the Trino implementation in the source repository.com.esri.hadoop.hive.serde.EsriJsonSerDe(hive-configuration)=
To configure the Hive connector, create a catalog properties file
etc/catalog/example.properties that references the hive connector.
You must configure a metastore for metadata.
You must select and configure one of the supported file systems.
connector.name=hive
hive.metastore.uri=thrift://example.net:9083
fs.x.enabled=true
Replace the fs.x.enabled configuration property with the desired file system.
If you are using {ref}AWS Glue <hive-glue-metastore> as your metastore, you
must instead set hive.metastore to glue:
connector.name=hive
hive.metastore=glue
Each metastore type has specific configuration properties along with .
You can have as many catalogs as you need, so if you have additional
Hive clusters, simply add another properties file to etc/catalog
with a different name, making sure it ends in .properties. For
example, if you name the property file sales.properties, Trino
creates a catalog named sales using the configured connector.
(hive-configuration-properties)=
The following table lists general configuration properties for the Hive connector. There are additional sets of configuration properties throughout the Hive connector documentation.
:::{list-table} Hive general configuration properties :widths: 35, 50, 15 :header-rows: 1
hive.recursive-directorieshive.mapred.supports.subdirectories property in Hive.falsehive.ignore-absent-partitionsfalsehive.storage-formatORChive.parquet.time-zonehive.compression-codecNONE,
SNAPPY, LZ4, ZSTD, or GZIP.GZIPhive.force-local-schedulingfalsehive.respect-table-formattruehive.immutable-partitionstrue then setting
hive.insert-existing-partitions-behavior to APPEND is not allowed. This
also affects the insert_existing_partitions_behavior session property in
the same way.falsehive.insert-existing-partitions-behavior
What happens when data is inserted into an existing partition? Possible values are
APPEND - appends data to existing partitionsOVERWRITE - overwrites existing partitionsERROR - modifying existing partitions is not allowedThe equivalent catalog session property is insert_existing_partitions_behavior.
APPEND
hive.target-max-file-size1GBhive.create-empty-bucket-filesfalsehive.validate-bucketingtruehive.partition-statistics-sample-sizehive.max-partitions-per-writershive.max-partitions-for-eager-loadhive.max-partitions-per-scanhive.non-managed-table-writes-enabledfalsehive.non-managed-table-creates-enabledtruehive.file-status-cache-tables
Cache directory listing for specific tables. Examples:
fruit.apple,fruit.orange to cache listings only for tables
apple and orange in schema fruitfruit.*,vegetable.* to cache listings for all tables
in schemas fruit and vegetable* to cache listings for all tables in all schemashive.file-status-cache.excluded-tableshive.file-status-cache-tables is an inclusion list, this is an exclusion list for the cache.
fruit.apple,fruit.orange to NOT cache listings only for tables
apple and orange in schema fruitfruit.*,vegetable.* to NOT cache listings for all tables
in schemas fruit and vegetablehive.file-status-cache.max-retained-size1GBhive.file-status-cache-expire-time1mhive.per-transaction-file-status-cache.max-retained-size100MBhive.rcfile.time-zonehive.timestamp-precisionTIMESTAMP.
Possible values are MILLISECONDS, MICROSECONDS and NANOSECONDS.
Values with higher precision than configured are rounded. The equivalent
catalog session property is timestamp_precision for
session specific use.MILLISECONDShive.temporary-staging-directory-enabledhive.temporary-staging-directory-path is used for write operations.
Temporary staging directory is never used for writes to non-sorted tables on
S3, encrypted HDFS or external location. Writes to sorted tables will
utilize this path for staging temporary files during sorting operation. When
disabled, the target storage will be used for staging while writing sorted
tables which can be inefficient when writing to object stores like S3.truehive.temporary-staging-directory-path${USER} placeholder can be used to use a different
location for each user./tmp/presto-${USER}hive.hive-views.enabledfalsehive.hive-views.legacy-translationhive_views_legacy_translation catalog session property for
temporary, catalog specific use.falsehive.parallel-partitioned-bucketed-writestruehive.query-partition-filter-requiredtrue to force a query to use a partition filter. You can use the
query_partition_filter_required catalog session property for temporary,
catalog specific use.falsehive.query-partition-filter-required-schemashive.query-partition-filter-required-schemas,
or the query_partition_filter_required_schemas session property. The list
is taken into consideration only if the hive.query-partition-filter-required
configuration property or the query_partition_filter_required session
property is set to true.[]hive.table-statistics-enabledstatistics_enabled for session specific
use. Set to false to disable statistics. Disabling statistics means that
can not make smart decisions about
the query plan.truehive.partition-projection-enabledtruehive.s3-glacier-filterREAD_ALL - read files from all S3 storage classesREAD_NON_GLACIER - read files from non S3 Glacier storage classesREAD_NON_GLACIER_AND_RESTORED - read files from non S3 Glacier storage classes and
restored objects from Glacier storage classREAD_ALLhive.max-partition-drops-per-queryhive.metastore.partition-batch-size.maxhive.single-statement-writesfalsehive.metadata.parallelism8hive.protobuf.descriptors.locationcom.twitter.elephantbird.hive.serde.ProtobufDeserializer format.hive.protobuf.descriptors.cache.max-size64hive.protobuf.descriptors.cache.refresh-interval1d
:::(hive-file-system-configuration)=
The connector supports accessing the following file systems:
You must enable and configure the specific file system access. Legacy support is not recommended and will be removed.
(hive-fte-support)=
The connector supports {doc}/admin/fault-tolerant-execution of query
processing. Read and write operations are both supported with any retry policy
on non-transactional tables.
Read operations are supported with any retry policy on transactional tables.
Write operations and CREATE TABLE ... AS operations are not supported with
any retry policy on transactional tables.
(hive-security)=
The connector supports different means of authentication for the used file system and metastore.
In addition, the following security-related features are supported.
(hive-authorization)=
You can enable authorization checks by setting the hive.security property in
the catalog properties file. This property must be one of the following values:
:::{list-table} hive.security property values
:widths: 30, 60
:header-rows: 1
allow-all (default value)read-onlySELECT, are permitted, but
none of the operations that write data or metadata, such as CREATE,
INSERT or DELETE, are allowed.(hive-sql-standard-based-authorization)=
When sql-standard security is enabled, Trino enforces the same SQL
standard-based authorization as Hive does.
Since Trino's ROLE syntax support matches the SQL standard, and
Hive does not exactly follow the SQL standard, there are the following
limitations and differences:
CREATE ROLE role WITH ADMIN is not supported.admin role must be enabled to execute CREATE ROLE, DROP ROLE or CREATE SCHEMA.GRANT role TO user GRANTED BY someone is not supported.REVOKE role FROM user GRANTED BY someone is not supported.admin, are enabled in a new user session.SET ROLE role.SET ROLE ALL enables all of a user's roles except admin.admin role must be enabled explicitly by executing SET ROLE admin.GRANT privilege ON SCHEMA schema is not supported. Schema ownership can be
changed with ALTER SCHEMA schema SET AUTHORIZATION user(hive-parquet-encryption)=
The Hive connector supports reading Parquet files encrypted with Parquet Modular Encryption (PME). Decryption keys can be provided via environment variables. Writing encrypted Parquet files is not supported.
:::{list-table} Parquet encryption properties :widths: 35, 50, 15 :header-rows: 1
pme.environment-key-retriever.enabledfalsepme.aad-prefixpme.check-footer-integritytrue
:::When pme.environment-key-retriever.enabled is set, provide keys with
environment variables:
pme.environment-key-retriever.footer-keyspme.environment-key-retriever.column-keysEach variable accepts either a single base64-encoded key, or a comma-separated
list of id:key pairs (base64-encoded keys) where id must match the key
metadata embedded in the Parquet file.
(hive-sql-support)=
The connector provides read access and write access to data and metadata in the configured object storage system and metadata stores:
{ref}Globally available statements <sql-globally-available>; see also
{ref}Globally available statements <hive-procedures>
{ref}Read operations <sql-read-operations>
{ref}sql-write-operations:
sql-data-management; see also
{ref}Hive-specific data management <hive-data-management>sql-schema-table-management; see also
{ref}Hive-specific schema and table management <hive-schema-and-table-management>sql-view-management; see also
{ref}Hive-specific view management <hive-sql-view-management>{ref}sql-security-operations: see also
{ref}SQL standard-based authorization for object storage <hive-sql-standard-based-authorization>
{ref}sql-transactions
Refer to {doc}the migration guide </appendix/from-hive> for practical advice
on migrating from Hive to Trino.
The following sections provide Hive-specific information regarding SQL support.
(hive-examples)=
The examples shown here work on Google Cloud Storage by replacing s3:// with
gs://.
Create a new Hive table named page_views in the web schema
that is stored using the ORC file format, partitioned by date and
country, and bucketed by user into 50 buckets. Note that Hive
requires the partition columns to be the last columns in the table:
CREATE TABLE example.web.page_views (
view_time TIMESTAMP,
user_id BIGINT,
page_url VARCHAR,
ds DATE,
country VARCHAR
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['ds', 'country'],
bucketed_by = ARRAY['user_id'],
bucket_count = 50
)
Create a new Hive schema named web that stores tables in an
S3 bucket named my-bucket:
CREATE SCHEMA example.web
WITH (location = 's3://my-bucket/')
Drop a schema:
DROP SCHEMA example.web
Drop a partition from the page_views table:
DELETE FROM example.web.page_views
WHERE ds = DATE '2016-08-09'
AND country = 'US'
Query the page_views table:
SELECT * FROM example.web.page_views
List the partitions of the page_views table:
SELECT * FROM example.web."page_views$partitions"
Create an external Hive table named request_logs that points at
existing data in S3:
CREATE TABLE example.web.request_logs (
request_time TIMESTAMP,
url VARCHAR,
ip VARCHAR,
user_agent VARCHAR
)
WITH (
format = 'TEXTFILE',
external_location = 's3://my-bucket/data/logs/'
)
Collect statistics for the request_logs table:
ANALYZE example.web.request_logs;
Drop the external table request_logs. This only drops the metadata
for the table. The referenced data directory is not deleted:
DROP TABLE example.web.request_logs
{doc}/sql/create-table-as can be used to create transactional tables in ORC format like this:
CREATE TABLE <name>
WITH (
format='ORC',
transactional=true
)
AS <query>
Add an empty partition to the page_views table:
CALL system.create_empty_partition(
schema_name => 'web',
table_name => 'page_views',
partition_columns => ARRAY['ds', 'country'],
partition_values => ARRAY['2016-08-09', 'US']);
Drop stats for a partition of the page_views table:
CALL system.drop_stats(
schema_name => 'web',
table_name => 'page_views',
partition_values => ARRAY[ARRAY['2016-08-09', 'US']]);
Tables created in Hive with
Twitter Elephantbird
are supported to read. The binary protobuf descriptor as mentioned in the
serialization.class should be stored in a directory that is configured via
hive.protobuf.descriptors.location on every worker.
...
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.proto.gen.Storage$User"
)
(hive-procedures)=
Use the {doc}/sql/call statement to perform data manipulation or
administrative tasks. Procedures must include a qualified catalog name, if your
Hive catalog is called web:
CALL web.system.example_procedure()
The following procedures are available:
system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)
Create an empty partition in the specified table.
system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)
Check and update partitions list in metastore. There are three modes available:
ADD : add any partitions that exist on the file system, but not in the metastore.DROP: drop any partitions that exist in the metastore, but not on the file system.FULL: perform both ADD and DROP.The case_sensitive argument is optional. The default value is true for compatibility
with Hive's MSCK REPAIR TABLE behavior, which expects the partition column names in
file system paths to use lowercase (e.g. col_x=SomeValue). Partitions on the file system
not conforming to this convention are ignored, unless the argument is set to false.
system.drop_stats(schema_name, table_name, partition_values)
Drops statistics for a subset of partitions or the entire table. The partitions are specified as an
array whose elements are arrays of partition values (similar to the partition_values argument in
create_empty_partition). If partition_values argument is omitted, stats are dropped for the
entire table.
(register-partition)=
system.register_partition(schema_name, table_name, partition_columns, partition_values, location)
Registers existing location as a new partition in the metastore for the specified table.
When the location argument is omitted, the partition location is
constructed using partition_columns and partition_values.
Due to security reasons, the procedure is enabled only when hive.allow-register-partition-procedure
is set to true.
(unregister-partition)=
system.unregister_partition(schema_name, table_name, partition_columns, partition_values)
Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.
(hive-flush-metadata-cache)=
system.flush_metadata_cache()
Flush all Hive metadata caches.
system.flush_metadata_cache(schema_name => ..., table_name => ...)
Flush Hive metadata caches entries connected with selected table. Procedure requires named parameters to be passed
system.flush_metadata_cache(schema_name => ..., table_name => ..., partition_columns => ARRAY[...], partition_values => ARRAY[...])
Flush Hive metadata cache entries connected with selected partition. Procedure requires named parameters to be passed.
(hive-data-management)=
The {ref}sql-data-management functionality includes support for INSERT,
UPDATE, DELETE, and MERGE statements, with the exact support
depending on the storage system, file format, and metastore.
When connecting to a Hive metastore version 3.x, the Hive connector supports reading from and writing to insert-only and ACID tables, with full support for partitioning and bucketing.
{doc}/sql/delete applied to non-transactional tables is only supported if the
table is partitioned and the WHERE clause matches entire partitions.
Transactional Hive tables with ORC format support "row-by-row" deletion, in
which the WHERE clause may match arbitrary sets of rows.
{doc}/sql/update is only supported for transactional Hive tables with format
ORC. UPDATE of partition or bucket columns is not supported.
{doc}/sql/merge is only supported for ACID tables.
ACID tables created with Hive Streaming Ingest are not supported.
(hive-schema-and-table-management)=
The Hive connector supports querying and manipulating Hive tables and schemas (databases). While some uncommon operations must be performed using Hive directly, most operations can be performed using Trino.
Hive table partitions can differ from the current table schema. This occurs when the data types of columns of a table are changed from the data types of columns of preexisting partitions. The Hive connector supports this schema evolution by allowing the same conversions as Hive. The following table lists possible data type conversions.
:::{list-table} Hive schema evolution type conversion :widths: 25, 75 :header-rows: 1
BOOLEANVARCHARVARCHARBOOLEAN, TINYINT, SMALLINT, INTEGER, BIGINT, REAL, DOUBLE, TIMESTAMP, DATE, CHAR as well as
narrowing conversions for VARCHARCHARVARCHAR, narrowing conversions for CHARTINYINTVARCHAR, SMALLINT, INTEGER, BIGINT, DOUBLE, DECIMALSMALLINTVARCHAR, INTEGER, BIGINT, DOUBLE, DECIMALINTEGERVARCHAR, BIGINT, DOUBLE, DECIMALBIGINTVARCHAR, DOUBLE, DECIMALREALDOUBLE, DECIMALDOUBLEFLOAT, DECIMALDECIMALDOUBLE, REAL, VARCHAR, TINYINT, SMALLINT, INTEGER, BIGINT, as
well as narrowing and widening conversions for DECIMALDATEVARCHARTIMESTAMPVARCHAR, DATEVARBINARYVARCHAR
:::Any conversion failure results in null, which is the same behavior
as Hive. For example, converting the string 'foo' to a number,
or converting the string '1234' to a TINYINT (which has a
maximum value of 127).
(hive-avro-schema)=
Trino supports querying and manipulating Hive tables with the Avro storage format, which has the schema set based on an Avro schema file/literal. Trino is also capable of creating the tables in Trino by inferring the schema from a valid Avro schema file located locally, or remotely in HDFS/Web server.
To specify that the Avro schema should be used for interpreting table data, use
the avro_schema_url table property.
The schema can be placed in the local file system or remotely in the following locations:
avro_schema_url = 'hdfs://user/avro/schema/avro_data.avsc')avro_schema_url = 's3n:///schema_bucket/schema/avro_data.avsc')avro_schema_url = 'http://example.org/schema/avro_data.avsc')The URL, where the schema is located, must be accessible from the Hive metastore and Trino coordinator/worker nodes.
Alternatively, you can use the table property avro_schema_literal to define
the Avro schema.
The table created in Trino using the avro_schema_url or
avro_schema_literal property behaves the same way as a Hive table with
avro.schema.url or avro.schema.literal set.
Example:
CREATE TABLE example.avro.avro_data (
id BIGINT
)
WITH (
format = 'AVRO',
avro_schema_url = '/usr/local/avro_data.avsc'
)
The columns listed in the DDL (id in the above example) is ignored if avro_schema_url is specified.
The table schema matches the schema in the Avro schema file. Before any read operation, the Avro schema is
accessed so the query result reflects any changes in schema. Thus Trino takes advantage of Avro's backward compatibility abilities.
If the schema of the table changes in the Avro schema file, the new schema can still be used to read old data. Newly added/renamed fields must have a default value in the Avro schema file.
The schema evolution behavior is as follows:
The following operations are not supported when avro_schema_url is set:
CREATE TABLE AS is not supported.bucketed_by) columns are not supported in CREATE TABLE.ALTER TABLE commands modifying columns are not supported.(hive-alter-table-execute)=
The connector supports the following commands for use with {ref}ALTER TABLE EXECUTE <alter-table-execute>.
The optimize command is disabled by default, and can be enabled for a
catalog with the <catalog-name>.non_transactional_optimize_enabled
session property:
SET SESSION <catalog_name>.non_transactional_optimize_enabled=true
:::{warning} Because Hive tables are non-transactional, take note of the following possible outcomes:
optimize operation,
a manual cleanup of the table directory is needed. In this situation, refer
to the Trino logs and query failure messages to see which files must be
deleted.
:::(hive-table-properties)=
Table properties supply or set metadata for the underlying tables. This
is key for {doc}/sql/create-table-as statements. Table properties are passed
to the connector using a {doc}WITH </sql/create-table-as> clause:
CREATE TABLE tablename
WITH (format='CSV',
csv_escape = '"')
:::{list-table} Hive connector table properties :widths: 20, 60, 20 :header-rows: 1
auto_purgebucket_countbucketed_by.bucketed_bybucket_count.[]bucketing_version1 or 2.csv_escapecsv_quotecsv_separator| or use Unicode to configure invisible separators such
tabs with U&'\0009'.,formatORC, PARQUET, AVRO,
RCBINARY, RCTEXT, SEQUENCEFILE, JSON, OPENX_JSON, TEXTFILE,
CSV, and REGEX. The catalog property hive.storage-format sets the
default value and can change it to a different default.null_formatNULL value. Requires TextFile, RCText, or
SequenceFile format.orc_bloom_filter_columns=, IN and
small range predicates, when reading ORC files. Requires ORC format.[]orc_bloom_filter_fpppartitioned_bypartitioned_by clause must be the last columns as defined in the DDL.[]parquet_bloom_filter_columns=, IN and
small range predicates, when reading Parquet files. Requires Parquet format.[]skip_footer_line_countskip_header_line_countsorted_bybucketed_by and bucket_count are specified as well.[]textfile_field_separatortextfile_field_separator_escapetransactionaltrue to create an ORC ACID transactional table.
Requires ORC format. This property may be shown as true for insert-only
tables created using older versions of Hive.partition_projection_enabledpartition_projection_ignorepartition_projection_location_templates3a://test/name=${name}/.
Mapped from the AWS Athena table property
storage.location.template${table_location}/${partition_name}extra_properties$properties metadata table. The properties
are not included in the output of SHOW CREATE TABLE statements.:::
(hive-special-tables)=
The raw Hive table properties are available as a hidden table, containing a separate column per table property, with a single row containing the property values.
$properties tableThe properties table name is composed with the table name and $properties appended.
It exposes the parameters of the table in the metastore.
You can inspect the property names and values with a simple query:
SELECT * FROM example.web."page_views$properties";
stats_generated_via_stats_task | auto.purge | trino_query_id | trino_version | transactional
---------------------------------------------+------------+-----------------------------+---------------+---------------
workaround for potential lack of HIVE-12730 | false | 20230705_152456_00001_nfugi | 434 | false
$partitions tableThe $partitions table provides a list of all partition values
of a partitioned table.
The following example query returns all partition values from the
page_views table in the web schema of the example catalog:
SELECT * FROM example.web."page_views$partitions";
day | country
------------+---------
2023-07-01 | POL
2023-07-02 | POL
2023-07-03 | POL
2023-03-01 | USA
2023-03-02 | USA
(hive-column-properties)=
:::{list-table} Hive connector column properties :widths: 20, 60, 20 :header-rows: 1
partition_projection_typeENUM, INTEGER, DATE,
INJECTED. Mapped from the AWS Athena table property
projection.${columnName}.type.partition_projection_valuespartition_projection_type set to ENUM. Contains a static list
of values used to generate partitions. Mapped from the AWS Athena table
property
projection.${columnName}.values.partition_projection_rangepartition_projection_type set to INTEGER or DATE to define a
range. It is a two-element array, describing the minimum and maximum range
values used to generate partitions. Generation starts from the minimum, then
increments by the defined partition_projection_interval to the maximum.
For example, the format is ['1', '4'] for a partition_projection_type of
INTEGER and ['2001-01-01', '2001-01-07'] or ['NOW-3DAYS', 'NOW'] for a
partition_projection_type of DATE. Mapped from the AWS Athena table
property
projection.${columnName}.range.partition_projection_intervalpartition_projection_type set to INTEGER or DATE. It
represents the interval used to generate partitions within the given range
partition_projection_range. Mapped from the AWS Athena table property
projection.${columnName}.interval.partition_projection_digitspartition_projection_type set to INTEGER. The number of digits
to be used with integer column projection. Mapped from the AWS Athena table
property
projection.${columnName}.digits.partition_projection_formatpartition_projection_type set to DATE. The date column
projection format, defined as a string such as yyyy MM or MM-dd-yy HH:mm:ss for use with the Java DateTimeFormatter
class.
Mapped from the AWS Athena table property
projection.${columnName}.format.partition_projection_interval_unitpartition_projection_type=DATA. The date column projection range
interval unit given in partition_projection_interval. Mapped from the AWS
Athena table property
projection.${columnName}.interval.unit.:::
(hive-special-columns)=
In addition to the defined columns, the Hive connector automatically exposes metadata in a number of hidden columns in each table:
$bucket: Bucket number for this row$path: Full file system path name of the file for this row$file_modified_time: Date and time of the last modification of the file for this row$file_size: Size of the file for this row$partition: Partition name for this rowYou can use these columns in your SQL statements like any other column. They can be selected directly, or used in conditional statements. For example, you can inspect the file size, location and partition for each record:
SELECT *, "$path", "$file_size", "$partition"
FROM example.web.page_views;
Retrieve all records that belong to files stored in the partition
ds=2016-08-09/country=US:
SELECT *, "$path", "$file_size"
FROM example.web.page_views
WHERE "$partition" = 'ds=2016-08-09/country=US'
(hive-sql-view-management)=
Trino allows reading from Hive materialized views, and can be configured to support reading Hive views.
The Hive connector supports reading from Hive materialized views. In Trino, these views are presented as regular, read-only tables.
(hive-views)=
Hive views are defined in HiveQL and stored in the Hive Metastore Service. They are analyzed to allow read access to the data.
The Hive connector includes support for reading Hive views with three different modes.
If using Hive views from Trino is required, you must compare results in Hive and Trino for each view definition to ensure identical results. Use the experimental mode whenever possible. Avoid using the legacy mode. Leave Hive views support disabled, if you are not accessing any Hive views from Trino.
You can configure the behavior in your catalog properties file.
By default, Hive views are executed with the RUN AS DEFINER security mode.
Set the hive.hive-views.run-as-invoker catalog configuration property to
true to use RUN AS INVOKER semantics.
Disabled
The default behavior is to ignore Hive views. This means that your business logic and data encoded in the views is not available in Trino.
Legacy
A very simple implementation to execute Hive views, and therefore allow read
access to the data in Trino, can be enabled with
hive.hive-views.enabled=true and
hive.hive-views.legacy-translation=true.
For temporary usage of the legacy behavior for a specific catalog, you can set
the hive_views_legacy_translation {doc}catalog session property </sql/set-session> to true.
This legacy behavior interprets any HiveQL query that defines a view as if it is written in SQL. It does not do any translation, but instead relies on the fact that HiveQL is very similar to SQL.
This works for very simple Hive views, but can lead to problems for more complex queries. For example, if a HiveQL function has an identical signature but different behaviors to the SQL version, the returned results may differ. In more extreme cases the queries might fail, or not even be able to be parsed and executed.
Experimental
The new behavior is better engineered and has the potential to become a lot more powerful than the legacy implementation. It can analyze, process, and rewrite Hive views and contained expressions and statements.
It supports the following Hive view functionality:
UNION [DISTINCT] and UNION ALL against Hive viewsGROUP BY clausescurrent_user()LATERAL VIEW OUTER EXPLODELATERAL VIEW [OUTER] EXPLODE on array of structLATERAL VIEW json_tupleYou can enable the experimental behavior with
hive.hive-views.enabled=true. Remove the
hive.hive-views.legacy-translation property or set it to false to make
sure legacy is not enabled.
Keep in mind that numerous features are not yet implemented when experimenting with this feature. The following is an incomplete list of missing functionality:
current_date, current_timestamp, and otherstranslate(), window functions, and othersThe connector includes a number of performance improvements, detailed in the following sections.
(hive-table-statistics)=
The Hive connector supports collecting and managing {doc}table statistics </optimizer/statistics> to improve query processing performance.
When writing data, the Hive connector always collects basic statistics
(numFiles, numRows, rawDataSize, totalSize)
and by default will also collect column level statistics:
:::{list-table} Available table statistics :widths: 35, 65 :header-rows: 1
TINYINTSMALLINTINTEGERBIGINTDOUBLEREALDECIMALDATETIMESTAMPVARCHARCHARVARBINARYBOOLEAN(hive-analyze)=
If your queries are complex and include joining large data sets,
running {doc}/sql/analyze on tables/partitions may improve query performance
by collecting statistical information about the data.
When analyzing a partitioned table, the partitions to analyze can be specified
via the optional partitions property, which is an array containing
the values of the partition keys in the order they are declared in the table schema:
ANALYZE table_name WITH (
partitions = ARRAY[
ARRAY['p1_value1', 'p1_value2'],
ARRAY['p2_value1', 'p2_value2']])
This query will collect statistics for two partitions with keys
p1_value1, p1_value2 and p2_value1, p2_value2.
On wide tables, collecting statistics for all columns can be expensive and can have a
detrimental effect on query planning. It is also typically unnecessary - statistics are
only useful on specific columns, like join keys, predicates, grouping keys. One can
specify a subset of columns to be analyzed via the optional columns property:
ANALYZE table_name WITH (
partitions = ARRAY[ARRAY['p2_value1', 'p2_value2']],
columns = ARRAY['col_1', 'col_2'])
This query collects statistics for columns col_1 and col_2 for the partition
with keys p2_value1, p2_value2.
Note that if statistics were previously collected for all columns, they must be dropped before re-analyzing just a subset:
CALL system.drop_stats('schema_name', 'table_name')
You can also drop statistics for selected partitions only:
CALL system.drop_stats(
schema_name => 'schema',
table_name => 'table',
partition_values => ARRAY[ARRAY['p2_value1', 'p2_value2']])
(hive-dynamic-filtering)=
The Hive connector supports the {doc}dynamic filtering </admin/dynamic-filtering> optimization.
Dynamic partition pruning is supported for partitioned tables stored in any file format
for broadcast as well as partitioned joins.
Dynamic bucket pruning is supported for bucketed tables stored in any file format for
broadcast joins only.
For tables stored in ORC or Parquet file format, dynamic filters are also pushed into local table scan on worker nodes for broadcast joins. Dynamic filter predicates pushed into the ORC and Parquet readers are used to perform stripe or row-group pruning and save on disk I/O. Sorting the data within ORC or Parquet files by the columns used in join criteria significantly improves the effectiveness of stripe or row-group pruning. This is because grouping similar data within the same stripe or row-group greatly improves the selectivity of the min/max indexes maintained at stripe or row-group level.
It can often be beneficial to wait for the collection of dynamic filters before starting a table scan. This extra wait time can potentially result in significant overall savings in query and CPU time, if dynamic filtering is able to reduce the amount of scanned data.
For the Hive connector, a table scan can be delayed for a configured amount of
time until the collection of dynamic filters by using the configuration property
hive.dynamic-filtering.wait-timeout in the catalog file or the catalog
session property <hive-catalog>.dynamic_filtering_wait_timeout.
(hive-table-redirection)=
The connector supports redirection from Hive tables to Iceberg, Delta Lake, and Hudi tables with the following catalog configuration properties:
hive.iceberg-catalog-name: Name of the catalog, configured with the
, to use for reading Iceberg tables.hive.delta-lake-catalog-name: Name of the catalog, configured with the
, to use for reading Delta Lake tables.hive.hudi-catalog-name: Name of the catalog, configured with the
, to use for reading Hudi tables.The connector supports configuring and using file system caching.
(hive-performance-tuning-configuration)=
The following table describes performance tuning properties for the Hive connector.
:::{warning} Performance tuning configuration properties are considered expert-level features. Altering these properties from their default values is likely to cause instability and performance degradation. :::
:::{list-table} :widths: 30, 50, 20 :header-rows: 1
hive.max-outstanding-splits1000hive.max-outstanding-splits-size256 MBhive.max-splits-per-secondhive.max-initial-splitsmax-initial-split-size. After max-initial-splits have been assigned,
max-split-size is used for the remaining splits.200hive.max-initial-split-sizemax-initial-splits have been assigned. Smaller splits results in more
parallelism, which gives a boost to smaller queries.32 MBhive.max-split-size64 MB
:::sys system catalog is not accessible.timestamp with local zone data type is mapped to
timestamp with time zone with UTC timezone. It only supports reading
values - writing to tables with columns of this type is not supported.TIMESTAMP values from Parquet, RCBinary, or Avro
file formats created by Hive 3.1 or later. When reading from these file formats,
Trino returns different results than Hive.