docs/src/main/sphinx/connector/bigquery.md
The BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.
The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary. It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
: It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Column Filtering
: The new API allows column filtering to only read the data you are interested in. Backed by a columnar datastore, it can efficiently stream data without reading all columns.
Dynamic Sharding
: The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.
(bigquery-requirements)=
To connect to BigQuery, you need:
To enable the BigQuery Storage Read API.
Network access from your Trino coordinator and workers to the Google Cloud API service endpoint. This endpoint uses HTTPS, or port 443.
To configure BigQuery so that the Trino coordinator and workers have permissions in BigQuery.
To set up authentication. Your authentication options differ depending on whether you are using Dataproc/Google Compute Engine (GCE) or not.
On Dataproc/GCE the authentication is done from the machine's role.
Outside Dataproc/GCE you have 3 options:
GOOGLE_APPLICATION_CREDENTIALS as
described in the Google Cloud authentication getting started guide.bigquery.credentials-key in the catalog properties file. It should
contain the contents of the JSON file, encoded using base64.bigquery.credentials-file in the catalog properties file. It should
point to the location of the JSON file.To configure the BigQuery connector, create a catalog properties file in
etc/catalog named example.properties, to mount the BigQuery connector as
the example catalog. Create the file with the following contents, replacing
the connection properties as appropriate for your setup:
connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
The BigQuery connector can only access a single GCP project. If you have
data in multiple GCP projects, you must create several catalogs, each
pointing to a different GCP project. For example, if you have two GCP projects,
one for the sales and one for analytics, you can create two properties files in
etc/catalog named sales.properties and analytics.properties, both
having connector.name=bigquery but with different project-id. This will
create the two catalogs, sales and analytics respectively.
(bigquery-project-id-resolution)=
The BigQuery connector determines the project ID to use based on the configuration settings. This behavior provides users with flexibility in selecting both the project to query and the project to bill for BigQuery operations. The following table explains how project IDs are resolved in different scenarios:
:::{list-table} Billing and data project ID resolution :widths: 30, 33, 33 :header-rows: 1
bigquery.credentials-keybigquery.credentials-key and bigquery.project-idbigquery.project-id is used for querying data.bigquery.credentials-key and bigquery.parent-project-idbigquery.parent-project-id is used for billing.bigquery.credentials-key and bigquery.parent-project-id
and bigquery.project-idbigquery.parent-project-id is used for billing.bigquery.project-id is used for querying data.
:::(bigquery-arrow-serialization-support)=
This is a feature which introduces support for using Apache Arrow as the serialization format when reading from BigQuery. Add the following required, additional JVM argument to the :
--add-opens=java.base/java.nio=ALL-UNNAMED
--sun-misc-unsafe-memory-access=allow
(bigquery-reading-from-views)=
The connector has a preliminary support for reading from BigQuery views. Please note there are a few caveats:
bigquery.views-enabled configuration property to true.bigquery.view-materialization-project
and bigquery.view-materialization-dataset properties, respectively. The
service account must have write permission to the project and the dataset in
order to materialize the view.:::{list-table} BigQuery configuration properties :widths: 30, 55, 15 :header-rows: 1
bigquery.views-enabledfalsebigquery.view-expire-duration24hbigquery.view-materialization-projectbigquery.view-materialization-datasetbigquery.skip-view-materializationBIGNUMERIC
and TIMESTAMP types are unsupported.falsebigquery.view-materialization-with-filterfalsebigquery.views-cache-ttl0ms to disable the cache.15mbigquery.metadata.cache-ttl0ms to disable the cache.0msbigquery.metadata.parallelism1 and 32.32bigquery.metadata-page-size1000bigquery.max-read-rows-retries3bigquery.credentials-keybigquery.credentials-filebigquery.case-insensitive-name-matchingfalsebigquery.case-insensitive-name-matching.cache-ttl0ms to disable the cache.0msbigquery.query-results-cache.enabledfalsebigquery.job.label-namebigquery.job.label-formatbigquery.job.label-name. May
consist of letters, digits, underscores, hyphens, commas, spaces, equal signs, and
predefined values $QUERY_ID, $SOURCE, $USER, and $TRACE_TOKEN.bigquery.arrow-serialization.enabledtruebigquery.arrow-serialization.max-allocation100MBbigquery.projection-pushdown-enabledROW type.truebigquery.max-parallelismbigquery.channel-pool.initial-size1bigquery.channel-pool.min-size1bigquery.channel-pool.max-size1bigquery.channel-pool.min-rpc-per-channel0bigquery.channel-pool.max-rpc-per-channel2147483647bigquery.rpc-retries0, the value from
bigquery.rpc-timeout is used.
Retry is deactivated when both bigquery.rpc-retries and
bigquery.rpc-timeout are 0.
If this value is positive, and the number of attempts exceeds
bigquery.rpc-retries limit, retries stop even if
the total retry time is still lower than bigquery.rpc-timeout.0bigquery.rpc-timeout0s, then
bigquery.rpc-retries is used to determine retries.
Retry is deactivated when bigquery.rpc-retries and
bigquery.rpc-timeout are both 0.
If this value is positive, and the retry duration has reached the timeout
value, retries stop even if the number of attempts is lower than
the bigquery.rpc-retries value.0sbigquery.rpc-retry-delay0sbigquery.rpc-retry-delay-multiplierbigquery.rpc-retry-delay-multiplier to calculate the retry delay
for the next RPC call.1.0bigquery.rpc-proxy.enabledfalsebigquery.rpc-proxy.uribigquery.rpc-proxy.usernamebigquery.rpc-proxy.passwordbigquery.rpc-proxy.keystore-pathbigquery.rpc-proxy.keystore-passwordbigquery.rpc-proxy.keystore-path.bigquery.rpc-proxy.truststore-pathbigquery.rpc-proxy.truststore-passwordbigquery.rpc-proxy.truststore-path.:::
(bigquery-fte-support)=
The connector supports {doc}/admin/fault-tolerant-execution of query
processing. Read and write operations are both supported with any retry policy.
(bigquery-type-mapping)=
Because Trino and BigQuery each support types that the other does not, this
connector {ref}modifies some types <type-mapping-overview> when reading or
writing data. Data types may not map the same way in both directions between
Trino and the data source. Refer to the following sections for type mapping in
each direction.
The connector maps BigQuery types to the corresponding Trino types according to the following table:
:::{list-table} BigQuery type to Trino type mapping :widths: 30, 30, 50 :header-rows: 1
BOOLEANBOOLEANINT64BIGINTINT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT are aliases
for INT64 in BigQuery.FLOAT64DOUBLENUMERICDECIMAL(P,S)NUMERIC is (38, 9).BIGNUMERICDECIMAL(P,S)BIGNUMERIC is (77, 38).DATEDATEDATETIMETIMESTAMP(6)STRINGVARCHARBYTESVARBINARYTIMETIME(6)TIMESTAMPTIMESTAMP(6) WITH TIME ZONEGEOGRAPHYVARCHARJSONJSONARRAYARRAYRECORDROW:::
No other types are supported.
The connector maps Trino types to the corresponding BigQuery types according to the following table:
:::{list-table} Trino type to BigQuery type mapping :widths: 30, 30, 50 :header-rows: 1
BOOLEANBOOLEANVARBINARYBYTESDATEDATEDOUBLEFLOATBIGINTINT64INT, SMALLINT, INTEGER, BIGINT, TINYINT, and
BYTEINT are aliases for INT64 in BigQuery.DECIMAL(P,S)NUMERICNUMERIC is (38, 9).VARCHARSTRINGTIMESTAMP(6)DATETIME:::
No other types are supported.
For each Trino table which maps to BigQuery view there exists a system table
which exposes BigQuery view definition. Given a BigQuery view example_view
you can send query SELECT * example_view$view_definition to see the SQL
which defines view in BigQuery.
(bigquery-special-columns)=
In addition to the defined columns, the BigQuery connector exposes partition information in a number of hidden columns:
$partition_date: Equivalent to _PARTITIONDATE pseudo-column in BigQuery$partition_time: Equivalent to _PARTITIONTIME pseudo-column in BigQueryYou can use these columns in your SQL statements like any other column. They can be selected directly, or used in conditional statements. For example, you can inspect the partition date and time for each record:
SELECT *, "$partition_date", "$partition_time"
FROM example.web.page_views;
Retrieve all records stored in the partition _PARTITIONDATE = '2022-04-07':
SELECT *
FROM example.web.page_views
WHERE "$partition_date" = date '2022-04-07';
:::{note}
Two special partitions __NULL__ and __UNPARTITIONED__ are not supported.
:::
(bigquery-sql-support)=
The connector provides read and write access to data and metadata in the
BigQuery database. In addition to the
{ref}globally available <sql-globally-available> and
{ref}read operation <sql-read-operations> statements, the connector supports
the following features:
/sql/insert/sql/delete/sql/truncate/sql/create-table/sql/create-table-as/sql/drop-table/sql/create-schema/sql/drop-schema/sql/commentThe connector provides support to query multiple tables using a concise wildcard table notation.
SELECT *
FROM example.web."page_views_*";
The connector provides specific {doc}table functions </functions/table> to
access BigQuery.
(bigquery-query-function)=
query(varchar) -> tableThe query function allows you to query the underlying BigQuery directly. It
requires syntax native to BigQuery, because the full query is pushed down and
processed by BigQuery. This can be useful for accessing native features which are
not available in Trino or for improving query performance in situations where
running a query natively may be faster.
For example, query the example catalog and group and concatenate all
employee IDs by manager ID:
SELECT
*
FROM
TABLE(
example.system.query(
query => 'SELECT
manager_id, STRING_AGG(employee_id)
FROM
company.employees
GROUP BY
manager_id'
)
);
The connector includes a number of performance improvements, detailed in the following sections.
(bigquery-pushdown)=
The connector supports pushdown for a number of operations:
See the BigQuery pricing documentation.