presto-docs/src/main/sphinx/cache/service.rst
A common optimization to improve Presto query latency is to cache the working set to avoid unnecessary I/O from remote data sources or through a slow network. This section describes how to leverage Alluxio as a caching layer for Presto.
Alluxio File System <https://docs.alluxio.io/os/user/stable/en/core-services/Caching.html?utm_source=prestodb&utm_medium=prestodocs>_ serves Presto Hive Connector as an independent distributed caching file system on top of HDFS or object stores like AWS S3, GCP, Azure blob store.
Users can understand the cache usage and control cache explicitly through a file system interface.
For example, one can preload all files in an Alluxio directory to warm the cache for Presto queries,
and set the TTL (time-to-live) for cached data to reclaim cache capacity.
Presto Hive connector can connect to
AlluxioFileSystem <https://docs.alluxio.io/os/user/stable/en/core-services/Caching.html?utm_source=prestodb&utm_medium=prestodocs>_ as a Hadoop-compatible file system,
on top of other persistent storage systems.
Setup ^^^^^
First, configure ${PRESTO_HOME}/etc/catalog/hive.properties to use the Hive connector.
.. code-block:: none
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
Second, ensure the Alluxio client jar is already in
${PRESTO_HOME}/plugin/hive-hadoop2/ on all Presto servers.
If this is not the case,
download Alluxio binary <https://alluxio.io/download>_, extract the tarball to
${ALLUXIO_HOME} and copy Alluxio client jar
${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar into this directory. Restart Presto service:
.. code-block:: none
$ ${PRESTO_HOME}/bin/launcher restart
Third, configure Hive Metastore connects to Alluxio File System when serving Presto.
Edit ${HIVE_HOME}/conf/hive-env.sh to include Alluxio client jar on the Hive classpath:
.. code-block:: none
export HIVE_AUX_JARS_PATH=${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar
Then restart Hive Metastore
.. code-block:: none
$ ${HIVE_HOME}/hcatalog/sbin/hcat_server.sh start
Query ^^^^^
After completing the basic configuration, Presto should be able to access Alluxio File System
with tables pointing to alluxio:// address.
Refer to the :doc:/connector/hive documentation
to learn how to configure Alluxio file system in Presto. Here is a simple example:
.. code-block:: none
$ cd ${ALLUXIO_HOME}
$ bin/alluxio-start.sh local -f
$ bin/alluxio fs mount --readonly /example \
s3://apc999/presto-tutorial/example-reason/
Start a Prest CLI connecting to the server started in the previous step.
Download :github_download:cli, rename it to presto,
make it executable with chmod +x, then run it:
.. code-block:: none
$ ./presto --server localhost:8080 --catalog hive --debug
presto> use default;
USE
Create a new table based on the file mounted in Alluxio:
.. code-block:: none
presto:default> DROP TABLE IF EXISTS reason;
DROP TABLE
presto:default> CREATE TABLE reason (
r_reason_sk integer,
r_reason_id varchar,
r_reason_desc varchar
) WITH (
external_location = 'alluxio://localhost:19998/example',
format = 'PARQUET'
);
CREATE TABLE
Scan the newly created table on Alluxio:
.. code-block:: none
presto:default> SELECT * FROM reason LIMIT 3;
r_reason_sk | r_reason_id | r_reason_desc
-------------+------------------+---------------------------------------------
1 | AAAAAAAABAAAAAAA | Package was damaged
4 | AAAAAAAAEAAAAAAA | Not the product that was ordred
5 | AAAAAAAAFAAAAAAA | Parts missing
Basic Operations ^^^^^^^^^^^^^^^^
With Alluxio file system this approach supports the following features:
alluxio fs distributedLoad <https://docs.alluxio.io/os/user/stable/en/operation/User-CLI.html#distributedload>_,
in addition to caching data transparently based on the data access pattern.alluxio fs setTtl <https://docs.alluxio.io/os/user/stable/en/operation/User-CLI.html#setttl>_.alluxio fs ls <https://docs.alluxio.io/os/user/stable/en/operation/User-CLI.html#ls>,
or browse the corresponding files on
Alluxio WebUI <https://docs.alluxio.io/os/user/stable/en/operation/Web-Interface.html>.alluxio fsadmin report <https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#report>_ and plan the resource accordingly.