content/influxdb3/clustered/process-data/tools/pyarrow.md
Use PyArrow to read and analyze query results from {{% product-name %}}. The PyArrow library provides efficient computation, aggregation, serialization, and conversion of Arrow format data.
<!-- TOC --> <!-- /TOC -->Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.
The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow. {{% caption %}}PyArrow documentation{{% /caption %}}
The examples in this guide assume using a Python virtual environment and the InfluxDB 3 influxdb3-python Python client library.
For more information, see how to get started using Python to query InfluxDB.
Installing influxdb3-python also installs the pyarrow library that provides Python bindings for Apache Arrow.
The following example shows how to use influxdb3-python and pyarrow to query InfluxDB and view Arrow data as a PyArrow Table.
In your editor, copy and paste the following sample code to a new file--for example, pyarrow-example.py:
{{% tabs-wrapper %}} {{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
# pyarrow-example.py
from influxdb_client_3 import InfluxDBClient3
import pandas
def querySQL():
# Instantiate an InfluxDB client configured for a database
client = InfluxDBClient3(
"https://{{< influxdb/host >}}",
database="DATABASE_NAME",
token="DATABASE_TOKEN")
# Execute the query to retrieve all record batches in the stream formatted as a PyArrow Table.
table = client.query(
'''SELECT *
FROM home
WHERE time >= now() - INTERVAL '90 days'
ORDER BY time'''
)
client.close()
print(querySQL())
{{% /code-placeholders %}} {{% /tabs-wrapper %}}
Replace the following configuration values:
DATABASE_TOKEN{{% /code-placeholder-key %}}:
a database token
with read permissions on the databases you want to queryDATABASE_NAME{{% /code-placeholder-key %}}: the name of the database to queryIn your terminal, use the Python interpreter to run the file:
python pyarrow-example.py
The InfluxDBClient3.query() method sends the query request, and then returns a pyarrow.Table that contains all the Arrow record batches from the response stream.
Next, use PyArrow to analyze data.
With a pyarrow.Table, you can use values in a column as keys for grouping.
The following example shows how to query InfluxDB, and then use PyArrow to group the table data and calculate an aggregate value for each group:
{{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
# pyarrow-example.py
from influxdb_client_3 import InfluxDBClient3
import pandas
def querySQL():
# Instantiate an InfluxDB client configured for a database
client = InfluxDBClient3(
"https://{{< influxdb/host >}}",
database="DATABASE_NAME",
token="DATABASE_TOKEN")
# Execute the query to retrieve data
# formatted as a PyArrow Table
table = client.query(
'''SELECT *
FROM home
WHERE time >= now() - INTERVAL '90 days'
ORDER BY time'''
)
client.close()
return table
table = querySQL()
# Use PyArrow to aggregate data
print(table.group_by('room').aggregate([('temp', 'mean')]))
{{% /code-placeholders %}}
Replace the following:
DATABASE_TOKEN{{% /code-placeholder-key %}}:
a database token
with read permissions on the databases you want to queryDATABASE_NAME{{% /code-placeholder-key %}}:
the name of the database to query{{< expand-wrapper >}} {{% expand "View example results" %}}
pyarrow.Table
temp_mean: double
room: string
----
temp_mean: [[22.581987577639747,22.10807453416151]]
room: [["Kitchen","Living Room"]]
{{% /expand %}} {{< /expand-wrapper >}}
For more detail and examples, see the PyArrow documentation and the Apache Arrow Python Cookbook.