Use the PyArrow library to analyze data - Influxdb

Use PyArrow to read and analyze query results from {{% product-name %}}. The PyArrow library provides efficient computation, aggregation, serialization, and conversion of Arrow format data.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow. {{% caption %}}PyArrow documentation{{% /caption %}}

Install prerequisites
Use PyArrow to read query results
Use PyArrow to analyze data
- Group and aggregate data

Install prerequisites

The examples in this guide assume using a Python virtual environment and the InfluxDB 3 influxdb3-python Python client library. For more information, see how to get started using Python to query InfluxDB.

Installing influxdb3-python also installs the pyarrow library that provides Python bindings for Apache Arrow.

Use PyArrow to read query results

The following example shows how to use influxdb3-python and pyarrow to query InfluxDB and view Arrow data as a PyArrow Table.

In your editor, copy and paste the following sample code to a new file--for example, pyarrow-example.py:

{{% tabs-wrapper %}} {{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}

# pyarrow-example.py

from influxdb_client_3 import InfluxDBClient3
import pandas

def querySQL():
  
  # Instantiate an InfluxDB client configured for a database
  client = InfluxDBClient3(
    "https://{{< influxdb/host >}}",
    database="DATABASE_NAME",
    token="DATABASE_TOKEN")

  # Execute the query to retrieve all record batches in the stream formatted as a PyArrow Table.
  table = client.query(
    '''SELECT *
      FROM home
      WHERE time >= now() - INTERVAL '90 days'
      ORDER BY time'''
  )

  client.close()

print(querySQL())

Replace the following configuration values:
- {{% code-placeholder-key %}}DATABASE_TOKEN{{% /code-placeholder-key %}}: a database token with read permissions on the databases you want to query
- {{% code-placeholder-key %}}DATABASE_NAME{{% /code-placeholder-key %}}: the name of the database to query
In your terminal, use the Python interpreter to run the file:
sh
```
python pyarrow-example.py
```

The InfluxDBClient3.query() method sends the query request, and then returns a pyarrow.Table that contains all the Arrow record batches from the response stream.

Next, use PyArrow to analyze data.

Use PyArrow to analyze data

Group and aggregate data

With a pyarrow.Table, you can use values in a column as keys for grouping.

The following example shows how to query InfluxDB, and then use PyArrow to group the table data and calculate an aggregate value for each group:

# pyarrow-example.py

from influxdb_client_3 import InfluxDBClient3
import pandas

def querySQL():
  
  # Instantiate an InfluxDB client configured for a database
  client = InfluxDBClient3(
    "https://{{< influxdb/host >}}",
    database="DATABASE_NAME",
    token="DATABASE_TOKEN")

  # Execute the query to retrieve data 
  # formatted as a PyArrow Table
  table = client.query(
    '''SELECT *
      FROM home
      WHERE time >= now() - INTERVAL '90 days'
      ORDER BY time'''
  )

  client.close()

  return table

table = querySQL()

# Use PyArrow to aggregate data
print(table.group_by('room').aggregate([('temp', 'mean')]))

Replace the following:

{{% code-placeholder-key %}}DATABASE_TOKEN{{% /code-placeholder-key %}}: a database token with read permissions on the databases you want to query
{{% code-placeholder-key %}}DATABASE_NAME{{% /code-placeholder-key %}}: the name of the database to query

arrow

pyarrow.Table
temp_mean: double
room: string
----
temp_mean: [[22.581987577639747,22.10807453416151]]
room: [["Kitchen","Living Room"]]

For more detail and examples, see the PyArrow documentation and the Apache Arrow Python Cookbook.