Back to Arrow

Integrating PyArrow with Java

docs/source/python/integration/python_java.rst

latest20.8 KB
Original Source

.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.

Integrating PyArrow with Java

Arrow supports exchanging data within the same process through the :ref:c-data-interface.

This can be used to exchange data between Python and Java functions and methods so that the two languages can interact without any cost of marshaling and unmarshaling data.

.. note::

The article takes for granted that you have a ``Python`` environment
with ``pyarrow`` correctly installed and a ``Java`` environment with
``arrow`` library correctly installed.
The ``Arrow Java`` version must have been compiled with ``mvn -Parrow-c-data`` to
ensure CData exchange support is enabled.
See `Python Install Instructions <https://arrow.apache.org/docs/python/install.html>`_
and `Java Documentation <https://arrow.apache.org/docs/java/>`_
for further details.

Invoking Java methods from Python

Suppose we have a simple Java class providing a number as its output:

.. code-block:: Java

public class Simple {
    public static int getNumber() {
        return 4;
    }
}

We would save such class in the Simple.java file and proceed with compiling it to Simple.class using javac Simple.java.

Once the Simple.class file is created we can use the class from Python using the JPype <https://jpype.readthedocs.io/>_ library which enables a Java runtime within the Python interpreter.

jpype1 can be installed using pip like most Python libraries

.. code-block:: console

$ pip install jpype1

The most basic thing we can do with our Simple class is to use the Simple.getNumber method from Python and see if it will return the result.

To do so, we can create a simple.py file which uses jpype to import the Simple class from Simple.class file and invoke the Simple.getNumber method:

.. code-block:: python

import jpype
from jpype.types import *

jpype.startJVM(classpath=["./"])

Simple = JClass('Simple')

print(Simple.getNumber())

Running the simple.py file will show how our Python code is able to access the Java method and print the expected result:

.. code-block:: console

$ python simple.py
4

Java to Python using pyarrow.jvm

PyArrow provides a pyarrow.jvm module that makes easier to interact with Java classes and convert the Java objects to actual Python objects.

To showcase pyarrow.jvm we could create a more complex class, named FillTen.java

.. code-block:: java

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;


public class FillTen {
    static RootAllocator allocator = new RootAllocator();

    public static BigIntVector createArray() {
        BigIntVector intVector = new BigIntVector("ints", allocator);
        intVector.allocateNew(10);
        intVector.setValueCount(10);
        FillTen.fillVector(intVector);
        return intVector;
    }

    private static void fillVector(BigIntVector iv) {
        iv.setSafe(0, 1);
        iv.setSafe(1, 2);
        iv.setSafe(2, 3);
        iv.setSafe(3, 4);
        iv.setSafe(4, 5);
        iv.setSafe(5, 6);
        iv.setSafe(6, 7);
        iv.setSafe(7, 8);
        iv.setSafe(8, 9);
        iv.setSafe(9, 10);
    }
}

This class provides a public createArray method that anyone can invoke to get back an array containing numbers from 1 to 10.

Given that this class now has a dependency on a bunch of packages, compiling it with javac is not enough anymore. We need to create a dedicated pom.xml file where we can collect the dependencies:

.. code-block:: xml

<project>
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.apache.arrow.py2java</groupId>
    <artifactId>FillTen</artifactId>
    <version>1</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
        <groupId>org.apache.arrow</groupId>
        <artifactId>arrow-memory</artifactId>
        <version>8.0.0</version>
        <type>pom</type>
        </dependency>
        <dependency>
        <groupId>org.apache.arrow</groupId>
        <artifactId>arrow-memory-netty</artifactId>
        <version>8.0.0</version>
        <type>jar</type>
        </dependency>
        <dependency>
        <groupId>org.apache.arrow</groupId>
        <artifactId>arrow-vector</artifactId>
        <version>8.0.0</version>
        <type>pom</type>
        </dependency>
        <dependency>
        <groupId>org.apache.arrow</groupId>
        <artifactId>arrow-c-data</artifactId>
        <version>8.0.0</version>
        <type>jar</type>
        </dependency>
    </dependencies>
</project>

Once the FillTen.java file with the class is created as src/main/java/FillTen.java we can use maven to compile the project with mvn package and get it available in the target directory.

.. code-block:: console

$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------
[INFO] Building FillTen 1
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ FillTen ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 1 source file to /experiments/java2py/target/classes
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ FillTen ---
[INFO] Building jar: /experiments/java2py/target/FillTen-1.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

Now that we have the package built, we can make it available to Python. To do so, we need to make sure that not only the package itself is available, but that also its dependencies are.

We can use maven to collect all dependencies and make them available in a single place (the dependencies directory) so that we can more easily load them from Python:

.. code-block:: console

$ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------
[INFO] Building FillTen 1
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.7:copy-dependencies (default-cli) @ FillTen ---
[INFO] Copying jsr305-3.0.2.jar to /experiments/java2py/dependencies/jsr305-3.0.2.jar
[INFO] Copying netty-common-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-common-4.1.72.Final.jar
[INFO] Copying arrow-memory-core-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-core-8.0.0-SNAPSHOT.jar
[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.jar
[INFO] Copying arrow-c-data-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-c-data-8.0.0-SNAPSHOT.jar
[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.pom
[INFO] Copying jackson-core-2.11.4.jar to /experiments/java2py/dependencies/jackson-core-2.11.4.jar
[INFO] Copying jackson-annotations-2.11.4.jar to /experiments/java2py/dependencies/jackson-annotations-2.11.4.jar
[INFO] Copying slf4j-api-1.7.25.jar to /experiments/java2py/dependencies/slf4j-api-1.7.25.jar
[INFO] Copying arrow-memory-netty-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-netty-8.0.0-SNAPSHOT.jar
[INFO] Copying arrow-format-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-format-8.0.0-SNAPSHOT.jar
[INFO] Copying flatbuffers-java-1.12.0.jar to /experiments/java2py/dependencies/flatbuffers-java-1.12.0.jar
[INFO] Copying arrow-memory-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-memory-8.0.0-SNAPSHOT.pom
[INFO] Copying netty-buffer-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-buffer-4.1.72.Final.jar
[INFO] Copying jackson-databind-2.11.4.jar to /experiments/java2py/dependencies/jackson-databind-2.11.4.jar
[INFO] Copying commons-codec-1.10.jar to /experiments/java2py/dependencies/commons-codec-1.10.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

.. note::

Instead of manually collecting dependencies, you could also rely on the
``maven-assembly-plugin`` to build a single ``jar`` with all dependencies.

Once our package and all its dependencies are available, we can invoke it from fillten_pyarrowjvm.py script that will import the FillTen class and print out the result of invoking FillTen.createArray

.. code-block:: python

import jpype
import jpype.imports
from jpype.types import *

# Start a JVM making available all dependencies we collected
# and our class from target/FillTen-1.jar
jpype.startJVM(classpath=["./dependencies/*", "./target/*"])

FillTen = JClass('FillTen')

array = FillTen.createArray()
print("ARRAY", type(array), array)

# Convert the proxied BigIntVector to an actual pyarrow array
import pyarrow.jvm
pyarray = pyarrow.jvm.array(array)
print("ARRAY", type(pyarray), pyarray)
del pyarray

Running the python script will lead to two lines getting printed:

.. code-block::

ARRAY <java class 'org.apache.arrow.vector.BigIntVector'> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ARRAY <class 'pyarrow.lib.Int64Array'> [
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
    10
]

The first line is the raw result of invoking the FillTen.createArray method. The resulting object is a proxy to the actual Java object, so it's not really a pyarrow Array, it will lack most of its capabilities and methods. That's why we subsequently use pyarrow.jvm.array to convert it to an actual pyarrow array. That allows us to treat it like any other pyarrow array. The result is the second line in the output where the array is correctly reported as being of type pyarrow.lib.Int64Array and is printed using the pyarrow style.

.. note::

At the moment the ``pyarrow.jvm`` module is fairly limited in capabilities,
nested types like structs are not supported and it only works on a JVM running
within the same process like JPype.

Java to Python communication using the C Data Interface

The C Data Interface is a protocol implemented in Arrow to exchange data within different environments without the cost of marshaling and copying data.

This allows to expose data coming from Python or Java to functions that are implemented in the other language.

.. note::

In the future the ``pyarrow.jvm`` will be implemented to leverage the C Data
interface, at the moment is instead specifically written for JPype

To showcase how C Data works, we are going to tweak a bit both our FillTen Java class and our fillten.py Python script. Given a PyArrow array, we are going to expose a function in Java that sets its content to by the numbers from 1 to 10.

Using C Data interface in pyarrow at the moment requires installing cffi explicitly, like most Python distributions it can be installed with

.. code-block:: console

$ pip install cffi

The first thing we would have to do is to tweak the Python script so that it sends to Java the exported references to the Array and its Schema according to the C Data interface:

.. code-block:: python

import jpype
import jpype.imports
from jpype.types import *

# Init the JVM and make FillTen class available to Python.
jpype.startJVM(classpath=["./dependencies/*", "./target/*"])
FillTen = JClass('FillTen')

# Create a Python array of 10 elements
import pyarrow as pa
array = pa.array([0]*10)

from pyarrow.cffi import ffi as arrow_c

# Export the Python array through C Data
c_array = arrow_c.new("struct ArrowArray*")
c_array_ptr = int(arrow_c.cast("uintptr_t", c_array))
array._export_to_c(c_array_ptr)

# Export the Schema of the Array through C Data
c_schema = arrow_c.new("struct ArrowSchema*")
c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema))
array.type._export_to_c(c_schema_ptr)

# Send Array and its Schema to the Java function
# that will populate the array with numbers from 1 to 10
FillTen.fillCArray(c_array_ptr, c_schema_ptr)

# See how the content of our Python array was changed from Java
# while it remained of the Python type.
print("ARRAY", type(array), array)

.. note::

Changing content of arrays is not a safe operation, it was done
for the purpose of creating this example, and it mostly works only
because the array hasn't changed size, type or nulls.

In the FillTen Java class, we already have the fillVector method, but that method is private and even if we made it public it would only accept a BigIntVector object and not the C Data array and schema references.

So we have to expand our FillTen class adding a fillCArray method that is able to perform the work of fillVector but on the C Data exchanged entities instead of the BigIntVector one:

.. code-block:: java

import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.BigIntVector;


public class FillTen {
    static RootAllocator allocator = new RootAllocator();

    public static void fillCArray(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);

        FieldVector v = Data.importVector(allocator, arrow_array, arrow_schema, null);
        FillTen.fillVector((BigIntVector)v);
    }

    private static void fillVector(BigIntVector iv) {
        iv.setSafe(0, 1);
        iv.setSafe(1, 2);
        iv.setSafe(2, 3);
        iv.setSafe(3, 4);
        iv.setSafe(4, 5);
        iv.setSafe(5, 6);
        iv.setSafe(6, 7);
        iv.setSafe(7, 8);
        iv.setSafe(8, 9);
        iv.setSafe(9, 10);
    }
}

The goal of the fillCArray method is to get the Array and Schema received in C Data exchange format and turn them back to an object of type FieldVector so that Arrow Java knows how to deal with it.

If we run again mvn package, update the maven dependencies and then our Python script, we should be able to see how the values printed by the Python script have been properly changed by the Java code:

.. code-block:: console

$ mvn package
$ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies
$ python fillten.py
ARRAY <class 'pyarrow.lib.Int64Array'> [
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8,
    9,
    10
]

We can also use the C Stream Interface to exchange :py:class:pyarrow.RecordBatchReader between Java and Python. We'll use this Java class as a demo, which lets you read an Arrow IPC file via Java's implementation, or write data to a JSON file:

.. code-block:: java

import java.io.File; import java.nio.file.Files; import java.nio.file.Paths;

import org.apache.arrow.c.ArrowArrayStream; import org.apache.arrow.c.Data; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.ipc.ArrowFileReader; import org.apache.arrow.vector.ipc.ArrowReader; import org.apache.arrow.vector.ipc.JsonFileWriter;

public class PythonInteropDemo implements AutoCloseable { private final BufferAllocator allocator;

 public PythonInteropDemo() {
   this.allocator = new RootAllocator();
 }

 public void exportStream(String path, long cStreamPointer) throws Exception {
   try (final ArrowArrayStream stream = ArrowArrayStream.wrap(cStreamPointer)) {
     ArrowFileReader reader = new ArrowFileReader(Files.newByteChannel(Paths.get(path)), allocator);
     Data.exportArrayStream(allocator, reader, stream);
   }
 }

 public void importStream(String path, long cStreamPointer) throws Exception {
   try (final ArrowArrayStream stream = ArrowArrayStream.wrap(cStreamPointer);
        final ArrowReader input = Data.importArrayStream(allocator, stream);
        JsonFileWriter writer = new JsonFileWriter(new File(path))) {
     writer.start(input.getVectorSchemaRoot().getSchema(), input);
     while (input.loadNextBatch()) {
       writer.write(input.getVectorSchemaRoot());
     }
   }
 }

 @Override
 public void close() throws Exception {
   allocator.close();
 }

}

On the Python side, we'll use JPype as before, except this time we'll send RecordBatchReaders back and forth:

.. code-block:: python

import tempfile

import jpype import jpype.imports from jpype.types import *

Init the JVM and make demo class available to Python.

jpype.startJVM(classpath=["./dependencies/", "./target/"]) PythonInteropDemo = JClass("PythonInteropDemo") demo = PythonInteropDemo()

Create a Python record batch reader

import pyarrow as pa schema = pa.schema([ ("ints", pa.int64()), ("strs", pa.string()) ]) batches = [ pa.record_batch([ [0, 2, 4, 8], ["a", "b", "c", None], ], schema=schema), pa.record_batch([ [None, 32, 64, None], ["e", None, None, "h"], ], schema=schema), ] reader = pa.RecordBatchReader.from_batches(schema, batches)

from pyarrow.cffi import ffi as arrow_c

Export the Python reader through C Data

c_stream = arrow_c.new("struct ArrowArrayStream*") c_stream_ptr = int(arrow_c.cast("uintptr_t", c_stream)) reader._export_to_c(c_stream_ptr)

Send reader to the Java function that writes a JSON file

with tempfile.NamedTemporaryFile() as temp: demo.importStream(temp.name, c_stream_ptr)

   # Read the JSON file back
   with open(temp.name) as source:
       print("JSON file written by Java:")
       print(source.read())

Write an Arrow IPC file for Java to read

with tempfile.NamedTemporaryFile() as temp: with pa.ipc.new_file(temp.name, schema) as sink: for batch in batches: sink.write_batch(batch)

   demo.exportStream(temp.name, c_stream_ptr)
   with pa.RecordBatchReader._import_from_c(c_stream_ptr) as source:
       print("IPC file read by Java:")
       print(source.read_all())

.. code-block:: console

$ mvn package $ mvn org.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies -DoutputDirectory=dependencies $ python demo.py JSON file written by Java: {"schema":{"fields":[{"name":"ints","nullable":true,"type":{"name":"int","bitWidth":64,"isSigned":true},"children":[]},{"name":"strs","nullable":true,"type":{"name":"utf8"},"children":[]}]},"batches":[{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[1,1,1,1],"DATA":["0","2","4","8"]},{"name":"strs","count":4,"VALIDITY":[1,1,1,0],"OFFSET":[0,1,2,3,3],"DATA":["a","b","c",""]}]},{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[0,1,1,0],"DATA":["0","32","64","0"]},{"name":"strs","count":4,"VALIDITY":[1,0,0,1],"OFFSET":[0,1,1,1,2],"DATA":["e","","","h"]}]}]} IPC file read by Java: pyarrow.Table ints: int64 strs: string

ints: [[0,2,4,8],[null,32,64,null]] strs: [["a","b","c",null],["e",null,null,"h"]]