Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the same URI formats as in Hadoop. You can specify a path in Swift as input through a URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your Swift security credentials, through <code>core-site.xml</code> or via <code>SparkContext.hadoopConfiguration</code>. The current Swift driver requires Swift to use the Keystone authentication method, or its Rackspace-specific predecessor.

Configuring Swift for Better Data Locality

Although not mandatory, it is recommended to configure the proxy server of Swift with <code>list_endpoints</code> to have better data locality. More information is available here.

Dependencies

The Spark application should include <code>hadoop-openstack</code> dependency, which can be done by including the hadoop-cloud module for the specific version of spark used. For example, for Maven support, add the following to the <code>pom.xml</code> file:

{% highlight xml %} <dependencyManagement> ... <dependency> <groupId>org.apache.spark</groupId> <artifactId>hadoop-cloud_2.13</artifactId> <version>${spark.version}</version> </dependency> ... </dependencyManagement> {% endhighlight %}

Configuration Parameters

Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory. The main category of parameters that should be configured is the authentication parameters required by Keystone.

The following table contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any (alphanumeric) name.

<table> <thead><tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr></thead> <tr> <td><code>fs.swift.service.PROVIDER.auth.url</code></td> <td>Keystone Authentication URL</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td> <td>Keystone endpoints prefix</td> <td>Optional</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.tenant</code></td> <td>Tenant</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.username</code></td> <td>Username</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.password</code></td> <td>Password</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.http.port</code></td> <td>HTTP port</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.region</code></td> <td>Keystone region</td> <td>Mandatory</td> </tr> <tr> <td><code>fs.swift.service.PROVIDER.public</code></td> <td>Indicates whether to use the public (off cloud) or private (in cloud; no transfer fees) endpoints</td> <td>Mandatory</td> </tr> </table>

For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code> defined for tenant <code>test</code>. Then <code>core-site.xml</code> should include:

{% highlight xml %} <configuration> <property> <name>fs.swift.service.SparkTest.auth.url</name> <value>http://127.0.0.1:5000/v2.0/tokens</value> </property> <property> <name>fs.swift.service.SparkTest.auth.endpoint.prefix</name> <value>endpoints</value> </property> <name>fs.swift.service.SparkTest.http.port</name> <value>8080</value> </property> <property> <name>fs.swift.service.SparkTest.region</name> <value>RegionOne</value> </property> <property> <name>fs.swift.service.SparkTest.public</name> <value>true</value> </property> <property> <name>fs.swift.service.SparkTest.tenant</name> <value>test</value> </property> <property> <name>fs.swift.service.SparkTest.username</name> <value>tester</value> </property> <property> <name>fs.swift.service.SparkTest.password</name> <value>testing</value> </property> </configuration> {% endhighlight %}

Notice that <code>fs.swift.service.PROVIDER.tenant</code>, <code>fs.swift.service.PROVIDER.username</code>, <code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in <code>core-site.xml</code> is not always a good approach. We suggest to keep those parameters in <code>core-site.xml</code> for testing purposes when running Spark via <code>spark-shell</code>. For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.

Accessing OpenStack Swift from Spark

Configuring Swift for Better Data Locality

Dependencies

Configuration Parameters