hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/reading.md
t<!--- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. -->
One of the most important --and performance sensitive-- parts of the S3A connector is reading data from storage. This is always evolving, based on experience, and benchmarking, and in collaboration with other projects.
ObjectInputStream.classic, analytics and prefetch; these are called stream typesConfiguration Options
| Property | Permitted Values | Default | Meaning |
|---|---|---|---|
fs.s3a.input.stream.type | default, classic, analytics, prefetch, custom | classic | Name of stream type to use |
defaultThe default implementation for this release of Hadoop.
<property>
<name>fs.s3a.input.stream.type</name>
<value>default</value>
</property>
The choice of which stream type to use by default may change in future releases.
It is currently classic.
classicThis is the classic S3A input stream, present since the original addition of the S3A connector to the Hadoop codebase.
<property>
<name>fs.s3a.input.stream.type</name>
<value>classic</value>
</property>
Strengths
Weaknesses
analyticsAn input stream aware-of and adapted-to the columnar storage formats used in production, currently with specific support for Apache Parquet.
<property>
<name>fs.s3a.input.stream.type</name>
<value>analytics</value>
</property>
Strengths
Weaknesses
It delivers tangible speedup for reading Parquet files where the reader is deployed within AWS infrastructure, it will just take time to encounter all the failure conditions which the classic connectors have encountered and had to address.
This library is where all future feature development is focused, including benchmark-based tuning for other file formats.
prefetchThis input stream prefetches data in multi-MB blocks and caches these on the local disk's buffer directory.
<property>
<name>fs.s3a.input.stream.type</name>
<value>prefetch</value>
</property>
Strengths
Weaknesses
All streams support VectorIO to some degree.
| Stream | Support |
|---|---|
classic | Parallel issuing of GET request with range coalescing |
prefetch | Sequential reads, using prefetched blocks as appropriate |
analytics | Sequential reads, using prefetched blocks as where possible |
Because the analytics streams is doing parquet-aware RowGroup prefetch, its prefetched blocks should align with Parquet read sequences through vectored reads, as well the unvectored reads.
This does not hold for ORC. When reading ORC files with a version of the ORC library which is configured to use the vector IO API, it is likely to be significantly faster to use the classic stream and its parallel reads.
fs.s3a.experimental.input.fadviseThe S3A Filesystem client supports the notion of input policies, similar
to that of the Posix fadvise() API call. This tunes the behavior of the S3A
client to optimise HTTP GET requests for the different use cases.
See Improving data input performance through fadvise for the details.
Some of the streams support detailed IOStatistics, which will get aggregated into
the filesystem IOStatistics when the stream is closed(), or possibly after unbuffer().
The filesystem aggregation can be displayed when the instance is closed, which happens in process termination, if not earlier:
<property>
<name>fs.thread.level.iostatistics.enabled</name>
<value>true</value>
</property>
StreamCapabilities.hasCapability() can be used to probe for the active
stream type and its capabilities.
The unbuffer() operation requires the stream to release all client-side
resources: buffer, connections to remote servers, cached files etc.
This is used in some query engines, including Apache Impala, to keep
streams open for rapid re-use, avoiding the overhead of re-opening files.
Only the classic stream supports CanUnbuffer.unbuffer();
the other streams must be closed rather than kept open for an extended
period of time.
All input streams MUST be closed via a close() call once no-longer needed
-this is the only way to guarantee a timely release of HTTP connections
and local resources.
Some applications/libraries neglect to close the stram
There is a special stream type custom.
This is primarily used internally for testing, however it may also be used by
anyone who wishes to experiment with alternative input stream implementations.
If it is requested, then the name of the factory for streams must be set in the
property fs.s3a.input.stream.custom.factory.
This must be a classname to an implementation of the factory service,
org.apache.hadoop.fs.s3a.impl.streams.ObjectInputStreamFactory.
Consult the source and javadocs of the package org.apache.hadoop.fs.s3a.impl.streams for
details.
Note this is very much internal code and unstable: any use of this should be considered experimental, unstable -and is not recommended for production use.
| Property | Permitted Values | Meaning |
|---|---|---|
fs.s3a.input.stream.custom.factory | name of factory class on the classpath | classname of custom factory |