hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
This module includes both unit tests, which can run in isolation without
connecting to the S3 service, and integration tests, which require a working
connection to S3 to interact with a bucket. Unit test suites follow the naming
convention Test*.java. Integration tests follow the naming convention
ITest*.java.
hadoop-aws module.The Apache Jenkins infrastructure does not run any S3 integration tests, due to the need to keep credentials secure.
This is important: patches which do not include this declaration will be ignored
This policy has proven to be the only mechanism to guarantee full regression testing of code changes. Why the declaration of region? Two reasons
You don't need to test from a VM within the AWS infrastructure; with the
-Dparallel-tests option the non-scale tests complete in under twenty minutes.
Because the tests clean up after themselves, they are also designed to be low
cost. It's neither hard nor expensive to run the tests; if you can't,
there's no guarantee your patch works. The reviewers have enough to do, and
don't have the time to do these tests, especially as every failure will simply
make for a slow iterative development.
Please: run the tests. And if you don't, we are sorry for declining your patch, but we have to.
Some of the tests do fail intermittently, especially in parallel runs. If this happens, try to run the test on its own to see if the test succeeds.
If it still fails, include this fact in your declaration. We know some tests are intermittently unreliable.
The tests and the S3A client are designed to be configurable for different timeouts. If you are seeing problems and this configuration isn't working, that's a sign of the configuration mechanism isn't complete. If it's happening in the production code, that could be a sign of a problem which may surface over long-haul connections. Please help us identify and fix these problems — especially as you are the one best placed to verify the fixes work.
To integration test the S3* filesystem clients, you need to provide
auth-keys.xml which passes in authentication details to the test runner.
It is a Hadoop XML configuration file, which must be placed into
hadoop-tools/hadoop-aws/src/test/resources.
core-site.xmlThis file pre-exists and sources the configurations created
under auth-keys.xml.
For most purposes you will not need to edit this file unless you need to apply a specific, non-default property change during the tests.
auth-keys.xmlThe presence of this file triggers the testing of the S3 classes.
Without this file, none of the integration tests in this module will be executed.
The XML file must contain all the ID/key information needed to connect each of the filesystem clients to the object stores, and a URL for each filesystem for its testing.
test.fs.s3a.name : the URL of the bucket for S3a testsfs.contract.test.fs.s3a : the URL of the bucket for S3a filesystem contract testsThe contents of the bucket will be destroyed during the test process:
do not use the bucket for any purpose other than testing. Furthermore, for
s3a, all in-progress multi-part uploads to the bucket will be aborted at the
start of a test (by forcing fs.s3a.multipart.purge=true) to clean up the
temporary state of previously failed tests.
Example:
<configuration>
<property>
<name>test.fs.s3a.name</name>
<value>s3a://test-aws-s3a/</value>
</property>
<property>
<name>fs.contract.test.fs.s3a</name>
<value>${test.fs.s3a.name}</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for IAM role-based authentication.</description>
<value>DONOTCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for IAM role-based authentication.</description>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>test.sts.endpoint</name>
<description>Specific endpoint to use for STS requests.</description>
<value>sts.amazonaws.com</value>
</property>
</configuration>
For S3a encryption tests to run correctly, the
fs.s3a.encryption.key must be configured in the s3a contract xml
file or auth-keys.xml file with a AWS KMS encryption key arn as this value is
different for each AWS KMS. Please note this KMS key should be created in the
same region as your S3 bucket. Otherwise, you may get KMS.NotFoundException.
Example:
<property>
<name>fs.s3a.encryption.key</name>
<value>arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01</value>
</property>
You can also force all the tests to run with a specific SSE encryption method
by configuring the property fs.s3a.encryption.algorithm in the s3a
contract file.
Buckets can be configured with default encryption on the AWS side. Some S3AFileSystem tests are skipped when default encryption is enabled due to unpredictability in how ETags are generated.
If the S3 store/storage class doesn't support server-side-encryption, these will fail. They can be turned off.
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
Encryption is only used for those specific test suites with Encryption in
their classname.
After completing the configuration, execute the test run through Maven.
mvn clean verify
It's also possible to execute multiple test suites in parallel by passing the
parallel-tests property on the command line. The tests spend most of their
time blocked on network I/O with the S3 service, so running in parallel tends to
complete full test runs faster.
mvn -Dparallel-tests clean verify
Some tests must run with exclusive access to the S3 bucket, so even with the
parallel-tests property, several test suites will run in serial in a separate
Maven execution step after the parallel tests.
By default, parallel-tests runs 4 test suites concurrently. This can be tuned
by passing the testsThreadCount property.
mvn -Dparallel-tests -DtestsThreadCount=8 clean verify
To run just unit tests, which do not require S3 connectivity or AWS credentials,
use any of the above invocations, but switch the goal to test instead of
verify.
mvn clean test
mvn -Dparallel-tests clean test
mvn -Dparallel-tests -DtestsThreadCount=8 clean test
To run only a specific named subset of tests, pass the test property for unit
tests or the it.test property for integration tests.
mvn clean test -Dtest=TestS3AInputPolicies
mvn clean verify -Dit.test=ITestS3AFileContextStatistics -Dtest=none
mvn clean verify -Dtest=TestS3A* -Dit.test=ITestS3A*
Note that when running a specific subset of tests, the patterns passed in test
and it.test override the configuration of which tests need to run in isolation
in a separate serial phase (mentioned above). This can cause unpredictable
results, so the recommendation is to avoid passing parallel-tests in
combination with test or it.test. If you know that you are specifying only
tests that can run safely in parallel, then it will work. For wide patterns,
like ITestS3A* shown above, it may cause unpredictable test failures.
S3A can connect to different regions —the tests support this. Simply
define the target region in auth-keys.xml.
<property>
<name>fs.s3a.endpoint.region</name>
<value>eu-central-1</value>
</property>
The TestS3AInputStreamPerformance tests require read access to a multi-MB
text file. The default file for these tests is a public one.
s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz
from the NOAA Continuously Operating Reference Stations (CORS) Network (NCN)
Historically it was required to be a csv.gz file to validate S3 Select
support. Now that S3 Select support has been removed, other large files
may be used instead.
The path to this object is set in the option fs.s3a.scale.test.csvfile,
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value>s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz</value>
</property>
fs.s3a.scale.test.csvfile is empty, tests which require it will be skipped..gz file, expect decompression-related test failures.(the reason the space or newline is needed is to add "an empty entry"; an empty
<value/> would be considered undefined and pick up the default)
If using a test file in a different AWS S3 region then
a bucket-specific region must be defined.
For the default test dataset, hosted in the noaa-cors-pds bucket, this is:
<property>
<name>fs.s3a.bucket.noaa-cors-pds.endpoint.region</name>
<value>us-east-1</value>
</property>
S3a supports using Access Point ARNs to access data in S3. If you think your changes affect VPC integration, request signing, ARN manipulation, or any code path that deals with the actual sending and retrieving of data to/from S3, make sure you run the entire integration test suite with this feature enabled.
Check out our documentation for steps on how to enable this feature. To create access points for your S3 bucket you can use the AWS Console or CLI.
Integration test results and logs are stored in target/failsafe-reports/.
An HTML report can be generated during site generation, or with the surefire-report
plugin:
mvn surefire-report:failsafe-report-only
Some tests (specifically some in ITestS3ARemoteFileChanged) require
a versioned bucket for full test coverage.
To enable versioning in a bucket.
Once a bucket is converted to being versioned, it cannot be converted back to being unversioned.
The tests are run with prefetch if the prefetch property is set in the
maven build. This can be combined with the scale tests as well.
mvn verify -Dprefetch
mvn verify -Dparallel-tests -Dprefetch -DtestsThreadCount=8
mvn verify -Dparallel-tests -Dprefetch -Dscale -DtestsThreadCount=8
There are a set of tests designed to measure the scalability and performance at scale of the S3A tests, Scale Tests. Tests include: creating and traversing directory trees, uploading large files, renaming them, deleting them, seeking through the files, performing random IO, and others. This makes them a foundational part of the benchmarking.
By their very nature they are slow. And, as their execution time is often limited by bandwidth between the computer running the tests and the S3 endpoint, parallel execution does not speed these tests up.
The tests are enabled if the scale property is set in the maven build
this can be done regardless of whether or not the parallel test profile
is used
mvn verify -Dscale
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
The most bandwidth intensive tests (those which upload data) always run sequentially; those which are slow due to HTTPS setup costs or server-side actions are included in the set of parallelized tests.
Some of the tests can be tuned from the maven build or from the configuration file used to run the tests.
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8 -Dfs.s3a.scale.test.huge.filesize=128M
The algorithm is
unset,
then the configuration value is used. The unset option is used to
work round a quirk in maven property propagation.Only a few properties can be set this way; more will be added.
| Property | Meaning |
|---|---|
fs.s3a.scale.test.timeout | Timeout in seconds for scale tests |
fs.s3a.scale.test.huge.filesize | Size for huge file uploads |
fs.s3a.scale.test.huge.huge.partitionsize | Size for partitions in huge file uploads |
The file and partition sizes are numeric values with a k/m/g/t/p suffix depending on the desired size. For example: 128M, 128m, 2G, 2G, 4T or even 1P.
Some scale tests perform multiple operations (such as creating many directories).
The exact number of operations to perform is configurable in the option
scale.test.operation.count
<property>
<name>scale.test.operation.count</name>
<value>10</value>
</property>
Larger values generate more load, and are recommended when testing locally, or in batch runs.
Smaller values results in faster test runs, especially when the object store is a long way away.
Operations which work on directories have a separate option: this controls the width and depth of tests creating recursive directories. Larger values create exponentially more directories, with consequent performance impact.
<property>
<name>scale.test.directory.count</name>
<value>2</value>
</property>
DistCp tests targeting S3A support a configurable file size. The default is 10 MB, but the configuration value is expressed in KB so that it can be tuned smaller to achieve faster test runs.
<property>
<name>scale.test.distcp.file.size.kb</name>
<value>10240</value>
</property>
S3A specific scale test properties are
fs.s3a.scale.test.huge.filesize: size in MB for "Huge file tests".
The Huge File tests validate S3A's ability to handle large files —the property
fs.s3a.scale.test.huge.filesize declares the file size to use.
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>200M</value>
</property>
Amazon S3 handles files larger than 5GB differently than smaller ones. Setting the huge filesize to a number greater than that) validates support for huge files.
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>6G</value>
</property>
Tests at this scale are slow: they are best executed from hosts running in
the cloud infrastructure where the S3 endpoint is based.
Otherwise, set a large timeout in fs.s3a.scale.test.timeout
<property>
<name>fs.s3a.scale.test.timeout</name>
<value>432000</value>
</property>
The tests are executed in an order to only clean up created files after the end of all the tests. If the tests are interrupted, the test data will remain.
For CI testing of the module, including the integration tests, it is generally necessary to support testing multiple PRs simultaneously.
To do this
job.id property, so each job works on an isolated directory
tree. This should be a number or unique string, which will be used within a path element, so
must only contain characters valid in an S3/hadoop path element.fs.s3a.root.tests.enabled to
false, either in the command line to maven or in the XML configurations.mvn verify -T 1C -Dparallel-tests -DtestsThreadCount=14 -Dscale -Dfs.s3a.root.tests.enabled=false -Djob.id=001
This parallel execution feature is only for isolated builds sharing a single S3 bucket; it does not support parallel builds and tests from the same local source tree.
Without the root tests being executed, set up a scheduled job to purge the test bucket of all data on a regular basis, to keep costs down. The easiest way to do this is to have a bucket lifecycle rule for the bucket to delete all files more than a few days old, alongside one to abort all pending uploads more than 24h old.
It's clearly unsafe to have CI infrastructure testing PRs submitted to apache github account with AWS credentials -which is why it isn't done by the Yetus-initiated builds.
Anyone doing this privately should:
Some tests are designed to overload AWS services with more requests per second than an AWS account is permitted.
The operation of these tests may be observable to other users of the same account -especially if they are working in the AWS region to which the tests are targeted.
There may also run up larger bills.
These tests all have the prefix ILoadTest
They do not run automatically: they must be explicitly run from the command line or an IDE.
Look in the source for these and reads the Javadocs before executing.
Note: one fear here was that asking for two many session/role credentials in a short period of time would actually lock an account out of a region. It doesn't: it simply triggers throttling of STS requests.
The S3A filesystem is designed to work with S3 stores which implement the S3 protocols to the extent that the amazon S3 SDK is capable of talking to it. We encourage testing against other filesystems and submissions of patches which address issues. In particular, we encourage testing of Hadoop release candidates, as these third-party endpoints get even less testing than the S3 endpoint itself.
The core XML settings to turn off tests of features unavailable on third party stores.
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
<property>
<name>test.fs.s3a.create.storage.class.enabled</name>
<value>false</value>
</property>
<property>
<name>test.fs.s3a.sts.enabled</name>
<value>false</value>
</property>
<property>
<name>test.fs.s3a.create.create.acl.enabled</name>
<value>false</value>
</property>
<property>
<name>test.fs.s3a.performance.enabled</name>
<value>false</value>
</property>
<!--
If the store reports errors when trying to list/abort completed multipart uploads,
expect failures in ITestUploadRecovery and ITestS3AContractMultipartUploader.
The tests can be reconfigured to expect failure.
Note how this can be set as a per-bucket option.
-->
<property>
<name>fs.s3a.ext.test.multipart.commit.consumes.upload.id</name>
<value>true</value>
</property>
See Third Party Stores for more on this topic.
Some tests rely on the presence of existing public datasets available on Amazon S3.
You may find a number of these in org.apache.hadoop.fs.s3a.test.PublicDatasetTestUtils.
When testing against an endpoint which is not part of Amazon S3's standard commercial partition
(aws) such as third-party implementations or AWS's China regions, you should replace these
configurations with an empty space ( ) to disable the tests or an existing path in your object
store that supports these tests.
An example of this might be the MarkerTools tests which require a bucket with a large number of objects or the requester pays tests that require requester pays to be enabled for the bucket.
When running storage class tests against third party object store that doesn't support S3 storage class, these tests might fail. They can be disabled.
<property>
<name>test.fs.s3a.create.storage.class.enabled</name>
<value>false</value>
</property>
To test on alternate infrastructures supporting
the same APIs, the option fs.s3a.scale.test.csvfile must either be
set to " ", or an object of at least 10MB is uploaded to the object store, and
the fs.s3a.scale.test.csvfile option set to its path.
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value> </value>
</property>
(yes, the space is necessary. The Hadoop Configuration class treats an empty
value as "do not override the default").
The tests are run with prefetch if the prefetch property is set in the
maven build. This can be combined with the scale tests as well.
If ITestS3AContractGetFileStatusV1List fails with any error about unsupported API.
<property>
<name>test.fs.s3a.list.v1.enabled</name>
<value>false</value>
</property>
Note: there's no equivalent for turning off v2 listing API, which all stores are now required to support.
By default, the requester pays tests will look for a bucket that exists on Amazon S3 in us-east-1.
If the endpoint does support requester pays, you can specify an alternative object. The test only requires an object of at least a few bytes in order to check that lists and basic reads work.
<property>
<name>test.fs.s3a.requester.pays.file</name>
<value>s3a://my-req-pays-enabled-bucket/on-another-endpoint.json</value>
</property>
If the endpoint does not support requester pays, you can also disable the tests by configuring the test URI as a single space.
<property>
<name>test.fs.s3a.requester.pays.file</name>
<value> </value>
</property>
Some tests requests a session credentials and assumed role credentials from the AWS Secure Token Service, then use them to authenticate with S3 either directly or via delegation tokens.
If an S3 implementation does not support STS, then these functional test cases must be disabled:
<property>
<name>test.fs.s3a.sts.enabled</name>
<value>false</value>
</property>
These tests request a temporary set of credentials from the STS service endpoint.
An alternate endpoint may be defined in fs.s3a.assumed.role.sts.endpoint.
If this is set, a delegation token region must also be defined:
in fs.s3a.assumed.role.sts.endpoint.region.
This is useful not just for testing alternative infrastructures,
but to reduce latency on tests executed away from the central
service.
<property>
<name>fs.s3a.delegation.token.endpoint</name>
<value>fs.s3a.assumed.role.sts.endpoint</value>
</property>
<property>
<name>fs.s3a.assumed.role.sts.endpoint.region</name>
<value>eu-west-2</value>
</property>
The default is ""; meaning "use the amazon default endpoint" (sts.amazonaws.com).
Consult the AWS documentation for the full list of locations.
Tests in ITestS3AContentEncoding may need disabling
<property>
<name>test.fs.s3a.content.encoding.enabled</name>
<value>false</value>
</property>
Some tests running in performance mode turn off the safety checks. They expect operations which break POSIX semantics to succeed. For stores with stricter semantics, these test cases must be disabled.
<property>
<name>test.fs.s3a.performance.enabled</name>
<value>false</value>
</property>
ITestS3AContractMultipartUploader and ITestUploadRecoveryIf the store reports errors when trying to list/abort completed multipart uploads,
expect failures in ITestUploadRecovery and ITestS3AContractMultipartUploader.
The tests can be reconfigured to expect failure by setting the option
fs.s3a.ext.test.multipart.commit.consumes.upload.id to true.
Note how this can be set as a per-bucket option.
<property>
<name>fs.s3a.ext.test.multipart.commit.consumes.upload.id</name>
<value>true</value>
</property>
ITestS3AMiscOperations.testEmptyFileChecksums: if the FS encrypts data always.Logging at debug level is the standard way to provide more diagnostics output; after setting this rerun the tests
log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
There are also some logging options for debug logging of the AWS client; consult the file.
There is also the option of enabling logging on a bucket; this could perhaps
be used to diagnose problems from that end. This isn't something actively
used, but remains an option. If you are forced to debug this way, consider
setting the fs.s3a.user.agent.prefix to a unique prefix for a specific
test run, which will enable the specific log entries to be more easily
located.
New tests are always welcome. Bear in mind that we need to keep costs and test time down, which is done by
No duplication: if an operation is tested elsewhere, don't repeat it. This applies as much for metadata operations as it does for bulk IO. If a new test case is added which completely obsoletes an existing test, it is OK to cut the previous one —after showing that coverage is not worsened.
Efficient: prefer the getFileStatus() and examining the results, rather than
call to exists(), isFile(), etc.
Isolating Scale tests. Any S3A test doing large amounts of IO MUST extend the
class S3AScaleTestBase, so only running if scale is defined on a build,
supporting test timeouts configurable by the user. Scale tests should also
support configurability as to the actual size of objects/number of operations,
so that behavior at different scale can be verified.
Designed for parallel execution. A key need here is for each test suite to work
on isolated parts of the filesystem. Subclasses of AbstractS3ATestBase
SHOULD use the path() method, with a base path of the test suite name, to
build isolated paths. Tests MUST NOT assume that they have exclusive access
to a bucket.
Extending existing tests where appropriate. This recommendation goes against normal testing best practise of "test one thing per method". Because it is so slow to create directory trees or upload large files, we do not have that luxury. All the tests against real S3 endpoints are integration tests where sharing test setup and teardown saves time and money.
A standard way to do this is to extend existing tests with some extra predicates, rather than write new tests. When doing this, make sure that the new predicates fail with meaningful diagnostics, so any new problems can be easily debugged from test logs.
Effective use of FS instances during S3A integration tests. Tests using
FileSystem instances are fastest if they can recycle the existing FS
instance from the same JVM.
If you do that, you MUST NOT close or do unique configuration on them. If you want a guarantee of 100% isolation or an instance with unique config, create a new instance which you MUST close in the teardown to avoid leakage of resources.
Do NOT add FileSystem instances manually
(with e.g org.apache.hadoop.fs.FileSystem#addFileSystemForTesting) to the
cache that will be modified or closed during the test runs. This can cause
other tests to fail when using the same modified or closed FS instance.
For more details see HADOOP-15819.
This is what we expect from new tests; they're an extension of the normal Hadoop requirements, based on the need to work with remote servers whose use requires the presence of secret credentials, where tests may be slow, and where finding out why something failed from nothing but the test output is critical.
Extend AbstractS3ATestBase or AbstractSTestS3AHugeFiles unless justifiable.
These set things up for testing against the object stores, provide good threadnames,
help generate isolated paths, and for AbstractSTestS3AHugeFiles subclasses,
only run if -Dscale is set.
Key features of AbstractS3ATestBase
getFileSystem() returns the S3A Filesystem bonded to the contract test Filesystem
defined in fs.s3a.contract.testAbstractFSContractTestBaseHaving shared base classes may help reduce future maintenance too. Please use them.
We adopted AssertJ assertions long before the move to JUnit5. While there are still many tests with legacy JUnit 1.x assertions, all new test cases should use AssertJ assertions and MUST NOT use JUnit5.
Don't ever log credentials. The credential tests go out of their way to not provide meaningful logs or assertion messages precisely to avoid this.
This means efficient in test setup/teardown, and, ideally, making use of existing public datasets to save setup time and tester cost.
Strategies of particular note are:
ITestS3ADirectoryPerformance: a single test case sets up the directory
tree then performs different list operations, measuring the time taken.AbstractSTestS3AHugeFiles: marks the test suite as
@FixMethodOrder(MethodSorters.NAME_ASCENDING) then orders the test cases such
that each test case expects the previous test to have completed (here: uploaded a file,
renamed a file, ...). This provides for independent tests in the reports, yet still
permits an ordered sequence of operations. Do note the use of Assume.assume()
to detect when the preconditions for a single test case are not met, hence,
the tests become skipped, rather than fail with a trace which is really a false alarm.The ordered test case mechanism of AbstractSTestS3AHugeFiles is probably
the most elegant way of chaining test setup/teardown.
Regarding reusing existing data, we tend to use the noaa-cors-pds archive of AWS US-East for our testing of input stream operations. This doesn't work against other regions, or with third party S3 implementations. Thus the URL can be overridden for testing elsewhere.
Don't assume AWS S3 US-East only, do allow for working with external S3 implementations. Those may be behind the latest S3 API features, not support encryption, session APIs, etc.
They won't have the same CSV/large test files as some of the input tests rely on.
Look at ITestS3AInputStreamPerformance to see how tests can be written
to support the declaration of a specific large test file on alternate filesystems.
As well as making file size and operation counts scalable, this includes
making test timeouts adequate. The Scale tests make this configurable; it's
hard coded to ten minutes in AbstractS3ATestBase(); subclasses can
change this by overriding getTestTimeoutMillis().
Equally importantly: support proxies, as some testers need them.
S3AFileSystem and its input
and output streams all provide useful statistics in their {{toString()}}
calls; logging them is useful on its own.AbstractS3ATestBase.describe(format-stringm, args) here.; it
adds some newlines so as to be easier to spot.ContractTestUtils.NanoTimer to measure the duration of operations,
and log the output.The ContractTestUtils class contains a whole set of assertions for making
statements about the expected state of a filesystem, e.g.
assertPathExists(FS, path), assertPathDoesNotExists(FS, path), and others.
These do their best to provide meaningful diagnostics on failures (e.g. directory
listings, file status, ...), so help make failures easier to understand.
At the very least, do not use assertTrue() or assertFalse() without
including error messages.
Tests can overrun createConfiguration() to add new options to the configuration
file for the S3A Filesystem instance used in their tests.
However, filesystem caching may mean that a test suite may get a cached instance created with an different configuration. For tests which don't need specific configurations caching is good: it reduces test setup time.
For those tests which do need unique options (encryption, magic files), things can break, and they will do so in hard-to-replicate ways.
Use S3ATestUtils.disableFilesystemCaching(conf) to disable caching when
modifying the config. As an example from AbstractTestS3AEncryption:
@Override
protected Configuration createConfiguration() {
Configuration conf = super.createConfiguration();
S3ATestUtils.disableFilesystemCaching(conf);
removeBaseAndBucketOverrides(conf,
SERVER_SIDE_ENCRYPTION_ALGORITHM);
conf.set(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM,
getSSEAlgorithm().getMethod());
return conf;
}
Then verify in the setup method or test cases that their filesystem actually has
the desired feature (fs.getConf().getProperty(...)). This not only
catches filesystem reuse problems, it catches the situation where the
filesystem configuration in auth-keys.xml has explicit per-bucket settings
which override the test suite's general option settings.
Keeps costs down.
We really appreciate this — you will too.
Tests must be designed to run in parallel with other tests, all working with the same shared S3 bucket. This means
AbstractFSContractTestBase.path(String filepath).Tests such as these can only be run as sequential tests. When adding one,
exclude it in the POM file. from the parallel failsafe run and add to the
sequential one afterwards. The IO heavy ones must also be subclasses of
S3AScaleTestBase and so only run if the system/maven property
fs.s3a.scale.test.enabled is true.
This is invaluable for debugging test failures.
How to set test options in your hadoop configuration rather than on the maven command line:
Most of the base S3 tests are designed delete files after test runs, so you don't have to pay for storage costs. The scale tests do work with more data so will cost more as well as generally take more time to execute.
You are however billed for
The GET/decrypt costs are incurred on each partial read of a file, so random IO can cost more than sequential IO; the speedup of queries with columnar data usually justifies this.
How to keep costs down
Don't run the scale tests with large datasets; keep fs.s3a.scale.test.huge.filesize unset, or a few MB (minimum: 5).
Remove all files in the filesystem. The root tests usually do this, but it can be manually done:
hadoop fs -rm -r -f -skipTrash s3a://test-bucket/\*
Abort all outstanding uploads:
hadoop s3guard uploads -abort -force s3a://test-bucket/
Although the auth-keys.xml file is marked as ignored in git and subversion,
it is still in your source tree, and there's always that risk that it may
creep out.
You can avoid this by keeping your keys outside the source tree and using an absolute XInclude reference to it.
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="file:///users/ubuntu/.auth-keys.xml" />
</configuration>
S3A provides an "Inconsistent S3 Client Factory" that can be used to simulate throttling by injecting random failures on S3 client requests.
Note
In previous releases, this factory could also be used to simulate inconsistencies during testing of S3Guard. Now that S3 is consistent, injecting inconsistency is no longer needed during testing.
Tests for the AWS Assumed Role credential provider require an assumed role to request.
If this role is not declared in fs.s3a.assumed.role.arn,
the tests which require it will be skipped.
The specific tests an Assumed Role ARN is required for are
ITestAssumeRole.ITestRoleDelegationTokens.ITestDelegatedMRJob.To run these tests you need:
A role in your AWS account will full read and write access rights to the S3 bucket used in the tests, and KMS for any SSE-KMS or DSSE-KMS tests.
Your IAM User to have the permissions to "assume" that role.
The role ARN must be set in fs.s3a.assumed.role.arn.
<property>
<name>fs.s3a.assumed.role.arn</name>
<value>arn:aws:iam::9878543210123:role/role-s3-restricted</value>
</property>
The tests assume the role with different subsets of permissions and verify that the S3A client (mostly) works when the caller has only write access to part of the directory tree.
You can also run the entire test suite in an assumed role, a more thorough test, by switching to the credentials provider.
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider</value>
</property>
The usual credentials needed to log in to the bucket will be used, but now the credentials used to interact with S3 will be temporary role credentials, rather than the full credentials.
Updating the AWS SDK is something which does need to be done regularly, but is rarely without complications, major or minor.
Assume that the version of the SDK will remain constant for an X.Y release, excluding security fixes, so it's good to have an update before each release — as long as that update works doesn't trigger any regressions.
hadoop-project/pom.xml update the aws-java-sdk.version to the new SDK version.hadoop-aws tests.
This includes the -Pscale set, with a role defined for the assumed role tests.
in fs.s3a.assumed.role.arn for testing assumed roles,
and fs.s3a.encryption.key for encryption, for full coverage.
If you can, scale up the scale tests.ITest* integration tests from
your IDE or via maven.ILoadTest* load tests from your IDE or via maven through
mvn verify -Dtest=skip -Dit.test=ILoadTest\* ; look for regressions in performance
as much as failures.mvn site -DskipTests; look in target/site for the report.-output.txt file in hadoop-tools/hadoop-aws/target/failsafe-reports,
paying particular attention to
org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance-output.txt,
as that is where changes in stream close/abort logic will surface.mvn install to install the artifacts, then in
hadoop-cloud-storage-project/hadoop-cloud-storage run
mvn dependency:tree -Dverbose > target/dependencies.txt.
Examine the target/dependencies.txt file to verify that no new
artifacts have unintentionally been declared as dependencies
of the shaded software.amazon.awssdk:bundle:jar artifact.fs.s3a.encryption.algorithm to 'CSE-KMS' and setting up AWS-KMS
Key ID in fs.s3a.encryption.key.TestAWSV2SDK doesn't contain any unshaded classes.The dependency chain of the hadoop-aws module should be similar to this, albeit
with different version numbers:
[INFO] +- org.apache.hadoop:hadoop-aws:jar:3.4.0-SNAPSHOT:compile
[INFO] | +- software.amazon.awssdk:bundle:jar:2.23.5:compile
[INFO] | \- org.wildfly.openssl:wildfly-openssl:jar:1.1.3.Final:compile
We need a run through of the CLI to see if there have been changes there which cause problems, especially whether new log messages have surfaced, or whether some packaging change breaks that CLI.
It is always interesting when doing this to enable IOStatistics reporting
<property>
<name>fs.iostatistics.logging.level</name>
<value>info</value>
</property>
From the root of the project, create a command line release mvn package -Pdist -DskipTests -Dmaven.javadoc.skip=true -DskipShade;
hadoop-dist/target/hadoop-x.y.z-SNAPSHOT dir.core-site.xml file into etc/hadoop.HADOOP_OPTIONAL_TOOLS env var on the command line or ~/.hadoop-env.export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
Run some basic s3guard CLI as well as file operations.
export BUCKETNAME=example-bucket-name
export BUCKET=s3a://$BUCKETNAME
bin/hadoop s3guard bucket-info $BUCKET
bin/hadoop s3guard uploads $BUCKET
# repeat twice, once with "no" and once with "yes" as responses
bin/hadoop s3guard uploads -abort $BUCKET
# ---------------------------------------------------
# root filesystem operatios
# ---------------------------------------------------
bin/hadoop fs -ls $BUCKET/
# assuming file is not yet created, expect error and status code of 1
bin/hadoop fs -ls $BUCKET/file
# exit code of 0 even when path doesn't exist
bin/hadoop fs -rm -R -f $BUCKET/dir-no-trailing
bin/hadoop fs -rm -R -f $BUCKET/dir-trailing/
# error because it is a directory
bin/hadoop fs -rm $BUCKET/
bin/hadoop fs -touchz $BUCKET/file
# expect I/O error as it is the root directory
bin/hadoop fs -rm -r $BUCKET/
# succeeds
bin/hadoop fs -rm -r $BUCKET/\*
# ---------------------------------------------------
# File operations
# ---------------------------------------------------
bin/hadoop fs -mkdir $BUCKET/dir-no-trailing
bin/hadoop fs -mkdir $BUCKET/dir-trailing/
bin/hadoop fs -touchz $BUCKET/file
bin/hadoop fs -ls $BUCKET/
bin/hadoop fs -mv $BUCKET/file $BUCKET/file2
# expect "No such file or directory"
bin/hadoop fs -stat $BUCKET/file
# expect success
bin/hadoop fs -stat $BUCKET/file2
# expect "file exists"
bin/hadoop fs -mkdir $BUCKET/dir-no-trailing
bin/hadoop fs -mv $BUCKET/file2 $BUCKET/dir-no-trailing
bin/hadoop fs -stat $BUCKET/dir-no-trailing/file2
# treated the same as the file stat
bin/hadoop fs -stat $BUCKET/dir-no-trailing/file2/
bin/hadoop fs -ls $BUCKET/dir-no-trailing/file2/
bin/hadoop fs -ls $BUCKET/dir-no-trailing
# expect a "0" here:
bin/hadoop fs -test -d $BUCKET/dir-no-trailing ; echo $?
# expect a "1" here:
bin/hadoop fs -test -d $BUCKET/dir-no-trailing/file2 ; echo $?
# will return NONE unless bucket has checksums enabled
bin/hadoop fs -checksum $BUCKET/dir-no-trailing/file2
# expect "etag" + a long string
bin/hadoop fs -D fs.s3a.etag.checksum.enabled=true -checksum $BUCKET/dir-no-trailing/file2
bin/hadoop fs -expunge -immediate -fs $BUCKET
# ---------------------------------------------------
# Delegation Token support
# ---------------------------------------------------
# failure unless delegation tokens are enabled
bin/hdfs fetchdt --webservice $BUCKET secrets.bin
# success
bin/hdfs fetchdt -D fs.s3a.delegation.token.binding=org.apache.hadoop.fs.s3a.auth.delegation.SessionTokenBinding --webservice $BUCKET secrets.bin
bin/hdfs fetchdt -print secrets.bin
# expect warning "No TokenRenewer defined for token kind S3ADelegationToken/Session"
bin/hdfs fetchdt -renew secrets.bin
# ---------------------------------------------------
# Copy to from local
# ---------------------------------------------------
time bin/hadoop fs -copyFromLocal -t 10 share/hadoop/tools/lib/*aws*jar $BUCKET/
# expect the iostatistics object_list_request value to be O(directories)
bin/hadoop fs -ls -R $BUCKET/
# expect the iostatistics object_list_request and op_get_content_summary values to be 1
bin/hadoop fs -du -h -s $BUCKET/
mkdir tmp
time bin/hadoop fs -copyToLocal -t 10 $BUCKET/\*aws\* tmp
# ---------------------------------------------------
# Cloudstore
# check out and build https://github.com/steveloughran/cloudstore
# then for these tests, set CLOUDSTORE env var to point to the JAR
# ---------------------------------------------------
bin/hadoop jar $CLOUDSTORE storediag $BUCKET
time bin/hadoop jar $CLOUDSTORE bandwidth 64M $BUCKET/testfile
ILoadTestS3ABulkDeleteThrottling.storediag etc)A Jenkins run should tell you if there are new deprecations. If so, you should think about how to deal with them.
Moving to methods and APIs which weren't in the previous SDK release makes it harder to roll back if there is a problem; but there may be good reasons for the deprecation.
At the same time, there may be good reasons for staying with the old code.
When the patch is committed: update the JIRA to the version number actually used; use that title in the commit message.
Be prepared to roll-back, re-iterate or code your way out of a regression.
There may be some problem which surfaces with wider use, which can get fixed in a new AWS release, rolling back to an older one, or just worked around HADOOP-14596.
Don't be surprised if this happens, don't worry too much, and, while that rollback option is there to be used, ideally try to work forwards.
If the problem is with the SDK, file issues with the AWS V2 SDK Bug tracker. If the problem can be fixed or worked around in the Hadoop code, do it there too.