Initial commit

awslabs · Apr 1, 2022 · dbcaffd · dbcaffd
1 parent 1ed0e55
commit dbcaffd
Show file tree

Hide file tree

Showing 4 changed files with 239 additions and 22 deletions.
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -1,4 +1,4 @@
 ## Code of Conduct
 This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
 For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
-[email protected] with any additional questions or comments.
+[email protected] with any additional questions or comments.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -11,13 +11,21 @@ information to effectively respond to your bug report or contribution.
 
 We welcome you to use the GitHub issue tracker to report bugs or suggest features.
 
-When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
+When filing an issue, please check [existing open](https://github.com/aws/amazon-genomics-cli/issues), 
+or [recently closed](https://github.com/aws/amazon-genomics-cli/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), 
+issues to make sure somebody else hasn't already
 reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
 
-* A reproducible test case or series of steps
-* The version of our code being used
-* Any modifications you've made relevant to the bug
-* Anything unusual about your environment or deployment
+* Environment
+  * Java version
+  * OS version
+  * Location of this extension JAR (or if it was used to compile another application)
+  * IAM S3 permissions of the role used (mask sensitive information if any)
+  * Bucket ACL (mask sensitive information if any)
+* Steps to reproduce the error
+* Expected result
+* Actual result
+* AWS region(s) where the issue was observed and region of the bucket being read from
 
 
 ## Contributing via Pull Requests
@@ -31,17 +39,22 @@ To send us a pull request, please:
 
 1. Fork the repository.
 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
-3. Ensure local tests pass.
-4. Commit to your fork using clear commit messages.
-5. Send us a pull request, answering any default questions in the pull request interface.
-6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
+3. Use only java 1.8 language features and ensure code will compile with sdk 1.8.0_322 (8.322) or later patch versions
+4. Ensure unit tests cover your change and demonstrate expected behavior
+5. Ensure unit tests do NOT require AWS credentials or S3 connectivity by using Mocks for any `S3Client` or `S3AsyncClient`. Remember, unit tests test this library and not the functionality of S3.
+6. Run `./gradlew check` to ensure local tests pass and test coverage reports are produced.
+7. Ensure test coverage is not degraded. Reports can be found at `build/reports/jacoco/test/html/index.html`.
+8. Send us a pull request, answering any default questions in the pull request interface.
+9. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
 
 GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
 [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
 
 
 ## Finding contributions to work on
-Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
+Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the 
+default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), 
+looking at any ['help wanted'](https://github.com/aws/amazon-genomics-cli/labels/help%20wanted) issues is a great place to start.
 
 
 ## Code of Conduct
@@ -51,9 +64,14 @@ [email protected] with any additional questions or comments.
 
 
 ## Security issue notifications
-If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
+If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via 
+our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). 
+Please do **not** create a public Github issue.
 
 
 ## Licensing
+See the [LICENSE](https://github.com/aws/amazon-genomics-cli/blob/main/LICENSE) file for our 
+project's licensing. We will ask you to confirm the licensing of your contribution.
 
-See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
+We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) 
+for larger changes.
diff --git a/NOTICE b/NOTICE
@@ -1 +1,3 @@
+AWS Java NIO SPI for S3
+
 Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
diff --git a/README.md b/README.md
@@ -1,17 +1,214 @@
-## My Project
+# AWS Java NIO SPI for S3
 
-TODO: Fill this README out!
+A Java NIO2 service provider for S3 allowing Java NIO operations to be performed on paths using the `s3` scheme.
 
-Be sure to:
+## Using this package as a provider
 
-* Change the title in this README
-* Edit your repository description on GitHub
+There are several ways that this package can be used to provide Java NIO operations on S3 objects:
 
-## Security
+1. Use this libraries jar as one of your applications compile dependencies
+2. Include the libraries "shadowJar" in your `$JAVA_HOME/jre/lib/ext/` directory
+3. Include this library on your class path at runtime
+4. Include the library as an extension at runtime `-Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:/path/to/extension/`
 
-See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
+## Example usage
 
-## License
+Assuming that `myExecutatbleJar` is a Java application that has been built to read from `java.nio.file.Path`s and
+this library has been exposed by one of the mechanisms above then S3 URIs may be used to identify inputs. For example:
 
-This project is licensed under the Apache-2.0 License.
+```java 
+java -jar myExecutableJar s3://some-bucket/input/file
+```
 
+If this library is exposed as an extension (see above), then no code changes or recompilation of `myExecutable` are
+required.
+
+## AWS Credentials
+
+This library will perform all actions using credentials according to the AWS SDK for Java [default credential provider
+chain](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html). The library does not allow any
+library specific configuration of credentials. In essence this means that when using this library you (or the service 
+using this library) should have or be able to assume a role that will allow access to the S3 buckets and objects you
+want to interact with.
+
+Note also that although your IAM role may be sufficient to access the desired objects and buckets you may still be 
+blocked by bucket access control lists and/ or bucket policies.
+
+## Reading Files
+
+Bytes from S3 objects can be read using `S3SeekableByteChannel` which is an implementation of `java.nio.channel.SeekableByteChannel`.
+Because S3 is a high-throughput but high-latency (compared to a native filesystem) service the `S3SeekableByteChannel`
+uses an in-memory read-ahead cache of `ByteBuffers` which are optimized for the scenario where bytes will typically be 
+read sequentially.
+
+To perform this the `S3SeekableByteChannel` delegates read operations to an `S3ReadAheadByteChannel` which 
+implements `java.nio.channels.ReadableByteChannel`. When the first `read` operation is called the channel will read it's
+first fragment and enter that into the buffer, requests for bytes in that fragment are fulfilled from that buffer. When
+a buffer fragment is more than half read all empty fragment slots in the cache will be asynchronously filled. Further,
+any cached fragments that precede the fragment currently being read will be invalidated in the cache freeing up space
+for additional fragments to be retrieved asynchronously. Once the cache is "warm" the application should not be blocked
+on I/O, up to the limits of your network connection.
+
+### Configuration
+
+The read ahead buffer prefetches `n` sequential fragments of `m` bytes from S3 asynchronously. The
+values of n and m can be configured to your needs by using command line properties or environment variables.
+
+If no configuration is supplied the values in `resources/s3-nio-spi.properties` are used. Currently, 50 fragments of 5MB.
+Each fragment is downloaded concurrently on a unique thread.
+
+#### Environment Variables
+
+You may use `S3_SPI_READ_MAX_FRAGMENT_NUMBER` and `S3_SPI_READ_MAX_FRAGMENT_SIZE` to set the maximum umber of cached 
+fragments and maximum fragment sizes respectively. For example:
+
+```shell
+export S3_SPI_READ_MAX_FRAGMENT_SIZE=100000
+export S3_SPI_READ_MAX_FRAGMENT_NUMBER=5
+java -Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:<location-of-this-spi-jar> -jar <jar-file-to-run>
+```
+
+#### Java Properties
+
+You may use java command line properties to set the values of the maximum fragment size and maximum number of fragments
+with `s3.spi.read.max-fragment-size` and `s3.spi.read.max-fragment-number` respectively. For example:
+
+```shell
+java -Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:<location-of-this-spi-jar> -Ds3.spi.read.max-fragment-size=10000 -Ds3.spi.read.max-fragment-number=2 -jar <jar-file-to-run>
+```
+
+#### Order of Precedence
+
+Configurations use the following order or precedence from highest to lowest:
+
+1. java properties
+2. environment variables
+3. default values
+
+#### S3 limits
+
+As each `S3SeekableByteChannel` can potentially spawn 50 concurrent fragment download threads you may find you exceed S3
+limits, especially when the application using this SPI reads from multiple files at the same time or has multiple threads
+each opening its own byte channel. In this situation you should reduce the size of `S3_SPI_READ_MAX_FRAGMENT_NUMBER`.
+
+In some cases it may also help to increase the value of `S3_SPI_READ_MAX_FRAGMENT_SIZE` as fewer larger fragments will
+reduce the number of requests to the S3 service.
+
+## Design Decisions
+
+As an object store, S3 is not completely analogous to a traditional file system. Therefore, several opinionated decisions
+were made to map filesystem concepts to S3 concepts.
+
+### Read Only
+
+The current implementation only supports read operations. It is possible to add write operations however special consideration
+will be needed due to the lack of support for random writes in S3 and the read-after-write consistency of S3 objects.
+
+### A Bucket is a `FileSystem`
+
+An S3 bucket is represented as a `java.nio.spi.FileSystem` using a `S3FileSystem`. Although buckets are globally 
+namespaced they are owned by individual accounts, have their own permissions, regions and potentially endpoints. 
+An application that accesses objects from multiple buckets will generate multiple `FileSystem` instances.
+
+### S3 Objects are `Path`s
+
+Objects in S3 are analogous to files in a filesystem and are identified using `S3Path` instances which can be built
+using S3 uris (e.g `s3://mybucket/some-object`) or posix patterns `/some-object` from an `S3FileSystem` for `mybucket` 
+
+### No hidden files
+
+S3 doesn't support hidden files therefore files named with a `.` prefix such as `.hidden` are not considered hidden.
+
+### Creation time and Last modified time
+
+S3 objects to not have a creation time and modification of an S3 object is actually a re-write of the object so these
+are both given the same date (represented as a `FileTime`). If for some reason a last modified time cannot be determined
+the Unix Epoch time is used.
+
+### No symbolic links
+
+S3 doesn't support symbolic links therefore no `S3Path` is a symbolic link and any NIO `LinkOption`s are ignored.
+
+### Posix-like path representations
+
+Technically, S3 doesn't have directories there are only buckets and keys. For example, in `s3://mybucket/path/to/file/object`
+the bucket name is `mybucket` and the key would be `/path/to/file/object`. By convention the use of `/` in a key is
+thought of as a path separator, therefore `object` could be inferred to be a file in a directory called `/path/to/file/`
+even though that directory technically doesn't exist. This package will infer directories under what we call "posix like"
+path representations. The logic of these is encoded in the `PosixLikePathRepresentation` object.
+
+#### Directories
+
+An `S3Path` is inferred to be a directory if the path ends with `/`, `/.` or `/..` or contains only `.` or `..`.
+
+All of these paths are inferred to be directories `/dir/`, `/dir/.`, `/dir/..`. However `dir` cannot be inferred to be a directory.
+This is a divergence from a true POSIX filesystem where if `/dir/` is a directory then `/dir` must also be a directory.
+S3 holds no metadata that can be used to make this inference.
+
+#### Working directory
+
+As directories don't exist and are only inferred there is no concept of being "in a directory". Therefore, the working
+directory is always the root and `/object` `./object` and `object` can be inferred to be the same file. In addition `../object`
+will also be the same file as you may not navigate past the root and no error will be produced if you attempt to.
+
+#### Relative path resolution
+
+Although there are no working directories paths may be resolved relative to one another as long as one is a directory. 
+So if `some/path` was resolved relative to `/this/location/` then the resulting path is `/this/location/some/path`.
+
+Because directories are inferred, you may not resolve `some/path` relative to `/this/location` as the latter cannot be
+inferred to be a directory (at lacks a trailing `/`).
+
+#### Resolution of `..` and `.`
+
+The Posix path special symbols `.` and `..` are treated as they would be in a normal Posix path. Note that this could
+cause some S3 objects to be effectively invisible to this implementation. For example `s3://mybucket/foo/./baa` is 
+an allowed S3 URI that *not* equivalent to `s3://mybucket/foo/baa` even though this library will resolve the path `/foo/./baa`
+to `/foo/baa`.
+
+## Building this library
+
+The library uses the gradle build system and targets Java 1.8. To build you can simply run:
+
+```shell
+./gradlew build
+```
+
+This will run all unit tests and then generate a jar file in `libs` with the name `s3fs-spi-<version>.jar`
+
+### Shadowed Jar with dependencies
+
+To build a "fat" jar with the required dependencies (including aws s3 client libraries) you can run:
+
+```shell
+./gradlew shadowJar
+```
+
+which will produce `s3fs-spi-<version>-all.jar`. If you are using this library as an extension, this is the recommended
+jar to use. Don't put both jars on your extension path, you will observe class conflicts.
+
+## Testing
+
+To run unit tests and produce code coverage reports, run this command:
+
+```shell
+./gradlew test
+```
+
+HTML output of the test reports can be found at `build/reports/tests/test/index.html` and test coverage reports are 
+found at `build/reports/jacoco/test/html/index.html`
+
+## Contributing
+
+We encourage community contributions via pull requests. Please refer to our [code of conduct](./CODE_OF_CONDUCT.md) and
+[contributing](./CONTRIBUTING.md) for guidance.
+
+Code must compile with JDK 1.8 and matching unit tests are required. 
+
+### Contributing Unit Tests
+
+We use JUnit 4 and Mockito for unit testing.
+
+When contributing code for bug fixes or feature improvements, matching tests should also be provided. Tests must not 
+rely on specific S3 bucket access or credentials. To this end, S3 clients and other artifacts should be mocked as 
+necessary. Remember, you are testing this library, not the behavior of S3.
Original file line number	Diff line number	Diff line change
		@@ -1 +1,3 @@
		AWS Java NIO SPI for S3

		Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.