Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
markjschreiber committed Apr 1, 2022
1 parent 1ed0e55 commit dbcaffd
Show file tree
Hide file tree
Showing 4 changed files with 239 additions and 22 deletions.
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
[email protected] with any additional questions or comments.
[email protected] with any additional questions or comments.
42 changes: 30 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,21 @@ information to effectively respond to your bug report or contribution.

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
When filing an issue, please check [existing open](https://github.com/aws/amazon-genomics-cli/issues),
or [recently closed](https://github.com/aws/amazon-genomics-cli/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20),
issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment
* Environment
* Java version
* OS version
* Location of this extension JAR (or if it was used to compile another application)
* IAM S3 permissions of the role used (mask sensitive information if any)
* Bucket ACL (mask sensitive information if any)
* Steps to reproduce the error
* Expected result
* Actual result
* AWS region(s) where the issue was observed and region of the bucket being read from


## Contributing via Pull Requests
Expand All @@ -31,17 +39,22 @@ To send us a pull request, please:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
3. Use only java 1.8 language features and ensure code will compile with sdk 1.8.0_322 (8.322) or later patch versions
4. Ensure unit tests cover your change and demonstrate expected behavior
5. Ensure unit tests do NOT require AWS credentials or S3 connectivity by using Mocks for any `S3Client` or `S3AsyncClient`. Remember, unit tests test this library and not the functionality of S3.
6. Run `./gradlew check` to ensure local tests pass and test coverage reports are produced.
7. Ensure test coverage is not degraded. Reports can be found at `build/reports/jacoco/test/html/index.html`.
8. Send us a pull request, answering any default questions in the pull request interface.
9. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the
default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix),
looking at any ['help wanted'](https://github.com/aws/amazon-genomics-cli/labels/help%20wanted) issues is a great place to start.


## Code of Conduct
Expand All @@ -51,9 +64,14 @@ [email protected] with any additional questions or comments.


## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via
our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/).
Please do **not** create a public Github issue.


## Licensing
See the [LICENSE](https://github.com/aws/amazon-genomics-cli/blob/main/LICENSE) file for our
project's licensing. We will ask you to confirm the licensing of your contribution.

See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement)
for larger changes.
2 changes: 2 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
AWS Java NIO SPI for S3

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
215 changes: 206 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,214 @@
## My Project
# AWS Java NIO SPI for S3

TODO: Fill this README out!
A Java NIO2 service provider for S3 allowing Java NIO operations to be performed on paths using the `s3` scheme.

Be sure to:
## Using this package as a provider

* Change the title in this README
* Edit your repository description on GitHub
There are several ways that this package can be used to provide Java NIO operations on S3 objects:

## Security
1. Use this libraries jar as one of your applications compile dependencies
2. Include the libraries "shadowJar" in your `$JAVA_HOME/jre/lib/ext/` directory
3. Include this library on your class path at runtime
4. Include the library as an extension at runtime `-Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:/path/to/extension/`

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
## Example usage

## License
Assuming that `myExecutatbleJar` is a Java application that has been built to read from `java.nio.file.Path`s and
this library has been exposed by one of the mechanisms above then S3 URIs may be used to identify inputs. For example:

This project is licensed under the Apache-2.0 License.
```java
java -jar myExecutableJar s3://some-bucket/input/file
```

If this library is exposed as an extension (see above), then no code changes or recompilation of `myExecutable` are
required.

## AWS Credentials

This library will perform all actions using credentials according to the AWS SDK for Java [default credential provider
chain](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html). The library does not allow any
library specific configuration of credentials. In essence this means that when using this library you (or the service
using this library) should have or be able to assume a role that will allow access to the S3 buckets and objects you
want to interact with.

Note also that although your IAM role may be sufficient to access the desired objects and buckets you may still be
blocked by bucket access control lists and/ or bucket policies.

## Reading Files

Bytes from S3 objects can be read using `S3SeekableByteChannel` which is an implementation of `java.nio.channel.SeekableByteChannel`.
Because S3 is a high-throughput but high-latency (compared to a native filesystem) service the `S3SeekableByteChannel`
uses an in-memory read-ahead cache of `ByteBuffers` which are optimized for the scenario where bytes will typically be
read sequentially.

To perform this the `S3SeekableByteChannel` delegates read operations to an `S3ReadAheadByteChannel` which
implements `java.nio.channels.ReadableByteChannel`. When the first `read` operation is called the channel will read it's
first fragment and enter that into the buffer, requests for bytes in that fragment are fulfilled from that buffer. When
a buffer fragment is more than half read all empty fragment slots in the cache will be asynchronously filled. Further,
any cached fragments that precede the fragment currently being read will be invalidated in the cache freeing up space
for additional fragments to be retrieved asynchronously. Once the cache is "warm" the application should not be blocked
on I/O, up to the limits of your network connection.

### Configuration

The read ahead buffer prefetches `n` sequential fragments of `m` bytes from S3 asynchronously. The
values of n and m can be configured to your needs by using command line properties or environment variables.

If no configuration is supplied the values in `resources/s3-nio-spi.properties` are used. Currently, 50 fragments of 5MB.
Each fragment is downloaded concurrently on a unique thread.

#### Environment Variables

You may use `S3_SPI_READ_MAX_FRAGMENT_NUMBER` and `S3_SPI_READ_MAX_FRAGMENT_SIZE` to set the maximum umber of cached
fragments and maximum fragment sizes respectively. For example:

```shell
export S3_SPI_READ_MAX_FRAGMENT_SIZE=100000
export S3_SPI_READ_MAX_FRAGMENT_NUMBER=5
java -Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:<location-of-this-spi-jar> -jar <jar-file-to-run>
```

#### Java Properties

You may use java command line properties to set the values of the maximum fragment size and maximum number of fragments
with `s3.spi.read.max-fragment-size` and `s3.spi.read.max-fragment-number` respectively. For example:

```shell
java -Djava.ext.dirs=$JAVA_HOME/jre/lib/ext:<location-of-this-spi-jar> -Ds3.spi.read.max-fragment-size=10000 -Ds3.spi.read.max-fragment-number=2 -jar <jar-file-to-run>
```

#### Order of Precedence

Configurations use the following order or precedence from highest to lowest:

1. java properties
2. environment variables
3. default values

#### S3 limits

As each `S3SeekableByteChannel` can potentially spawn 50 concurrent fragment download threads you may find you exceed S3
limits, especially when the application using this SPI reads from multiple files at the same time or has multiple threads
each opening its own byte channel. In this situation you should reduce the size of `S3_SPI_READ_MAX_FRAGMENT_NUMBER`.

In some cases it may also help to increase the value of `S3_SPI_READ_MAX_FRAGMENT_SIZE` as fewer larger fragments will
reduce the number of requests to the S3 service.

## Design Decisions

As an object store, S3 is not completely analogous to a traditional file system. Therefore, several opinionated decisions
were made to map filesystem concepts to S3 concepts.

### Read Only

The current implementation only supports read operations. It is possible to add write operations however special consideration
will be needed due to the lack of support for random writes in S3 and the read-after-write consistency of S3 objects.

### A Bucket is a `FileSystem`

An S3 bucket is represented as a `java.nio.spi.FileSystem` using a `S3FileSystem`. Although buckets are globally
namespaced they are owned by individual accounts, have their own permissions, regions and potentially endpoints.
An application that accesses objects from multiple buckets will generate multiple `FileSystem` instances.

### S3 Objects are `Path`s

Objects in S3 are analogous to files in a filesystem and are identified using `S3Path` instances which can be built
using S3 uris (e.g `s3://mybucket/some-object`) or posix patterns `/some-object` from an `S3FileSystem` for `mybucket`

### No hidden files

S3 doesn't support hidden files therefore files named with a `.` prefix such as `.hidden` are not considered hidden.

### Creation time and Last modified time

S3 objects to not have a creation time and modification of an S3 object is actually a re-write of the object so these
are both given the same date (represented as a `FileTime`). If for some reason a last modified time cannot be determined
the Unix Epoch time is used.

### No symbolic links

S3 doesn't support symbolic links therefore no `S3Path` is a symbolic link and any NIO `LinkOption`s are ignored.

### Posix-like path representations

Technically, S3 doesn't have directories there are only buckets and keys. For example, in `s3://mybucket/path/to/file/object`
the bucket name is `mybucket` and the key would be `/path/to/file/object`. By convention the use of `/` in a key is
thought of as a path separator, therefore `object` could be inferred to be a file in a directory called `/path/to/file/`
even though that directory technically doesn't exist. This package will infer directories under what we call "posix like"
path representations. The logic of these is encoded in the `PosixLikePathRepresentation` object.

#### Directories

An `S3Path` is inferred to be a directory if the path ends with `/`, `/.` or `/..` or contains only `.` or `..`.

All of these paths are inferred to be directories `/dir/`, `/dir/.`, `/dir/..`. However `dir` cannot be inferred to be a directory.
This is a divergence from a true POSIX filesystem where if `/dir/` is a directory then `/dir` must also be a directory.
S3 holds no metadata that can be used to make this inference.

#### Working directory

As directories don't exist and are only inferred there is no concept of being "in a directory". Therefore, the working
directory is always the root and `/object` `./object` and `object` can be inferred to be the same file. In addition `../object`
will also be the same file as you may not navigate past the root and no error will be produced if you attempt to.

#### Relative path resolution

Although there are no working directories paths may be resolved relative to one another as long as one is a directory.
So if `some/path` was resolved relative to `/this/location/` then the resulting path is `/this/location/some/path`.

Because directories are inferred, you may not resolve `some/path` relative to `/this/location` as the latter cannot be
inferred to be a directory (at lacks a trailing `/`).

#### Resolution of `..` and `.`

The Posix path special symbols `.` and `..` are treated as they would be in a normal Posix path. Note that this could
cause some S3 objects to be effectively invisible to this implementation. For example `s3://mybucket/foo/./baa` is
an allowed S3 URI that *not* equivalent to `s3://mybucket/foo/baa` even though this library will resolve the path `/foo/./baa`
to `/foo/baa`.

## Building this library

The library uses the gradle build system and targets Java 1.8. To build you can simply run:

```shell
./gradlew build
```

This will run all unit tests and then generate a jar file in `libs` with the name `s3fs-spi-<version>.jar`

### Shadowed Jar with dependencies

To build a "fat" jar with the required dependencies (including aws s3 client libraries) you can run:

```shell
./gradlew shadowJar
```

which will produce `s3fs-spi-<version>-all.jar`. If you are using this library as an extension, this is the recommended
jar to use. Don't put both jars on your extension path, you will observe class conflicts.

## Testing

To run unit tests and produce code coverage reports, run this command:

```shell
./gradlew test
```

HTML output of the test reports can be found at `build/reports/tests/test/index.html` and test coverage reports are
found at `build/reports/jacoco/test/html/index.html`

## Contributing

We encourage community contributions via pull requests. Please refer to our [code of conduct](./CODE_OF_CONDUCT.md) and
[contributing](./CONTRIBUTING.md) for guidance.

Code must compile with JDK 1.8 and matching unit tests are required.

### Contributing Unit Tests

We use JUnit 4 and Mockito for unit testing.

When contributing code for bug fixes or feature improvements, matching tests should also be provided. Tests must not
rely on specific S3 bucket access or credentials. To this end, S3 clients and other artifacts should be mocked as
necessary. Remember, you are testing this library, not the behavior of S3.

0 comments on commit dbcaffd

Please sign in to comment.