Merge pull request #48 from bu-ist/doc-updates

Add readme
bu-ist · Jan 22, 2024 · 8652ed3 · 8652ed3
2 parents 5b9d48d + 1fd6d68
commit 8652ed3
Show file tree

Hide file tree

Showing 8 changed files with 176 additions and 70 deletions.
diff --git a/docs/cicd.md b/docs/cicd.md
@@ -3,9 +3,9 @@
 Since [Github Actions](https://docs.github.com/actions) runs our CI/CD pipeline, a recommended refresher on CI/CD is the github ["CI/CD explained" article](https://resources.github.com/ci-cd/).
 The implementation of this pipeline is a one-time exercise, with this as a record detailing what was done.
 
-### Overview
+## Overview
 
-A rudimentary [workflow](https://docs.github.com/en/actions/using-workflows/about-workflows#about-workflows) has been setup for deployment of the app that breaks down into the following sequence:
+A [workflow](https://docs.github.com/en/actions/using-workflows/about-workflows#about-workflows) has been setup for deployment of the app that breaks down into the following sequence:
 
 1. A feature branch is approved and merged into the main branch of the github repository for the app.
    This kicks off the [workflow](https://docs.github.com/en/actions/using-workflows/about-workflows#about-workflows).
@@ -31,7 +31,7 @@ The abridged steps are:
    The role "WordpressProtectedAssetsGithubActionsCloudformingRole" can be found [here](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/roles/details/WordpressProtectedAssetsGithubActionsCloudformingRole?section=permissions)
    Role policy:
 
-   ```
+   ```json
    {
        "Version": "2012-10-17",
        "Statement": [
@@ -60,7 +60,7 @@ The abridged steps are:
 
    Trust relationship:
 
-   ```
+   ```json
    {
        "Version": "2012-10-17",
        "Statement": [
@@ -87,7 +87,7 @@ The abridged steps are:
 3. Create the github action (located in `.github/workflows/cicd.yml`)
    Below are relevant excerpts that shows the deploy job step that uses the role (`role-session-name`)
 
-   ```
+   ```yaml
      env:
        AWS_REGION: us-east-1
 
@@ -98,9 +98,9 @@ The abridged steps are:
      ...
 
      jobs:
-   	...
-   	 deploy:
-   	 ...
+    ...
+    deploy:
+    ...
            - uses: aws-actions/configure-aws-credentials@v2
              with: 
                role-to-assume: arn:aws:iam::115619461932:role/WordpressProtectedAssetsGithubActionsCloudformingRole
@@ -121,4 +121,4 @@ OIDC (OpenID Connect) is an identity layer built on top of OAuth 2.0 that allows
 4. **Defense Against Key Exfiltration:** If an attacker gains access to the OIDC provider's private key used for token signing, they might try to insert their own public key into the provider's configuration. The OIDC thumbprint can help prevent such attacks by verifying that the public key used for token validation aligns with the trusted key.
 5. **Third-Party OIDC Providers:** In scenarios where the relying party trusts multiple OIDC providers, the thumbprint can help ensure that tokens are only accepted from the intended and validated OIDC provider, preventing tokens from unauthorized providers.
 
-In summary, the OIDC thumbprint is a security mechanism that enhances the trustworthiness of the OIDC authentication process by providing a means to verify the authenticity of the OIDC provider's token validation endpoint. It adds an additional layer of protection against various attack vectors, particularly those involving tampering, impersonation, and unauthorized token sources.
+In summary, the OIDC thumbprint is a security mechanism that enhances the trustworthiness of the OIDC authentication process by providing a means to verify the authenticity of the OIDC provider's token validation endpoint. It adds an additional layer of protection against various attack vectors, particularly those involving tampering, impersonation, and unauthorized token sources.
diff --git a/docs/image-resizing.md b/docs/image-resizing.md
@@ -0,0 +1,37 @@
+# Image resizing
+
+Where WordPress normally generates all scaled derivates during file upload, this application provides a way to serve images at specific sizes without having to generate all of them in advance. This can result in significant storage and processing savings for WordPress sites with a large number of image sizes defined, but only a few of those sizes are used in the site. Only sizes that are actually requested will be generated and stored in the bucket.
+
+It also allows for original media to be stored separately from the scaled derivates, making it easier to manage very large media libraries. For example, when cloning sites for development or testing, the original media can be copied to the new site without having to copy all of the scaled versions. The scaled versions will be automatically generated when the media is requested.
+
+The access controls are applied based on the path of the file, not the file name, so access controls are applied consistently across the original and any scaled versions of the file.
+
+The logic of the image scaling handler and how it relates to access control is described in the following diagram:
+![Image scaling flow diagram](./images/image-scaling-flow-diagram.png)
+
+## WordPress compatible custom image crop controls
+
+Custom crop controls are available with the image scaling, either through GET parameters to the image request or through custom image sizes defined in a WordPress theme.
+
+### WordPress image sizes
+
+WordPress themes can define custom image sizes using the `add_image_size()` function. This function takes a `crop` parameter that determines if portions of the image should be removed to fit the custom size. The BU Media S3 WordPress plugin has a wp-cli command that can be used to push the custom image sizes defined in the theme to the DynamoDB table. These records use a primary key based on the site url and look like this: `SIZES#example.bu.edu/example-site`.
+
+The `getOrCreateObject()` function uses the `lookupCustomCrop()` function to load any custom sizes that may be defined for the site in the DynamoDB table. If there is a size that matches the dimensions of the requested image and that size has custom crop settings, then that crop parameter is added to the rewritten file name in S3. It is also passed to the `resizeAndSave()` function which passes it to the `sharp` library to perform the crop.
+
+By adding the crop parameter to the name of the object in S3, crop factors can change without causing a file name conflict.
+
+### Custom Crop Parameters
+
+In addition to crop parameters defined in WordPress themes, custom crop parameters can be added as GET parameters to the URL. For example, the following URL will crop the left part of an image:
+
+```text
+https://example.bu.edu/example-site/files/2024/01/16/picture-800x300.jpg?resize-position=top
+```
+
+Currently only top, bottom, left, right are accepted as resize-position options, but in theory any of the [sharp library resize position options](https://sharp.pixelplumbing.com/api-resize) could be added.
+
+## References
+
+- [Developer reference on add_image size() and the crop parameter](https://developer.wordpress.org/reference/functions/add_image_size/)
+- [Sharp library resize position options](https://sharp.pixelplumbing.com/api-resize)
diff --git a/docs/images/image-scaling-flow-diagram.png b/docs/images/image-scaling-flow-diagram.png
diff --git a/docs/images/request-flow-diagram.png b/docs/images/request-flow-diagram.png
diff --git a/docs/lambda-function-description.md b/docs/lambda-function-description.md
@@ -0,0 +1,18 @@
+# Lambda Function Description
+
+The Lambda function is responsible for authorizing requests to the access point and generating and resized versions of original media.
+
+The code includes the following modules:
+
+* authorizeRequest: A module that checks if a user is authorized to access an object in S3 based on the user's IP address and the site rules defined in DynamoDB.
+* getOrCreateObject: A module that retrieves an object from S3 or creates scaled version of the object if it doesn't exist.
+
+The Lambda function (`app.js`) receives the details of the request from the event parameter, which contains the information for the native S3 WriteGetObjectResponse request. It first checks if the request is for a site on the protected sites list. It uses that information, combined with the path of the request to determine if the request is for a protected object. For protected objects, it runs the authorizeRequest module to check if the user is authorized to access the object.
+
+If the user is authorized or the object is unprotected, the `getOrCreateObject()` function retrieves the object from S3 or creates a sized version of the object if it doesn't exist. Finally, the Lambda function returns the image data with a 200 OK response or a 404 Not Found response if the image is not found.
+
+## S3 Object Lambda Pre-signed keys
+
+In typical usage, S3 Object Lambdas do not need direct read or write access to the S3 bucket. This is because the request event that the Lambda function receives contains a pre-signed that allows one-time read access directly without the need for additional credentials.
+
+In our case with image resizing involved, the original request may be rewritted for a different location (sized media are stored in a different location than original media for example). Also the scaled media object may not yet exist. For these reasons, we side-step the pre-signed key and use the AWS SDK to get the object directly from S3, at the potentially rewritten location. The Lambda is also granted write access to the bucket, so that it can save the scaled media object if it doesn't exist. These extra permissions are added as a policy (`S3CrudPolicy`) to the ObjectLambdaFunction in the SAM template.
diff --git a/docs/protected-media.md b/docs/protected-media.md
@@ -0,0 +1,55 @@
+# Protected media
+
+The Lambda function applies access restrictions based on rules determined by records in DynamoDB and by the url of the request. These urls and DynamoDB records are managed by the BU Access Control WordPress plugin.
+
+There are two ways to protect media files in this application: individual file protections and whole site protections.
+
+## Individual file protections
+
+By convention, the Lambda function recognizes files with the string `/__restricted/` in the path as being protected, and uses the next path segment as the name of the associated access group to apply. For example, a request for the following URL would be recognized as a request for a protected file:
+
+```text
+https://sites.bu.edu/example-site/files/__restricted/example-group/protected-file.pdf
+```
+
+The Lambda function would use the site url and the group name to look up the `example-group` access group in the DynamoDB table and apply the access rules to the request. If the user is authorized to access the file, the Lambda function returns the file with a 200 OK response (the standard HTTP response code for a successful request).  If the user is not authorized to access the file, the Lambda function returns a 403 Forbidden response (the standard HTTP respsonse code when access is denied).
+
+## Whole site protections
+
+In addition to individual file protections, the Lambda function can also apply access controls to an entire site. This is useful for sites that need to be completely protected from public access. The list of protected sites are stored in a single DynamoDB item with the key of `PROTECTED_SITES`. The list is a JSON encoded array of key-value pairs, where the key is the url of the site and the value is the name of the access group to be applied.
+
+This is an example of a `PROTECTED_SITES` record:
+
+```json
+[
+    {
+        "https://sites.bu.edu/example-site": "example-group"
+    },
+    {
+        "https://sites.bu.edu/another-example-site": "another-example-group"
+    }
+]
+```
+
+To efficiently apply the access controls for these protected sites, the `PROTCETED_SITES` record is cached by the Lambda function for up to a minute. This means that any changes to the `PROTECTED_SITES` record will not be applied until the cache expires. The Lambda uses a value declared outside of the handler function to store the cache, and the cache is refreshed when the value is empty or expired. This is a standard part of the Lambda execution environment, there is a [good summary blog post about it here](https://katiyarvipinknp.medium.com/how-to-cache-the-data-in-aws-lambda-function-using-node-js-use-tmp-storage-of-aws-lambda-2c7e1e01d923).
+
+On each request, the Lambda function checks if the site url is in the `PROTECTED_SITES` list. If so, it gets the access group name from the list and uses it to apply the access rules the same as with individual file protections.
+
+## Access control rules
+
+The access control rules are stored in a DynamoDB table, which is created by the CloudFormation template and populated by the BU Access Control WordPress plugin. The access rule records in the table have a composite primary key of `site` and `group`, where the `site` attribute is the url of the site and the `group` attribute is the name of the access group. The two parts of the primary key are combined with a `#` character as a delimiter. For example, the access rule record for the `example-group` access group on the `https://sites.bu.edu/example-site` site would have a primary key of `sites.bu.edu/example-site#example-group`.
+
+Each record has an attribute called `rules` which is a JSON encoded array of access control rules. Each rule is a key-value pair where the key is the name of the rule and the value is the rule data. The following is an example of an access rule record:
+
+```json
+{
+    "users":["webteam","authorized-user"],
+    "states":["faculty"],
+    "entitlements":["http:\/\/iam.bu.edu\/hr\/OrgUnitParent\/9999999"],
+    "ranges":["crc","bmc"],
+    "satisfy_all":null,
+    "admins":["site-admin1","site-admin2"],
+}
+```
+
+The `authorizeRequest()` function is responsible for taking the authentication data from the user request headers and comparing them to the access rules in the DynamoDB table. The function returns `true` if the user is authorized to access the file and `false` if the user is not authorized.
diff --git a/docs/x-forwarded-host-header.md b/docs/x-forwarded-host-header.md
@@ -0,0 +1,7 @@
+# Using the X-Forwarded-Host Header
+
+For use in a multi-site and multi-network WordPress setup, the X-Forwarded-Host header is used to determine the path of the S3 object that is requested by the user. This header is added to the request by the upstream Apache instance that is using mod-proxy to proxy the user request to the signing container.
+
+Apache sets the X-Forwarded-Host header according to the domain name of the original requests. This domain name is used to construct the path of the S3 object by concatenating the domain with the url from the userRequest object.
+
+The multi-network WordPress installation is configured to upload files to S3 using the domain name of the site as the prefix. This ensures that the path of the S3 object is unique across all sites and networks.