This document explains the layout of the translationConvertor (tX) conversion platform and how the components of the system should interact with one another.
If you just want to use the tX API, see tX API Example Usage
Keep reading if you want to contribute to tX.
tX is intended to be a conversion tool for the content in the Door43 Platform. The goal is to support several different input formats, output formats, and resource types.
Development goals are:
- Keep the system modular, in order to:
- Encourage others to contribute and make it simple to do so
- Contain development, testing, and deployment to each individual component
- Constrain feature, bugfixes, and security issues to a smaller codebase for each component
- Continuous Deployment, which means
- Automated testing is required
- Continuous integration is required
- Checks and balances on our process
- RESTful API utilizing JSON
All code for tX is run by AWS Lambda. The AWS API Gateway service is what provides routing from URL requests to Lambda functions. Data and any required persistent metadata are stored in AWS S3 buckets. This is a "serverless" API.
Developers use Apex, Travis CI, and Coveralls.
Permissions (mostly for accessing S3 buckets) are managed by the
role
assigned to each Lambda function.
Modules may be written in any language supported by AWS Lambda (including some that are available via "shimming"). As of July, 2016, this list includes:
- Java (v8)
- Python (v2.7)
- Node.js (v0.10 or v4.3)
- Go lang (any version)
Modules MUST all present an API endpoint that the other components of the system can use. Modules MAY present API endpoints that the public can use.
We want our code to not know/care if it is running in production, development or test environments. Yet there are plenty of variables and locations that data and files are stored that vary from the three, such as different bucket names between our two AWS accounts, since all bucket names on AWS must be unique.
So that the clients, tx-manager and convert modules don't have to worry about this, everything that varies from environment will be set up in the API Gateway Stage Variables. These are variables we set up in AWS for a particular API's URL. Along with the payload sent by the requesting client, these variables will also be put into the "event" variable in the Lambda handle function.
For example, such variables may be (test | development | production):
- cdn_bucket = "test-cdn.door43.org" | "dev-cdn.door43.org" | "cdn.door43.org"
- api_bucket = "test-api.door43.org" | "dev-api.door43.org" | "api.door43.org"
- door43_bucket = "test-door43.org" | "dev-door43.org" | "door43.org"
- api_url = "https://test-api.door43.org" | "https://dev-api.door43.org" | "https://api.door43.org"
- gogs_user_token = "<a user token from test.door43.org:3000>" | "<a user token from dev.door43.org:3000>" | "<a user token from git.door43.org>"
- gogs_username = "<username of the above user_token>"
- env = "test" or "dev" or "prod" (just in case you want to still do something different based on environment in your code)
The test environment should use the WA AWS account. There are 3 test buckets that have been created that mirror the production buckets:
- test-api.door43.org - for tx-manager to manage data for tX (only /tx namespace should be used) (public access disabled on this)
- test-cdn.door43.org - for conversion modules to upload their output to (only /tx namespace should be used) (public access enabled on this)
- test-door43.org - For Jekyll and /u generated files to upload to (public access enabled on this)
Use apex deploy to upload code for lambda functions to test environment.
The development environment should use the Door43 AWS account. There are 3 development buckets that have been created that mirror the production buckets:
- dev-api.door43.org - for tx-manager to manage data for tX (only /tx namespace should be used) (public access disabled on this)
- dev-cdn.door43.org - for conversion modules to upload their output to (only /tx namespace should be used) (public access enabled on this)
- dev-door43.org - For Jekyll and /u generated files to upload to (public access enabled on this)
The develop
branch for each repo should automatically deploy to this
account and make use of the above buckets.
The production environment should use the Door43 AWS account. The production buckets are:
- api.door43.org - for tx-manager to manage data for tX (only /tx namespace should be used) (public access disabled on this)
- cdn.door43.org - for conversion modules to upload their output to (only /tx namespace should be used) (public access enabled on this)
- door43.org - For Jekyll and /u generated files to upload to (public access enabled on this)
The master
branch for each repo should automatically deploy to this
account and make use of the above buckets.
Every part of tX is broken into components referred to as
tX modules
. Each tX module has one or more functions that it
provides to the overall system. The list of tX modules is given here,
with a full description in its respective heading below.
- tX Webhook Client - Handles webhooks from git.door43.org (Gogs) to format the repo files, massaging them based on resource and format into a flat directory structure and zips it up to invoke a job request with the tX Manager Module.
- tX Manager Module - Manages the registration of conversion modules and handles job requests for conversions. Makes a callback to the client when conversion job is complete.
- tX Authorization Module (actually just the python-gogs-client)
- tX Conversion Modules - modules that handle the conversion from one file format to another of one or more resources
- tX Door43 Module - When a conversion job is completed, it is invoked to make the converted file accessible through the door43.org site, setting up a new revision page for the corresponding Gogs repository. It also maintain stats on the particular project or project revision, such as views and stars
The tX Manager Module provides access to three functions:
- Maintains the registry for all tX Conversion Modules
- Authorization for requests via the
`tx-auth
module <#tx-authorization-module>`__ - Accepts user credentials via
HTTP Basic Auth
(over HTTPS) to verify the calling client is a gogs user - Counts requests made by each token [not implemented]
- Blocks access if requests per minute reaches a certain threshold [not implemented]
- Handles the public API paths that a tX Convertion modules register
- Job queue management. Accepts job requests with parameters given to it, the most important being a URL to a zip file of the source files, the resource type, input format, and output format. These files must be in a flat ZIP file (no sub-directories, at least not for the files of the input format), conforming to what the tX Converter expects
- Makes a callback to client when job is completed or has failed, if a callback URL was given by the client when the job was requested
The tX manager does NOT concern itself with nor has knowledge of: *
git.door43.org
repositories * door43.org
pages
The tX Authorization
Module is
an authorization
module for the tX system. In reality, this is just
the python-gogs-client. The tx-manager
module uses it to perform
authorization of request. The module handles the following:
- Grants access to the API based on a Gogs user token
Conversion modules include (some are still to be implemented):
- tx-md2html - Converts Markdown to HTML (obs, ta, tn, tw, tq)
- tx-md2pdf - Converts Markdown to PDF (obs, ta, tn, tw, tq)
- tx-md2docx - Converts Markdown to DOCX (obs, ta, tn, tw, tq)
- tx-md2epub - Converts Markdown to ePub (obs, ta, tn, tw, tq)
- tx-usfm2html - Converts USFM to HTML (bible)
- tx-usfm2pdf - Converts USFM to PDF (bible)
- tx-usfm2docx - Converts USFM to DOCX (bible)
- tx-usfm2epub - Converts USFM to ePub (bible)
Each conversion module accepts a specific type of text format as its input and the module returns a specific type of output document. For example, there is a md2pdf module that converts Markdown text into a rendered PDF. The conversion modules also require that you specify the resource type (e.g. obs, ta, tn, tw or tq), which affects the formatting of the output document.
There are currently two accepted input format types:
- Markdown -
md
- Unified Standard Format Markers -
usfm
A few notes on input formatting:
- Conversion modules do not do pre-processing of the text. The data supplied must be well formed.
- Conversion modules expect a single file either:
- A plaintext file of the appropriate format (
md
orusfm
). - A zip file with multiple plaintext files of the appropriate format.
In the case of a zip file, the conversion module should process the
files in alphabetical order. According to our obs
file naming
convention and the usfm
standard, this process should yield the
correct output in both cases.
For each type of input format, the following output formats are supported:
- PDF -
pdf
- DOCX -
docx
- HTML -
html
Each of these resource types affects the expected input and the rendered output of the text. The recognized resource types are:
- Open Bible Stories -
obs
- Scripture/Bible -
bible
- translationNotes -
tn
- translationWords -
tw
- translationQuestions -
tq
- translationAcademy -
ta
Conversion modules specify a list of options
that they accept to
help format the output document. Every conversion module MUST support
these options:
"language": "en"
- Defaults toen
if not provided, MUST be a valid IETF code, may affect font used"css": "http://some.url/your_custom_css"
- A CSS file that you provide. You can override or extend any of the CSS in the templates with your own values.
Conversion modules MAY support these options:
"columns": [1, 2, 3, 4]
- Not available forobs
input"page_size": ["A4", "A5", "Letter", "Statement"]
- Not available for HTML output"line_spacing": "100%"
"toc_levels": [1, 2, 3, 4, ...]
- To specify how many heading levels you want to appear in your TOC."page_margins": { "top": ".5in","right": ".5in","bottom": ".5in","left": ".5in" }
- If you want to override the default page margins for PDF or DOCX output.
Each module is initially deployed to AWS Lambda via the apex
command. After this, Travis CI is configured to manage continuous
deployment of the module (see Deploying to AWS from Travis
CI).
Continuous deployment of the module should be setup such that:
- the
master
branch is deployed toproduction
whenever it is updated - the
develop
branch is deployed todevelopment
whenever it is updated
The deployment process looks like this:
- Code in progress lives in a feature-named branch until the developer is happy and automated tests pass.
- Code is peer-reviewed, then
- Merged into
develop
until automated testing passes and it integrates correctly indevelopment
. - Merged into
master
which triggers the auto-deployment
Every module (except tx-manager
) MUST register itself with
tx-manager
. A module MUST provide the following information to
tx-manager
:
- Public endpoints (for
tx-manager
to present) - Private endpoints (will not be published by
tx-manager
) - Module type (one of
conversion
,authorization
,utility
)
A conversion module MUST also provide:
- Input format types accepted
- Output format types accepted
- Resource types accepted
- Conversion options accepted
Example registration for md2pdf
:
Request
POST https://api.door43.org/tx/module { "name": "tx-md2pdf_convert", "version": "1", "type": "conversion", "resource_types": [ "obs", "bible" ], "input_format": [ "md" ], "output_format": [ "pdf" ], "options": [ "language", "css", "line_spacing" ], "private_links": [ ], "public_links": [ { "href": "/md2pdf", "rel": "list", "method": "GET" }, { "href": "/md2pdf", "rel": "create", "method": "POST" }, ] }
Response:
201 Created { "name": "md2pdf", "version": "1", "type": "conversion", "resource_types": [ "obs", "bible" ], "input_format": [ "md" ], "output_format": [ "pdf" ], "options": [ "language", "css", "line_spacing" ], "private_links": [ ], "public_links": [ { "href": "/md2pdf", "rel": "list", "method": "GET" }, { "href": "/md2pdf", "rel": "create", "method": "POST" }, ] }
The tX Webhook
Client is a client
to tX. The purpose of this client is to pre-process the git
repos
from Gogs' webhook notifications, send them through tX, and upload the
resulting HTML files to the cdn.door43.org
bucket. The process looks
like this:
When a Gogs webhook is triggered: * Accepts the default webhook
notification from git.door43.org
* Gets the data from the
repository for the given commit (via HTTPS request that returns a zip
file) * Identifies the Resource Type (via name of repo or
manifest.json
file) * Formats the request (turns the repo into
valid Markdown or USFM file(s), then creates a zip file with the files
being in the root of the archive) * Sends the valid data (in zip
format) through an API call to the tX Manager
Module, requesting HTML output, which it then
should get a confirmation (JSON) that the job has been queued ('status'
= 'requested') * Uploads an initial build_log.json
file to the
cdn.door43.org
bucket as
u/<owner>/<repo>/<commit>/build_log.json
with information returned
from the call to the tX Manager (this file will be updated when job is
completed) * Uploads the repo's manifest.json
file to the
cdn.door43.org
bucket as u/<owner>/<repo>/<commit>/manifest.json
* Returns its own JSON response which will be seen in the Gogs' webhook
request results, stating the request was made, the source ZIP and the
expected output ZIP locations
When callback is made: * Extract each file from the resulting output
ZIP file to the cdn.door43.org
bucket with the prefix key of
u/<owner>/<repo>/<commit>/
* Updates the
u/<owner>/<repo>/<commit>/build_log.json
in the cdn.door43.org
bucket with the information given by tX Manager through the callback
(e.g. conversion status, log, warnings, errors, timestamps, etc.)
tX Webhook Client does NOT concern itself with: * Converting files for
presentation on door43.org
The tX Door43
Module contains
processes that will update the door43.org
bucket/site when
conversion jobs are completed. It works behind the scenes, so is not an
API. Its tasks include:
- convert the files for presentation on
door43.org
when a conversion job is completed and files have been deployed to thecdn.door43.org
bucket, applying a template and other styling and JavaScript, and deploy them to thedoor43.org
bucket, prefixed withu/<owner>/<repo>/<commit>
- Update stats of a project or revision such as views, followers and
stars from
git.door43.org
Requirements for a Python script need to reside within the function's
directory that calls them. A requirement for the convert
function
should exist within functions/convert/
.
The list of requirements for a function should be in a
requirements.txt
file within that function's directory, for example:
functions/convert/requirements.txt
.
Requirements must be installed before deploying to Lambda. For example:
pip install -r functions/convert/requirements.txt -t functions/convert/
The -t
option tells pip to install the files into the specified
target directory. This ensures that the Lambda environment has direct
access to the dependency.
If you have any Python files in subdirectories that also have
dependencies, you can import the ones available in the main function by
using sys.path.append('/var/task/')
.
Lastly, if you install dependencies for a function you need to include
the following in an .apexignore
file:
*.dist-info
There is a similar API that has good documentation at https://developers.zamzar.com/docs. This can be consulted if we run into blockers or need examples of how to implement tX.