-
Notifications
You must be signed in to change notification settings - Fork 10
Architecture: graphql
This project is organized into vertical slices of functionality, grouped by Business Domain. This is an intentional design choice to minimize the amount of layering and boilerplate that is common in web server projects. Most simple CRUD logic should be implemented directly inside GraphQL resolver functions.
sm-graphql starts several servers/processes, all managed in the server.js
file:
- Apollo GraphQL API server
- Apollo GraphQL Subscription API server (WebSockets)
- HTTP registration/login/etc REST API
- HTTP "Storage Server" for raw optical image upload
- (in the Storage Server) Uppy Companion server for signing upload URLs for direct dataset/molDB upload to S3
- A scheduled "cron" job for sending reminder emails to users to publish their data if they have old private projects
Additionally, TypeORM runs any new database migrations on startup.
The GraphQL API can be easily explored at https://metaspace2020.eu/graphql (or your local dev server equivalent). Set "request.credentials": "include",
in the settings and it will use your login details from the browser cookies.
Almost all security-related logic happens in sm-graphql:
- User creation/login is handled by the REST API in
src/modules/auth/controller.ts
- Authentication is handled by Passport middleware based on each request's cookies/JWTs/Api-Keys
- Authorization needs to be handled explicitly in GraphQL resolver functions. This is usually done when retrieving the entity, e.g.:
- The ElasticSearch queries in
esConnector.ts
filter datasets/annotations to only include results that should be visible to the current user. Some controllers will even query and discard the result just to check that the dataset is allowed to be accessed. - When there are multiple different levels of access privilege, it should be explicit in the function names, e.g.
getDatasetForEditing
which will raise an exception if the user isn't allowed to edit the dataset. - Operations that call sm-api must still handle authorization! sm-api doesn't do any authorization itself.
- The ElasticSearch queries in
- As an optimization, some resolvers pass authorization information to their children resolvers through a
scopeRole
field, e.g.
Managed by the Passport library, works like every other website - the cookie content includes a signed session ID, the actual session data is stored in Redis. Cookies are the primary authentication mechanism - (non-anonymous) JWTs and Api-Keys can only be generated by a user authenticated with a cookie.
The cookie is the same whether a user logs in with Google or Email+Password.
GraphQL requests from webapp use a JWT for authentication. This isn't really needed anymore - previously webapp and graphql were separate and webapp handled authentication. It's just more work to clean up - getting access to the cookies in the GraphQL Subscription Server has been an difficult/impossible in the past. The subscription server library has probably fixed that by now.
Python Client also uses JWTs if Email+Password authentication is used. For Api-Key authentication, the JWT isn't needed.
API Keys use a similar authentication code path to JWTs, but have significant restrictions (only specific mutations are allowed, some queries are blocked, all usages are logged) to limit the impact if they're leaked. They're intended for use with the Python Client.
The project publication workflow allows a user to create a share link to that project. Anyone who accesses this link is allowed to see the datasets in the project - the authorization details are persisted in the user's session, even if they're not logged in.
Not intended to be used continually, but for new users' convenience, clicking the email validation link will give them a logged-in cookie up to 1 hour after account creation. This technically counts as an authentication method from a security perspective.
It's common for one business entity to have multiple representations at the various interfaces. These follow a naming convention:
import { Dataset } from '../../../binding' // The "Binding" is the GraphQL schema type - no suffix is used
import { DatasetSource } from '../../../bindingTypes' // The "Source" is the type returned by resolvers, which may have additional fields for internal use e.g. authentication data
import { Dataset as DatasetModel } from '../model' // The "Model" is the TypeORM DB Entity class
When a parent resolver needs to share authentication data with one or more child resolvers, it's done through a "ScopeRole" field. E.g. only group members may see a user's email address, but the group member status is easiest to query when selecting the user. Resolvers like Group.members
include a scopeRole
field in their returned "source" object, which is eventually used in User.email
to check if the email should be visible.
GraphQL queries are executed hierarchically, e.g. for this query:
query {
currentUser {
id
email
projects {
role
project { id name }
}
}
}
the Apollo GraphQL server performs these operations:
- Call the
Query.currentUser
resolver (which returns aUserModel
object). The GraphQL Schema defines the return value asUser
, so nested fields useUser
resolvers. - Check for a
User.id
resolver - there is none, so theid
raw value fromUserModel
is used instead - Check for a
User.email
resolver - it exists, so it's called with theUserModel
object as thesource
parameter - Check for a
User.projects
resolver - it exists, so it's called with theUserModel
object as thesource
. It returns an array ofUserProjectModel
objects. The GraphQL return type is[UserProject]
, so it's handled as an array and the nested fields useUserProject
resolvers - For each
UserProjectModel
object inUser.projects
' return value:- Check for a
UserProject.role
resolver - it doesn't exist, sorole
fromUserProjectModel
- Check for a
UserProject.project
resolver - it exists, so it's called with theUserProjectModel
as thesource
. It returns aProjectModel
object, and the schema says the graphql return type isProject
- Check for a
Project.id
resolver - it doesn't exist, so theid
field from theProjectModel
is used - Check for a
Project.name
resolver - it doesn't exist, so thename
field from theProjectModel
is used
- Check for a
This approach allows API consumers (i.e. webapp and python-client) to specify exactly what data they need. Well-written GraphQL resolvers are extremely flexible and often don't need any server-side code changes as the client-side application evolves.
The biggest drawback is that the hierarchical method calling makes it very easy to hit the "SELECT N+1" problem, e.g. in the above query the UserProject.project
resolver is called for every project - if there are 10 projects and UserProject.project
contains an SQL query, then 10 SQL queries will be issued. There's an easy solution though - see the Caching/Dataloaders section.
The type annotations around resolvers are a bit janky because we want to enforce that the code is type-safe with the .graphql
definitions, but "graphql-binding" was the only available .graphql
-to-TypeScript interface compiler available when this was written. graphql-binding has many shortcomings in the generated types, and updated versions of the library don't seem to support our use-case, so we're currently stuck with it. If you want to fix this, there's an idea task that suggests a newer library that looks suitable. It's also viable to just write our own compiler, as GraphQL schemas are actually very simple to process if you use the programmatic API.
The compiled bindings are stored in graphql/src/binding.ts
and are generated by yarn run gen-binding
, which runs automatically when the Docker container starts or detects changes to .graphql
files.
Here's an example resolver definition:
const ProjectResolvers: FieldResolversFor<Project, ProjectSource> = {
async hasPendingRequest(project: ProjectSource, args: any, ctx: Context, info: GraphQLResolveInfo): Promise<boolean | null> {
...
The FieldResolversFor
<TBinding, TSource>` allows TypeScript to enforce that the contained functions are loosely compatible with the GraphQL schema.
-
TBinding
(Project
frombinding.ts
in this case) is used to ensure theargs
and return type of each function matches the GraphQL schema -
TSource
(ProjectSource
in this case) is used for type-checking that all resolvers handle the "Source" (first argument) correctly
The "resolver function" (hasPendingRequest
) accepts up to 4 arguments:
-
source
(calledproject
in this case) - seeTSource
explanation above -
args
- arguments to this resolver. This only contains a value if the GraphQL schema includes arguments for this resolver. It's mostly used for Queries/Mutations, but some fields also use it e.g.Annotation.colocalizationCoeff
. NOTE:args
has the most problems with bad types generated inbindings.ts
- e.g.String
is translated tostring[]|string
andID
is translated tostring|number
. Often it's better to manually write type definitions for this argument. -
ctx
- the "Context" object for the current request. It's shared between all resolver calls in this request (allowing it to be used for caching), and includes useful stuff like theentityManager
connection to the database, and theuser
details. -
info
- almost never needed. It contains metadata about the whole GraphQL query, including which sub-fields will be resolved from the current resolver's returned object.
It's common for GraphQL resolvers within the same query to need access to the same data, but it's not easy to just select the data once and pass it to the functions that need it because resolvers are so isolated. Context
manages a cache that's specific to the current request to help with these cases, e.g.
const getMolecularDbById = async(ctx: Context, dbId: number): Promise<MolecularDB> => {
return await ctx.contextCacheGet('getMolecularDbById', [dbId],
(dbId: number) =>
ctx.entityManager.getCustomRepository(MolecularDbRepository)
.findDatabaseById(ctx, db_id)
)
}
This ensures that no matter how many times the outer getMolecularDbById
is called during the request, the inner findDatabaseById
function will only be called once per value of dbId
.
Virtually all data in graphql is dependent on the current user's permissions, so no global caches have been set up.
DataLoaders can combine many independent function calls into a single function call that receives an array of the calls' parameters, allowing optimizations e.g. using a single SQL query to get many rows by ID. Here's a good explanation. Use Context.contextCacheGet
to create a single instance of the DataLoader for each request. e.g.
const getDbDatasetById = async(ctx: Context, id: string): Promise<DbDataset | null> => {
const dataloader = ctx.contextCacheGet('getDbDatasetByIdDataLoader', [], () => {
return new DataLoader(async(datasetIds: string[]): Promise<any[]> => {
const results = await ctx.entityManager.query('SELECT...', [datasetIds])
const keyedResults = _.keyBy(results, 'id')
return datasetIds.map(id => keyedResults[id] || null)
})
})
return await dataloader.load(id)
}
Contains authentication middleware and a non-GraphQL REST API for registration, login, JWT issuing, etc.
Contains the Storage Server and code to run Uppy Companion
Contains the GraphQL schema files. These are compiled by Apollo into a single schema at runtime.
Webapp's tests also use compiled version of these schema files so that it can run a mock graphql server for the tests to call. The schema is kept in webapp/tests/utils/graphql-schema.json
(not stored in Git) and is generated by running yarn run gen-graphql-schema
in the graphql project. Webapp automatically calls this as part of yarn run test
.
The Dataset Upload page was originally planned to be much more dynamic, holding different fields of metadata for different dataset types, different projects, etc. A JSON schema was developed for configuring this form. We didn't need the dynamic configurability in the end, so the schema very rarely changes. However, many parts of the upload page are still dynamically built based on the JSON schema.
These files are usually generated on container startup as part of the deref-schema
script in package.json
. You can rebuild them manually with yarn run deref-schema
.