Hatrac (pronounced "hat rack") is a simple object storage service for web-based, data-oriented collaboration. It presents a simple HTTP RESTful service model with:
- Hierarchical data naming
- Access control suitable for collaboration
- Trivial support for browser-based applications
- Referential stability for immutable data
- Atomic binding of names to data
- Consistent use of distributed data
Hatrac is research software but runs in production for several informatics projects using the filesystem storage backend. It is developed in a continuous-integration style with an automated regression test suite covering all core API features.
As a protocol, the Hatrac REST API can be easily accessed by browser-based applications or any basic HTTP client library. The API can also be easily re-implemented by other services if interoperability is desired. As a piece of software and reference implementation it targets two deployment scenarios:
- A standalone Linux Apache HTTPD server with local filesystem storage of data objects and PostgreSQL storage of namespace and policy metadata.
- An Amazon AWS scalable service with S3 storage of data objects and RDS PostgreSQL storage of namespace and policy metadata.
Both scenarios share much of the same basic software stack, though additional administrative effort is required to assemble a scalable deployment.
Hatrac is developed and tested primarily on an enterprise Linux distribution with Python 2.7. It has a conventional web service stack:
- Apache HTTPD
- mod_wsgi
- flask web framework
- psycopg2 database driver
- PostgreSQL 10 or newer
- webauthn2 security adaptation layer (another product of our group)
There is not much installation automation yet. Please see our detailed installation instructions
- The HTTP(S) connection is terminated by Apache HTTPD.
- The Hatrac service code executes as the
hatrac
daemon user. - The service configuration is loaded from
~hatrac/hatrac_config.json
:
- Security configuration
- Storage back-end.
- All object naming and detailed policy metadata is stored in the RDBMS.
- All bulk object data is stored in the configured storage backend.
- Client authentication context is determined by callouts to the webauthn2 module:
- Client identity
- Client roles/group membership.
- Authorization of service requests is determined by the service code:
- ACLs retrieved from RDBMS
- ACLs are intersected with authenticated client context.
- The RDBMS and backend storage are accessed using deployed service credentials which have no necessary relation to client security model.
The purpose of Hatrac is to facilitate data-oriented collaborations. To understand our goals requires an understanding of what we mean by data-oriented architecture and what we mean by data-oriented collaboration. With that understanding, we can then consider specific examples of object storage systems and their suitability to purpose.
Data orientation, like service orientation, is a philosophy for decomposing a complex system into modular pieces. These pieces, in turn, are meant to be put back together in novel combinations. However, the nature of the pieces and the means of recombination are different:
- Service orientation: a service encapsulates a (possibly hidden) state model and a set of computational behaviors behind a message-passing interface which can trigger computation and state mutation. Over time, new services may be developed to synthesize a behavior on top of existing services, and compatible services may be developed to support the same message protocol while having differences in their internal computation or state.
- Data orientation: a universe of actors take on roles of producing, referencing, and/or consuming data objects. There is a basic assumption that actors will somehow discover and consume data objects, leading them to synthesize new data products. Over time, the set of available data evolves as a result of the combined activity of the community of actors.
Crucially, data itself is the main shared resource and point of integration in a data-oriented architecture. This may include digests, indices, and other derived data products as well as the synthesis and discovery of new products. Services with domain-specific message interfaces and "business logic" are considered to be transient just like any other actor; they cannot be relied upon to mediate access to data over the long term. Rather, community activity causes the data to evolve over time while individual services, applications, and actors come and go. The data artifacts are passed down through time, much like the body of literature and knowledge passed through human civilization. This requires different mechanisms for the collection and dissemination of data among a community while remaining agnostic to the methods and motivations of the actors.
To simplify the sharing of data resources while remaining agnostic to the methods and motivations of actors, we choose a simple collaborative object sharing semantics:
- Generic bulk data: objects have a generic byte-sequence representation that can be further interpreted by actors aware of the encoding or data model used to produce the object.
- Atomic object creation and naming: an object becomes visible all at once or not at all.
- Immutable objects and object references: any reference to or attempt to retrieve an object by name always denotes the same object content, once it is defined.
- Accountable creation and access: policies can be enforced to restrict object creation or retrieval within a community of mutually trusting actors.
- Delegated trust: the coordination of naming and policy management can be delegated among community members to allow for self-help and a lower barrier to entry.
- Hierarchical naming: perhaps less important than the preceding requirements, we find that many users are more comfortable with a hierarchical namespace which can encode some normalized semantic information about a set of objects, provide a focal point for collaboration tasks and conventions, and scope policy that relates to a smaller sub-community.
It is important to note that we restrict Hatrac to bulk data where a generic byte-sequence representation makes sense. We realize that effective collaboration may benefit from structured metadata and search facilities, but these are considered out of scope for Hatrac. We believe that these object storage semantics are necessary but not sufficient for data-oriented collaboration, and we are simultaneously exploring related concepts for data-oriented search and discovery.
To support a range of collaboration scenarios we have observed with our scientific peers, we adopt several additional implementation requirements:
- Integrate into conventional web architectures
- Use URL structure for naming data objects in a federated universe
- Use HTTP protocol for accessing and managing data objects
- Flexible deployment scenarios
- Run in a traditional server or workstation for small groups with local resources
- Run in a conventional hosted/colocated server
- Run in a cloud/scale-out environment
- Configurable client identity and role providers
- Standalone database
- Enterprise directory
- Cloud-hosted identity and group providers
- Configurable storage and "bring your own disks" scenarios
- Store objects as files in a regular filesystem
- Store objects in an existing object system (such as Amazon S3)
- Allow communities to mix and match these options
Amazon's Simple Storage Service (S3) can provide the sharing semantics we seek when object versioning is enabled on a bucket. However, it has drawbacks as far as deployment options, security integration, and ease of use:
- There is no simple standalone server option for groups who wish to locally host their data and "bring their own disks".
- The access control model in S3 is with respect to Amazon AWS accounts and does not allow simple integration with other user account systems.
- Awkward support for hierarchical naming, and policy.
Of course, we support S3 as a storage target. The purpose of Hatrac is to enable the collaboration environment on top of conventional storage options like S3 which are more focused on providing infrastructure than coordinating end-user collaboration.
File-sharing services such as Dropbox make it trivial to share data files between users but provide several challenges when looking at larger collaboration:
- Lack of immutability guarantees to provide stable references to data.
- All or nothing trust model means a user invited into the shared folder has full privilege to add, delete, or modify content.
- Shared/replication model means that every user potentially downloads the entire collection which may be inconvenient or impractical for large, data-intensive collaborations.
A web server such as Apache HTTPD provides many convenient options for security integration and download of objects. However, there is a large gap in supporting upload or submission of new objects and management of object policy by community members. Historically, this has been handled in an ad hoc fashion by each web service implemented on top of Apache. As such an add-on service, Hatrac provides the minimal data-oriented service interface we seek on top of Apache HTTPD, so we can build different services and applications that share data resources without adding a private, back-end data store to each one.
Please direct questions and comments to the project issue tracker at GitHub.
Hatrac is made available as open source under the Apache License, Version 2.0. Please see the LICENSE file for more information.
Hatrac is developed in the Informatics group at the USC Information Sciences Institute.