From 2a8ea6858aad1f06390e1865ad8368e8920e1e83 Mon Sep 17 00:00:00 2001 From: Andrew Nesbitt Date: Wed, 8 May 2019 14:09:59 +0100 Subject: [PATCH] Move documents into /docs folder --- README.md | 6 +- blockers.md => docs/blockers.md | 0 categories.md => docs/categories.md | 0 docs/decentralization.md | 202 ++++++++++++++++++++++++++++ glossary.md => docs/glossary.md | 0 docs/linking.md | 65 +++++++++ docs/problems.md | 57 ++++++++ 7 files changed, 327 insertions(+), 3 deletions(-) rename blockers.md => docs/blockers.md (100%) rename categories.md => docs/categories.md (100%) create mode 100644 docs/decentralization.md rename glossary.md => docs/glossary.md (100%) create mode 100644 docs/linking.md create mode 100644 docs/problems.md diff --git a/README.md b/README.md index 1d83097..d107c6e 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,9 @@ IPFS Package Managers Special Interest Group -- [Package Management Glossary](glossary.md) -- [Package Management Categories](categories.md) -- [Package Manager list](package-managers) +- [Package Management Glossary](docs/glossary.md) +- [Package Management Categories](docs/categories.md) +- [Package Manager list](docs/package-managers) ## Integrations diff --git a/blockers.md b/docs/blockers.md similarity index 100% rename from blockers.md rename to docs/blockers.md diff --git a/categories.md b/docs/categories.md similarity index 100% rename from categories.md rename to docs/categories.md diff --git a/docs/decentralization.md b/docs/decentralization.md new file mode 100644 index 0000000..b604b2a --- /dev/null +++ b/docs/decentralization.md @@ -0,0 +1,202 @@ +# Decentralized Publishing + +I started thinking about what the parts of a fully decentralized package manager would look like, ignoring to a certain extent what the actual technology doing the work would be, as a follow up from https://github.com/ipfs/package-managers/issues/50. + +In the majority of issues we've discussed mirroring and supporting existing package manager registries that generally take the form of a centralized authority managing publishing to a particular index. + +_The exception being [Registry-less](https://github.com/ipfs/package-managers/blob/master/categories.md#registry-less) package managers like Go, Swift, Carthage, Deno etc that use DNS as the centralized authority and package/release identifiers are URLs, sometimes with shorthands for GitHub urls._ + +Some terminology: + +**Release:** The actual artifact of software being published as source code and/or binary, usually delivered as a compressed archive (tar, zip etc), they may contain some metadata about the contents but do not know their own CID within a content addressable system like IPFS. + +**Version index:**: Data for a collection of one or more of releases of a particular software project, including the human readable name/number of each release, the content address of the release and usually some integrity data to confirm the content at that address matches what was expected, with IPFS that comes for free. It may also contain some information about the state of each release (deprecated, insecure etc). + +**Version index owner:** A record of one or more identities that can do one or more of the following to a version index: +- add a new release +- update an existing release +- remove an existing release + +In a content addressable world, they are actually publishing a new version of a version index. + +**Package index:** Data for a collection of one or more Version Indexes, including the human readable name for each version index and it's content address. + +**Package index owner:** A record of one or more identities that can do one or more of the following to a package index: +- add a new version index to package index +- update an existing version index +- remove an existing version index + +In a content addressable world, they are actually publishing a new version of a Package index when making one of those changes. + +![Screenshot 2019-05-01 at 14 35 13](https://user-images.githubusercontent.com/1060/57025551-53668080-6c2f-11e9-880f-1d095bc93242.png) + +## Identity + +Identity and permissions around changing indexes depending on the [category](https://github.com/ipfs/package-managers/blob/master/categories.md#implementation-categories) of package manager. + +For example: + +**File system based:** These usually store releases, version indexes and package indexes all within the same root filesystem, meaning that the administrators of that system usually have permission to update all indexes, those administrators also the ability to make more administrators but must trust them with all indexes. + +**Git based:** These usually store releases, version indexes and package indexes all within the same git repository, meaning that the administrators of that repository usually have permission to update all indexes, those administrators also the ability to make more administrators but must trust them with all indexes. + +**Database based:** These tend to have more fine grained permissions, identities can be given limited access to only certain version indexes, and only the ability to update information about those version indexes in the package indexes. Those identities often have the ability to manage access to those version indexes by other identities as well. Access can also be limited to certain actions like disabling deleting data from indexes. Administrators of the database have access to all indexes. + +**Registry-less:** The permission setup can vary with url based setups, administrators who control the DNS for a domain will have the ultimate ability to change the data available from urls on that domain, although usually the web server that the DNS record resolves to takes care of permission setups, any administrator of that web server usually also has access to change the data for that domain. + +One difference with registry-less systems is that there is usually no single administrator that has control DNS of all domains involved (ignoring top level DNS authorities like ICANN), although within a private network, a network administrator can override certain levels of DNS configuration but other security features like TLS may limit functionality in that case. + +In a decentralized package manager, identity and access controls require a different approach due to the lack of an administrator that has access over the whole system. + +Rather than there being just one version index for a project with a unique, human readable name, there can be instead any number of instances of a version index for a project, each one owned by an identity which factors into the globally unique name for that version index. + +Because there is no central naming authority, any identity has the ability to publish a new release of a project and update their own copy of a version index to include that release. + +For example, below are two different version indexes created by two different identities, which share the first two releases: + +```json +{ + "owner_identity": "dave:4ba0d198425a112c335a112c334e92848d76905e69a60de3f57e", + "version_index_name": "libfoo", + "releases": [ + { + "number": "1.0.1", + "cid": "Qm2453256c...510b2" + }, + { + "number": "1.0.2", + "cid": "Qm3f53392...4f877" + } + ] +} +``` + +```json +{ + "owner_identity": "lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5", + "version_index_name": "libfoo", + "releases": [ + { + "number": "1.0.1", + "cid": "Qm2453256c...510b2" + }, + { + "number": "1.0.2", + "cid": "Qm3f53392...4f877" + }, + { + "number": "1.0.3", + "cid": "Qm20d2ca...b41a4" + } + ] +} +``` + +Those indexes are effectively named `%{owner_identity}:%{version_index_name}`, for example `lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5:libfoo`, but these names are not very meaningful or memorable to humans. + +_Note: I'm skimming over content addressability and versioning of version indexes for simplicity here_ + +Package indexes also have a similar naming approach, where the owner has curated a number of version indexes together. Again any identity has the ability to publish a new package index, for example, two similar package indexes with libfoo from different version indexes: + +```json +{ + "owner_identity": "molly:f6f7e983afc59354c91673d637c22072ec68f710794f899ae747", + "package_index_name": "lib_node_modules", + "version_indexes": { + "libfoo": "lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5:libfoo", + "libbar": "fred:d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3:libbar", + "libbaz": "anna:55579b557896d0ce1764c47fed644f9b35f58bad620674af23f3:libbaz" + } +} +``` + +```json +{ + "owner_identity": "andrew:d979885447a413abb6d606a5d0f45c3b7809e6fde2c83f0df", + "package_index_name": "lib_node_modules", + "version_indexes": { + "libfoo": "dave:4ba0d198425a112c335a112c334e92848d76905e69a60de3f57e:libfoo", + "libbar": "fred:d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3:libbar", + "libbaz": "anna:55579b557896d0ce1764c47fed644f9b35f58bad620674af23f3:libbaz" + } +} +``` + +Note here how a package index can provide human-meaningful names to version indexes that it includes. The standalone version index `"dave:4ba0d1...3f57e:libfoo"` is mapped to a short readable name `libfoo`, much in the same way that a version index provides a mapping between the content address of a release and it's human-meaningful number, `"dave:4ba0d1...3f57e:libfoo"` maps the content address `Qm3f53392...4f877` to the number `1.0.2`. + +This is same mapping happens in centralized registries but it is usually implicit rather than explicit, the npm module `react` by default a shorthand for https://registry.npmjs.org/react, but an end user may choose to override that default and point to a private registry instead for their project: `https://internal-npm-registry.enterprise.com/react`. + +## Dependencies + +When a software project declares a dependency on another software project via it's package manager configuration, the usual practice to specify the short human-meaningful name and an acceptable range of releases that it should work with. + +Because a release can be included in many version indexes and package indexes, the release itself usually declares it's dependency requirements in a index agnostic approach. For example, rather than specifying the full URL (`https://registry.npmjs.org/react`) of a dependency, a short name or alias is used `react`, this allows different registries to host the same packages without needing to rewrite the metadata for every package with their specific registry details. + +This is key for decentralized package managers as the package index name (`andrew:d979885447a413abb6d606a5d0f45c3b7809e6fde2c83f0df:lib_node_modules`) is much less human-meaningful than a centralized registry url. + +Acceptable version numbers in dependency requires have similar properties, when declaring a range of acceptable versions, the author is encouraged to be as broad as possible in acceptable version ranges so as to reduce the likelihood of conflicts with other packages requirements of that same dependency during resolution. + +Unlike package names, version number requirements for dependencies may also attempt to take into account future releases that are likely to work with that release, for example using [semantic versioning](https://semver.org/) to allow for minor patch releases whilst avoiding major breaking change releases. + +From a decentralized package manager's point of view the key attribute of dependency requirements are that they are agnostic to the exact name of the version and package indexes used, instead preferring the local names (`react`) and number ranges (`>= 1.0.0`) which are defined within the indexes that the end user chooses to use. + +## Discovery + +One of the more challenging problems with decentralized package management is that of discovery, finding information about packages in a network. There are three main categories of discovery: + +### Search + +Search is the initial starting point of finding an open source package that solves a problem for you, for example "a library to parse xml documents in javascript". This usually involves searching for keyword matches across names, descriptions, tags and other textual metadata, with the aim of finding and comparing details of various matches, resulting in the address of one or more version indexes. + +In centralized registries, relevant data from the package index is often replicated to a separate, specialized full-text-search index. More general web search engines like Google also index the contents of individual html package pages. + +In a decentralized environment there won't be an automatic index created of every package published, services may need to be built to help index the variety of new indexes and packages being published, things like: + +- opt-in announcement of releases and index updates to indexing services +- trawling existing known indexes and source code to discover package/version indexes and releases +- peer-to-peer sharing of package/version indexes and releases where connected users also keep track of their peers published package data + +### Resolution + +Once you have a package that you wish to install, say [`react`](https://www.npmjs.com/package/react), there is are two further stages of discovery that happen before the installation is complete: + +1. A version index needs to be discovered for that package which provides list of available releases, of which one will be chosen to be installed that doesn't conflict with other existing package requirements usually the newest or highest number. + +2. A specific release many declare a number of dependencies by name, [`react@16.8.6`](https://unpkg.com/react@16.8.6/package.json) has four dependencies with version range requirements: + +```json + "dependencies": { + "loose-envify": "^1.1.0", + "object-assign": "^4.1.1", + "prop-types": "^15.6.2", + "scheduler": "^0.13.6" + }, +``` + +For each one of those dependencies the same two discovery steps need to be performed until the full dependency tree has been resolved. + +In a centralized registry this stage of discovery usually involves looking up existing version indexes within the same database, some even have specialized APIs to query the indexes in bulk because the data is all stored in the same place. + +This process can also fail in two ways: + +1. If you cannot find a version index to match the package name you are searching for, the classic example is [`left-pad`](https://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/) where a popular package's version index was removed from the central registry completely and many users couldn't successfully install that package. + +2. If you cannot find a release number within a version index that satisfies all of the dependency requirements for that package that other packages with depend on it have specified. This may be because an existing release has been removed, or because one of the dependency requirements specifies a release number that conflicts with other dependency requirements within the same tree. + +One way that decentralized publishing can help minimize these kinds of discovery problems is via validation on or before the publishing of indexes, ensuring that there are no unreachable parts of a dependency tree within an index, and then automatically advising on how to find and correct those errors if they are caught. + +### Updates + +Existing software packages often have multiple releases published as bugs are found and fixed or functionality improvements added. When these new releases are published, end users of the package need a way of finding out that those new versions are available. + +This usually is done by adding the new release to existing version indexes, there are also services that can announce new updates via email, rss, push notifications or even as a direct request to update that dependency requirement within a codebase such as [Dependabot](https://dependabot.com/). + +There may also be updates to be discovered about existing releases, such as a state change ("This release has been marked as deprecated") or a security vulnerability or legal notice has been published about it, although the contents of the release is not expected to change, the metadata about it within a version index may be updated to reflect the change. + +Whilst centralized registries make it very easy to have update information available to all users, it also makes it harder for end users to opt-out of getting that update information which can have a negative impact on the reproducibility of a projects dependency tree, either due to data being removed or new data being added that changes the results of dependency resolution in an undesirable manner. + +## Tooling + +When it comes to decentralized publishing, and consuming those decentralized indexes, new tooling will need to be built to help encourage the use of standards and compatibility between indexes as well as extra services for aiding in some of the discovery problems that arise. Experimenting and finding out what those tools might look like would be a useful task to think about before starting to implement decentralized publishing. + +One other aspect that I've yet to explore is the idea that version and package index "owners" don't need to be humans, but could instead be software that does the heavy lifting of publishing and curating indexes together. This could open the door to multiple levels of identities of ownership and connection between different indexes, perhaps even federations of indexes that get combined together by groups or communities of end users. diff --git a/glossary.md b/docs/glossary.md similarity index 100% rename from glossary.md rename to docs/glossary.md diff --git a/docs/linking.md b/docs/linking.md new file mode 100644 index 0000000..613f160 --- /dev/null +++ b/docs/linking.md @@ -0,0 +1,65 @@ +# Package indexing and linking + +**TL;DR:** have releases declare their parent via IPFS CID. This would enable a package to produce an immutable history of it's previous releases without the need for an external index, improving reproducibility and reducing reliance on large centralized indexes. + +
+ +The crux of package management is really the act of maintaining one or more **indexes** of package releases, with some rules around how to group, connect and order releases. + +In an index releases usually have a version identifier (e.g. `1.0.0`) and are grouped together by a project identifier (e.g. `react`), but those identifiers don't have to match up with the what the source code contents (often a tarball) of a release believes it is (although it usually does). + +Indexes tend to store project/release identifiers, which usually translate into an address that points to where the actual contents of a release is. + +This has some benefits, for one, it decouples the index from the storage and transportation of the release content. It also allows for the index to be updated, mutated and mirrored without changing any data inside the packages. + +Many package managers take further advantage of these indexes when declaring dependencies within an package, using only the bare minimum project identifier with an optional version range selector, relying on the index to provide the available version identifiers within that project identifier group and then the actual address of the desired release of that dependency once selected. + +That separation also requires a lot of trust in both the index and the translation of the addresses the index provides into actual package contents. There are a lot of assumptions that the relevant data that was in the index when the release was initially published will continue to be there. + +*Side note: The community solution to this trust problem is solved by many package managers by defaulting everyone to a single, large centralized index that only allows certain users to add new data to it. These large indexes often become central points of failure, becoming too large for regular users to mirror and operationally taxing for the maintainers of the index.* + +When the contents of a release has so little knowledge of itself and it's context, it always needs to be paired with a separate index. + +So what if... **a package could also be it's own index?** + +A release can't contain a hash of itself or be updated with information about later releases, but it can contain an IPFS CID of the release that came before it (except for the initial release), creating a linked list of previous releases: + +``` + 0.1.1 <----0.1.2 + / +0.0.1 <--- 0.1.0 <--- 1.0.0 <--- 1.1.0 <--- 1.1.1 + \ + 1.0.1 <--- 1.0.2 +``` + +#### Why is this useful? + +One of the most frequent lookups a package manager will do in an index is to get a list of available releases for a given project identifier, the resulting list of releases is then passed into dependency resolution. + +- Reproducibility: If you already have a particular package, the list of available versions that can be pulled from this linked list of previous packages will always be the same. + +- Trust: once published, the list of previous versions for that package is immutable, this reduces the reliance on external indexes which have the possibility to change the items in a list of available versions. + +- Size: Rather than having one large index that contains all available releases across all packages, this approach breaks it up into many small indexes, one per package and community indexes then only need index the leaf nodes of each package tree. + +#### What's missing? + +- Discovery: Users expect to be able to run `xpm update` and automatically discover newer releases than the ones they already have, which still requires a separate index, although that index only needs to point at the latest version rather than keep track of all possible versions + +#### A note about dependencies: + +You could use this same method for referencing other package linked lists (via IPFS CID) for dependencies, pointing to the tip of a branch from your release, which then points to an immutable, reproducible transitive dependency tree, again without an index: + +``` +0.0.1 <--- 0.0.2 <--- 0.1.0 <--- 1.0.0 <--- 1.1.0 <--- 1.1.1 + / / + dep@1.0.1 <--- dep@2.0.2 <--- dep@2.1.0 <--- dep@3.0.0 + / / + trans-dep@0.0.1 <--- trans-dep@0.0.1 +``` + +Whilst being very reliable for reproducibility, fast moving communities will often want to pick up the latest versions of transitive dependencies without requiring changes to intermediate dependencies. + +This is one of the pain points for [gx](https://github.com/whyrusleeping/gx), as often the developer does not have the ability to publish new releases of an intermediate dependency to the central index. + +One solution to this may be to enable users to easily create and manage their own small indexes, possibly even at a per-application level which would allow easy forking and patching of dependencies without needing to change project identifiers. I'll be writing more on this idea soon. diff --git a/docs/problems.md b/docs/problems.md new file mode 100644 index 0000000..b1a2399 --- /dev/null +++ b/docs/problems.md @@ -0,0 +1,57 @@ +# Problems with Package Managers + +I thought it would be an interesting thought experiment to outline problems that package publishers, package consumers and package manager maintainers currently experience today, rather than outlining the benefits of IPFS and looking for places where those benefits can be applied to package management. + +Doing this may highlight areas where introducing IPFS can provide significant improvements that wouldn't be a clear headline feature of IPFS. + +For example, IPFS can offer large savings in bandwidth costs, but almost all the large community package managers are offered free CDN and hosting services by Fastly, Cloudflare, Bintray etc, so that package managers don't see that as being a problem right now. + +Not all problems will be present in all package managers and IPFS definitely won't be able to fix all problems either but this list may spark some interesting ideas that would otherwise have been missed, as well as helping us to focus our efforts on the key problems that IPFS can help with. + +Feel free to add in other problems or group similar problems together, this list is off the top of my head and will likely be missing loads things. + +## Package consumers + +- Package releases being removed from registries +- Package maintainers transferring ownership to unknown entities +- Registry downtime stops building/developing of software that is dependent on it +- Not being able to reproduce a known working set of dependencies at a later date +- Not being able to opt-out of using new releases which have breaking changes +- Not being able to update/fix/swap old packages deep within a dependency tree when they cause problems +- Not being able to resolve conflicting dependency requirements when building a dependency tree +- Not knowing the status of a package (deprecated, unmaintained, broken etc) +- Not being able to confirm that the contents of a download package is the same as was originally published by the author +- Not being able to install more than one version of a dependency at once +- Not being able to install or build packages whilst offline +- Not being able to efficiently review the downloaded code within each dependency of an application +- Not being able to filter packages by compatible licenses +- Not being able to validate the license/copyright/patent details of a package +- Hosting internal or private mirrors of registries is time consuming and requires ongoing maintenance and infrastructure costs +- Language level packages that depend system level packages do not communicate or automate those dependency links effectively +- Not being able to find new packages that are relevant to consumers interests/needs +- Difficult to know when new releases of packages are published +- Difficult to known if a new release of a package is stable + +## Package Producers + +- Communicating with consumers of packages +- Maintaining compatibility with multiple platforms, architectures and run times +- Vetting contributions for security concerns +- Coordinating key signing infrastructure between maintainers is time consuming and error prone +- Difficult to test package against a range of different versions of upstream and downstream dependencies +- Difficult to know what percentage of consumers are using newer releases vs stuck on old releases due to incompatibilities +- Difficult to discourage users from using broken/deprecated releases + +## Package Manager Maintainers + +- Difficulties funding maintenance, development and infrastructure costs +- Large amount of trust placed on very small amount of maintainers +- Heavy support burden from both consumers and producers +- Having to police illegal/pirate/malicious packages from the registry +- Recovery of data after loss due to server failure or security breach +- Deploying significant changes can result in community backlash +- Difficulties communicating with package consumers and producers +- Difficult to hand over control/trust when maintainers step down +- Distributing infrastructure globally can be costly/complex +- Greatly increased infrastructure costs when storing binaries as well as source code +- Package managers often can't use themselves for managing dependencies, due to bootstrapping issues, resulting in duplicate efforts and reduced productivity