Skip to content

Commit

Permalink
Version file tree diff: design doc
Browse files Browse the repository at this point in the history
ref #11319
  • Loading branch information
stsewd committed Jul 30, 2024
1 parent 35ec392 commit 2ba73fa
Showing 1 changed file with 228 additions and 0 deletions.
228 changes: 228 additions & 0 deletions docs/dev/design/file-tree-diff.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
Version file tree diff
======================

Goals
-----

- Compare files from two versions to identify the files that have been added, removed, or modified.
- Provide an API for this feature.
- Integrate this feature to suggest redirects on files that were removed.
- Integrate this feature to list the files that changed in a pull request.

Non-goals
---------

- Replace the `docdiff <https://github.com/readthedocs/addons?tab=readme-ov-file#docdiff>`__ feature from addons.
That works on the client side, and it's good for comparing the content of files.

Current problems
----------------

Currently, when a user opens a PRs, they need to manually search for the files of interest (new and modified files).
We have a GitHub action that links to the root of the documentation preview, that helps a little, but it's not enough.

When files are removed or renamed, users may not be aware that a redirect may be needed.
We track 404s in our traffic analytics, but they don't keep track of the version,
and it may be too late to add a redirect when users are already seeing a 404.

In the past, we haven't implemented those features, because it's hard to map the source files to the generated files,
since that depends on the build tool and configuration used by the project.

Git providers may already offer a way to compare file trees, but again,
they work on the source files, and not on the generated files.

All hope was lost for having nice features like this, until now.

Proposed solution
-----------------

Since redirects and files of interest are related to the generated files.
Instead of working over the source files, we will work over the generated files, which we have access to.

The key points of this feature are:

- Get the diff of the file tree between two versions.
- Expose that as an API.
- Integrate that in PR previews.

Diff between two S3 directories
-------------------------------

We are already using ``rclone`` to speed up uploads to S3,
``rclone`` has a command (``rclone check``) to return the diff between two directories.
For this, it uses the metadata of the files, like size and hash
(it doesn't download the files).

.. code:: console
$ ls a
changed.txt new.txt unchanged.txt
$ ls b
changed.txt deleted.txt unchanged.txt
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+ new.txt
- deleted.txt
= unchanged.txt
* changed.txt
The result is a list of files with a mark indicating if they were added, removed, or modified.
The result is easy to parse.

To start, we will only consider HTML files (``--include=*.html``).

Lines changed between two files
-------------------------------

Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.
This is useful to link to the most relevant files that changed in a PR.

.. code:: console
$ cat a.txt
One
Two
Three
Four
Five
$ cat b.txt
Ore
Three
Four
Five
Six
$ diff --side-by-side --suppress-common-lines a.txt b.txt
One | Ore
Two <
> Six
.. note::

Taken from https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only.

The command will return only the lines that changed between the two files.
We can just count the lines, or maybe even parse each symbol to check if the line was added or removed.

Another alternative is to use the `difflib <https://docs.python.org/3/library/difflib.html>`__ module,
the only downside is that it doesn't distinguish lines that were changed from lines that were added or removed.
But maybe that's ok? Do we really need to know if a line was changed instead of added or removed?

.. code:: python
import difflib
diff = difflib.ndiff(['one', 'two', 'three', 'four'], ['ore', 'three', 'four', 'five'])
print(list(diff))
# ['+ ore', '- one', '- two', ' three', ' four', '+ five']
Storing results
---------------

Doing a diff between two versions can be expensive, so we need to store the results.

We can store the results in the DB (``VersionDiff``), or in S3 (``diff/project/version-a/version-b.json``), or maybe a combination of both.
The information to store would contain some information about the versions compared, the build, and the diff itself.

.. code:: json
{
"version_a": {"id": 1, "build": {"id": 1}},
"version_b": {"id": 2, "build": {"id": 2}},
"diff": {
"added": [{"file": "new.txt"}],
"removed": [{"file": "deleted.txt"}],
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
}
}
The information is stored in a similar way that it will be returned by the API.
Things important to note:

- We need to take into consideration the diff of the latest successful builds only.
If any of the builds from the stored diff don't match the latest successful build of any of the versions,
we need to the diff again.
- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
We can store that information as well, or just calculate it on the fly.
- The list of files are objects, so we can store additional information in the future.
- When a file has been modified, we also store the number of lines that changed.
We could also show this for files that were added or removed.
- Instead of using the slugs on storage (if we decide to use S3), we could use the IDs, that will prevent the need to update the data if the slugs change.
But if the slugs change, the builds will probably be different, so we will need to update the data anyway.
- If a project or version is deleted (or deactivated), we should delete the diff as well.

We could store the changed files sorted by the number of changes, or make that an option in the API,
or just let the client sort the files as they see fit.

API
---

This operation is expensive, so it won't be available to unauthenticated users.
And a diff can only be done between versions of the same project that the user has access to.

The endpoint will be:

.. code:: http
GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/
And the response will be:

.. code:: json
{
"version_a": {"id": 1, "build": {"id": 1}},
"version_b": {"id": 2, "build": {"id": 2}},
"diff": {
"added": [{"file": "new.txt"}],
"removed": [{"file": "deleted.txt"}],
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
}
}
The version and build can be the full objects, or just the IDs and slugs.

We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
We can reply with a ``202 Accepted`` if the diff is being calculated in another request.

Integrations
------------

You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!

Doing the API call is easy, but knowing *when* to call it is hard.
We need to call the API after the build has finished successfully,
or we will be comparing the files of an incomplete or stale build.

Luckily, we have a webhook that tells us when a build has finished successfully.
But, we don't want users to have to implement the integration by themselves.

We could:

- Use this as an opportunity to explore using GitHub Apps.
- Request additional permissions in our existing OAuth2 integration (``project`` scope). Probably not a good idea.
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
Maybe don't even expose the API to the public, just use it internally.
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
to trigger the action from our webhook. This requires the user to do some additional setup,
and for our webhooks to support custom headers.

Possible issues
---------------

Even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls.
But since we are doing this on demand, and we can cache the results, we can minimize the costs
(maybe is not that much).

``rclone check`` returns only the list of files that changed,
if we want to make additional checks over those files, we will need to make additional calls to S3.

We should also just check a X number of files, we don't want to run a diff of thousands of files,
and also a limit on the size of the files.

Future improvements and ideas
-----------------------------

- Detect moved files.
- Detect changes in sections of HTML files.
- Expand to other file types?
- Integrate with addons?
- Allow doing a diff between versions of different projects?

0 comments on commit 2ba73fa

Please sign in to comment.