eLabFTW integration via Galaxy file source #19319

kysrpex · 2024-12-12T13:17:38Z

Closes #18665, requires #19154 and #19256 (and also includes them, sorry; to review have a look at c00d250). Implements a file source to integrate Galaxy with eLabFTW.

eLabFTW revolves around the concepts of experiment and resource. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo at demo.elabftw.net. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key.

File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read #18665. This leads to the need to declare a mapping between said identifiers and Galaxy URIs.

Those take the form elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id, where:

entity_type is either 'experiments' or 'resources'
entity_id is the id (an integer in string form) of an experiment or resource
attachment_id is the id (an integer in string form) of an attachment

This implementation uses both aiohttp and the requests libraries as underlying mechanisms to communicate with eLabFTW via its REST API. A significant limitation of the implementation is that, due to the fact that the API does
not have an endpoint that can list attachments for several experiments and/or resources with a single request, when
listing the root directory or an entity type recursively, a list of entities has to be fetched first, then to fetch
the information on their attachments, a separate request has to be sent for each one of them. The aiohttp library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending them
concurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side.

This is the third and last PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. Run eLabFTW, for example using Docker Compose.
2. Copy the configuration samples from file_sources_conf.yml.sample and user_preferences_extra_conf.yml.sample to your own configuration files.
3. Create an account and generate an API Key on eLabFTW.
4. Configure the endpoint and API Key on the user preferences page.
5. Create an experiment or resource in eLabFTW and attach a file.
6. Import the file to your history from the eLabFTW file source.
7. Export your history to the eLabFTW file source.

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

The method `write_from()` of `SingleFileSource` and `BaseFilesSource` reads a local file from `native_path` and saves it to `target_path` on a file source. This commit allows the service backing the file source to choose which will be the path of the saved file, meaning that `target_path` and the actual path where the file can be recovered later do not have to match. The latter is the return value of `write_from()`. Therefore, all usages of `write_from()` have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object.

Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.

Define a new class `FileSourceModelExportStore` that abstracts the common details of `BcoModelExportStore`, `ROCrateArchiveModelExportStore`, `TarModelExportStore` and `BagArchiveModelExportStore`. This new class manages exports to file sources, from where data can be retrieved later on via a URI. It takes the responsibility of creating a temporary directory to set up the file to export and uploading it to the file source.

The method `list()` from `galaxy.files.sources.BaseFilesSource` lists the directories and files within a file source. An optional keyword argument `recursive` (`False` by default) lets it recursively retrieve directories and files within a specific directory. This operation is very cheap in terms of CPU and expensive in IO terms, be it network or filesystem IO. Depending on how the underlying system is built, it may support retrieving directories and files recursively or not. If it does not, then every time a directory is listed, it is necessary to make another request to list each subdirectory. This may end up involving hundreds of requests. Done sequentially, this can be extremely slow, especially if each one involves network access. This commit makes the `list()` method asynchronous, which enables Galaxy to wait for the underlying system to complete the requests concurrently, resulting in a massive speedup. The price to pay is the extra complexity of using the async primitives. Since this change implies that all functions in the chain up to the API endpoints and the test functions must also be made asynchronous, this commit also takes care of it.

kysrpex · 2024-12-12T13:21:18Z

Eventually I can add automated tests, I would rather ask for your feedback first though.

eLabFTW [1] revolves around the concepts of experiment [2] and resource [3]. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo [4]. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key. File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read galaxyproject#18665 [5]. This leads to the need to declare a mapping between said identifiers and Galaxy URIs. Those take the form `elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id`, where: - `entity_type` is either 'experiments' or 'resources' - `entity_id` is the id (an integer in string form) of an experiment or resource - `attachment_id` is the id (an integer in string form) of an attachment This implementation uses both `aiohttp` and the `requests` libraries as underlying mechanisms to communicate with eLabFTW via its REST API [6]. A significant limitation of the implementation is that, due to the fact that the API does not have an endpoint that can list attachments for several experiments and/or resources with a single request, when listing the root directory or an entity type _recursively_, a list of entities has to be fetched first, then to fetch the information on their attachments, a separate request has to be sent _for each one_ of them. The `aiohttp` library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending them concurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side. References: - [1] https://www.elabftw.net/ - [2] https://doc.elabftw.net/user-guide.html#experiments - [3] https://doc.elabftw.net/user-guide.html#resources - [4] https://demo.elabftw.net - [5] galaxyproject#18665 - [6] https://doc.elabftw.net/api/v2

davelopez

As I mentioned in person, the implementation looks amazing! especially because of the care and attention to detail! Thank you very much!

I also think that the recursive option that has motivated many of these changes (and increased complexity) is a use case that we should probably not support. Being able to recursively list or select a potentially huge number of experiments without strict pagination will likely be unusable or cause more trouble than it is worth... 😓

mvdbeek · 2024-12-13T14:55:44Z

lib/galaxy/files/sources/__init__.py

@@ -296,7 +299,7 @@ def get_uri_root(self) -> str:
        """Return a prefix for the root (e.g. gxfiles://prefix/)."""

    @abc.abstractmethod
-    def list(
+    async def list(


Please don't change this for existing interfaces. You can add a <operation>_async method where needed. We can't do blocking IO in async calls, but every plugin apart from yours will make blocking calls.

Thanks a lot for mentioning this, I was not aware of the inner workings of FastAPI and it would have been a big mistake to merge this PR as it is. I reverted commits from #19256 (so the endpoint is synchronous again), then chose a different mechanism to send the requests concurrently (see 204b5fe).

Namely, I learned that while FastAPI runs async routes on the main thread's event loop (that's precisely the reason why the previous approach is a bad idea), it runs sync routes on a separate thread from a threadpool. Therefore, the alternative I chose is to spawn an event loop on the sync route's thread, send the concurrent requests using it, and close it afterward.

This implies that #19256 would no longer be needed.

Thanks. I guess my other question is, have you checked with the elab people that concurrent requests are not considered abuse ? They often are in other contexts, e.g. in galaxy we would ban your ip or you would be hit by rate limits.

TL;DR: no public instances thus hard to define abuse and to be abused, cannot be engineered much better without eLabFTW API changes

eLabFTW's scope is one instance per lab or at most one per research institution. Unlike Galaxy, there are no public main instances. Either you host your own, or you use their hosted offer, which I assume is a single instance just for you (it costs 4985 € / yr) and is supposed to handle "~128 active users max".

Therefore, there is no authority defining abuse that it makes sense to ask or that would ban your ip per se. It depends on how many resources the endpoint you have configured has at their disposal, what their admins consider reasonable, and whether they enforce rate limits or not.

Nevertheless, and foreseeing this problem, I included a hard limit MAX_CONCURRENT_REQUESTS: int = 75, setting the number to something that would not cause issues (for example timed out requests according to the timeouts I also have defined) on their live demo site. To decrease the number of requests, I also avoid making requests if it is known that the experiment or resource has no attachments wrapped_entity.source["has_attachment"].

From a software engineering standpoint, given the fact that the file sources API requires the plugin to be able to list everything on the server (recursive=True) and the pagination is client-side I guess I cannot do much better besides a global queue with a limit that applies to the sum of requests per endpoint from all users at any time.

From this point on, achieving a better result if the plugin is supposed to work in the case recursive=True becomes a social engineering (from the point of view of our project) problem rather than a software engineering one, as we'd need changes to the eLabFTW API. At best an endpoint like this one but that lists all attachments a user can access, ideally takes offset, limit, sorting and the filtering parameters we need.

That is why I think just not supporting recursive=True in those file sources that don't make sense would be the safest option. We could add a "support" flag, similar to what we already have for pagination, etc. here:

galaxy/lib/galaxy/files/sources/__init__.py

Line 84 in b1217af

class FileSourceSupports(TypedDict):

Then we can disable the recursive option in the UI and the API according to this flag.

This reverts commit a05d5cc.

This reverts commit 668437a.

…nc route thread FastAPI runs sync routes on a separate thread (from a threadpool), and async routes on the main thread's event loop. Originally, it was conceived to convert `_list` to an async method (see galaxyproject#19256). However, given that all other file source plugins are blocking, they could block the main thread for a significant amount of time (see fastapi/fastapi#3091). Thus, the implementation below creates a new event loop within the sync route's thread so that the eLabFTW plugin can still send concurrent requests without blocking the main thread.

kysrpex added 9 commits November 18, 2024 17:02

Keep BaseFilesSource API backwards compatible

5926778

Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.

Ignore mypy [arg-type] errors on unmodified functions

62c8955

Merge branch 'dev' into file_sources_assign_paths

6fe9264

Mark tests using assert_list_names as async

a05d5cc

Merge branch 'file_sources_assign_paths' into eLabFTW

5203f68

Merge branch 'file_sources_asynchronous_listing' into eLabFTW

f8f5976

kysrpex self-assigned this Dec 12, 2024

github-actions bot added area/UI-UX area/testing area/API area/database Galaxy's database or data access layer area/tool-framework area/util area/packaging labels Dec 12, 2024

github-actions bot added this to the 25.0 milestone Dec 12, 2024

This was referenced Dec 12, 2024

Let file sources choose a path for uploaded files #19154

Open

Support listing file sources asynchronously #19256

Closed

kysrpex added 2 commits December 12, 2024 14:26

Merge branch 'dev' into eLabFTW

3b311ca

kysrpex force-pushed the eLabFTW branch from 7f1c1c6 to 3b311ca Compare December 12, 2024 13:27

kysrpex added area/backend and removed area/UI-UX area/testing area/database Galaxy's database or data access layer area/tool-framework area/packaging labels Dec 12, 2024

davelopez reviewed Dec 13, 2024

View reviewed changes

mvdbeek reviewed Dec 13, 2024

View reviewed changes

kysrpex added 3 commits December 16, 2024 08:37

Revert "Mark tests using assert_list_names as async"

1a51d1e

This reverts commit a05d5cc.

Revert "Support listing file sources asynchronously"

5e216ea

This reverts commit 668437a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eLabFTW integration via Galaxy file source #19319

eLabFTW integration via Galaxy file source #19319

kysrpex commented Dec 12, 2024 •

edited

Loading

kysrpex commented Dec 12, 2024

davelopez left a comment

mvdbeek Dec 13, 2024

kysrpex Dec 16, 2024 •

edited

Loading

mvdbeek Dec 17, 2024

kysrpex Dec 17, 2024 •

edited

Loading

davelopez Dec 17, 2024

eLabFTW integration via Galaxy file source #19319

Are you sure you want to change the base?

eLabFTW integration via Galaxy file source #19319

Conversation

kysrpex commented Dec 12, 2024 • edited Loading

How to test the changes?

License

kysrpex commented Dec 12, 2024

davelopez left a comment

Choose a reason for hiding this comment

mvdbeek Dec 13, 2024

Choose a reason for hiding this comment

kysrpex Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

mvdbeek Dec 17, 2024

Choose a reason for hiding this comment

kysrpex Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

davelopez Dec 17, 2024

Choose a reason for hiding this comment

kysrpex commented Dec 12, 2024 •

edited

Loading

kysrpex Dec 16, 2024 •

edited

Loading

kysrpex Dec 17, 2024 •

edited

Loading