-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eLabFTW integration via Galaxy file source #19319
base: dev
Are you sure you want to change the base?
Conversation
The method `write_from()` of `SingleFileSource` and `BaseFilesSource` reads a local file from `native_path` and saves it to `target_path` on a file source. This commit allows the service backing the file source to choose which will be the path of the saved file, meaning that `target_path` and the actual path where the file can be recovered later do not have to match. The latter is the return value of `write_from()`. Therefore, all usages of `write_from()` have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object.
Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.
Define a new class `FileSourceModelExportStore` that abstracts the common details of `BcoModelExportStore`, `ROCrateArchiveModelExportStore`, `TarModelExportStore` and `BagArchiveModelExportStore`. This new class manages exports to file sources, from where data can be retrieved later on via a URI. It takes the responsibility of creating a temporary directory to set up the file to export and uploading it to the file source.
The method `list()` from `galaxy.files.sources.BaseFilesSource` lists the directories and files within a file source. An optional keyword argument `recursive` (`False` by default) lets it recursively retrieve directories and files within a specific directory. This operation is very cheap in terms of CPU and expensive in IO terms, be it network or filesystem IO. Depending on how the underlying system is built, it may support retrieving directories and files recursively or not. If it does not, then every time a directory is listed, it is necessary to make another request to list each subdirectory. This may end up involving hundreds of requests. Done sequentially, this can be extremely slow, especially if each one involves network access. This commit makes the `list()` method asynchronous, which enables Galaxy to wait for the underlying system to complete the requests concurrently, resulting in a massive speedup. The price to pay is the extra complexity of using the async primitives. Since this change implies that all functions in the chain up to the API endpoints and the test functions must also be made asynchronous, this commit also takes care of it.
Eventually I can add automated tests, I would rather ask for your feedback first though. |
eLabFTW [1] revolves around the concepts of experiment [2] and resource [3]. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo [4]. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key. File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read galaxyproject#18665 [5]. This leads to the need to declare a mapping between said identifiers and Galaxy URIs. Those take the form `elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id`, where: - `entity_type` is either 'experiments' or 'resources' - `entity_id` is the id (an integer in string form) of an experiment or resource - `attachment_id` is the id (an integer in string form) of an attachment This implementation uses both `aiohttp` and the `requests` libraries as underlying mechanisms to communicate with eLabFTW via its REST API [6]. A significant limitation of the implementation is that, due to the fact that the API does not have an endpoint that can list attachments for several experiments and/or resources with a single request, when listing the root directory or an entity type _recursively_, a list of entities has to be fetched first, then to fetch the information on their attachments, a separate request has to be sent _for each one_ of them. The `aiohttp` library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending them concurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side. References: - [1] https://www.elabftw.net/ - [2] https://doc.elabftw.net/user-guide.html#experiments - [3] https://doc.elabftw.net/user-guide.html#resources - [4] https://demo.elabftw.net - [5] galaxyproject#18665 - [6] https://doc.elabftw.net/api/v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in person, the implementation looks amazing! especially because of the care and attention to detail! Thank you very much!
I also think that the recursive
option that has motivated many of these changes (and increased complexity) is a use case that we should probably not support. Being able to recursively list or select a potentially huge number of experiments without strict pagination will likely be unusable or cause more trouble than it is worth... 😓
lib/galaxy/files/sources/__init__.py
Outdated
@@ -296,7 +299,7 @@ def get_uri_root(self) -> str: | |||
"""Return a prefix for the root (e.g. gxfiles://prefix/).""" | |||
|
|||
@abc.abstractmethod | |||
def list( | |||
async def list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change this for existing interfaces. You can add a <operation>_async
method where needed. We can't do blocking IO in async calls, but every plugin apart from yours will make blocking calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for mentioning this, I was not aware of the inner workings of FastAPI and it would have been a big mistake to merge this PR as it is. I reverted commits from #19256 (so the endpoint is synchronous again), then chose a different mechanism to send the requests concurrently (see 204b5fe).
Namely, I learned that while FastAPI runs async routes on the main thread's event loop (that's precisely the reason why the previous approach is a bad idea), it runs sync routes on a separate thread from a threadpool. Therefore, the alternative I chose is to spawn an event loop on the sync route's thread, send the concurrent requests using it, and close it afterward.
This implies that #19256 would no longer be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I guess my other question is, have you checked with the elab people that concurrent requests are not considered abuse ? They often are in other contexts, e.g. in galaxy we would ban your ip or you would be hit by rate limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: no public instances thus hard to define abuse and to be abused, cannot be engineered much better without eLabFTW API changes
eLabFTW's scope is one instance per lab or at most one per research institution. Unlike Galaxy, there are no public main instances. Either you host your own, or you use their hosted offer, which I assume is a single instance just for you (it costs 4985 € / yr) and is supposed to handle "~128 active users max".
Therefore, there is no authority defining abuse that it makes sense to ask or that would ban your ip per se. It depends on how many resources the endpoint you have configured has at their disposal, what their admins consider reasonable, and whether they enforce rate limits or not.
Nevertheless, and foreseeing this problem, I included a hard limit MAX_CONCURRENT_REQUESTS: int = 75
, setting the number to something that would not cause issues (for example timed out requests according to the timeouts I also have defined) on their live demo site. To decrease the number of requests, I also avoid making requests if it is known that the experiment or resource has no attachments wrapped_entity.source["has_attachment"]
.
From a software engineering standpoint, given the fact that the file sources API requires the plugin to be able to list everything on the server (recursive=True
) and the pagination is client-side I guess I cannot do much better besides a global queue with a limit that applies to the sum of requests per endpoint from all users at any time.
From this point on, achieving a better result if the plugin is supposed to work in the case recursive=True
becomes a social engineering (from the point of view of our project) problem rather than a software engineering one, as we'd need changes to the eLabFTW API. At best an endpoint like this one but that lists all attachments a user can access, ideally takes offset, limit, sorting and the filtering parameters we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I think just not supporting recursive=True
in those file sources that don't make sense would be the safest option. We could add a "support" flag, similar to what we already have for pagination, etc. here:
class FileSourceSupports(TypedDict): |
Then we can disable the recursive option in the UI and the API according to this flag.
This reverts commit a05d5cc.
This reverts commit 668437a.
…nc route thread FastAPI runs sync routes on a separate thread (from a threadpool), and async routes on the main thread's event loop. Originally, it was conceived to convert `_list` to an async method (see galaxyproject#19256). However, given that all other file source plugins are blocking, they could block the main thread for a significant amount of time (see fastapi/fastapi#3091). Thus, the implementation below creates a new event loop within the sync route's thread so that the eLabFTW plugin can still send concurrent requests without blocking the main thread.
Closes #18665, requires #19154 and #19256 (and also includes them, sorry; to review have a look at c00d250). Implements a file source to integrate Galaxy with eLabFTW.
eLabFTW revolves around the concepts of experiment and resource. Experiments and resources can have files attached to them. To get a quick overview, try out the live demo at demo.elabftw.net. The scope of this implementation is exporting data from and importing data to eLabFTW as file attachments of already existing experiments and resources. Each user can configure their preferred eLabFTW instance entering its URL and an API Key.
File sources reference files via a URI, while eLabFTW uses auto-incrementing positive integers. For more details read #18665. This leads to the need to declare a mapping between said identifiers and Galaxy URIs.
Those take the form
elabftw://demo.elabftw.net/entity_type/entity_id/attachment_id
, where:entity_type
is either 'experiments' or 'resources'entity_id
is the id (an integer in string form) of an experiment or resourceattachment_id
is the id (an integer in string form) of an attachmentThis implementation uses both
aiohttp
and therequests
libraries as underlying mechanisms to communicate with eLabFTW via its REST API. A significant limitation of the implementation is that, due to the fact that the API doesnot have an endpoint that can list attachments for several experiments and/or resources with a single request, when
listing the root directory or an entity type recursively, a list of entities has to be fetched first, then to fetch
the information on their attachments, a separate request has to be sent for each one of them. The
aiohttp
library makes it bearable to recursively browse instances with up to ~500 experiments or resources with attachments by sending themconcurrently, but ultimately solving the problem would require changes to the API from the eLabFTW side.
This is the third and last PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):
How to test the changes?
(Select all options that apply)
License