Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deliverable 8.2 (May 2022) #77

Open
14 of 17 tasks
llivermore opened this issue Mar 27, 2022 · 1 comment
Open
14 of 17 tasks

Deliverable 8.2 (May 2022) #77

llivermore opened this issue Mar 27, 2022 · 1 comment
Assignees
Labels
D8.2 Deliverable associated with tools and services.

Comments

@llivermore
Copy link
Contributor

llivermore commented Mar 27, 2022

Description
Corresponds to SYNTHESYS+ Task 8.2 "A set of automated and high throughput software services to extract, enhance and annotate Natural History specimen data (confined to microscope slides, herbarium sheets and pinned insects)."

Tools should include:

  • OCR/WTR (Subtask 8.2.1)
  • Data mining and linkage (Subtask 8.2.2) - includes NER
  • Image analysis (Subtask 8.2.3) - Configurable specimen colour analysis and measurement tools will be developed. Includes segmentation.
  • Georeferencing (Subtask 8.2.4)
  • Training datasets (Subtask 8.2.5) - will require a discussion on future feasibility of HIT and crowdsourced data
  • Workflow and tool registry (Subtask 8.2.6)

Tools:

Tool testing/functionality for UAT:

Training data:

@llivermore llivermore changed the title Deliverable 8.2 Deliverable 8.2 (May 2022) Mar 27, 2022
@llivermore
Copy link
Contributor Author

Full task description:

SYNTHESYS+ will develop and improve accessibility to tools and services, building on some of the prototypes developed in SYNTHESYS3, which support high throughput analysis and enrichment of NH specimen images and associated information using contemporary approaches. These will be constructed in a modular way and made available to the consortium as workflows and tools (task 8.3). All services will be provided through an API following established standards for service management (e.g. http://fitsm.itemo.org/, https://www.biodiversitycatalogue.org/).

Subtask 8.2.1: Optical character recognition data capture and analysis
An optical character recognition service will be developed for both handwritten and printed text of specimens. This will facilitate discovery of duplicate text data and data transfer (linked with subtask 8.2.3). It will also build on and improve the approaches and success rates of the READ, tranScriptorium, Herbadrop and ICEDIG projects. A pilot will be undertaken focussing on handwriting style / label recognition in groups of similar labels and text to aid OCR (linked with subtask 8.2.3) and also to improve the work efficiency in subtask 8.2.5.

Subtask 8.2.2: Data mining and linkage
This subtask will extract and map specimen metadata to Darwin Core terms from OCR results in subtask 8.2.1. Data will be mined to group similar texts/specimen fields to facilitate the usage of metadata in large collections. Specimen links will be created with external sources of taxonomies, literature and molecular data. The resulting data will be integrated and contributed to Catalogue of Life+ services. The service will link to and implement NA4 work on Stable Identifiers and the IIIF standard.

Subtask 8.2.3: Image analysis
Configurable specimen colour analysis and measurement tools will be developed. This will include semantic segmentation of features of interest in images (e.g. labels, specimens, landmarks). Prototypes will be developed for 1) Automated collection assessment tools (e.g. verdigris in entomological collections, degradation of microscope slide mountants, pest damage/activity); 2) Automated identification of higher taxa and morphological traits using machine learning; and 3) Detection of erroneous determinations of specimens through pattern recognition (e.g. outlier detection).

Subtask 8.2.4: Georeferencing
Provisioning of semi-automated georeferencing tools and workflows as a shared service, linking with subtask 8.2.2 and 8.2.5. This will include the development of functionality for coordinate transformations, reverse geocoding, and the development of electronic itineraries, to aid georeferencing and utilisation of gazetteers developed in SYNTHESYS3. The tool sets will be integrated within a sample production collection management system (DINA) to demonstrate typical data curation workflows (supporting task 8.4).

Subtask 8.2.5: Human interaction and crowdsourcing
AI and machine learning needs ground truth datasets and learning datasets as a reference to function and test the quality and accuracy of the output. These reference datasets need to be large and sufficiently complete. The content or the measurements are collected by combining existing validated datasets produced by experts and content mobilised via crowdsourcing. The aim is to improve via human interaction the other automated processes of task 8.2. Automated processes have limitations and will not completely replace human interactions with currently available technologies. The second approach will be to assess which processes can be automated and to what degree: for example, to support users in routine tasks such as sorting, feature recognition, or error detection; and where expert human input is still needed, to ensure that this is as efficient and targeted as possible.

Subtask 8.2.6: Specimen Data Refinery Workflow and Tools Registry
This task focuses the creation of the SDR Registry for all the services and workflows defined above, and the registration and curation of those services using a FAIR metadata schema compatible with EOSC and ELIXIR registries, with Common Workflow Language (CWL) tools, and with emerging tool description de facto standards.

The SDR Registry will be developed based on SEEK, myExperiment, and GitHub CWLViewer (all developed and supported by UNIMAN) and informed by the landscape review in task 8.1, leveraging GA4GH and ELIXIR community initiatives such as Dockstore and ToolDog. The tools will be described in a standard modular way and all services will be provided through an Application Programming Interface (API) following established standards for service management (e.g. http://fitsm.itemo.org/). A service definition will encapsulate a pre-existing tool, and will be specified in a manner compatible with Common Workflow Language nodes. Definitions will be richly annotated to facilitate their discovery by users. The registry schema will hold the metadata for the tasks surfaced in task 8.2 above (e.g. models and parameters), and the schema will be compatible with other registry initiatives.

Services will be made available as containers (VM, Docker container or Bioconda specification), available to the consortium as cloud services (task 8.3). Services will be quality checked to ensure that they read/write in formats identified in task 8.1, and are not specific to very limited execution environments, taking into account recent advancements in integration testing of workflows. We will develop and implement service discovery functions for search, related services and registry adherence to FAIR metrics developed by EOSC, ELIXIR and other initiatives like FAIRmetrics.org.

The system will maintain analytics including up time and usage statistics for service analysis and review. Some support will be given for users to create simple workflows based upon instances of CWL tools (e.g. Rabix Composer)

@llivermore llivermore self-assigned this May 11, 2022
@llivermore llivermore added the D8.2 Deliverable associated with tools and services. label Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D8.2 Deliverable associated with tools and services.
Projects
None yet
Development

No branches or pull requests

1 participant