Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Germplasm Accession Importer #6

Merged
merged 51 commits into from
Feb 9, 2024
Merged

Conversation

carolyncaron
Copy link
Contributor

@carolyncaron carolyncaron commented Oct 23, 2023

Issue #2

Motivation

To create a Tripal 4 Importer instance for germplasm accessions which accommodates all the metadata required to meet BrAPI standards. Essentially, to upgrade this Tripal 3 code to Tripal 4: https://github.com/UofS-Pulse-Binfo/uofspb_germplasm/blob/7.x-3.x/includes/TripalImporter/GermplasmAccessionImporter.inc

User documentation for this importer, for which I am also open to feedback, has been updated here: https://tripalcultivate.github.io/docs/docs/curation/germplasm-data

What does this PR do?

  1. Implements the form function to display a form prompting the user for the genus, file and provides a shortened description of the expected file input format. Here is a screenshot of what the form currently looks like:

Screen Shot 2023-10-23 at 10 49 29-fullpage

  1. Sets up CV term configuration in YML format to store CV term IDs that are relevant to processing germplasm. Also implements the getCVterm() and setCVterm() function for grabbing and setting throughout the form as well as in kernel tests. In the future, the curator will be able to set these terms through the UI. This will be implemented at a later time since configuration will be shared among all of the TripalCultivate Germplasm modules.

  2. Divides the run() function into multiple smaller functions that each take care of a specific chunk of the importer. Kernel tests for each function are also given their own testing function in the format testGermplasmAccessionImporterFunctionName(). The following have been implemented:

    • getOrganismID()
    • getStockID()
    • getDbxrefID()
    • loadStockProperties()
    • loadSynonyms()

Testing

Automated Testing

The test is marked complete if implemented

  • testGermplasmAccessionImporterForm()
  • testGermplasmAccessionImporterRun()
    • A simple file with 2 germplasm and no optional columns
    • A file with missing required columns an insufficient number of columns
    • A file with 1 germplasm that has some optional columns specified, and 3 synonyms
  • testGermplasmAccessionImporterGetOrganismID()
    • Selects an existing organism
    • Tries selecting a non-existing organism
  • testGermplasmAccessionImporterGetStockID()
    • Selects an already existing stock
    • Inserts a stock that doesn't already exist
    • When a stock name exists but with a different accession
    • When multiple stocks with the same name + uniquename exist (only type_id differentiates them)
  • testGermplasmAccessionImporterGetDbxrefID()
    • Tries to get DbxrefID using a non-existing external database
    • After inserting the external db, check the dbxref was inserted and stock was updated
    • Call the function again to trigger the elseif statements where it already exists. Ensure dbxref and stock are not changed
    • Manually insert a dbxref with same accession, different db_id and then call the function
  • testGermplasmAccessionImporterloadStockProperties()
    • Load empty stock property values
    • Load all 6 stock properties
    • Load a couple of new stock properties for the same stock
    • Try loading a stock property that already exists
  • testGermplasmAccessionImporterloadSynonyms()
    • Test loading an empty string as a synonym
    • Check that a single synonym gets entered into the synonym table and stock_synonym table
    • Check that a synonym not already in the stock table triggers a notice but otherwise no errors
    • Check that a 2nd synonym with a stock_id prompts a stock_relationship record to be created
    • Check that a comma separated list of synonyms get entered correctly
    • Check that a semicolon separated list of synonyms get entered correctly

Manual Testing

  1. You can set up a docker for testing this PR by running the following commands:
git clone https://github.com/TripalCultivate/TripalCultivate-Germplasm.git
git checkout g2.2-germAccessionImporter
docker build --tag=trpcultivate:latest .
docker run --publish=80:80 --name=trpcultivate -tid --volume=`pwd`:/var/www/drupal/web/modules/contrib/TripalCultivate-Germplasm trpcultivate:latest
docker exec trpcultivate service postgresql restart
docker exec trpcultivate drush cr
  1. At localhost in your browser, login to the site and navigate in your Admin navigation bar to Tripal -> Data Loaders and make sure Tripal Cultivate: Germplasm Accessions is in the listing. If not, try re-running docker exec trpcultivate drush cr a couple more times and refresh the page.

  2. You will need to configure CVterms through the command line since this module does not yet have an interface for it. You will also need an organism and an external DB to already exist. You can either use docker exec -it trpcultivate drush php:cli or PHP Execute through the interface (you may have to configure the Devel bar to show this). Copy and paste the following PHP code:

$config_factory = \Drupal::configFactory();
$germplasm_config = $config_factory->getEditable('trpcultivate_germplasm.settings');
$germplasm_config->set('terms.accession', 9);
$germplasm_config->set('terms.institute_code', 10);
$germplasm_config->set('terms.institute_name', 11);
$germplasm_config->set('terms.country_of_origin_code',12);
$germplasm_config->set('terms.biological_status_of_accession_code', 13);
$germplasm_config->set('terms.breeding_method_DbId', 14);
$germplasm_config->set('terms.pedigree', 15);
$germplasm_config->set('terms.synonym', 16);
$germplasm_config->set('terms.stock_relationship_type_synonym', 17);
$germplasm_config->set('terms.subtaxa', 18);
$germplasm_config->save();

$connection = \Drupal::service('tripal_chado.database');

$subspecies_cvterm_query = $connection->select('1:cvterm','cvt')
	->fields('cvt', ['cvterm_id', 'name', 'definition'])
	->condition('name', 'subspecies', '=');
$subspecies_cvterm_record = $subspecies_cvterm_query->execute()->fetchAll();
$subspecies_cvterm_id = $subspecies_cvterm_record[0]->cvterm_id;

$organism_id = $connection->insert('1:organism')
	->fields([
		'genus' => 'Tripalus',
		'species' => 'databasica',
		'infraspecific_name' => 'chadoii',
		'type_id' => $subspecies_cvterm_id,
	])
	->execute();

$db_id = $connection->insert('1:db')
	->fields([
		'name' => 'TestDB',
	])
	->execute();
  1. Go to the importer page at Tripal -> Data Loaders -> Tripal Cultivate: Germplasm Accessions. Select the genus 'Tripalus' from the dropdown. For your test files, you can try with one (or both!) of these test files included with the module for testing:
  • TripalCultivate-Germplasm/trpcultivate_germplasm/tests/src/Fixtures/simple_example.txt
  • TripalCultivate-Germplasm/trpcultivate_germplasm/tests/src/Fixtures/props_syns_example.txt
  1. Submit the form, then you will be prompted to run a Tripal job. Just ensure the job command is prefixed with docker exec trpcultivate to run it within the docker container. Hopefully, you will see output such as Reached end of file without encountering any errors. Transaction will be committed to the database. on your terminal, indicating that the import worked!

  2. You can verify that the stocks in your database match the ones in your file of choice by going back to the php CLI or Execute PHP and running the following select query:

$connection = \Drupal::service('tripal_chado.database');
$stock_query = $connection->select('1:stock', 's')->fields('s', ['organism_id', 'name', 'uniquename', 'type_id']);
$stock_record = $stock_query->execute()->fetchAll();

If testing with the file props_syns_example.txt, it is also good to check the stockprop and stock_synonym for values as well:

$stockprop_query = $connection->select('1:stockprop', 'sp')->fields('sp', ['stock_id', 'type_id', 'value']);
$stockprop_records = $stockprop_query->execute()->fetchAll();

$stock_relationship_query = $connection->select('1:stock_relationship', 'sr')->fields('sr', ['subject_id', 'object_id', 'type_id', 'value']);
$stock_relationship_record = $stock_relationship_query->execute()->fetchAll();

@carolyncaron carolyncaron linked an issue Oct 23, 2023 that may be closed by this pull request
Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I will do this review in stages.

Review of the associated documentation can be seen here: TripalCultivate/docs#2

Review of the import form:

  • It would be good to include a link on the import page to your more detailed documentation! Something like "For more detailed information on this format including links to lookup various codes, please see our official documentation."
  • Instructions block feels a bit like placeholder text. Suggestion in separate review comment.

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next I tried a functional test. I followed the steps outlined for setup and then tested as follows:

  1. Put in my own simple test file with altered names / organism: simple_example.txt

✅ This worked beautifully. Importer ran with no errors and the records in the stock and dbxref tables were as expected.

  1. Imported the sample file as in 1 a second time.

✅ This also worked beautifully. Importer ran with no errors and detected that the records already existed (did not say "inserting" in the job output.
✅ Records in database still as expected.

  1. Imported my version of the properties example file and included the same two accessions as in the first file but with more details:
    props_syns_example.txt

❌ This unfortunately failed. The job produced the following output which shows it was unable to determine that the same germplasm already existed 🙈

docker exec trpcultivate drush trp-run-jobs --username=drupaladmin --root=/var/www/drupal/web

2023-12-12 22:18:34
Tripal Job Launcher
Running as user 'drupaladmin'
-------------------
2023-12-12 22:18:34: Job ID 3.
2023-12-12 22:18:34: Calling: tripal_run_importer(36)
Running 'Tripal Cultivate: Germplasm Accessions' importer
NOTE: Loading of this file is performed using a database transaction. If it fails or is terminated prematurely then all insertions and updates are rolled back and will not be found in the database
Percent complete: 0%. Memory: 33,996,184 bytes.
Percent complete: 1.12 %. Memory: 34,005,008 bytes.
Percent complete: 3.21 %. Memory: 34,005,080 bytes.
Percent complete: 4.33 %. Memory: 34,005,144 bytes.
Percent complete: 32.96 %. Memory: 34,005,424 bytes.
Percent complete: 56.56 %. Memory: 34,005,552 bytes.
Percent complete: 74.58 %. Memory: 34,014,168 bytes.
Percent complete: 100.00 %. Memory: 34,014,288 bytes.
Percent complete: 100.00 %. Memory: 34,014,648 bytes.
Inserting "Test1".
ERROR: SQLSTATE[23505]: Unique violation: 7 ERROR:  duplicate key value violates unique constraint "stock_c1"
DETAIL:  Key (organism_id, uniquename, type_id)=(1, GERM:00001, 9) already exists.: INSERT INTO "testchado"."stock" ("organism_id", "name", "uniquename", "type_id") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3) RETURNING stock_id; Array
(
)

[site http://default] [TRIPAL] ERROR: SQLSTATE[23505]: Unique violation: 7 ERROR:  duplicate key value violates unique constraint "stock_c1"DETAIL:  Key (organism_id, uniquename, type_id)=(1, GERM:00001, 9) already exists.: INSERT INTO "testchado"."stock" ("organism_id", "name", "uniquename", "type_id") VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3) RETURNING stock_id; Array()

❌ I also ended up with partially inserted data rather then everything being rolled back. Interestingly the output says Test1 (end of file) was inserted but it was NOT. However, the stockprop table does have a bunch of records that should have been rolled back 🤔

@carolyncaron
Copy link
Contributor Author

carolyncaron commented Dec 18, 2023

Thank you for your review Lacey! I made the following two changes to address the error that you saw:

  1. In commit 19e91ce, I altered the query in getStockID() to check for duplicates in either the germplasm name OR the accession_number. This is to ensure the unique constraint of organism_id+accession_number+type_id does not get violated, which is what happened during your test. I also opened Issue G2.11 Do we want to insert a germplasm name or accession that already exists with a different organism_id or type_id? #11 if we want to further discuss changing any other restrictions imposed by this importer that don't fall under the unique constraint.
  2. In commit 707ceed I added a try-catch around the 5 smaller functions within the run() function. This is to catch any error that hasn't been anticipated yet and to pass on the error message at the very end, along with the "transaction not committed" message. I also created a new test method testGermplasmAccessionImporterRunIncomplete() to test the following cases that trigger an exception:
    • A non-existent input file
    • A genus specified which does not currently exist in the database
    • 2 stocks which share the same Accession Number

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the same file I used last time on the new changes:
✅ I did not getting an error "ERROR: A stock already exists for accession "GERM:00001" but with a germplasm name of "CDC Ch@d" which does not match the input file."
❌ It still didn't actually roll back the inserted records 😕

$connection = \Drupal::service('tripal_chado.database');
$stock_query = $connection->select('1:stock', 's')->fields('s', ['organism_id', 'name', 'uniquename', 'type_id']);
$stock_record = $stock_query->execute()->fetchAll();
= [
    {#6212
      +"organism_id": "1",
      +"name": "CDC Ch@d",
      +"uniquename": "GERM:00001",
      +"type_id": "9",
    },
    {#6211
      +"organism_id": "1",
      +"name": "USASK Ken",
      +"uniquename": "GERM:00002",
      +"type_id": "9",
    },
  ]

To replicate, I used a fresh docker image/container on this branch, ran the following in drush php and then submitted an import job with the same file I used before.

$config_factory = \Drupal::configFactory();
$germplasm_config = $config_factory->getEditable('trpcultivate_germplasm.settings');
$germplasm_config->set('terms.accession', 9);
$germplasm_config->set('terms.institute_code', 10);
$germplasm_config->set('terms.institute_name', 11);
$germplasm_config->set('terms.country_of_origin_code',12);
$germplasm_config->set('terms.biological_status_of_accession_code', 13);
$germplasm_config->set('terms.breeding_method_DbId', 14);
$germplasm_config->set('terms.pedigree', 15);
$germplasm_config->set('terms.synonym', 16);
$germplasm_config->set('terms.stock_relationship_type_synonym', 17);
$germplasm_config->set('terms.subtaxa', 18);
$germplasm_config->save();

$connection = \Drupal::service('tripal_chado.database');

$subspecies_cvterm_query = $connection->select('1:cvterm','cvt')
        ->fields('cvt', ['cvterm_id', 'name', 'definition'])
        ->condition('name', 'subspecies', '=');
$subspecies_cvterm_record = $subspecies_cvterm_query->execute()->fetchAll();
$subspecies_cvterm_id = $subspecies_cvterm_record[0]->cvterm_id;

$organism_id = $connection->insert('1:organism')
        ->fields([
                'genus' => 'Tripalus',
                'species' => 'postgresquelus',
                'infraspecific_name' => 'chadoii',
                'type_id' => $subspecies_cvterm_id,
        ])
        ->execute();

$db_id = $connection->insert('1:db')
        ->fields([
                'name' => 'Germplasm Database',
        ])
        ->execute();

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working through the code review :-) I'm going to list all the functions in this comment and as I review them I'm going to check them off. All comments/suggestions I'll do inline in the code.

Functions to review in GermplasmAccessionImporter class:

  • Dependency injection (create + _construct)
  • describeUploadFileFormat()
  • setCVterm() + getCVterm()
  • form() + formValidate()
  • setUpCVterms()
  • run()
  • getOrganismID()
  • getStockID()
  • getDbxrefID()
  • loadStockProperties()
  • loadSynonyms()

continue 2;
}
}
$germplasm_name = $germplasm_columns[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to add trim() around all of these ;-p

Suggested change
$germplasm_name = $germplasm_columns[0];
$germplasm_name = trim($germplasm_columns[0]);

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review complete 🎉

Overall, this importer is looking really solid :-) I have a few inline comments but for the most part everything is looking good and these suggestions are relatively minor. I'll create an issue for the discussion regarding stock lookup. Any updates you get added in based on review are a bonus but if you run out of time before vacation, I feel comfortable merging this as is and moving the suggestions into another issue.

For prioritizing, I would say the biggest priority is getting the test updated so it fails demonstrating the bug in core regarding transactions.

Copy link
Member

@laceysanderson laceysanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes addressed or moved to issues for later.
✅ READY TO MERGE 🎉

@laceysanderson laceysanderson merged commit 4f5e0e0 into 4.x Feb 9, 2024
5 checks passed
@laceysanderson laceysanderson deleted the g2.2-germAccessionImporter branch February 9, 2024 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Importer: Germplasm Accessions
2 participants