Acquires and loads gene and mutation data #42

stephenshank · 2016-11-16T00:57:48Z

Motivation

Addresses #27 by attempting to load remaining cancer-static data (mutations and genes).

API changes

No changes to API, but populates data required by related views.

Implementation Notes

Loading the mutations is painfully slow, and will be the subject of future pull requests to this and cancer-data repositories. We may want to bypass the Django ORM here. Had an issue removing django-genes, so left it in for the moment. Also had an issue using bulk_create on all mutations; this is the reason for doing 1000 at a time.

Functional Tests

Recommendations on how to do this would be appreciated.

dhimmel · 2016-11-16T16:59:45Z

api/management/commands/acquiredata.py

+
+        gene_path = os.path.join(options['path'], 'genes.tsv')
+        if not os.path.exists(gene_path):
+            gene_url = 'http://www.stephenshank.com/genes.tsv'


Use https://github.com/cognoma/genes/raw/master/data/genes.tsv instead.

dhimmel · 2016-11-16T17:04:45Z

api/management/commands/acquiredata.py

+
+        mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2')
+        if not os.path.exists(mutation_path):
+            mutation_url = 'https://ndownloader.figshare.com/files/5864862'


Would be nice if we could use the figshare download logic from cognoml. See cognoml/figshare.py. @jessept is our code modular enough that @stephenshank can use cognoml to download the figshare data here, or would this application be out of scope.

I created a corresponding issue for the cognoml team: cognoma/cognoml#15.

I just assigned cognoma/cognoml#15 to myself to help move this forward. We can definitely use our data retrieval code in other places, @stephenshank check if the code here works for what you need. I'm happy to add any additional helper code as well, just let me know what you're looking for.

@jessept -- nice. My main worry here is that hardcoding the URL for a specific dataset of a specific version of the figshare data is going to cause an upkeep issue later on. So the goal will be to use the cognoml logic for figshare downloads to avoid repeating any efforts and clean this up!

dhimmel · 2016-11-16T17:16:05Z

api/management/commands/loaddata.py

+                gene_list = []
+                for row in gene_reader:
+                    gene = Gene(
+                        entrezid=row['entrez_gene_id'],


Can we use entrez_gene_id rather than entrezid here as the field name?

dhimmel · 2016-11-16T17:18:05Z

api/management/commands/loaddata.py

+                        description=row['description'],
+                        chromosome=row['chromosome'] or None,
+                        gene_type=row['gene_type'],
+                        synonyms=row['synonyms'] or None,


synonyms and aliases are array types. Split by |. See #40 (comment).

Is it possible to just do Gene(**row)? @awm33 do you know?

@stephenshank I'm guessing the or None is to prevent empty strings in favor of null?

@awm33 You are correct; perhaps I should be using blank=True as opposed to null=True in models which may contain missing data... if I recall correctly, trying to insert an empty string threw an error.

dhimmel · 2016-11-16T17:24:35Z

api/migrations/0003_cognoma_genes.py

+        migrations.CreateModel(
+            name='Gene',
+            fields=[
+                ('entrezid', models.IntegerField(primary_key=True, serialize=False)),


Should be entrez_gene_id. See api docs

dhimmel · 2016-11-16T17:28:16Z

api/management/commands/loaddata.py

+        # Mutations
+        if Mutation.objects.count() == 0:
+            mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2')
+            mutation_df = pd.read_table(mutation_path, index_col=0)


not sure if it's worth introducing the pandas dependency here. I think a DictReader could do the job.

for row in reader: sample_id = row.pop('sample_id') for entrez_gene_id, mutation_status in row.items(): if mutation_status == '1': # Create mutation from entrez_gene_id, sample_id

Loading the mutations is painfully slow, and will be the subject of future pull requests to this and cancer-data repositories.

This could be due to using iterrows in pandas which can be slow.

Dido, let's nix Pandas if we can

Pandas will be eliminated in the next revision.

See more detailed workaround at cognoma/cancer-data#34 (comment).

dhimmel · 2016-11-16T17:30:02Z

api/management/commands/loaddata.py

+                        )
+                        mutation_list.append(mutation)
+                    except:
+                        print('OOPS! Had an issue inserting sample', sample_id, 'mutation', mutated_gene)


This shouldn't happen? Have you had any issues? I don't think we want to except this.

Yes, this should be a hard fail.

awm33 · 2016-12-03T23:40:57Z

api/management/commands/loaddata.py

@@ -2,8 +2,9 @@
 import csv

 from django.core.management.base import BaseCommand
+import pandas as pd


Why are we adding pandas an a dependency in a web app? Using it to just load a CSV for a data model seems like overkill

awm33 · 2016-12-03T23:46:10Z

api/models.py


 GENDER_CHOICES = (
    ("male", "Male"),
    ("female", "Female")
 )

+


Typo? Usually one line is enough

This was for PEP 8 compliancy with classes and blank lines. Will revert

awm33 · 2016-12-03T23:48:23Z

api/models.py

@@ -21,13 +21,15 @@ class Meta:
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)

+


Looks like you added double spacing to this file, please revert back to single spacing for consistency.

awm33 · 2016-12-03T23:49:02Z

requirements.txt

@@ -16,3 +16,4 @@ pycparser==2.16
 pycrypto==2.6.1
 PyJWT==1.4.2
 six==1.10.0
+pandas==0.18.1


If we remove pandas, remember to remove it from the requirements as well.

awm33 · 2016-12-03T23:50:02Z

@stephenshank I left some feedback. It also looks like the tests are failing. But overall, good PR, just need to address some issues.

stephenshank · 2016-12-05T14:33:06Z

@dhimmel @awm33 Thanks for the comments! I have opened a PR in cancer-data with a reformatted mutation matrix that is much more amenable for this task.

I also have an updated version of this branch which passes all tests, after some required API changes. I will further modify it to make use of the new mutation matrix, which will get rid of the pandas dependency, and hope to update this PR by tomorrow morning at the latest.

stephenshank · 2016-12-06T00:21:36Z

api/management/commands/loaddata.py

+                    sample = Sample.objects.get(sample_id=sample_id)
+                    for entrez_gene_id, mutation_status in row.items():
+                        if mutation_status == '1':
+                            try:


This does indeed catch a lone exception, even with the up-to-date genes.tsv file. The Entrez ID is 117153. Putting this into NCBI's Gene database shows that it is an out of date ID, and related to melanoma. The current ID is 4253, which is also found in mutation-matrix.tsv.bz2.

Are you using the latest figshare files (v5)?

From https://api.figshare.com/v2/articles/3487685:

[ { "is_link_only":false, "size":173889264, "id":5864859, "download_url":"https://ndownloader.figshare.com/files/5864859", "name":"expression-matrix.tsv.bz2" }, { "is_link_only":false, "size":1564703, "id":5864862, "download_url":"https://ndownloader.figshare.com/files/5864862", "name":"mutation-matrix.tsv.bz2" }, { "is_link_only":false, "size":772313, "id":6207135, "download_url":"https://ndownloader.figshare.com/files/6207135", "name":"samples.tsv" }, { "is_link_only":false, "size":1211305, "id":6207138, "download_url":"https://ndownloader.figshare.com/files/6207138", "name":"covariates.tsv" } ]

Looks like you are. Will keep investingating

Can you delete your local mutation-matrix.tsv.bz2, maybe it's outdated but since it exists is not re-downloading... that's all I can think of.

Can you delete your local mutation-matrix.tsv.bz2...

Just tried this, and the issue persists. ID 117153 (the lone exception) was discontinued on 9/10/16 and replaced with 4253. To investigate, I downloaded mutation-matrix.tsv.bz2 from the url in your JSON above. After loading with df=pandas.read_table(path, index_col=0), when I run '4253' in df.columns and '117153' in df.columns, both return True.

Looking at commit histories in cancer-data, the mutation matrix appears to have been made before this date. Looking at commits from genes, Entrez information was obtained after this date. My guess is that this ID changed between the creation of these two files.

Hmm, mutation-matrix.tsv.bz2 should filter out all genes that are not in cognoma/genes. I'll look into this, but for now keeping the error handling makes sense.

In the event that this is useful, when I download the latest genes file, load with pandas, and run 4253 in df.entrez_gene_id I get True, but when I run 117153 in df.entrez_gene_id, I get False. I feel as though this further supports the idea that the discrepancy between these two files is due to the dates on which they were made, given that some information from Entrez changed in between.

Thanks for looking into this @stephenshank. I reported the issue in cognoma/cancer-data#36.

stephenshank · 2016-12-06T00:23:20Z

One other major change to the models: now using postgresfields.ArrayField(models.CharField(), null=True) in the synonyms and aliases fields of the Gene model.

awm33 · 2016-12-07T00:35:17Z

api/test/test_classifiers.py

+                                         description='foo',
+                                         chromosome='1',
+                                         gene_type='bar',
+                                         synonyms='foo|bar'.split('|'),


Why not just an array?

dhimmel · 2016-12-09T00:02:17Z

@awm33 @stephenshank looking good. Just a few small comments left I believe. Let's prioritize this so we can start analyzing actual data.

stephenshank · 2016-12-09T02:01:20Z

@awm33 Hopefully I did not miss any loose ends with the latest commit... any additional thoughts?

awm33 · 2016-12-09T04:17:57Z

@stephenshank 👍 Looking good

stephenshank · 2016-12-11T21:01:19Z

@dhimmel @awm33 Any other requests before this is ready to merge?

awm33 · 2016-12-11T21:02:32Z

@stephenshank Nope. You're good to go

stephenshank added 3 commits November 13, 2016 18:41

Initial attempt to load gene/mutation data, having Docker issues.

d12acf8

Rough draft of loading of mutation data; issues with postgres.

12c3276

Does a try/catch to handle missing genes gracefully.

91064b9

dhimmel mentioned this pull request Nov 16, 2016

Improving figshare download modularity cognoma/cognoml#15

Open

3 tasks

dhimmel reviewed Nov 16, 2016

View reviewed changes

Updates API to be compatible with cognoma/genes.

d48475f

awm33 reviewed Dec 3, 2016

View reviewed changes

dhimmel mentioned this pull request Dec 5, 2016

Reshape mutation matrix for use by core-service repository cognoma/cancer-data#34

Closed

Gets genes file from correct source and gets rid of pandas.

47e65c5

stephenshank commented Dec 6, 2016

View reviewed changes

awm33 reviewed Dec 7, 2016

View reviewed changes

api/test/test_classifiers.py

description='foo',

chromosome='1',

gene_type='bar',

synonyms='foo|bar'.split('|'),

Copy link

Member

awm33 Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just an array?

Reverts modifications to spacing, uses arrays in Gene tests.

68b0e45

stephenshank changed the title ~~WIP: Acquires and loads gene and mutation data~~ Acquires and loads gene and mutation data Dec 9, 2016

dhimmel mentioned this pull request Dec 9, 2016

Invalid gene in mutation-matrix.tsv.bz2 on figshare v5 cognoma/cancer-data#36

Closed

awm33 merged commit 1cec5df into cognoma:master Dec 11, 2016

		@@ -21,13 +21,15 @@ class Meta:
		created_at = models.DateTimeField(auto_now_add=True)
		updated_at = models.DateTimeField(auto_now=True)

Acquires and loads gene and mutation data #42

Acquires and loads gene and mutation data #42

Conversation

stephenshank commented Nov 16, 2016

Motivation

API changes

Implementation Notes

Functional Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awm33 commented Dec 3, 2016

stephenshank commented Dec 5, 2016

Choose a reason for hiding this comment

dhimmel Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenshank commented Dec 6, 2016

Choose a reason for hiding this comment

dhimmel commented Dec 9, 2016

stephenshank commented Dec 9, 2016

awm33 commented Dec 9, 2016

stephenshank commented Dec 11, 2016

awm33 commented Dec 11, 2016

dhimmel Dec 8, 2016 •

edited

Loading