Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acquires and loads gene and mutation data #42

Merged
merged 6 commits into from
Dec 11, 2016

Conversation

stephenshank
Copy link
Member

Motivation

Addresses #27 by attempting to load remaining cancer-static data (mutations and genes).

API changes

No changes to API, but populates data required by related views.

Implementation Notes

Loading the mutations is painfully slow, and will be the subject of future pull requests to this and cancer-data repositories. We may want to bypass the Django ORM here. Had an issue removing django-genes, so left it in for the moment. Also had an issue using bulk_create on all mutations; this is the reason for doing 1000 at a time.

Functional Tests

Recommendations on how to do this would be appreciated.


gene_path = os.path.join(options['path'], 'genes.tsv')
if not os.path.exists(gene_path):
gene_url = 'http://www.stephenshank.com/genes.tsv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use https://github.com/cognoma/genes/raw/master/data/genes.tsv instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2')
if not os.path.exists(mutation_path):
mutation_url = 'https://ndownloader.figshare.com/files/5864862'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we could use the figshare download logic from cognoml. See cognoml/figshare.py. @jessept is our code modular enough that @stephenshank can use cognoml to download the figshare data here, or would this application be out of scope.

I created a corresponding issue for the cognoml team: cognoma/cognoml#15.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just assigned cognoma/cognoml#15 to myself to help move this forward. We can definitely use our data retrieval code in other places, @stephenshank check if the code here works for what you need. I'm happy to add any additional helper code as well, just let me know what you're looking for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessept -- nice. My main worry here is that hardcoding the URL for a specific dataset of a specific version of the figshare data is going to cause an upkeep issue later on. So the goal will be to use the cognoml logic for figshare downloads to avoid repeating any efforts and clean this up!

gene_list = []
for row in gene_reader:
gene = Gene(
entrezid=row['entrez_gene_id'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use entrez_gene_id rather than entrezid here as the field name?

description=row['description'],
chromosome=row['chromosome'] or None,
gene_type=row['gene_type'],
synonyms=row['synonyms'] or None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synonyms and aliases are array types. Split by |. See #40 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just do Gene(**row)? @awm33 do you know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenshank I'm guessing the or None is to prevent empty strings in favor of null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awm33 You are correct; perhaps I should be using blank=True as opposed to null=True in models which may contain missing data... if I recall correctly, trying to insert an empty string threw an error.

migrations.CreateModel(
name='Gene',
fields=[
('entrezid', models.IntegerField(primary_key=True, serialize=False)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be entrez_gene_id. See api docs

# Mutations
if Mutation.objects.count() == 0:
mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2')
mutation_df = pd.read_table(mutation_path, index_col=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's worth introducing the pandas dependency here. I think a DictReader could do the job.

for row in reader:
    sample_id = row.pop('sample_id')
    for entrez_gene_id, mutation_status in row.items():
        if mutation_status == '1':
            # Create mutation from entrez_gene_id, sample_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the mutations is painfully slow, and will be the subject of future pull requests to this and cancer-data repositories.

This could be due to using iterrows in pandas which can be slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dido, let's nix Pandas if we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas will be eliminated in the next revision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See more detailed workaround at cognoma/cancer-data#34 (comment).

)
mutation_list.append(mutation)
except:
print('OOPS! Had an issue inserting sample', sample_id, 'mutation', mutated_gene)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't happen? Have you had any issues? I don't think we want to except this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be a hard fail.

@@ -2,8 +2,9 @@
import csv

from django.core.management.base import BaseCommand
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding pandas an a dependency in a web app? Using it to just load a CSV for a data model seems like overkill


GENDER_CHOICES = (
("male", "Male"),
("female", "Female")
)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? Usually one line is enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -21,13 +21,15 @@ class Meta:
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you added double spacing to this file, please revert back to single spacing for consistency.

@@ -16,3 +16,4 @@ pycparser==2.16
pycrypto==2.6.1
PyJWT==1.4.2
six==1.10.0
pandas==0.18.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove pandas, remember to remove it from the requirements as well.

@awm33
Copy link
Member

awm33 commented Dec 3, 2016

@stephenshank I left some feedback. It also looks like the tests are failing. But overall, good PR, just need to address some issues.

@stephenshank
Copy link
Member Author

@dhimmel @awm33 Thanks for the comments! I have opened a PR in cancer-data with a reformatted mutation matrix that is much more amenable for this task.

I also have an updated version of this branch which passes all tests, after some required API changes. I will further modify it to make use of the new mutation matrix, which will get rid of the pandas dependency, and hope to update this PR by tomorrow morning at the latest.

sample = Sample.objects.get(sample_id=sample_id)
for entrez_gene_id, mutation_status in row.items():
if mutation_status == '1':
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed catch a lone exception, even with the up-to-date genes.tsv file. The Entrez ID is 117153. Putting this into NCBI's Gene database shows that it is an out of date ID, and related to melanoma. The current ID is 4253, which is also found in mutation-matrix.tsv.bz2.

Copy link
Member

@dhimmel dhimmel Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using the latest figshare files (v5)?

From https://api.figshare.com/v2/articles/3487685:

[
   {
      "is_link_only":false,
      "size":173889264,
      "id":5864859,
      "download_url":"https://ndownloader.figshare.com/files/5864859",
      "name":"expression-matrix.tsv.bz2"
   },
   {
      "is_link_only":false,
      "size":1564703,
      "id":5864862,
      "download_url":"https://ndownloader.figshare.com/files/5864862",
      "name":"mutation-matrix.tsv.bz2"
   },
   {
      "is_link_only":false,
      "size":772313,
      "id":6207135,
      "download_url":"https://ndownloader.figshare.com/files/6207135",
      "name":"samples.tsv"
   },
   {
      "is_link_only":false,
      "size":1211305,
      "id":6207138,
      "download_url":"https://ndownloader.figshare.com/files/6207138",
      "name":"covariates.tsv"
   }
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are. Will keep investingating

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete your local mutation-matrix.tsv.bz2, maybe it's outdated but since it exists is not re-downloading... that's all I can think of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete your local mutation-matrix.tsv.bz2...

Just tried this, and the issue persists. ID 117153 (the lone exception) was discontinued on 9/10/16 and replaced with 4253. To investigate, I downloaded mutation-matrix.tsv.bz2 from the url in your JSON above. After loading with df=pandas.read_table(path, index_col=0), when I run '4253' in df.columns and '117153' in df.columns, both return True.

Looking at commit histories in cancer-data, the mutation matrix appears to have been made before this date. Looking at commits from genes, Entrez information was obtained after this date. My guess is that this ID changed between the creation of these two files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, mutation-matrix.tsv.bz2 should filter out all genes that are not in cognoma/genes. I'll look into this, but for now keeping the error handling makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event that this is useful, when I download the latest genes file, load with pandas, and run 4253 in df.entrez_gene_id I get True, but when I run 117153 in df.entrez_gene_id, I get False. I feel as though this further supports the idea that the discrepancy between these two files is due to the dates on which they were made, given that some information from Entrez changed in between.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @stephenshank. I reported the issue in cognoma/cancer-data#36.

@stephenshank
Copy link
Member Author

One other major change to the models: now using postgresfields.ArrayField(models.CharField(), null=True) in the synonyms and aliases fields of the Gene model.

description='foo',
chromosome='1',
gene_type='bar',
synonyms='foo|bar'.split('|'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just an array?

@dhimmel
Copy link
Member

dhimmel commented Dec 9, 2016

@awm33 @stephenshank looking good. Just a few small comments left I believe. Let's prioritize this so we can start analyzing actual data.

@stephenshank stephenshank changed the title WIP: Acquires and loads gene and mutation data Acquires and loads gene and mutation data Dec 9, 2016
@stephenshank
Copy link
Member Author

@awm33 Hopefully I did not miss any loose ends with the latest commit... any additional thoughts?

@awm33
Copy link
Member

awm33 commented Dec 9, 2016

@stephenshank 👍 Looking good

@stephenshank
Copy link
Member Author

@dhimmel @awm33 Any other requests before this is ready to merge?

@awm33
Copy link
Member

awm33 commented Dec 11, 2016

@stephenshank Nope. You're good to go

@awm33 awm33 merged commit 1cec5df into cognoma:master Dec 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants