-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acquires and loads gene and mutation data #42
Conversation
|
||
gene_path = os.path.join(options['path'], 'genes.tsv') | ||
if not os.path.exists(gene_path): | ||
gene_url = 'http://www.stephenshank.com/genes.tsv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use https://github.com/cognoma/genes/raw/master/data/genes.tsv
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2') | ||
if not os.path.exists(mutation_path): | ||
mutation_url = 'https://ndownloader.figshare.com/files/5864862' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we could use the figshare download logic from cognoml
. See cognoml/figshare.py
. @jessept is our code modular enough that @stephenshank can use cognoml to download the figshare data here, or would this application be out of scope.
I created a corresponding issue for the cognoml
team: cognoma/cognoml#15.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just assigned cognoma/cognoml#15 to myself to help move this forward. We can definitely use our data retrieval code in other places, @stephenshank check if the code here works for what you need. I'm happy to add any additional helper code as well, just let me know what you're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessept -- nice. My main worry here is that hardcoding the URL for a specific dataset of a specific version of the figshare data is going to cause an upkeep issue later on. So the goal will be to use the cognoml logic for figshare downloads to avoid repeating any efforts and clean this up!
gene_list = [] | ||
for row in gene_reader: | ||
gene = Gene( | ||
entrezid=row['entrez_gene_id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use entrez_gene_id
rather than entrezid
here as the field name?
description=row['description'], | ||
chromosome=row['chromosome'] or None, | ||
gene_type=row['gene_type'], | ||
synonyms=row['synonyms'] or None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synonyms
and aliases
are array types. Split by |
. See #40 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just do Gene(**row)
? @awm33 do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenshank I'm guessing the or None
is to prevent empty strings in favor of null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awm33 You are correct; perhaps I should be using blank=True
as opposed to null=True
in models which may contain missing data... if I recall correctly, trying to insert an empty string threw an error.
migrations.CreateModel( | ||
name='Gene', | ||
fields=[ | ||
('entrezid', models.IntegerField(primary_key=True, serialize=False)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be entrez_gene_id
. See api docs
# Mutations | ||
if Mutation.objects.count() == 0: | ||
mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2') | ||
mutation_df = pd.read_table(mutation_path, index_col=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it's worth introducing the pandas dependency here. I think a DictReader could do the job.
for row in reader:
sample_id = row.pop('sample_id')
for entrez_gene_id, mutation_status in row.items():
if mutation_status == '1':
# Create mutation from entrez_gene_id, sample_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the mutations is painfully slow, and will be the subject of future pull requests to this and cancer-data repositories.
This could be due to using iterrows in pandas which can be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dido, let's nix Pandas if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas will be eliminated in the next revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See more detailed workaround at cognoma/cancer-data#34 (comment).
) | ||
mutation_list.append(mutation) | ||
except: | ||
print('OOPS! Had an issue inserting sample', sample_id, 'mutation', mutated_gene) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't happen? Have you had any issues? I don't think we want to except this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be a hard fail.
@@ -2,8 +2,9 @@ | |||
import csv | |||
|
|||
from django.core.management.base import BaseCommand | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding pandas an a dependency in a web app? Using it to just load a CSV for a data model seems like overkill
|
||
GENDER_CHOICES = ( | ||
("male", "Male"), | ||
("female", "Female") | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo? Usually one line is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for PEP 8 compliancy with classes and blank lines. Will revert
@@ -21,13 +21,15 @@ class Meta: | |||
created_at = models.DateTimeField(auto_now_add=True) | |||
updated_at = models.DateTimeField(auto_now=True) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you added double spacing to this file, please revert back to single spacing for consistency.
@@ -16,3 +16,4 @@ pycparser==2.16 | |||
pycrypto==2.6.1 | |||
PyJWT==1.4.2 | |||
six==1.10.0 | |||
pandas==0.18.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove pandas, remember to remove it from the requirements as well.
@stephenshank I left some feedback. It also looks like the tests are failing. But overall, good PR, just need to address some issues. |
@dhimmel @awm33 Thanks for the comments! I have opened a PR in I also have an updated version of this branch which passes all tests, after some required API changes. I will further modify it to make use of the new mutation matrix, which will get rid of the pandas dependency, and hope to update this PR by tomorrow morning at the latest. |
sample = Sample.objects.get(sample_id=sample_id) | ||
for entrez_gene_id, mutation_status in row.items(): | ||
if mutation_status == '1': | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does indeed catch a lone exception, even with the up-to-date genes.tsv
file. The Entrez ID is 117153. Putting this into NCBI's Gene database shows that it is an out of date ID, and related to melanoma. The current ID is 4253, which is also found in mutation-matrix.tsv.bz2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using the latest figshare files (v5)?
From https://api.figshare.com/v2/articles/3487685:
[
{
"is_link_only":false,
"size":173889264,
"id":5864859,
"download_url":"https://ndownloader.figshare.com/files/5864859",
"name":"expression-matrix.tsv.bz2"
},
{
"is_link_only":false,
"size":1564703,
"id":5864862,
"download_url":"https://ndownloader.figshare.com/files/5864862",
"name":"mutation-matrix.tsv.bz2"
},
{
"is_link_only":false,
"size":772313,
"id":6207135,
"download_url":"https://ndownloader.figshare.com/files/6207135",
"name":"samples.tsv"
},
{
"is_link_only":false,
"size":1211305,
"id":6207138,
"download_url":"https://ndownloader.figshare.com/files/6207138",
"name":"covariates.tsv"
}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are. Will keep investingating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you delete your local mutation-matrix.tsv.bz2
, maybe it's outdated but since it exists is not re-downloading... that's all I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you delete your local mutation-matrix.tsv.bz2...
Just tried this, and the issue persists. ID 117153 (the lone exception) was discontinued on 9/10/16 and replaced with 4253. To investigate, I downloaded mutation-matrix.tsv.bz2
from the url in your JSON above. After loading with df=pandas.read_table(path, index_col=0)
, when I run '4253' in df.columns
and '117153' in df.columns
, both return True
.
Looking at commit histories in cancer-data, the mutation matrix appears to have been made before this date. Looking at commits from genes, Entrez information was obtained after this date. My guess is that this ID changed between the creation of these two files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, mutation-matrix.tsv.bz2
should filter out all genes that are not in cognoma/genes. I'll look into this, but for now keeping the error handling makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event that this is useful, when I download the latest genes file, load with pandas
, and run 4253 in df.entrez_gene_id
I get True
, but when I run 117153 in df.entrez_gene_id
, I get False
. I feel as though this further supports the idea that the discrepancy between these two files is due to the dates on which they were made, given that some information from Entrez changed in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @stephenshank. I reported the issue in cognoma/cancer-data#36.
One other major change to the models: now using |
description='foo', | ||
chromosome='1', | ||
gene_type='bar', | ||
synonyms='foo|bar'.split('|'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just an array?
@awm33 @stephenshank looking good. Just a few small comments left I believe. Let's prioritize this so we can start analyzing actual data. |
@awm33 Hopefully I did not miss any loose ends with the latest commit... any additional thoughts? |
@stephenshank 👍 Looking good |
@stephenshank Nope. You're good to go |
Motivation
Addresses #27 by attempting to load remaining cancer-static data (mutations and genes).
API changes
No changes to API, but populates data required by related views.
Implementation Notes
Loading the mutations is painfully slow, and will be the subject of future pull requests to this and
cancer-data
repositories. We may want to bypass the Django ORM here. Had an issue removingdjango-genes
, so left it in for the moment. Also had an issue usingbulk_create
on all mutations; this is the reason for doing 1000 at a time.Functional Tests
Recommendations on how to do this would be appreciated.