Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Add step to filter for only NP_ sequences from NCBI #28

Open
olgabot opened this issue Apr 10, 2020 · 0 comments
Open

Add step to filter for only NP_ sequences from NCBI #28

olgabot opened this issue Apr 10, 2020 · 0 comments

Comments

@olgabot
Copy link
Contributor

olgabot commented Apr 10, 2020

In some analyses, I was getting a ton of matches to ribosomal proteins, mitochondrial genes, and ferritin.

Ferritin example

A00111:192:HFVL5DMXX:1:1228:17002:16094 XP_021536930.1  100.0   1.8e-08 62.4    XP_021536930.1 ferritin light chain [Neomonachus schauinslandi] 29088   Neomonachus schauinslandi       Eukaryota       Chordata

Ribosomal example

A00111:43:H252WDMXX:1:2458:30481:16986  XP_004873387.2  100.0   9.5e-10 66.6    XP_004873387.2 60S ribosomal protein L18a [Heterocephalus glaber]       10181   Heterocephalus glaber   Eukaryota       Chordata

Mitochondrial example

A00111:192:HFVL5DMXX:1:1167:31439:29512 YP_009412754.1  100.0   2.3e-08 62.0    YP_009412754.1 cytochrome c oxidase subunit II (mitochondrion) [Microcebus arnholdi]    864580  Microcebus arnholdi     Eukaryota       Chordata

Why are these coming up?

Here is the GeneCards summary for ferritin (emphasis mine)

This gene encodes the heavy subunit of ferritin, the major intracellular iron storage protein in prokaryotes and eukaryotes. It is composed of 24 subunits of the heavy and light ferritin chains. Variation in ferritin subunit composition may affect the rates of iron uptake and release in different tissues. A major function of ferritin is the storage of iron in a soluble and nontoxic state. Defects in ferritin proteins are associated with several neurodegenerative diseases. This gene has multiple pseudogenes. Several alternatively spliced transcript variants have been observed, but their biological validity has not been determined. [provided by RefSeq, Jul 2008]

So it's very possible that these are only appearing in these genomes as a result of contamination, and weren't properly filtered out. If you look closely, the protein IDs start with either XP_ or YP_, whereas a "good" result looks like:

A00111:47:H2HT7DMXX:1:1172:23556:31454  NP_001334083.1  100.0   8.1e-09 63.5    NP_001334083.1 major urinary protein 22 precursor [Mus musculus]        10090   Mus musculus    Eukaryota       Chordata

Why is this happening?

From NCBI’s website: https://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_is_the_difference_between

Accession numbers that begin with the prefix XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein) are model RefSeqs produced either by NCBI’s genome annotation pipeline or copied from computationally annotated submissions to the INSDC. These RefSeq records are derived from the genome sequence and have varying levels of transcript or protein homology support. They represent the predicted transcripts and proteins annotated on the NCBI RefSeq contigs and may differ from INSDC mRNA submissions or from the subsequently curated RefSeq records (with NM_, NR_, or NP_ accession prefixes).

Therefore → ONLY NP RECORDS SHOULD BE USED FROM NCBI!!
Everything else is purely computationally generated and not to be trusted

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant