Skip to content

Commit

Permalink
This CL adds a new import for NCBI Gene. The data cleaning and testin…
Browse files Browse the repository at this point in the history
…g is documented on [GitHub](datacommonsorg/data#1084). NCBI Gene is updated daily. We included the following datasets in this import:

1. [NCBI Gene](https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz).
2. [gene2pubmed](https://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz).
3. [gene_neighbors](https://ftp.ncbi.nih.gov/gene/DATA/gene_neighbors.gz).
4. [gene_orthologs](https://ftp.ncbi.nih.gov/gene/DATA/gene_orthologs.gz).
5. [gene_group](https://ftp.ncbi.nih.gov/gene/DATA/gene_group.gz).
6. [mim2gene_medgen](https://ftp.ncbi.nih.gov/gene/DATA/mim2gene_medgen).
7. [gene2go](https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz).
8. [gene2accession](https://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz).
9. [gene2ensembl](https://ftp.ncbi.nih.gov/gene/DATA/gene2ensembl.gz).
10. [generifs_basic](https://ftp.ncbi.nih.gov/gene/GeneRIF/generifs_basic.gz).

[NCBI Gene](https://www.ncbi.nlm.nih.gov/gene) is a comprehensive resource containing information about genes from a wide range of species. It serves as a central hub for gene-specific data, integrating information from various sources and providing links to other relevant resources. It includes gene identification (e.g. official gene symbols, aliases, and cross-references to other databases), sequence information (e.g. genomic location and reference sequences (RefSeqs) for genomic DNA, transcripts, proteins, and mature peptides), functional information (gene function descriptions, associated pathways, related biological processes, orthologs, and related genes), phenotypic associations, (i.e. links to phenotypes and diseases associated with the gene), and links to relevant scientific papers (i.e. PubMed IDs).

"[NCBI Gene](https://www.ncbi.nlm.nih.gov/gene) supplies gene-specific connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information. These gene identifiers are used throughout NCBI's databases and tracked through updates of annotation. Gene includes genomes represented by [NCBI Reference Sequences](https://www.ncbi.nlm.nih.gov/refseq/) (or RefSeqs) and is integrated for indexing and query and retrieval from NCBI's Entrez and [E-Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/) systems. Gene comprises sequences from thousands of distinct taxonomic identifiers, ranging from viruses to bacteria to eukaryotes. It represents chromosomes, organelles, plasmids, viruses, transcripts, and millions of proteins."

PiperOrigin-RevId: 690868739
  • Loading branch information
spiekos authored and copybara-github committed Oct 30, 2024
1 parent f405b86 commit aa40401
Show file tree
Hide file tree
Showing 5 changed files with 1,259 additions and 164 deletions.
16 changes: 1 addition & 15 deletions biomedical_schema/chemical_compound.mcf
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ description: "An antibody is a kind of protective protein which is produced by t
Node: dcid:BiomedicalEntity
name: "BiomedicalEntity"
typeOf: schema:Class
subClassOf: schema:BioChemEntity
subClassOf: dcs:Thing
description: "Biomedical related entities."

Node: dcid:ChemicalCompound
Expand Down Expand Up @@ -546,20 +546,6 @@ rangeIncludes: schema:Boolean
description: "The Clinical Pharmacogenetics Implementation Consortium (CPIC) was established in 2009 as a shared project between PharmGKB and the Pharmacogenomics Research Network (PGRN). CPIC is funded by the NIH/NHGRI. This indicates whether a gene has a drug dosing guideline issued by the CPIC that is associated with it."
descriptionUrl: "https://www.pharmgkb.org/page/cpic"

Node: dcid:hasGenomicCoordinates
name: "hasGenomicCoordinates"
typeOf: schema:Property
domainIncludes: dcs:GenomeAnnotation
rangeIncludes: dcs:GenomicCoordinates
description: "Genomic coordinates specify the location of the position of an element within a specified genome assembly. It is a specified set of chromosome start_position end_position."

Node: dcid:hasGeneticVariantAnnotation
name: "hasGeneticVariantAnnotation"
typeOf: schema:Property
domainIncludes: dcs:Gene
rangeIncludes: schema:Boolean
description: "Indicates whether there are gene has genetic variants that are associated with it."

Node: dcid:humanCellType
typeOf: schema:Property
name: "humanCellType"
Expand Down
18 changes: 13 additions & 5 deletions biomedical_schema/disease.mcf
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,15 @@ domainIncludes: dcs:MeSHConcept
rangeIncludes: dcs:MeSHConcept
description: "The preferred MeSH Concept to which the MeSH Concept that is narrower in scope is related."

Node: dcid:scopeNote
name: "scopeNote"
typeOf: schema:Property
rangeIncludes: schema:Text
domainIncludes: dcs:Thing
specializationOf: dcs:description
description: "A scope note is a concise explanatory text that defines the intended meaning and usage of a term or concept within a specific context. It clarifies the meaning of the term, specifies the boundaries of the concept, and provides guidance on its usage."
descriptionUrl: "https://www.nlm.nih.gov/mesh/xml_data_elements.html#ScopeNote"

Node: dcid:snomedCT
typeOf: schema:Property
name: "snomedCT"
Expand All @@ -269,12 +278,11 @@ descriptionUrl: "https://www.snomed.org/use-snomed-ct"
Node: dcid:umlsConceptUniqueID
typeOf: schema:Property
name: "umlsConceptUniqueID"
domainIncludes: dcs:Disease
rangeIncludes: schema:Text,dcs:MeSHConcept
description: "A Unified Medical Language System (UMLS) Concept Unique ID (CUI) is a unique identifier in the Metathesaurus for a concept. CUI contain the letter C followed by seven numbers. An example of a CUI is C0018681."
domainIncludes: dcs:Disease, dcs:UmlsConceptUniqueIdentifier
rangeIncludes: dcs:MeSHConcept, dcs:UmlsConceptUniqueIdentifier, schema:Text
abbreviation: "CUI"
description: "A concept is a meaning. A meaning can have many different names. A key goal of Metathesaurus construction is to understand the intended meaning of each name in each source vocabulary and to link all the names from all of the source vocabularies that mean the same thing (the synonyms). CUI contain the letter C followed by seven numbers. In the example on the right the CUI is C0018681."
descriptionUrl: "https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html"
synonym: "unified medical language system concept ID"
sameAs: dcs:unifiedMedicalLanguageSystemConceptUniqueIdentifier

Node: dcid:unifiedMedicalLanguageSystemConceptUniqueIdentifier
typeOf: schema:Property
Expand Down
Loading

0 comments on commit aa40401

Please sign in to comment.