Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive and load updated RSC linksets #23

Open
stain opened this issue Oct 14, 2015 · 5 comments
Open

Archive and load updated RSC linksets #23

stain opened this issue Oct 14, 2015 · 5 comments
Assignees
Milestone

Comments

@stain
Copy link
Contributor

stain commented Oct 14, 2015

Available at http://ops.rsc.org/download/RDF-2015.10.09.zip or as separate resources from http://ops.rsc.org/download/20151009/void_2015-10-09.ttl by following void:dataDump.

TODO: Modify this build job https://github.com/openphacts/ops-rsc-wikipathways-dataset/ -- not sure if this should be one big ops-rsc-dataset, or probably better, one per linkset.

Now easier to download from http://ops.rsc.org/download/ without authentication needed.

@stain stain self-assigned this Oct 14, 2015
@stain stain added this to the 2.1 milestone Oct 14, 2015
@stain
Copy link
Contributor Author

stain commented Oct 14, 2015

RDF syntax issue in 20151009.

Tested with riot --validate from Apache Jena 3.0.0

MESH/ISSUES_MESH20151009.ttl.gz

ERROR riot :: [line: 161, col: 47] Bad character in IRI (space): http://purl.bioontology.org/ontology/MSH/...[space]...

      <http://purl.bioontology.org/ontology/MSH/...  (((4-(1,4,5,6R-trans-tetrahydro-2- pyrimidinyl)phenyl)acetyl)amino)-5-thia-> cheminf:CHEMINF_000560 "Contains completely undefined stereo:
  enantiomers"@en .

MESH/LINKSET_EXACT_MESH20151009.ttl.gz

ERROR riot :: [line: 125, col: 95] Bad character in IRI (space): http://purl.bioontology.org/ontology/MSH/...[space]...

Line 125:

      <http://ops.rsc.org/OPS1965918> skos:exactMatch <http://purl.bioontology.org/ontology/MSH/... th 3-(aminocarbonyl)-1-beta-D-ribofuranosylpyridinium hydroxide inner saltN-> .

@stain
Copy link
Contributor Author

stain commented Oct 14, 2015

URI mismatch in void:inDataset statements - date changed during data generation?

    stain@biggie:~/Downloads/rsc/20151009/HUMAN_METABOLOME_DATABASE$ riot * |grep inData | cut -d ">" -f 3

     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_charge_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_isotope_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_stereo_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_super_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_tautomer_insensitive_parent_closeMatch
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_exactMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_ops_chemspider_exactMatch
     <http://ops.rsc.org/download/20151010/void_2015-10-10.ttl#human_metabolome_database_parent_child_fragment_relatedMatch
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#openphacts-human_metabolome_database
     <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#openphacts-human_metabolome_database

while the void says consistently 20151009 or 2015-10-09:

    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_exactMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#exactMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_ops_chemspider_exactMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#exactMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_charge_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_fragment_relatedMatch> <http://rdfs.org/ns/void#linkPredicate> <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_isotope_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_stereo_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_super_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .
    <http://ops.rsc.org/download/20151009/void_2015-10-09.ttl#human_metabolome_database_parent_child_tautomer_insensitive_parent_closeMatch> <http://rdfs.org/ns/void#linkPredicate> <http://www.w3.org/2004/02/skos/core#closeMatch> .

(Note that void_2015-10-09.ttl here is correct as it is not a .ttl.gz)

@stain
Copy link
Contributor Author

stain commented Oct 14, 2015

The VoID has wrong dataDump directory for HMDB and MESH, as they are missing the subfolder names.

:openphacts-human_metabolome_database dcterms:description "The subset of OpenPhacts that contains Human Metabolome Database data."@en;
            dcterms:title "OpenPhacts Human Metabolome Database Subset"@en;
            void:dataDump <http://ops.rsc.org/download/20151009/ISSUES_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>,
                                                          <http://ops.rsc.org/download/20151009/PROPERTIES_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>,
                                                          <http://ops.rsc.org/download/20151009/SYNONYMS_HUMAN_METABOLOME_DATABASE20151009.ttl.gz>;
      :openphacts-mesh dcterms:description "The subset of OpenPhacts that contains MeSH data."@en;
                       dcterms:title "OpenPhacts MeSH Subset"@en;
                       void:dataDump <http://ops.rsc.org/download/20151009/ISSUES_MESH20151009.ttl.gz>,
                                     <http://ops.rsc.org/download/20151009/PROPERTIES_MESH20151009.ttl.gz>,
                                     <http://ops.rsc.org/download/20151009/SYNONYMS_MESH20151009.ttl.gz>;

@stain
Copy link
Contributor Author

stain commented Oct 14, 2015

The pav:previousVersion statements in the void points misleadingly to the same version:

:chebi_exactMatch pav:previousVersion :chebi_exactMatch .
:drugbank_exactMatch pav:previousVersion :drugbank_exactMatch .

Should these go to anchors within the previous VoID release under ftp://[email protected]/OPS/ somewhere?

stain added a commit to openphacts/ops-rsc-dataset that referenced this issue Nov 9, 2015
stain added a commit to openphacts/ops-rsc-dataset that referenced this issue Nov 9, 2015
@stain
Copy link
Contributor Author

stain commented Nov 9, 2015

Update for RDF-2015.11.04.zip from http://ops.rsc.org/download/RDF-2015.11.04.zip (2.2 GiB, 20 GB unzipped):

I made a Maven job to archive and patch (still building, download speed from http://ops.rsc.org/ are not ideal, seems to be about 5 MBit/s?). Once archived I can use http://repository.mygrid.org.uk/artifactory/ops/org/openphacts/data/ops-rsc-dataset/20151104-SNAPSHOT/ instead, so not a big issue.

MESH errors remain - but the rest of the linksets are all valid Turtle. I added patches to remove the offending lines - this means those URIs won't have a matching links to MESH identifiers.

The void:dataDump links are now updated, but now all of them are 404, e.g.

Simply unpacking the zip file in its current download directory should fix this, which would make
http://ops.rsc.org/download/20151104/ work.

I see files now are .ttl instead of .ttl.gz which increases disk space requirement for unzipping by a ten-fold, but I can repackage them in the archival job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant