datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

Any of the file-containing directories can contain a README file and a CHANGES file.

README YAML files

Every file-containing directory, AKA "collection", in the LIS datastore should contain a README file in YAML format.

Examples:

Validation

The basic README structure (acceptable field names, strings vs. lists vs. dates) can be validated using the following command:

ajv -s readme.schema.json -d README.[collection].yml --all-errors --coerce-types=array --remove-additional=all --changes

using the JSON schema definition readme.schema.json.

This schema must be kept up to date along with the sample template README.collection.yml when any changes are made to the README spec.

Content requirements

READMEs must be YAML-compliant, which means they pass the test on http://www.yamllint.com/ or using the yamllint command-line utility. Here are some, but not all, requirements for a valid LIS README:

identifier at the top repeats the name of the collection, i.e. the name of the containing directory.
synopsis should be short, 100 characters or less.
genotype is a YAML array: but use a single "strain1 x strain2" value for bi-parental crosses.
publication_doi (and any other DOI) is a DOI, not a URL (e.g. 10.1534/g3.118.200521).
Dates are in the format 2020-03-23.
Use spaces, not tabs (tabs may not appear anywhere in a YAML)
Enclose values in quotes when they contain a colon or quotes (you can use single or double quotes to distinguish from quotes in content)
Do not include empty keys - leave them out entirely. All keys must have values.
publication_doi is REQUIRED. If the data were generated by LIS, use the default LIS publication:

publication_doi: 10.1093/nar/gkv1159

Gotchas

READMEs may share content. For example, the README with a genome assembly (under /genomes/) often contains the same publication as the README with annotations (under /annotations/). Those publications must match exactly. Otherwise, the mine loader will error out with an error like "Conflicting values for field Publication.title between Zh13.gnm2.LV9P (value "Update soybean Zhonghuang 13 genome to a golden reference. Sci China" in database with ID 99000176) and Zh13.gnm2.ann1.FJ3G.cds (value "Update soybean Zhonghuang 13 genome to a golden reference" being stored)."

MANIFEST files

A directory may contain a MANIFEST.collection.correspondence.yml file which lists the current filenames and prior filenames:

---
# filename in this repository: previous names
glyma.Wm82.gnm2.DTC4.genome_hardmasked.fna.gz: Gmax_275_v2.0.hardmasked.fa.gz
glyma.Wm82.gnm2.DTC4.genome_softmasked.fna.gz: Gmax_275_v2.0.softmasked.fa.gz

... and also a MANIFEST.collection.descriptions.yml file which briefly describes the files:

---
# filename in this repository: description
glyma.Wm82.gnm2.DTC4.hardmasked.fna.gz: Genome assembly: masked with 'N's
glyma.Wm82.gnm2.DTC4.softmasked.fna.gz: Genome assembly: masked with lowercase

CHANGES files

A directory may contain a CHANGES.collection.txt file which lists file transformations and changes. For example:

file transformations:

seqlen.awk vigan.Gyeongwon.a3.v1.cds.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest"

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.cds.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5c.cds_primaryTranscript.fna

seqlen.awk vigan.Gyeongwon.a3.v1.peptide.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.peptide.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5f.protein_primaryTranscript.faa

changes: 

2018-03-03 Added MANIFEST files
2018-09-15 Changed fastas to include full prefixing (s/vigan/vigan.Gyeongwon.gnm3.ann1/)

Name		Name	Last commit message	Last commit date
Latest commit History 486 Commits
Genus		Genus
LEGUMES/Fabaceae		LEGUMES/Fabaceae
PROTOCOLS		PROTOCOLS
scripts		scripts
.gitignore		.gitignore
MANIFEST.collection.correspondence.yml		MANIFEST.collection.correspondence.yml
MANIFEST.collection.descriptions.yml		MANIFEST.collection.descriptions.yml
README.collection.yml		README.collection.yml
README.md		README.md
description_Genus.yml		description_Genus.yml
description_Genus_species.yml		description_Genus_species.yml
description_genus.schema.json		description_genus.schema.json
description_genus_species.schema.json		description_genus_species.schema.json
readme.schema.json		readme.schema.json
validate		validate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datastore-specifications

README YAML files

Validation

Content requirements

Gotchas

MANIFEST files

CHANGES files

About

Releases

Packages

Contributors 7

Languages

legumeinfo/datastore-specifications

Folders and files

Latest commit

History

Repository files navigation

datastore-specifications

README YAML files

Validation

Content requirements

Gotchas

MANIFEST files

CHANGES files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages