apple_data_aggregation_scripts

A list of scripts used to aggregate apple datasets from isolated sources. Includes two thesauri for aggregating accession names and marker names, respectively These scripts can be downloaded and used directly in R or RStudio. The user will need raw ID input files (id.Raw files).

The Germplasm Identification Thesaurus (GIT) is used to aggregate multiple historical datasets where those datasets could link to individuals known by one or more synonyms. For example, data for the 'Cripps Pink' cultivar could also be linked to the brand name Pink Lady. The SNP locus Identification Thesaurus (SIT) is used to link aliases of the marker identifier for facilitating the aggregation of data linked to the same position in the genome.

The first id.Raw file can be used as input for the Germplasm Identification Thesaurus (GIT). These id.Raw files are flat files (.csv) with the following fields:

id.Original - the original germplasm name or identifier of the received file
id.Plain - the original germplasm name or identifier (id.Original) of the received file without special characters/spaces, using only lower case letters and numbers of the English alphabet
id.Raw - in many cases, the same as the id.Plain, but if ambiguous IDs are found, then this identifier is appended with extra passport information of the sample
id.Syn (if needed) - Extra passport information. Any available alias of the id.Original
syn.Plain (if needed) - Extra passport information. Similar to the id.Plain in format, but generated from the id.Syn
contrib - The contributor of the file
file_name - The name of the file
confidentiality - Whether the files content or private or public information id.Raw files should be generated for all field measurement (phenotype) and marker (genotype) data

The second id.Raw file can be used as input for the SNP locus Indentification Thesaurus (SIT). These sid.Raw (for SNP) files are also flat files (.csv) with the following fields:

sid.Raw - The original raw marker identifier linking the data to a specific position of the genome
sid.Syn - Extra passport information; other names or identifiers that the sid.Raw is known by
sid.Plain - plain text identifier of the sid.Raw or sid.Syn, using only lower case letters and numbers of the English alphabet and no special characters or spaces.
Chromosome - Chromosome number of the organism
Position - physical position (bp) on the chromosome
Array - Which array the SNP is originally located
contrib - The contributor of the file
file_name - The name of the file

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GID_thesaurus		GID_thesaurus
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apple_data_aggregation_scripts

About

Releases

Packages

dedgegegarza/apple_data_aggregation_scripts

Folders and files

Latest commit

History

Repository files navigation

apple_data_aggregation_scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages