-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genotypes API: genotype_call table management #1
Comments
Can each group please post a detailed comment on this issue using the following template:
|
Group/Tool information
Group: University of Saskatchewan, Pulse Bioinformatics Materialized Views
Materialized View Name: mview_ndg_[genus]_variantsColumns (include name, data type, brief description):
Indices (include columns covered):
Query / Approach to Populate:Data is broken into 50 million record chunks and copied into a file using the following query. A linux command is then used on the file to collapse it to unique variants. This is done because it's faster then ordering and grouping in Postgresql. Then the collapsed file is copied into the mview table. Before sync the mview is truncated and indices dropped. After sync the indices are re-created. SELECT
call.variant_id,
v.name as variant_name,
(SELECT value FROM chado.featureprop WHERE feature_id=v.feature_id AND type_id=:variantprop_type) as variant_type,
loc.srcfeature_id as srcfeature_id,
b.name as srcfeature_name,
loc.fmin,
loc.fmax,
json_build_object('strand', loc.strand, 'featureloc_id', loc.featureloc_id) as meta_data
FROM {$ndg_calls_name} call
LEFT JOIN {feature} v ON v.feature_id=call.variant_id
LEFT JOIN {featureloc} loc ON loc.feature_id=call.variant_id
LEFT JOIN {feature} b ON b.feature_id=loc.srcfeature_id
WHERE call.variant_id BETWEEN :min AND :max Materialized View Name: mview_ndg_[genus]_callsColumns (include name, data type, brief description):
Indices (include columns covered):
Query / Approach to Populate:We chunk the dataset into hopefully reasonable pieces for updating the materialized view. However, we would need to avoid sorting first since that has a large performance hit. Our solution is to break the data set by the allele/genotype for each call. While this makes no sense from a biological standpoint it is at least something we can assume is present (since calls are genotypes;-) ). This approach will suffer if there are a large number of alleles (ie: MNPs or indels) especially if the overall dataset is small. That said it is well suited to SNP data which is the bulk of our data currently. We use a similar approach of copying to a file and then back in for performance reasons. There is no processing of the resulting file in this case. We also truncate + drop indices to start and then re-create indices after. SELECT
gc.variant_id,
gc.marker_id,
m.name as marker_name,
mt.value as marker_type,
gc.stock_id,
s.name as stock_name,
g.stock_id as germplasm_id,
g.name as germplasm_name,
gc.project_id,
gc.genotype_id,
a.description as allele_call,
gc.meta_data
FROM {genotype_call} gc
LEFT JOIN {feature} m ON m.feature_id=gc.marker_id
LEFT JOIN {featureprop} mt ON mt.feature_id=m.feature_id AND mt.type_id=:markerprop_type
LEFT JOIN {stock} s ON s.stock_id=gc.stock_id
LEFT JOIN {organism} o ON o.organism_id=s.organism_id
LEFT JOIN {stock}_relationship sr ON sr.".$sample_pos."=s.stock_id AND sr.type_id=:stockrel_type
LEFT JOIN {stock} g ON g.stock_id=sr.".$germ_pos."
LEFT JOIN {genotype} a ON a.genotype_id=gc.genotype_id
WHERE gc.genotype_id=:id AND LOWER(o.genus)=LOWER(:partition) Partitioning
We currently partition by genus but are currently considering of restricting it further to species and/or genome assembly :thinking. We're not sure on this front :-) Optimizations
Scripts / Features / General discussion
We already have this well handled by our module nd_genotypes but are happy to expand the implementation during the Tripal 4 upgrade to help other groups as well. We're also interested to see what approaches other groups have taken to optimize retrieval. |
There was a discussion in Slack that a general API that could handle management of materialized views related to the genotype_call table would be useful.
In the discussion I suggested that an API to manage the following could reduce duplicated effort and allow us all to benefit for optimization for large data sets.
In order to support the very different needs of different tools using the genotype_call table, this API would provide a means for tools to describe the materialized view name, columns, composition, queries, indices, optimization approaches, etc to be used by the API. It is understood that a single best practice materialized view for this type of data is not possible as with large datasets it is important to cater to the specific composition of the data and needs of the tool in order to be performant.
This API is NOT trying to force us all to use the same materialized views or even optimization approaches. Rather it is trying to provide all tools with a set of optimization approaches which can be selected from to support each tool optimally.
The text was updated successfully, but these errors were encountered: