Sub-path support #29

glennhickey · 2019-08-20T13:33:49Z

Subpath information is useful when working with subgraphs. It will also help with rGFA compatibility (SN/SO fields).

The following ideas emerged from the last meeting:

A subpath is a path.
It contains information linking it to the name and positional offset of its source path
It represents a contiguous range in its source path
All paths in a graph (regardless of whether or not they are subpaths) must have unique non-empty names. There's no naming structure enforced beyond this, though implementations will probably adopt conventions for subpath naming under the hood.

The subpath interface would be added to PathHandleGraph. It could look something like

/// is a path a subpath
virtual bool is_subpath(path_handle_t path_handle) const = 0;

/// get the name of the source path
virtual string get_source_path_name(path_handle_t path_handle) const = 0;

/// get the offset in the source path
virtual off_t get_source_path_offset(path_handle_t path_handle) const = 0;

I'm not sure if, implementation-wise, it's better to combine the above? In theory, they could be a single method like

/// get subpath information, not a subpath if returned name is empty
virtual pair<string, off_t> get_source_path_position(path_handle_t path_handle) const = 0;

Similar story for querying. We can provide default implementations of these using the above methods:

/// get subpaths for a path
vector<path_handle_t> get_subpaths(const string& source_path_name) const;

/// get_subpaths for a range in a path (start/end 0-based inclusive).  if last flag is true, only paths
/// fully contained in the range are returned
vector<path_handle_t> get_subpaths_for_range(const string& source_path_name, off_t start, off_t end, bool fully_contained = false)

I think implementing this should be pretty straightforward, but making sure that existing algorithms that modify paths use it properly may take some doing.

@ekg @jeizenga @adamnovak thoughts?

The text was updated successfully, but these errors were encountered:

glennhickey · 2019-08-20T14:01:41Z

There's probably a method (that could have a default implementation) or flag needed in the above to walk up all the way to the "root" source path and/or compute the equivalent to rGFA's SR/Rank in the case of multiple nesting layers.

adamnovak · 2019-08-20T18:50:26Z

In this formulation, I think we definitely need a set of functions which operate on either the path itself if not a subpath, or the parent path if there is one. When we get the path name and offset for output (or read them in as input), we want to automatically articulate things in terms of the base path, and interpret them in terms of the subpaths we actually happen to have available. Something like get_root_path_name(path_handle_t) and get_offset_in_root_path(occurrence_handle_t). We also would want some more accessors for working with root paths. We want to be able to iterate over the subpaths on a root path even when we don't have any idea of their coordinates along the path, for example. We also want to be able to visit every occurrence along a root path in order, regardless of the subpath that each comes from, with some notion of when we cross a discontinuity. And we should be able to iterate over all of the root paths, so we don't waste our time looping over all paths and filtering.

Given that, I'm not sure that we don't want a handle instead of just a string to identify root paths that have subpaths but are not themselves present in the graph.

I think the Right Way to structure things might be to make all actual paths into subpaths, and to separate the root path namespace from the subpath namespace. You can no longer have subpaths of subpaths; they just become possibly overlapping subpaths of the same ultimate root path. You can have a subpath called "chr1" with an offset of 0 that runs the length of chr1, but it is still a subpath and not the actual "chr1" root path, and the only thing that makes it better than some other subpath of the actual "chr1" root path is that it is probably longer. Subpaths would then be very firmly views and not actual paths, so updating one would update all other overlapping subpaths. A design like this would make it harder to accidentally ignore the existence of subpaths.

glennhickey · 2019-08-21T13:44:12Z

Good points. I like get_root_path_name(path_handle_t) and get_offset_in_root_path(occurrence_handle_t).

Personally, I'm mostly interested in getting global coordinates out of subgraphs, for which the above two methods are sufficient. In addition to vg chunk, this allows for things like extracting an MHC graph from a genome graph, and focusing on it transparently using real chr6 coordinates. I think these methods would also be sufficient for a fairly simple 2-pass subgraph merging algorithm.

The interface where we can iterate and query root paths transparently via a collection of subgraphs sounds cool, but seems like a far taller order implementation wise. What would the use case be for it?

glennhickey · 2019-08-21T13:49:20Z

I guess on my MHC example, you may want to go from a root path position to a step in the subgraph which would indeed require some kind of root path handle to do consistently.

glennhickey · 2019-08-21T14:03:49Z

Can we do things The Right Way while being backward compatible with the existing interface? Perhaps if we keep everything as is and assume the current path interface gives handles to root paths. Then behind the scenes the implementation can manage subpaths if they exist. An additional subpath interface could be used to map root-path steps to their actual subpaths and detect discontinuities during iteration, examine the subpaths themselves etc.

adamnovak · 2019-08-21T17:55:12Z

The interface where we can iterate and query root paths transparently via a collection of subgraphs sounds cool, but seems like a far taller order implementation wise. What would the use case be for it?

I'm not quite sure what you mean here; I think you meant to say "subpaths" and not "subgraphs"? The use case for being able to query the root path name and coordinates from a handle to a subpath is for VCF output when calling on a subgraph. The use case for being able to see the root path through transparent subpath views is any time we want to modify a path, because we want the modification to affect all the subpaths to maintain consistency.

Can we do things The Right Way while being backward compatible with the existing interface?

I don't think we can have the existing interface work for root paths, because the existing iteration interface has no way to signal a discontinuity. We don't want to feed a graph with only parts of a root path into some existing algorithm and have it ignore the discontinuities between the actually-provided subpath parts of the path, without intending to. I think we need a new interface for iterating along the represented parts of a root path, or through the subpaths of a root path, that doesn't let the iteratee ignore the discontinuities accidentally.

jeizenga · 2019-08-21T18:00:38Z

One comment about the suggestions above: if we're going to do efficient range queries we'll also need to build a BBT index over subpath segments. I don't think that's horrible, tbh, but it is a design consideration.

I think we also need to extend the MutablePathHandleGraph interface to have some methods for creating subpaths. The implementation here actually seems somewhat more challenging to me. Off the top of my head, I see two options here: one offset-based, one step-based.

path_handle_t create_subpath(const string& parent_name, const size_t& offset, const size_t& length);
path_handle_t create_subpath(const PathHandleGraph& from, const step_handle_t& begin, const step_handle_t& end);

In the first option, there doesn't seem to be any simple mechanism for ensuring consistency between the subpath and the parent path. For example, what if the offset does not fall at a node boundary? Where do the steps come from, and how would we ensure that they match the parent?

In the second option, the only way to calculate the offset is by traversing the entire prefix of the path. That's not difficult to implement, but it's not very efficient.

Any suggestions for improving the mutable interface?

ekg · 2019-08-21T22:04:44Z

What's a BBT index? I think that we should add a source path position and source path name accessor for paths. We should add functions get the position in a given path relative to its source path. I don't think there is any need to make a new concept (subpath) that's different than a regular path. This just lets us say who the parent of any given path is. In serialization, we'd just add some flags to the path line to hold onto these values. If we allow for multiple source paths and positions then this scheme can be used to compress multiple paths into one. I'm not sure how useful this is though.

…

On Wed, Aug 21, 2019 at 8:00 PM Jordan Eizenga ***@***.***> wrote: One comment about the suggestions above: if we're going to do efficient range queries we'll also need to build a BBT index over subpath segments. I don't think that's horrible, tbh, but it is a design consideration. I think we also need to extend the MutablePathHandleGraph interface to have some methods for creating subpaths. The implementation here actually seems somewhat more challenging to me. Off the top of my head, I see two options here: one offset-based, one step-based. path_handle_t create_subpath(const string& parent_name, const size_t& offset, const size_t& length); path_handle_t create_subpath(const PathHandleGraph& from, const step_handle_t& begin, const step_handle_t& end); In the first option, there doesn't seem to be any simple mechanism for ensuring consistency between the subpath and the parent path. For example, what if the offset does not fall at a node boundary? Where do the steps come from, and how would we ensure that they match the parent? In the second option, the only way to calculate the offset is by traversing the entire prefix of the path. That's not difficult to implement, but it's not very efficient. Any suggestions for improving the mutable interface? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29?email_source=notifications&email_token=AABDQENJBJPJY6RCKEP5MF3QFV7ENA5CNFSM4INWVMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42TGXI#issuecomment-523580253>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABDQEJVDRJFCU26WTSCSQLQFV7ENANCNFSM4INWVMVA> .

jeizenga · 2019-08-21T22:26:09Z

BBT = balanced binary tree

This was referenced Oct 11, 2019

vg view reports that a subgraph by vg chunk is invalid vgteam/vg#2504

Open

Append offset to subpaths output by vg chunk -p vgteam/vg#2506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-path support #29

Sub-path support #29

glennhickey commented Aug 20, 2019

glennhickey commented Aug 20, 2019

adamnovak commented Aug 20, 2019

glennhickey commented Aug 21, 2019

glennhickey commented Aug 21, 2019

glennhickey commented Aug 21, 2019 •

edited

Loading

adamnovak commented Aug 21, 2019

jeizenga commented Aug 21, 2019

ekg commented Aug 21, 2019 via email

jeizenga commented Aug 21, 2019

Sub-path support #29

Sub-path support #29

Comments

glennhickey commented Aug 20, 2019

glennhickey commented Aug 20, 2019

adamnovak commented Aug 20, 2019

glennhickey commented Aug 21, 2019

glennhickey commented Aug 21, 2019

glennhickey commented Aug 21, 2019 • edited Loading

adamnovak commented Aug 21, 2019

jeizenga commented Aug 21, 2019

ekg commented Aug 21, 2019 via email

jeizenga commented Aug 21, 2019

glennhickey commented Aug 21, 2019 •

edited

Loading