You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While we (@cmungall@caufieldjh@hrshdhgd) were working on oakx-grape, we observed that it'd be useful in this and other use cases to be able to deal with edge lists which contain references to nodes that are not present in the nodes list.
e.g.
nodes.tsv:
id category
foo biolink:Gene
bar biolink:Protein
edges.tsv
subject predicate object
foo biolink:interacts_with bar
foo biolink:interacts_with baz
Do either of these two possible behaviors seem reasonable/doable?
add an argument to from_csv to ignore edges that reference nodes that are not in the nodes list (ignore_edges_with_unknown_nodes=False or some such?) - here we'd ignore the foo biolink:interacts_with baz edge
add an argument to from_csv to instantiate nodes with the default node type when they are referenced in the edge file but not in the node file (autocreate_nodes_from_edge_list=False or some such?) - here we'd create a node baz with default_node_type
The text was updated successfully, but these errors were encountered:
We could add support for this, but the reasons such corner cases are intentionally not supported are that:
These are malformed input files. The set of nodes should contain all nodes.
Loading of the graph can no longer be parallel when such corner cases are present, making it much slower.
Many assumptions in creating the graph data structure are no longer valid when edges can get thrown out because of malformations.
From our perspective, since we want the best and fastest experience loading graph objects from CSVs, the graph files should be fixed before loading, not as they are being loaded.
What are the reasons for having incomplete node lists?
Possibly a solution here would be to build a helper tool as part of Grape (or OAK or KG-Hub) that reads in node and edge files and does 1) and 2) above - either rejects edges or adds missing nodes to the nodes file, respectively. Maybe we could help write this if it's of interest?
What are the reasons for having incomplete node lists?
In the oakx-grape use case this happens, I believe, because "dangling edges" (edges that reference nodes that are not explicitly mentioned as entities/nodes) are permitted in the OWL specification - @cmungall I think can elaborate
In the KG-Hub-like use cases, this likely will sometimes happen if during an ingest the developer forgets to write out node information when ingesting and processing edges. Essentially a data bug. This isn't currently a problem when reading KG-Hub graphs into Grape, because KGX currently rejects edges like this during the merge step in our ETL pipeline (which is possibly not ideal, since it silently omits information from the final graph, but that's a separate discussion)
While we (@cmungall @caufieldjh @hrshdhgd) were working on oakx-grape, we observed that it'd be useful in this and other use cases to be able to deal with edge lists which contain references to nodes that are not present in the nodes list.
e.g.
nodes.tsv:
edges.tsv
Do either of these two possible behaviors seem reasonable/doable?
add an argument to
from_csv
to ignore edges that reference nodes that are not in the nodes list (ignore_edges_with_unknown_nodes=False
or some such?) - here we'd ignore thefoo biolink:interacts_with baz
edgeadd an argument to
from_csv
to instantiate nodes with the default node type when they are referenced in the edge file but not in the node file (autocreate_nodes_from_edge_list=False
or some such?) - here we'd create a nodebaz
withdefault_node_type
The text was updated successfully, but these errors were encountered: