Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JGF is missing vertices in rv1 match-format writer #1310

Open
jameshcorbett opened this issue Oct 15, 2024 · 6 comments
Open

JGF is missing vertices in rv1 match-format writer #1310

jameshcorbett opened this issue Oct 15, 2024 · 6 comments

Comments

@jameshcorbett
Copy link
Member

On rzadams, which was just today configured to use the rv1 match format:

$ flux alloc -N2
flux-job: fqxGU4MP3XV started                                                                 00:00:17
Oct 14 19:16:24.615448 PDT sched-fluxion-resource.err[0]: grow_resource_db_jgf: db.load: unpack_edge: source and/or target vertex not found1654 -> 2196.
Oct 14 19:16:24.615462 PDT sched-fluxion-resource.err[0]: : Invalid argument
Oct 14 19:16:24.615469 PDT sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Oct 14 19:16:24.615473 PDT sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Oct 14 19:16:24.616098 PDT sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Oct 14 19:16:24.616106 PDT sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Oct 14 19:16:24.616108 PDT sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Oct 14 19:16:24.616109 PDT sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database
Oct 14 19:16:24.616397 PDT sched-fluxion-resource.crit[0]: module exiting abnormally
Oct 14 19:16:24.842895 PDT sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Function not implemented
Oct 14 19:16:24.842907 PDT sched-fluxion-qmanager.err[0]: handshake_resource: update_on_resource_response: Function not implemented
Oct 14 19:16:24.842909 PDT sched-fluxion-qmanager.err[0]: handshake: handshake_resource: Function not implemented
Oct 14 19:16:24.842912 PDT sched-fluxion-qmanager.err[0]: mod_start: handshake: Function not implemented
Oct 14 19:16:24.842934 PDT sched-fluxion-qmanager.crit[0]: module exiting abnormally

I confirmed that vertex 1654 is not in the JGF produced for the scheduler, although 2196 is. 1654 is a rack vertex, 2196 is a node vertex.

@jameshcorbett
Copy link
Member Author

The system instance's resource graph has cluster -> rack -> node. The JGF it writes out for child instances does not include rack vertices, however it still writes out the edges from cluster to rack and from rack to node. My current hypothesis is that the writer is coded to include the root of the graph but then skip any intermediate vertices on its way down to node vertices. Hopefully will be a simple fix?

@jameshcorbett
Copy link
Member Author

Strangely, hetchy does not have this problem, it writes out the rack vertex. Something is off and since this is the same cluster as #1305 I wonder if the JGF is wrong somehow.

@trws
Copy link
Member

trws commented Oct 16, 2024

Could you pull an example json object from each of these? I'm looking at the RV1 code, and it doesn't have anything that would trim vertices. It's possible something in the match code is doing it, but something is clearly fishy here.

@jameshcorbett
Copy link
Member Author

Some nodes hit the issue on the cluster, some don't. Here is the JGF for the overall system, and the JGF for one node that hit the error and another that didn't.
bad_jgf_cluster.json
cluster_R.json
good_jgf_cluster.json

@jameshcorbett
Copy link
Member Author

I didn't see any obvious errors in the system JGF but I may well have missed something.

@trws
Copy link
Member

trws commented Oct 17, 2024

It occurred to me looking at this yesterday that there's something we usually don't see in our graphs in here, the cluster-level graph has nodes with a rack and exactly one node that's directly under the cluster vertex. There's no reason that should cause a problem, but I'm not sure it's tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants