Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobspecs with flexible resource types #1259

Open
zekemorton opened this issue Jul 29, 2024 · 2 comments
Open

Jobspecs with flexible resource types #1259

zekemorton opened this issue Jul 29, 2024 · 2 comments

Comments

@zekemorton
Copy link
Collaborator

This features would allow jobespecs to specify multiple possible resource configurations and allow the scheduler to select configurations based what's available or what is best.

I've tested implementing this feature in the traverser by adding a new slot type, an or_slot with the intention here being there are multiple possible configurations and the scheduler can select any combination based on what's available.

Implementing a prototype of this in the traverser consisted of adding handling for the or_slot much like how slot is handled in dfu_impl_t::match and dfu_impl_t::test except by allowing for multiple resource types of or_slot to be specified. I also added a new function dfu_impl_t::dom_or_slot that behaves similarly to dfu_impl_t::dom_slot. A few key differences are that it will traverse the rest of the resource graph and job spec on the union of all resources specified under all of the or_slots. This is to ensure that we get proper counts. Then determine the best configurations of the or_slot options. Then create the edge groups for those or slot options. You can find the branch of this prototype here: https://github.com/zekemorton/flux-sched/tree/resource-or. The lates commit shows an example where the optimal configuration is selected using dynamic programing, but easier commits show how it was implemented with greedy selection.

This leaves me with some open questions that I would like to discuss with the larger group:
Is implementing this in the traverser an appropriate place for this kind of functionality? What are some other options?
What are kinds of logical operators for this kind of flexibility?
What are different means that we can select the desired configuration? Will this be a whole new set of policies?

@zekemorton
Copy link
Collaborator Author

An example job spec with or_slots :

resources:
  - type: or_slot
    count: 8
    label: default
    with:
      - type: core
        count: 12
  - type: or_slot
    count: 8
    label: default
    with:
      - type: core
        count: 6
      - type: gpu
        count: 1

When running a match allocate on this against tiny.graphml we get:

m allocate t/data/resource/jobspecs/basics/test004.yaml 
      ---------------core0[1:x]
      ---------------core1[1:x]
      ---------------core2[1:x]
      ---------------core3[1:x]
      ---------------core4[1:x]
      ---------------core5[1:x]
      ---------------core6[1:x]
      ---------------core7[1:x]
      ---------------core8[1:x]
      ---------------core9[1:x]
      ---------------core10[1:x]
      ---------------core11[1:x]
      ---------------core12[1:x]
      ---------------core13[1:x]
      ---------------core14[1:x]
      ---------------core15[1:x]
      ---------------core16[1:x]
      ---------------core17[1:x]
      ---------------gpu0[1:x]
      ------------socket0[1:s]
      ---------------core18[1:x]
      ---------------core19[1:x]
      ---------------core20[1:x]
      ---------------core21[1:x]
      ---------------core22[1:x]
      ---------------core23[1:x]
      ---------------core24[1:x]
      ---------------core25[1:x]
      ---------------core26[1:x]
      ---------------core27[1:x]
      ---------------core28[1:x]
      ---------------core29[1:x]
      ---------------core30[1:x]
      ---------------core31[1:x]
      ---------------core32[1:x]
      ---------------core33[1:x]
      ---------------core34[1:x]
      ---------------core35[1:x]
      ---------------gpu1[1:x]
      ------------socket1[1:s]
      ---------node0[1:s]
      ---------------core0[1:x]
      ---------------core1[1:x]
      ---------------core2[1:x]
      ---------------core3[1:x]
      ---------------core4[1:x]
      ---------------core5[1:x]
      ---------------core6[1:x]
      ---------------core7[1:x]
      ---------------core8[1:x]
      ---------------core9[1:x]
      ---------------core10[1:x]
      ---------------core11[1:x]
      ---------------core12[1:x]
      ---------------core13[1:x]
      ---------------core14[1:x]
      ---------------core15[1:x]
      ---------------core16[1:x]
      ---------------core17[1:x]
      ---------------gpu0[1:x]
      ------------socket0[1:s]
      ---------------core18[1:x]
      ---------------core19[1:x]
      ---------------core20[1:x]
      ---------------core21[1:x]
      ---------------core22[1:x]
      ---------------core23[1:x]
      ---------------core24[1:x]
      ---------------core25[1:x]
      ---------------core26[1:x]
      ---------------core27[1:x]
      ---------------core28[1:x]
      ---------------core29[1:x]
      ---------------core30[1:x]
      ---------------core31[1:x]
      ---------------core32[1:x]
      ---------------core33[1:x]
      ---------------core34[1:x]
      ---------------core35[1:x]
      ---------------gpu1[1:x]
      ------------socket1[1:s]
      ---------node1[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

@zekemorton
Copy link
Collaborator Author

opened PR #1296 for this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant