Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported resource type 'gpu' on LC cluster despite resource-list reporting GPUs #6371

Closed
jameshcorbett opened this issue Oct 15, 2024 · 4 comments

Comments

@jameshcorbett
Copy link
Member

[corbett8@cluster1081:~]$ flux run -t15m -N1 -n1 -c24 -g 1 -o mpibind=on hostname
0.022s: job.exception type=alloc severity=0 Unsupported resource type 'gpu'
[corbett8@cluster1081:~]$ flux resource list
     STATE PROPERTIES       NNODES NCORES NGPUS NODELIST
      free plarge,pdev,pall      2    192     8 cluster[1081,1084]
 allocated                       0      0     0 
      down                       0      0     0 


[corbett8@cluster1081:~]$ flux kvs get resource.R | jq .execution
{
  "R_lite": [
    {
      "rank": "0-1",
      "children": {
        "gpu": "0-3",
        "core": "0-95"
      }
    }
  ],
  "starttime": 1729014140.0,
  "expiration": 1729028540.0,
  "nodelist": [
    "cluster[1081,1084]"
  ],
  "properties": {
    "plarge": "0-1",
    "pdev": "0-1",
    "pall": "0-1"
  }
}


[corbett8@cluster1081:~]$ flux module list
Module                   Idle  S Sendq Recvq Service
content-sqlite           idle  R     0     0 content-backing
job-manager              idle  R     0     0 
cron                     idle  R     0     0 
sched-simple             idle  R     0     0 feasibility,sched
resource                 idle  R     0     0 
job-info                 idle  R     0     0 
job-exec                 idle  R     0     0 
heartbeat                   0  R     0     0 
barrier                  idle  R     0     0 
job-ingest               idle  R     0     0 
job-list                 idle  R     0     0 
connector-local             0  R     0     0 
kvs                        22  R     0     0 
content                    22  R     0     0 
kvs-watch                idle  R     0     0 

The instance is actually configured to use Fluxion with JGF but due to flux-framework/flux-sched#1310 sched-simple is loaded instead. But FWIW the JGF does contain GPU vertices.

@grondo
Copy link
Contributor

grondo commented Oct 15, 2024

sched-simple doesn't currently support scheduling GPUs, so it raises an exception on jobs that ask for them.

Overall, flux-core doesn't support JGF, that's a Fluxion-only thing.

So the root issue here is that sched-simple was loaded instead of Fluxion. This occurs when there is an error loading the Fluxion modules, because unfortunately the current way rc1 works is to load sched-simple if there is no other scheduler loaded after running all rc1.d/* files. (There's already an issue and a plan for improving this)

So I think this a flux-sched issue, not a core issue.

@jameshcorbett
Copy link
Member Author

Oh OK, I wasn't aware sched-simple doesn't support GPUs. Sounds like it's just a symptom of flux-framework/flux-sched#1310 then. Closing this.

@grondo
Copy link
Contributor

grondo commented Oct 15, 2024

Maybe the exception should make it clearer: sched-simple does not support resource type 'gpu' or similar

@jameshcorbett
Copy link
Member Author

That might be nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants