Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parameter devicePlugin.deviceSplitCount does not work #35

Open
2232729885 opened this issue Mar 13, 2024 · 3 comments
Open

parameter devicePlugin.deviceSplitCount does not work #35

2232729885 opened this issue Mar 13, 2024 · 3 comments

Comments

@2232729885
Copy link

i use helm to install k8s-vgpu-scheduler, set devicePlugin.deviceSplitCount = 5. after deployed successfully, i run 'kubectl describe node ', i can see the allocatable resources 'nvidia.com/gpu' count 40 (it has 8 A40 card in machine). Then i create 6 pod, every pod assign 1 'nvidia.com/gpu', but when i create a pod which needs 3 'nvidia.com/gpu',the k8s said the pod can't not be schedulerd.

the logs of vgpu-scheduler is showed below, it seems said only 2 gpu card can usable?
image
I0313 00:58:35.594437 1 score.go:65] "devices status" I0313 00:58:35.594467 1 score.go:67] "device status" device id="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" device detail={"Id":"GPU-0707087e-8264-4ba4-bc45-30c70272ec4a","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594519 1 score.go:67] "device status" device id="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" device detail={"Id":"GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce","Index":1,"Used":0,"Count":10,"Usedmem":0,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594542 1 score.go:67] "device status" device id="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" device detail={"Id":"GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4","Index":2,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594568 1 score.go:67] "device status" device id="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" device detail={"Id":"GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e","Index":3,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594600 1 score.go:67] "device status" device id="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" device detail={"Id":"GPU-56967eb2-30b7-c808-367a-225b8bd8a12e","Index":4,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594639 1 score.go:67] "device status" device id="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" device detail={"Id":"GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb","Index":5,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594671 1 score.go:67] "device status" device id="GPU-e731cd15-879f-6d00-485d-d1b468589de9" device detail={"Id":"GPU-e731cd15-879f-6d00-485d-d1b468589de9","Index":6,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594693 1 score.go:67] "device status" device id="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" device detail={"Id":"GPU-865edbf8-5d63-8e57-5e14-36682179eaf6","Index":7,"Used":1,"Count":10,"Usedmem":46068,"Totalmem":46068,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA A40","Health":true} I0313 00:58:35.594725 1 score.go:90] "Allocating device for container request" pod="default/gpu-pod-2" card request={"Nums":5,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0} I0313 00:58:35.594757 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=5 device index=7 device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594800 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-b3e35ad4-81ee-0aee-9865-4787748b93ce" I0313 00:58:35.594829 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=4 device index=6 device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594850 1 score.go:140] "first fitted" pod="default/gpu-pod-2" device="GPU-0707087e-8264-4ba4-bc45-30c70272ec4a" I0313 00:58:35.594869 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=5 device="GPU-865edbf8-5d63-8e57-5e14-36682179eaf6" I0313 00:58:35.594889 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=4 device="GPU-e731cd15-879f-6d00-485d-d1b468589de9" I0313 00:58:35.594911 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=3 device="GPU-54191405-e5a9-2f7b-8ac4-f4e86c6669cb" I0313 00:58:35.594929 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=2 device="GPU-56967eb2-30b7-c808-367a-225b8bd8a12e" I0313 00:58:35.594948 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=1 device="GPU-7099a282-5a75-55f8-0cd0-a4b48098ae1e" I0313 00:58:35.594966 1 score.go:93] "scoring pod" pod="default/gpu-pod-2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=3 device index=0 device="GPU-d38a391c-9f2f-395e-2f91-1785a648f6c4" I0313 00:58:35.594989 1 score.go:211] "calcScore:node not fit pod" pod="default/gpu-pod-2" node="gpu-230"

the kubectl describe node gpu-230 said:
image

the nvidia-smi said:
image

so somebody can solve this issue? thanks

@2232729885
Copy link
Author

the used helm install command is:

add vgpu repo

helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler

helm install chart

helm upgrade --install vgpu vgpu-charts/vgpu
-n kube-system
--set scheduler.kubeScheduler.imageTag=v1.27.2
--set devicePlugin.deviceSplitCount=5
--set devicePlugin.deviceMemoryScaling=1
--set devicePlugin.migStrategy=none
--set resourceName=nvidia.com/gpu
--set resourceMem=nvidia.com/gpumem
--set resourceMemPercentage=nvidia.com/gpumem-percentage
--set resourceCores=nvidia.com/gpucores
--set resourcePriority=nvidia.com/priority
--set devicePlugin.tolerations[0].key=nvidia.com/gpu
--set devicePlugin.tolerations[0].operator=Exists
--set devicePlugin.tolerations[0].effect=NoSchedule

@2232729885
Copy link
Author

image

@2232729885
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant