-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KNN on max series seems slower than cuda-based implementation on comparable devices ? #1441
Comments
There is actually an error in my initial snippet, in that it imports
it significantly improves the walltime on cpu:
(NB: the CPU it runs on provides 254 cores, that's a lot of cores, users usually have easier access to middle-end gpus than workstation CPUs with 64cores+) But still no luck running it on GPU, now I have the following error:
I thought about converting the data to on-device import numpy as np
import sklearn
import dpctl.tensor as dpt
# device = "cpu"
device = "gpu"
from sklearnex import patch_sklearn, config_context
patch_sklearn()
from sklearn.neighbors import NearestNeighbors
seed = 123
rng = np.random.default_rng(seed)
n_samples = 10_000_000
dim = 100
n_queries = 10_000
k = 100
data = rng.random((n_samples, dim), dtype=np.float32)
query = rng.random((n_queries, dim), dtype=np.float32)
data = dpt.asarray(data)
query = dpt.asarray(query)
with config_context(target_offload=f"{device}"):
knn = NearestNeighbors(n_neighbors=k, algorithm="brute")
knn.fit(data)
%time knn.kneighbors(X=query) but then the compute will just hang and output nothing. |
So I found out I had a version mismatch in the conda dependency tree if I don't install everything with the
and now here's on GPU Max Series:
this time it seems to work and to be properly dispatched to GPU. There's about a 5 times slowdown compared to the cuml backend on nvidia A100 (see report in the OP). The performance cap one can reach on intel Max Series is unknown but the gap still feel larger than it should be, judging by the respective GPU specs. |
@fcharras thank you for the report. Let me reproduce and investigate the issue. |
Hi @fcharras, thank you for providing these results. We have reproduced the experiments and will create an internal feature request to identify ways to speed up this computation for more comparable results. |
Initial report contained an error, please follow through the first comment for a better explanation.
show following results:
device=cpu
:device=gpu
(Max Series on intel beta cloud):but one could expect a significant speedup on GPU.
Comparing on A100 with
cuml
implementation (in fact inherited from OSS implementation from FAISS):it's about 3sc:
Also, looking at total total cpu times with scikit-learn-intelex it's unexpected that I see 25mins+ for both cpu and gpu runs despite the walltime being <15sc, it suggests cpu is also under heavy load for the gpu call snippet, is this possibility really dismissed by https://github.com//issues/1416 ?Environment:
sklearn-intelex + dpcpp_cpp_rt install with conda with max series gpu on intel beta cloud.
The text was updated successfully, but these errors were encountered: