-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CategoricalDomain performance in transform step #431
Comments
Just to be sure, you're experiencing this bad performance issue when using the latest SkLearn2PMML version (currently 0.110.0)? |
sorry, forgot to mention the version. Yes, I'm using the latest 0.110.0 |
Here is some timing for the performance. 99% of the time of the mapper step is spent in CategoricalDomain's transform.
For categorical features with the CategoricalDomain decorator, timings look like this (I know there is a really 'bad' feature with tons of categories):
Timings for numerical features with the ContinuousDomain decorator are in a normal range:
|
The Indeed, in case of @a-rudnik Can you implement something along those lines locally, and run your benchmarks again? This way you can be sure that the fix is relevant/sufficient. However, the valid subspace mask is calculated using the |
I had added some time measurements to the code. There is certainly also some time spent in the computation of the other masks, but that becomes insignificant with more categories. Most time is spent in calculating the valid mask:
Following are the changes I made to the code (diff output):
Now the performance is much better:
|
Hi,
I'm currently working with a larger dataset which has numerous categorical features, some of them with many categories. I have set up the pipeline with the corresponding decorators for each feature, and using xgb.
I noticed that the CategoricalDomain decorator spends a lot of time in the transform step. I did a bit more digging in the code, and found out that most of the time spend in
_compute_masks
, specifically in computing the valid mask (_valid_value_mask
). I'm using the CategoricalDomain decorator with invalid_value_treatment='as_is' in which case the valid/invalid masks are not really needed as there is no transformation happening.Would it be possible to skip the step of calculating the valid/invalid mask in case invalid_value_treatment is set to 'as_is'?
The text was updated successfully, but these errors were encountered: