Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalDomain performance in transform step #431

Open
ghost opened this issue Aug 20, 2024 · 5 comments
Open

CategoricalDomain performance in transform step #431

ghost opened this issue Aug 20, 2024 · 5 comments

Comments

@ghost
Copy link

ghost commented Aug 20, 2024

Hi,

I'm currently working with a larger dataset which has numerous categorical features, some of them with many categories. I have set up the pipeline with the corresponding decorators for each feature, and using xgb.

I noticed that the CategoricalDomain decorator spends a lot of time in the transform step. I did a bit more digging in the code, and found out that most of the time spend in _compute_masks, specifically in computing the valid mask (_valid_value_mask). I'm using the CategoricalDomain decorator with invalid_value_treatment='as_is' in which case the valid/invalid masks are not really needed as there is no transformation happening.

Would it be possible to skip the step of calculating the valid/invalid mask in case invalid_value_treatment is set to 'as_is'?

@vruusmann
Copy link
Member

Just to be sure, you're experiencing this bad performance issue when using the latest SkLearn2PMML version (currently 0.110.0)?

@and-ruid
Copy link

sorry, forgot to mention the version. Yes, I'm using the latest 0.110.0

@and-ruid
Copy link

and-ruid commented Aug 20, 2024

Here is some timing for the performance. 99% of the time of the mapper step is spent in CategoricalDomain's transform.

[Pipeline] ............ (step 1 of 2) Processing mapper, total=20.6min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  36.2s

For categorical features with the CategoricalDomain decorator, timings look like this (I know there is a really 'bad' feature with tons of categories):

2024-08-18 17:47:23,358:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1.369734 secs
2024-08-18 17:47:33,321:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 9.88299 secs
2024-08-18 18:06:07,930:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 1106.247428 secs

Timings for numerical features with the ContinuousDomain decorator are in a normal range:

2024-08-18 18:06:11,677:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.019175 secs
2024-08-18 18:06:11,706:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.017644 secs
2024-08-18 18:06:11,748:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.020861 secs

@vruusmann
Copy link
Member

The Domain._compute_masks()X method returns a 3-tuple of boolean arrays (with the elements representing missing mask, valid mask and invalid mask).

Indeed, in case of invalid_value_treatment = "as_is" there is no need to distinguish between valid and invalid subspaces (only missing vs. non-missing is needed). In such a situation, the second and third element of the tuple could be set to None values (instead of boolean arrays). And the Domain.transform(X) method could simply skip a value space if the corresponding mask is None

@a-rudnik Can you implement something along those lines locally, and run your benchmarks again? This way you can be sure that the fix is relevant/sufficient.

However, the valid subspace mask is calculated using the numpy.isin(x, values) method. Is the time really spent in there, or somewhere around it?

@and-ruid
Copy link

I had added some time measurements to the code. There is certainly also some time spent in the computation of the other masks, but that becomes insignificant with more categories. Most time is spent in calculating the valid mask:

Code block '_isin_mask' took: 491.21950 ms
Code block '_valid_value_mask' took: 584.56533 ms
Code block '_compute_masks' took: 791.44471 ms
2024-08-21 09:24:57,090:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1.070773 secs
Code block '_isin_mask' took: 9017.28933 ms
Code block '_valid_value_mask' took: 9093.18437 ms
Code block '_compute_masks' took: 9303.03583 ms
2024-08-21 09:25:06,774:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 9.605799 secs
Code block '_isin_mask' took: 1103150.89917 ms
Code block '_valid_value_mask' took: 1103226.58283 ms
Code block '_compute_masks' took: 1103420.87033 ms
2024-08-21 09:43:36,908:INFO:sklearn_pandas:_transform - [TRANSFORM] ['xxx']: 1103.709195 secs

Following are the changes I made to the code (diff output):

180c180
< 		elif self.invalid_value_treatment == "as_is":
---
> 		elif (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
196,197c196,201
< 		valid_mask = self._valid_value_mask(X, nonmissing_mask)
< 		invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)
---
> 		if (self.invalid_value_treatment == "as_is") or (self.invalid_value_treatment == "as_missing" and self.missing_value_treatment == "as_is"):
> 			valid_mask = None
> 			invalid_mask = None
> 		else:
> 			valid_mask = self._valid_value_mask(X, nonmissing_mask)
> 			invalid_mask = ~numpy.logical_or(missing_mask, valid_mask)

Now the performance is much better:

Code block '_compute_masks' took: 204.30683 ms
2024-08-21 10:37:12,503:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.746888 secs
Code block '_compute_masks' took: 207.54517 ms
2024-08-21 10:37:13,380:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.799804 secs
Code block '_compute_masks' took: 184.08946 ms
2024-08-21 10:37:20,148:INFO:sklearn_pandas:_transform - [FIT_TRANSFORM] ['xxx']: 0.798046 secs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants