-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_vocab_count won't work for CATEGORICAL integerized in tfdf.keras.GradientBoostedTreesModel #190
Comments
Hi, |
Hi, I looked into the code a bit more deeply and the behaviour you're seeing is expected and probably hasn't changed since 0.2.6. The simplest fix is to feed the feature as a string instead of an int, so TF-DF will not assume it's pre-integerized. Alternatively, you can continue feeding the values as integers and set every value but the |
Hi, Thank you for your quick reply. If I want to avoid data manipulation, can you please point me to the code where I can apply the guide to Pre-Integerized categorical features as well? I saw this src that is relevant to the release we are using, using it as a custom fix will be the best solution facing our limitation. Thank you for your help! |
Hi, The relevant piece is in this function. If you read through it, you'll see some preprocessing is not applied if the column is integerized. You'll have to make sure that the parameter Note that this code resides in the Yggdrasil Decision Forests repository, which is a C++ library (developed by the same team as TF-DF). During compilation (with bazel), you'll have to make sure your local, modified copy of Yggdrasil Decision Forests is used. Finally, follow these instructions for building the old TF-serving with the old TF-DF. Unfortunately, our team does not have the bandwidth to support you in this process. If this is too cumbersome, I suggest you also try to just exclude the |
I thought about this a bit more and I believe this can be considered a bug and it's probably something we should address. I'll keep this issue open (and labeled) as a reminder. |
Hi All,
versions:
python 3.9
tensorflow_decision_forests==0.2.6
tensorflow==2.9.1
Running on AWS instance type: ml.m5.24xlarge
Problem description:
When setting max_vocab_count in tfdf.keras.FeatureUsage and in tfdf.keras.GradientBoostedTreesModel to 20, features of type: CATEGORICAL integerized won't be affected and original vocabulary size will be used, while in features of type: CATEGORICAL has-dict max_vocab_count will be applied correctly:
Please see the statistics on the log for example, both using the same feature usage:
"request_id" CATEGORICAL integerized vocab-size:8806 no-ood-item
"request_tile" CATEGORICAL has-dict vocab-size:21 num-oods:2823 (0.0014115%) most-frequent:"851fb467fffffff" 2395895 (1.19795%)
request_id is ignored by the guide and doesn't use the max_vocab_count,
request_tile is handled correctly.
Will appreciate your help, Thank you
The text was updated successfully, but these errors were encountered: