Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy degradation with approx and noisy data. #8901

Closed
trivialfis opened this issue Mar 12, 2023 · 6 comments
Closed

Accuracy degradation with approx and noisy data. #8901

trivialfis opened this issue Mar 12, 2023 · 6 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Mar 12, 2023

https://discuss.xgboost.ai/t/xgboost-1-7-fits-much-worse-than-1-5-on-noisy-data-with-reproducible-experiment/3108

Related:
jpmml/jpmml-sparkml#128

@trivialfis
Copy link
Member Author

Some of the default parameters were changed over the course of multiple releases. Including max_bin and some subtle parameters in sketching. I can reduce the logloss to something similar to 1.5 by using a smaller number of bins. (like 64)

@r-luo
Copy link

r-luo commented Mar 14, 2023

Some of the default parameters were changed over the course of multiple releases. Including max_bin and some subtle parameters in sketching. I can reduce the logloss to something similar to 1.5 by using a smaller number of bins. (like 64)

Thanks for looking into this! When you matched the performance to 1.5, are they on the same hyperparameters, or did you have to do more parameter tuning than with 1.5..?

@trivialfis
Copy link
Member Author

trivialfis commented Mar 14, 2023

@r-luo Unfortunately, there's not a universally good parameter set. The max_bin is currently set to 256, and was set to 64 before. A smaller number of bins can help mitigate over-fitting, but also introduces the chance of under-fitting. We usually benchmark with large datasets, as a result, the default parameter might favor over-fitting. I opened an issue only to track whether there's a bug in the implementation.

I guess the only thing we as maintainers of the project can do is to spend more time on #4986 .

@r-luo
Copy link

r-luo commented Mar 14, 2023

@r-luo Unfortunately, there's not a universally good parameter set. The max_bin is currently set to 256, and was set to 64 before. A smaller number of bins can help mitigate over-fitting, but also introduces the chance of under-fitting. We usually benchmark with large datasets, as a result, the default parameter might favor over-fitting. I opened an issue only to track whether there's a bug in the implementation.

I guess the only thing we as maintainers of the project can do is to spend more time on #4986 .

That makes sense. I agree that there's not a universally good set of parameters for every problem. I'll do more experiment on my side to see if they are able to perform the same after extensive hyperparameter tuning. Would you mind pointing me to the parameters that affects sketching that you mentioned above? Are those accessible through training parameters?

@trivialfis
Copy link
Member Author

Unfortunately, the parameters for sketching are hard coded. But for this specific case, the max_bin is most significant I think. The change in performance comes from this PR: #7214 .

@r-luo
Copy link

r-luo commented Mar 17, 2023

Unfortunately, the parameters for sketching are hard coded. But for this specific case, the max_bin is most significant I think. The change in performance comes from this PR: #7214 .

Makes sense, thanks so much for looking into this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants