Accuracy degradation with approx and noisy data. #8901

trivialfis · 2023-03-12T07:09:50Z

https://discuss.xgboost.ai/t/xgboost-1-7-fits-much-worse-than-1-5-on-noisy-data-with-reproducible-experiment/3108

trivialfis · 2023-03-13T12:48:42Z

Some of the default parameters were changed over the course of multiple releases. Including max_bin and some subtle parameters in sketching. I can reduce the logloss to something similar to 1.5 by using a smaller number of bins. (like 64)

r-luo · 2023-03-14T14:45:05Z

Some of the default parameters were changed over the course of multiple releases. Including max_bin and some subtle parameters in sketching. I can reduce the logloss to something similar to 1.5 by using a smaller number of bins. (like 64)

Thanks for looking into this! When you matched the performance to 1.5, are they on the same hyperparameters, or did you have to do more parameter tuning than with 1.5..?

trivialfis · 2023-03-14T14:54:40Z

@r-luo Unfortunately, there's not a universally good parameter set. The max_bin is currently set to 256, and was set to 64 before. A smaller number of bins can help mitigate over-fitting, but also introduces the chance of under-fitting. We usually benchmark with large datasets, as a result, the default parameter might favor over-fitting. I opened an issue only to track whether there's a bug in the implementation.

I guess the only thing we as maintainers of the project can do is to spend more time on #4986 .

r-luo · 2023-03-14T15:51:53Z

@r-luo Unfortunately, there's not a universally good parameter set. The max_bin is currently set to 256, and was set to 64 before. A smaller number of bins can help mitigate over-fitting, but also introduces the chance of under-fitting. We usually benchmark with large datasets, as a result, the default parameter might favor over-fitting. I opened an issue only to track whether there's a bug in the implementation.

I guess the only thing we as maintainers of the project can do is to spend more time on #4986 .

That makes sense. I agree that there's not a universally good set of parameters for every problem. I'll do more experiment on my side to see if they are able to perform the same after extensive hyperparameter tuning. Would you mind pointing me to the parameters that affects sketching that you mentioned above? Are those accessible through training parameters?

trivialfis · 2023-03-16T19:27:11Z

Unfortunately, the parameters for sketching are hard coded. But for this specific case, the max_bin is most significant I think. The change in performance comes from this PR: #7214 .

r-luo · 2023-03-17T20:06:01Z

Unfortunately, the parameters for sketching are hard coded. But for this specific case, the max_bin is most significant I think. The change in performance comes from this PR: #7214 .

Makes sense, thanks so much for looking into this issue!

trivialfis closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy degradation with approx and noisy data. #8901

Accuracy degradation with approx and noisy data. #8901

trivialfis commented Mar 12, 2023 •

edited

Loading

trivialfis commented Mar 13, 2023

r-luo commented Mar 14, 2023

trivialfis commented Mar 14, 2023 •

edited

Loading

r-luo commented Mar 14, 2023

trivialfis commented Mar 16, 2023

r-luo commented Mar 17, 2023

Accuracy degradation with approx and noisy data. #8901

Accuracy degradation with approx and noisy data. #8901

Comments

trivialfis commented Mar 12, 2023 • edited Loading

trivialfis commented Mar 13, 2023

r-luo commented Mar 14, 2023

trivialfis commented Mar 14, 2023 • edited Loading

r-luo commented Mar 14, 2023

trivialfis commented Mar 16, 2023

r-luo commented Mar 17, 2023

trivialfis commented Mar 12, 2023 •

edited

Loading

trivialfis commented Mar 14, 2023 •

edited

Loading