Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate confidence interval with bootstrap #20

Open
shenxiangzhuang opened this issue Dec 12, 2024 · 6 comments · May be fixed by #21
Open

Generate confidence interval with bootstrap #20

shenxiangzhuang opened this issue Dec 12, 2024 · 6 comments · May be fixed by #21
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@shenxiangzhuang
Copy link
Contributor

As shown in: https://lmsys.org/blog/2023-12-07-leaderboard/.

I think this function will be useful and convenient in practice. But I'm not sure if it's appropriate to add it to evalica. I think evalica is very consice and focus on core computation of the algorithms, which is good enough currently.

If it's not appropriate, I'll try to build a simple python package based on evalica to add this functions, maybe will more visualization functions(as shown in the jupyter notebooks) too.

@dustalov
Copy link
Owner

I'm happy to welcome more quality-of-life improvements to Evalica. At the same time, I'm focused on achieving a clean design and maintaining 100% test coverage, as reproducibility is one of the core goals.

Originally, Evalica was built to accelerate the computation of confidence intervals. However, its API currently lacks specialized utilities for this purpose. You can see an example of achieving this at https://github.com/VikhrModels/ru_llm_arena/blob/56d1edabb069945c81254969cdc9dd1df62c0d89/show_result.py. It works roughly like this:

*_, index = evalica.indexing(
    xs=df["model_a"],  # series with model A identifiers
    ys=df["model_b"],  # series with model B identifiers
)

bootstrap: list["pd.Series[float]"] = []

for r in range(BOOTSTRAP_ROUNDS):
    df_sample = df.sample(frac=1.0, replace=True, random_state=r)

    result_sample = evalica.bradley_terry(
        xs=df_sample["model_a"],
        ys=df_sample["model_b"],
        winners=df_sample["winner"],
        index=index  # to save time by not re-indexing the elements
    )

    bootstrap.append(result_sample.scores)

df_bootstrap = pd.DataFrame(bootstrap)

@dustalov dustalov changed the title [Feature Request] Add function to generate confidence interval with bootstrap Generate confidence interval with bootstrap Dec 12, 2024
@dustalov dustalov added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2024
@shenxiangzhuang
Copy link
Contributor Author

Can you assign this task to me? I'd like to implement it.

@dustalov
Copy link
Owner

Sure, why not. Please go ahead but please outline the API usage examples first so we’ll be on the same page.

@shenxiangzhuang
Copy link
Contributor Author

shenxiangzhuang commented Dec 18, 2024

Sure, why not. Please go ahead but please outline the API usage examples first so we’ll be on the same page.

I agree. Let's explore some examples to make it clear. For simplicity, Let's call the core function as bootstrap_ci.

Firstly, the output of bootstrap_ci could be raw bootstrap result as df_bootstrap, which can be used for the calculation the confidence interval of any level.

The simplest way to call it:

import evalica


df = pd.read_csv(...)
df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry')

If df have different column names with left, right and winner, user can pass column names to it.
And we can also pass a weight_column if we have:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', left_column='left_model', right_column='right_model', winner_column='winner_column', weight_column='weight_column')

We can also setting the win_weight and tie_weight by:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', win_weight=1.0, tie_weight=0.3)

If we are using elo score method, we can just pass the method-specific params:

df_bootstrap = evalica.bootstrap_ci(df, score_method='elo', initial=1200, base=10, ...)

And we can also control the bootstrap process settings:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', num_rounds=1000, sample_rate=0.99, with_replace=True)

According to the usage example, here are some reference design about the api input:

The input of bootstrap_ci could be seperated into 4 classes:

  • data settings: df: the dataset with pd.DataFrame type; left_column, right_column, winner_column, weight_column: the four params to specify the related columns if df
  • pair weight settings: the win_weight and the tie_weight
  • score method settings: the score_method(elo, bradley-terry for example); the solver(naive or pyo3); and other method specific params(Like the initial, base, scale, k in elo method)
  • bootstrap process setttings: num_rounds; sample_rate; with_replace

@dustalov
Copy link
Owner

In this setting, we replicate the inconvenient aspect of Crowd-Kit's design, which required developers to manually construct a data frame in the proper format. In practice, we found this process to be highly inconvenient. By contrast, Evalica's columnar approach is significantly more user-friendly.

The currently proposed approach introduces a requirement for a proxy to resolve the score_method argument and to its downstream implementation. I believe we can simplify it as a lightweight wrapper.

evalica.bootstrap(
  method=evalica.bradley_terry,
  xs=df['left'],
  ys=df['right'],
  winners=df['winner'],
  weights=df['weights'],  # this one is optional like in the rest of the library
  n_resamples=10000,
  confidence_level=0.95,
  **kwags,  # for simplicity; these arguments are passed to the specified method
)

Also, for reference, consider scipy.stats.bootstrap. Do you think we could use it directly to avoid writing custom bootstrapping code?

@shenxiangzhuang
Copy link
Contributor Author

Sorry for the late reply. Thanks for your suggestion and I agree this lightweight wrapper is more flexible and I'll try to implement it in this way(Maybe a few days later, I'm afraid that I don't have too much time in recently). As for the scipy.stats.bootstrap, I think it will be better if we can use it directly without sacrifice the readability of the code. I'll explore it when implement this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants