Generate confidence interval with bootstrap #20

shenxiangzhuang · 2024-12-12T03:40:39Z

As shown in: https://lmsys.org/blog/2023-12-07-leaderboard/.

I think this function will be useful and convenient in practice. But I'm not sure if it's appropriate to add it to evalica. I think evalica is very consice and focus on core computation of the algorithms, which is good enough currently.

If it's not appropriate, I'll try to build a simple python package based on evalica to add this functions, maybe will more visualization functions(as shown in the jupyter notebooks) too.

dustalov · 2024-12-12T22:34:35Z

I'm happy to welcome more quality-of-life improvements to Evalica. At the same time, I'm focused on achieving a clean design and maintaining 100% test coverage, as reproducibility is one of the core goals.

Originally, Evalica was built to accelerate the computation of confidence intervals. However, its API currently lacks specialized utilities for this purpose. You can see an example of achieving this at https://github.com/VikhrModels/ru_llm_arena/blob/56d1edabb069945c81254969cdc9dd1df62c0d89/show_result.py. It works roughly like this:

*_, index = evalica.indexing(
    xs=df["model_a"],  # series with model A identifiers
    ys=df["model_b"],  # series with model B identifiers
)

bootstrap: list["pd.Series[float]"] = []

for r in range(BOOTSTRAP_ROUNDS):
    df_sample = df.sample(frac=1.0, replace=True, random_state=r)

    result_sample = evalica.bradley_terry(
        xs=df_sample["model_a"],
        ys=df_sample["model_b"],
        winners=df_sample["winner"],
        index=index  # to save time by not re-indexing the elements
    )

    bootstrap.append(result_sample.scores)

df_bootstrap = pd.DataFrame(bootstrap)

shenxiangzhuang · 2024-12-13T10:09:51Z

Can you assign this task to me? I'd like to implement it.

dustalov · 2024-12-13T19:19:17Z

Sure, why not. Please go ahead but please outline the API usage examples first so we’ll be on the same page.

shenxiangzhuang · 2024-12-18T09:00:28Z

Sure, why not. Please go ahead but please outline the API usage examples first so we’ll be on the same page.

I agree. Let's explore some examples to make it clear. For simplicity, Let's call the core function as bootstrap_ci.

Firstly, the output of bootstrap_ci could be raw bootstrap result as df_bootstrap, which can be used for the calculation the confidence interval of any level.

The simplest way to call it:

import evalica


df = pd.read_csv(...)
df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry')

If df have different column names with left, right and winner, user can pass column names to it.
And we can also pass a weight_column if we have:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', left_column='left_model', right_column='right_model', winner_column='winner_column', weight_column='weight_column')

We can also setting the win_weight and tie_weight by:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', win_weight=1.0, tie_weight=0.3)

If we are using elo score method, we can just pass the method-specific params:

df_bootstrap = evalica.bootstrap_ci(df, score_method='elo', initial=1200, base=10, ...)

And we can also control the bootstrap process settings:

df_bootstrap = evalica.bootstrap_ci(df, score_method='bradley-terry', num_rounds=1000, sample_rate=0.99, with_replace=True)

According to the usage example, here are some reference design about the api input:

The input of bootstrap_ci could be seperated into 4 classes:

data settings: df: the dataset with pd.DataFrame type; left_column, right_column, winner_column, weight_column: the four params to specify the related columns if df
pair weight settings: the win_weight and the tie_weight
score method settings: the score_method(elo, bradley-terry for example); the solver(naive or pyo3); and other method specific params(Like the initial, base, scale, k in elo method)
bootstrap process setttings: num_rounds; sample_rate; with_replace

dustalov · 2024-12-19T21:54:21Z

In this setting, we replicate the inconvenient aspect of Crowd-Kit's design, which required developers to manually construct a data frame in the proper format. In practice, we found this process to be highly inconvenient. By contrast, Evalica's columnar approach is significantly more user-friendly.

The currently proposed approach introduces a requirement for a proxy to resolve the score_method argument and to its downstream implementation. I believe we can simplify it as a lightweight wrapper.

evalica.bootstrap(
  method=evalica.bradley_terry,
  xs=df['left'],
  ys=df['right'],
  winners=df['winner'],
  weights=df['weights'],  # this one is optional like in the rest of the library
  n_resamples=10000,
  confidence_level=0.95,
  **kwags,  # for simplicity; these arguments are passed to the specified method
)

Also, for reference, consider scipy.stats.bootstrap. Do you think we could use it directly to avoid writing custom bootstrapping code?

shenxiangzhuang · 2024-12-24T10:09:53Z

Sorry for the late reply. Thanks for your suggestion and I agree this lightweight wrapper is more flexible and I'll try to implement it in this way(Maybe a few days later, I'm afraid that I don't have too much time in recently). As for the scipy.stats.bootstrap, I think it will be better if we can use it directly without sacrifice the readability of the code. I'll explore it when implement this function.

dustalov changed the title ~~[Feature Request] Add function to generate confidence interval with bootstrap~~ Generate confidence interval with bootstrap Dec 12, 2024

dustalov added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2024

dustalov assigned shenxiangzhuang Dec 13, 2024

shenxiangzhuang linked a pull request Dec 14, 2024 that will close this issue

feat(bootstrap): generate confidence interval with bootstrap #21

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate confidence interval with bootstrap #20

Generate confidence interval with bootstrap #20

shenxiangzhuang commented Dec 12, 2024

dustalov commented Dec 12, 2024

shenxiangzhuang commented Dec 13, 2024

dustalov commented Dec 13, 2024

shenxiangzhuang commented Dec 18, 2024 •

edited

Loading

dustalov commented Dec 19, 2024

shenxiangzhuang commented Dec 24, 2024

Generate confidence interval with bootstrap #20

Generate confidence interval with bootstrap #20

Comments

shenxiangzhuang commented Dec 12, 2024

dustalov commented Dec 12, 2024

shenxiangzhuang commented Dec 13, 2024

dustalov commented Dec 13, 2024

shenxiangzhuang commented Dec 18, 2024 • edited Loading

dustalov commented Dec 19, 2024

shenxiangzhuang commented Dec 24, 2024

shenxiangzhuang commented Dec 18, 2024 •

edited

Loading