Add KS tests for weighted sampling #1530

dhardy · 2024-11-18T19:55:47Z

Added a CHANGELOG.md entry

Motivation

Some of these are non-trivial distributions we didn't really test before.

To validate solution of #1476.

Details

Single-element weighted sampling is simple enough.

fn choose_two_iterator is also simple enough: there are no weights, so we can just assign each pair of results a unique index in the list of 100 * 99 / 2 possibilities (nothing that we sort pairs since the order of chosen elements is not specified).

fn choose_two_weighted_indexed gets a bit more complicated; I choose to approach it by building a table for the CDF of size num*num including impossible variants. Most of the tests don't pass, so there must be a mistake here.

Aside: using let key = rng.random::<f64>().ln() / weight; (src/seq/index.rs:392) may help with #1476 but does not fix the above.

Some failures

dhardy · 2024-11-19T09:31:07Z

I can confirm that choose_multiple_weighted has a significant problem, since sampling two elements from 0, 1, 2 with weights 1, 1/2, 1/3 a million times and sorting yields 532298 counts of (0, 1), 338524 counts of (0, 2) and 129178 counts of (1, 2). (Unlike #1476, this example does not require very small weights.)

This is sampling without replacement, so expected samples are:

(0,1) or (1, 0): 531818
(0, 2) or (2, 0): 339393
(1, 2) or (2, 1): 128788

…ust-random#1476 Also improves choose_two_weighted_indexed time by 23% (excluding new test)

Approx 2% improvement to tests sampling 2 of 100 elements

This results in approx 18% faster tests choosing 2-in-100 items

dhardy · 2024-11-19T11:04:01Z

I fixed my calculation of the CDF, found a variant which failed like #1476, fixed this by taking the logarithm of keys, and applied some optimisation to the Efraimidis-Spirakis algorithm.

benjamin-lieser

Looks correct. We might want to test how the performance is for big amount

distr_test/tests/weighted.rs

benjamin-lieser · 2024-11-25T11:06:51Z

distr_test/tests/weighted.rs

+}
+
+#[test]
+fn choose_two_weighted_indexed() {


This is probably more complex than needed, but looks correct.
It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

This is probably more complex than needed, but looks correct.

You mean the use of an Adapter? Yes, but I'd sooner do this than revise the KS test API (which is well adapted for other usages).

It's probably worth implementing chi squared at some point, but this should also be quite sensitive.

A fair point.

I mean using KS for these distributions (chi squared would be more straight forward), Adapter I think is fine.

src/seq/index.rs

benjamin-lieser · 2024-11-25T14:35:30Z

src/seq/index.rs

+                let t = core::f64::consts::E.powf(candidates[0].key * weight);
+                let key = rng.random_range(t..1.0).ln() / weight;
+                candidates[0] = Element { index, key };
+                candidates.sort_unstable();


I guess it is very likely that a tree data structure would perform much faster if amount is big enough. Not sure where the threshold is. Depending on sort_unstable is could even perform particularly worse on an almost sorted slice.

Good point, though I won't address it now.

https://doc.rust-lang.org/std/collections/struct.BinaryHeap.html
should be always faster as it also stores its elements in a Vec should be an easy change.

dhardy added 10 commits November 18, 2024 17:06

Add KS test for WeightedIndex

2e28810

Add KS test for WeightedAliasIndex

eb1836b

Add KS test for WeightedTreeIndex

a8ce256

Add KS test for IndexedRandom::choose_weighted

9e03a15

Add KS test for IndexedRandom::choose_multiple_weighted (one element)

4b0a296

Add KS test for IndexedRandom::choose_multiple_weighted (two elements)

2584212

Some failures

Add KS test for IteratorRandom::choose

336ddbe

Add KS test for IteratorRandom::choose_stable

2ef212b

Add KS test for IteratorRandom::choose_multiple_fill (two elements)

a0908bc

Fix cdf for choose_two_weighted_indexed

865aba2

dhardy requested a review from benjamin-lieser November 18, 2024 19:55

dhardy added 5 commits November 19, 2024 10:09

More complex test for choose_multiple_weighted

16a16c6

Fix calculated CDF in choose_two_weighted_indexed

554d331

Test and fix choose_multiple_weighted with very small probabilities: r…

d645952

…ust-random#1476 Also improves choose_two_weighted_indexed time by 23% (excluding new test)

sample_efraimidis_spirakis: keep at most amount candidates

b806b29

Approx 2% improvement to tests sampling 2 of 100 elements

sample_efraimidis_spirakis: use algorithm A-ExpJ

0f662b1

This results in approx 18% faster tests choosing 2-in-100 items

dhardy marked this pull request as ready for review November 19, 2024 11:02

dhardy added 2 commits November 19, 2024 11:04

Rustfmt

51c6fdd

Clippy

a1f61ae

dhardy mentioned this pull request Nov 23, 2024

choose_multiple_weighted returns unexpect probability of result #1476

Open

Temp: extra test

1c147a4

dhardy mentioned this pull request Nov 23, 2024

Prepare 0.9.0-beta.0 #1535

Open

1 task

benjamin-lieser reviewed Nov 25, 2024

View reviewed changes

Review feedback

ddb8f5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KS tests for weighted sampling #1530

Add KS tests for weighted sampling #1530

dhardy commented Nov 18, 2024

dhardy commented Nov 19, 2024 •

edited

Loading

dhardy commented Nov 19, 2024

benjamin-lieser left a comment

benjamin-lieser Nov 25, 2024

dhardy Nov 25, 2024

benjamin-lieser Nov 25, 2024

benjamin-lieser Nov 25, 2024

dhardy Nov 25, 2024

benjamin-lieser Nov 25, 2024

Add KS tests for weighted sampling #1530

Are you sure you want to change the base?

Add KS tests for weighted sampling #1530

Conversation

dhardy commented Nov 18, 2024

Motivation

Details

dhardy commented Nov 19, 2024 • edited Loading

dhardy commented Nov 19, 2024

benjamin-lieser left a comment

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

dhardy Nov 25, 2024

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

dhardy Nov 25, 2024

Choose a reason for hiding this comment

benjamin-lieser Nov 25, 2024

Choose a reason for hiding this comment

dhardy commented Nov 19, 2024 •

edited

Loading