Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolate merging digest percentiles by centroid weight #110

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pyckle
Copy link

@pyckle pyckle commented Jun 10, 2018

Hi Ted,

Thanks for the great work on the project. The code is well documented, and easy to understand considering the complexity of the problem it is solving.

I have a suggestion that perhaps improves accuracy for datasets that contain data points with very large weights. In addition to interpolating by the percentile's distance to the nearest centroids, take into account the weight of the centroid.

In the case of the endpoint, we can estimate the weight of the min/max to be the minimum of sin^2(pi*(1/compression)/2) and min/max CentroidWeight. The former is the inverse of the quantile to notional index value for the last centroid.

In cases where the centroids fit the quantile notional function well, this won't make much difference because adjacent centroids typically have relatively similar weights. In datasets which have elements with large weights, this can substantially improve accuracy in merging digest.

On a related point, while changing the code, I realized I don't quite understand what's going on at the interpolation by the max:

double z1 = index - totalWeight - weight[n - 1] / 2.0;
double z2 = weight[n - 1] / 2 - z1;
return weightedAverage(mean[n - 1], z1, max, z2);

It seems z1 will always be negative, and z2 will always be positive.

Any feedback on the changes would be appreciated. Cheers.

@tdunning
Copy link
Owner

tdunning commented Jun 10, 2018 via email

@pyckle
Copy link
Author

pyckle commented Jun 10, 2018

Hi,

I think you may have misunderstand the suggestion. I was not refering to interpolation between the weights during the merge step, although this is a fascinating idea - I'll elaborate below. I was only referring to interpolating already compressed/merged centroids to generate quantiles. This change can be done independently of the centroid merging/compression algorithm. Have a look at the diff - it should make this clearer.

Regarding biasing the merging algorithm to have different heuristics with which to merge the incoming centroids - I tend to think you're correct that any heuristics that do not enforce k-size during the merge function will cause unpredicatible inaccuracies. And it should trivally follow that these heuristics will cause this biased behavior in many datasets. I presume this not an acceptable tradeoff for better accuracy in specific use cases.

One idea that could perhaps make this feasible is adding logic to split the centroids once they become bigger than the optimal k-size. The splitting, in fact, could be done in a way to make the centroids better adhere to the k-size limits as well as offer better flexibility to where new centroids are merged. I suspect this is worth spending some time thinking about.

I'm looking forward to seeing the math for the q(1-q) k-size function/the paper updates. It's very interesting.

@tdunning
Copy link
Owner

tdunning commented Jun 11, 2018 via email

@pyckle
Copy link
Author

pyckle commented Jun 13, 2018

Hi,

The image that you sent didn't get attached. Would you mind sending it again?

In the meantime, I made some visualizations of what the code I wrote does:
https://github.com/pyckle/t-digest/blob/master/docs/t-digest-paper/weighted-vs-linear-interpolation.pdf

Instead of interpolating linearly, the code I wrote interpolates taking centroid weight in account. This has the advantage of making the algorithm much more accurate when values with weights larger than the k.to.q() function limit. This is demonstrated in the Java test cases. I am using t-digest with datasets of arbitrary weights, some of which could be individual points that violate the limits.

However, this change has the unfortunate consequence of increasing errors at extreme percentiles, where weight sizes are more relatively different. The increased error is bounded at the difference between the centroids that are interpolated. For the percentiles and datasets which I am using, this error is irrelevant.

As such, I'm no longer convinced that it's worth the trade off for all general use, but it is useful for this specific use case, and presumably anybody else with have datasets that can have elements with very heavy weights. If you think this idea has some merit, but only for some use cases, I'm also happy to refactor the code to add this as an extra method, or interpolation algorithm as a parameter to quantile().

Two last things that I hope are of some use.

There is an existing bug in the interpolation code between weight[n-1] and max. I fixed it here:
pyckle@41c0066
It also is fixed in this pull request. If this pull request is not accepted, I will open one for this branch. The current interpolation code for max always produces a negative z1 and a z2 greater than 1.

I think the charts better visually represent centroids than the ones that linear-interpolation.r generates. Those charts led me to believe the centroids have an start/end point and a slope, whereas the sample charts I generated show that each centroid is a point, displays the weight of it, and connects them like the interpolation in the code.

Thanks for your time.

@tdunning
Copy link
Owner

tdunning commented Feb 9, 2019

Picking this up again.

The current version of the paper has an extensive section on how I do the interpolation. Revamping this has improved accuracy substantially.

Could you take a look at in section 2.9 and beyond?

@tdunning
Copy link
Owner

Notably, interpolation now is weight aware.

@pyckle
Copy link
Author

pyckle commented Feb 28, 2019

Hi Ted,

Sorry for the delay. I've been very busy recently and just got a chance to start looking at the changes you've made. I will try and find a few more hours sometime soon to verify my findings.

I read the paper. The method of interpolating as described appears to be far better than what I had come up with originally. Fantastic work, and clear drawings make it very easy to understand.

However, this doesn't address my original issue. The issue I had is an `apparent' limitation of the algorithm - if you have a weighted sample with one sample with an extreme weight as compared to the rest of the dataset, it will interpolate to the nearby centroids. Example:

digest.add(1, 2);
digest.add(50, 1000000);
digest.add(100, 2);

This change does not address this issue. I think the situation which it assists with is actually probably even less common of a use case. It will assist with preventing incorrect interpolation with tiny datasets, as centroids will always get a weight >=1 as datasets get bigger. One elegant way to address this situation using the current design would be the sign bit on the weight in each centroid. The sign bit can be set to 1 if there's exactly one sample at this centroid, and 0 if there is more than 1. This complicates the code, but would have minimal performance impact, and should handle this case. If I have some spare time, I perhaps will try and write this myself.

Also, the logic for the end cases seems incorrect as coded. See test case and possible fix in this commit: pyckle@a367ab9

I think the root of the issue is that the code does not correctly take into account that a weight of 1 from the min or max is included in the first/last centroid. I have not tested my fix for this, nor am I confident it is correct. But it doesn't divide by zero.

I have not yet had time to review the interpolation for the middle cases in the datasets in depth, but it appears correct on first glance. I will try and find time to do this soon.

Thank you again for the incredible work. This project has been very beneficial to me and my organization.

@tdunning
Copy link
Owner

OK.

I think I understand better what you are getting at.

You are correct that the current algorithm makes no distinction between a cluster of 100,000 data points with a variety of values versus a cluster with 100,000 data points all with the same value versus a single data point with a high weight.

This is absolutely true. The key question is whether this is a feature or a bug. From the point of view that a sketch is purposely losing information, this could be a feature. From the point of view that essentially nobody inserts weighted points anyway, it could be viewed as moot.

But there is precedent for handling this. As you suggest, we can mark clusters according to whether they incorporate samples with non-identical values. Individual samples, regardless of weight, would carry the flag indicating that they have a single value. Merging single value clusters could preserve this characteristic if the means are the same. All other kinds of merges would result in a cluster without the special flag. As it stands, we essentially have a flag for this implicitly in that singleton clusters are known to have a single value.

The interpolation algorithms already can handle this case by simply extending the current singleton cases.

I am a bit dubious about the positive impact here, though. The case of a single point with weight so large that the merging algorithm won't touch it will absolutely work better. A mixed continuous discrete algorithm, however, won't see much improvement. The reason is that when you look into a digest constructed with lots of repeated points, you see centroids that look like this:

{Centroid@1222} "Centroid{centroid=18.651162790697672, count=43}"
{Centroid@1223} "Centroid{centroid=19.0, count=50}"
{Centroid@1224} "Centroid{centroid=476200.42857142846, count=42}"
{Centroid@1225} "Centroid{centroid=1000000.0, count=40}"
{Centroid@1226} "Centroid{centroid=1000000.0, count=40}"

The data here have lots of points with a value of exactly 19 and lots with a value of exactly 10^6. What we see are transitional clusters like 1222 and 1224 and one or more "pure" clusters like 1223, 1225 and 1226. When interpolating, we won't know that cluster 1223 contains pure data, but the fact is that half of the mass of those samples are split between 1222 and 1224 anyway. That means that understanding that 1223 contains only a single value won't really fix more than half the problem.

On the other hand, once we get to the middle 1225, we are interpolating between two clusters at the same location so everything is just fine even without understanding what is going on in terms of unique values.

In practice, if we have a lot of repeated samples, then we will actually get a number of clusters with the same mean. There will be some error on the shoulders, but most of the interpolation will be between clusters with identical means. The errors are relatively small.

Moreover, a diversity flag wouldn't help much because during the early phases, some of the repeated samples would inevitably get merged with other values and pure clusters will only appear later. This means that even with a diversity flag, the problem of shoulder clusters would still exist.

Does this make sense?

@tdunning
Copy link
Owner

Regarding the handling of interpolation at terminal clusters, I experimented with your approach and found that it gave slightly less accurate results.

The basic question is whether the min or max value should be considered to be a singleton cluster in its own right or whether it should be an observation about the largest sample in the first or last cluster. The way that clusters are currently constructed makes the second interpretation more accurate. As such, half of the weight of a terminal cluster is assumed to be on either side of the mean and one sample from the weight on the outer side is treated as a singleton. It is handled this way because that singleton contributed to the mean of the terminal cluster.

IF the min and max were treated as forced singletons, this could be done differently. In that case, the min or max samples would not contribute to the mean of the first or last cluster and your code would be correct. To do that, however, would require changes to the insertion path. Essentially would would happen is that if the new sample is larger than the old max or smaller than the old min, the old max or min would have to be inserted into the last (or first) cluster and the value of max (or min) would be updated. I considered making this change, but K_2 and K_3 scale functions both force a singleton at the edge of the digest anyway so it didn't seem like a big deal other than it makes slightly better use of the bits used by the min and max values.

Base automatically changed from master to main January 18, 2021 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants