Interpolate merging digest percentiles by centroid weight #110

pyckle · 2018-06-10T13:28:26Z

Hi Ted,

Thanks for the great work on the project. The code is well documented, and easy to understand considering the complexity of the problem it is solving.

I have a suggestion that perhaps improves accuracy for datasets that contain data points with very large weights. In addition to interpolating by the percentile's distance to the nearest centroids, take into account the weight of the centroid.

In the case of the endpoint, we can estimate the weight of the min/max to be the minimum of sin^2(pi*(1/compression)/2) and min/max CentroidWeight. The former is the inverse of the quantile to notional index value for the last centroid.

In cases where the centroids fit the quantile notional function well, this won't make much difference because adjacent centroids typically have relatively similar weights. In datasets which have elements with large weights, this can substantially improve accuracy in merging digest.

On a related point, while changing the code, I realized I don't quite understand what's going on at the interpolation by the max:

t-digest/core/src/main/java/com/tdunning/math/stats/MergingDigest.java

Lines 701 to 703 in b2d0bd2

    
           double z1 = index - totalWeight - weight[n - 1] / 2.0; 
        
           double z2 = weight[n - 1] / 2 - z1; 
        
           return weightedAverage(mean[n - 1], z1, max, z2);

It seems z1 will always be negative, and z2 will always be positive.

Any feedback on the changes would be appreciated. Cheers.

tdunning · 2018-06-10T17:42:46Z

Hi, Your suggestion is reasonable. But there are some very serious subtleties involved here that I am only barely beginning to understand. The key issue, as I know it so far, is whether your policy encourages biased assignment of new points. The best example of what happens when there is bias, the errors produced by the merging digest are larger than expected. This apparently has to do with the fact that merging is delayed while buffering incoming data and then adding things by a sequential scan over sorted centroids. That sounds great, but it causes a bias in how points are added when tends to cause centroids to start too close to the median and then slide outward towards the extremes. This means that centroids are constructed of data that is smeared over a longer range than we want. That leads to higher estimation errors than expected. Similar (but smaller) slightly bad things happen when you have a preference for attaching new data points to big or to small centroids. So, I think your idea might be useful. I just can't say. You should also look at the k-size limiting ideas. The idea is that you map the beginning and ending values of q onto a non-linear k-scale. The size of each cluster on the k-scale then is constrained to be between 0.5 and 1. This gives a very natural size limit on cluster size. It is also pretty easy to show how to derive the k-scale from a desired cluster size limit expressed in terms of q. This k function can be of the form asin(2q-1) to get a sqrt(q(1-q)) size limit. Regarding behavior at the extremes, I have recently developed some methods that use a different k-function to get a q(1-q) proportional limit on centroid size. A particular virtue of this alternative is that it is guaranteed to have singleton centroids at the min and max values of the distribution. This function also gives tighter error bounds (as long as the centroids actually get the right samples). Anyway, these alternative formulations make it harder to put in a preference of the sort you describe, but I think that the k-scale approach may give something very similar. I would love any input you have on these ideas. Please note that I am updating the paper on the master branch.

…

On Sun, Jun 10, 2018 at 2:28 PM pyckle ***@***.***> wrote: Hi Ted, Thanks for the great work on the project. The code is well documented, and easy to understand considering the complexity of the problem it is solving. I have a suggestion that perhaps improves accuracy for datasets that contain data points with very large weights. In addition to interpolating by the percentile's distance to the nearest centroids, take into account the weight of the centroid. In the case of the endpoint, we can estimate the weight of the min/max to be the minimum of sin^2(pi*(1/compression)/2) and lowestCentroidWeight. The former is the inverse of the quantile to notional index value for the last centroid. In cases where the centroids fit the quantile notional function well, this won't make much different because adjacent centroids typically have relatively similiar weights. In datasets which have elements with large weights, this can substantially improve accuracy in merging digest. On a related point, while changing the code, I realized I don't quite understand what's going on at the interpolation by the max: From: t-digest/core/src/main/java/com/tdunning/math/stats/MergingDigest.java <https://github.com/tdunning/t-digest/blob/b2d0bd2481ce12803bbe05c78f226b87ac94576e/core/src/main/java/com/tdunning/math/stats/MergingDigest.java#L701> Line 701 in b2d0bd2 <http:///tdunning/t-digest/commit/b2d0bd2481ce12803bbe05c78f226b87ac94576e> double z1 = index - totalWeight - weight[n - 1] / 2.0; // weightSoFar = totalWeight - weight[n-1]/2 (very nearly) // so we interpolate out to max value ever seen double z1 = index - totalWeight - weight[n - 1] / 2.0; double z2 = weight[n - 1] / 2 - z1; return weightedAverage(mean[n - 1], z1, max, z2); It seems z1 will always be negative, and z2 will always be positive. Any feedback on the changes would be appreciated. Cheers. ------------------------------ You can view, comment on, or merge this pull request online at: #110 Commit Summary - Interpolate merging digest percentiles by centroid weight File Changes - *M* core/src/main/java/com/tdunning/math/stats/MergingDigest.java <https://github.com/tdunning/t-digest/pull/110/files#diff-0> (29) - *M* core/src/test/java/com/tdunning/math/stats/MergingDigestTest.java <https://github.com/tdunning/t-digest/pull/110/files#diff-1> (26) Patch Links: - https://github.com/tdunning/t-digest/pull/110.patch - https://github.com/tdunning/t-digest/pull/110.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#110>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPSelphbOhXcfvAQndOS5DMl1tlp132ks5t7R77gaJpZM4Uhtwn> .

pyckle · 2018-06-10T23:39:24Z

Hi,

I think you may have misunderstand the suggestion. I was not refering to interpolation between the weights during the merge step, although this is a fascinating idea - I'll elaborate below. I was only referring to interpolating already compressed/merged centroids to generate quantiles. This change can be done independently of the centroid merging/compression algorithm. Have a look at the diff - it should make this clearer.

Regarding biasing the merging algorithm to have different heuristics with which to merge the incoming centroids - I tend to think you're correct that any heuristics that do not enforce k-size during the merge function will cause unpredicatible inaccuracies. And it should trivally follow that these heuristics will cause this biased behavior in many datasets. I presume this not an acceptable tradeoff for better accuracy in specific use cases.

One idea that could perhaps make this feasible is adding logic to split the centroids once they become bigger than the optimal k-size. The splitting, in fact, could be done in a way to make the centroids better adhere to the k-size limits as well as offer better flexibility to where new centroids are merged. I suspect this is worth spending some time thinking about.

I'm looking forward to seeing the math for the q(1-q) k-size function/the paper updates. It's very interesting.

tdunning · 2018-06-11T08:05:22Z

On Mon, Jun 11, 2018 at 1:39 AM pyckle ***@***.***> wrote: I think you may have misunderstand the suggestion.

Sounds like I did.

I was not refering to interpolation between the weights during the merge step [but] to interpolating already compressed/merged centroids to generate quantiles.

Ah. I think that this is already done effectively. See the latest unreleased code and paper. I think that this diagram conveys the method. I don't have time right now to add explanatory text, but you can see how weights are involved: [image: image.png]

This change can be done independently of the centroid merging/compression algorithm. Have a look at the diff - it should make this clearer.

I will look. In the meantime, have you tested this algorithm for accuracy?

Regarding biasing the merging algorithm to have different heuristics with which to merge the incoming centroids - I tend to think you're correct that any heuristics that do not enforce k-size during the merge function will cause unpredicatible inaccuracies.

Actually, I was saying that even algorithms that always enforce k-size can have variable accuracy.

One idea that could perhaps make this feasible is adding logic to split the centroids once they become bigger than the optimal k-size.

None of the implementations ever produce a centroid that exceeds k-size. That is absolutely key to the whole thing because it avoids all need for splitting.

The splitting, in fact, could be done in a way to make the centroids better adhere to the k-size limits as well as offer better flexibility to where new centroids are merged. I suspect this is worth spending some time thinking about.

Yes. It is. But splitting is incredibly difficult to do in a way that is guaranteed to meet any accuracy bounds because you have lost the resolution you need to split by doing the combining in the first place.

pyckle · 2018-06-13T14:57:07Z

Hi,

The image that you sent didn't get attached. Would you mind sending it again?

In the meantime, I made some visualizations of what the code I wrote does:
https://github.com/pyckle/t-digest/blob/master/docs/t-digest-paper/weighted-vs-linear-interpolation.pdf

Instead of interpolating linearly, the code I wrote interpolates taking centroid weight in account. This has the advantage of making the algorithm much more accurate when values with weights larger than the k.to.q() function limit. This is demonstrated in the Java test cases. I am using t-digest with datasets of arbitrary weights, some of which could be individual points that violate the limits.

However, this change has the unfortunate consequence of increasing errors at extreme percentiles, where weight sizes are more relatively different. The increased error is bounded at the difference between the centroids that are interpolated. For the percentiles and datasets which I am using, this error is irrelevant.

As such, I'm no longer convinced that it's worth the trade off for all general use, but it is useful for this specific use case, and presumably anybody else with have datasets that can have elements with very heavy weights. If you think this idea has some merit, but only for some use cases, I'm also happy to refactor the code to add this as an extra method, or interpolation algorithm as a parameter to quantile().

Two last things that I hope are of some use.

There is an existing bug in the interpolation code between weight[n-1] and max. I fixed it here:
pyckle@41c0066
It also is fixed in this pull request. If this pull request is not accepted, I will open one for this branch. The current interpolation code for max always produces a negative z1 and a z2 greater than 1.

I think the charts better visually represent centroids than the ones that linear-interpolation.r generates. Those charts led me to believe the centroids have an start/end point and a slope, whereas the sample charts I generated show that each centroid is a point, displays the weight of it, and connects them like the interpolation in the code.

Thanks for your time.

tdunning · 2019-02-09T07:13:38Z

Picking this up again.

The current version of the paper has an extensive section on how I do the interpolation. Revamping this has improved accuracy substantially.

Could you take a look at in section 2.9 and beyond?

tdunning · 2019-02-27T17:59:39Z

Notably, interpolation now is weight aware.

pyckle · 2019-02-28T10:33:50Z

Hi Ted,

Sorry for the delay. I've been very busy recently and just got a chance to start looking at the changes you've made. I will try and find a few more hours sometime soon to verify my findings.

I read the paper. The method of interpolating as described appears to be far better than what I had come up with originally. Fantastic work, and clear drawings make it very easy to understand.

However, this doesn't address my original issue. The issue I had is an `apparent' limitation of the algorithm - if you have a weighted sample with one sample with an extreme weight as compared to the rest of the dataset, it will interpolate to the nearby centroids. Example:

digest.add(1, 2);
digest.add(50, 1000000);
digest.add(100, 2);

This change does not address this issue. I think the situation which it assists with is actually probably even less common of a use case. It will assist with preventing incorrect interpolation with tiny datasets, as centroids will always get a weight >=1 as datasets get bigger. One elegant way to address this situation using the current design would be the sign bit on the weight in each centroid. The sign bit can be set to 1 if there's exactly one sample at this centroid, and 0 if there is more than 1. This complicates the code, but would have minimal performance impact, and should handle this case. If I have some spare time, I perhaps will try and write this myself.

Also, the logic for the end cases seems incorrect as coded. See test case and possible fix in this commit: pyckle@a367ab9

I think the root of the issue is that the code does not correctly take into account that a weight of 1 from the min or max is included in the first/last centroid. I have not tested my fix for this, nor am I confident it is correct. But it doesn't divide by zero.

I have not yet had time to review the interpolation for the middle cases in the datasets in depth, but it appears correct on first glance. I will try and find time to do this soon.

Thank you again for the incredible work. This project has been very beneficial to me and my organization.

tdunning · 2019-02-28T19:47:28Z

OK.

I think I understand better what you are getting at.

You are correct that the current algorithm makes no distinction between a cluster of 100,000 data points with a variety of values versus a cluster with 100,000 data points all with the same value versus a single data point with a high weight.

This is absolutely true. The key question is whether this is a feature or a bug. From the point of view that a sketch is purposely losing information, this could be a feature. From the point of view that essentially nobody inserts weighted points anyway, it could be viewed as moot.

But there is precedent for handling this. As you suggest, we can mark clusters according to whether they incorporate samples with non-identical values. Individual samples, regardless of weight, would carry the flag indicating that they have a single value. Merging single value clusters could preserve this characteristic if the means are the same. All other kinds of merges would result in a cluster without the special flag. As it stands, we essentially have a flag for this implicitly in that singleton clusters are known to have a single value.

The interpolation algorithms already can handle this case by simply extending the current singleton cases.

I am a bit dubious about the positive impact here, though. The case of a single point with weight so large that the merging algorithm won't touch it will absolutely work better. A mixed continuous discrete algorithm, however, won't see much improvement. The reason is that when you look into a digest constructed with lots of repeated points, you see centroids that look like this:

{Centroid@1222} "Centroid{centroid=18.651162790697672, count=43}"
{Centroid@1223} "Centroid{centroid=19.0, count=50}"
{Centroid@1224} "Centroid{centroid=476200.42857142846, count=42}"
{Centroid@1225} "Centroid{centroid=1000000.0, count=40}"
{Centroid@1226} "Centroid{centroid=1000000.0, count=40}"

The data here have lots of points with a value of exactly 19 and lots with a value of exactly 10^6. What we see are transitional clusters like 1222 and 1224 and one or more "pure" clusters like 1223, 1225 and 1226. When interpolating, we won't know that cluster 1223 contains pure data, but the fact is that half of the mass of those samples are split between 1222 and 1224 anyway. That means that understanding that 1223 contains only a single value won't really fix more than half the problem.

On the other hand, once we get to the middle 1225, we are interpolating between two clusters at the same location so everything is just fine even without understanding what is going on in terms of unique values.

In practice, if we have a lot of repeated samples, then we will actually get a number of clusters with the same mean. There will be some error on the shoulders, but most of the interpolation will be between clusters with identical means. The errors are relatively small.

Moreover, a diversity flag wouldn't help much because during the early phases, some of the repeated samples would inevitably get merged with other values and pure clusters will only appear later. This means that even with a diversity flag, the problem of shoulder clusters would still exist.

Does this make sense?

tdunning · 2019-02-28T20:41:44Z

Regarding the handling of interpolation at terminal clusters, I experimented with your approach and found that it gave slightly less accurate results.

The basic question is whether the min or max value should be considered to be a singleton cluster in its own right or whether it should be an observation about the largest sample in the first or last cluster. The way that clusters are currently constructed makes the second interpretation more accurate. As such, half of the weight of a terminal cluster is assumed to be on either side of the mean and one sample from the weight on the outer side is treated as a singleton. It is handled this way because that singleton contributed to the mean of the terminal cluster.

IF the min and max were treated as forced singletons, this could be done differently. In that case, the min or max samples would not contribute to the mean of the first or last cluster and your code would be correct. To do that, however, would require changes to the insertion path. Essentially would would happen is that if the new sample is larger than the old max or smaller than the old min, the old max or min would have to be inserted into the last (or first) cluster and the value of max (or min) would be updated. I considered making this change, but K_2 and K_3 scale functions both force a singleton at the edge of the digest anyway so it didn't seem like a big deal other than it makes slightly better use of the bits used by the min and max values.

pyckle force-pushed the master branch from 6f5c685 to a4a52a1 Compare June 13, 2018 14:03

pyckle added 2 commits June 13, 2018 17:26

Interpolate merging digest percentiles by centroid weight

08f3679

Add visualization for linear vs. weighted interpolation

a959f2e

pyckle force-pushed the master branch from a4a52a1 to a959f2e Compare June 13, 2018 14:39

Base automatically changed from master to main January 18, 2021 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpolate merging digest percentiles by centroid weight #110

Interpolate merging digest percentiles by centroid weight #110

pyckle commented Jun 10, 2018 •

edited

Loading

tdunning commented Jun 10, 2018 via email

pyckle commented Jun 10, 2018

tdunning commented Jun 11, 2018 via email

pyckle commented Jun 13, 2018

tdunning commented Feb 9, 2019

tdunning commented Feb 27, 2019

pyckle commented Feb 28, 2019

tdunning commented Feb 28, 2019

tdunning commented Feb 28, 2019

	double z1 = index - totalWeight - weight[n - 1] / 2.0;
	double z2 = weight[n - 1] / 2 - z1;
	return weightedAverage(mean[n - 1], z1, max, z2);

Interpolate merging digest percentiles by centroid weight #110

Are you sure you want to change the base?

Interpolate merging digest percentiles by centroid weight #110

Conversation

pyckle commented Jun 10, 2018 • edited Loading

tdunning commented Jun 10, 2018 via email

pyckle commented Jun 10, 2018

tdunning commented Jun 11, 2018 via email

pyckle commented Jun 13, 2018

tdunning commented Feb 9, 2019

tdunning commented Feb 27, 2019

pyckle commented Feb 28, 2019

tdunning commented Feb 28, 2019

tdunning commented Feb 28, 2019

pyckle commented Jun 10, 2018 •

edited

Loading