-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpolate merging digest percentiles by centroid weight #110
base: main
Are you sure you want to change the base?
Conversation
Hi,
Your suggestion is reasonable. But there are some very serious subtleties
involved here that I am only barely beginning to understand.
The key issue, as I know it so far, is whether your policy encourages
biased assignment of new points. The best example of what happens when
there is bias, the errors produced by the merging digest are larger than
expected. This apparently has to do with the fact that merging is delayed
while buffering incoming data and then adding things by a sequential scan
over sorted centroids. That sounds great, but it causes a bias in how
points are added when tends to cause centroids to start too close to the
median and then slide outward towards the extremes. This means that
centroids are constructed of data that is smeared over a longer range than
we want. That leads to higher estimation errors than expected.
Similar (but smaller) slightly bad things happen when you have a preference
for attaching new data points to big or to small centroids.
So, I think your idea might be useful. I just can't say.
You should also look at the k-size limiting ideas. The idea is that you map
the beginning and ending values of q onto a non-linear k-scale. The size of
each cluster on the k-scale then is constrained to be between 0.5 and 1.
This gives a very natural size limit on cluster size. It is also pretty
easy to show how to derive the k-scale from a desired cluster size limit
expressed in terms of q. This k function can be of the form asin(2q-1) to
get a sqrt(q(1-q)) size limit.
Regarding behavior at the extremes, I have recently developed some methods
that use a different k-function to get a q(1-q) proportional limit on
centroid size. A particular virtue of this alternative is that it is
guaranteed to have singleton centroids at the min and max values of the
distribution. This function also gives tighter error bounds (as long as the
centroids actually get the right samples).
Anyway, these alternative formulations make it harder to put in a
preference of the sort you describe, but I think that the k-scale approach
may give something very similar.
I would love any input you have on these ideas. Please note that I am
updating the paper on the master branch.
…On Sun, Jun 10, 2018 at 2:28 PM pyckle ***@***.***> wrote:
Hi Ted,
Thanks for the great work on the project. The code is well documented, and
easy to understand considering the complexity of the problem it is solving.
I have a suggestion that perhaps improves accuracy for datasets that
contain data points with very large weights. In addition to interpolating
by the percentile's distance to the nearest centroids, take into account
the weight of the centroid.
In the case of the endpoint, we can estimate the weight of the min/max to
be the minimum of sin^2(pi*(1/compression)/2) and lowestCentroidWeight. The
former is the inverse of the quantile to notional index value for the last
centroid.
In cases where the centroids fit the quantile notional function well, this
won't make much different because adjacent centroids typically have
relatively similiar weights. In datasets which have elements with large
weights, this can substantially improve accuracy in merging digest.
On a related point, while changing the code, I realized I don't quite
understand what's going on at the interpolation by the max:
From:
t-digest/core/src/main/java/com/tdunning/math/stats/MergingDigest.java
<https://github.com/tdunning/t-digest/blob/b2d0bd2481ce12803bbe05c78f226b87ac94576e/core/src/main/java/com/tdunning/math/stats/MergingDigest.java#L701>
Line 701 in b2d0bd2
<http:///tdunning/t-digest/commit/b2d0bd2481ce12803bbe05c78f226b87ac94576e>
double z1 = index - totalWeight - weight[n - 1] / 2.0;
// weightSoFar = totalWeight - weight[n-1]/2 (very nearly)
// so we interpolate out to max value ever seen
double z1 = index - totalWeight - weight[n - 1] / 2.0;
double z2 = weight[n - 1] / 2 - z1;
return weightedAverage(mean[n - 1], z1, max, z2);
It seems z1 will always be negative, and z2 will always be positive.
Any feedback on the changes would be appreciated. Cheers.
------------------------------
You can view, comment on, or merge this pull request online at:
#110
Commit Summary
- Interpolate merging digest percentiles by centroid weight
File Changes
- *M* core/src/main/java/com/tdunning/math/stats/MergingDigest.java
<https://github.com/tdunning/t-digest/pull/110/files#diff-0> (29)
- *M* core/src/test/java/com/tdunning/math/stats/MergingDigestTest.java
<https://github.com/tdunning/t-digest/pull/110/files#diff-1> (26)
Patch Links:
- https://github.com/tdunning/t-digest/pull/110.patch
- https://github.com/tdunning/t-digest/pull/110.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#110>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSelphbOhXcfvAQndOS5DMl1tlp132ks5t7R77gaJpZM4Uhtwn>
.
|
Hi, I think you may have misunderstand the suggestion. I was not refering to interpolation between the weights during the merge step, although this is a fascinating idea - I'll elaborate below. I was only referring to interpolating already compressed/merged centroids to generate quantiles. This change can be done independently of the centroid merging/compression algorithm. Have a look at the diff - it should make this clearer. Regarding biasing the merging algorithm to have different heuristics with which to merge the incoming centroids - I tend to think you're correct that any heuristics that do not enforce k-size during the merge function will cause unpredicatible inaccuracies. And it should trivally follow that these heuristics will cause this biased behavior in many datasets. I presume this not an acceptable tradeoff for better accuracy in specific use cases. One idea that could perhaps make this feasible is adding logic to split the centroids once they become bigger than the optimal k-size. The splitting, in fact, could be done in a way to make the centroids better adhere to the k-size limits as well as offer better flexibility to where new centroids are merged. I suspect this is worth spending some time thinking about. I'm looking forward to seeing the math for the q(1-q) k-size function/the paper updates. It's very interesting. |
On Mon, Jun 11, 2018 at 1:39 AM pyckle ***@***.***> wrote:
I think you may have misunderstand the suggestion.
Sounds like I did.
I was not refering to interpolation between the weights during the merge
step [but] to interpolating already compressed/merged centroids to generate
quantiles.
Ah.
I think that this is already done effectively. See the latest unreleased
code and paper.
I think that this diagram conveys the method. I don't have time right now
to add explanatory text, but you can see how weights are involved:
[image: image.png]
This change can be done independently of the centroid merging/compression
algorithm. Have a look at the diff - it should make this clearer.
I will look.
In the meantime, have you tested this algorithm for accuracy?
Regarding biasing the merging algorithm to have different heuristics with
which to merge the incoming centroids - I tend to think you're correct that
any heuristics that do not enforce k-size during the merge function will
cause unpredicatible inaccuracies.
Actually, I was saying that even algorithms that always enforce k-size can
have variable accuracy.
One idea that could perhaps make this feasible is adding logic to split
the centroids once they become bigger than the optimal k-size.
None of the implementations ever produce a centroid that exceeds k-size.
That is absolutely key to the whole thing because it avoids all need for
splitting.
The splitting, in fact, could be done in a way to make the centroids
better adhere to the k-size limits as well as offer better flexibility to
where new centroids are merged. I suspect this is worth spending some time
thinking about.
Yes. It is. But splitting is incredibly difficult to do in a way that is
guaranteed to meet any accuracy bounds because you have lost the resolution
you need to split by doing the combining in the first place.
|
Hi, The image that you sent didn't get attached. Would you mind sending it again? In the meantime, I made some visualizations of what the code I wrote does: Instead of interpolating linearly, the code I wrote interpolates taking centroid weight in account. This has the advantage of making the algorithm much more accurate when values with weights larger than the k.to.q() function limit. This is demonstrated in the Java test cases. I am using t-digest with datasets of arbitrary weights, some of which could be individual points that violate the limits. However, this change has the unfortunate consequence of increasing errors at extreme percentiles, where weight sizes are more relatively different. The increased error is bounded at the difference between the centroids that are interpolated. For the percentiles and datasets which I am using, this error is irrelevant. As such, I'm no longer convinced that it's worth the trade off for all general use, but it is useful for this specific use case, and presumably anybody else with have datasets that can have elements with very heavy weights. If you think this idea has some merit, but only for some use cases, I'm also happy to refactor the code to add this as an extra method, or interpolation algorithm as a parameter to quantile(). Two last things that I hope are of some use. There is an existing bug in the interpolation code between weight[n-1] and max. I fixed it here: I think the charts better visually represent centroids than the ones that linear-interpolation.r generates. Those charts led me to believe the centroids have an start/end point and a slope, whereas the sample charts I generated show that each centroid is a point, displays the weight of it, and connects them like the interpolation in the code. Thanks for your time. |
Picking this up again. The current version of the paper has an extensive section on how I do the interpolation. Revamping this has improved accuracy substantially. Could you take a look at in section 2.9 and beyond? |
Notably, interpolation now is weight aware. |
Hi Ted, Sorry for the delay. I've been very busy recently and just got a chance to start looking at the changes you've made. I will try and find a few more hours sometime soon to verify my findings. I read the paper. The method of interpolating as described appears to be far better than what I had come up with originally. Fantastic work, and clear drawings make it very easy to understand. However, this doesn't address my original issue. The issue I had is an `apparent' limitation of the algorithm - if you have a weighted sample with one sample with an extreme weight as compared to the rest of the dataset, it will interpolate to the nearby centroids. Example: digest.add(1, 2); This change does not address this issue. I think the situation which it assists with is actually probably even less common of a use case. It will assist with preventing incorrect interpolation with tiny datasets, as centroids will always get a weight >=1 as datasets get bigger. One elegant way to address this situation using the current design would be the sign bit on the weight in each centroid. The sign bit can be set to 1 if there's exactly one sample at this centroid, and 0 if there is more than 1. This complicates the code, but would have minimal performance impact, and should handle this case. If I have some spare time, I perhaps will try and write this myself. Also, the logic for the end cases seems incorrect as coded. See test case and possible fix in this commit: pyckle@a367ab9 I think the root of the issue is that the code does not correctly take into account that a weight of 1 from the min or max is included in the first/last centroid. I have not tested my fix for this, nor am I confident it is correct. But it doesn't divide by zero. I have not yet had time to review the interpolation for the middle cases in the datasets in depth, but it appears correct on first glance. I will try and find time to do this soon. Thank you again for the incredible work. This project has been very beneficial to me and my organization. |
OK. I think I understand better what you are getting at. You are correct that the current algorithm makes no distinction between a cluster of 100,000 data points with a variety of values versus a cluster with 100,000 data points all with the same value versus a single data point with a high weight. This is absolutely true. The key question is whether this is a feature or a bug. From the point of view that a sketch is purposely losing information, this could be a feature. From the point of view that essentially nobody inserts weighted points anyway, it could be viewed as moot. But there is precedent for handling this. As you suggest, we can mark clusters according to whether they incorporate samples with non-identical values. Individual samples, regardless of weight, would carry the flag indicating that they have a single value. Merging single value clusters could preserve this characteristic if the means are the same. All other kinds of merges would result in a cluster without the special flag. As it stands, we essentially have a flag for this implicitly in that singleton clusters are known to have a single value. The interpolation algorithms already can handle this case by simply extending the current singleton cases. I am a bit dubious about the positive impact here, though. The case of a single point with weight so large that the merging algorithm won't touch it will absolutely work better. A mixed continuous discrete algorithm, however, won't see much improvement. The reason is that when you look into a digest constructed with lots of repeated points, you see centroids that look like this:
The data here have lots of points with a value of exactly 19 and lots with a value of exactly 10^6. What we see are transitional clusters like 1222 and 1224 and one or more "pure" clusters like 1223, 1225 and 1226. When interpolating, we won't know that cluster 1223 contains pure data, but the fact is that half of the mass of those samples are split between 1222 and 1224 anyway. That means that understanding that 1223 contains only a single value won't really fix more than half the problem. On the other hand, once we get to the middle 1225, we are interpolating between two clusters at the same location so everything is just fine even without understanding what is going on in terms of unique values. In practice, if we have a lot of repeated samples, then we will actually get a number of clusters with the same mean. There will be some error on the shoulders, but most of the interpolation will be between clusters with identical means. The errors are relatively small. Moreover, a diversity flag wouldn't help much because during the early phases, some of the repeated samples would inevitably get merged with other values and pure clusters will only appear later. This means that even with a diversity flag, the problem of shoulder clusters would still exist. Does this make sense? |
Regarding the handling of interpolation at terminal clusters, I experimented with your approach and found that it gave slightly less accurate results. The basic question is whether the min or max value should be considered to be a singleton cluster in its own right or whether it should be an observation about the largest sample in the first or last cluster. The way that clusters are currently constructed makes the second interpretation more accurate. As such, half of the weight of a terminal cluster is assumed to be on either side of the mean and one sample from the weight on the outer side is treated as a singleton. It is handled this way because that singleton contributed to the mean of the terminal cluster. IF the min and max were treated as forced singletons, this could be done differently. In that case, the min or max samples would not contribute to the mean of the first or last cluster and your code would be correct. To do that, however, would require changes to the insertion path. Essentially would would happen is that if the new sample is larger than the old max or smaller than the old min, the old max or min would have to be inserted into the last (or first) cluster and the value of max (or min) would be updated. I considered making this change, but K_2 and K_3 scale functions both force a singleton at the edge of the digest anyway so it didn't seem like a big deal other than it makes slightly better use of the bits used by the min and max values. |
Hi Ted,
Thanks for the great work on the project. The code is well documented, and easy to understand considering the complexity of the problem it is solving.
I have a suggestion that perhaps improves accuracy for datasets that contain data points with very large weights. In addition to interpolating by the percentile's distance to the nearest centroids, take into account the weight of the centroid.
In the case of the endpoint, we can estimate the weight of the min/max to be the minimum of sin^2(pi*(1/compression)/2) and min/max CentroidWeight. The former is the inverse of the quantile to notional index value for the last centroid.
In cases where the centroids fit the quantile notional function well, this won't make much difference because adjacent centroids typically have relatively similar weights. In datasets which have elements with large weights, this can substantially improve accuracy in merging digest.
On a related point, while changing the code, I realized I don't quite understand what's going on at the interpolation by the max:
t-digest/core/src/main/java/com/tdunning/math/stats/MergingDigest.java
Lines 701 to 703 in b2d0bd2
It seems z1 will always be negative, and z2 will always be positive.
Any feedback on the changes would be appreciated. Cheers.