Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve percentile calculations #112

Open
sclasen opened this issue Aug 28, 2013 · 8 comments
Open

Improve percentile calculations #112

sclasen opened this issue Aug 28, 2013 · 8 comments

Comments

@sclasen
Copy link

sclasen commented Aug 28, 2013

Would be great to have this instead/in addition to the raw percentiles over the reporting period.
Perhaps use: https://github.com/bmizerany/perks

@ryandotsmith
Copy link
Owner

Just out of curiosity, do you think there is a user benefit to having this? I know that it is more efficient from an engineering perspective, just not sure it adds any value to users of l2met.

@ryandotsmith
Copy link
Owner

Also, I would be happy to review a PR if you wanted to take a stab at using the library.

@sclasen
Copy link
Author

sclasen commented Aug 28, 2013

Unless I am misunderstanding the l2met code (A distinct possibility!!) :) It looks like the percentiles for a given reporting period are in no way related to previous periods. Is this the case?

I guess I should have added more detail to the request, but this would involve storing the perks structures in redis, pulling them back out and running the current period measurements through them to get you a set of percentiles that represent more than just the current measurement period.

Does that make sense?

@ryandotsmith
Copy link
Owner

I see. I wonder though... Do you really want to carry your statistics across time intervals? For example, lets say there was a strange instance failure at t=0 which caused your latency metrics to spike. Then, at t=1 the problem went away and your latencies returned to normal values. Currently, l2met computes statistics in isolation to the period in which they are measured. What you suggest would be aggregating different time period together. Thus, the incident you had at t=0 would impact your metrics for t=1 which might make understanding what happened much more difficult.

@sclasen
Copy link
Author

sclasen commented Aug 28, 2013

So in the context of alerting i think periods being related is not desireable, but from the context of understanding the long term performance characteristics of a service, you would want to consider measurements across periods.

This I think is one reason why all these statistical methods for calculating percentiles over an unbounded stream have been developed :)

I think your example is correct for t=0 and 1, but by the time you got to t=1000 your outlier would have much less (but importantly still measurable) impact. Similarly if you were at t=1000 when the outlier happened it would have much less impact at t=1001, I think, if these methods were used.

perks is based on one paper, and the percentiles in coda hale metrics on others.

perks 
http://www.cs.rutgers.edu/~muthu/bquant.pdf

coda's

http://www.cs.umd.edu/~samir/498/vitter.pdf

from https://github.com/codahale/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/UniformReservoir.java

and

http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf

from https://github.com/codahale/metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/ExponentiallyDecayingReservoir.java

Thoughts?

@ryandotsmith
Copy link
Owner

Can't deny math. Lets keep this issue open and see what happens.

@ryandotsmith
Copy link
Owner

@sclasen Hope you don't mind that I word-smithed your issue name and desc...

@sclasen
Copy link
Author

sclasen commented Aug 28, 2013

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants