-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is per service disk space monitoring possible? #263
Comments
Thanks for the report. That is indeed a doc bug. There is currently no disk-related monitoring done by AppObserver. I will correct the documentation. |
Fixed the documentation to reflect reality. Thanks again for catching that. In terms of adding disk monitoring capability to AppObserver, feel free to create a Work Item and it will be looked into. What disk IO metrics do you want to measure and apply thresholds? |
Hey, not quite sure what you mean by creating a work item. Do you want a separate issue? I'm mostly interested in disk consumption on a per app / service basis, so that we can have a way to track how much disk space each service consumes and how it changes over time. I don't know how practical that would be to do though. |
Hi, So, that would be something like tracking (on Windows) WriteTransferCount, which is the number of bytes written to disk by a process. It's what you see in Task Manager, Details view, for a process when you add the "I/O write bytes" column. Implementation-wise, that is easy to add to AppObserver, but users would need to supply a Warning threshold to enable it and it is unclear to me if users know what constitutes misbehavior. So, maybe the service writes data to logs or some other file(s) and this could amount to GBs of data. What constitutes too much? That would be left to the user to decide, but observers only monitor resources that have thresholds specified, so you could just use a really large value to limit Warning noise or if you know that your service is supposed to manage the disk space it consumes, you could warn when it eats 10GB or something, which could signal that your disk cleanup code is failing. Again, this would be up to the user. |
Re Work Item, I mean not an Issue (where this some problem), but a Feature Request. When you create an Issue, you can choose the type. Feature Request means you want something added to the technology. It explains what you need to present in the template. |
That makes sense thanks. I'm not sure I/O Write Bytes would be useful as it looks to me like that value never goes down, so its not a representation of the current disk space utilisation of a process, but how much its written to disk in its lifetime (which for very long lived processes is going to end up being huge). The scenario we have is a lot of stateful services that co-locate their state on disk with the code, and its this "state" disk consumption that it would be interesting to track, but i'm not sure if a metric easily exists to do so. |
Yeah, you are right. That won't really help. I am not sure what performance counter would help you here. Are you trying to measure how much replicated state exists on disk? |
Yes, if possible, with ideally a breakdown per app or service. |
Hi Charles. Thanks for this, it looks interesting. I found this page that describes the counters: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-diagnostics
I used Perfmon to look at them on some of my clusters. They obviously aren't present on nodes that run stateless services, but I did find them on my nodes that have stateful services, however it seemed a bit strange that for every instance the value was the same (79 in this case). I was expecting it to vary between different services, but it was also hard to work out which instance was for which service, which I assume is determinable by the ID it returns. If you are interested in exploring this I'd be keen to see how it could be implemented in AppObserver so that it more easily allows you to see the values per service/app. Oh, I also found they weren't present on my clusters that are still SF 9.0, but were on my ones that were SF 9.1. Are these counters new to SF 9.1 do you know? |
Hi Mark, This feature isn't actually ready for prime time. There is an active work item tracking this (internal) and I will let you know when it comes to fruition. I think the documentation is ahead of reality here. Even in my local tests, I am not getting the results I expect. This means that there is nothing FO can provide in the near term. The data you are getting back is not correct (it doesn't seem like 79 bytes is realistic for your stateful service replicas...). Sorry for confusing you. Let's hold off on this for now until I get back to you. Feel free to leave this Issue open in the interim. |
I think the issue here is unrelated to the counter implementation - it is fine.... It is more of an understanding problem vis a vis how the counter actually works. So, I verified that the results are accurate. However, there is something to keep in mind here: A check point will be initiated when the specified threshold (CheckPointThresholdInMB) is reached. This amounts to the log usage exceeding this threshold. At that point, the counter will return non-zero value (so greater than the CheckPointThresholdInMB as bytes). The default value for this setting is 50MB. You can do a local experiment and change the value to be lower for your stateful service (only for testing, mind you - do not use a small value in production...). So, the counter is not a problem. It was just understanding what is going on that took some time (plus I frankly haven't had much time to revisit this and when I did I talked to a dev on the SF data team to clear this up). Note that there is still a work item in progress to work on the overall feature, including performance improvements, better documentation of what the data means, etc. Also, querying the counters from C# code (via PerformanceCounter class) does not work. So, that needs to be sorted out before FO can do any monitoring/reporting for this. |
In the documentation page for Observers it states for AppObserver:
However I don't see that AppObserver does return metrics for disk consumption. It appears to just be the following:
I assume the documentation is just incorrect and should be updated, but does Service Fabric Observer have any mechanism to get a per service disk usage metric?
Thanks,
Mark
The text was updated successfully, but these errors were encountered: