Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is per service disk space monitoring possible? #263

Open
markwragg opened this issue Mar 8, 2023 · 12 comments
Open

Is per service disk space monitoring possible? #263

markwragg opened this issue Mar 8, 2023 · 12 comments

Comments

@markwragg
Copy link

In the documentation page for Observers it states for AppObserver:

Monitors CPU usage (Total CPU Time; percentage), Memory usage (Working Set; total or private, MB or percentage of total), and logical Disk space consumption for Service Fabric service processes and their descendants (aka child processes). Alerts when user-supplied thresholds are breached.

However I don't see that AppObserver does return metrics for disk consumption. It appears to just be the following:

  • Active Ephemeral Ports (Percent)
  • CPU Time (Percent)
  • Allocated File Handles
  • Active Ephemeral Ports
  • Thread Count
  • Memory Usage (MB)

I assume the documentation is just incorrect and should be updated, but does Service Fabric Observer have any mechanism to get a per service disk usage metric?

Thanks,
Mark

@GitTorre
Copy link
Member

GitTorre commented Mar 8, 2023

Thanks for the report. That is indeed a doc bug.

There is currently no disk-related monitoring done by AppObserver.

I will correct the documentation.

@GitTorre
Copy link
Member

GitTorre commented Mar 8, 2023

Fixed the documentation to reflect reality. Thanks again for catching that.

In terms of adding disk monitoring capability to AppObserver, feel free to create a Work Item and it will be looked into. What disk IO metrics do you want to measure and apply thresholds?

@markwragg
Copy link
Author

Hey, not quite sure what you mean by creating a work item. Do you want a separate issue? I'm mostly interested in disk consumption on a per app / service basis, so that we can have a way to track how much disk space each service consumes and how it changes over time. I don't know how practical that would be to do though.

@GitTorre
Copy link
Member

Hi,

So, that would be something like tracking (on Windows) WriteTransferCount, which is the number of bytes written to disk by a process. It's what you see in Task Manager, Details view, for a process when you add the "I/O write bytes" column. Implementation-wise, that is easy to add to AppObserver, but users would need to supply a Warning threshold to enable it and it is unclear to me if users know what constitutes misbehavior. So, maybe the service writes data to logs or some other file(s) and this could amount to GBs of data. What constitutes too much? That would be left to the user to decide, but observers only monitor resources that have thresholds specified, so you could just use a really large value to limit Warning noise or if you know that your service is supposed to manage the disk space it consumes, you could warn when it eats 10GB or something, which could signal that your disk cleanup code is failing. Again, this would be up to the user.

@GitTorre
Copy link
Member

Re Work Item, I mean not an Issue (where this some problem), but a Feature Request. When you create an Issue, you can choose the type. Feature Request means you want something added to the technology. It explains what you need to present in the template.

@markwragg
Copy link
Author

That makes sense thanks. I'm not sure I/O Write Bytes would be useful as it looks to me like that value never goes down, so its not a representation of the current disk space utilisation of a process, but how much its written to disk in its lifetime (which for very long lived processes is going to end up being huge).

The scenario we have is a lot of stateful services that co-locate their state on disk with the code, and its this "state" disk consumption that it would be interesting to track, but i'm not sure if a metric easily exists to do so.

@GitTorre
Copy link
Member

Yeah, you are right. That won't really help.

I am not sure what performance counter would help you here. Are you trying to measure how much replicated state exists on disk?

@markwragg
Copy link
Author

Yes, if possible, with ideally a breakdown per app or service.

@GitTorre
Copy link
Member

GitTorre commented Mar 17, 2023

This information is actually available via TStore SF perfcounters.

I have not had a chance to experiment with this yet, however. You can open Performance Monitor, go to SF counters, look under TStore. Disk Size and Item Count are the droids you're looking for, particularly Disk Size.

TStorePCtr

@markwragg
Copy link
Author

Hi Charles. Thanks for this, it looks interesting. I found this page that describes the counters: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-diagnostics

Item Count = The number of items in the store.
Disk Size = The total disk size, in bytes, of checkpoint files for the store.

I used Perfmon to look at them on some of my clusters. They obviously aren't present on nodes that run stateless services, but I did find them on my nodes that have stateful services, however it seemed a bit strange that for every instance the value was the same (79 in this case). I was expecting it to vary between different services, but it was also hard to work out which instance was for which service, which I assume is determinable by the ID it returns.

If you are interested in exploring this I'd be keen to see how it could be implemented in AppObserver so that it more easily allows you to see the values per service/app.

Oh, I also found they weren't present on my clusters that are still SF 9.0, but were on my ones that were SF 9.1. Are these counters new to SF 9.1 do you know?

@GitTorre
Copy link
Member

Hi Mark,

This feature isn't actually ready for prime time. There is an active work item tracking this (internal) and I will let you know when it comes to fruition. I think the documentation is ahead of reality here. Even in my local tests, I am not getting the results I expect.

This means that there is nothing FO can provide in the near term. The data you are getting back is not correct (it doesn't seem like 79 bytes is realistic for your stateful service replicas...). Sorry for confusing you. Let's hold off on this for now until I get back to you. Feel free to leave this Issue open in the interim.

@GitTorre GitTorre changed the title AppObserver documentation incorrect. Is per service disk space monitoring possible? Is per service disk space monitoring possible? Mar 21, 2023
@GitTorre
Copy link
Member

GitTorre commented May 31, 2023

I think the issue here is unrelated to the counter implementation - it is fine.... It is more of an understanding problem vis a vis how the counter actually works. So, I verified that the results are accurate. However, there is something to keep in mind here:

A check point will be initiated when the specified threshold (CheckPointThresholdInMB) is reached. This amounts to the log usage exceeding this threshold. At that point, the counter will return non-zero value (so greater than the CheckPointThresholdInMB as bytes). The default value for this setting is 50MB. You can do a local experiment and change the value to be lower for your stateful service (only for testing, mind you - do not use a small value in production...).

So, the counter is not a problem. It was just understanding what is going on that took some time (plus I frankly haven't had much time to revisit this and when I did I talked to a dev on the SF data team to clear this up).

Note that there is still a work item in progress to work on the overall feature, including performance improvements, better documentation of what the data means, etc. Also, querying the counters from C# code (via PerformanceCounter class) does not work. So, that needs to be sorted out before FO can do any monitoring/reporting for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants