Is per service disk space monitoring possible? #263

markwragg · 2023-03-08T10:52:02Z

In the documentation page for Observers it states for AppObserver:

Monitors CPU usage (Total CPU Time; percentage), Memory usage (Working Set; total or private, MB or percentage of total), and logical Disk space consumption for Service Fabric service processes and their descendants (aka child processes). Alerts when user-supplied thresholds are breached.

However I don't see that AppObserver does return metrics for disk consumption. It appears to just be the following:

Active Ephemeral Ports (Percent)
CPU Time (Percent)
Allocated File Handles
Active Ephemeral Ports
Thread Count
Memory Usage (MB)

I assume the documentation is just incorrect and should be updated, but does Service Fabric Observer have any mechanism to get a per service disk usage metric?

Thanks,
Mark

GitTorre · 2023-03-08T16:52:43Z

Thanks for the report. That is indeed a doc bug.

There is currently no disk-related monitoring done by AppObserver.

I will correct the documentation.

GitTorre · 2023-03-08T17:04:30Z

Fixed the documentation to reflect reality. Thanks again for catching that.

In terms of adding disk monitoring capability to AppObserver, feel free to create a Work Item and it will be looked into. What disk IO metrics do you want to measure and apply thresholds?

markwragg · 2023-03-10T20:14:54Z

Hey, not quite sure what you mean by creating a work item. Do you want a separate issue? I'm mostly interested in disk consumption on a per app / service basis, so that we can have a way to track how much disk space each service consumes and how it changes over time. I don't know how practical that would be to do though.

GitTorre · 2023-03-10T20:39:22Z

Hi,

So, that would be something like tracking (on Windows) WriteTransferCount, which is the number of bytes written to disk by a process. It's what you see in Task Manager, Details view, for a process when you add the "I/O write bytes" column. Implementation-wise, that is easy to add to AppObserver, but users would need to supply a Warning threshold to enable it and it is unclear to me if users know what constitutes misbehavior. So, maybe the service writes data to logs or some other file(s) and this could amount to GBs of data. What constitutes too much? That would be left to the user to decide, but observers only monitor resources that have thresholds specified, so you could just use a really large value to limit Warning noise or if you know that your service is supposed to manage the disk space it consumes, you could warn when it eats 10GB or something, which could signal that your disk cleanup code is failing. Again, this would be up to the user.

GitTorre · 2023-03-10T20:44:26Z

Re Work Item, I mean not an Issue (where this some problem), but a Feature Request. When you create an Issue, you can choose the type. Feature Request means you want something added to the technology. It explains what you need to present in the template.

markwragg · 2023-03-10T20:48:21Z

That makes sense thanks. I'm not sure I/O Write Bytes would be useful as it looks to me like that value never goes down, so its not a representation of the current disk space utilisation of a process, but how much its written to disk in its lifetime (which for very long lived processes is going to end up being huge).

The scenario we have is a lot of stateful services that co-locate their state on disk with the code, and its this "state" disk consumption that it would be interesting to track, but i'm not sure if a metric easily exists to do so.

GitTorre · 2023-03-10T21:07:17Z

Yeah, you are right. That won't really help.

I am not sure what performance counter would help you here. Are you trying to measure how much replicated state exists on disk?

markwragg · 2023-03-10T21:12:19Z

Yes, if possible, with ideally a breakdown per app or service.

GitTorre · 2023-03-17T20:18:40Z

This information is actually available via TStore SF perfcounters.

I have not had a chance to experiment with this yet, however. You can open Performance Monitor, go to SF counters, look under TStore. Disk Size and Item Count are the droids you're looking for, particularly Disk Size.

markwragg · 2023-03-20T12:11:00Z

Hi Charles. Thanks for this, it looks interesting. I found this page that describes the counters: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-diagnostics

Item Count = The number of items in the store.
Disk Size = The total disk size, in bytes, of checkpoint files for the store.

I used Perfmon to look at them on some of my clusters. They obviously aren't present on nodes that run stateless services, but I did find them on my nodes that have stateful services, however it seemed a bit strange that for every instance the value was the same (79 in this case). I was expecting it to vary between different services, but it was also hard to work out which instance was for which service, which I assume is determinable by the ID it returns.

If you are interested in exploring this I'd be keen to see how it could be implemented in AppObserver so that it more easily allows you to see the values per service/app.

Oh, I also found they weren't present on my clusters that are still SF 9.0, but were on my ones that were SF 9.1. Are these counters new to SF 9.1 do you know?

GitTorre · 2023-03-20T17:06:41Z

Hi Mark,

This feature isn't actually ready for prime time. There is an active work item tracking this (internal) and I will let you know when it comes to fruition. I think the documentation is ahead of reality here. Even in my local tests, I am not getting the results I expect.

This means that there is nothing FO can provide in the near term. The data you are getting back is not correct (it doesn't seem like 79 bytes is realistic for your stateful service replicas...). Sorry for confusing you. Let's hold off on this for now until I get back to you. Feel free to leave this Issue open in the interim.

GitTorre · 2023-05-31T23:14:50Z

I think the issue here is unrelated to the counter implementation - it is fine.... It is more of an understanding problem vis a vis how the counter actually works. So, I verified that the results are accurate. However, there is something to keep in mind here:

A check point will be initiated when the specified threshold (CheckPointThresholdInMB) is reached. This amounts to the log usage exceeding this threshold. At that point, the counter will return non-zero value (so greater than the CheckPointThresholdInMB as bytes). The default value for this setting is 50MB. You can do a local experiment and change the value to be lower for your stateful service (only for testing, mind you - do not use a small value in production...).

So, the counter is not a problem. It was just understanding what is going on that took some time (plus I frankly haven't had much time to revisit this and when I did I talked to a dev on the SF data team to clear this up).

Note that there is still a work item in progress to work on the overall feature, including performance improvements, better documentation of what the data means, etc. Also, querying the counters from C# code (via PerformanceCounter class) does not work. So, that needs to be sorted out before FO can do any monitoring/reporting for this.

GitTorre changed the title ~~AppObserver documentation incorrect. Is per service disk space monitoring possible?~~ Is per service disk space monitoring possible? Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is per service disk space monitoring possible? #263

Is per service disk space monitoring possible? #263

markwragg commented Mar 8, 2023

GitTorre commented Mar 8, 2023 •

edited

Loading

GitTorre commented Mar 8, 2023 •

edited

Loading

markwragg commented Mar 10, 2023

GitTorre commented Mar 10, 2023

GitTorre commented Mar 10, 2023

markwragg commented Mar 10, 2023

GitTorre commented Mar 10, 2023

markwragg commented Mar 10, 2023

GitTorre commented Mar 17, 2023 •

edited

Loading

markwragg commented Mar 20, 2023

GitTorre commented Mar 20, 2023

GitTorre commented May 31, 2023 •

edited

Loading

Is per service disk space monitoring possible? #263

Is per service disk space monitoring possible? #263

Comments

markwragg commented Mar 8, 2023

GitTorre commented Mar 8, 2023 • edited Loading

GitTorre commented Mar 8, 2023 • edited Loading

markwragg commented Mar 10, 2023

GitTorre commented Mar 10, 2023

GitTorre commented Mar 10, 2023

markwragg commented Mar 10, 2023

GitTorre commented Mar 10, 2023

markwragg commented Mar 10, 2023

GitTorre commented Mar 17, 2023 • edited Loading

markwragg commented Mar 20, 2023

GitTorre commented Mar 20, 2023

GitTorre commented May 31, 2023 • edited Loading

GitTorre commented Mar 8, 2023 •

edited

Loading

GitTorre commented Mar 8, 2023 •

edited

Loading

GitTorre commented Mar 17, 2023 •

edited

Loading

GitTorre commented May 31, 2023 •

edited

Loading