Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding metrics collection from MTV #916

Merged
merged 7 commits into from
Jun 27, 2024
Merged

Adding metrics collection from MTV #916

merged 7 commits into from
Jun 27, 2024

Conversation

bkhizgiy
Copy link
Member

@bkhizgiy bkhizgiy commented Jun 2, 2024

Adding Prometheus metrics reporting for migration and plans. allowing to gather additional information about the diffrent MTV plans and migartion per diffrent types of provider.

How to test quickly: Open terminal of the forklift-controller and run following command

curl http://localhost:2112/metrics | grep mtv

Example output:

# HELP mtv_migration_data_transferred_bytes Total data transferred during VM migrations in bytes
# TYPE mtv_migration_data_transferred_bytes gauge
mtv_migration_data_transferred_bytes{mode="Cold",plan="2a86619c-8c84-43a9-ac6e-bfbea268acd6",provider="ova",target="Local"} 1.27926272e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="4ba105e2-cea9-4828-b115-52bac12738a9",provider="ova",target="Local"} 3.85875968e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="60b4e3ae-f23b-4b30-a289-a26c113e41e3",provider="ova",target="Local"} 3.85875968e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="6a735494-f22e-4c76-b179-424f90f82fd1",provider="ova",target="Local"} 3.85875968e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="73780b25-f4cf-4095-9667-61e5c9777895",provider="vsphere",target="Local"} 2.9892804608e+10
mtv_migration_data_transferred_bytes{mode="Cold",plan="81d7e453-1a91-4385-8139-e04c143a0923",provider="ova",target="Local"} 0
mtv_migration_data_transferred_bytes{mode="Cold",plan="825726f4-4996-47e2-8a16-681d7d332072",provider="ova",target="Local"} 3.85875968e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="8f165ea7-fa6c-48c3-909d-f7a539927ec0",provider="vsphere",target="Local"} 2.9892804608e+10
mtv_migration_data_transferred_bytes{mode="Cold",plan="9e4322ec-73f8-4732-9b6d-258725e60014",provider="ova",target="Local"} 1.073741824e+10
mtv_migration_data_transferred_bytes{mode="Cold",plan="c89fea6e-13d8-4e53-adb0-d19da9b36e5c",provider="ova",target="Local"} 4.294967296e+09
mtv_migration_data_transferred_bytes{mode="Cold",plan="e15ffb14-fa00-4350-8561-382819d08181",provider="ova",target="Local"} 3.85875968e+08
mtv_migration_data_transferred_bytes{mode="Cold",plan="f790528c-4a2d-4e27-9168-3ab7f94472a8",provider="vsphere",target="Local"} 2.9892804608e+10
# HELP mtv_migration_duration_seconds Duration of VM migrations in seconds
# TYPE mtv_migration_duration_seconds gauge
mtv_migration_duration_seconds{mode="Cold",plan="01bfbd1b-b0bb-4086-9d5d-bdfe6b81e1fc",provider="ova",target="Local"} 571
mtv_migration_duration_seconds{mode="Cold",plan="1d510b7c-51b8-48c9-8428-82d0853483a7",provider="ova",target="Local"} 412
mtv_migration_duration_seconds{mode="Cold",plan="2a86619c-8c84-43a9-ac6e-bfbea268acd6",provider="ova",target="Local"} 280
mtv_migration_duration_seconds{mode="Cold",plan="4ba105e2-cea9-4828-b115-52bac12738a9",provider="ova",target="Local"} 315
mtv_migration_duration_seconds{mode="Cold",plan="57fd64c1-3c71-4336-a6b7-a07697aa5154",provider="vsphere",target="Local"} 4198
mtv_migration_duration_seconds{mode="Cold",plan="60b4e3ae-f23b-4b30-a289-a26c113e41e3",provider="ova",target="Local"} 311
mtv_migration_duration_seconds{mode="Cold",plan="6150bb77-8489-4ba6-802f-76b76053551e",provider="ova",target="Local"} 918
mtv_migration_duration_seconds{mode="Cold",plan="6a735494-f22e-4c76-b179-424f90f82fd1",provider="ova",target="Local"} 376
mtv_migration_duration_seconds{mode="Cold",plan="73780b25-f4cf-4095-9667-61e5c9777895",provider="vsphere",target="Local"} 66987
mtv_migration_duration_seconds{mode="Cold",plan="81d7e453-1a91-4385-8139-e04c143a0923",provider="ova",target="Local"} 320
mtv_migration_duration_seconds{mode="Cold",plan="825726f4-4996-47e2-8a16-681d7d332072",provider="ova",target="Local"} 359
mtv_migration_duration_seconds{mode="Cold",plan="8f165ea7-fa6c-48c3-909d-f7a539927ec0",provider="vsphere",target="Local"} 48859
mtv_migration_duration_seconds{mode="Cold",plan="9e4322ec-73f8-4732-9b6d-258725e60014",provider="ova",target="Local"} 1369
mtv_migration_duration_seconds{mode="Cold",plan="c46088f5-a97a-497e-83fe-a1b230a3cdd2",provider="vsphere",target="Local"} 37962
mtv_migration_duration_seconds{mode="Cold",plan="c89fea6e-13d8-4e53-adb0-d19da9b36e5c",provider="ova",target="Local"} 281
mtv_migration_duration_seconds{mode="Cold",plan="ced50622-8b7b-4894-9166-6bf7073333e7",provider="ova",target="Local"} 287
mtv_migration_duration_seconds{mode="Cold",plan="e15ffb14-fa00-4350-8561-382819d08181",provider="ova",target="Local"} 393
mtv_migration_duration_seconds{mode="Cold",plan="e9969f48-e8a7-45ad-86ca-35e66d5cb6c2",provider="ova",target="Local"} 608
mtv_migration_duration_seconds{mode="Cold",plan="f2c2e684-1576-4310-a708-3e7bfec794c8",provider="vsphere",target="Local"} 6879
mtv_migration_duration_seconds{mode="Cold",plan="f790528c-4a2d-4e27-9168-3ab7f94472a8",provider="vsphere",target="Local"} 71620
# HELP mtv_migration_duration_seconds Histogram of VM migrations duration in seconds
# TYPE mtv_migration_duration_seconds histogram
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="3600"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="7200"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="18000"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="36000"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="86400"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="172800"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="ova",target="Local",le="+Inf"} 14
mtv_migration_duration_seconds_sum{mode="Cold",provider="ova",target="Local"} 6800
mtv_migration_duration_seconds_count{mode="Cold",provider="ova",target="Local"} 14
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="3600"} 0
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="7200"} 2
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="18000"} 2
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="36000"} 2
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="86400"} 6
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="172800"} 6
mtv_migration_duration_seconds_bucket{mode="Cold",provider="vsphere",target="Local",le="+Inf"} 6
mtv_migration_duration_seconds_sum{mode="Cold",provider="vsphere",target="Local"} 236505
mtv_migration_duration_seconds_count{mode="Cold",provider="vsphere",target="Local"} 6
# HELP mtv_migrations_status VM Migrations sorted by status, provider, mode and destination
# TYPE mtv_migrations_status gauge
mtv_migrations_status{mode="Cold",provider="ova",status="Canceled",target="Local"} 1
mtv_migrations_status{mode="Cold",provider="ova",status="Failed",target="Local"} 1
mtv_migrations_status{mode="Cold",provider="ova",status="Succeeded",target="Local"} 1
mtv_migrations_status{mode="Cold",provider="vsphere",status="Canceled",target="Local"} 1
mtv_migrations_status{mode="Cold",provider="vsphere",status="Failed",target="Local"} 4
mtv_migrations_status{mode="Cold",provider="vsphere",status="Succeeded",target="Local"} 1
# HELP mtv_plans_status VM migration Plans sorted by status, provider, mode and destination
# TYPE mtv_plans_status gauge
mtv_plans_status{mode="Cold",provider="ova",status="Canceled",target="Local"} 4
mtv_plans_status{mode="Cold",provider="ova",status="Executing",target="Local"} 5
mtv_plans_status{mode="Cold",provider="ova",status="Failed",target="Local"} 1
mtv_plans_status{mode="Cold",provider="ova",status="Succeeded",target="Local"} 14
mtv_plans_status{mode="Cold",provider="vsphere",status="Executing",target="Local"} 1
mtv_plans_status{mode="Cold",provider="vsphere",status="Failed",target="Local"} 3
mtv_plans_status{mode="Cold",provider="vsphere",status="Succeeded",target="Local"} 6
# HELP mtv_workload_migrations_status VM Migrations status by provider, mode, destination and plan
# TYPE mtv_workload_migrations_status gauge
mtv_workload_migrations_status{mode="Cold",plan="00963712-6050-4597-8c2c-1217ab7f521a",provider="vsphere",status="Failed",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="01bfbd1b-b0bb-4086-9d5d-bdfe6b81e1fc",provider="ova",status="Succeeded",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="1b70f85b-11c8-40c0-bc27-7465aa757fba",provider="ova",status="Failed",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="1d510b7c-51b8-48c9-8428-82d0853483a7",provider="ova",status="Succeeded",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="25e8cfdf-103a-4af8-9a49-1aa2bdd55cd8",provider="ova",status="Canceled",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="2a86619c-8c84-43a9-ac6e-bfbea268acd6",provider="ova",status="Succeeded",target="Local"} 1
 mtv_workload_migrations_status{mode="Cold",plan="471dcee9-5fe7-4ba5-9e23-2a964761dd96",provider="ova",status="Canceled",target="Local"} 1
mtv_workload_migrations_status{mode="Cold",plan="4ba105e2-cea9-4828-b115-52bac12738a9",provider="ova",status="Canceled",target="Local"} 1

Copy link

sonarqubecloud bot commented Jun 2, 2024

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

codecov bot commented Jun 2, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 15.83%. Comparing base (ccb54b8) to head (08e86d7).

Files Patch % Lines
pkg/controller/plan/controller.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #916   +/-   ##
=======================================
  Coverage   15.83%   15.83%           
=======================================
  Files         109      108    -1     
  Lines       19970    19911   -59     
=======================================
- Hits         3162     3153    -9     
+ Misses      16527    16479   -48     
+ Partials      281      279    -2     
Flag Coverage Δ
unittests 15.83% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@liranr23 liranr23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the cold/warm and remote/local. I guess it's apart from the plans because that was requested.
but how much will it be harmful to have a single metric to contain all of the plans and their data?
did you thought about making percentages out of the data? e.g success rate, etc.

@bkhizgiy
Copy link
Member Author

bkhizgiy commented Jun 3, 2024

Regarding the cold/warm and remote/local. I guess it's apart from the plans because that was requested. but how much will it be harmful to have a single metric to contain all of the plans and their data? did you thought about making percentages out of the data? e.g success rate, etc.

It will be more problematic to have it in a single metric because we will have to specify the provider type, status, and destination in a single line, which will provide us with a very specific output. This is the opposite of what we currently want. Additionally, these are only the metrics in the controller. Now, I'm working on recording rules which will have more specific data based on the different labels and counters, so we will end up having a more combined output like you are referring to.

Regarding the percentage, I think it can be done. We just need to count all the migrations (there are some statuses we are not counting currently) and see the success/failure percentage from it.

@liranr23
Copy link
Member

liranr23 commented Jun 3, 2024

Regarding the cold/warm and remote/local. I guess it's apart from the plans because that was requested. but how much will it be harmful to have a single metric to contain all of the plans and their data? did you thought about making percentages out of the data? e.g success rate, etc.

It will be more problematic to have it in a single metric because we will have to specify the provider type, status, and destination in a single line, which will provide us with a very specific output. This is the opposite of what we currently want. Additionally, these are only the metrics in the controller. Now, I'm working on recording rules which will have more specific data based on the different labels and counters, so we will end up having a more combined output like you are referring to.

Regarding the percentage, I think it can be done. We just need to count all the migrations (there are some statuses we are not counting currently) and see the success/failure percentage from it.

all my questions regarding combining/adding and even the time between updates are pretty much depending on who consume this data and how. if the consumer expect the data to be accurate 10 sec can fit (maybe someone changed the plan from warm to cold?), but if not we can increase the time in order to consume less resources / increase performance of forklift.
the main question would be - if we know who is planning to consume it and what are the expectations.

@sradco
Copy link

sradco commented Jun 17, 2024

@bkhizgiy I don't think there is a need for the metric
'mtv_operator_destination'. Why not integrate it to the migration and plan metrics?

@sradco
Copy link

sradco commented Jun 17, 2024

Regarding the cold/warm and remote/local. I guess it's apart from the plans because that was requested.
but how much will it be harmful to have a single metric to contain all of the plans and their data?
did you thought about making percentages out of the data? e.g success rate, etc.

I agree with the part of making the labels part of the plans. I'm not sure if everything can be in a single metric since I'm not sure about the difference between the plan and the migration, but it is worth discussing.

About the percentages, it's better yo keep as is since you can calculate the percentage in Prometheus and it's against the best practices.

"sigs.k8s.io/controller-runtime/pkg/client"
)

var processedPlans = make(map[string]struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we act like the plans also in the migrations?
or you didn't do that as you check if we need both migrations and plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we don't need, because the migration is set with atomic counters that always reflect the status, it set the real value each time instead of increasing it, but the plan is have to many combinations and options now that we have 4 labels on the same metric, so we will need to have all the combinations of Local/Remote.Warm/Cold.Status.ProviderType which leads to to many counters and if conditions, so this implementation simplifies things.

also I've comment above I'ts the migration entity that corresponds to plan, I left it because the original team had it, but do we really need both? I think one of them should be enough but which one, plans or migrations would be better? so the question is if we really need it and if we do should we add additional labels there as well?

@ahadas ahadas added this to the 2.6.3 milestone Jun 19, 2024
var succeededRHV, succeededOCP, succeededOVA, succeededVsphere, succeededOpenstack float64
var failedRHV, failedOCP, failedOVA, failedVsphere, failedOpenstack float64
// Initialize or reset the counter map at the beginning of each iteration
counterMap := make(map[string]float64)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fetch the current state and save it. Using the previous implementation, we ignored plans/migrations that switched different statuses and didn't count them. Later, this will be processed in telemetry for what we need. Since there are too many combinations, adding a specific counter for each type based on providerType/migrationType/destination/status, I've added a map that will hold the different combinations as keys and the counter for each as values. This map will be reset each iteration to represent the actual state of the system.

}

// Set the metrics for duration and data transferred and update the map for scaned migration
if _, exists := processedSucceededMigrations[string(m.UID)]; !exists {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't want to add a new metric for each successful migration each iteration, once we process it once, we will save its ID to a map and ignore it in the next iterations, only for successful migrations.

@sradco
Copy link

sradco commented Jun 24, 2024

@bkhizgiy you report the metric as Gauges, but their names end with "total" which is saved for counters.
I suggest to implement the metrics name linter that CNV uses.
Also, if the metric always go up its a counter. If it reports the current state its a Gauge.

@bkhizgiy
Copy link
Member Author

@sradco you are right, thanks. I first switched it to a counter but then decided to reflect the current state instead, so I returned it to a Gauge. I will change the name to reflect that. Regarding the linter, it's a good suggestion, I think it can be done next when adding more advanced logic, after we have our basic metrics implementation in.

Copy link
Member

@liranr23 liranr23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did #916 (comment) changed? if so, please update.

pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
pkg/monitoring/metrics/forklift-controller/plan_metrics.go Outdated Show resolved Hide resolved
Copy link
Member

@ahadas ahadas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a minor comment inside

pkg/monitoring/metrics/forklift-controller/metrics.go Outdated Show resolved Hide resolved
@bkhizgiy bkhizgiy force-pushed the telemtry branch 2 times, most recently from dfc99ac to 9257de6 Compare June 26, 2024 10:57
@sradco
Copy link

sradco commented Jun 26, 2024

@bkhizgiy you can drop the word in from all the metrics names.
For example:
mtv_data_transferred_in_bytes -> mtv_data_transferred_bytes.

@sradco
Copy link

sradco commented Jun 26, 2024

@bkhizgiy I saw you updated the metrics names, please update the pr description with the new names.

@ahadas
Copy link
Member

ahadas commented Jun 26, 2024

@sradco I can't find the comment in which you wrote:

Hi @ahadas, there are still a few metrics names changes that are worth fixing before merging this PR.
It's important since they also show in they are visible in the observe->metrics UI.

sure, I'm waiting for your ack

@sradco
Copy link

sradco commented Jun 27, 2024

@sradco I can't find the comment in which you wrote:

Hi @ahadas, there are still a few metrics names changes that are worth fixing before merging this PR.
It's important since they also show in they are visible in the observe->metrics UI.

sure, I'm waiting for your ack

Actually I saw that @bkhizgiy fixed the issue that I raised. I added a few follow ups.
Most are nits, but the only thing that I think you should consider is if the mtv_plans_status is still needed after you added the plan information to the migration metric.

@bkhizgiy bkhizgiy force-pushed the telemtry branch 2 times, most recently from c4c2db9 to 86bc5c2 Compare June 27, 2024 11:35
@bkhizgiy bkhizgiy changed the title Adding metrics collection to Telemetry Adding metrics collection for MTV Jun 27, 2024
@bkhizgiy bkhizgiy changed the title Adding metrics collection for MTV Adding metrics collection from MTV Jun 27, 2024
Signed-off-by: Bella Khizgiyaev <[email protected]>
@sradco
Copy link

sradco commented Jun 27, 2024

/lgtm

@ahadas ahadas merged commit 50a2566 into kubev2v:main Jun 27, 2024
11 of 12 checks passed
@ahadas ahadas removed this from the 2.6.3 milestone Jun 30, 2024
// 'target' - [Local, Remote]
planStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
Name: "mtv_plans_status",
Help: "VM migration Plans sorted by status, provider, mode and destination",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destination -> target

// 'target' - [Local, Remote]
migrationStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
Name: "mtv_migrations_status",
Help: "VM Migrations sorted by status, provider, mode and destination",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destination -> target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants