-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880
Conversation
Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 2))); | ||
this.catalog_timers.put(metricId[1], metric); | ||
} | ||
} else if ("puppetdb".equals(metricId[0])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This puppetdb clause does not contain all the necessary complexity of the former PUPPETDB_PATTERN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the root metrics we report on based the PUPPETDB_PATTERN:
puppetdb.catalog.munge
puppetdb.command.submit.replace facts
puppetdb.command.submit.replace catalog
puppetdb.command.submit.store report
puppetdb.facts.encode
puppetdb.query
puppetdb.resource.search
puppetdb.report.convert_to_wire_format_hash
The code in the current PR dumbs this down to:
puppetdb.catalog
puppetdb.command
puppetdb.facts
puppetdb.query
puppetdb.resource
puppetdb.report
Going through the PuppetDB Ruby code (at this commit puppetlabs/puppetdb@8a1e2ef) I see these metrics created:
[:puppetdb, :aliases, :map_to_title]
[:puppetdb, :catalog, :munge]
[:puppetdb, :catalog, :save, request.key]
[:puppetdb, :command, :submit]
[:puppetdb, :command, :submit, catalog_command_name, catalog_version]
[:puppetdb, :command, :submit, command_name, version]
[:puppetdb, :command, :submit, inputs_command_name, inputs_version]
[:puppetdb, :edges, :munge]
[:puppetdb, :edges, :synthesize]
[:puppetdb, :edges, :synthesize, :make_unique]
[:puppetdb, :edges, :synthesize, :primary_synthesis]
[:puppetdb, :edges, :synthesize, :resource_table, :build]
[:puppetdb, :events_list, :build]
[:puppetdb, :facts, :encode]
[:puppetdb, :facts, :find, :parse_response, request.key]
[:puppetdb, :facts, :find, :query_nodes, request.key]
[:puppetdb, :facts, :find, request.key]
[:puppetdb, :facts, :save, request.key]
[:puppetdb, :facts, :search, :parse_query_response, request.key,]
[:puppetdb, :facts, :search, :query_request, request.key]
[:puppetdb, :facts, :search, request.key]
[:puppetdb, :keys, :filter_extraneous]
[:puppetdb, :logs_list, :build]
[:puppetdb, :metaparams, :sort]
[:puppetdb, :metrics_list, :build]
[:puppetdb, :namevar_aliases, :add]
[:puppetdb, :parameters, :add_missing]
[:puppetdb, :payload, :format]
[:puppetdb, :query, query]
[:puppetdb, :report, :convert_to_wire_format_hash]
[:puppetdb, :report, :process]
[:puppetdb, :resource, :search, :build_up_collected_objects, request.key]
[:puppetdb, :resource, :search, :parse_query_response, request.key]
[:puppetdb, :resource, :search, :query, request.key]
[:puppetdb, :resource, :search, request.key]
[:puppetdb, :resources_list, :build]
[:puppetdb, :titles, :stringify]
Are the current metrics correct? We could add a few while we're here. Some easier ones to add would be:
puppetdb.catalog.save
puppetdb.edges.munge
puppetdb.edges.synthesize
puppetdb.facts.find
puppetdb.facts.save
puppetdb.facts.search
puppetdb.payload.format
puppetdb.report.process
But I don't know if they duplicate some existing reported metrics, or aren't very valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reports.process
, catalog.save
and facts.save
metrics would likely be useful as they are the highest level metrics closest to "how long did the puppetdb terminus take?".
We should be sure to keep all the versions of submit like puppetdb.command.submit.store report
as those, along with puppetdb.facts.encode
, puppetdb.catalog.munge
, and puppetdb.report.convert_to_wire_format_hash
can help support determine if its spending its time in our Ruby code or trying to send the data over the net.
puppetdb.payload.format
would be good to include as well in case our json conversion is slow. We do have to do some string manipulation there, so it isn't just Ruby's to_json function.
I'm less familiar with the terminus's use for querying, but I imagine puppetdb.query
, puppetdb.facts.find
, and puppetdb.facts.search
would be good to include assuming those terminus functions are used somewhere.
To me, the rest fall into the nice-to-have category. If their portion is what's slow, they would be helpful to determine that, but I don't know how much value they'd have outside that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sharpie , how does adding those new puppetdb metrics sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the updated metrics:
"puppetdb-metrics": [
{
"metric": "catalog_save",
"count": 2,
"mean": 243,
"aggregate": 486
},
{
"metric": "resource_search",
"count": 8,
"mean": 46,
"aggregate": 368
},
{
"metric": "query",
"count": 1,
"mean": 355,
"aggregate": 355
},
{
"metric": "facts_find",
"count": 2,
"mean": 117,
"aggregate": 234
},
{
"metric": "catalog_munge",
"count": 2,
"mean": 94,
"aggregate": 188
},
{
"metric": "facts_save",
"count": 2,
"mean": 76,
"aggregate": 152
},
{
"metric": "command_submit_replace_catalog",
"count": 2,
"mean": 54,
"aggregate": 108
},
{
"metric": "command_submit_replace_facts",
"count": 2,
"mean": 49,
"aggregate": 98
},
{
"metric": "report_process",
"count": 2,
"mean": 41,
"aggregate": 82
},
{
"metric": "command_submit_store_report",
"count": 2,
"mean": 30,
"aggregate": 60
},
{
"metric": "payload_format",
"count": 8,
"mean": 2,
"aggregate": 16
},
{
"metric": "report_convert_to_wire_format_hash",
"count": 2,
"mean": 4,
"aggregate": 8
},
{
"metric": "facts_encode",
"count": 2,
"mean": 2,
"aggregate": 4
}
],
"inline-metrics": []
}
},
"active_alerts": [],
"service_name": "puppet-profiler"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated set of metrics looks good to me 👍
} | ||
|
||
private void updateMetricsTrackers(String[] metricId, Map<String, Timer> metricsByID) { | ||
if ("functions".equals(metricId[0])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note on how to start reviewing this because it was hard to identify for me... We need to look for areas like https://github.com/puppetlabs/puppet/blob/1fba2505e6fd5c7ed17257c07667da98d6ed1dfb/lib/puppet/pops/functions/function.rb#L115 and evaluate if all the cases are handled correctly. Using array indices for each of these fields requires that we have a good sense of the possible arrays that are coming in.
I think there will be some in puppet and puppetdb. Not sure if there are any in puppetserver as well?
Overall, i like the idea of doing the handling when an event comes in such that there is no cost for querying :).
src/java/com/puppetlabs/puppetserver/MetricsPuppetProfiler.java
Outdated
Show resolved
Hide resolved
} else if ("static_compile_inlining".equals(metricId[1])) { | ||
Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 3))); | ||
this.inlining_timers.put(metricId[2], metric); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous regex based selection explicitly grabbed these submetrics: static_compile_postprocessing, static_compile, compile, and find_node. This else will grab all metrics that aren't evaluate_resource or static_compile_inlining, don't know if that is a good thing or a bad thing. Thoughts @Sharpie ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean these more specific metrics will no longer be counted in catalog timers? Does that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will still be there, but we'll report a lot more now. I don't know if that will be helpful or just noise. Here are the ones I can easily find with grep:
init_server_facts
find_facts
iterate_on_generators
evaluate_definitions
evaluate_collections
validate_final
finish_catalog
validate_pre_finish
evaluate_generators
evaluate_node_classes
evaluate_ast_node
evaluate_main
create_settings_scope
set_node_params
I think I've addressed the logic issues with the existing puppetdb metrics. Looking forward to hear from Charlie, et al if others would also be valuable. I need to add some tests, it appears that we test the function, catalog, and resource metrics from the profiler interface, but test the puppetdb metrics only via the graphite exporter (unaffected by this PR). |
The unit/integration tests for puppetdb metrics I was thinking of are testing allow listing, not actual creation of those metrics (which is hard since we don't have a PDB during our unit/integration tests. However there is an acceptance test to validate these metrics, and that's been helpful locally to make sure these are formatted correctly. The acceptance test is passing with a package built prior a small refactor (puppetserver 7.17.3.SNAPSHOT.2024.09.26T1533 on builds). I'll add those metrics Austin mentioned (unless I hear otherwise from Charlie) and then build a final package to acceptance test and that others can validate if they wish. |
We have triggered a nasty performance regression in the JVM[1] with some of our Regex usage in the MetricsPuppetProfiler (particularly the PuppetDB Patterns). Regardless of the performance regression, our Regex usage is unnecessary and unneedlessly expensive (I believe it was taking up ~3% of samples from JFR in an unrelated support escalation). This patch moves from iterating over all metrics and doing regex matches to pull out only those that match the desired pattern when metrics are requested, to caching references to desired metrics in maps as they are recorded. The keys of these maps should be the same as the match group from the previous patterns.
Based on discussion with Austin Blatt these additional metrics could be helpful in diagnosing issues with PuppetDB. > The reports.process, catalog.save and facts.save metrics would likely be > useful as they are the highest level metrics closest to "how long did the > puppetdb terminus take?". > > puppetdb.payload.format would be good to include as well in case our json > conversion is slow. We do have to do some string manipulation there, so it > isn't just Ruby's to_json function. > > I'm less familiar with the terminus's use for querying, but I imagine > puppetdb.query, puppetdb.facts.find, and puppetdb.facts.search would be > good to include assuming those terminus functions are used somewhere.
clj-kondo version 2024.03.05 fixed a bug which suppressed the :java-static-field-call warning in some cases. Linting is now erroring in CI and this patch should resolve that issue.
c3f314e
to
f85021c
Compare
Okay. I think this is ready to go. There's a package internally available with 7.17.3.SNAPSHOT.2024.09.27T1145 that works for me. I had to update our acceptance tests - it looks like our PDB integration isn't correct. I have a feeling that mean our PDB tests aren't really doing anything in CI, but first we need to get CI working and then I take up updating the PDB integration acceptance tests after that. fwiw, these are those local changes to the acceptance suite: https://github.com/justinstoller/puppetserver/compare/PE-38408...justinstoller:puppetserver:pdbupdate?expand=1 |
} | ||
|
||
@Override | ||
public void shutdown() { | ||
} | ||
|
||
private List<String> sliceOfArrayToList(String[] idSegments, int lengthOfID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant see how subList
does not work for this?
Seems like for example the getMetricName
could be called with the result of a metricId.subList(0,2)
as the result of subList
I think would be mutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start with a String[]
which doesn't have the subList()
method defined, but we can turn an array into a List with Arrays.asList()
.
I originally tried Arrays.asList(ids).subList(0,2)
. But the problem is asList()
returns a fixed sized List and subList()
simply returns a view of the original fixed size list. So we end up erring when getMetricName
attempts to prepend the hostname to the list here: https://github.com/puppetlabs/puppetserver/pull/2880/files#diff-300731d432bcbd4678957c962c2aa20d068698fff7a4441434c3d1ac5e55fdfdR196
I should probably clean up getMetricName()
but I didn't want to lead with a larger refactor than was strictly necessary (at least after I had changed getTimers
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, i see, i misread that parameter type.
} else { | ||
String thirdElemet = safeGet(metricId, 2); | ||
|
||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its probably overkill, but i wonder if it would be worth comparing metrics from two reference workflows. Matching the regex to this logic is hurting my brain. It may not matter too much though... These are hard to reason about 🫠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex matching definitely hurt my brain as well! I got it wrong a couple of times. I can run the acceptance test and collect what metrics are reported before and after this change. I've ran it after to make sure they look right (see this comment #2880 (comment)) , but haven't done a full comparison. I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the previous metrics from 7.17.2:
{"service_version":"7.17.2","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":369,"aggregate":369},{"function":"lookup","count":2,"mean":45,"aggregate":90}],"resource-metrics":[{"resource":"Class[main]","count":2,"mean":193,"aggregate":386}],"catalog-metrics":[{"metric":"compile","count":2,"mean":886,"aggregate":1772},{"metric":"find_node","count":2,"mean":11,"aggregate":22}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":366,"aggregate":366},{"metric":"resource_search","count":8,"mean":31,"aggregate":248},{"metric":"command_submit_replace_catalog","count":2,"mean":68,"aggregate":136},{"metric":"command_submit_store_report","count":2,"mean":51,"aggregate":102},{"metric":"command_submit_replace_facts","count":2,"mean":43,"aggregate":86},{"metric":"catalog_munge","count":2,"mean":9,"aggregate":18},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":5,"aggregate":10},{"metric":"facts_encode","count":2,"mean":2,"aggregate":4}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}
and here are the metrics returned after this pr:
{"service_version":"7.17.3-SNAPSHOT","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":411,"aggregate":411},{"function":"lookup","count":2,"mean":46,"aggregate":92}],"resource-metrics":[{"resource":"Node[weakest-byword.delivery.puppetlabs.net]","count":1,"mean":421,"aggregate":421},{"resource":"Class[main]","count":2,"mean":184,"aggregate":368},{"resource":"Node[resource-exporter.test]","count":1,"mean":208,"aggregate":208}],"catalog-metrics":[{"metric":"compile","count":2,"mean":912,"aggregate":1824},{"metric":"evaluate_ast_node","count":2,"mean":321,"aggregate":642},{"metric":"evaluate_main","count":2,"mean":188,"aggregate":376},{"metric":"evaluate_generators","count":2,"mean":115,"aggregate":230},{"metric":"iterate_on_generators","count":3,"mean":75,"aggregate":225},{"metric":"evaluate_collections","count":2,"mean":108,"aggregate":216},{"metric":"create_settings_scope","count":2,"mean":54,"aggregate":108},{"metric":"set_node_params","count":2,"mean":14,"aggregate":28},{"metric":"find_facts","count":2,"mean":7,"aggregate":14},{"metric":"find_node","count":2,"mean":6,"aggregate":12},{"metric":"finish_catalog","count":2,"mean":1,"aggregate":2},{"metric":"init_server_facts","count":1,"mean":0,"aggregate":0},{"metric":"validate_pre_finish","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_node_classes","count":2,"mean":0,"aggregate":0},{"metric":"validate_final","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_definitions","count":3,"mean":0,"aggregate":0}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":409,"aggregate":409},{"metric":"resource_search","count":8,"mean":36,"aggregate":288},{"metric":"facts_find","count":2,"mean":139,"aggregate":278},{"metric":"catalog_save","count":2,"mean":131,"aggregate":262},{"metric":"facts_save","count":2,"mean":91,"aggregate":182},{"metric":"command_submit_replace_catalog","count":2,"mean":67,"aggregate":134},{"metric":"command_submit_replace_facts","count":2,"mean":56,"aggregate":112},{"metric":"report_process","count":2,"mean":50,"aggregate":100},{"metric":"command_submit_store_report","count":2,"mean":39,"aggregate":78},{"metric":"payload_format","count":8,"mean":3,"aggregate":24},{"metric":"facts_encode","count":2,"mean":10,"aggregate":20},{"metric":"catalog_munge","count":2,"mean":10,"aggregate":20},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":4,"aggregate":8}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}
These aren't pretty printed, not as an f-u, but because I think putting them pretty printed in a github comment would make the content very hard to compare. I put them into files, piped them through jq, and compared two terminals side by side. I think they look correct.
@@ -90,8 +176,8 @@ private boolean shouldTime(String[] metric_id) { | |||
return true; | |||
} | |||
|
|||
private List<Timer> getTimers(String[] metric_id) { | |||
List<Timer> timers = new ArrayList<Timer>(); | |||
private Map<String, Timer> getOrCreateTimersByIDs(String[] metric_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not used anywhere else right? Its not important for
(->> (.getTimers metric-registry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure i'm following the "or" in the name here. It seems like this just creates a new map and returns it unconditionally. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not used anywhere else right? Its not important for
It's not, the method is private and external callers are calling this method on the MetricRegistry https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#getTimers--
I think the two cases of "getTimers()" here where inspired by that method. But, the call to MetricRegister.timer() in this method is actually the place in this class were we create timers (if they don't exist). See https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#timer-java.lang.String-
The getOrCreateTimers
is in reference to that. A more accurate name would probably be getOrCreateTimersAndReturnThemById
, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the reference. I think leaving it as is will be fine!
Do you feel strongly about this going in 7.x? Would it make sense just to target main? |
I don't have strong feelings. The escalation that started this research was using 7.x and so I started there. |
I think given the main benefit will be in 7.x (even though its close to the end) i think leaving it at that stream is best. Thanks! |
We have triggered a nasty performance regression in the JVM with some of our Regex usage in the MetricsPuppetProfiler. Regardless of the performance regression, our Regex usage is unnecessary and unneedlessly expensive.