Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PE-38408) Remove expensive Regexes from puppet profiler Java impl #2880

Merged
merged 3 commits into from
Sep 30, 2024

Conversation

justinstoller
Copy link
Member

We have triggered a nasty performance regression in the JVM with some of our Regex usage in the MetricsPuppetProfiler. Regardless of the performance regression, our Regex usage is unnecessary and unneedlessly expensive.

@justinstoller justinstoller requested review from a team as code owners September 23, 2024 23:08
Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 2)));
this.catalog_timers.put(metricId[1], metric);
}
} else if ("puppetdb".equals(metricId[0])) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puppetdb clause does not contain all the necessary complexity of the former PUPPETDB_PATTERN

Copy link
Member Author

@justinstoller justinstoller Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the root metrics we report on based the PUPPETDB_PATTERN:

puppetdb.catalog.munge
puppetdb.command.submit.replace facts
puppetdb.command.submit.replace catalog
puppetdb.command.submit.store report
puppetdb.facts.encode
puppetdb.query
puppetdb.resource.search
puppetdb.report.convert_to_wire_format_hash

The code in the current PR dumbs this down to:

puppetdb.catalog
puppetdb.command
puppetdb.facts
puppetdb.query
puppetdb.resource
puppetdb.report

Going through the PuppetDB Ruby code (at this commit puppetlabs/puppetdb@8a1e2ef) I see these metrics created:

[:puppetdb, :aliases, :map_to_title]
[:puppetdb, :catalog, :munge]
[:puppetdb, :catalog, :save, request.key]
[:puppetdb, :command, :submit]
[:puppetdb, :command, :submit, catalog_command_name, catalog_version]
[:puppetdb, :command, :submit, command_name, version]
[:puppetdb, :command, :submit, inputs_command_name, inputs_version]
[:puppetdb, :edges, :munge]
[:puppetdb, :edges, :synthesize]
[:puppetdb, :edges, :synthesize, :make_unique]
[:puppetdb, :edges, :synthesize, :primary_synthesis]
[:puppetdb, :edges, :synthesize, :resource_table, :build]
[:puppetdb, :events_list, :build]
[:puppetdb, :facts, :encode]
[:puppetdb, :facts, :find, :parse_response, request.key]
[:puppetdb, :facts, :find, :query_nodes, request.key]
[:puppetdb, :facts, :find, request.key]
[:puppetdb, :facts, :save, request.key]
[:puppetdb, :facts, :search, :parse_query_response, request.key,]
[:puppetdb, :facts, :search, :query_request, request.key]
[:puppetdb, :facts, :search, request.key]
[:puppetdb, :keys, :filter_extraneous]
[:puppetdb, :logs_list, :build]
[:puppetdb, :metaparams, :sort]
[:puppetdb, :metrics_list, :build]
[:puppetdb, :namevar_aliases, :add]
[:puppetdb, :parameters, :add_missing]
[:puppetdb, :payload, :format]
[:puppetdb, :query, query]
[:puppetdb, :report, :convert_to_wire_format_hash]
[:puppetdb, :report, :process]
[:puppetdb, :resource, :search, :build_up_collected_objects, request.key]
[:puppetdb, :resource, :search, :parse_query_response, request.key]
[:puppetdb, :resource, :search, :query, request.key]
[:puppetdb, :resource, :search, request.key]
[:puppetdb, :resources_list, :build]
[:puppetdb, :titles, :stringify]

Are the current metrics correct? We could add a few while we're here. Some easier ones to add would be:

puppetdb.catalog.save
puppetdb.edges.munge
puppetdb.edges.synthesize
puppetdb.facts.find
puppetdb.facts.save
puppetdb.facts.search
puppetdb.payload.format
puppetdb.report.process

But I don't know if they duplicate some existing reported metrics, or aren't very valuable.

Thoughts @austb, @rbrw , @Sharpie ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reports.process, catalog.save and facts.save metrics would likely be useful as they are the highest level metrics closest to "how long did the puppetdb terminus take?".

We should be sure to keep all the versions of submit like puppetdb.command.submit.store report as those, along with puppetdb.facts.encode, puppetdb.catalog.munge, and puppetdb.report.convert_to_wire_format_hash can help support determine if its spending its time in our Ruby code or trying to send the data over the net.

puppetdb.payload.format would be good to include as well in case our json conversion is slow. We do have to do some string manipulation there, so it isn't just Ruby's to_json function.

I'm less familiar with the terminus's use for querying, but I imagine puppetdb.query, puppetdb.facts.find, and puppetdb.facts.search would be good to include assuming those terminus functions are used somewhere.

To me, the rest fall into the nice-to-have category. If their portion is what's slow, they would be helpful to determine that, but I don't know how much value they'd have outside that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sharpie , how does adding those new puppetdb metrics sound?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the updated metrics:

      "puppetdb-metrics": [
        {                                         
          "metric": "catalog_save",
          "count": 2,
          "mean": 243,   
          "aggregate": 486
        },
        {                            
          "metric": "resource_search",
          "count": 8,
          "mean": 46,    
          "aggregate": 368
        },
        {                                                                                                
          "metric": "query",
          "count": 1,
          "mean": 355,  
          "aggregate": 355
        },
        {                          
          "metric": "facts_find",
          "count": 2,
          "mean": 117,  
          "aggregate": 234
        },
        {                 
          "metric": "catalog_munge",
          "count": 2,
          "mean": 94, 
          "aggregate": 188         
        },
        {
          "metric": "facts_save",                                                                        
          "count": 2,
          "mean": 76,
          "aggregate": 152
        },
        {
          "metric": "command_submit_replace_catalog",
          "count": 2,
          "mean": 54,
          "aggregate": 108
        },
        {
          "metric": "command_submit_replace_facts",
          "count": 2,
          "mean": 49,
          "aggregate": 98
        },
        {
          "metric": "report_process",
          "count": 2,
          "mean": 41,
          "aggregate": 82
        },
        {
          "metric": "command_submit_store_report",
          "count": 2,
          "mean": 30,
          "aggregate": 60
        },
        {
          "metric": "payload_format",
          "count": 8,
          "mean": 2,
          "aggregate": 16
        },
        {
          "metric": "report_convert_to_wire_format_hash",
          "count": 2,
          "mean": 4,
          "aggregate": 8
        },
        {
          "metric": "facts_encode",
          "count": 2,
          "mean": 2,
          "aggregate": 4
        }
      ],
      "inline-metrics": []
    }
  },
  "active_alerts": [],
  "service_name": "puppet-profiler"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated set of metrics looks good to me 👍

}

private void updateMetricsTrackers(String[] metricId, Map<String, Timer> metricsByID) {
if ("functions".equals(metricId[0])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note on how to start reviewing this because it was hard to identify for me... We need to look for areas like https://github.com/puppetlabs/puppet/blob/1fba2505e6fd5c7ed17257c07667da98d6ed1dfb/lib/puppet/pops/functions/function.rb#L115 and evaluate if all the cases are handled correctly. Using array indices for each of these fields requires that we have a good sense of the possible arrays that are coming in.

I think there will be some in puppet and puppetdb. Not sure if there are any in puppetserver as well?

Overall, i like the idea of doing the handling when an event comes in such that there is no cost for querying :).

} else if ("static_compile_inlining".equals(metricId[1])) {
Timer metric = metricsByID.get(getMetricName(sliceOfArrayToList(metricId, 3)));
this.inlining_timers.put(metricId[2], metric);
} else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous regex based selection explicitly grabbed these submetrics: static_compile_postprocessing, static_compile, compile, and find_node. This else will grab all metrics that aren't evaluate_resource or static_compile_inlining, don't know if that is a good thing or a bad thing. Thoughts @Sharpie ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean these more specific metrics will no longer be counted in catalog timers? Does that matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will still be there, but we'll report a lot more now. I don't know if that will be helpful or just noise. Here are the ones I can easily find with grep:

init_server_facts
find_facts
iterate_on_generators
evaluate_definitions
evaluate_collections
validate_final
finish_catalog
validate_pre_finish
evaluate_generators
evaluate_node_classes
evaluate_ast_node
evaluate_main
create_settings_scope
set_node_params

@justinstoller
Copy link
Member Author

I think I've addressed the logic issues with the existing puppetdb metrics. Looking forward to hear from Charlie, et al if others would also be valuable. I need to add some tests, it appears that we test the function, catalog, and resource metrics from the profiler interface, but test the puppetdb metrics only via the graphite exporter (unaffected by this PR).

@justinstoller
Copy link
Member Author

The unit/integration tests for puppetdb metrics I was thinking of are testing allow listing, not actual creation of those metrics (which is hard since we don't have a PDB during our unit/integration tests.

However there is an acceptance test to validate these metrics, and that's been helpful locally to make sure these are formatted correctly. The acceptance test is passing with a package built prior a small refactor (puppetserver 7.17.3.SNAPSHOT.2024.09.26T1533 on builds).

I'll add those metrics Austin mentioned (unless I hear otherwise from Charlie) and then build a final package to acceptance test and that others can validate if they wish.

We have triggered a nasty performance regression in the JVM[1] with some
of our Regex usage in the MetricsPuppetProfiler (particularly the
PuppetDB Patterns). Regardless of the performance regression, our Regex
usage is unnecessary and unneedlessly expensive (I believe it was
taking up ~3% of samples from JFR in an unrelated support escalation).

This patch moves from iterating over all metrics and doing regex matches
to pull out only those that match the desired pattern when metrics are
requested, to caching references to desired metrics in maps as they are
recorded. The keys of these maps should be the same as the match group
from the previous patterns.
Based on discussion with Austin Blatt these additional metrics could be
helpful in diagnosing issues with PuppetDB.

> The reports.process, catalog.save and facts.save metrics would likely be
> useful as they are the highest level metrics closest to "how long did the
> puppetdb terminus take?".
>
> puppetdb.payload.format would be good to include as well in case our json
> conversion is slow. We do have to do some string manipulation there, so it
> isn't just Ruby's to_json function.
>
> I'm less familiar with the terminus's use for querying, but I imagine
> puppetdb.query, puppetdb.facts.find, and puppetdb.facts.search would be
> good to include assuming those terminus functions are used somewhere.
clj-kondo version 2024.03.05 fixed a bug which suppressed the
:java-static-field-call warning in some cases. Linting is now erroring
in CI and this patch should resolve that issue.
@justinstoller
Copy link
Member Author

justinstoller commented Sep 27, 2024

Okay. I think this is ready to go. There's a package internally available with 7.17.3.SNAPSHOT.2024.09.27T1145 that works for me.

I had to update our acceptance tests - it looks like our PDB integration isn't correct. I have a feeling that mean our PDB tests aren't really doing anything in CI, but first we need to get CI working and then I take up updating the PDB integration acceptance tests after that.

fwiw, these are those local changes to the acceptance suite: https://github.com/justinstoller/puppetserver/compare/PE-38408...justinstoller:puppetserver:pdbupdate?expand=1

}

@Override
public void shutdown() {
}

private List<String> sliceOfArrayToList(String[] idSegments, int lengthOfID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant see how subList does not work for this?

Seems like for example the getMetricName could be called with the result of a metricId.subList(0,2) as the result of subList I think would be mutable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with a String[] which doesn't have the subList() method defined, but we can turn an array into a List with Arrays.asList().

I originally tried Arrays.asList(ids).subList(0,2). But the problem is asList() returns a fixed sized List and subList() simply returns a view of the original fixed size list. So we end up erring when getMetricName attempts to prepend the hostname to the list here: https://github.com/puppetlabs/puppetserver/pull/2880/files#diff-300731d432bcbd4678957c962c2aa20d068698fff7a4441434c3d1ac5e55fdfdR196

I should probably clean up getMetricName() but I didn't want to lead with a larger refactor than was strictly necessary (at least after I had changed getTimers)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see, i misread that parameter type.

} else {
String thirdElemet = safeGet(metricId, 2);

if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its probably overkill, but i wonder if it would be worth comparing metrics from two reference workflows. Matching the regex to this logic is hurting my brain. It may not matter too much though... These are hard to reason about 🫠

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex matching definitely hurt my brain as well! I got it wrong a couple of times. I can run the acceptance test and collect what metrics are reported before and after this change. I've ran it after to make sure they look right (see this comment #2880 (comment)) , but haven't done a full comparison. I'll do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the previous metrics from 7.17.2:

{"service_version":"7.17.2","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":369,"aggregate":369},{"function":"lookup","count":2,"mean":45,"aggregate":90}],"resource-metrics":[{"resource":"Class[main]","count":2,"mean":193,"aggregate":386}],"catalog-metrics":[{"metric":"compile","count":2,"mean":886,"aggregate":1772},{"metric":"find_node","count":2,"mean":11,"aggregate":22}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":366,"aggregate":366},{"metric":"resource_search","count":8,"mean":31,"aggregate":248},{"metric":"command_submit_replace_catalog","count":2,"mean":68,"aggregate":136},{"metric":"command_submit_store_report","count":2,"mean":51,"aggregate":102},{"metric":"command_submit_replace_facts","count":2,"mean":43,"aggregate":86},{"metric":"catalog_munge","count":2,"mean":9,"aggregate":18},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":5,"aggregate":10},{"metric":"facts_encode","count":2,"mean":2,"aggregate":4}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}

and here are the metrics returned after this pr:

{"service_version":"7.17.3-SNAPSHOT","service_status_version":1,"detail_level":"debug","state":"running","status":{"experimental":{"function-metrics":[{"function":"puppetdb_query","count":1,"mean":411,"aggregate":411},{"function":"lookup","count":2,"mean":46,"aggregate":92}],"resource-metrics":[{"resource":"Node[weakest-byword.delivery.puppetlabs.net]","count":1,"mean":421,"aggregate":421},{"resource":"Class[main]","count":2,"mean":184,"aggregate":368},{"resource":"Node[resource-exporter.test]","count":1,"mean":208,"aggregate":208}],"catalog-metrics":[{"metric":"compile","count":2,"mean":912,"aggregate":1824},{"metric":"evaluate_ast_node","count":2,"mean":321,"aggregate":642},{"metric":"evaluate_main","count":2,"mean":188,"aggregate":376},{"metric":"evaluate_generators","count":2,"mean":115,"aggregate":230},{"metric":"iterate_on_generators","count":3,"mean":75,"aggregate":225},{"metric":"evaluate_collections","count":2,"mean":108,"aggregate":216},{"metric":"create_settings_scope","count":2,"mean":54,"aggregate":108},{"metric":"set_node_params","count":2,"mean":14,"aggregate":28},{"metric":"find_facts","count":2,"mean":7,"aggregate":14},{"metric":"find_node","count":2,"mean":6,"aggregate":12},{"metric":"finish_catalog","count":2,"mean":1,"aggregate":2},{"metric":"init_server_facts","count":1,"mean":0,"aggregate":0},{"metric":"validate_pre_finish","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_node_classes","count":2,"mean":0,"aggregate":0},{"metric":"validate_final","count":2,"mean":0,"aggregate":0},{"metric":"evaluate_definitions","count":3,"mean":0,"aggregate":0}],"puppetdb-metrics":[{"metric":"query","count":1,"mean":409,"aggregate":409},{"metric":"resource_search","count":8,"mean":36,"aggregate":288},{"metric":"facts_find","count":2,"mean":139,"aggregate":278},{"metric":"catalog_save","count":2,"mean":131,"aggregate":262},{"metric":"facts_save","count":2,"mean":91,"aggregate":182},{"metric":"command_submit_replace_catalog","count":2,"mean":67,"aggregate":134},{"metric":"command_submit_replace_facts","count":2,"mean":56,"aggregate":112},{"metric":"report_process","count":2,"mean":50,"aggregate":100},{"metric":"command_submit_store_report","count":2,"mean":39,"aggregate":78},{"metric":"payload_format","count":8,"mean":3,"aggregate":24},{"metric":"facts_encode","count":2,"mean":10,"aggregate":20},{"metric":"catalog_munge","count":2,"mean":10,"aggregate":20},{"metric":"report_convert_to_wire_format_hash","count":2,"mean":4,"aggregate":8}],"inline-metrics":[]}},"active_alerts":[],"service_name":"puppet-profiler"}

These aren't pretty printed, not as an f-u, but because I think putting them pretty printed in a github comment would make the content very hard to compare. I put them into files, piped them through jq, and compared two terminals side by side. I think they look correct.

@@ -90,8 +176,8 @@ private boolean shouldTime(String[] metric_id) {
return true;
}

private List<Timer> getTimers(String[] metric_id) {
List<Timer> timers = new ArrayList<Timer>();
private Map<String, Timer> getOrCreateTimersByIDs(String[] metric_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not used anywhere else right? Its not important for

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure i'm following the "or" in the name here. It seems like this just creates a new map and returns it unconditionally. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not used anywhere else right? Its not important for

It's not, the method is private and external callers are calling this method on the MetricRegistry https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#getTimers--

I think the two cases of "getTimers()" here where inspired by that method. But, the call to MetricRegister.timer() in this method is actually the place in this class were we create timers (if they don't exist). See https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/4.1.0-rc2/com/codahale/metrics/MetricRegistry.html#timer-java.lang.String-

The getOrCreateTimers is in reference to that. A more accurate name would probably be getOrCreateTimersAndReturnThemById , maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the reference. I think leaving it as is will be fine!

@donoghuc
Copy link
Contributor

Do you feel strongly about this going in 7.x? Would it make sense just to target main?

@justinstoller
Copy link
Member Author

I don't have strong feelings. The escalation that started this research was using 7.x and so I started there.

@donoghuc
Copy link
Contributor

I think given the main benefit will be in 7.x (even though its close to the end) i think leaving it at that stream is best. Thanks!

@justinstoller justinstoller merged commit acd44da into puppetlabs:7.x Sep 30, 2024
12 checks passed
@justinstoller justinstoller deleted the PE-38408 branch September 30, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants