-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve debug logging performance #5441
Comments
Yep -- I think if we want to avoid getting aggressive about reducing the debug log volume, the logger will need to support an async mode. I think if we're opening the "async logging" can of worms, though, it would probably be worthwhile to move to the |
As an aside, I am very interested in getting tracing support added to the stacks node (and potentially signer node) as well. If tokio's tracing lib were used to generate spans (as part of a separate issue), it may make sense to also look into it for logging as well since it may be cleaner and easier than mixing libs. |
It seems that 33% of the flame graph is just log output from Perhaps for now, we can move the inner loop of that to
I'm 100% down to substantially reduce the debug log volume. Some of the things we log in the networking stack really need to be Down the road, I'm in favor of a combination of @CharlieC3's request to have per-module or per-aspect logging, and my request to have an RPC interface for toggling logging, so we can selectively activate subsystem logging when diagnosing a problem.
This right here is the reason I'm not in favor of asynchronous logging, regardless of the concurrency paradigm used. We want to see crashes or other debug-worthy behaviors reported as they happen, so we shouldn't introduce an arbitrary delay between when the offending logic executes and when the log data is emitted. That is, after all, why we log things. |
It came up on the sprint call today to consider the possibility of using It is my understanding that the value proposition of Given the heterogeneity of node users, there isn't a one-size-fits-all solution to log ingestion. Home users have different needs from developers, who in turn have different needs from exchanges (and each exchange likely has their own internal tooling). It's not up to us to provide solutions to log ingestion, and trying to do so only makes our lives harder in the form of more object code to compile, more threads to trace, and more service tickets to deal with incompatibility issues which may arise. Instead, it's the node's job to give them the log data in an ingestible format (e.g. lines of text, JSON objects over stderr, etc), so it can be sent to the log-processing back-end. |
If folks feel strongly that we have to help users with log ingestion, then that's all well and good. Let's create one or more log ingestion binaries into which the node's output can be piped. And, let's do so in a separate repository, and let's treat this as a "green field" out-of-scope project that can be taken on by 3rd parties. |
@jcnelson For posterity, do you recall which backends from the tracing lib were being discussed? If it were regarding backends which ship logs, like tracing-loki, then I have to agree. It's often safer, easier, and more performant to configure an external log collector which ships these logs rather than doing it in code. |
I don't think any backends in particular were being discussed -- I don't think it got that far. The conversation stopped at the conceptual level of why there are distinct frontends and backends in the |
This flame graph taken by @kantai shows debug logging has a large impact on node performance.
I've also observed that RPC endpoints return much quicker when the node does not have debug logging enabled; this can have a significant impact if a stacks node with debug logs enabled is broadcasting to a signer or other downstream observer.
One idea from @hstove that can alleviate this is to switch to async logging via
slog_async
, however this would need to be researched and weighed against any other suggestions, as it may result in not getting critical logs in the event of a panic under the right conditions.The text was updated successfully, but these errors were encountered: