Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for collection of backtrace memory addresses #966

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

hammad45
Copy link

  • Added support for collecting backtrace memory addresses using backtrace () and backtrace_symbols ()
  • Get address-to-line mappings using addr2line for the unique memory addresses corresponding to the binary
  • Modified Darshan logs to include the address-to-line mappings as part of the Darshan header and the complete memory addresses stack as part of the DXT trace data

@jakobluettgau
Copy link
Collaborator

Hi Hammad, this looks really nice. I'll try to create some logs with this new mode for DXT in MPI and POSIX as well, but could you share one of your logs for testing too?

Also since this appears to change the log format, it should progress the log format versions, for example, DXT_*_VER for the affected modules in darshan-dxt-log-format.h.

I'll try to run some tests and get back with additional feedback.

@jakobluettgau
Copy link
Collaborator

It looks like this does regress for old darshan logs, it should not be a big deal to support both, but as is old logs will error out both for darshan-parser and darshan-dxt-parser, as well as pydarshan:

Error: failed to read darshan log file header. Error: darshan_log_open failed to read darshan log file header: Success.

@jakobluettgau
Copy link
Collaborator

I guess a small paragraph for the documentation might be helpful as well. Something along the lines of:

  • Target application needs to be compiled with debugging symbols (-g) otherwise line mappings are less meaningful and just show ??
  • To collect backtrace information, a new environment variable has to be set export DXT_ENABLE_STACK_TRACE=1
  • Maybe a reference to online man pages of backtrace and addr2line for an interested user
  • And maybe at some point with more experience an expectation of added overhead when enabled

Maybe some other noteworthy remarks from your experience when implementing this :)

@shanedsnyder
Copy link
Contributor

Hi Hammad,

Thanks for submitting this PR!

Could you provide some detailed comments/discussion on how exactly the stack traces are collected with this code? I think it would take me some time to grok all the code changes, but it will be easier if I'm able to better understand how this process is intended to be carried out. From a relatively quick first scan, it seems:

  • Processes independently capture stacktrace info as read/write calls come into DXT
  • At DXT module shutdown time, information related to these stacktraces is extracted and written to per-process files
  • At Darshan shutdown time, rank 0 serially reads each per-rank file, extracts/transforms the data, then writes the resulting output data into the Darshan header

Any more elaborations there would be very welcome.

Without understanding the full changes yet, I do have a couple of higher level concerns:

  1. Ultimately storing this stack data in the Darshan log header is almost certainly not what we want to do

    • The header is a small, uncompressed region of the Darshan log file to store compact metadata about the modules (i.e., their version, how much compressed data they wrote, etc.), so it's not really where we'd imagine storing big chunks of characterization data
    • If we can't store the stack traces alongside the trace segments captured by the DXT modules, I think I'd recommend we create an entirely new module (e.g., DXT_STACKS) that stores this info
  2. The shutdown process seems pretty inefficient. It looks like the DXT module on each process writes out it's own file at module shutdown time, but then as Darshan is shutting down and writing it's log file it has to have rank 0 read each of these per-rank files serially

    • Could we just use MPI collective operations at module shutdown time to reduce all of the stack data to rank 0? I'd guess that will be much more efficient than serializing all of this through the file system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants