Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge] CI test, cmake option to enable address santitizer. #2890

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bernhardu
Copy link
Contributor

This are the patches of my recent address santizer work.

With these all tests pass except two.
But at least some take much longer, the whole run around 24 minutes instead of 9.

The tests nested_detach-no-syscallbuf and nested_detach-32-no-syscallbuf do not finish even with 980 seconds timeout, and given the tests nested_detach and nested_detach-32 finish also with asan in a few seconds, I guess they show a hanging test.

What do you think? Is there something already suitable for merging?

@bernhardu
Copy link
Contributor Author

bernhardu commented Jun 10, 2021

From #2883: We would also need to be able to run rr with the ASAN library dynamically linked, but NOT pass that library name through to rr's spawned children.

Here "dynamically linked" means this?

$ ldd bin/rr
        libasan.so.6 => /usr/lib/x86_64-linux-gnu/libasan.so.6 (0x00007f730ea2e000)
...

Is this "NOT pass that library name" just related to LD_PRELOAD? Could you please give some more details?

@rocallahan
Copy link
Collaborator

Is this "NOT pass that library name" just related to LD_PRELOAD? Could you please give some more details?

Ah, I was thinking that ASAN uses LD_PRELOAD, but I guess it doesn't the way you're building it.

@bernhardu
Copy link
Contributor Author

bernhardu commented Jun 10, 2021

Ah, I was thinking that ASAN uses LD_PRELOAD, but I guess it doesn't the way you're building it.

That would be at least needed when inspecting a library built with ASAN with an executable built without.
But yes, while working at the strip_outer_ld_preload change, I got the impression that ASAN prepends itself to LD_PRELOAD when spawning childs to make sure it is the first library. But I could not trace any negative consequence back to it, when running the tests.

@Bob131
Copy link
Contributor

Bob131 commented Jun 11, 2021

Dang, you beat me to it! I've also been working on this, but it seems I spent too much time messing with hacks ;)

I have some comments (and maybe some patches) that might help.

CMakeLists.txt Outdated
@@ -522,6 +522,12 @@ if (x86ish)
set(RR_SOURCES ${RR_SOURCES} src/test/x86/cpuid_loop.S)
endif()

option(asan "Build with address santitizer enabled.")
if (asan)
set(ASAN_FLAGS "-fno-omit-frame-pointer -fsanitize=address")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -fno-omit-frame-pointer necessary here? If this was because of less-than-useful stacktraces from leaks, there's ASAN_OPTIONS=fast_unwind_on_malloc=false. I'm wondering if -fno-omit-frame-pointer might be causing some of the test timeout issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -fno-omit-frame-pointer necessary here?

I am just using it because the usage example I looked at had it too. Have to test if that makes a difference.
At least even with this the stack e.g. in 7214d96 was not complete.

CMakeLists.txt Outdated
@@ -1539,7 +1539,7 @@ if(BUILD_TESTS)
# reduced single-core performance.
exec_program(cat ARGS "/proc/cpuinfo" OUTPUT_VARIABLE CPUINFO)
string(REGEX MATCH "^.*(Xeon Phi).*$" CPU_MODEL_PHI ${CPUINFO})
if(NOT "${CPU_MODEL_PHI}" STREQUAL "")
if(NOT "${CPU_MODEL_PHI}" STREQUAL "" OR asan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't seeing this kind of issue running with an ASan build. Would you mind checking if you're still getting timeouts without this commit and without -fno-omit-frame-pointer? (Sorry, I know the test suite speed is kinda painful)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed -fno-omit-frame-pointer, but timeouts seem unchanged:

1265/2625 Test  #869: conditional_breakpoint_offload-no-syscallbuf ......................   Passed  261.44 sec
2620/2625 Test #2358: x86/string_instructions_async_signals_shared-32 ...................   Passed  264.50 sec
1280/2625 Test  #868: conditional_breakpoint_offload ....................................   Passed  274.21 sec
2586/2625 Test #2180: conditional_breakpoint_offload-32 .................................   Passed  329.61 sec
2/3 Test #1217: record_replay-no-syscallbuf ......   Passed  533.47 sec
2624/2625 Test #2528: record_replay-32 ..................................................   Passed  574.54 sec
2625/2625 Test #2529: record_replay-32-no-syscallbuf ....................................   Passed  608.18 sec
2621/2625 Test #1216: record_replay .....................................................   Passed  741.66 sec

With these tests still failing:

        1203 - nested_detach-no-syscallbuf (Failed)
        2515 - nested_detach-32-no-syscallbuf (Failed)

So from my point of view the -fno-omit-frame-pointer does not change much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid misunderstanding - I run the test suite also in parallel (ctest --verbose -j$(nproc)).
Therefore the it depends much of the related other tests that run at that time.

Copy link
Contributor

@Bob131 Bob131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really solve the issue, and it actually masks the fact that support for running ASan binaries in rr is currently broken (I'll open a PR to fix that in a bit). This is actually what I've been stuck on for a few days.

I'm thinking maybe the right approach is to detect ASan-using binaries before they exec() and fix-up LD_PRELOAD there.

EDIT: Woops, this was regarding 23fafc1 (" Run child process with ASAN_OPTIONS=verify_asan_link_order=0. ")

src/Task.cc Outdated
@@ -3671,6 +3671,12 @@ void Task::dup_from(Task *other) {
create_mapping(this, remote_this, km);
LOG(debug) << "Copying mapping into " << tid;
if (!(km.flags() & MAP_SHARED)) {
#if defined(__x86_64__) && defined(__SANITIZE_ADDRESS__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think possibly masks a real issue that maybe should be investigated? I haven't looked into it myself yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is just a workaround for asan mapping a really huge area, which cannot be allocated by plain malloc.
Maybe rr has to reproduce this mapping and copy just the "used" memory. Have not looked any further into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I tried to modify the copy_mem_mapping to not try to allocate the whole chunk in one, instead to copy by e.g. 256 MB blocks. Unfortunately this did just start to fill RAM.
I guess if we really try to duplicate such a 20 TB "MAP_NORESERVE" mapping, a way is needed to tell pages which got accessed once and therefore are "in use" apart from these pages, which never got touched, and therefore the kernel did never allocate a hardware page for it. (If I understand this situation right?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you can do here is read /proc/<pid>/pagemap and check if the page frame number for each anonymous page is zero. If it is, the page is all zeroes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint.
I found the page frame number (PFN), bits 0-54 were all 0.
By testing the bit 63 "page present" seems to better show if something is in use.
Unfortunately testing the whole 20 TB increases the test time from 3 seconds to nearly 400 seconds.
And all in all there seem just around 126 pages of it in use.
So I guess this is not the final solution.
88792e9 contains my attempt to use /proc/pid/pagemap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That program processes 1M pages per read (so 4GB instead of 1GB per read), and uses pread() to halve the number of syscalls, but I don't think that would explain why my program is 20x faster than your code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code calls page_size(), i.e. sysconf(_SC_PAGE_SIZE), twice per chunk (and twice per page present!) ... maybe that's really slow? Or maybe building rr non-optimized, the std::vector operations (size() and operator[]) per page, are really slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your sample code.
I tried to do some modifications:

  • 4m55: before
  • 5m18: Assigned page_size() to a local variable (tested just once, seems I did something in parallel)
  • 2m10: replaced the vector by a c-array

Then I changed the loop to your version (I hope it is ok to use it?).
And then I noticed, that I commented the && km.size() >= 0x100000000 and therefore used
copy_mem_mapping_just_used for all MAP_NORESERVE mappings,
which might not use holes and therefore get copied slowly page by page.
If I just use it for I got it down to 35 seconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you use pread instead of lseek/read? You should.

copy_mem_mapping_just_used for all MAP_NORESERVE mappings,
which might not use holes and therefore get copied slowly page by page.

Seems like the right thing to do is to identify runs of consecutive pages that are present, and handle each run with a single copy_mem_mapping call. I think you should use copy_mem_mapping_just_used for all anonymous mappings (which you're not checking for currently!), not just MAP_NORESERVE, and not just for big mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly I first found that just the first lseek with read was needed, so moved lseek before the loop.
But I could not see an improvement in time. Similar for removing lseek and changing read to pread.
Then I ripped out my version and used your lines.

I tried to add this identification of consecutive pages and call it for all sizes of ANONYMOUS|NORESERVE mappings.

@Bob131 Bob131 mentioned this pull request Jun 11, 2021
@bernhardu
Copy link
Contributor Author

This doesn't really solve the issue, and it actually masks the fact that support for running ASan binaries in rr is currently broken (I'll open a PR to fix that in a bit). This is actually what I've been stuck on for a few days.

I'm thinking maybe the right approach is to detect ASan-using binaries before they exec() and fix-up LD_PRELOAD there.

EDIT: Woops, this was regarding 23fafc1 (" Run child process with ASAN_OPTIONS=verify_asan_link_order=0. ")

Because I was specifically interested in running just the tests against an asan enabled rr (where I intentionally did not build the tests with asan), I took this "shortcut".
This also gets just activated when building rr with asan. Running a asan enabled binary with a regular rr should still show the issue.

@bernhardu
Copy link
Contributor Author

Rebased to current head, because a few cheap patches got merged separately. Remaining patches are unchanged.

@@ -2230,6 +2230,10 @@ static string lookup_by_path(const string& name) {
env.push_back("SYSTEMD_RDRAND=0");
}

#if defined(__SANITIZE_ADDRESS__)
env.push_back("ASAN_OPTIONS=verify_asan_link_order=0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this, and I'd suggest that this should be fine as a testsuite-specific hack provided there's a comment mentioning that

  1. This is needed to run the test suite with an ASan-instrumented rr because of rr's lack of support for children execing ASan binaries, and should be removed if/when rr gains such support.
  2. This is specific to the case of running the test suite and isn't supposed to help with general usage.

Please feel free to word it slightly less awkwardly ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to add a comment.

@@ -5,11 +5,17 @@
// environment. The runner uses this to test
// --nested=ignore in the absence of RR_UNDER_RR
// environment variable.
int main(int argc, char *argv[]) {
int main(int argc, char *argv[], char *envp[]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above re comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole change to the test seems not needed in a tree that contains your vdso soname patch, so I remove this from my draft.

@bernhardu
Copy link
Contributor Author

@Bob131 I hope it is ok to integrate your patches to this draft?

@bernhardu bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from ab793c9 to 30e13b8 Compare June 24, 2021 12:10
@@ -629,7 +629,7 @@ static WaitStatus record(const vector<string>& args, const RecordFlags& flags) {
flags.use_syscall_buffer, flags.syscallbuf_desched_sig,
flags.bind_cpu, flags.output_trace_dir,
flags.trace_id.get(),
flags.stap_sdt, flags.unmap_vdso);
flags.stap_sdt, flags.unmap_vdso, flags.asan);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khuey While working at this draft branch I found that the asan flag is currently set but never used, as far as I see.
And I am guessing right, that this flag is intended to go into RecordSession::create?
Should I put this into a separate PR?

@bernhardu
Copy link
Contributor Author

Just rebased the patches from 2021 to current git tip.
Half of them are still just for better control of running the tests.

@Keno
Copy link
Member

Keno commented Mar 28, 2023

Can we split this up into a few separate PRs?

  • Just the CMake option to turn on ASAN?
  • The sparse copy stuff
  • The preload changes

and then if we still can't figure out the test failures, the extra changes to disable tests, but I'd prefer to just go through and debug why it's not working. I think it'll make it easier to review and faster to merge in isolated pieces.

@bernhardu bernhardu force-pushed the cmake-option-asan branch 2 times, most recently from 9444af3 to b2e9c5e Compare April 1, 2023 09:56
@bernhardu bernhardu force-pushed the cmake-option-asan branch 5 times, most recently from 1c9362c to 66f32e2 Compare April 22, 2023 15:05
@bernhardu bernhardu changed the title RFC: Add cmake option to enable address santitizer. [Do not merge] CI test, cmake option to enable address santitizer. Apr 22, 2023
@bernhardu bernhardu force-pushed the cmake-option-asan branch 14 times, most recently from 6645058 to 5c1c898 Compare April 29, 2023 11:47
Error message:
  [FATAL src/MemoryRange.h:20:MemoryRange()] start_ <= end_
CI showed timeouts with test:
  madvise_fracture_flags (exceeded 120s, aarch64)
  nested_release (exceeded 120s)
  record_replay (exceeded 600s)
Remove those tests for now.
No idea how that could be avoided?

Error message:
  [FATAL src/PerfCounters.cc:785:read_ticks()] 1 (speculatively) executed strex instructions detected.

Example:
  https://buildkite.com/julialang/rr/builds/1141
  BUILDKITE_AGENT_NAME="default-armageddon.3"
Avoids error message:
  ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

Affected tests:
  nested_detach
  nested_detach_kill
  nested_detach_kill_stuck
  nested_detach_stop
  nested_detach_wait
@GitMensch
Copy link
Contributor

@bernhardu Can you rebase + split those changes? [not sure how the buildkite stuff is related]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants