-
Notifications
You must be signed in to change notification settings - Fork 179
Instrumentation support for PIE & shared libs on x86_64 Linux #184
Conversation
de0e296
to
8609edf
Compare
Thanks a lot for working on this! I pulled the branch, but it is failing two tests in my local machine: Failed Tests (2): Can you verify that? I see the tests are introduced in the first commit "Tests: add instrumentation tests for PIE exec & shared libs". Usually we add tests in the same commit that introduces a functionality, instead of before. The reason for that is that we want to preserve our ability to bisect problems, and we can't have a given commit in the history that introduces tests that are failing. So I suggest adding the tests together with the first commit that makes them pass. If they were made by different authors, then it's fine to commit then separately, as long as the test commit comes after the functionality. |
8609edf
to
4627ce0
Compare
Hello @rafaelauler As for failed tests - as I can see from Docker CI log - the test failed with an bolt-error mentioned in issues #117 & #181. I reproduced the same behavior on Ubuntu-20.04 with gcc-9. On Ubuntu-18.04 with gcc-7 - there are no such issues. As I understood from #181 - Bolt team is already working on resolving the root cause of the issue. Until then we can consider using a workaround commit added on top of my branch (Tests: temp fix for failing tests on newer gcc versions). With this fix - there are no failed BOLT tests on Ubuntu-20.04 with gcc-9. However, more tests will fail with older gcc versions, because "-fcf-protection" option was introduced in gcc-8. So feel free to cherry-pick all commits except the last one with a workaround. I'll appreciate your suggestions on how to handle this gcc version-specific behavior more properly. |
Summary: Not all symbol table entries were updated after ICF. (cherry picked from commit b48548684ffd88e48d86e66fc82ce162216763b1)
Summary: In strict mode, a jump table with targets generated by builtin_unreachable (located at the very end of the function) was asserting when being recreated by postProcessIndirectBranches. Fix this. (cherry picked from commit 2e9d2bc624656bf50a85e0d9a8b9cf5a60488548)
Summary: Use proper function while printing modified function name to file. (cherry picked from commit 8ba42abb8611cce9919c2c4999a3f05511602344)
Summary: In this diff we refactor the code around getting the original binary encoding of function's body. The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData. (cherry picked from commit 20af623a27a65c78645ad82ad4e6ff4d6ae2a172)
Summary: Make the methods isText/isData work for MachO. (cherry picked from commit e771d2e8a2670072bbbc13834173ed9ae440363e)
Summary: This diff ports reviews.llvm.org/D70157 to our LLVM tree, which makes the integrated assembler able to align X86 control-flow changing instructions in a way to reduce the performance impact of the ucode update on Intel processors that implement the JCC erratum mitigation. See white paper "Mitigations for Jump Conditional Code Erratum" by Intel published November 2019. To port this patch, I changed classifySecondInstInMacroFusion to analyze instruction opcodes directly instead of analyzing the CondCond operand (in more recent versions of LLVM, all conditional branches share the same opcode, but with a different conditional operand). I also pulled to our tree Alignment.h as a dependency, and the macroop analyzing helpers. x86-align-branch-boundary and -x86-align-branch are the two flags that control nop insertion to avoid disabling the decoder cache, following the original patch. In BOLT, I added the flag x86-align-branch-boundary-hot-only to request the alignment to only be applied to hot code, which is turned on by default. The reason is because such alignment is expensive to perform on large modules, but if we limit it to hot code, the relaxation pass runtime becomes tolerable. (cherry picked from commit 36739a87e353de66536945a8991cb0d8c1810076)
Summary: There are two peephole subpasses, remove-double-jumps and remove-useless-conditional-branches, that operates by reading branches directly, which makes them tricky to run before fix-branches. In the case of remove-double-jumps, it will even lead to suboptimal code if the patched branch was going to be removed by fix-branches when the target is the fall-through. If the final target is a tail call, it will lead to a broken CFG in the worst case. Fix this by moving these passes after SCTC, which already produces CFGs with conditional tail calls. (cherry picked from commit 8f9b6e4dd276e1fed4e34057d8db18926f8e4788)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later. (cherry picked from commit 527f7d05751053d29bfaeae8c3e13c7bbb8f71b2)
Summary: Add missing std::move in the method BinaryFunction::addAlternativeName (cherry picked from commit 44c9e038bb12848cb7a1a8a6c0a4f93fe0cd2414)
Summary: Factor out the helper class NameResolver from the class RewriteInstance. (cherry picked from commit ffec8b03fcceb9ef285af03ae2dbe1acd52e509a)
Summary: The flag is no longer used/needed. (cherry picked from commit ae10b98bfa5c3cb53295e0e79489de2bef1c8dc0)
Summary: Change our X86 target to use long nops by default. In general, BOLT does not put nops into the instruction stream that is going to be executed, since it doesn't align basic blocks, only functions. Since we rebased BOLT, our relationship with MCAssembler changed because it stopped using multibyte nops and we never needed to revisit that. But it makes a difference if we want to mitigate perf issues with the Intel JCC erratum, since the nops inserted are going to be decoded and executed. To make MCAssembler emit long nops again, we need to explictly set mattr (Features) of the X86 target. (cherry picked from commit f19079b7617a79ff0a98930400f47ba6d6fa13fc)
Summary: Add first bits to disassemble functions from a MachO binary. (cherry picked from commit 24251682342940f7d560f0b206298fb86499fc5b)
Summary: Add first bits to build CFG. (cherry picked from commit 36b89229c9897caf30c1d30d0596f7b90a41babf)
Summary: Enable reversing the order of basic blocks. (cherry picked from commit 46ebeb83b67d2be5e5e3eb8ac6ddc7b42993dc44)
Summary: (cherry picked from commit 7cd3b8e217f93b85df0090002df438c0fdda009f)
Summary: Add missing override in X86MCPlusBuilder.cpp. (cherry picked from commit f5735b400d2eb086f4b3e98d83cc21743071ba87)
Summary: The interface is no longer in use. (cherry picked from commit 460bdaaf5c48910d228d17211548bdc410f89d44)
(cherry picked from commit b392da1fbd055a555ef0fab3f943bbf214ace0ee)
(cherry picked from commit 0290dbaa15546fafa27a9a17240f0da4218299ce)
Summary: Shrink wrapping has a mode where it will directly move push pop pairs, instead of replacing them with stores/loads. This is an ambitious mode that is triggered sometimes, but whenever matching with a push, it would operate with the assumption that the restoring instruction was a pop, not a load, otherwise it would assert. Fix this assertion to bail nicely back to non-pushpop mode (use regular store and load instructions). (cherry picked from commit 2ff448dee7461b1b31ae06837e8d859b77985e6b)
(cherry picked from commit d4667d4e3eb21f481d2fd8df30d1be160fa0f646)
Summary: Fix begin decrementing. (cherry picked from commit 664a2dbdae4741bb4a6048c9bb5de6e262708845)
Summary: The parameter is no longer used. (cherry picked from commit 7ac00974b07895aeb9ad53f5f417119c8b15c7e5)
Summary: Add code to read more dynamic relocations (DT_JMPREL) and enforce strict checks that corresponding sections sizes match .dynamic entry description.
Summary: clang-12 now compiles bolt without warnings. Some warnings were fixed if possible while others were suppressed by doing (void)variable for unused variable warnings or moving code inside assert statements of LLVM_DEBUG blocks.
Summary: A binary can contain multiple PLT sections with different name and attributes (such as an entry size). Extend the support to .plt.sec and refactor the code to make future extensions simpler.
Summary: This commit implements new method for _start & _fini functions hooking which allows to use relative jumps for future PIE & .so library support. Instead of using absolute address of _start & _fini functions known on linking stage - we'll use dynamically created trampoline functions and use corresponding symbols in instrumentation runtime library. As we would like to use instrumentation for dynamically loaded binaries (with PIE & .so), thus we need to compile instrumentation library with "-fPIC" flag to support relative address resolution for functions and data. For shared libraries we need to handle initialization of instrumentation library case by using DT_INIT section entry point. Also this commit adds detection if the binary is executable or shared library based on existence of PT_INTERP header. In case of shared library we save information about real library init function address for further usage for instrumentation library init trampoline function creation and also update DT_INIT to point instrumentation library init function. Functions called from init/fini functions should be called with forced stack alignment to avoid issues with instructions which relies on it. E.g. optimized string operations. Vasily Leonenko, Advanced Software Technology Lab, Huawei
Summary: This commit adds support for getting directory entries and reading value of a symbolic link in instrumentation runtime library Elvina Yakubova, Advanced Software Technology Lab, Huawei
…lf/map_files Summary: This commit adds support for opening libs based on links /proc/self/map_files. For this we're getting current virtual address and searching the lib in the directory with such address range. After that, we're getting full path to the binary by using readlink function. Direct read from link in /proc/self/map_files entries is not possible because of lack of permissions. Elvina Yakubova, Advanced Software Technology Lab, Huawei
Summary: This commit introduces static binaries instrumentation support. Note that current implementation does not support profile output on the instrumented binary finalization. So it requires to use -instrumentation-sleep-time=N (N>0) option usage. Note: There is unhandled case with static PIE executable which might have dynamic header. Vasily Leonenko, Advanced Software Technology Lab, Huawei
Summary: This commit fixes runtime instrumentation handlers for PIE binaries case. Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei
Summary: This commit replaces IsStaticExecutable with !hasDynamicHeader to improve actual meaning of this value and corresponding checks. Also this commit introduces isExecutable() wrapper function for further handling static-pie executable case during it's enabling. Vasily Leonenko, Advanced Software Technology Lab, Huawei
This reverts commit 1af5e3c.
Vasily Leonenko, Advanced Software Technology Lab, Huawei
Summary: This commit introduces -instrumentation-binpath argument used to point instumented binary in runtime in case if /proc/self/map_files path is not accessible due to access restriction issues. Vasily Leonenko Advanced Software Technology Lab, Huawei
The trampolines are no loger pointers to the functions. For propper name resolving by bolt use extern "C" for all external symbols in instr.cpp Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei
Summary: This commit adds dummy tests for checking instrumentation support for PIE executables and shared libraries. Vasily Leonenko, Advanced Software Technology Lab, Huawei
This is a temporary fix for failing tests on newer gcc versions. Note that older gcc versions doesnot support -fcf-protection option, so this commit will cause failure on these versions. Related issues: facebookarchive#117 facebookarchive#181 Vasily Leonenko, Advanced Software Technology Lab, Huawei
ea0da0c
to
b9ce0d7
Compare
Rebased on top of current facebookincubator:rebased branch |
Thanks Vasily! Overall the diffs look pretty nicely written. But before I go over them, I'm still trying to figure out what is happening with testing to check that things are working as intended. I'm getting: At X86/instrumentation-pie.c,
And in the next command:
Are you able to repro that? My host compiler is clang9. Any idea about what might be happening? I can investigate more deeply later, if you need help, let me know. |
Hello Rafael! Thanks for checking it. |
(This was automatically closed because I renamed our branch from "rebased" to "main", so ignore this). Ok, I was able to debug what is happening. In the mmap call[1] we do during initialization to share the page of our counters, the page granularity is too large for the small counters vector in that test case, so we end up mmap'ing a bunch of unrelated sections that are next in the layout (including .got section). When we do this, mmap sets all contents to zero. So, previously, we had a .got entry with the correct PLT address for "dlsym". After instrumentation initialization code, thanks to this mmap, the address for 'dlsym" in the got table is now set to zero. When we call "dlsym", the code jumps to zero and crashes. I think the solution to this would be (1) to ensure that the size of our counters vector is always in multiples of the page size (0x1000). Either that, or (2) we need to add code in initialization to copy the previous contents of the page that is being mmap'ed, mmap it and then initialize it with the previously copied contents. I think (1) is simpler. [1] https://github.com/facebookincubator/BOLT/blob/main/bolt/runtime/instr.cpp#L1465 |
Thanks for sharing your results of the issue debugging. In order to make this case more clear - could you share your binary to reproduce the issue on my side or at least "readelf -lSW" output for the file. Unfortunately, I'm not able to build the same binary with gcc or clang compilers in my environment. In my case .got section resides before .text and .bolt.instr.counters sections, so it si not corrupted. |
You're right, CountersEnd is already padded. I looked again to understand what might be happening and discovered that the problem is __bolt_num_counters fetching an incorrect value in https://github.com/facebookincubator/BOLT/blob/main/bolt/runtime/instr.cpp#L1458 Is your __bolt_num_counters sane? I took a look at it here and discovered that we are not generating dynamic relocations to instruct the dynamic linker and loader to patch our GOT entries, so, for example, the got entry with the address of __bolt_num_counters has the value 0x602008, which is the correct static address of __bolt_num_counters, but after the dynamic library is loaded, it is loaded at a random location in memory (a given base address, say, 0x7ffff77f0000), and the got entry stays with the original address without the base address where the dynamic library was loaded added to it. |
@rafaelauler
Also in the runtime library, the following instruction is related to __bolt_num_counters:
It is replaced with the following instruction in the instrumented test file:
So it seems that in my environment __bolt_num_counters relocation was resolved during linking instrumentation library to output executable, but in your environment, it was not. Or maybe your libbolt_rt_instr.a doesn't contain relocations for __bolt_num_counters? |
PR reopened #192 due to merge target branch renaming. |
@rafaelauler Hello! As I understand the problem, your instr library has a GOT table. Please take a look at the patch in reopened PR, it should fix the problem :) |
That's correct, the runtime library is being generated with GOT entries and our linking mechanism doesn't handle them. |
This PR enables instrumentation support for PIE executables, shared libraries, and initial support for statically linked executables on x86_64 Linux systems.
We are using relative jumps in trampoline functions to avoid usage of absolute addresses calculated during build time on the linking stage for initialization/finalization functions and instrumentation calls. These absolute addresses are not valid for randomized addresses for PIE & shared libraries.
Also for correct operation for instrumented shared libraries, we added a set of syscalls and functions for searching a path to the current binary in /proc/self/map_files directory based on the current virtual address and symlinks in this directory. In case/proc/self/map_files directory access is restricted (e.g. in container environment), we added -instrumentation-binpath option to pass information about expected path to an instrumented binary during runtime.
Co-authored-by: Vladislav Khmelevsky [email protected]
Co-authored-by: Elvina Yakubova [email protected]
Fixes #137
Vasily Leonenko,
Advanced Software Technology Lab, Huawei