Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Instrumentation support for PIE & shared libs on x86_64 Linux #184

Closed
wants to merge 904 commits into from

Conversation

vleonen
Copy link

@vleonen vleonen commented Jul 4, 2021

This PR enables instrumentation support for PIE executables, shared libraries, and initial support for statically linked executables on x86_64 Linux systems.
We are using relative jumps in trampoline functions to avoid usage of absolute addresses calculated during build time on the linking stage for initialization/finalization functions and instrumentation calls. These absolute addresses are not valid for randomized addresses for PIE & shared libraries.
Also for correct operation for instrumented shared libraries, we added a set of syscalls and functions for searching a path to the current binary in /proc/self/map_files directory based on the current virtual address and symlinks in this directory. In case/proc/self/map_files directory access is restricted (e.g. in container environment), we added -instrumentation-binpath option to pass information about expected path to an instrumented binary during runtime.

Co-authored-by: Vladislav Khmelevsky [email protected]
Co-authored-by: Elvina Yakubova [email protected]

Fixes #137

Vasily Leonenko,
Advanced Software Technology Lab, Huawei

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jul 4, 2021
@vleonen vleonen force-pushed the inst-pie-shlib-x86 branch from de0e296 to 8609edf Compare July 5, 2021 13:44
@rafaelauler
Copy link
Contributor

Thanks a lot for working on this!

I pulled the branch, but it is failing two tests in my local machine:

Failed Tests (2):
BOLT :: X86/instrumentation-pie.c
BOLT :: X86/instrumentation-shlib.c

Can you verify that?

I see the tests are introduced in the first commit "Tests: add instrumentation tests for PIE exec & shared libs". Usually we add tests in the same commit that introduces a functionality, instead of before. The reason for that is that we want to preserve our ability to bisect problems, and we can't have a given commit in the history that introduces tests that are failing. So I suggest adding the tests together with the first commit that makes them pass. If they were made by different authors, then it's fine to commit then separately, as long as the test commit comes after the functionality.

@vleonen vleonen force-pushed the inst-pie-shlib-x86 branch from 8609edf to 4627ce0 Compare July 7, 2021 19:31
@vleonen
Copy link
Author

vleonen commented Jul 7, 2021

Hello @rafaelauler
Thank you for your comments for this PR and for checking it.
As complete instrumentation support was implemented in several commits from several authors, I rewrote my branch to put tests on top of it.

As for failed tests - as I can see from Docker CI log - the test failed with an bolt-error mentioned in issues #117 & #181. I reproduced the same behavior on Ubuntu-20.04 with gcc-9. On Ubuntu-18.04 with gcc-7 - there are no such issues.

As I understood from #181 - Bolt team is already working on resolving the root cause of the issue.

Until then we can consider using a workaround commit added on top of my branch (Tests: temp fix for failing tests on newer gcc versions). With this fix - there are no failed BOLT tests on Ubuntu-20.04 with gcc-9. However, more tests will fail with older gcc versions, because "-fcf-protection" option was introduced in gcc-8. So feel free to cherry-pick all commits except the last one with a workaround. I'll appreciate your suggestions on how to handle this gcc version-specific behavior more properly.

maksfb and others added 24 commits July 7, 2021 17:10
Summary: Not all symbol table entries were updated after ICF.

(cherry picked from commit b48548684ffd88e48d86e66fc82ce162216763b1)
Summary:
In strict mode, a jump table with targets generated by
builtin_unreachable (located at the very end of the function) was
asserting when being recreated by postProcessIndirectBranches. Fix
this.

(cherry picked from commit 2e9d2bc624656bf50a85e0d9a8b9cf5a60488548)
Summary:
Use proper function while printing modified function name to file.

(cherry picked from commit 8ba42abb8611cce9919c2c4999a3f05511602344)
Summary:
In this diff we refactor the code around getting the original binary encoding of function's body.
The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData.

(cherry picked from commit 20af623a27a65c78645ad82ad4e6ff4d6ae2a172)
Summary: Make the methods isText/isData work for MachO.

(cherry picked from commit e771d2e8a2670072bbbc13834173ed9ae440363e)
Summary:
This diff ports reviews.llvm.org/D70157 to our LLVM tree, which
makes the integrated assembler able to align X86 control-flow changing
instructions in a way to reduce the performance impact of the ucode
update on Intel processors that implement the JCC erratum mitigation.

See white paper "Mitigations for Jump Conditional Code Erratum" by Intel
published November 2019.

To port this patch, I changed classifySecondInstInMacroFusion to analyze
instruction opcodes directly instead of analyzing the CondCond operand
(in more recent versions of LLVM, all conditional branches share the
same opcode, but with a different conditional operand). I also pulled to
our tree Alignment.h as a dependency, and the macroop analyzing helpers.

x86-align-branch-boundary and -x86-align-branch are the two flags that
control nop insertion to avoid disabling the decoder cache, following
the original patch. In BOLT, I added the flag
x86-align-branch-boundary-hot-only to request the alignment to only be
applied to hot code, which is turned on by default. The reason is
because such alignment  is expensive to perform on large modules, but if
we limit it to hot code, the relaxation pass runtime becomes tolerable.

(cherry picked from commit 36739a87e353de66536945a8991cb0d8c1810076)
Summary:
There are two peephole subpasses, remove-double-jumps and
remove-useless-conditional-branches, that operates by reading branches
directly, which makes them tricky to run before fix-branches. In the
case of remove-double-jumps, it will even lead to suboptimal code if
the patched branch was going to be removed by fix-branches when the
target is the fall-through. If the final target is a tail call, it will
lead to a broken CFG in the worst case. Fix this by moving these passes
after SCTC, which already produces CFGs with conditional tail calls.

(cherry picked from commit 8f9b6e4dd276e1fed4e34057d8db18926f8e4788)
Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later.

(cherry picked from commit 527f7d05751053d29bfaeae8c3e13c7bbb8f71b2)
Summary: Add missing std::move in the method BinaryFunction::addAlternativeName

(cherry picked from commit 44c9e038bb12848cb7a1a8a6c0a4f93fe0cd2414)
Summary: Factor out the helper class NameResolver from the class RewriteInstance.

(cherry picked from commit ffec8b03fcceb9ef285af03ae2dbe1acd52e509a)
Summary: The flag is no longer used/needed.

(cherry picked from commit ae10b98bfa5c3cb53295e0e79489de2bef1c8dc0)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.

(cherry picked from commit f19079b7617a79ff0a98930400f47ba6d6fa13fc)
Summary: Add first bits to disassemble functions from a MachO binary.

(cherry picked from commit 24251682342940f7d560f0b206298fb86499fc5b)
Summary: Add first bits to build CFG.

(cherry picked from commit 36b89229c9897caf30c1d30d0596f7b90a41babf)
Summary: Enable reversing the order of basic blocks.

(cherry picked from commit 46ebeb83b67d2be5e5e3eb8ac6ddc7b42993dc44)
Summary:
(cherry picked from commit 7cd3b8e217f93b85df0090002df438c0fdda009f)
Summary: Add missing override in X86MCPlusBuilder.cpp.

(cherry picked from commit f5735b400d2eb086f4b3e98d83cc21743071ba87)
Summary: The interface is no longer in use.

(cherry picked from commit 460bdaaf5c48910d228d17211548bdc410f89d44)
(cherry picked from commit b392da1fbd055a555ef0fab3f943bbf214ace0ee)
(cherry picked from commit 0290dbaa15546fafa27a9a17240f0da4218299ce)
Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).

(cherry picked from commit 2ff448dee7461b1b31ae06837e8d859b77985e6b)
(cherry picked from commit d4667d4e3eb21f481d2fd8df30d1be160fa0f646)
Summary: Fix begin decrementing.

(cherry picked from commit 664a2dbdae4741bb4a6048c9bb5de6e262708845)
Summary: The parameter is no longer used.

(cherry picked from commit 7ac00974b07895aeb9ad53f5f417119c8b15c7e5)
maksfb and others added 3 commits July 7, 2021 17:11
Summary:
Add code to read more dynamic relocations (DT_JMPREL) and enforce strict
checks that corresponding sections sizes match .dynamic entry
description.
Summary:
clang-12 now compiles bolt without warnings.
Some warnings were fixed if possible while others were suppressed by
doing (void)variable for unused variable warnings or moving code inside
assert statements of LLVM_DEBUG blocks.
Summary:
A binary can contain multiple PLT sections with different name and
attributes (such as an entry size). Extend the support to .plt.sec and
refactor the code to make future extensions simpler.
Vasily Leonenko and others added 12 commits July 8, 2021 08:52
Summary:
This commit implements new method for _start & _fini functions hooking
which allows to use relative jumps for future PIE & .so library support.
Instead of using absolute address of _start & _fini functions known on
linking stage - we'll use dynamically created trampoline functions and
use corresponding symbols in instrumentation runtime library.

As we would like to use instrumentation for dynamically loaded binaries
(with PIE & .so), thus we need to compile instrumentation library with
"-fPIC" flag to support relative address resolution for functions and
data.

For shared libraries we need to handle initialization of instrumentation
library case by using DT_INIT section entry point.

Also this commit adds detection if the binary is executable or shared
library based on existence of PT_INTERP header. In case of shared
library we save information about real library init function address
for further usage for instrumentation library init trampoline function
creation and also update DT_INIT to point instrumentation library init
function.

Functions called from init/fini functions should be called with forced
stack alignment to avoid issues with instructions which relies on it.
E.g. optimized string operations.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit adds support for getting directory entries and
reading value of a symbolic link in instrumentation runtime library

Elvina Yakubova,
Advanced Software Technology Lab, Huawei
…lf/map_files

Summary:
This commit adds support for opening libs based on links /proc/self/map_files.
For this we're getting current virtual address and searching the lib in the
directory with such address range. After that, we're getting full path to the
binary by using readlink function. Direct read from link in /proc/self/map_files
entries is not possible because of lack of permissions.

Elvina Yakubova,
Advanced Software Technology Lab, Huawei
Summary:
This commit introduces static binaries instrumentation support.
Note that current implementation does not support profile output
on the instrumented binary finalization. So it requires to use
-instrumentation-sleep-time=N (N>0) option usage.
Note: There is unhandled case with static PIE executable which
might have dynamic header.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit fixes runtime instrumentation handlers for PIE
binaries case.

Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Summary:
This commit replaces IsStaticExecutable with !hasDynamicHeader to improve
actual meaning of this value and corresponding checks. Also this commit
introduces isExecutable() wrapper function for further handling static-pie
executable case during it's enabling.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit introduces -instrumentation-binpath argument used to point
instumented binary in runtime in case if /proc/self/map_files path
is not accessible due to access restriction issues.

Vasily Leonenko
Advanced Software Technology Lab, Huawei
The trampolines are no loger pointers to the functions.
For propper name resolving by bolt use extern "C" for
all external symbols in instr.cpp

Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Summary: This commit adds dummy tests for checking instrumentation support for PIE executables and shared libraries.

Vasily Leonenko, Advanced Software Technology Lab, Huawei
This is a temporary fix for failing tests on newer gcc versions.
Note that older gcc versions doesnot support -fcf-protection
option, so this commit will cause failure on these versions.

Related issues:
facebookarchive#117
facebookarchive#181

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
@vleonen vleonen force-pushed the inst-pie-shlib-x86 branch from ea0da0c to b9ce0d7 Compare July 8, 2021 07:19
@vleonen
Copy link
Author

vleonen commented Jul 8, 2021

Rebased on top of current facebookincubator:rebased branch

@rafaelauler
Copy link
Contributor

Thanks Vasily! Overall the diffs look pretty nicely written. But before I go over them, I'm still trying to figure out what is happening with testing to check that things are working as intended. I'm getting:

At X86/instrumentation-pie.c,

clang -no-pie -fcf-protection=none bolt/test/X86/instrumentation-pie.c -o build/tools/bolt/test/X86/Output/instrumentation-pie.c.tmp.exe -Wl,-q -pie -fpie

bin/llvm-bolt bolt/test/X86/Output/instrumentation-pie.c.tmp.exe -instrument -instrumentation-file=test/X86/Output/instrumentation-pie.c.tmp.fdata -o test/X86/Output/instrumentation-pie.c.tmp.instrumented
Output:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: b9ce0d7
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x400000, offset 0x400000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 3
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 13
BOLT-INSTRUMENTER: Number of function descriptors: 13
BOLT-INSTRUMENTER: Number of branch counters: 5
BOLT-INSTRUMENTER: Number of ST leaf node counters: 15
BOLT-INSTRUMENTER: Number of direct call counters: 0
BOLT-INSTRUMENTER: Total number of counters: 20
BOLT-INSTRUMENTER: Total size of counters: 160 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 231 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 1500 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file test/X86/Output/instrumentation-pie.c.tmp.fdata
BOLT-INFO: 0 out of 16 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0.
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0x6095c0
BOLT-INFO: clear procedure is 0x602660
BOLT-INFO: setting _end to 0x6006c0

And in the next command:
bolt/test/X86/Output/instrumentation-pie.c.tmp.instrumented 1 2 3
Output:

Segmentation fault (core dumped)

Are you able to repro that? My host compiler is clang9. Any idea about what might be happening? I can investigate more deeply later, if you need help, let me know.

@vleonen
Copy link
Author

vleonen commented Jul 8, 2021

Hello Rafael! Thanks for checking it.
I checked how the test works with clang-9 & clang-10 on Ubuntu-20.04 and the test passed. bolt/test/X86/Output/instrumentation-pie.c.tmp.instrumented runs as expected.
Definitely, we have different input binaries produced by clang. Please share if you have any clues on what to check on my side.

@rafaelauler rafaelauler deleted the branch facebookarchive:rebased July 16, 2021 19:57
@rafaelauler
Copy link
Contributor

(This was automatically closed because I renamed our branch from "rebased" to "main", so ignore this).

Ok, I was able to debug what is happening. In the mmap call[1] we do during initialization to share the page of our counters, the page granularity is too large for the small counters vector in that test case, so we end up mmap'ing a bunch of unrelated sections that are next in the layout (including .got section). When we do this, mmap sets all contents to zero. So, previously, we had a .got entry with the correct PLT address for "dlsym". After instrumentation initialization code, thanks to this mmap, the address for 'dlsym" in the got table is now set to zero. When we call "dlsym", the code jumps to zero and crashes.

I think the solution to this would be (1) to ensure that the size of our counters vector is always in multiples of the page size (0x1000). Either that, or (2) we need to add code in initialization to copy the previous contents of the page that is being mmap'ed, mmap it and then initialize it with the previously copied contents. I think (1) is simpler.

[1] https://github.com/facebookincubator/BOLT/blob/main/bolt/runtime/instr.cpp#L1465

@vleonen
Copy link
Author

vleonen commented Jul 19, 2021

Thanks for sharing your results of the issue debugging. In order to make this case more clear - could you share your binary to reproduce the issue on my side or at least "readelf -lSW" output for the file. Unfortunately, I'm not able to build the same binary with gcc or clang compilers in my environment. In my case .got section resides before .text and .bolt.instr.counters sections, so it si not corrupted.
As I can see - CountersEnd is aligned to page boundary: https://github.com/facebookincubator/BOLT/blob/main/bolt/runtime/instr.cpp#L1457
Also I see that there is a padding for the table: https://github.com/facebookincubator/BOLT/blob/main/bolt/src/RuntimeLibs/InstrumentationRuntimeLibrary.cpp#L183
So your solution (1) seems to be already in place and we need to debug it on a reproducible case.

@rafaelauler
Copy link
Contributor

You're right, CountersEnd is already padded. I looked again to understand what might be happening and discovered that the problem is __bolt_num_counters fetching an incorrect value in https://github.com/facebookincubator/BOLT/blob/main/bolt/runtime/instr.cpp#L1458

Is your __bolt_num_counters sane? I took a look at it here and discovered that we are not generating dynamic relocations to instruct the dynamic linker and loader to patch our GOT entries, so, for example, the got entry with the address of __bolt_num_counters has the value 0x602008, which is the correct static address of __bolt_num_counters, but after the dynamic library is loaded, it is loaded at a random location in memory (a given base address, say, 0x7ffff77f0000), and the got entry stays with the original address without the base address where the dynamic library was loaded added to it.

@vleonen
Copy link
Author

vleonen commented Jul 21, 2021

@rafaelauler
Yes, __bolt_num_counters variable is in correct state in runtime. Could you check if there are relocations in libbolt_rt_instr.a ? In my case it looks as following:

$ readelf -aW ./build/lib/libbolt_rt_instr.a |grep num_counters
0000000000006716  0000005300000002 R_X86_64_PC32          0000000000000000 __bolt_num_counters - 4
0000000000006cf9  0000005300000002 R_X86_64_PC32          0000000000000000 __bolt_num_counters - 4
    83: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __bolt_num_counters

Also in the runtime library, the following instruction is related to __bolt_num_counters:

$ objdump -dS ./build/lib/libbolt_rt_instr.a|grep 6cf7
    6cf7:       8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6cfd <__bolt_instr_setup+0x2d>

It is replaced with the following instruction in the instrumented test file:

$ objdump -dS ./build/tools/bolt/test/X86/Output/instrumentation-pie.c.tmp.instrumented |grep 409767
  409767:       8b 05 9b 88 ff ff       mov    -0x7765(%rip),%eax        # 402008 <_end+0x1988>

So it seems that in my environment __bolt_num_counters relocation was resolved during linking instrumentation library to output executable, but in your environment, it was not. Or maybe your libbolt_rt_instr.a doesn't contain relocations for __bolt_num_counters?

@vleonen
Copy link
Author

vleonen commented Jul 21, 2021

PR reopened #192 due to merge target branch renaming.

@yota9
Copy link
Contributor

yota9 commented Jul 21, 2021

@rafaelauler Hello! As I understand the problem, your instr library has a GOT table. Please take a look at the patch in reopened PR, it should fix the problem :)
P.S. I'm not a big fan of using pragmas, but here I know only two options - to use either it, either attribute for each extern variable, since fvisibility flag does not affect them. I think it will be just easier to use pragma here to avoid this problem in the future with new variables.

@rafaelauler
Copy link
Contributor

That's correct, the runtime library is being generated with GOT entries and our linking mechanism doesn't handle them.
Thanks for the fix! Let me take a look at it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.