Redesign syscallbuf to always unwind on interruption

This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
rr-debugger · Jul 9, 2022 · ab53550 · ab53550
1 parent ecc4124
commit ab53550
Show file tree

Hide file tree

Showing 17 changed files with 1,025 additions and 566 deletions.
diff --git a/src/DiversionSession.cc b/src/DiversionSession.cc
@@ -72,9 +72,10 @@ static void process_syscall_arch(Task* t, int syscallno) {
   if (syscallno == t->session().syscall_number_for_rrcall_rdtsc()) {
     uint64_t rdtsc_value = static_cast<DiversionSession*>(&t->session())->next_rdtsc_value();
     LOG(debug) << "Faking rrcall_rdtsc syscall with value " << rdtsc_value;
-    remote_ptr<uint64_t> out_param(t->regs().arg1());
-    t->write_mem(out_param, rdtsc_value);
-    finish_emulated_syscall_with_ret(t, 0);
+    Registers r = t->regs();
+    r.set_dx(rdtsc_value >> 32);
+    t->set_regs(r);
+    finish_emulated_syscall_with_ret(t, (uint32_t)rdtsc_value);
     return;
   }
 

diff --git a/src/Monkeypatcher.cc b/src/Monkeypatcher.cc
diff --git a/src/Monkeypatcher.h b/src/Monkeypatcher.h
@@ -125,24 +125,34 @@ class Monkeypatcher {
   };
   std::vector<ExtendedJumpPage> extended_jump_pages;
 
-  bool is_jump_stub_instruction(remote_code_ptr p, bool include_safearea);
-  // Return the breakpoint instruction (i.e. the last branch back to caller)
-  // if we are on the exit path in the jump stub
-  remote_code_ptr get_jump_stub_exit_breakpoint(remote_code_ptr ip, RecordTask *t);
 
   struct patched_syscall {
     // Pointer to hook inside the syscall_hooks array, which gets initialized
     // once and is fixed afterwars.
     const syscall_patch_hook *hook;
+    remote_ptr<uint8_t> patch_addr;
+    remote_ptr<uint8_t> stub_addr;
     size_t size;
     uint16_t safe_prefix = 0;
     uint16_t safe_suffix = 0;
   };
 
+  patched_syscall *find_jump_stub(remote_code_ptr ip, bool include_safearea);
+  bool is_jump_stub_instruction(remote_code_ptr p, bool include_safearea) {
+    return (bool)find_jump_stub(p, include_safearea);
+  }
+
+  patched_syscall *find_syscall_patch(remote_code_ptr patch_location);
+
+  // Return the breakpoint instruction (i.e. the last branch back to caller)
+  // if we are on the exit path in the jump stub
+  remote_code_ptr get_jump_stub_exit_breakpoint(remote_code_ptr ip, RecordTask *t);
   /**
    * Addresses/lengths of syscallbuf stubs.
    */
-  std::map<remote_ptr<uint8_t>, patched_syscall> syscallbuf_stubs;
+  std::vector<patched_syscall> syscall_stub_list;
+  std::map<remote_ptr<uint8_t>, int> syscallbuf_stubs_by_extended_patch;
+  std::map<remote_ptr<uint8_t>, int> syscallbuf_stubs_by_patch_addr;
 
 private:
   /**

diff --git a/src/RecordSession.cc b/src/RecordSession.cc
@@ -471,7 +471,6 @@ void RecordSession::handle_seccomp_traced_syscall(RecordTask* t,
       SupportedArch syscall_arch = t->detect_syscall_arch();
       t->canonicalize_regs(syscall_arch);
       if (!process_syscall_entry(t, step_state, result, syscall_arch)) {
-        step_state->continue_type = RecordSession::DONT_CONTINUE;
         return;
       }
       *did_enter_syscall = true;
@@ -508,6 +507,8 @@ static void seccomp_trap_done(RecordTask* t) {
       (uint8_t)1);
 }
 
+extern void disarm_desched_event(RecordTask *t);
+extern void leave_syscallbuf(RecordTask *t);
 static void handle_seccomp_trap(RecordTask* t,
                                 RecordSession::StepState* step_state,
                                 uint16_t seccomp_data) {
@@ -542,27 +543,21 @@ static void handle_seccomp_trap(RecordTask* t,
     }
   }
 
-  if (t->is_in_untraced_syscall()) {
-    ASSERT(t, !t->delay_syscallbuf_reset_for_seccomp_trap);
-    // Don't reset the syscallbuf immediately after delivering the trap. We have
-    // to wait until this buffered syscall aborts completely before resetting
-    // the buffer.
-    t->delay_syscallbuf_reset_for_seccomp_trap = true;
-
-    t->push_event(Event::seccomp_trap());
-
+  bool is_untraced_syscall = t->is_in_untraced_syscall();
+  if (is_untraced_syscall) {
     // desched may be armed but we're not going to execute the syscall, let
-    // alone block. If it fires, ignore it.
-    t->write_mem(
-        REMOTE_PTR_FIELD(t->syscallbuf_child, desched_signal_may_be_relevant),
-        (uint8_t)0);
+    // alone block. Disarm the event and if it fires, ignore it.
+    disarm_desched_event(t);
+    leave_syscallbuf(t);
+    r = t->regs();
   }
 
+  t->canonicalize_regs(t->detect_syscall_arch());
   t->push_syscall_event(syscallno);
   t->ev().Syscall().failed_during_preparation = true;
   note_entering_syscall(t);
 
-  if (t->is_in_untraced_syscall() && !syscall_entry_already_recorded) {
+  if (is_untraced_syscall && !syscall_entry_already_recorded) {
     t->record_current_event();
   }
 
@@ -578,10 +573,21 @@ static void handle_seccomp_trap(RecordTask* t,
   si.native_api.si_code = SYS_SECCOMP;
   si.native_api._sifields._sigsys._arch = to_audit_arch(r.arch());
   si.native_api._sifields._sigsys._syscall = syscallno;
+
   // Documentation says that si_call_addr is the address of the syscall
   // instruction, but in tests it's immediately after the syscall
   // instruction.
-  si.native_api._sifields._sigsys._call_addr = t->ip().to_data_ptr<void>();
+  remote_code_ptr seccomp_ip = t->ip();
+
+  /* If we actually deliver this signal, we will fudge the ip value to instead
+     point into the patched-out syscall. The callee may rely on these values
+     matching, so do the same adjustment here. */
+  Monkeypatcher::patched_syscall *ps = t->vm()->monkeypatcher().find_jump_stub(seccomp_ip, true);
+  if (ps) {
+    seccomp_ip = (ps->patch_addr + (seccomp_ip - ps->stub_addr.as_int()).register_value() - (ps->size - ps->safe_suffix)).as_int();
+  }
+
+  si.native_api._sifields._sigsys._call_addr = seccomp_ip.to_data_ptr<void>();
   LOG(debug) << "Synthesizing " << si.linux_api;
   t->stash_synthetic_sig(si.linux_api, DETERMINISTIC_SIG);
 
@@ -591,16 +597,24 @@ static void handle_seccomp_trap(RecordTask* t,
   t->set_regs(r);
   t->maybe_restore_original_syscall_registers();
 
-  if (t->is_in_untraced_syscall()) {
+  if (is_untraced_syscall) {
+    Registers r = t->regs();
+    // Cause kernel processing to skip the syscall
+    r.set_original_syscallno(SECCOMP_MAGIC_SKIP_ORIGINAL_SYSCALLNO);
+    t->set_regs(r);
+
     // For buffered syscalls, go ahead and record the exit state immediately.
     t->ev().Syscall().state = EXITING_SYSCALL;
     t->record_current_event();
     t->pop_syscall();
 
-    // The tracee is currently in the seccomp ptrace-stop. Advance it to the
-    // syscall-exit stop so that when we try to deliver the SIGSYS via
+    // The tracee is currently in the seccomp ptrace-stop or syscall-entry stop.
+    // Advance it to the syscall-exit stop so that when we try to deliver the SIGSYS via
     // PTRACE_SINGLESTEP, that doesn't trigger a SIGTRAP stop.
     t->resume_execution(RESUME_SYSCALL, RESUME_WAIT, RESUME_NO_TICKS);
+    if (t->status().ptrace_event() == PTRACE_EVENT_SECCOMP) {
+      t->resume_execution(RESUME_SYSCALL, RESUME_WAIT, RESUME_NO_TICKS);
+    }
   }
 
   // Don't continue yet. At the next iteration of record_step, if we
@@ -815,12 +829,6 @@ void RecordSession::task_continue(const StepState& step_state) {
   // A task in an emulated ptrace-stop must really stay stopped
   ASSERT(t, !t->emulated_stop_pending);
 
-  bool may_restart = t->at_may_restart_syscall();
-
-  if (may_restart && t->seccomp_bpf_enabled) {
-    LOG(debug) << "  PTRACE_SYSCALL to possibly-restarted " << t->ev();
-  }
-
   if (!t->vm()->first_run_event()) {
     t->vm()->set_first_run_event(trace_writer().time());
   }
@@ -892,7 +900,7 @@ void RecordSession::task_continue(const StepState& step_state) {
          makes PTRACE_SYSCALL traps be delivered *before* seccomp RET_TRACE
          traps.
          Detect and handle this. */
-      if (!t->seccomp_bpf_enabled || may_restart ||
+      if (!t->seccomp_bpf_enabled ||
           syscall_seccomp_ordering_ == PTRACE_SYSCALL_BEFORE_SECCOMP_UNKNOWN) {
         resume = RESUME_SYSCALL;
       } else {
@@ -1232,6 +1240,17 @@ void RecordSession::syscall_state_changed(RecordTask* t,
           ASSERT(t, t->regs().original_syscallno() == -1);
         }
         rec_did_sigreturn(t);
+
+        /* The inverse of the processing we do during signal delivery - if the IP
+           points into a region that we patched out, move us to the extended jump
+           patch instead. */
+        Monkeypatcher::patched_syscall *ps = t->vm()->monkeypatcher().find_syscall_patch(t->ip());
+        if (ps) {
+          Registers r = t->regs();
+          r.set_ip((ps->stub_addr + (r.ip() - ps->patch_addr.as_int()).register_value() + (ps->size - ps->safe_suffix)).as_int());
+          t->set_regs(r);
+        }
+
         t->record_current_event();
         t->pop_syscall();
 
@@ -1500,6 +1519,33 @@ static bool inject_handled_signal(RecordTask* t) {
   t->stashed_signal_processed();
 
   int sig = t->ev().Signal().siginfo.si_signo;
+
+  /*
+   * If we're delivering a signal while in the extended jump patch, pretend we're in the
+   * unpatched code instead. That way, any unwinder that makes use of CFI for unwinding
+   * will see the correct unwind info of the patch site rather than that of the extended
+   * jump patch. The instruction sequence in the original code was of course altered by
+   * the patch, so if the signal handler inspects that, it might get confused. However,
+   * that is already a general problem with our patching strategy, in that the application
+   * is not allowed to read its own code.
+   * Naturally, we need to perform the inverse transformation in sigreturn.
+   */
+  Monkeypatcher::patched_syscall *ps = t->vm()->monkeypatcher().find_jump_stub(t->ip(), true);
+  if (ps) {
+    Registers r = t->regs();
+    uint64_t translated_patch_offset = (r.ip() - ps->stub_addr.as_int()).register_value() - (ps->size - ps->safe_suffix);
+    // We patch out the jump stub with nop, but of course, if we happen to find ourselves
+    // in the middle of the nop sled, we just want to end up at the end of the patch
+    // region.
+    size_t total_patch_region_size = ps->hook->patch_region_length +
+      rr::syscall_instruction_length(t->arch());
+    if (translated_patch_offset > total_patch_region_size) {
+      translated_patch_offset = total_patch_region_size;
+    }
+    r.set_ip(ps->patch_addr.as_int() + translated_patch_offset);
+    t->set_regs(r);
+  }
+
   do {
     // We are ready to inject our signal.
     // XXX we assume the kernel won't respond by notifying us of a different
@@ -1909,32 +1955,22 @@ static bool is_ptrace_any_sysemu(SupportedArch arch, int command)
 bool RecordSession::process_syscall_entry(RecordTask* t, StepState* step_state,
                                           RecordResult* step_result,
                                           SupportedArch syscall_arch) {
-  if (const RecordTask::StashedSignal* sig = t->stashed_sig_not_synthetic_SIGCHLD()) {
-    // The only four cases where we allow a stashed signal to be pending on
-    // syscall entry are:
-    // -- the signal is a ptrace-related signal, in which case if it's generated
-    // during a blocking syscall, it does not interrupt the syscall
-    // -- rrcall_notify_syscall_hook_exit, which is effectively a noop and
-    // lets us dispatch signals afterward
-    // -- when we're entering a blocking untraced syscall. If it really blocks,
-    // we'll get the desched-signal notification and dispatch our stashed
-    // signal.
-    // -- when we're doing a privileged syscall that's internal to the preload
-    // logic
-    // We do not generally want to have stashed signals pending when we enter
-    // a syscall, because that will execute with a hacked signal mask
-    // (see RecordTask::will_resume_execution) which could make things go wrong.
-    ASSERT(t,
-           t->desched_rec() || is_rrcall_notify_syscall_hook_exit_syscall(
-                                   t->regs().original_syscallno(), t->arch()) ||
-               t->ip() ==
-                   t->vm()
-                       ->privileged_traced_syscall_ip()
-                       .increment_by_syscall_insn_length(t->arch()))
-      << "Stashed signal pending on syscall entry when it shouldn't be: "
-      << sig->siginfo << "; regs=" << t->regs()
-      << "; last_execution_resume=" << t->last_execution_resume()
-      << "; sig ip=" << sig->ip;
+  if (!t->is_in_syscallbuf() && t->stashed_sig_not_synthetic_SIGCHLD()) {
+    // If we have a pending signal, deliver it as if it had happened just before
+    // execution of the syscall instruction. To this end, kick us out of the
+    // current syscall again and set up the registers for a restart. Regular
+    // signal injection will do the rest.
+    LOG(debug) << "Entered syscall, but signal pending - setting up pre-syscall signal delivery";
+    Registers entry_regs = t->regs();
+    Registers r = entry_regs;
+    // Cause kernel processing to skip the syscall
+    r.set_original_syscallno(SECCOMP_MAGIC_SKIP_ORIGINAL_SYSCALLNO);
+    t->set_regs(r);
+    t->exit_syscall();
+    entry_regs.set_ip(entry_regs.ip().decrement_by_syscall_insn_length(syscall_arch));
+    entry_regs.set_syscallno(entry_regs.original_syscallno());
+    t->set_regs(entry_regs);
+    return false;
   }
 
   // We just entered a syscall.

diff --git a/src/RecordTask.cc b/src/RecordTask.cc
@@ -634,6 +634,9 @@ bool RecordTask::will_resume_execution(ResumeRequest, WaitRequest,
     if (!set_sigmask(sigset)) {
       return false;
     }
+    LOG(debug) << "Set signal mask to block all signals (bar "
+              << "SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we "
+              << " have a stashed signal";
   }
 
   // RESUME_NO_TICKS means that tracee code is not going to run so there's no
@@ -710,7 +713,9 @@ void RecordTask::did_wait() {
     // state, because we do not allow stashed_signals_blocking_more_signals
     // to hold across syscalls (traced or untraced) that change the signal mask.
     ASSERT(this, !blocked_sigs_dirty);
-    xptrace(PTRACE_SETSIGMASK, remote_ptr<void>(8), &blocked_sigs);
+    if (set_sigmask(blocked_sigs)) {
+      LOG(debug) << "Blocked signals restored";
+    }
   } else if (syscallbuf_child) {
     // The syscallbuf struct is only 32 bytes currently so read the whole thing
     // at once to avoid multiple calls to read_mem. Even though this shouldn't
@@ -1294,10 +1299,6 @@ bool RecordTask::set_sigmask(sig_set_t mask) {
       return false;
     }
     ASSERT(this, errno == EINVAL);
-  } else {
-    LOG(debug) << "Set signal mask to block all signals (bar "
-               << "SYSCALLBUF_DESCHED_SIGNAL/TIME_SLICE_SIGNAL) while we "
-               << " have a stashed signal";
   }
   return true;
 }