-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use portable JIT compilation for accelerating RISC-V emulation #81
Comments
Think about whether we want to load a 64-bit immediate under the present RISC-V specification. Despite memory access being substantially slower, it is simpler to just load a constant from memory since it requires less instructions. In RISC-V
24-bytes, 6 instruction cycles and 2 registers. In x86_64:
11-bytes, 1 instruction cycle, 1 register. |
it's not so apparent where the starting and ending position of a particular code block is . does this mean hard coding pseudo pattern(common code block pattern) for the binary to match on? |
blink is a virtual machine for running x86-64-linux programs on different operating systems and hardware architectures. Recently, it implements a JIT. Quote from blink/jit.c:
|
wasm3 is a fast WebAssembly interpreter without JIT. In general, its strategy seems capable of executing code around 4-15x slower than compiled code on a modern x86 processor.
Tails is a minimal, fast Forth-like interpreter core. It uses no assembly code, only C++, but an elegant tail-recursion technique inspired by Wasm3 makes it nearly as efficient as hand-written assembly. |
Benchmark results of rv8: (RV32 only, smaller is better)
|
Our strategy to develop JIT is utilizing An example instruction sequence that is a hot spot in Mandelbrot is shown in Figure 2, along with the corresponding EBB. The generated code for this EBB is shown as below.
insn_10750:
...
goto insn_10754;
insn_10754:
...
if (...)
goto insn_10760;
goto insn_10758;
... |
The commit 36f304c implements the JIT strategy described above. The benchmark results, as shown in the statistics below, demonstrate that the JIT has a positive effect on benchmarks with long execution times. However, in the case of Mandelbrot, its short execution time means that the overhead of the JIT outweighs its benefits.
|
We experiment the same JIT strategy based on different compiler clang and mir. clang
mir
The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively. |
The preliminary baseline JIT compiler has been landed in |
Ongoing
|
You shall describe the details in #142 rather than here. In #142, we care about the feasibility to improve block-based execution by introducing dominator tree. |
The author of RVVM discussed the design choices where it's substantially different to QEMU.
|
lightrec is a MIPS-to-everything dynamic re-compiler (aka JIT compiler or dynrec) for PlayStation emulators, using GNU Lightning as the code emitter. Features:
Check optimizer.c, blockcache.c, and TLSF for the implementation. Test hardware: Desktop PC – Core i7 7700k, Windows 10
|
The copyjit draws inspiration from the paper "Copy-and-Patch Compilation." However, what if patching could be entirely eliminated? The core concept revolves around using the compiler to generate 'templates' that can be directly copied into place. This approach heavily relies on continuation passing, which means that all operations defined by the jit library must allow for continuation passing optimizations. In copy-and-patch, the templates are filled in at runtime with user-selected values. Unfortunately, this method relies on parsing ELF relocations, which necessitates porting the library to different platforms. While not a major issue, avoiding runtime patching of relocations could potentially enable the creation of a JIT library that is architecture agnostic and offers very low latencies. bcgen generates a number of files in a directory called gen in the working directory. These generated files are included by bcode.c, which you can compile into an object file that provides an interface to compiling and running bytecode system. See also: A Template-Based Code Generation Approach for MLIR |
QEMU employs a two-step process for executing binaries, involving an intermediate representation known as tiny code. This tiny code is interpreted in two ways: first through emulation and second via compilation into native code using a JIT compiler, often leading to enhanced speed. However, the use of JITs demands the allocation of executable memory to house the compiled code, which is not permitted in iOS. To circumvent this restriction, a technique is employed that involves reusing portions of code that are already in executable memory. This concept takes on various names such as code re-use, ROP (Return-Oriented Programming), and ret2code. It is formalized as "weird machines" due to the differing semantics between the original code and final execution. This process involves the creation of code gadgets, such as This inventive approach allows the creation of complete programs by reusing existing code, a technique historically employed for creating exploits. It provides a creative solution for implementing JIT compilers in architectures that disallow the allocation of executable memory. See commit 4de86e. UTM already merges the above qemu-tcg-tcti effort (see patches directory):
Reference: |
pylbbv is a lazy basic block versioning + copy and patch JIT interpreter for CPython. The copy-and-patch JIT compiler uses a stencil compiler.
|
luajit-remake transforms an LLVM function to make it suitable for compilation and back-parsing into a copy-and-patch stencil. The transformation process involves the following steps: The function is split into two parts: the fast path and the slow path. The identification of the slow path logic is done using BlockFrequencyInfo, and proper annotations are added to the LLVM IR to enable the identification and separation of the slow path during assembly generation.
It is important to note that the IR-level rewrite pass should be executed immediately before the LLVM module is compiled to assembly. Once this pass is applied, no further transformations to the LLVM IR are allowed. |
Jonathan Müller has an excellent talk on A deep dive into dispatching techniques. He compared the manual jump table and the one generated by optimizing compiler.
Compiler generates jump table with 4 byte relative offsets, not 8 byte absolute offsets, resulting faster execution on Intel Core i5-1145G7. |
WebAssembly Micro Runtime (WAMR) is a lightweight standalone WebAssembly (Wasm) runtime with small footprint, high performance and highly configurable features for applications cross from embedded devices. |
Possible lightweight JIT framework:
|
The core of the security concern lies in the inherent complexity of the system. Even extensively used and battle-tested tools like wasmtime have experienced severe vulnerabilities, such as the recent critical bug that could potentially lead to remote code execution (as seen in Guest-controlled out-of-bounds read/write on x86_64 · bytecodealliance/wasmtime). The strategy employed here, assuming it progresses beyond the experimental phase, comprises three key elements to ensure robust security:
|
An experimental JIT for PHP, built upon dstogov/ir project, has been developed and can be found in the master branch of the php-src repository. By following the provided build instructions, we can build a development version of PHP, which will display the following: $ sapi/cli/php --version
PHP 8.4.0-dev (cli) (built: Dec 1 2023 01:59:26) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.0-dev, Copyright (c) Zend Technologies Check if opcache s loaded $ sapi/cli/php -v | grep -i opcache
$ sapi/cli/php -d opcache.jit=off Zend/bench.php
..
Total 0.310
$ sapi/cli/php -d opcache.jit=tracing Zend/bench.php
...
Total 0.089 |
Whose baseline compiler is it anyway? by Ben L. Titzer
|
The concept of delay slot in MIPS was initially a straightforward solution to manage pipeline hazards in five-stage pipelines. However, it became a challenge for processors with longer pipelines and the ability to issue multiple instructions per clock cycle. From a software perspective, delay slot has drawbacks, making programs harder to read and often less efficient due to frequently inserting Historically, in the 1980s, the idea of branch delay slot made sense for pipelines consisting of 5 or 6 stages, as it helped to mitigate the one-cycle branch penalty inherent in these systems. But with the evolution of processor architectures, this approach has become outdated. For instance, in modern Pentium microarchitectures, the branch penalty can range from 15 to 25 cycles, rendering a single instruction delay slot ineffective. Implementing a delay slot that could accommodate a 15-instruction delay would be impractical and would disrupt the compatibility of instruction sets. Advancements in technology have introduced more efficient solutions. Branch prediction, now a mature technology, has proven to be more efficient. The rate of misprediction with current branch predictors is significantly lower than the occurrence of branches with a Given these considerations, both in terms of hardware and software efficiency, delay slots are less advantageous. Therefore, modern architectures like RISC-V have chosen to omit the delay slot feature, aligning with current technological capabilities and requirements. The lightrec, a MIPS recompiler that employs GNU Lightning for code emission, must handle the delay slot characteristic of MIPS. This feature, however, is not present in RISC-V and other more recent RISC designs, which typically exclude the delay slot. This omission reflects a broader trend in newer RISC architectures to move away from this once-common design element. |
rv64_emulator is a RISC-V ISA emulation suite which contains a full system emulator and an ELF instruction frequency analyzer, with JIT compiler for Arm64. |
rv8 demonstrates how RISC-V instruction emulation can benefit from JIT compilation and aggressive optimizations. However, it is dedicated to x86-64 and hard to support other host architectures, such as Apple M1 (Aarch64). SFUZZ is a high performance fuzzer using RISC-V to x86 binary translations with modern fuzzing techniques. RVVM is another example to implement tracing JIT.
The goal of this task to utilize existing JIT framework as a new abstraction layer while we accelerate RISC-V instruction executions. In particular, we would
The JIT compilation's high level operation can be summed up as follows:
Every block will come to an end after a branch instruction has been translated since translation occurs at the basic block level. Then, there is room for further optimization passes performed on the generated code.
We gain speed by using the technique for the reasons listed below:
Reference:
The text was updated successfully, but these errors were encountered: