workaround cdt's small const memcpy host function calls in EOS VM OC #1016

spoonincode · 2024-11-06T20:45:00Z

AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.

This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.

As a simple example, a contract such as,

(module
 (import "env" "memcpy" (func $memcpy (param i32 i32 i32) (result i32)))
 (export "apply" (func $apply))
 (memory 1)
 (func $apply (param $0 i64) (param $1 i64) (param $2 i64)
    (drop (call $memcpy (i32.const 4) (i32.const 200) (i32.const 8)))
 )
)

Will generate machine code (annotated by me)

    pushq	%rax
    decl	%gs:-18888     ;;decrement depth count and jump if zero
    je  	51
    movq	%gs:200, %rax  ;;load 8 bytes from address 200 in to register
    movq	%rax, %gs:4    ;;store 8 bytes to address 4 from register
    movl	$4, %edi       ;;prepare the 3 parameters to check_memcpy_params
    movl	$200, %esi
    movl	$8, %edx
    callq	*%gs:-21056    ;;call check_memcpy_params
    incl	%gs:-18888     ;;increment depth count
    popq	%rax
    retq
    callq	*%gs:-18992    ;;call depth_assert

(in practice the source and destination addresses are typically not constants)

For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.

A run of the LLVM exhaustive workflow is at,
https://github.com/AntelopeIO/spring/actions/runs/11711214896

spoonincode · 2024-11-06T20:45:24Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/intrinsic.hpp

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

 const intrinsic_map_t& get_intrinsic_map();

+static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
+static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;


128 is a bit of a dice roll. I don't have data to know where the exact threshold is between this local copy being faster vs slower than the full call off to memcpy. I wouldn't want to go higher than this anyways though, since it bloats the code a little more.

spoonincode · 2024-11-06T20:45:38Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/intrinsic.hpp

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

 const intrinsic_map_t& get_intrinsic_map();

+static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
+static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;
+


This is a weird place for this, but header files for OC are a little weird because of some parts still are compiled with C++17 -- need to place these in a header file that is consumed by the C++17 code.

spoonincode · 2024-11-06T20:46:21Z

libraries/chain/webassembly/runtimes/eos-vm-oc/LLVMEmitIR.cpp

+					  const_memcpy_sz->getZExtValue() >= minimum_const_memcpy_intrinsic_to_optimize &&
+					  const_memcpy_sz->getZExtValue() <= maximum_const_memcpy_intrinsic_to_optimize) {
+					const unsigned sz_value = const_memcpy_sz->getZExtValue();
+					llvm::IntegerType* type_of_memcpy_width = llvm::Type::getIntNTy(context, sz_value*8);


The use of a non-power-of-2 type size here is rather unorthodox, though one would have to imagine something like _BitInt from C23 would do exactly this. Unit tests exhaustively explore every size (1 through 128). An oddball size does generate mostly expected machine code (loads and stores are maybe not in "natural" order one would expect but that doesn't matter of course), for example for size 13:

pushq %rax decl %gs:-18888 je 83 movl %gs:208, %eax ;;load 4 bytes (208-211) movq %gs:200, %rcx ;;load 8 bytes (200-207) movb %gs:212, %dl ;;load 1 byte (212) movb %dl, %gs:16 ;;store 1 byte (16) movq %rcx, %gs:4 ;;store 8 bytes (4-11) movl %eax, %gs:12 ;;store 4 bytes (12-15) movl $4, %edi movl $200, %esi movl $13, %edx callq *%gs:-21056 incl %gs:-18888 popq %rax retq callq *%gs:-18992

spoonincode · 2024-11-06T20:46:55Z

libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp

+   const unsigned ulength = length;
+
+   //this must remain the same behavior as the memcpy host function
+   if((size_t)(std::abs((ptrdiff_t)udest - (ptrdiff_t)usrc)) >= ulength)


This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

wonder if

if ((std::max(udest, usrc) - std::min(udest, usrc)) >= ulength)

might be faster?

This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

Did you try plain if comparisons without using std::abs?

ericpassmore · 2024-11-22T23:20:51Z

Note:start
category: System Stability
component: EOS EVM
summary: Improve EOS EVM by working around CDT's memcpy function.
Note:end

spoonincode added 2 commits November 6, 2024 15:17

make throw_internal_exception a little more generic

af124c6

handle small constant sized memcpy host function calls internal to OC

b2ca768

spoonincode commented Nov 6, 2024

View reviewed changes

spoonincode linked an issue Nov 6, 2024 that may be closed by this pull request

experiment with workaround for cdt inefficient memcpy usage #994

Open

greg7mdp approved these changes Nov 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

spoonincode commented Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

greg7mdp Nov 6, 2024

linh2931 Nov 7, 2024

ericpassmore commented Nov 22, 2024

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

Are you sure you want to change the base?

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

Conversation

spoonincode commented Nov 6, 2024

spoonincode Nov 6, 2024

Choose a reason for hiding this comment

spoonincode Nov 6, 2024

Choose a reason for hiding this comment

spoonincode Nov 6, 2024

Choose a reason for hiding this comment

spoonincode Nov 6, 2024

Choose a reason for hiding this comment

greg7mdp Nov 6, 2024

Choose a reason for hiding this comment

linh2931 Nov 7, 2024

Choose a reason for hiding this comment

ericpassmore commented Nov 22, 2024