Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spoonincode
Copy link
Member

AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.

This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.

As a simple example, a contract such as,

(module
 (import "env" "memcpy" (func $memcpy (param i32 i32 i32) (result i32)))
 (export "apply" (func $apply))
 (memory 1)
 (func $apply (param $0 i64) (param $1 i64) (param $2 i64)
    (drop (call $memcpy (i32.const 4) (i32.const 200) (i32.const 8)))
 )
)

Will generate machine code (annotated by me)

    pushq	%rax
    decl	%gs:-18888     ;;decrement depth count and jump if zero
    je  	51
    movq	%gs:200, %rax  ;;load 8 bytes from address 200 in to register
    movq	%rax, %gs:4    ;;store 8 bytes to address 4 from register
    movl	$4, %edi       ;;prepare the 3 parameters to check_memcpy_params
    movl	$200, %esi
    movl	$8, %edx
    callq	*%gs:-21056    ;;call check_memcpy_params
    incl	%gs:-18888     ;;increment depth count
    popq	%rax
    retq
    callq	*%gs:-18992    ;;call depth_assert

(in practice the source and destination addresses are typically not constants)

For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.

A run of the LLVM exhaustive workflow is at,
https://github.com/AntelopeIO/spring/actions/runs/11711214896

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

const intrinsic_map_t& get_intrinsic_map();

static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128 is a bit of a dice roll. I don't have data to know where the exact threshold is between this local copy being faster vs slower than the full call off to memcpy. I wouldn't want to go higher than this anyways though, since it bloats the code a little more.

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

const intrinsic_map_t& get_intrinsic_map();

static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird place for this, but header files for OC are a little weird because of some parts still are compiled with C++17 -- need to place these in a header file that is consumed by the C++17 code.

const_memcpy_sz->getZExtValue() >= minimum_const_memcpy_intrinsic_to_optimize &&
const_memcpy_sz->getZExtValue() <= maximum_const_memcpy_intrinsic_to_optimize) {
const unsigned sz_value = const_memcpy_sz->getZExtValue();
llvm::IntegerType* type_of_memcpy_width = llvm::Type::getIntNTy(context, sz_value*8);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a non-power-of-2 type size here is rather unorthodox, though one would have to imagine something like _BitInt from C23 would do exactly this. Unit tests exhaustively explore every size (1 through 128). An oddball size does generate mostly expected machine code (loads and stores are maybe not in "natural" order one would expect but that doesn't matter of course), for example for size 13:

		                pushq	%rax
		                decl	%gs:-18888
		                je	83
		                movl	%gs:208, %eax  ;;load 4 bytes (208-211)
		                movq	%gs:200, %rcx  ;;load 8 bytes (200-207)
		                movb	%gs:212, %dl   ;;load 1 byte  (212)
		                movb	%dl, %gs:16    ;;store 1 byte  (16)
		                movq	%rcx, %gs:4    ;;store 8 bytes (4-11)
		                movl	%eax, %gs:12   ;;store 4 bytes (12-15)
		                movl	$4, %edi
		                movl	$200, %esi
		                movl	$13, %edx
		                callq	*%gs:-21056
		                incl	%gs:-18888
		                popq	%rax
		                retq
		                callq	*%gs:-18992

const unsigned ulength = length;

//this must remain the same behavior as the memcpy host function
if((size_t)(std::abs((ptrdiff_t)udest - (ptrdiff_t)usrc)) >= ulength)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if

if ((std::max(udest, usrc) - std::min(udest, usrc)) >= ulength)

might be faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

Did you try plain if comparisons without using std::abs?

@spoonincode spoonincode linked an issue Nov 6, 2024 that may be closed by this pull request
@ericpassmore
Copy link
Contributor

Note:start
category: System Stability
component: EOS EVM
summary: Improve EOS EVM by working around CDT's memcpy function.
Note:end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

experiment with workaround for cdt inefficient memcpy usage
4 participants