-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize time wasted in passing argument and fetching return #137
Comments
in _value._convert I believe in the constructor of class Func we need to set a property
where Val.struct_code is and this should be done once on Func constructor not on every code
|
This comment was marked as outdated.
This comment was marked as outdated.
I was able to make it 4x faster
here is the code def cdb_djp_hash_wasm(input: bytearray):
size = len(input)
start = hb
stop = start + size
np_mem[start:stop]=input
params = (start, size)
raw=(wasmtime_val_raw_t*len(params))()
for row_i, param in zip(raw, params): row_i.i32=param
raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))
with enter_wasm(store) as trap:
error = ffi.wasmtime_func_call_unchecked(
store._context,
ctypes.byref(_cdb_djp_hash_wasm_in._func),
raw_ptr_casted,
trap)
if error:
raise WasmtimeError._from_ptr(error)
return ctypes.c_uint32(raw[0].i32).value and there is still more room from improvement |
now it's ~66ms instead of ~80ms (compared to old 350ms) by doing things once outside of the loop (the function won't change it's prototype/signature inside the loop) def func_init(func, store):
ty = func.type(store)
ty_params = ty.params
ty_results = ty.results
params_str = (str(i) for i in ty_params)
params_n = len(ty_params)
results_n = len(ty_results)
n = max(params_n, results_n)
raw_type = wasmtime_val_raw_t*n
func.raw_type = raw_type
def _create_raw(params):
raw = raw_type()
for i, param_str in enumerate(params_str):
setattr(raw[i], param_str, params[i])
return raw
func._create_raw = _create_raw
func_init(_cdb_djp_hash_wasm_in, store)
#....
raw = _cdb_djp_hash_wasm_in._create_raw(params)
raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t)) |
using struct.Struct gave ~80ms that is no extra benefit each wasmtime_val_raw_t is 16 bytes (that is each one is 4xi32) # outside the loop
st=Struct('<LLLLLLLL')
raw_ptr_type = ctypes.c_uint8*st.size
st_ret=Struct('<L')
# inside the loop
raw = bytearray(st.pack(start,0,0,0, size, 0,0,0))
raw_ptr = raw_ptr_type.from_buffer(raw)
raw_ptr_casted = ctypes.cast(raw_ptr, ctypes.POINTER(wasmtime_val_raw_t)) I'll continue using the previous |
when profiler is enabled old method took ~700ms the profile of new method is very clean and does not seem to be possible to optimize it further here are the report
I just wonder why |
this is a great enhancement from 43 ms to 6 ms in @alexprengere question in #96
|
While discussing #96 I was able to identify a bottleneck
there was needlessly large time wasted in passing the arguments and retrieving the return
basically I made a simple 32-bit hash function cdb_djp_hash.c
in pure python the time is proportional to string length
in wasm it was almost constant time ~40ms regardless of string length 13 bytes to 1300 bytes (100x)
which means that most of those 40ms are in passing the arguments not in the actual loop (if it was the loop it would be proportional to string length or a large fraction of that).
This was confirmed using
the first note that the actual WASM call is fast (169ms out of 800ms)
which means that we might able to squeeze ~650ms of 800ms making it 4x faster
and I was able to identify bottleneck
examples of such things
isinstance
which is 170k (in a 10k iteration benchmark), that is 17 times per call_value.py:129(_convert)
too ~200ms out of 800ms{method 'append' of 'list' objects}
Are we sure that passing arguments / converting can be made faster?
Yes
I can confirm
_value.py:129(_convert)
that is too slowI've moved the conversion outside the loop and confirmed it was the bottleneck
also I can confirm that this time is just waste because passing parameter should do the following
memory
#135 Setting a slice of memory without having to enumerate #81 FIXES #81: 400x faster slice assignment #134 )to proof that we can convert in almost no time
The text was updated successfully, but these errors were encountered: