You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since we keep needing to iterate through the "columns" of data packets, e.g. altitude, accel, gyro more than "rows", e.g. in the logger - it is "faster" to represent data packets as numpy structured arrays since the operations we do can be vectorized, while still allowing access by name, so it's not completely obscure.
Similarly, pandas can also achieve the same effect, while at the same time providing a higher level interface than struct arrays. So it begs the question, which method is faster and at the same time does not sacrifice readability, ease of use, and maintainability? So here's a performance test, tailored to our use case:
Testing script
importtimeitsetup="""import numpy as npimport msgspecimport pandas as pdclass A(msgspec.Struct): a: float b: float t: float = None est_alt: float = None accel_x: float = None accel_y: float = None accel_z: float = None gyro_x: float = None gyro_y: float = None gyro_z: float = None"""test_1="""a = A(**{'a': 1.1, 'b': 2.2})"""test_2="""b = pd.Series({'a': 1.2, 'b': 2.1}, name='A')"""test_3="""x = np.array([(1.0, 2.0)], dtype=[('age', 'float64'), ('weight', 'float64')])"""setup_2=setup+"""msg_way = [A(**{'a': 1.0, 'b': 2.0}) for _ in range(100)]# Create a pd.DataFrame with 100 rows and 2 columns:pd_way = pd.DataFrame({'age': [1.0 for _ in range(100)], 'weight': [2.0 for _ in range(100)]}, dtype='float64')np_way = np.array([(1.0, 2.0) for _ in range(100)], dtype=[('age', 'float64'), ('weight', 'float64')])"""# Test extracting a field from a list of objects:test_4="""a = [x.a for x in msg_way]"""test_5="""b = pd_way['age']"""test_6="""b = np_way['age']"""time_1=timeit.timeit(test_1, setup, number=10000)
time_2=timeit.timeit(test_2, setup, number=10000)
time_3=timeit.timeit(test_3, setup, number=10000)
print(f"Time for msgspec class creation: {time_1}")
print(f"Time for pandas Series creation: {time_2}")
print(f"Time for numpy structured array creation: {time_3}")
time_4=timeit.timeit(test_4, setup_2, number=10000)
time_5=timeit.timeit(test_5, setup_2, number=10000)
time_6=timeit.timeit(test_6, setup_2, number=10000)
print()
print(f"Time for msgspec to extract: {time_4}")
print(f"Time for pandas dataframe to extract: {time_5}")
print(f"Time for numpy structured array to extract: {time_6}")
# Create msgspec class, convert to numpy structured array:print()
setup_3=setup+"""data_dtype = np.dtype([ (i, "float64") for i in A.__struct_fields__])def convert_to_structured_array(data_packets) -> np.ndarray: # Replace None with np.nan def packet_to_tuple(packet): return tuple( getattr(packet, field) if getattr(packet, field) is not None else np.nan for field in data_dtype.names ) # Convert list of packets to NumPy structured array return np.array([msgspec.structs.astuple(packet) for packet in data_packets], dtype=data_dtype)def convert_to_rec_array(data_packets) -> np.recarray: return np.rec.array([msgspec.structs.astuple(packet) for packet in data_packets], dtype=data_dtype)"""test_7="""classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]structured_array = convert_to_structured_array(classes)a = structured_array['a']b = structured_array['b']t = structured_array['t']est_alt = structured_array['est_alt']accel_x = structured_array['accel_x']accel_y = structured_array['accel_y']accel_z = structured_array['accel_z']gyro_x = structured_array['gyro_x']gyro_y = structured_array['gyro_y']gyro_z = structured_array['gyro_z']"""time_7=timeit.timeit(test_7, setup_3, number=10000)
print(f"Time for creating 15 msgspec structs to numpy structured array to extract: {time_7}")
test_8="""classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]only_a = [x.a for x in classes]only_b = [x.b for x in classes]only_t = [x.t for x in classes]only_est_alt = [x.est_alt for x in classes]only_accel_x = [x.accel_x for x in classes]only_accel_y = [x.accel_y for x in classes]only_accel_z = [x.accel_z for x in classes]only_gyro_x = [x.gyro_x for x in classes]only_gyro_y = [x.gyro_y for x in classes]only_gyro_z = [x.gyro_z for x in classes]"""time_8=timeit.timeit(test_8, setup, number=10000)
print(f"Time for creating 15 msgspec structs to extract: {time_8}")
test_9="""classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]structured_array = convert_to_rec_array(classes)a = structured_array.ab = structured_array.bt = structured_array.test_alt = structured_array.est_altaccel_x = structured_array.accel_xaccel_y = structured_array.accel_yaccel_z = structured_array.accel_zgyro_x = structured_array.gyro_xgyro_y = structured_array.gyro_ygyro_z = structured_array.gyro_z"""time_9=timeit.timeit(test_9, setup_3, number=10000)
print(f"Time for creating 15 msgspec structs to numpy rec array to extract: {time_9}")
test_10="""fields = [1.0, 2.0, 2.4, 1.3, 1.1, 1.2, 1.4, 1.5, 1.6, 1.7]field_names = ['a', 'b', 't', 'est_alt', 'accel_x', 'accel_y', 'accel_z', 'gyro_x', 'gyro_y', 'gyro_z']structured_array = np.array(tuple(fields), dtype=[(field_name, 'float64') for field_name in field_names])a = structured_array['a']b = structured_array['b']t = structured_array['t']est_alt = structured_array['est_alt']accel_x = structured_array['accel_x']accel_y = structured_array['accel_y']accel_z = structured_array['accel_z']gyro_x = structured_array['gyro_x']gyro_y = structured_array['gyro_y']gyro_z = structured_array['gyro_z']"""time_10=timeit.timeit(test_10, setup_3, number=10000)
print(f"Time for creating 15 tuples to numpy structured array to extract: {time_10}")
Time for msgspec class creation: 0.0032883570002013585
Time for pandas Series creation: 0.7121300260005228
Time for numpy structured array creation: 0.01445766399956483
Time for msgspec to extract: 0.021531827000217163
Time for pandas dataframe to extract: 0.01614268999946944
Time for numpy structured array to extract: 0.0008738220003579045
Time for creating 15 msgspec structs to numpy structured array to extract: 0.21062122500006808
Time for creating 15 msgspec structs to extract: 0.14083567499983474
Time for creating 15 msgspec structs to numpy rec array to extract: 0.49987129799956165
Time for creating 15 tuples to numpy structured array to extract: 0.05613298300158931
So a couple of observations:
msgspec is insanely fast, even faster than numpy in creating new instances.
Creating pandasSeries or even a DataFrame is too slow to take any advantages of vectorization later.
Converting the fetched packets to numpy struct array and then using it is slower overall.
Numpy rec arrays (i.e. attributes accessible by dot notation instead of dictionary notation) is slower than struct arrays
The only faster solution (~60% faster) is to get rid of msgspec entirely and use struct arrays from the beginning (most time savings will only come when looping).
Tbh, even though it's faster, I still don't like the loss of readability, specially when you can't name your numpy struct array like you can a pandas.Series. Additionally, creating msgspec classes is way faster than numpy struct arrays, so when transferring the struct array between the processes, this overhead would increase, making it a much smaller difference overall.
So what action should we take? I'd say nothing. I just wanted to elaborate on #107 (comment), and maybe if someone has more thoughts on it you can share them.
The text was updated successfully, but these errors were encountered:
Since we keep needing to iterate through the "columns" of data packets, e.g. altitude, accel, gyro more than "rows", e.g. in the logger - it is "faster" to represent data packets as numpy structured arrays since the operations we do can be vectorized, while still allowing access by name, so it's not completely obscure.
Similarly,
pandas
can also achieve the same effect, while at the same time providing a higher level interface than struct arrays. So it begs the question, which method is faster and at the same time does not sacrifice readability, ease of use, and maintainability? So here's a performance test, tailored to our use case:Testing script
So a couple of observations:
msgspec
is insanely fast, even faster than numpy in creating new instances.pandas
Series
or even aDataFrame
is too slow to take any advantages of vectorization later.Tbh, even though it's faster, I still don't like the loss of readability, specially when you can't name your numpy struct array like you can a
pandas.Series
. Additionally, creating msgspec classes is way faster than numpy struct arrays, so when transferring the struct array between the processes, this overhead would increase, making it a much smaller difference overall.So what action should we take? I'd say nothing. I just wanted to elaborate on #107 (comment), and maybe if someone has more thoughts on it you can share them.
The text was updated successfully, but these errors were encountered: