Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Alternative way of representing data packets #112

Open
harshil21 opened this issue Dec 2, 2024 · 0 comments
Open

[DISCUSSION] Alternative way of representing data packets #112

harshil21 opened this issue Dec 2, 2024 · 0 comments

Comments

@harshil21
Copy link
Member

Since we keep needing to iterate through the "columns" of data packets, e.g. altitude, accel, gyro more than "rows", e.g. in the logger - it is "faster" to represent data packets as numpy structured arrays since the operations we do can be vectorized, while still allowing access by name, so it's not completely obscure.

Similarly, pandas can also achieve the same effect, while at the same time providing a higher level interface than struct arrays. So it begs the question, which method is faster and at the same time does not sacrifice readability, ease of use, and maintainability? So here's a performance test, tailored to our use case:

Testing script
import timeit

setup = """
import numpy as np
import msgspec
import pandas as pd

class A(msgspec.Struct):
    a: float
    b: float
    t: float = None
    est_alt: float = None
    accel_x: float = None
    accel_y: float = None
    accel_z: float = None
    gyro_x: float = None
    gyro_y: float = None
    gyro_z: float = None

"""

test_1 = """
a = A(**{'a': 1.1, 'b': 2.2})
"""

test_2 = """
b = pd.Series({'a': 1.2, 'b': 2.1}, name='A')
"""

test_3 = """
x = np.array([(1.0, 2.0)],
             dtype=[('age', 'float64'), ('weight', 'float64')])
"""

setup_2 = setup + """

msg_way = [A(**{'a': 1.0, 'b': 2.0}) for _ in range(100)]

# Create a pd.DataFrame with 100 rows and 2 columns:
pd_way = pd.DataFrame({'age': [1.0 for _ in range(100)],
                       'weight': [2.0 for _ in range(100)]}, dtype='float64')


np_way = np.array([(1.0, 2.0) for _ in range(100)],
                  dtype=[('age', 'float64'), ('weight', 'float64')])
"""

# Test extracting a field from a list of objects:
test_4 = """
a = [x.a for x in msg_way]
"""

test_5 = """
b = pd_way['age']
"""

test_6 = """
b = np_way['age']
"""

time_1 = timeit.timeit(test_1, setup, number=10000)
time_2 = timeit.timeit(test_2, setup, number=10000)
time_3 = timeit.timeit(test_3, setup, number=10000)

print(f"Time for msgspec class creation: {time_1}")
print(f"Time for pandas Series creation: {time_2}")
print(f"Time for numpy structured array creation: {time_3}")

time_4 = timeit.timeit(test_4, setup_2, number=10000)
time_5 = timeit.timeit(test_5, setup_2, number=10000)
time_6 = timeit.timeit(test_6, setup_2, number=10000)

print()
print(f"Time for msgspec to extract: {time_4}")
print(f"Time for pandas dataframe to extract: {time_5}")
print(f"Time for numpy structured array to extract: {time_6}")

# Create msgspec class, convert to numpy structured array:

print()
setup_3 = setup + """

data_dtype = np.dtype([
    (i, "float64") for i in A.__struct_fields__])


def convert_to_structured_array(data_packets) -> np.ndarray:
    # Replace None with np.nan
    def packet_to_tuple(packet):
        return tuple(
            getattr(packet, field) if getattr(packet, field) is not None else np.nan
            for field in data_dtype.names
        )

    # Convert list of packets to NumPy structured array
    return np.array([msgspec.structs.astuple(packet) for packet in data_packets], dtype=data_dtype)

def convert_to_rec_array(data_packets) -> np.recarray:
    return np.rec.array([msgspec.structs.astuple(packet) for packet in data_packets], dtype=data_dtype)
"""


test_7 = """

classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]

structured_array = convert_to_structured_array(classes)

a = structured_array['a']
b = structured_array['b']
t = structured_array['t']
est_alt = structured_array['est_alt']
accel_x = structured_array['accel_x']
accel_y = structured_array['accel_y']
accel_z = structured_array['accel_z']
gyro_x = structured_array['gyro_x']
gyro_y = structured_array['gyro_y']
gyro_z = structured_array['gyro_z']
"""

time_7 = timeit.timeit(test_7, setup_3, number=10000)
print(f"Time for creating 15 msgspec structs to numpy structured array to extract: {time_7}")


test_8 = """

classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]

only_a = [x.a for x in classes]
only_b = [x.b for x in classes]
only_t = [x.t for x in classes]
only_est_alt = [x.est_alt for x in classes]
only_accel_x = [x.accel_x for x in classes]
only_accel_y = [x.accel_y for x in classes]
only_accel_z = [x.accel_z for x in classes]
only_gyro_x = [x.gyro_x for x in classes]
only_gyro_y = [x.gyro_y for x in classes]
only_gyro_z = [x.gyro_z for x in classes]

"""

time_8 = timeit.timeit(test_8, setup, number=10000)
print(f"Time for creating 15 msgspec structs to extract: {time_8}")


test_9 = """

classes = [A(**{'a': 1.0, 'b': 2.0, 't': 2.4, 'est_alt': 1.3, 'accel_x': 1.1, 'accel_y': 1.2, 'accel_z': 1.4, 'gyro_x': 1.5, 'gyro_y': 1.6, 'gyro_z': 1.7}) for _ in range(15)]

structured_array = convert_to_rec_array(classes)

a = structured_array.a
b = structured_array.b
t = structured_array.t
est_alt = structured_array.est_alt
accel_x = structured_array.accel_x
accel_y = structured_array.accel_y
accel_z = structured_array.accel_z
gyro_x = structured_array.gyro_x
gyro_y = structured_array.gyro_y
gyro_z = structured_array.gyro_z
"""

time_9 = timeit.timeit(test_9, setup_3, number=10000)
print(f"Time for creating 15 msgspec structs to numpy rec array to extract: {time_9}")


test_10 = """

fields = [1.0, 2.0, 2.4, 1.3, 1.1, 1.2, 1.4, 1.5, 1.6, 1.7]
field_names = ['a', 'b', 't', 'est_alt', 'accel_x', 'accel_y', 'accel_z', 'gyro_x', 'gyro_y', 'gyro_z']
structured_array = np.array(tuple(fields), dtype=[(field_name, 'float64') for field_name in field_names])

a = structured_array['a']
b = structured_array['b']
t = structured_array['t']
est_alt = structured_array['est_alt']
accel_x = structured_array['accel_x']
accel_y = structured_array['accel_y']
accel_z = structured_array['accel_z']
gyro_x = structured_array['gyro_x']
gyro_y = structured_array['gyro_y']
gyro_z = structured_array['gyro_z']
"""

time_10 = timeit.timeit(test_10, setup_3, number=10000)

print(f"Time for creating 15 tuples to numpy structured array to extract: {time_10}")
Time for msgspec class creation: 0.0032883570002013585
Time for pandas Series creation: 0.7121300260005228
Time for numpy structured array creation: 0.01445766399956483

Time for msgspec to extract: 0.021531827000217163
Time for pandas dataframe to extract: 0.01614268999946944
Time for numpy structured array to extract: 0.0008738220003579045

Time for creating 15 msgspec structs to numpy structured array to extract: 0.21062122500006808
Time for creating 15 msgspec structs to extract: 0.14083567499983474
Time for creating 15 msgspec structs to numpy rec array to extract: 0.49987129799956165
Time for creating 15 tuples to numpy structured array to extract: 0.05613298300158931

So a couple of observations:

  1. msgspec is insanely fast, even faster than numpy in creating new instances.
  2. Creating pandas Series or even a DataFrame is too slow to take any advantages of vectorization later.
  3. Converting the fetched packets to numpy struct array and then using it is slower overall.
  4. Numpy rec arrays (i.e. attributes accessible by dot notation instead of dictionary notation) is slower than struct arrays
  5. The only faster solution (~60% faster) is to get rid of msgspec entirely and use struct arrays from the beginning (most time savings will only come when looping).

Tbh, even though it's faster, I still don't like the loss of readability, specially when you can't name your numpy struct array like you can a pandas.Series. Additionally, creating msgspec classes is way faster than numpy struct arrays, so when transferring the struct array between the processes, this overhead would increase, making it a much smaller difference overall.

So what action should we take? I'd say nothing. I just wanted to elaborate on #107 (comment), and maybe if someone has more thoughts on it you can share them.

@harshil21 harshil21 changed the title Alternative way of representing data packets [DISCUSSION] Alternative way of representing data packets Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant