Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for deserialization "into" existing collection (instead of overwrite) #337

Open
dusrdev opened this issue Oct 9, 2024 · 1 comment

Comments

@dusrdev
Copy link

dusrdev commented Oct 9, 2024

In order to reduce allocations of collections when deserializing, MemoryPackSerializer has a method that overwrites an existing instance. For 1 item this is fine, but for it would have been very helpful for collections that instead of overwriting or replacing them, it would serialize into the available space.

For instance, if a T[] is used there, the entire instance is replaced, regardless if the deserialized items could have been inserted from some index to some index. In some cases, the opposite would have been very efficient.

For the sake of the example, consider that I use MemoryPack to serialize a User[] in some API. In the crud operation of adding a new user, right now I have to do this:

User[] currentUsers = MemoryPack.Deserialize<User[]>(binaryData);
User[] usersAfterAddition = new User[currentUsers.Length + 1];
currentUsers.CopyTo(usersAfterAddition);
usersAfterAddition[^1] = newUser;
binaryData = MemoryPack.Serialize(usersAfterAddition);

So I am forced to allocate a secondary array with larger length just to add items.
I want to be able to do this:

User[] currentUsers = new User[currentUserCount + 1];
Span<User> current = new Span(currentUsers, 0, currentUserCount);
int written = MemoryPackSerializer.Deserialize(binaryData, current);
// I could use written in case I want to advance some index, perhaps as part of buffer or whatever
// for the sake of the example I won't use ^1 and use the written instead
currentUsers[written] = newUser;
binaryData = MemoryPack.Serialize(currentUsers);

This essentially uses a destination span for the required items, it could be a part of a larger buffer (which also enables usage with BufferWriter(s), Arrays rented from pools etc...
And using this functionality I completely avoid allocating an array of size n. (this is big, especially if the usage is in an API like in the example, which calls it lots of times, using an array from ArrayPool.Shared would increase the benefit immensely, and the same API would support it).

Currently, it is impossible to do from my tries, firstly because the overload of Deserialize that overwrites the instance uses a ref object? obj for the instance, and obviously span<t> cannot be cast to object. I have tried to implement a workaround using Memory, like this:

User[] currentUsers = new User[currentUserCount + 1];
Memory<User> current = new Memory(currentUsers, 0, currentUserCount);
ref object? obj = ref Unsafe.As<Memory<User>, object?>(ref current);
int written = MemoryPackSerializer.Deserialize(binaryData, ref obj);
// I could use written in case I want to advance some index, perhaps as part of buffer or whatever
// for the sake of the example I won't use ^1 and use the written instead
currentUsers[written] = newUser;
binaryData = MemoryPack.Serialize(currentUsers);

Which logically should produce similar results as Memory.Slice is also very efficient and doesn't actually change the backing array. But even after the Deserialize call, my debugger showed that the backing array was unchanged (remained only initialized without the deserialized data). And my unit tests failed.

Btw, as it currently stands, written doesn't actually tell the count of deserialized items, it just returns the length of binaryData which means if you are expecting to get 3 Users, from a byte[] of length 80 for instance, written would be 80, and trying to use it to advance indexes will throw you out of bounds. (So this is also a bug)

I know that implementing such a change would lead to thinking that it would mean a lot of overloads and code to add, but you don't really need much, supporting Span<T> would cover 99% of cases, as you can get the span out of arrays and lists and they are the most common data types for such operations.

@dusrdev
Copy link
Author

dusrdev commented Oct 11, 2024

I managed to find a workaround for now:

// Input
ReadOnlySpan<byte> serialized = ... // This is the serialized data
int length = ... // length of T[] before serialization
// Actual work
var state = MemoryPackReaderOptionalStatePool.Rent(null);
var reader = new MemoryPackReader(data, state);
var arr = new T[length + 1];
Span<T?> span = arr.AsSpan(0, length)!;
reader.ReadSpan(ref span);
arr[length] = new T() // the added item
// Then proceed to serialize after addition
var newSerialized = MemoryPackSerializer.Serialize(arr);

This works, and definitely better than allocating a new array just for deserialization. However, it is still suboptimal as:

  • MemoryPackReaderOptionalState is required for every such operation. This is a method to Rent it, but the Return to pool is internal, thus the Rent just allocated a new instance every time. Which is weird to be honest to have a pool with public Rent but internal Return, it doesn't make sense to me.
  • Even if MemoryPackReaderOptionalState allocation had been "free" (which it isn't because it is a reference type). It isn't even used in MermoyPackReader.ReadSpan(ref span) so it is a completely an unnecessary object allocation.
  • There is no way to get the actual number of items written to the span to "advance" it, MemoryPackReader.Consumed, same as the int return type of the Deserialize overload that was mentioned in the previous comment, just returns the number of input bytes that were read. Which you already know before beginning this... This means that if I tried to use this workaround with a rented buffer, and you didn't know the length of the items from before serialization, you have two options:
    1. Somehow traverse the array afterward, trying to find the end of the written portion.
    2. Use a "hacky" way of reading the length from the header before using this workaround, which works, but since MemoryPackReader.ReadSpan(ref span) already does this, it means you are doing duplicated work. (Which again, it is just reading one integer so not big deal - but this library, at least as I see it, meant to squeeze out every last bit of perf out of the language, and duplicated work is against that methodology).

By the way - BUG: as the documentation explains, the overwrite method, if receives a collection that has a .Clear() method, will instead just clear the existing instance and reuse it. I tried to leverage this to hack a different workaround, such as:

public class ICollectionWrappedArray<T> : ICollection<T> {
    private readonly T[] _arr;
    private int _index;
    public ICollectionWrapperArray(T[] arr) {
        _arr = arr;
    }
    // Empty ctor required for MemoryPack for some reason - otherwise runtime error
    public ICollectionWrapperArray() { _arr =[]; }
    public void Add(T item) {
        _arr[_index] = item;
        _index++;
     }
     public void Clear() { _index = 0; }
     // Rest of implementation doesn't really matter for this use case
}

// The instead of the span workaround above
T[] newArr = new[length + 1];
var collection = new ICollectionWrappedArray(newArr);
int written = MemoryPackSerializer.Deserialize(serializedData, ref collection);

This should have worked according to the documentation, as the collection has .Clear method, it should have just reset the index (which was 0 in any case), then just added to the array.

But, my tests failed, and after debugging I saw that it always went to the empty ctor, causing it to allocate a new array instead of using the one that was provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant