Encoding version 2: binary-only memory layout #1317

willdealtry · 2024-02-09T18:11:36Z

This PR adds an additional data encoding that is entirely binary in terms of the essential data structure descriptors. The aim is to make encoding and decoding faster, have a storage structure that can be entirely described by a set of POD structures (located in memory_layout.hpp), and pave the way for a more sophisticated approach to data encoding that will follow in a separate (smaller) PR.

The main work in this PR is to remove protobuf structures entirely from the internal implementation of data encoding and compression, and provide a mapping layer which can translate from standard C++ structures to either the legacy (protobuf) format or the new binary one.

The binary encoding is a much more direct representation of the in-memory structures, which fall into three main groups. The structures around EncodedFieldCollection describe the layout of a (usually compressed) field in storage, although a fixed set of optional EncodedFields are used at the head of the segment to bootstrap the decoding. StreamDescriptor describes features such as the names, types and dimensionality of the data being represented (thus it is oriented primarily towards the features of data in memory, as opposed to data in storage. Finally the TimeseriesDescriptor represents elements specific to segments that describe and reference sets of other segments, such as indexes etc, i.e. it contains data relevant to time series and dataframes as a whole rather than to their component parts.

IvoDD

I've paied closest attention to the protobuf mappings and new memory layout to check it all looks compatible. It all makes sense and most of my comments are just questions to double check everything makes sense.

cpp/arcticdb/memory_layout.hpp

python/tests/util/mark.py

cpp/arcticdb/memory_layout.hpp

cpp/arcticdb/codec/encoded_field.hpp

cpp/arcticdb/memory_layout.hpp

cpp/arcticdb/codec/protobuf_mappings.cpp

cpp/arcticdb/codec/protobuf_mappings.hpp

cpp/arcticdb/entity/timeseries_descriptor.hpp

cpp/arcticdb/async/tasks.hpp

cpp/arcticdb/codec/codec.cpp

cpp/arcticdb/codec/encode_common.hpp

cpp/arcticdb/codec/encoded_field.hpp

cpp/arcticdb/CMakeLists.txt

cpp/arcticdb/stream/stream_utils.hpp

cpp/arcticdb/util/buffer.hpp

cpp/arcticdb/util/timer.hpp

cpp/arcticdb/version/python_bindings.cpp

cpp/arcticdb/entity/protobuf_mappings.cpp

cpp/arcticdb/storage/memory_layout.hpp

willdealtry force-pushed the hash_descriptor_v2 branch from 468ae02 to 4de4d5a Compare March 14, 2024 13:41

willdealtry force-pushed the hash_descriptor_v2 branch 3 times, most recently from 398eed5 to 2789314 Compare April 24, 2024 15:06

willdealtry force-pushed the hash_descriptor_v2 branch from 2789314 to 388f323 Compare May 7, 2024 14:11

willdealtry force-pushed the hash_descriptor_v2 branch 4 times, most recently from 9d9ec12 to 38c65c7 Compare May 22, 2024 11:18

willdealtry changed the title ~~WIP descriptor changes~~ Encoding version 2: binary-only memory layout May 30, 2024

willdealtry marked this pull request as ready for review May 30, 2024 12:17

willdealtry requested review from alexowens90 and poodlewars as code owners May 30, 2024 12:17

willdealtry force-pushed the hash_descriptor_v2 branch 3 times, most recently from 95601e1 to 26c23f9 Compare June 5, 2024 13:21

willdealtry requested a review from IvoDD June 11, 2024 10:49

IvoDD reviewed Jun 12, 2024

View reviewed changes

willdealtry force-pushed the hash_descriptor_v2 branch from 26c23f9 to faea781 Compare June 13, 2024 10:35