-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding version 2: binary-only memory layout #1317
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
willdealtry
force-pushed
the
hash_descriptor_v2
branch
from
March 14, 2024 13:41
468ae02
to
4de4d5a
Compare
willdealtry
force-pushed
the
hash_descriptor_v2
branch
3 times, most recently
from
April 24, 2024 15:06
398eed5
to
2789314
Compare
willdealtry
force-pushed
the
hash_descriptor_v2
branch
from
May 7, 2024 14:11
2789314
to
388f323
Compare
willdealtry
force-pushed
the
hash_descriptor_v2
branch
4 times, most recently
from
May 22, 2024 11:18
9d9ec12
to
38c65c7
Compare
willdealtry
changed the title
WIP descriptor changes
Encoding version 2: binary-only memory layout
May 30, 2024
willdealtry
force-pushed
the
hash_descriptor_v2
branch
3 times, most recently
from
June 5, 2024 13:21
95601e1
to
26c23f9
Compare
IvoDD
reviewed
Jun 12, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've paied closest attention to the protobuf mappings and new memory layout to check it all looks compatible. It all makes sense and most of my comments are just questions to double check everything makes sense.
willdealtry
force-pushed
the
hash_descriptor_v2
branch
from
June 13, 2024 10:35
26c23f9
to
faea781
Compare
alexowens90
requested changes
Jun 20, 2024
willdealtry
force-pushed
the
hash_descriptor_v2
branch
4 times, most recently
from
June 23, 2024 21:54
dead6f5
to
1f6ae95
Compare
willdealtry
force-pushed
the
hash_descriptor_v2
branch
from
June 24, 2024 11:30
1f6ae95
to
57a67d6
Compare
alexowens90
approved these changes
Jun 25, 2024
willdealtry
force-pushed
the
hash_descriptor_v2
branch
from
June 25, 2024 13:13
8e7a57a
to
3d1290f
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds an additional data encoding that is entirely binary in terms of the essential data structure descriptors. The aim is to make encoding and decoding faster, have a storage structure that can be entirely described by a set of POD structures (located in memory_layout.hpp), and pave the way for a more sophisticated approach to data encoding that will follow in a separate (smaller) PR.
The main work in this PR is to remove protobuf structures entirely from the internal implementation of data encoding and compression, and provide a mapping layer which can translate from standard C++ structures to either the legacy (protobuf) format or the new binary one.
The binary encoding is a much more direct representation of the in-memory structures, which fall into three main groups. The structures around EncodedFieldCollection describe the layout of a (usually compressed) field in storage, although a fixed set of optional EncodedFields are used at the head of the segment to bootstrap the decoding. StreamDescriptor describes features such as the names, types and dimensionality of the data being represented (thus it is oriented primarily towards the features of data in memory, as opposed to data in storage. Finally the TimeseriesDescriptor represents elements specific to segments that describe and reference sets of other segments, such as indexes etc, i.e. it contains data relevant to time series and dataframes as a whole rather than to their component parts.