-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow array schema proxy #185
Arrow array schema proxy #185
Conversation
c1dbdcd
to
4b6cf82
Compare
[[nodiscard]] int64_t n_buffers() const; | ||
[[nodiscard]] int64_t n_children() const; | ||
[[nodiscard]] std::vector<sparrow::buffer_view<uint8_t>> buffers() const; | ||
[[nodiscard]] std::vector<arrow_proxy> children() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these methods should return const reference to cached variables instead of rebuilding the vector for each call:
- dynamic allocating the vector each time we need to read a buffer in an aray implementation class will be costly
- this will guarantee immutability (notice we could fix the current implementation by returning a const std::vector, but we would still have the allocation issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done
public: | ||
|
||
explicit arrow_proxy(ArrowArray&& array, ArrowSchema&& schema); | ||
explicit arrow_proxy(ArrowArray* array, ArrowSchema* schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are situations where we want to reuse the same schema structure for different Array structures, so I think the followinf constructor would also make sense arrao_proxy(ArrowArray&&, ArrowSchema*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case the schema or array members are pointers, we should not call the release callback, that's right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
[[nodiscard]] int64_t n_children() const; | ||
[[nodiscard]] std::vector<sparrow::buffer_view<uint8_t>> buffers() const; | ||
[[nodiscard]] std::vector<arrow_proxy> children() const; | ||
[[nodiscard]] std::optional<arrow_proxy> dictionary() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we provide an additional bitmap
method that returns a dynamic_bitset_view
(cached)? This would simpify the implementation of the array implementation classes.
In that case, the first buffer_view in the buffer vector should still be the bitmap to avoid confusion (contrary to what has been done in the array_data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no bitmap for Sparse and dense unions, map, and run encoded format.
In that case, we can return an optional<dynamic_bitset>,
You can check the get_buffer_types_from_data_type
to see what are the buffers for each format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case it might be better to handle that in the array implementation classes. We can probably factorize out the creation of the bitmap view in an intermediate class rather than using an std::optional for something known at build time.
overloaded(Ts...) -> overloaded<Ts...>; | ||
|
||
template <typename T> | ||
[[nodiscard]] T& get_variant(auto& var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is confusing, and this function should probably be a private method or array_proxy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
6af9ffa
to
79e0218
Compare
void arrow_proxy::initialize_buffers() | ||
{ | ||
m_buffers.clear(); | ||
const auto buffer_count = static_cast<size_t>(array().n_buffers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will fail in 32bit, I'll fix it with the 32bit fix by replacing static_cast
by a custom one.
ArrowArray& array = std::get<1>(m_array); | ||
if (array.release != nullptr) | ||
{ | ||
array.release(&array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this throws , abort will be called (because we are in a destructor which is implicitly noexcept(true)
). Same issue with schema's release. Maybe wrap these calls around a "safe invoke"? Something like we did in mamba: https://github.com/mamba-org/mamba/blob/main/libmamba/include/mamba/core/invoke.hpp#L12
Not sure what to do if it fails though, but I suppose at a minimum both release must be called, one failing shouldnt prevent the call to the second one.
If we want to crash anyway, ignore this.
|
||
arrow_proxy::~arrow_proxy() | ||
{ | ||
if (m_array.index() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of the duplication here, although it could be refactored later.
SIZES_64BIT, | ||
}; | ||
|
||
constexpr std::vector<buffer_type> get_buffer_types_from_data_type(data_type data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what a "buffer type" is here, so noob question: doest it have to be a vector here? (maybe a todo not to replace by a "small vector" later would be nice here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer_type is an enum which is just above.
We just return an ordered container of enum. So yes it could be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the enum but it is not clear to me exactly what the values means etc. Should be fixed once documented 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only suggested tweaks to the documentation. The arrow_proxy
type also needs to be better clarified as to what to expect of it's behavior, but once it's done it's good for me 👍🏽
@@ -19,12 +19,14 @@ | |||
|
|||
namespace sparrow | |||
{ | |||
/// Validates the number of buffers in an ArrowArray for a given data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Validates the number of buffers in an ArrowArray for a given data type. | |
/// @returns `true` if the number of buffers in an `ArrowArray` for a given data type is valid, `false` otherwise. |
constexpr bool validate_buffers_count(data_type data_type, int64_t n_buffers) | ||
{ | ||
const std::size_t expected_buffer_count = get_expected_buffer_count(data_type); | ||
return static_cast<std::size_t>(n_buffers) == expected_buffer_count; | ||
} | ||
|
||
/// Get the expected number of children for a given data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Get the expected number of children for a given data type. | |
/// @returns The expected number of children for a given data type. |
@@ -64,6 +66,7 @@ namespace sparrow | |||
mpl::unreachable(); | |||
} | |||
|
|||
/// Validates the format of an ArrowArray for a given data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Validates the format of an ArrowArray for a given data type. | |
/// @returns `true` if the format of an `ArrowArray` for a given data type is valid, `false` otherwise. |
@@ -84,6 +87,8 @@ namespace sparrow | |||
SIZES_64BIT, | |||
}; | |||
|
|||
/// Provide a vector of buffer types for a given data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Provide a vector of buffer types for a given data type. | |
/// @returns A vector of buffer types for a given data type. |
@@ -130,6 +135,7 @@ namespace sparrow | |||
mpl::unreachable(); | |||
} | |||
|
|||
/// Get The expected offset element count for a given data type and array length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Get The expected offset element count for a given data type and array length. | |
/// @returns The expected offset element count for a given data type and array length. |
@@ -16,6 +16,7 @@ | |||
|
|||
namespace sparrow | |||
{ | |||
/// Check that the given value is a valid ArrowFlag value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Check that the given value is a valid ArrowFlag value. | |
/// @returns `true` if the given value is a valid `ArrowFlag` value, `false` otherwise. |
@@ -32,6 +33,7 @@ namespace sparrow | |||
); | |||
} | |||
|
|||
/// Convert a bitfield of ArrowFlag values to a vector of ArrowFlag values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Convert a bitfield of ArrowFlag values to a vector of ArrowFlag values. | |
/// Converts a bitfield of `ArrowFlag` values to a vector of `ArrowFlag` values. |
@@ -52,6 +54,7 @@ namespace sparrow | |||
return flags; | |||
} | |||
|
|||
/// Convert a vector of ArrowFlag values to a bitfield of ArrowFlag values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Convert a vector of ArrowFlag values to a bitfield of ArrowFlag values. | |
/// Converts a vector of `ArrowFlag` values to a bitfield of `ArrowFlag` values. |
Clarifications: the issue in the destructor should be fixed (or explicitly stated in the code as intently ignored) before merging this. |
The conan failure is unrelated to this PR, let's merge and fix it in a dedicated PR. |
This first PR is for "read only". You can't reallocate buffers, change the size, etc ...