Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow array schema proxy #185

Merged
merged 14 commits into from
Sep 6, 2024

Conversation

Alex-PLACET
Copy link
Collaborator

@Alex-PLACET Alex-PLACET commented Aug 30, 2024

This first PR is for "read only". You can't reallocate buffers, change the size, etc ...

@Alex-PLACET Alex-PLACET force-pushed the arrow_array_schema_proxy branch from c1dbdcd to 4b6cf82 Compare September 3, 2024 09:03
@Alex-PLACET Alex-PLACET marked this pull request as ready for review September 3, 2024 09:24
[[nodiscard]] int64_t n_buffers() const;
[[nodiscard]] int64_t n_children() const;
[[nodiscard]] std::vector<sparrow::buffer_view<uint8_t>> buffers() const;
[[nodiscard]] std::vector<arrow_proxy> children() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods should return const reference to cached variables instead of rebuilding the vector for each call:

  • dynamic allocating the vector each time we need to read a buffer in an aray implementation class will be costly
  • this will guarantee immutability (notice we could fix the current implementation by returning a const std::vector, but we would still have the allocation issue).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done

public:

explicit arrow_proxy(ArrowArray&& array, ArrowSchema&& schema);
explicit arrow_proxy(ArrowArray* array, ArrowSchema* schema);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are situations where we want to reuse the same schema structure for different Array structures, so I think the followinf constructor would also make sense arrao_proxy(ArrowArray&&, ArrowSchema*)

Copy link
Collaborator Author

@Alex-PLACET Alex-PLACET Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case the schema or array members are pointers, we should not call the release callback, that's right ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

[[nodiscard]] int64_t n_children() const;
[[nodiscard]] std::vector<sparrow::buffer_view<uint8_t>> buffers() const;
[[nodiscard]] std::vector<arrow_proxy> children() const;
[[nodiscard]] std::optional<arrow_proxy> dictionary() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide an additional bitmap method that returns a dynamic_bitset_view (cached)? This would simpify the implementation of the array implementation classes.

In that case, the first buffer_view in the buffer vector should still be the bitmap to avoid confusion (contrary to what has been done in the array_data).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no bitmap for Sparse and dense unions, map, and run encoded format.
In that case, we can return an optional<dynamic_bitset>,
You can check the get_buffer_types_from_data_type to see what are the buffers for each format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it might be better to handle that in the array implementation classes. We can probably factorize out the creation of the bitmap view in an intermediate class rather than using an std::optional for something known at build time.

overloaded(Ts...) -> overloaded<Ts...>;

template <typename T>
[[nodiscard]] T& get_variant(auto& var)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is confusing, and this function should probably be a private method or array_proxy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Alex-PLACET Alex-PLACET force-pushed the arrow_array_schema_proxy branch from 6af9ffa to 79e0218 Compare September 4, 2024 14:07
void arrow_proxy::initialize_buffers()
{
m_buffers.clear();
const auto buffer_count = static_cast<size_t>(array().n_buffers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will fail in 32bit, I'll fix it with the 32bit fix by replacing static_cast by a custom one.

include/sparrow/arrow_array_schema_proxy.hpp Outdated Show resolved Hide resolved
ArrowArray& array = std::get<1>(m_array);
if (array.release != nullptr)
{
array.release(&array);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this throws , abort will be called (because we are in a destructor which is implicitly noexcept(true)). Same issue with schema's release. Maybe wrap these calls around a "safe invoke"? Something like we did in mamba: https://github.com/mamba-org/mamba/blob/main/libmamba/include/mamba/core/invoke.hpp#L12
Not sure what to do if it fails though, but I suppose at a minimum both release must be called, one failing shouldnt prevent the call to the second one.

If we want to crash anyway, ignore this.


arrow_proxy::~arrow_proxy()
{
if (m_array.index() == 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the duplication here, although it could be refactored later.

include/sparrow/arrow_array_schema_proxy.hpp Show resolved Hide resolved
include/sparrow/arrow_array_schema_proxy.hpp Show resolved Hide resolved
SIZES_64BIT,
};

constexpr std::vector<buffer_type> get_buffer_types_from_data_type(data_type data_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what a "buffer type" is here, so noob question: doest it have to be a vector here? (maybe a todo not to replace by a "small vector" later would be nice here?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_type is an enum which is just above.
We just return an ordered container of enum. So yes it could be improved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the enum but it is not clear to me exactly what the values means etc. Should be fixed once documented 👍🏼

include/sparrow/utils/variant_visitor.hpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@Klaim Klaim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only suggested tweaks to the documentation. The arrow_proxy type also needs to be better clarified as to what to expect of it's behavior, but once it's done it's good for me 👍🏽

@@ -19,12 +19,14 @@

namespace sparrow
{
/// Validates the number of buffers in an ArrowArray for a given data type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Validates the number of buffers in an ArrowArray for a given data type.
/// @returns `true` if the number of buffers in an `ArrowArray` for a given data type is valid, `false` otherwise.

constexpr bool validate_buffers_count(data_type data_type, int64_t n_buffers)
{
const std::size_t expected_buffer_count = get_expected_buffer_count(data_type);
return static_cast<std::size_t>(n_buffers) == expected_buffer_count;
}

/// Get the expected number of children for a given data type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Get the expected number of children for a given data type.
/// @returns The expected number of children for a given data type.

@@ -64,6 +66,7 @@ namespace sparrow
mpl::unreachable();
}

/// Validates the format of an ArrowArray for a given data type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Validates the format of an ArrowArray for a given data type.
/// @returns `true` if the format of an `ArrowArray` for a given data type is valid, `false` otherwise.

@@ -84,6 +87,8 @@ namespace sparrow
SIZES_64BIT,
};

/// Provide a vector of buffer types for a given data type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Provide a vector of buffer types for a given data type.
/// @returns A vector of buffer types for a given data type.

@@ -130,6 +135,7 @@ namespace sparrow
mpl::unreachable();
}

/// Get The expected offset element count for a given data type and array length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Get The expected offset element count for a given data type and array length.
/// @returns The expected offset element count for a given data type and array length.

@@ -16,6 +16,7 @@

namespace sparrow
{
/// Check that the given value is a valid ArrowFlag value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Check that the given value is a valid ArrowFlag value.
/// @returns `true` if the given value is a valid `ArrowFlag` value, `false` otherwise.

@@ -32,6 +33,7 @@ namespace sparrow
);
}

/// Convert a bitfield of ArrowFlag values to a vector of ArrowFlag values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Convert a bitfield of ArrowFlag values to a vector of ArrowFlag values.
/// Converts a bitfield of `ArrowFlag` values to a vector of `ArrowFlag` values.

@@ -52,6 +54,7 @@ namespace sparrow
return flags;
}

/// Convert a vector of ArrowFlag values to a bitfield of ArrowFlag values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Convert a vector of ArrowFlag values to a bitfield of ArrowFlag values.
/// Converts a vector of `ArrowFlag` values to a bitfield of `ArrowFlag` values.

@Klaim
Copy link
Collaborator

Klaim commented Sep 5, 2024

Clarifications: the issue in the destructor should be fixed (or explicitly stated in the code as intently ignored) before merging this.

@JohanMabille
Copy link
Collaborator

The conan failure is unrelated to this PR, let's merge and fix it in a dedicated PR.

@JohanMabille JohanMabille merged commit 702fc0a into man-group:main Sep 6, 2024
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants