-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault #43071
Conversation
|
Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add |
std::stringstream ss; | ||
ss << "Invalid ciphertext length " << ciphertext_len << ". Expected at least " | ||
<< length_buffer_length_ + kNonceLength + kGcmTagLength << "\n"; | ||
throw ParquetException(ss.str()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the existing code in these methods, several things stand out:
- we don't validate ciphertext length before fetching the 4 bytes encoding the actual length (a corrupt file could perhaps have a ciphertext length < 4?)
- use of raw C arrays instead of
std::array<uint8_t>
for example - why is the
ciphertext_len
argument optional in this API? this looks fickle and error-prone. - the part that extracts and validates the actual length is duplicated in the two methods
I would suggest we take the opportunity and refactor this into a cleaner and less error-prone implementation. In particular, the GCM and CTR-specific methods should probably have a mandatory ciphertext length, and would not have to bother with reading the length bytes.
@ggershinsky @thamht4190 Do we have an explanation for the very odd choices here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didn't dig in too deep to find why the ciphertext_len
is optional, it would be nice if that could be mandatory. But if that's not possible we should at least be able to provide the size of the buffer that the ciphertext is being read from to ensure that ciphertext_len
isn't greater than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look like we don't always know the length of the ciphertext but sometimes just an upper bound that's used to allocate the buffer. I found an example of this when reading the bloom filter header:
arrow/cpp/src/parquet/bloom_filter.cc
Lines 116 to 121 in e615a30
// NOTE: we don't know the bloom filter header size upfront without | |
// bloom_filter_length, and we can't rely on InputStream::Peek() which isn't always | |
// implemented. Therefore, we must first Read() with an upper bound estimate of the | |
// header size, then once we know the bloom filter data size, we can Read() the exact | |
// number of remaining data bytes. | |
bloom_filter_header_read_size = kBloomFilterHeaderSizeGuess; |
I'm thinking that rather than having separate arguments like ciphertext_buffer_len
(required) and ciphertext_expected_len
(optional), it's probably fine to make ciphertext_len
required and mean the size of the buffer, so we would validate that the actual length is <= this after accounting for the 4 byte length rather than enforcing an exact match. Does that seem reasonable? (I've gone ahead and made this change but am happy to adjust the approach)
Another potential weak point is this: arrow/cpp/src/parquet/thrift_internal.h Lines 415 to 417 in 1da71ba
and of course this part where we totally ignore the physical buffer length, letting the arrow/cpp/src/parquet/thrift_internal.h Lines 419 to 420 in 1da71ba
All in all, this warrants a refactor for sanity and robustness. |
@pitrou, I think I've addressed all of your comments now thank you
I haven't touched the |
@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this! |
OK no problem, thanks for letting me know, and I hope you're feeling better soon. |
I'll try to review this today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style looks ok but I'm not familiar with encryption
@@ -22,6 +22,7 @@ | |||
|
|||
#include "arrow/io/file.h" | |||
#include "arrow/testing/gtest_compat.h" | |||
#include "arrow/util/config.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ever used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is needed so that ARROW_WITH_SNAPPY
is defined, otherwise the tests below are always skipped. This is a bit unrelated to this change but I noticed this problem when running the tests locally, and I'd come across this problem before in #40327.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. Thanks for the fix!
@@ -89,6 +89,14 @@ inline const uint8_t* str2bytes(const std::string& str) { | |||
return reinterpret_cast<const uint8_t*>(cbytes); | |||
} | |||
|
|||
inline ::arrow::util::span<const uint8_t> str2span(const std::string& str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems pretty common, does it make sense to relocate it to arrow/util/span.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can already construct a span from a string but that creates a span<const char>
. Converting to a uint8_t
span might be more specific to the encryption use case so I'm not sure about this.
@@ -315,8 +315,10 @@ class AesDecryptor::AesDecryptorImpl { | |||
|
|||
~AesDecryptorImpl() { WipeOut(); } | |||
|
|||
int Decrypt(const uint8_t* ciphertext, int ciphertext_len, const uint8_t* key, | |||
int key_len, const uint8_t* aad, int aad_len, uint8_t* plaintext); | |||
int Decrypt(::arrow::util::span<const uint8_t> ciphertext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add using ::arrow::util::span
to this source file to make the signatures shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've done that now
std::vector<uint8_t> ciphertext(expected_ciphertext_len, '\0'); | ||
|
||
int ciphertext_length = | ||
encryptor.Encrypt(str2bytes(plain_text_), static_cast<int>(plain_text_.size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can refactor this to use span in the follow up changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I've just made #43142 for this, and I will follow up after this PR
} else { | ||
if (ciphertext_len == 0) { | ||
throw ParquetException("Zero ciphertext length"); | ||
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) { | |
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int32_t>::max())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fixed
int aad_len, uint8_t* plaintext) { | ||
int len; | ||
int plaintext_len; | ||
int AesDecryptor::PlaintextLength(int ciphertext_len) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we replace int
with int32_t
to be more portable? Same for functions below. Of course this can be a followup change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would make sense to do this as a follow up to avoid changing too much in this PR, I've made #43141 for this
if (length_buffer_length_ > 0) { | ||
if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand this line. Perhaps it is better to explicitly use kBufferSizeLength
here as line 484 to 486 have assumed this length is 4 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point, that also confused me a bit. I've changed that and added a comment to hopefully make this more readable
if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) { | ||
std::stringstream ss; | ||
ss << "Ciphertext buffer length " << ciphertext.size() | ||
<< " is insufficient to read the ciphertext length"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: include the length (4 bytes) in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks!
CI failures are unrelated. I will merge it shortly. |
@raulcd Please feel free to port it to maint-17.0.0 |
(Gang says he has api error when run merge script so I merged this, lol) |
Thanks both for jumping in on the review and thanks @adamreeve for the PR |
… segfault (#43071) ### Rationale for this change See #43070 ### What changes are included in this PR? Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type. Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer. ### Are these changes tested? Yes I've added new unit tests for this. ### Are there any user-facing changes? No * GitHub Issue: #43070 Authored-by: Adam Reeve <[email protected]> Signed-off-by: mwish <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb74. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…revent segfault (apache#43071) ### Rationale for this change See apache#43070 ### What changes are included in this PR? Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type. Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer. ### Are these changes tested? Yes I've added new unit tests for this. ### Are there any user-facing changes? No * GitHub Issue: apache#43070 Authored-by: Adam Reeve <[email protected]> Signed-off-by: mwish <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Belated review, sorry.
ss << "Negative plaintext length " << plaintext_len; | ||
throw ParquetException(ss.str()); | ||
} | ||
return plaintext_len + ciphertext_size_delta_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ideally check for signed addition overflow here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Is it ok to add this to the changes in #43195, or should I make a separate PR for these follow ups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you can add them to the aforementioned PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this check in fb63bcc
if (ciphertext_len > 0 && | ||
ciphertext_len != (written_ciphertext_len + length_buffer_length_)) { | ||
throw ParquetException("Wrong ciphertext length"); | ||
if (written_ciphertext_len < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the compiler won't elide this, as ((ciphertext[3] & 0xff) << 24)
becoming negative implies signed integer overflow which is undefined behavior. @felipecrv What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is weird. written_ciphertext_len
should be uint32_t
, then checking that written_ciphertext_len
is not greater than std::numeric_limits<int32_t>::max()
is enough. In other words: the encoded value is never negative, but it can be too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ciphertext[3] & 0xff
is weird as well, since ciphertext[3]
is a uint8, thus in [0,255] already.
Perhaps the proper check should simply be ciphertext[3] < 0x80u
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah the & 0xff
s are all redundant, maybe it ended up written this way to be similar to the code for writing the length. Rather than just checking the length isn't negative, we should also check that written_ciphertext_len + length_buffer_length_
doesn't overflow, so I think it would probably be simplest to read as uint32_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint8_t tag[kGcmTagLength]; | ||
memset(tag, 0, kGcmTagLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more idiomatically:
std::array<uint8_t, kGcmTagLength> tag{};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this as well as the other uses of C style arrays in ff31c7b
AllocateBuffer(decryptor->pool(), | ||
static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta()))); | ||
const uint8_t* cipher_buf = buf; | ||
auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>(AllocateBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but since we're not attempting to resize the buffer, the cast to ResizableBuffer
should be superfluous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllocateBuffer
here is ::parquet::AllocateBuffer
which returns a std::shared_ptr<ResizableBuffer>
, rather than ::arrow::AllocateBuffer
, so it looks like the cast would be unnecessary even if we were resizing it:
arrow/cpp/src/parquet/platform.cc
Lines 36 to 37 in 031497d
std::shared_ptr<ResizableBuffer> AllocateBuffer(MemoryPool* pool, int64_t size) { | |
PARQUET_ASSIGN_OR_THROW(auto result, ::arrow::AllocateResizableBuffer(size, pool)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. We can try removing it at some point then, as the code currently looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this as well as another occurrence of the same behaviour in f4a7721
…pan instead of raw pointers (#43195) ### Rationale for this change See #43142. This is a follow up to #43071 which refactored the Decryptor API and added extra checks to prevent segfaults. This PR makes similar changes to the Encryptor API for consistency and better maintainability. ### What changes are included in this PR? * Change `AesEncryptor::Encrypt` and `Encryptor::Encrypt` to use `arrow::util::span` instead of raw pointers * Replace the `AesEncryptor::CiphertextSizeDelta` method with a `CiphertextLength` method that checks for overflow and abstracts the size difference behaviour away from consumer code for improved readability. ### Are these changes tested? * This is mostly a refactoring of existing code so is covered by existing tests. ### Are there any user-facing changes? No * GitHub Issue: #43142 Lead-authored-by: Adam Reeve <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
See #43070
What changes are included in this PR?
Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.
Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.
Are these changes tested?
Yes I've added new unit tests for this.
Are there any user-facing changes?
No