Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault #43071

Merged
merged 18 commits into from
Jul 4, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jun 26, 2024

Rationale for this change

See #43070

What changes are included in this PR?

Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.

Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.

Are these changes tested?

Yes I've added new unit tests for this.

Are there any user-facing changes?

No

@adamreeve adamreeve requested a review from wgtmac as a code owner June 26, 2024 23:53
Copy link

⚠️ GitHub issue #43070 has been automatically assigned in GitHub to PR creator.

@adamreeve
Copy link
Contributor Author

Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add PARQUET_EXPORT to these, or should I not be testing these internal classes? I think it would be quite difficult to add tests for this change at a higher level.

@mapleFU
Copy link
Member

mapleFU commented Jun 27, 2024

The code LGTM, but I'm not familiar with decrypt module. So @wgtmac @pitrou for help

cpp/src/parquet/CMakeLists.txt Outdated Show resolved Hide resolved
std::stringstream ss;
ss << "Invalid ciphertext length " << ciphertext_len << ". Expected at least "
<< length_buffer_length_ + kNonceLength + kGcmTagLength << "\n";
throw ParquetException(ss.str());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the existing code in these methods, several things stand out:

  1. we don't validate ciphertext length before fetching the 4 bytes encoding the actual length (a corrupt file could perhaps have a ciphertext length < 4?)
  2. use of raw C arrays instead of std::array<uint8_t> for example
  3. why is the ciphertext_len argument optional in this API? this looks fickle and error-prone.
  4. the part that extracts and validates the actual length is duplicated in the two methods

I would suggest we take the opportunity and refactor this into a cleaner and less error-prone implementation. In particular, the GCM and CTR-specific methods should probably have a mandatory ciphertext length, and would not have to bother with reading the length bytes.

@ggershinsky @thamht4190 Do we have an explanation for the very odd choices here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't dig in too deep to find why the ciphertext_len is optional, it would be nice if that could be mandatory. But if that's not possible we should at least be able to provide the size of the buffer that the ciphertext is being read from to ensure that ciphertext_len isn't greater than this.

Copy link
Contributor Author

@adamreeve adamreeve Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look like we don't always know the length of the ciphertext but sometimes just an upper bound that's used to allocate the buffer. I found an example of this when reading the bloom filter header:

// NOTE: we don't know the bloom filter header size upfront without
// bloom_filter_length, and we can't rely on InputStream::Peek() which isn't always
// implemented. Therefore, we must first Read() with an upper bound estimate of the
// header size, then once we know the bloom filter data size, we can Read() the exact
// number of remaining data bytes.
bloom_filter_header_read_size = kBloomFilterHeaderSizeGuess;

I'm thinking that rather than having separate arguments like ciphertext_buffer_len (required) and ciphertext_expected_len (optional), it's probably fine to make ciphertext_len required and mean the size of the buffer, so we would validate that the actual length is <= this after accounting for the 4 byte length rather than enforcing an exact match. Does that seem reasonable? (I've gone ahead and made this change but am happy to adjust the approach)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 27, 2024
@pitrou
Copy link
Member

pitrou commented Jun 27, 2024

Another potential weak point is this:

auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>(
AllocateBuffer(decryptor->pool(),
static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta())));

and of course this part where we totally ignore the physical buffer length, letting the Decrypt function happily read past the end of the buffer:

uint32_t decrypted_buffer_len =
decryptor->Decrypt(cipher_buf, 0, decrypted_buffer->mutable_data());

All in all, this warrants a refactor for sanity and robustness.

@pitrou pitrou added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jun 27, 2024
@adamreeve
Copy link
Contributor Author

adamreeve commented Jul 3, 2024

@pitrou, I think I've addressed all of your comments now thank you

  • I've added extra validation of the buffer sizes in various places
  • I've switched from raw pointers to arrow::util::span (I think this is more appropriate than std::array which is for fixed length arrays, but let me know if I've misunderstood)
  • The ciphertext size is no longer optional, although it's no longer checked for an exact match, the actual ciphertext length might be less than the buffer size
  • I've pulled out the length reading and validation into a shared method
  • I've refactored how the plaintext/ciphertext length conversions are handled by adding methods for these rather than adding or subtracting the size delta in consumer code

I haven't touched the AesEncryptor API at all, but it would probably make sense to follow up after this to at least change that to use arrow::util::span too for consistency.

@pitrou
Copy link
Member

pitrou commented Jul 3, 2024

@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this!

@adamreeve
Copy link
Contributor Author

OK no problem, thanks for letting me know, and I hope you're feeling better soon.

@raulcd
Copy link
Member

raulcd commented Jul 3, 2024

@mapleFU @wgtmac is there any possibility you could take a look on this? Otherwise this fix will have to miss the 17.0.0 release unfortunately

@wgtmac
Copy link
Member

wgtmac commented Jul 3, 2024

I'll try to review this today.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style looks ok but I'm not familiar with encryption

@@ -22,6 +22,7 @@

#include "arrow/io/file.h"
#include "arrow/testing/gtest_compat.h"
#include "arrow/util/config.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ever used?

Copy link
Contributor Author

@adamreeve adamreeve Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is needed so that ARROW_WITH_SNAPPY is defined, otherwise the tests below are always skipped. This is a bit unrelated to this change but I noticed this problem when running the tests locally, and I'd come across this problem before in #40327.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Thanks for the fix!

@@ -89,6 +89,14 @@ inline const uint8_t* str2bytes(const std::string& str) {
return reinterpret_cast<const uint8_t*>(cbytes);
}

inline ::arrow::util::span<const uint8_t> str2span(const std::string& str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pretty common, does it make sense to relocate it to arrow/util/span.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can already construct a span from a string but that creates a span<const char>. Converting to a uint8_t span might be more specific to the encryption use case so I'm not sure about this.

@@ -315,8 +315,10 @@ class AesDecryptor::AesDecryptorImpl {

~AesDecryptorImpl() { WipeOut(); }

int Decrypt(const uint8_t* ciphertext, int ciphertext_len, const uint8_t* key,
int key_len, const uint8_t* aad, int aad_len, uint8_t* plaintext);
int Decrypt(::arrow::util::span<const uint8_t> ciphertext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add using ::arrow::util::span to this source file to make the signatures shorter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've done that now

std::vector<uint8_t> ciphertext(expected_ciphertext_len, '\0');

int ciphertext_length =
encryptor.Encrypt(str2bytes(plain_text_), static_cast<int>(plain_text_.size()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can refactor this to use span in the follow up changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I've just made #43142 for this, and I will follow up after this PR

} else {
if (ciphertext_len == 0) {
throw ParquetException("Zero ciphertext length");
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) {
if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int32_t>::max())) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fixed

int aad_len, uint8_t* plaintext) {
int len;
int plaintext_len;
int AesDecryptor::PlaintextLength(int ciphertext_len) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace int with int32_t to be more portable? Same for functions below. Of course this can be a followup change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would make sense to do this as a follow up to avoid changing too much in this PR, I've made #43141 for this

if (length_buffer_length_ > 0) {
if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand this line. Perhaps it is better to explicitly use kBufferSizeLength here as line 484 to 486 have assumed this length is 4 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, that also confused me a bit. I've changed that and added a comment to hopefully make this more readable

if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) {
std::stringstream ss;
ss << "Ciphertext buffer length " << ciphertext.size()
<< " is insufficient to read the ciphertext length";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: include the length (4 bytes) in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fixed

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks!

@wgtmac
Copy link
Member

wgtmac commented Jul 4, 2024

CI failures are unrelated. I will merge it shortly.

@mapleFU mapleFU merged commit eadeb74 into apache:main Jul 4, 2024
34 of 40 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Jul 4, 2024
@wgtmac
Copy link
Member

wgtmac commented Jul 4, 2024

@raulcd Please feel free to port it to maint-17.0.0

@mapleFU
Copy link
Member

mapleFU commented Jul 4, 2024

(Gang says he has api error when run merge script so I merged this, lol)

@adamreeve adamreeve deleted the decrypt-segfault-fix branch July 4, 2024 05:37
@raulcd
Copy link
Member

raulcd commented Jul 4, 2024

Thanks both for jumping in on the review and thanks @adamreeve for the PR

raulcd pushed a commit that referenced this pull request Jul 4, 2024
… segfault (#43071)

### Rationale for this change

See #43070

### What changes are included in this PR?

Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.

Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.

### Are these changes tested?

Yes I've added new unit tests for this.

### Are there any user-facing changes?

No
* GitHub Issue: #43070

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: mwish <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb74.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…revent segfault (apache#43071)

### Rationale for this change

See apache#43070

### What changes are included in this PR?

Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.

Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.

### Are these changes tested?

Yes I've added new unit tests for this.

### Are there any user-facing changes?

No
* GitHub Issue: apache#43070

Authored-by: Adam Reeve <[email protected]>
Signed-off-by: mwish <[email protected]>
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated review, sorry.

ss << "Negative plaintext length " << plaintext_len;
throw ParquetException(ss.str());
}
return plaintext_len + ciphertext_size_delta_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally check for signed addition overflow here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Is it ok to add this to the changes in #43195, or should I make a separate PR for these follow ups?

Copy link
Member

@pitrou pitrou Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you can add them to the aforementioned PR.

Copy link
Contributor Author

@adamreeve adamreeve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this check in fb63bcc

if (ciphertext_len > 0 &&
ciphertext_len != (written_ciphertext_len + length_buffer_length_)) {
throw ParquetException("Wrong ciphertext length");
if (written_ciphertext_len < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the compiler won't elide this, as ((ciphertext[3] & 0xff) << 24) becoming negative implies signed integer overflow which is undefined behavior. @felipecrv What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is weird. written_ciphertext_len should be uint32_t, then checking that written_ciphertext_len is not greater than std::numeric_limits<int32_t>::max() is enough. In other words: the encoded value is never negative, but it can be too big.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ciphertext[3] & 0xff is weird as well, since ciphertext[3] is a uint8, thus in [0,255] already.
Perhaps the proper check should simply be ciphertext[3] < 0x80u?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah the & 0xffs are all redundant, maybe it ended up written this way to be similar to the code for writing the length. Rather than just checking the length isn't negative, we should also check that written_ciphertext_len + length_buffer_length_ doesn't overflow, so I think it would probably be simplest to read as uint32_t.

Copy link
Contributor Author

@adamreeve adamreeve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this check as part of #43195, in this commit: b81cafe

Comment on lines +520 to +521
uint8_t tag[kGcmTagLength];
memset(tag, 0, kGcmTagLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more idiomatically:

  std::array<uint8_t, kGcmTagLength> tag{};

Copy link
Contributor Author

@adamreeve adamreeve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this as well as the other uses of C style arrays in ff31c7b

AllocateBuffer(decryptor->pool(),
static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta())));
const uint8_t* cipher_buf = buf;
auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>(AllocateBuffer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but since we're not attempting to resize the buffer, the cast to ResizableBuffer should be superfluous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllocateBuffer here is ::parquet::AllocateBuffer which returns a std::shared_ptr<ResizableBuffer>, rather than ::arrow::AllocateBuffer, so it looks like the cast would be unnecessary even if we were resizing it:

std::shared_ptr<ResizableBuffer> AllocateBuffer(MemoryPool* pool, int64_t size) {
PARQUET_ASSIGN_OR_THROW(auto result, ::arrow::AllocateResizableBuffer(size, pool));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. We can try removing it at some point then, as the code currently looks weird.

Copy link
Contributor Author

@adamreeve adamreeve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this as well as another occurrence of the same behaviour in f4a7721

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Jul 9, 2024
pitrou added a commit that referenced this pull request Jul 11, 2024
…pan instead of raw pointers (#43195)

### Rationale for this change

See #43142. This is a follow up to #43071 which refactored the Decryptor API and added extra checks to prevent segfaults. This PR makes similar changes to the Encryptor API for consistency and better maintainability.

### What changes are included in this PR?

* Change `AesEncryptor::Encrypt` and `Encryptor::Encrypt` to use `arrow::util::span` instead of raw pointers
* Replace the `AesEncryptor::CiphertextSizeDelta` method with a `CiphertextLength` method that checks for overflow and abstracts the size difference behaviour away from consumer code for improved readability.

### Are these changes tested?

* This is mostly a refactoring of existing code so is covered by existing tests.

### Are there any user-facing changes?

No
* GitHub Issue: #43142

Lead-authored-by: Adam Reeve <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes Awaiting changes Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants