GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault #43071

adamreeve · 2024-06-26T23:53:17Z

Rationale for this change

What changes are included in this PR?

Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.

Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.

Are these changes tested?

Yes I've added new unit tests for this.

Are there any user-facing changes?

No

GitHub Issue: [C++][Parquet] Reading corrupted encrypted Parquet files can cause a segfault #43070

github-actions · 2024-06-26T23:53:44Z

⚠️ GitHub issue #43070 has been automatically assigned in GitHub to PR creator.

adamreeve · 2024-06-27T00:25:59Z

Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add PARQUET_EXPORT to these, or should I not be testing these internal classes? I think it would be quite difficult to add tests for this change at a higher level.

mapleFU · 2024-06-27T02:57:22Z

The code LGTM, but I'm not familiar with decrypt module. So @wgtmac @pitrou for help

cpp/src/parquet/CMakeLists.txt

pitrou · 2024-06-27T07:43:22Z

cpp/src/parquet/encryption/encryption_internal.cc

+    std::stringstream ss;
+    ss << "Invalid ciphertext length " << ciphertext_len << ". Expected at least "
+       << length_buffer_length_ + kNonceLength + kGcmTagLength << "\n";
+    throw ParquetException(ss.str());
  }


Looking at the existing code in these methods, several things stand out:

we don't validate ciphertext length before fetching the 4 bytes encoding the actual length (a corrupt file could perhaps have a ciphertext length < 4?)

use of raw C arrays instead of std::array<uint8_t> for example

why is the ciphertext_len argument optional in this API? this looks fickle and error-prone.

the part that extracts and validates the actual length is duplicated in the two methods

I would suggest we take the opportunity and refactor this into a cleaner and less error-prone implementation. In particular, the GCM and CTR-specific methods should probably have a mandatory ciphertext length, and would not have to bother with reading the length bytes.

@ggershinsky @thamht4190 Do we have an explanation for the very odd choices here?

Yeah I didn't dig in too deep to find why the ciphertext_len is optional, it would be nice if that could be mandatory. But if that's not possible we should at least be able to provide the size of the buffer that the ciphertext is being read from to ensure that ciphertext_len isn't greater than this.

It does look like we don't always know the length of the ciphertext but sometimes just an upper bound that's used to allocate the buffer. I found an example of this when reading the bloom filter header:

arrow/cpp/src/parquet/bloom_filter.cc

Lines 116 to 121 in e615a30

// NOTE: we don't know the bloom filter header size upfront without

// bloom_filter_length, and we can't rely on InputStream::Peek() which isn't always

// implemented. Therefore, we must first Read() with an upper bound estimate of the

// header size, then once we know the bloom filter data size, we can Read() the exact

// number of remaining data bytes.

bloom_filter_header_read_size = kBloomFilterHeaderSizeGuess;

I'm thinking that rather than having separate arguments like ciphertext_buffer_len (required) and ciphertext_expected_len (optional), it's probably fine to make ciphertext_len required and mean the size of the buffer, so we would validate that the actual length is <= this after accounting for the 4 byte length rather than enforcing an exact match. Does that seem reasonable? (I've gone ahead and made this change but am happy to adjust the approach)

pitrou · 2024-06-27T07:45:45Z

Another potential weak point is this:

arrow/cpp/src/parquet/thrift_internal.h

Lines 415 to 417 in 1da71ba

    
           auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>( 
        
               AllocateBuffer(decryptor->pool(), 
        
                              static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta())));

and of course this part where we totally ignore the physical buffer length, letting the Decrypt function happily read past the end of the buffer:

arrow/cpp/src/parquet/thrift_internal.h

Lines 419 to 420 in 1da71ba

    
           uint32_t decrypted_buffer_len = 
        
               decryptor->Decrypt(cipher_buf, 0, decrypted_buffer->mutable_data());

All in all, this warrants a refactor for sanity and robustness.

… methods

adamreeve · 2024-07-03T04:03:36Z

@pitrou, I think I've addressed all of your comments now thank you

I've added extra validation of the buffer sizes in various places
I've switched from raw pointers to arrow::util::span (I think this is more appropriate than std::array which is for fixed length arrays, but let me know if I've misunderstood)
The ciphertext size is no longer optional, although it's no longer checked for an exact match, the actual ciphertext length might be less than the buffer size
I've pulled out the length reading and validation into a shared method
I've refactored how the plaintext/ciphertext length conversions are handled by adding methods for these rather than adding or subtracting the size delta in consumer code

I haven't touched the AesEncryptor API at all, but it would probably make sense to follow up after this to at least change that to use arrow::util::span too for consistency.

pitrou · 2024-07-03T09:29:02Z

@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this!

adamreeve · 2024-07-03T09:53:49Z

OK no problem, thanks for letting me know, and I hope you're feeling better soon.

raulcd · 2024-07-03T09:59:34Z

@mapleFU @wgtmac is there any possibility you could take a look on this? Otherwise this fix will have to miss the 17.0.0 release unfortunately

wgtmac · 2024-07-03T10:10:58Z

I'll try to review this today.

mapleFU

The style looks ok but I'm not familiar with encryption

mapleFU · 2024-07-03T10:32:18Z

cpp/src/parquet/encryption/read_configurations_test.cc

@@ -22,6 +22,7 @@

 #include "arrow/io/file.h"
 #include "arrow/testing/gtest_compat.h"
+#include "arrow/util/config.h"


Yes this is needed so that ARROW_WITH_SNAPPY is defined, otherwise the tests below are always skipped. This is a bit unrelated to this change but I noticed this problem when running the tests locally, and I'd come across this problem before in #40327.

cpp/src/parquet/encryption/encryption_internal.h

wgtmac

Generally LGTM. Thanks for the fix!

cpp/src/parquet/encryption/encryption_internal.cc

wgtmac · 2024-07-03T14:27:12Z

cpp/src/parquet/encryption/encryption.h

@@ -89,6 +89,14 @@ inline const uint8_t* str2bytes(const std::string& str) {
  return reinterpret_cast<const uint8_t*>(cbytes);
 }

+inline ::arrow::util::span<const uint8_t> str2span(const std::string& str) {


It seems pretty common, does it make sense to relocate it to arrow/util/span.h?

You can already construct a span from a string but that creates a span<const char>. Converting to a uint8_t span might be more specific to the encryption use case so I'm not sure about this.

wgtmac · 2024-07-03T14:38:53Z

cpp/src/parquet/encryption/encryption_internal.cc

@@ -315,8 +315,10 @@ class AesDecryptor::AesDecryptorImpl {

  ~AesDecryptorImpl() { WipeOut(); }

-  int Decrypt(const uint8_t* ciphertext, int ciphertext_len, const uint8_t* key,
-              int key_len, const uint8_t* aad, int aad_len, uint8_t* plaintext);
+  int Decrypt(::arrow::util::span<const uint8_t> ciphertext,


Should we add using ::arrow::util::span to this source file to make the signatures shorter?

Good point, I've done that now

wgtmac · 2024-07-03T14:49:37Z

cpp/src/parquet/encryption/encryption_internal_test.cc

+    std::vector<uint8_t> ciphertext(expected_ciphertext_len, '\0');
+
+    int ciphertext_length =
+        encryptor.Encrypt(str2bytes(plain_text_), static_cast<int>(plain_text_.size()),


We can refactor this to use span in the follow up changes.

Yes I've just made #43142 for this, and I will follow up after this PR

cpp/src/parquet/encryption/encryption_internal.h

wgtmac · 2024-07-03T14:59:56Z

cpp/src/parquet/encryption/encryption_internal.cc

  } else {
-    if (ciphertext_len == 0) {
-      throw ParquetException("Zero ciphertext length");
+    if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) {


Suggested change

if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) {

if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int32_t>::max())) {

wgtmac · 2024-07-03T15:00:59Z

cpp/src/parquet/encryption/encryption_internal.cc

-                                               int aad_len, uint8_t* plaintext) {
-  int len;
-  int plaintext_len;
+int AesDecryptor::PlaintextLength(int ciphertext_len) const {


Should we replace int with int32_t to be more portable? Same for functions below. Of course this can be a followup change.

Yes it would make sense to do this as a follow up to avoid changing too much in this PR, I've made #43141 for this

wgtmac · 2024-07-03T15:15:35Z

cpp/src/parquet/encryption/encryption_internal.cc

  if (length_buffer_length_ > 0) {
+    if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) {


It took me a while to understand this line. Perhaps it is better to explicitly use kBufferSizeLength here as line 484 to 486 have assumed this length is 4 bytes.

Yes good point, that also confused me a bit. I've changed that and added a comment to hopefully make this more readable

wgtmac · 2024-07-03T15:16:17Z

cpp/src/parquet/encryption/encryption_internal.cc

+    if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) {
+      std::stringstream ss;
+      ss << "Ciphertext buffer length " << ciphertext.size()
+         << " is insufficient to read the ciphertext length";


nit: include the length (4 bytes) in the error message.

wgtmac

+1

Thanks!

wgtmac · 2024-07-04T05:19:38Z

CI failures are unrelated. I will merge it shortly.

wgtmac · 2024-07-04T05:30:04Z

@raulcd Please feel free to port it to maint-17.0.0

mapleFU · 2024-07-04T05:31:00Z

(Gang says he has api error when run merge script so I merged this, lol)

raulcd · 2024-07-04T08:53:30Z

Thanks both for jumping in on the review and thanks @adamreeve for the PR

… segfault (#43071) ### Rationale for this change See #43070 ### What changes are included in this PR? Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type. Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer. ### Are these changes tested? Yes I've added new unit tests for this. ### Are there any user-facing changes? No * GitHub Issue: #43070 Authored-by: Adam Reeve <[email protected]> Signed-off-by: mwish <[email protected]>

conbench-apache-arrow · 2024-07-04T13:22:02Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb74.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them.

…revent segfault (apache#43071) ### Rationale for this change See apache#43070 ### What changes are included in this PR? Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type. Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer. ### Are these changes tested? Yes I've added new unit tests for this. ### Are there any user-facing changes? No * GitHub Issue: apache#43070 Authored-by: Adam Reeve <[email protected]> Signed-off-by: mwish <[email protected]>

pitrou

Belated review, sorry.

pitrou · 2024-07-09T15:48:52Z

cpp/src/parquet/encryption/encryption_internal.cc

+      ss << "Negative plaintext length " << plaintext_len;
+      throw ParquetException(ss.str());
+    }
+    return plaintext_len + ciphertext_size_delta_;


We should ideally check for signed addition overflow here.

Good point. Is it ok to add this to the changes in #43195, or should I make a separate PR for these follow ups?

IMHO you can add them to the aforementioned PR.

I've added this check in fb63bcc

pitrou · 2024-07-09T15:51:09Z

cpp/src/parquet/encryption/encryption_internal.cc

-    if (ciphertext_len > 0 &&
-        ciphertext_len != (written_ciphertext_len + length_buffer_length_)) {
-      throw ParquetException("Wrong ciphertext length");
+    if (written_ciphertext_len < 0) {


I wonder if the compiler won't elide this, as ((ciphertext[3] & 0xff) << 24) becoming negative implies signed integer overflow which is undefined behavior. @felipecrv What do you think?

This check is weird. written_ciphertext_len should be uint32_t, then checking that written_ciphertext_len is not greater than std::numeric_limits<int32_t>::max() is enough. In other words: the encoded value is never negative, but it can be too big.

ciphertext[3] & 0xff is weird as well, since ciphertext[3] is a uint8, thus in [0,255] already.
Perhaps the proper check should simply be ciphertext[3] < 0x80u?

Hmm yeah the & 0xffs are all redundant, maybe it ended up written this way to be similar to the code for writing the length. Rather than just checking the length isn't negative, we should also check that written_ciphertext_len + length_buffer_length_ doesn't overflow, so I think it would probably be simplest to read as uint32_t.

I've changed this check as part of #43195, in this commit: b81cafe

pitrou · 2024-07-09T15:53:21Z

cpp/src/parquet/encryption/encryption_internal.cc

+  uint8_t tag[kGcmTagLength];
+  memset(tag, 0, kGcmTagLength);


Or more idiomatically:

std::array<uint8_t, kGcmTagLength> tag{};

I've changed this as well as the other uses of C style arrays in ff31c7b

pitrou · 2024-07-09T15:55:54Z

cpp/src/parquet/thrift_internal.h

-          AllocateBuffer(decryptor->pool(),
-                         static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta())));
-      const uint8_t* cipher_buf = buf;
+      auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>(AllocateBuffer(


Unrelated, but since we're not attempting to resize the buffer, the cast to ResizableBuffer should be superfluous?

AllocateBuffer here is ::parquet::AllocateBuffer which returns a std::shared_ptr<ResizableBuffer>, rather than ::arrow::AllocateBuffer, so it looks like the cast would be unnecessary even if we were resizing it:

arrow/cpp/src/parquet/platform.cc

Lines 36 to 37 in 031497d

std::shared_ptr<ResizableBuffer> AllocateBuffer(MemoryPool* pool, int64_t size) {

PARQUET_ASSIGN_OR_THROW(auto result, ::arrow::AllocateResizableBuffer(size, pool));

Hmm, I see. We can try removing it at some point then, as the code currently looks weird.

I've removed this as well as another occurrence of the same behaviour in f4a7721

…pan instead of raw pointers (#43195) ### Rationale for this change See #43142. This is a follow up to #43071 which refactored the Decryptor API and added extra checks to prevent segfaults. This PR makes similar changes to the Encryptor API for consistency and better maintainability. ### What changes are included in this PR? * Change `AesEncryptor::Encrypt` and `Encryptor::Encrypt` to use `arrow::util::span` instead of raw pointers * Replace the `AesEncryptor::CiphertextSizeDelta` method with a `CiphertextLength` method that checks for overflow and abstracts the size difference behaviour away from consumer code for improved readability. ### Are these changes tested? * This is mostly a refactoring of existing code so is covered by existing tests. ### Are there any user-facing changes? No * GitHub Issue: #43142 Lead-authored-by: Adam Reeve <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

adamreeve added 3 commits June 27, 2024 11:24

Add tests for AES encrytion round trip

6707dd0

Add tests to reproduce segfaults

39b1465

Check for invalid read lengths to prevent segfaults

b88ea0a

adamreeve requested a review from wgtmac as a code owner June 26, 2024 23:53

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jun 26, 2024

pitrou reviewed Jun 27, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 27, 2024

pitrou added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jun 27, 2024

adamreeve added 11 commits June 30, 2024 21:01

Move test file to encryption-test target

e74a7f5

Export AesEncryptor/Decryptor classes for unit testing

ccaa231

Fix incorrect parameter names in method implementation

e09fb7a

Make ciphertext_len required and mean the buffer size

a6ba7fe

Fix some tests being incorrectly skipped

fb64cbe

Fix signed/unsigned mismatch error

05f7907

Add tests for passing a ciphertext length that is too small

5f9b6ef

Replace CiphertextSizeDelta with PlaintextLength and CiphertextLength…

125a4b2

… methods

Fix C++ lint error

0473fec

Fix NoSSL version of AesDecryptor

d91d092

Use arrow::util::span in Decryptor API

d36ede6

mapleFU reviewed Jul 3, 2024

View reviewed changes

wgtmac reviewed Jul 3, 2024

View reviewed changes

adamreeve added 4 commits July 4, 2024 13:22

Add using ::arrow::util::span

1e249c0

Check for negative plaintext length

4bf5ee0

Tidy up checking ciphertext length against length buffer length

2c312d2

Fix incorrect use of int instead of int32

a39f535

adamreeve mentioned this pull request Jul 4, 2024

[C++][Parquet] Refactor Encryptor API to use arrow::util::span instead of raw pointers #43142

Closed

wgtmac approved these changes Jul 4, 2024

View reviewed changes

mapleFU merged commit eadeb74 into apache:main Jul 4, 2024
34 of 40 checks passed

mapleFU removed the awaiting committer review Awaiting committer review label Jul 4, 2024

mapleFU mentioned this pull request Jul 4, 2024

[C++][Parquet] Reading corrupted encrypted Parquet files can cause a segfault #43070

Closed

adamreeve deleted the decrypt-segfault-fix branch July 4, 2024 05:37

adamreeve mentioned this pull request Jul 9, 2024

GH-43142: [C++][Parquet] Refactor Encryptor API to use arrow::util::span instead of raw pointers #43195

Merged

pitrou reviewed Jul 9, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Jul 9, 2024

	// NOTE: we don't know the bloom filter header size upfront without
	// bloom_filter_length, and we can't rely on InputStream::Peek() which isn't always
	// implemented. Therefore, we must first Read() with an upper bound estimate of the
	// header size, then once we know the bloom filter data size, we can Read() the exact
	// number of remaining data bytes.
	bloom_filter_header_read_size = kBloomFilterHeaderSizeGuess;

	if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) {
	if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int32_t>::max())) {

		if (length_buffer_length_ > 0) {
		if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) {

	std::shared_ptr<ResizableBuffer> AllocateBuffer(MemoryPool* pool, int64_t size) {
	PARQUET_ASSIGN_OR_THROW(auto result, ::arrow::AllocateResizableBuffer(size, pool));

GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault #43071

GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault #43071

Conversation

adamreeve commented Jun 26, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 26, 2024

adamreeve commented Jun 27, 2024

mapleFU commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

pitrou commented Jun 27, 2024

adamreeve commented Jul 3, 2024 • edited Loading

pitrou commented Jul 3, 2024

adamreeve commented Jul 3, 2024

raulcd commented Jul 3, 2024

wgtmac commented Jul 3, 2024

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented Jul 4, 2024

wgtmac commented Jul 4, 2024

mapleFU commented Jul 4, 2024

raulcd commented Jul 4, 2024

conbench-apache-arrow bot commented Jul 4, 2024

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

adamreeve Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamreeve Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

adamreeve commented Jun 26, 2024 •

edited

Loading

adamreeve Jun 30, 2024 •

edited

Loading

adamreeve commented Jul 3, 2024 •

edited

Loading

adamreeve Jul 4, 2024 •

edited

Loading

pitrou Jul 10, 2024 •

edited

Loading

adamreeve Jul 11, 2024 •

edited

Loading

adamreeve Jul 11, 2024 •

edited

Loading

adamreeve Jul 11, 2024 •

edited

Loading

adamreeve Jul 11, 2024 •

edited

Loading