Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40633: [C++][Python] Fix ORC crash when file contains unknown timezone #45051

Merged
merged 5 commits into from
Dec 18, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 17, 2024

Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main tzdata package to a tzdata-legacy package. If tzdata-legacy is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:

#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900

What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Our ORC adapter would let a C++ exception slip through instead of converting it into a Status error.
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou pitrou changed the title [C++][Python] Fix ORC crash when file contains unknown timezone GH-40633: [C++][Python] Fix ORC crash when file contains unknown timezone Dec 17, 2024
@@ -140,6 +143,33 @@ def test_example_using_json(filename, datadir):
check_example_file(path, table, need_fix=True)


def test_unknown_timezone(datadir):
# Example file relies on the timezone "US/Pacific". It should gracefully
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test checks an unrelated issue.

@pitrou pitrou marked this pull request as ready for review December 17, 2024 17:47
@pitrou pitrou requested a review from wgtmac as a code owner December 17, 2024 17:47
@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2024

@github-actions crossbow submit wheelcp312*

@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2024

@wgtmac I don't think the ORC adapter code is completely safe yet. But this is a step forward (and it adds tests for it).

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2024

@github-actions crossbow submit wheelcp312*

@apache apache deleted a comment from github-actions bot Dec 17, 2024
Copy link

Revision: 1251958

Submitted crossbow builds: ursacomputing/crossbow @ actions-cb6caa5686

Task Status
wheel-macos-monterey-cp312-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-arm64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Dec 17, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! Sorry that I forgot to do this.

@kou kou merged commit 1e5d6e5 into apache:main Dec 18, 2024
39 checks passed
@kou kou removed the awaiting merge Awaiting merge label Dec 18, 2024
@pitrou pitrou deleted the orc-missing-timezone branch December 18, 2024 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants