Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable ASAN/UBSAN in pandas CI #55102

Merged
merged 57 commits into from
Dec 21, 2023
Merged

enable ASAN/UBSAN in pandas CI #55102

merged 57 commits into from
Dec 21, 2023

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Sep 11, 2023

@WillAyd WillAyd requested a review from mroeschke as a code owner September 11, 2023 21:11
@@ -25,8 +25,8 @@ runs:
- name: Build Pandas
run: |
if [[ ${{ inputs.editable }} == "true" ]]; then
pip install -e . --no-build-isolation -v
pip install -e . --no-build-isolation -v --config-settings=setup-args="-Db_sanitize=address,undefined"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to hard code this - is there a way with GHA to only do this for certain action invocations @lithomas1 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define an input above in the workflow and pass a variable from the job when you want to enable these flags

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push to your branch if you need any help with this, but it should be as Matt stated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to push

@@ -157,19 +157,25 @@ jobs:
- name: Build Pandas
id: build
uses: ./.github/actions/build_pandas
env:
CFLAGS: "$CFLAGS -fno-sanitize-recover=all"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meson has the config option for b_sanitize=address,undefined but I don't think it set this. UBSAN has non-fatal errors, so without this things just get printed to stderr and pytest still continues.

Looks like NumPy does something similar with halt_on_error=1, but that didn't seem to stop pytest from continuing as I tried this locally

https://github.com/numpy/numpy/pull/24208/files

@lithomas1 lithomas1 added the Build Library building on various platforms label Sep 11, 2023
@@ -154,22 +154,36 @@ jobs:
with:
environment-file: ci/deps/${{ matrix.env_file }}

- name: Set sanitizer flags
run: |
echo "CFLAGS=$CFLAGS -fno-sanitize-recover=all" >> "$GITHUB_ENV"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having trouble setting this with GHA. There may also be a way to pass the flag directly through meson python? @lithomas1 any idea?

- name: Build Pandas
id: build
uses: ./.github/actions/build_pandas
env:
CFLAGS: "$CFLAGS"
Copy link
Member

@lithomas1 lithomas1 Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this doesn't set CFLAGS inside of the action. It just makes CFLAGS available under env.CFLAGS .

Why don't you try setting CFLAGs in action.yml directly if sanitize = true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great idea - much cleaner

@WillAyd
Copy link
Member Author

WillAyd commented Sep 12, 2023

Somewhat working now. Guessing we need a way for pytest-xdist to fail and signal the test it failed on. Right now looks like a worker just crashed

@rgommers
Copy link
Contributor

This looks pretty good to me. Let me Cc @ngoldbaum, who implemented the NumPy CI job and has more experience with these sanitizers than I have.

@ngoldbaum
Copy link
Contributor

Guessing we need a way for pytest-xdist to fail and signal the test it failed on.

Ah that's probably why the halt_on_error=1 didn't work, I wasn't running with pytest-xdist locally when I was setting up the numpy config.

@mroeschke
Copy link
Member

Somewhat working now. Guessing we need a way for pytest-xdist to fail and signal the test it failed on. Right now looks like a worker just crashed

IMO I would add 1 new testing job in the matrix (e.g. that uses a 3.11 dependency file) that runs the tests with sanitize=True and python xdist with -n 0

@WillAyd
Copy link
Member Author

WillAyd commented Sep 12, 2023

In that case are you still planning to run against the entire test base or a subset of modules? I think removing multiple workers would slow down our CI a good deal? But maybe this gets to a state where it only runs when C/Cython files are touched?

@WillAyd
Copy link
Member Author

WillAyd commented Sep 12, 2023

Worth noting I tried -n 0 locally a few times and it didn't make a difference. Not sure if the mere installation of pytest-xdist changes that. Needs further investigation

@lithomas1
Copy link
Member

In that case are you still planning to run against the entire test base or a subset of modules? I think removing multiple workers would slow down our CI a good deal? But maybe this gets to a state where it only runs when C/Cython files are touched?

We should do a minimal run, kind of like the npdev situation. So only us, numpy, and arrow installed.
Maybe that helps with the runtime?

@mroeschke
Copy link
Member

Since the GHA Ubuntu runners only have 2 cores, I think running the entire test suite (even with all the dependencies) with -n 0 will be that significant, thought I don't know the impact that will have on the debugger.

FWIW that's how the Windows tests run currently and they take 15ish minutes longer than the non xdist runs

@WillAyd
Copy link
Member Author

WillAyd commented Sep 12, 2023

thought I don't know the impact that will have on the debugger.

In theory the average runtime of ASAN would be 2x (see https://github.com/google/sanitizers/wiki/AddressSanitizer), though since we are not detecting leaks but also adding UBSAN I'm not sure how that all evens out

@WillAyd
Copy link
Member Author

WillAyd commented Sep 14, 2023

A lot of the datetime stuff in this PR is hacked together just to appease UBSAN, but there are definitely quite a few code paths where datetime conversions can lead to undefined behavior.

The current ASAN failure looks like it comes from matplotlib, so @lithomas1 is probably right in that we need to pare this down to a smaller set of packages that we know can be clean

@mroeschke
Copy link
Member

Gotcha. Or rather, if we introduce undefined behavior and address violations, this job will hopefully fail correct? Just want to ensure there's a definitive job failure -> rectification -> job success path for this job

@WillAyd
Copy link
Member Author

WillAyd commented Dec 19, 2023

Yes exactly - this will fail when either of those are detected

@WillAyd
Copy link
Member Author

WillAyd commented Dec 19, 2023

The error messaging you see in CI is something that could be improved. It just "fails" right now but that feedback gets lost along the way from the crashed process. I think that can be tackled in a follow up

@WillAyd
Copy link
Member Author

WillAyd commented Dec 19, 2023

This is what happens today if either of these pops up:

https://github.com/pandas-dev/pandas/actions/runs/7066657149/job/19241456914#step:8:61

@mroeschke
Copy link
Member

This is what happens today if either of these pops up:

https://github.com/pandas-dev/pandas/actions/runs/7066657149/job/19241456914#step:8:61

Ah OK. Could you at least include a test_args: "-v" in the job configuration in unit-test.yml? At least then the last test should be printed before the job fails so when it fails we don't have to rerun to figure out where this fails

@WillAyd
Copy link
Member Author

WillAyd commented Dec 19, 2023

OK sure. Here is what that looks like:

https://github.com/pandas-dev/pandas/actions/runs/7267349987/job/19800967532?pr=55102#step:8:18053

Ends up being too much for GHA to show but if you go to the raw logs you will see the error:

2023-12-19T21:11:17.6390164Z ../../pandas/_libs/src/vendored/ujson/python/objToJSON.c:2066:3: runtime error: signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'
2023-12-19T21:12:44.8157270Z pandas/tests/io/test_common.py::TestCommonIOCapabilities::test_write_missing_parent_directory[to_json-os-OSError-json] 
2023-12-19T21:12:44.8158122Z [gw3] node down: Not properly terminated
2023-12-19T21:12:44.8158390Z 
2023-12-19T21:12:44.8158505Z replacing crashed worker gw3
2023-12-19T21:12:44.8158882Z INTERNALERROR> Traceback (most recent call last):
2023-12-19T21:12:44.8159868Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/_pytest/main.py", line 271, in wrap_session
2023-12-19T21:12:44.8160796Z INTERNALERROR>     session.exitstatus = doit(config, session) or 0
2023-12-19T21:12:44.8161413Z INTERNALERROR>                          ^^^^^^^^^^^^^^^^^^^^^
2023-12-19T21:12:44.8162455Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/_pytest/main.py", line 325, in _main
2023-12-19T21:12:44.8163512Z INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
2023-12-19T21:12:44.8164538Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/pluggy/_hooks.py", line 493, in __call__
2023-12-19T21:12:44.8165536Z INTERNALERROR>     return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
2023-12-19T21:12:44.8166476Z INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-12-19T21:12:44.8167481Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/pluggy/_manager.py", line 115, in _hookexec
2023-12-19T21:12:44.8168482Z INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
2023-12-19T21:12:44.8169138Z INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-12-19T21:12:44.8170131Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 152, in _multicall
2023-12-19T21:12:44.8170977Z INTERNALERROR>     return outcome.get_result()
2023-12-19T21:12:44.8171378Z INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^
2023-12-19T21:12:44.8172287Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/pluggy/_result.py", line 114, in get_result
2023-12-19T21:12:44.8173179Z INTERNALERROR>     raise exc.with_traceback(exc.__traceback__)
2023-12-19T21:12:44.8174405Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/pluggy/_callers.py", line 77, in _multicall
2023-12-19T21:12:44.8175220Z INTERNALERROR>     res = hook_impl.function(*args)
2023-12-19T21:12:44.8175633Z INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
2023-12-19T21:12:44.8176555Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/dsession.py", line 123, in pytest_runtestloop
2023-12-19T21:12:44.8177342Z INTERNALERROR>     self.loop_once()
2023-12-19T21:12:44.8178296Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/dsession.py", line 148, in loop_once
2023-12-19T21:12:44.8179020Z INTERNALERROR>     call(**kwargs)
2023-12-19T21:12:44.8179889Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/dsession.py", line 273, in worker_collectionfinish
2023-12-19T21:12:44.8180703Z INTERNALERROR>     self.sched.schedule()
2023-12-19T21:12:44.8181588Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 339, in schedule
2023-12-19T21:12:44.8182381Z INTERNALERROR>     self._reschedule(node)
2023-12-19T21:12:44.8183278Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 321, in _reschedule
2023-12-19T21:12:44.8184099Z INTERNALERROR>     self._assign_work_unit(node)
2023-12-19T21:12:44.8185057Z INTERNALERROR>   File "/home/runner/micromamba/envs/test/lib/python3.11/site-packages/xdist/scheduler/loadscope.py", line 259, in _assign_work_unit
2023-12-19T21:12:44.8186005Z INTERNALERROR>     worker_collection = self.registered_collections[node]
2023-12-19T21:12:44.8186635Z INTERNALERROR>                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
2023-12-19T21:12:44.8187150Z INTERNALERROR> KeyError: <WorkerController gw7>

@mroeschke
Copy link
Member

Great thanks!

@lithomas1
Copy link
Member

This is what happens today if either of these pops up:
https://github.com/pandas-dev/pandas/actions/runs/7066657149/job/19241456914#step:8:61

Ah OK. Could you at least include a test_args: "-v" in the job configuration in unit-test.yml? At least then the last test should be printed before the job fails so when it fails we don't have to rerun to figure out where this fails

Hm, I wonder if there's a better way to do this.
(My main worry is that this make the logs very hard to scroll through, e.g. if I had to look through the logs because of a flaky non-ASAN related test)

Do you know if e.g. something like PYTEST_CURRENT_TEST might help?
https://docs.pytest.org/en/7.1.x/example/simple.html#pytest-current-test-env

@WillAyd
Copy link
Member Author

WillAyd commented Dec 19, 2023

Is there anything particular to CI that we know of that does not redirect stderr from pytest-xdist to the logs? If you run things locally you get the error and a stacktrace. While that isn't 1-to-1 to the test name it is pretty helpful to figure out what is going on so might be the best medium

@mroeschke
Copy link
Member

Is there anything particular to CI that we know of that does not redirect stderr from pytest-xdist to the logs?

If I understand your question, this is a pytest xdist limitation: https://pytest-xdist.readthedocs.io/en/stable/known-limitations.html#output-stdout-and-stderr-from-workers

I know the -v isn't elegant, but I would prefer to have a way to narrow down what is causing the failure before merging this in.

@lithomas1
Copy link
Member

I know the -v isn't elegant, but I would prefer to have a way to narrow down what is causing the failure before merging this in.

I agree. I don't mean to block this PR, but maybe we should look into the reporting a little more?

At the very least, I think I could settle for a solution where we get pytest to print the filenames of the tests.
We could have this if we turned off pytest-xdist.
(there seems to be an issue with pytest-xdist where it swallows the filenames.
pytest-dev/pytest-xdist#450)

Would this work?

@WillAyd
Copy link
Member Author

WillAyd commented Dec 21, 2023

OK here is what @lithomas1 suggestion looks like:

https://github.com/pandas-dev/pandas/actions/runs/7281808581/job/19843041899?pr=55102#step:8:590

Out of the two options so far I would prefer to go that route. It looks like turning off pytest-xdist for the ASAN build had little to no effect on the overall runtime

pyproject.toml Outdated Show resolved Hide resolved
@lithomas1
Copy link
Member

OK here is what @lithomas1 suggestion looks like:

https://github.com/pandas-dev/pandas/actions/runs/7281808581/job/19843041899?pr=55102#step:8:590

Out of the two options so far I would prefer to go that route. It looks like turning off pytest-xdist for the ASAN build had little to no effect on the overall runtime

Ok, this looks correct to me, at a first glance. Just to double check, the failing test happens in test_common.py, right?
(and not in pandas/tests/extension/json/test_json.py)

This reverts commit 677da0e.
@WillAyd
Copy link
Member Author

WillAyd commented Dec 21, 2023

Yea it does happen in test_common. I suppose the downside to this one is you don't get exactly the test that failed, but the first one I see failing locally is pandas/tests/io/test_common.py::TestCommonIOCapabilities::test_write_missing_parent_directory[to_json-os-OSError-json]

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending resolution of Matt's last comment)!

Excited to see this finally go in.

@mroeschke mroeschke merged commit 8f32ea5 into pandas-dev:main Dec 21, 2023
83 checks passed
@mroeschke
Copy link
Member

Awesome! Thanks @WillAyd

cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024
* enable ASAN/UBSAN in pandas CI

* try input

* try removing sanitize

* try no CFLAGS

* try GH string substituion

* change flags in build script

* quotes

* update script run

* single_cpu updates

* asan checks for datetime funcs

* try smaller config

* checkpoint

* bool fixup

* reverts

* known UB marker

* Finished marking tests with known UB

* dedicated CI job

* identifier fix

* fixes

* more test skip

* try quotes

* simplify ci

* try CFLAGS

* preload args

* skip single_cpu tests

* wording

* removed unneeded marker

* float set implementations

* Revert "float set implementations"

This reverts commit 6266422.

* change marker name

* dedicated actions file

* consolidated into matrix

* fixup

* typos

* fixups

* add qt?

* intentional UB with verbose

* disable pytest-xdist

* original issue

* remove UB

* Revert "remove UB"

This reverts commit 677da0e.

* merge fixup

* remove UB

---------

Co-authored-by: Thomas Li <[email protected]>
@WillAyd WillAyd deleted the pandas-asan branch January 2, 2024 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Setup ASAN in CI
5 participants