Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Debug timeouts #55687

Closed
wants to merge 37 commits into from
Closed

CI: Debug timeouts #55687

wants to merge 37 commits into from

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Oct 25, 2023

  • The python dev jobs look to be timing out
  • Sometimes, the ubuntu single cpu jobs are timing out

@jbrockmendel
Copy link
Member

The recently un-skipped pyarrow parser tests are a likely candidate

@jbrockmendel
Copy link
Member

xref #51409 i never figured out how to access PYTEST_CURRENT_TEST when a freeze/timeout occurs, but it seems related

@lithomas1
Copy link
Member

lithomas1 commented Oct 25, 2023

pyarrow isn't installed on the python dev tests.

I suspect NEP 50 (Matt, did you ever get your old PR working?) is the cause, or something else changing on the latest numpy dev. Not sure why the actual numpydev build isn't failing though. (maybe a cache issue?)

You can see a huge amount of Fs (for failures) in the log before the job stops.

@mroeschke
Copy link
Member Author

I suspect NEP 50 (Matt, did you ever get your old PR working?)

Ah good point. No, I never addressed the NEP 50 failures but that could be the issue as other libraries are starting to see NEP 50 being activated #48867 (comment)

@lithomas1
Copy link
Member

Sorry for pushing to your branch.
I'm just wanted to try out legacy mode again, to see if it gets us back to green.

@mroeschke
Copy link
Member Author

No problem push as needed!

@mroeschke
Copy link
Member Author

  • Sometimes, the ubuntu single cpu jobs are timing out

Log from a test that's timing out on single cpu test

2023-10-26T04:06:36.0274702Z pandas/tests/io/parser/usecols/test_parse_dates.py::test_usecols_with_parse_dates[pyarrow-usecols1] XFAIL
2023-10-26T05:30:32.0440504Z ##[error]The operation was canceled.

@jorisvandenbossche
Copy link
Member

Can anyone reproduce this locally?

@mroeschke
Copy link
Member Author

Can anyone reproduce this locally?

I tried reproducing the Ubuntu 311 Single CPU failures locally on my MacOS but I couldn't reproduce the timeout. It appears it's timing out around the pyarrow CSV tests.

@mroeschke
Copy link
Member Author

Example of the logs where the stall happens

2023-10-31T18:14:24.1476884Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/13/2019] 
2023-10-31T18:14:24.1481880Z         SETUP    F configure_tests
2023-10-31T18:14:24.1483449Z         SETUP    F all_parsers['pyarrow']
2023-10-31T18:14:24.1485985Z         SETUP    F pyarrow_xfail
2023-10-31T18:14:24.1490063Z         SETUP    F date_string['13/13/2019']
2023-10-31T18:14:24.4276588Z         pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/13/2019] (fixtures used: all_parsers, configure_tests, date_string, pyarrow_xfail, request)XFAIL
2023-10-31T18:14:24.4282882Z         TEARDOWN F date_string['13/13/2019']
2023-10-31T18:14:24.4284062Z         TEARDOWN F pyarrow_xfail
2023-10-31T18:14:24.4285135Z         TEARDOWN F all_parsers['pyarrow']
2023-10-31T18:14:24.4290439Z         TEARDOWN F configure_tests
2023-10-31T18:14:24.4298463Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/2019] 
2023-10-31T18:14:24.4304013Z         SETUP    F configure_tests
2023-10-31T18:14:24.4305717Z         SETUP    F all_parsers['pyarrow']
2023-10-31T18:14:24.4308194Z         SETUP    F pyarrow_xfail
2023-10-31T18:14:24.4312410Z         SETUP    F date_string['13/2019']
2023-10-31T18:14:24.7108009Z         pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/2019] (fixtures used: all_parsers, configure_tests, date_string, pyarrow_xfail, request)XFAIL
2023-10-31T18:14:24.7113099Z         TEARDOWN F date_string['13/2019']
2023-10-31T18:14:24.7114685Z         TEARDOWN F pyarrow_xfail
2023-10-31T18:14:24.7115815Z         TEARDOWN F all_parsers['pyarrow']
2023-10-31T18:14:24.7120264Z         TEARDOWN F configure_tests
2023-10-31T18:14:24.7130084Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-a3/11/2018] 
2023-10-31T18:14:24.7135875Z         SETUP    F configure_tests
2023-10-31T18:14:24.7137102Z         SETUP    F all_parsers['pyarrow']
2023-10-31T18:14:24.7139492Z         SETUP    F pyarrow_xfail
2023-10-31T18:14:24.7143338Z         SETUP    F date_string['a3/11/2018']
2023-10-31T19:48:18.0840443Z ##[error]The operation was canceled.

Interestingly it seems like there's a small delay between the end of the setup and the running of the tests

@lithomas1
Copy link
Member

lithomas1 commented Nov 3, 2023

Not sure about the status of this PR, but I'm fine with just skipping the hanging tests or marking all pyarrow CSV tests as single CPU, if we can't make the timeout go away.

@mroeschke
Copy link
Member Author

Yeah I'm still trying to narrow down which test is hanging (so far there doesn't seem to be a consistent offender)

marking all pyarrow CSV tests as single CPU.

They are already marked as single cpu 😢

@mroeschke
Copy link
Member Author

mroeschke commented Nov 8, 2023

I think the tests are hanging are ones enabled to xfail in #55576 or #55637

Might be a coincidence, but the two tests I've observed a job getting stuck at hit this exception in Arrow: Empty CSV file or block

https://github.com/apache/arrow/blob/db19a358cdf2b550b741fc7574dea38a6e32261b/cpp/src/arrow/csv/parser.cc#L545

@jorisvandenbossche
Copy link
Member

FWIW, I think using xfail is also not necessarily the best approach for some of those pyarrow csv tests.

I think the tests are hanging are ones enabled to xfail in #55576 or #55637

The first PR also has a comment

# GH#43650: Some expected failures with the pyarrow engine can occasionally
# cause a deadlock instead, so we skip these instead of xfailing

that was removed. So it seems we experienced this in the past as well.

@jorisvandenbossche
Copy link
Member

Based on your hint of hanging on reading an empty file, and the fact we run our tests in parallel, I think I might have a reproducer:

import pandas as pd
from io import StringIO
from concurrent.futures import ThreadPoolExecutor

data = "x,y,z"

def read_csv_pyarrow(i):
    try:
        pd.read_csv(StringIO(data), engine="pyarrow")
    except:
        pass
    print(i)
    return i

with ThreadPoolExecutor(4) as e:
    list(e.map(read_csv_pyarrow, range(4)))

gives:

# this case gives the empty CSV error
In [4]: pd.read_csv(StringIO(data), engine="pyarrow")
...
File ~/miniconda3/envs/dev311/lib/python3.11/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Empty CSV file or block: cannot infer number of columns

# first time it worked
In [6]: with ThreadPoolExecutor(4) as e:
   ...:     list(e.map(read_csv_pyarrow, range(4)))
   ...: 
1
0
3
2

# second time it hangs
In [7]: with ThreadPoolExecutor(4) as e:
   ...:     list(e.map(read_csv_pyarrow, range(4)))
   ...: 
0
^C^C^C^C^C^C^C   <-- hanging

Will report this to Arrow then!

@lithomas1
Copy link
Member

Based on your hint of hanging on reading an empty file, and the fact we run our tests in parallel, I think I might have a reproducer:

import pandas as pd
from io import StringIO
from concurrent.futures import ThreadPoolExecutor

data = "x,y,z"

def read_csv_pyarrow(i):
    try:
        pd.read_csv(StringIO(data), engine="pyarrow")
    except:
        pass
    print(i)
    return i

with ThreadPoolExecutor(4) as e:
    list(e.map(read_csv_pyarrow, range(4)))

gives:

# this case gives the empty CSV error
In [4]: pd.read_csv(StringIO(data), engine="pyarrow")
...
File ~/miniconda3/envs/dev311/lib/python3.11/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Empty CSV file or block: cannot infer number of columns

# first time it worked
In [6]: with ThreadPoolExecutor(4) as e:
   ...:     list(e.map(read_csv_pyarrow, range(4)))
   ...: 
1
0
3
2

# second time it hangs
In [7]: with ThreadPoolExecutor(4) as e:
   ...:     list(e.map(read_csv_pyarrow, range(4)))
   ...: 
0
^C^C^C^C^C^C^C   <-- hanging

Will report this to Arrow then!

Note that I have a sketchy BytesIOWrapper class here that converts StringIO -> BytesIO that might also be part of the problem.

pandas/pandas/io/common.py

Lines 1093 to 1119 in 09ed69e

class _BytesIOWrapper:
# Wrapper that wraps a StringIO buffer and reads bytes from it
# Created for compat with pyarrow read_csv
def __init__(self, buffer: StringIO | TextIOBase, encoding: str = "utf-8") -> None:
self.buffer = buffer
self.encoding = encoding
# Because a character can be represented by more than 1 byte,
# it is possible that reading will produce more bytes than n
# We store the extra bytes in this overflow variable, and append the
# overflow to the front of the bytestring the next time reading is performed
self.overflow = b""
def __getattr__(self, attr: str):
return getattr(self.buffer, attr)
def read(self, n: int | None = -1) -> bytes:
assert self.buffer is not None
bytestring = self.buffer.read(n).encode(self.encoding)
# When n=-1/n greater than remaining bytes: Read entire file/rest of file
combined_bytestring = self.overflow + bytestring
if n is None or n < 0 or n >= len(combined_bytestring):
self.overflow = b""
return combined_bytestring
else:
to_return = combined_bytestring[:n]
self.overflow = combined_bytestring[n:]
return to_return

@jorisvandenbossche
Copy link
Member

-> apache/arrow#38676

Note that I have a sketchy BytesIOWrapper class here that converts StringIO -> BytesIO that might also be part of the problem.

I adapted the reproducer to pure pyarrow using BytesIO, and that also hangs sometimes, so it seems to at least also be an issue on that side

@WillAyd
Copy link
Member

WillAyd commented Nov 10, 2023

Here is some info from helgrind that might be of use:

==87545== Possible data race during write of size 4 at 0x642320 by thread #1
==87545== Locks held: none
==87545==    at 0x399780: PyMem_SetAllocator (obmalloc.c:544)
==87545==    by 0x398844: pymem_set_default_allocator (obmalloc.c:251)
==87545==    by 0x3E7C35: _PyRuntimeState_Fini (pystate.c:165)
==87545==    by 0x3C6B52: UnknownInlinedFun (pylifecycle.c:227)
==87545==    by 0x3C6B52: Py_FinalizeEx (pylifecycle.c:2027)
==87545==    by 0x3D2A4F: Py_RunMain (main.c:682)
==87545==    by 0x398006: Py_BytesMain (main.c:734)
==87545==    by 0x49A40CF: (below main) (libc_start_call_main.h:58)
==87545==  Address 0x642320 is 0 bytes inside data symbol "_PyMem_Raw"
==87545== 
==87545== ----------------------------------------------------------------
==87545== 
==87545==  Lock at 0x7E803A68 was first observed
==87545==    at 0x4852F6B: pthread_mutex_init (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==87545==    by 0x6CDD46EF: je_arrow_private_je_malloc_mutex_init (mutex.c:171)
==87545==    by 0x6CDE1A0A: je_arrow_private_je_ecache_init (ecache.c:9)
==87545==    by 0x6CDD626C: je_arrow_private_je_pac_init (pac.c:49)
==87545==    by 0x6CDD4B84: je_arrow_private_je_pa_shard_init (pa.c:43)
==87545==    by 0x6CD83A76: je_arrow_private_je_arena_new (arena.c:1648)
==87545==    by 0x6CD7C708: je_arrow_private_je_arena_choose_hard (jemalloc.c:582)
==87545==    by 0x6CDDF6BA: arena_choose_impl (jemalloc_internal_inlines_b.h:46)
==87545==    by 0x6CDDF6BA: arena_choose_impl (jemalloc_internal_inlines_b.h:32)
==87545==    by 0x6CDDF6BA: arena_choose (jemalloc_internal_inlines_b.h:88)
==87545==    by 0x6CDDF6BA: je_arrow_private_je_tsd_tcache_data_init (tcache.c:740)
==87545==    by 0x6CDDF967: je_arrow_private_je_tsd_tcache_enabled_data_init (tcache.c:644)
==87545==    by 0x6CDE12E1: tsd_data_init (tsd.c:244)
==87545==    by 0x6CDE12E1: je_arrow_private_je_tsd_fetch_slow (tsd.c:311)
==87545==    by 0x6CD7CE74: tsd_fetch_impl (tsd.h:422)
==87545==    by 0x6CD7CE74: tsd_fetch (tsd.h:448)
==87545==    by 0x6CD7CE74: imalloc (jemalloc.c:2681)
==87545==    by 0x6CD7CE74: je_arrow_mallocx (jemalloc.c:3424)
==87545==    by 0x6C0C346C: arrow::memory_pool::internal::JemallocAllocator::AllocateAligned(long, long, unsigned char**) (in /home/willayd/mambaforge/envs/pandas-dev/lib/libarrow.so.1400)
==87545==  Address 0x7e803a68 is in a rw- anonymous segment
==87545== 
==87545== Possible data race during write of size 8 at 0x7E806040 by thread #24
==87545== Locks held: 1, at address 0x7E803A68
==87545==    at 0x6CDC8891: atomic_store_zu (atomic.h:93)
==87545==    by 0x6CDC8891: je_arrow_private_je_eset_insert (eset.c:109)
==87545==    by 0x6CDCB1C8: extent_deactivate_locked_impl (extent.c:256)
==87545==    by 0x6CDCB1C8: extent_deactivate_locked (extent.c:263)
==87545==    by 0x6CDCB1C8: je_arrow_private_je_extent_record (extent.c:950)
==87545==    by 0x6CDD5909: pac_dalloc_impl (pac.c:277)
==87545==    by 0x6CD81CB2: je_arrow_private_je_arena_slab_dalloc (arena.c:570)
==87545==    by 0x6CDDD6E5: tcache_bin_flush_impl (tcache.c:477)
==87545==    by 0x6CDDD6E5: tcache_bin_flush_bottom (tcache.c:519)
==87545==    by 0x6CDDD6E5: je_arrow_private_je_tcache_bin_flush_small (tcache.c:529)
==87545==    by 0x6CDDE364: tcache_flush_cache (tcache.c:790)
==87545==    by 0x6CDDE9BE: tcache_destroy.constprop.0 (tcache.c:809)
==87545==    by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545==    by 0x4A1393F: start_thread (pthread_create.c:455)
==87545==    by 0x4AA42E3: clone (clone.S:100)
==87545==  Address 0x7e806040 is in a rw- anonymous segment
==87545== 
==87545== ----------------------------------------------------------------
==87545== 
==87545== Possible data race during read of size 1 at 0x6F41677C by thread #24
==87545== Locks held: none
==87545==    at 0x6CD81459: atomic_load_b (atomic.h:89)
==87545==    by 0x6CD81459: background_thread_indefinite_sleep (background_thread_inlines.h:45)
==87545==    by 0x6CD81459: arena_background_thread_inactivity_check (arena.c:207)
==87545==    by 0x6CD81459: je_arrow_private_je_arena_handle_deferred_work (arena.c:223)
==87545==    by 0x6CD81CE2: je_arrow_private_je_arena_slab_dalloc (arena.c:572)
==87545==    by 0x6CDDD6E5: tcache_bin_flush_impl (tcache.c:477)
==87545==    by 0x6CDDD6E5: tcache_bin_flush_bottom (tcache.c:519)
==87545==    by 0x6CDDD6E5: je_arrow_private_je_tcache_bin_flush_small (tcache.c:529)
==87545==    by 0x6CDDE364: tcache_flush_cache (tcache.c:790)
==87545==    by 0x6CDDE9BE: tcache_destroy.constprop.0 (tcache.c:809)
==87545==    by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545==    by 0x4A1393F: start_thread (pthread_create.c:455)
==87545==    by 0x4AA42E3: clone (clone.S:100)
==87545==  Address 0x6f41677c is in a rw- anonymous segment
==87545== 
==87545== ----------------------------------------------------------------
==87545== 
==87545==  Lock at 0x6F403AE8 was first observed
==87545==    at 0x4852F6B: pthread_mutex_init (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==87545==    by 0x6CDD46EF: je_arrow_private_je_malloc_mutex_init (mutex.c:171)
==87545==    by 0x6CDE1A0A: je_arrow_private_je_ecache_init (ecache.c:9)
==87545==    by 0x6CDD626C: je_arrow_private_je_pac_init (pac.c:49)
==87545==    by 0x6CDD4B84: je_arrow_private_je_pa_shard_init (pa.c:43)
==87545==    by 0x6CD83A76: je_arrow_private_je_arena_new (arena.c:1648)
==87545==    by 0x6CD7A254: je_arrow_private_je_arena_init (jemalloc.c:443)
==87545==    by 0x6CD7BBAC: malloc_init_hard_a0_locked (jemalloc.c:1885)
==87545==    by 0x6CD7BE70: malloc_init_hard (jemalloc.c:2129)
==87545==    by 0x400536D: call_init.part.0 (dl-init.c:90)
==87545==    by 0x4005472: call_init (dl-init.c:136)
==87545==    by 0x4005472: _dl_init (dl-init.c:137)
==87545==    by 0x4001561: _dl_catch_exception (dl-catch.c:211)
==87545==  Address 0x6f403ae8 is in a rw- anonymous segment
==87545== 
==87545== Possible data race during write of size 8 at 0x6F4060C0 by thread #24
==87545== Locks held: 1, at address 0x6F403AE8
==87545==    at 0x6CDC8891: atomic_store_zu (atomic.h:93)
==87545==    by 0x6CDC8891: je_arrow_private_je_eset_insert (eset.c:109)
==87545==    by 0x6CDCB1C8: extent_deactivate_locked_impl (extent.c:256)
==87545==    by 0x6CDCB1C8: extent_deactivate_locked (extent.c:263)
==87545==    by 0x6CDCB1C8: je_arrow_private_je_extent_record (extent.c:950)
==87545==    by 0x6CDD5909: pac_dalloc_impl (pac.c:277)
==87545==    by 0x6CDD38F9: large_dalloc_finish_impl (large.c:253)
==87545==    by 0x6CDD38F9: je_arrow_private_je_large_dalloc (large.c:273)
==87545==    by 0x6CDDECE7: arena_dalloc_large_no_tcache (arena_inlines_b.h:253)
==87545==    by 0x6CDDECE7: arena_dalloc_no_tcache (arena_inlines_b.h:276)
==87545==    by 0x6CDDECE7: arena_dalloc (arena_inlines_b.h:308)
==87545==    by 0x6CDDECE7: idalloctm (jemalloc_internal_inlines_c.h:120)
==87545==    by 0x6CDDECE7: tcache_destroy.constprop.0 (tcache.c:817)
==87545==    by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545==    by 0x4A1393F: start_thread (pthread_create.c:455)
==87545==    by 0x4AA42E3: clone (clone.S:100)
==87545==  Address 0x6f4060c0 is in a rw- anonymous segment
==87545== 
==87545== ----------------------------------------------------------------
==87545== 
==87545== Possible data race during read of size 1 at 0x6F4166AC by thread #24
==87545== Locks held: none
==87545==    at 0x6CD81459: atomic_load_b (atomic.h:89)
==87545==    by 0x6CD81459: background_thread_indefinite_sleep (background_thread_inlines.h:45)
==87545==    by 0x6CD81459: arena_background_thread_inactivity_check (arena.c:207)
==87545==    by 0x6CD81459: je_arrow_private_je_arena_handle_deferred_work (arena.c:223)
==87545==    by 0x6CDD393B: large_dalloc_finish_impl (large.c:255)
==87545==    by 0x6CDD393B: je_arrow_private_je_large_dalloc (large.c:273)
==87545==    by 0x6CDDECE7: arena_dalloc_large_no_tcache (arena_inlines_b.h:253)
==87545==    by 0x6CDDECE7: arena_dalloc_no_tcache (arena_inlines_b.h:276)
==87545==    by 0x6CDDECE7: arena_dalloc (arena_inlines_b.h:308)
==87545==    by 0x6CDDECE7: idalloctm (jemalloc_internal_inlines_c.h:120)
==87545==    by 0x6CDDECE7: tcache_destroy.constprop.0 (tcache.c:817)
==87545==    by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545==    by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545==    by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545==    by 0x4A1393F: start_thread (pthread_create.c:455)
==87545==    by 0x4AA42E3: clone (clone.S:100)
==87545==  Address 0x6f4166ac is in a rw- anonymous segment

@jorisvandenbossche
Copy link
Member

You can maybe post that on the arrow issue?

@mroeschke
Copy link
Member Author

fact we run our tests in parallel

These Arrow CSV tests don't run in parallel anymore (they are run on the single CPU job with pytest xdist disabled)

@mroeschke
Copy link
Member Author

Closing since I merged a PR to skip these empty CSV tests

@mroeschke mroeschke closed this Nov 14, 2023
@mroeschke mroeschke deleted the ci/debug branch November 14, 2023 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants