TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

rhshadrach · 2024-12-30T13:56:20Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

It would be more performant to implement a way of writing the string dtypes without converting to NumPy, but this approach gets us correct behavior for now.

mroeschke · 2024-12-30T17:33:29Z

pandas/io/pytables.py

@@ -3185,6 +3186,8 @@ def write_array(
        #  both self._filters and EA

        value = extract_array(obj, extract_numpy=True)
+        if isinstance(value, BaseStringArray):


Should this logic live in extract_array?

The line from the docstring of extract_array states:

Extract the ndarray or ExtensionArray from a Series or Index.

and the extract_numpy argument:

Whether to extract the ndarray from a NumpyExtensionArray.

So I think no - I would expect that function to still return the ExtensionArray.

Do the tests currently cover the "string" data type going through pytables? This seems like it might mangle the NA markers?

Not super familiar with pytables so not saying this is right or wrong - just want to double check

There are certainly tests with dataframe roundtrip that contain string columns, not entirely sure if there are also tests where those columns have missing values, though

@jorisvandenbossche - I'm not seeing this particular line hit with string dtype. I added a test, including an NA value.

@WillAyd - the current behavior is to write out the underlying objects, and then infer upon loading. So if we start with string an future.infer_string=False, we get object. When that option is True, we get str.

pandas/tests/io/pytables/test_put.py

jorisvandenbossche · 2025-01-02T09:33:17Z

pandas/tests/io/pytables/test_put.py

+        if using_infer_string:
+            msg = "Saving a MultiIndex with an extension dtype is not supported."


Suggested change

if using_infer_string:

msg = "Saving a MultiIndex with an extension dtype is not supported."

if using_infer_string:

# TODO(infer_string) make this work for string dtype

msg = "Saving a MultiIndex with an extension dtype is not supported."

(just as a reminder, because ideally we still solve this)

jorisvandenbossche · 2025-01-02T09:39:20Z

pandas/io/pytables.py

-            if using_string_dtype() and is_string_array(values, skipna=True):
+            if (
+                using_string_dtype()
+                and isinstance(values, np.ndarray)
+                and is_string_array(values, skipna=True)
+            ):


The same code pattern happens above in SeriesFixed read() method. Not entirely sure if the same change should be applied there as well, but would expect so (but maybe not covered by any test?)

Indeed, added a test

jorisvandenbossche · 2025-01-02T09:47:30Z

pandas/tests/io/pytables/test_put.py



 def test_put_mixed_type(setup_path, performance_warning):
    df = DataFrame(
        np.random.default_rng(2).standard_normal((10, 4)),
-        columns=Index(list("ABCD"), dtype=object),
+        columns=Index(list("ABCD")),
        index=date_range("2000-01-01", periods=10, freq="B"),
    )
    df["obj1"] = "foo"


Suggested change

df["obj1"] = "foo"

df["obj1"] = np.array(["foo"] * 10, dtype=object)

df["str"] = pd.Series(["a", None, "b", "c", "d"] * 2).array

To explicitly test here with both object dtype and string dtype, and with strings with missing values (this is passing locally, so that should address Will's comment at https://github.com/pandas-dev/pandas/pull/60625/files#r1900476124

TST(string dtype): Resolve HDF5 xfails in test_put.py

2be559a

rhshadrach added IO HDF5 read_hdf, HDFStore Strings String extension data type and string data labels Dec 30, 2024

rhshadrach added this to the 2.3 milestone Dec 30, 2024

rhshadrach mentioned this pull request Dec 30, 2024

TST(string dtype): Resolve HDF5 xfails in test_round_trip.py #60627

Open

5 tasks

mroeschke reviewed Dec 30, 2024

View reviewed changes

WillAyd requested changes Dec 31, 2024

View reviewed changes

pandas/tests/io/pytables/test_put.py Outdated Show resolved Hide resolved

pandas/tests/io/pytables/test_put.py Outdated Show resolved Hide resolved

Change tests

4c0e07d

jorisvandenbossche reviewed Jan 2, 2025

View reviewed changes

rhshadrach added 2 commits January 2, 2025 16:22

Add test for string dtype

0851ea8

Refinements

846b2b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

rhshadrach commented Dec 30, 2024

mroeschke Dec 30, 2024

rhshadrach Dec 31, 2024

WillAyd Jan 1, 2025

jorisvandenbossche Jan 2, 2025

rhshadrach Jan 2, 2025

jorisvandenbossche Jan 2, 2025 •

edited

Loading

jorisvandenbossche Jan 2, 2025

rhshadrach Jan 2, 2025

jorisvandenbossche Jan 2, 2025

		if using_infer_string:
		msg = "Saving a MultiIndex with an extension dtype is not supported."

	df["obj1"] = "foo"
	df["obj1"] = np.array(["foo"] * 10, dtype=object)
	df["str"] = pd.Series(["a", None, "b", "c", "d"] * 2).array

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

Are you sure you want to change the base?

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

Conversation

rhshadrach commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2025 •

edited

Loading