Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105

kastkeepitjumpinlikekangaroos · 2024-10-25T04:38:41Z

closes BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #59242
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR allows for bytes based data to be returned instead of throwing an exception like before

kastkeepitjumpinlikekangaroos · 2024-10-25T23:23:51Z

The most recent change on main seems to also have 17 failing tasks so I think these failures are unrelated to the change in my PR but please let me know if that's wrong!

mroeschke · 2024-10-30T19:21:57Z

pandas/core/internals/construction.py

+
+                            # Try and greedily convert to string
+                            # Will fail if the object is bytes
+                            arr = arr_cls._from_sequence(arr, dtype=new_dtype)


Shouldn't this ideally return pandas.ArrowDtype(pyarrow.binary()) type?

that makes sense thanks! so looks like the previous logic was not taking into account pyarrow types when doing this conversation so I've added logic similar to my initial change where we try to convert to a pyarrow string but then fall back to binary if we run into an invalid error (i.e. we tried to parse but it failed due to an encoding error). Please let me know what you think! Was also considering trying to type check the contents of arr to see if it has string or bytes data but seems like greedily trying to convert ends up being better performance in most cases (since we might have to search the whole arr to see if one of the elements is a bytes sequence that can't be converted to a string)

kastkeepitjumpinlikekangaroos · 2024-11-13T22:34:02Z

pandas/tests/io/test_sql.py

@@ -4355,3 +4355,19 @@ def test_xsqlite_if_exists(sqlite_buildin):
        (5, "E"),
    ]
    drop_table(table_name, sqlite_buildin)
+
+
+@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])


also found that this was failing for dtype_backend=numpy_nullable so implemented a fix and added test cases for each possible dtype_backend

WillAyd

I definitely appreciate what we are trying to do here although I'm wary about trying to infer "non-first class" types at runtime.

Have you tried using an ADBC driver instead? That should be Arrow-native and yield the proper dtypes

WillAyd · 2024-11-14T20:11:05Z

pandas/core/internals/construction.py

@@ -968,9 +972,31 @@ def convert(arr):
                    # i.e. maybe_convert_objects didn't convert
                    arr = maybe_infer_to_datetimelike(arr)
                    if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):


I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types?

Based on some local testing using the ADBC driver I can confirm that it yields a pandas.core.arrays.arrow.array.ArrowExtensionArray with ._dtype of pandas.ArrowDtype. When the query returns a bytes type column we get a .type of bytes, and likewise a .type of string is returned for a string type column. Seems like we don't need to do any conversions when using the ADBC driver as you've stated if I'm understanding correctly here!

Wondering if it makes sense to remove the code here trying to convert based on a dtype_backend != 'numpy' since this will fix the cause of the exception in the issue? and maybe raise an exception when trying to use a pyarrow dtype_backend with the SQLiteDatabase connection type here: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L695 ?

I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend.

With the pyarrow backend, I think you can still move this logic to sql.py and check the type of the column coming back from the database. If it is a binary type in the database, using the PyArrow binary type with that backend makes sense.

Not sure if @mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13 #58455

I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend (ArrowExtensionArray._from_sequence should be able to return a binary type with binary data.)

in sql.py it looks like the result of this conversion is being overwritten when the dtype_backend is pyarrow: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L181 and the dtype returned by the current logic is ArrowDtype(pa.binary()) for the example in the issue, so maybe just removing the conversion logic is all that's needed to resolve this issue? I've removed the block doing the conversion and added a test case showing that the resulting df has a dtype of ArrowDtype(pa.binary()) when the dtype_backend='pyarrow'

WillAyd

Very nice - simplification is always good and I think this implementation looks reasonable :-)

Just need a few tweaks to the test case

WillAyd · 2024-11-20T04:55:55Z

pandas/tests/io/test_sql.py

+    select cast(x'0123456789abcdef0123456789abcdef' as blob) a
+    """
+    df = pd.read_sql(query, sqlite_buildin, dtype_backend=dtype_backend)
+    assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"


Can you use our built-in test helpers instead? I think you can just do:

result = pd.read_sql(...) expected = pd.DataFrame({"a": ...}, dtype=pd.ArrowDtype(pa.binary())) tm.assert_frame_equal(result, expected)

What data type does this produce currently with the numpy_nullable backend - object?

for sure, changed the testing logic over to using this! for numpy_nullable and lib.no_default the dtype returned is an object

WillAyd · 2024-11-21T17:59:34Z

pandas/tests/io/test_sql.py

+
+
+@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])
+def test_bytes_column(sqlite_buildin, dtype_backend):


Suggested change

def test_bytes_column(sqlite_buildin, dtype_backend):

def test_bytes_column(all_connectable, dtype_backend):

Should test this against all databases

sure thing! I was trying to pass in the cartesian product of the all_connectable and dtype_backend arrays using itertools.product to @pytest.mark.parametrize but was running into issues with the connections getting passed. I instead made it so the connectables are being passed in the parametrize and then we loop through the dtypes in the test. Would love to know if there's a better way to do this so we're testing each dtype_backend/connection combination independently

WillAyd · 2024-11-21T18:00:08Z

pandas/tests/io/test_sql.py

+@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])
+def test_bytes_column(sqlite_buildin, dtype_backend):
+    pa = pytest.importorskip("pyarrow")
+    """


This is well intentioned but can you remove the docstring? We don't use them in tests.

Instead, you can just add a comment pointing to the github issue number in the function body

…pandas into fix-59242

pandas/tests/io/test_sql.py

kastkeepitjumpinlikekangaroos added 2 commits October 25, 2024 00:35

Add fix for pandas-dev#59242

2244869

add skip import

bd00fc5

mroeschke requested changes Oct 30, 2024

View reviewed changes

mroeschke added IO SQL to_sql, read_sql, read_sql_query Arrow pyarrow functionality labels Oct 30, 2024

kastkeepitjumpinlikekangaroos added 2 commits November 12, 2024 17:40

address comment

6a23f05

merged main

2f261c8

kastkeepitjumpinlikekangaroos requested a review from mroeschke November 13, 2024 04:15

also fix for dtype_backend=numpy_nullable

8f900b8

kastkeepitjumpinlikekangaroos commented Nov 13, 2024

View reviewed changes

Merge branch 'main' into fix-59242

536b1ed

WillAyd requested changes Nov 14, 2024

View reviewed changes

kastkeepitjumpinlikekangaroos added 3 commits November 17, 2024 09:14

merge main

8965a1d

fix

a32b4a6

Merge branch 'main' into fix-59242

f412c72

WillAyd requested changes Nov 20, 2024

View reviewed changes

kastkeepitjumpinlikekangaroos added 2 commits November 20, 2024 18:48

address comment

a0200d0

Merge branch 'main' into fix-59242

10ca030

WillAyd requested changes Nov 21, 2024

View reviewed changes

kastkeepitjumpinlikekangaroos added 4 commits December 4, 2024 00:13

address comment

ba2e82d

Merge branch 'fix-59242' of github.com:kastkeepitjumpinlikekangaroos/…

8dd45ca

…pandas into fix-59242

remove cast

602194a

fix test

f8ae285

kastkeepitjumpinlikekangaroos commented Dec 4, 2024

View reviewed changes

pandas/tests/io/test_sql.py Show resolved Hide resolved

kastkeepitjumpinlikekangaroos added 5 commits December 4, 2024 17:39

fix logic

02514e0

fix

ef5c2ed

fix

da7817a

also include pa.OpaqueType

de5c604

fix

e025d84

kastkeepitjumpinlikekangaroos added 2 commits December 8, 2024 22:57

Merge branch 'main' into fix-59242

3d83bab

Merge branch 'main' into fix-59242

6ea5785

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105

kastkeepitjumpinlikekangaroos commented Oct 25, 2024

kastkeepitjumpinlikekangaroos commented Oct 25, 2024

mroeschke Oct 30, 2024

kastkeepitjumpinlikekangaroos Nov 12, 2024

kastkeepitjumpinlikekangaroos Nov 13, 2024

WillAyd left a comment

WillAyd Nov 14, 2024

kastkeepitjumpinlikekangaroos Nov 15, 2024

WillAyd Nov 15, 2024

mroeschke Nov 15, 2024

kastkeepitjumpinlikekangaroos Nov 17, 2024

WillAyd left a comment

WillAyd Nov 20, 2024

kastkeepitjumpinlikekangaroos Nov 20, 2024

WillAyd Nov 21, 2024

kastkeepitjumpinlikekangaroos Dec 4, 2024

WillAyd Nov 21, 2024



		@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default])
		def test_bytes_column(sqlite_buildin, dtype_backend):

	def test_bytes_column(sqlite_buildin, dtype_backend):
	def test_bytes_column(all_connectable, dtype_backend):

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105

Are you sure you want to change the base?

Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105

Conversation

kastkeepitjumpinlikekangaroos commented Oct 25, 2024

kastkeepitjumpinlikekangaroos commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment