-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105
base: main
Are you sure you want to change the base?
Changes from 5 commits
2244869
bd00fc5
6a23f05
2f261c8
8f900b8
536b1ed
8965a1d
a32b4a6
f412c72
a0200d0
10ca030
ba2e82d
8dd45ca
602194a
f8ae285
02514e0
ef5c2ed
da7817a
de5c604
e025d84
3d83bab
6ea5785
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -4355,3 +4355,19 @@ def test_xsqlite_if_exists(sqlite_buildin): | |||||
(5, "E"), | ||||||
] | ||||||
drop_table(table_name, sqlite_buildin) | ||||||
|
||||||
|
||||||
@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default]) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also found that this was failing for |
||||||
def test_bytes_column(sqlite_buildin, dtype_backend): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Should test this against all databases There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure thing! I was trying to pass in the cartesian product of the |
||||||
pytest.importorskip("pyarrow") | ||||||
""" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is well intentioned but can you remove the docstring? We don't use them in tests. Instead, you can just add a comment pointing to the github issue number in the function body |
||||||
Regression test for (#59242) | ||||||
Bytes being returned in a column that could not be converted | ||||||
to a string would raise a UnicodeDecodeError | ||||||
when using dtype_backend='pyarrow' or dtype_backend='numpy_nullable' | ||||||
""" | ||||||
query = """ | ||||||
select cast(x'0123456789abcdef0123456789abcdef' as blob) a | ||||||
""" | ||||||
df = pd.read_sql(query, sqlite_buildin, dtype_backend=dtype_backend) | ||||||
assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you use our built-in test helpers instead? I think you can just do: result = pd.read_sql(...)
expected = pd.DataFrame({"a": ...}, dtype=pd.ArrowDtype(pa.binary()))
tm.assert_frame_equal(result, expected) What data type does this produce currently with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for sure, changed the testing logic over to using this! for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on some local testing using the ADBC driver I can confirm that it yields a
pandas.core.arrays.arrow.array.ArrowExtensionArray
with._dtype
ofpandas.ArrowDtype
. When the query returns a bytes type column we get a.type
ofbytes
, and likewise a.type
ofstring
is returned for a string type column. Seems like we don't need to do any conversions when using the ADBC driver as you've stated if I'm understanding correctly here!Wondering if it makes sense to remove the code here trying to convert based on a
dtype_backend != 'numpy'
since this will fix the cause of the exception in the issue? and maybe raise an exception when trying to use apyarrow
dtype_backend
with theSQLiteDatabase
connection type here: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L695 ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend.
With the pyarrow backend, I think you can still move this logic to
sql.py
and check the type of the column coming back from the database. If it is a binary type in the database, using the PyArrow binary type with that backend makes sense.Not sure if @mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13 #58455
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend (
ArrowExtensionArray._from_sequence
should be able to return a binary type with binary data.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in
sql.py
it looks like the result of this conversion is being overwritten when thedtype_backend
ispyarrow
: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L181 and the dtype returned by the current logic isArrowDtype(pa.binary())
for the example in the issue, so maybe just removing the conversion logic is all that's needed to resolve this issue? I've removed the block doing the conversion and added a test case showing that the resulting df has a dtype ofArrowDtype(pa.binary())
when thedtype_backend='pyarrow'