-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve type handling in read_sql and read_sql_table #13049
Comments
The |
There is already the However, the problem of eg possible NaNs in integer columns will not be solved by this I think? The only way to be certain to always have consistent dtype in different chunks is to convert integer columns always to float (unless in the case that a not-nullable constraint is put on the column). Which I am not sure of we should do, as in many cases we will be converting all integers without NaN unnecessarily to floats .. |
@jorisvandenbossche
This is supported by most major drivers with the exception of sqlite3, |
IMO the problem with this is that by default columns can hold NULLs, so I suppose in many cases people will not specify this, although maybe their columns do in practice not hold NULLs. For all those cases the dtype of the returned column would now change, in many cases unnecessarily. I am not saying the issue you raise is not a problem, because it certainly is, but I am considering what would be the best solution for all cases. |
I doubt that in serious environments non-nullable columns are left as nullable... but I guess we will never know. I think this could be handled by a keyword |
@rsdenijs That is quite possible, but the fact is that there are also a lot of less experienced people using pandas/sql .. The question then of course is to what extent we have to take those into account for this issue. Anyway, trying to think of other ways to deal with this issue:
Would you be interested in doing a PR for the first bullet point? This is in any case the non-controversial part I think and could already solve it for string columns (leaving only int columns to handle manually). |
@jorisvandenbossche Actually I might have been confused regarding the strings. String columns are always of type object, regardless of the presence of NaNs. For some reason I thought there was an actual string type in pandas. So although I would like to take a stab at it, im no longer sure what the goal would be. Regarding the ints types, I think that
If for some reason we can not verify the column is nullable (sqlalchemy), when chunking the default behaviour should imo be that ints. |
My problem is that even if the detection works, integers loose precision when casted to float and my values are id of records so I need full 64 bit integer precision. Any workaround? |
It would be extremely helpful to be able to specify the types of columns as read_sql() input arguments! Could we maybe have at least that for the moment? |
Yes, we can.. if somebody makes a contribution to add it! |
I'm interested in taking this on! Is a fix on this still welcome? |
@sam-hoffman contributions to improve type handling in sql reading are certainly welcome, but, I am not sure there is already a clear actionable conclusion from the above discussion (if I remember correctly, didn't yet reread the whole thread). So maybe you can first propose more concretely what you would like to change? |
@sam-hoffman please do! |
@jorisvandenbossche based on the above discussion I would propose to add a dtype arg for |
Hello what's is the latest situation with that issue? |
As a temporal workaround for nullable int64 types I use following and prefer to specify each type for each column. dtype={
'column': pd.Int64Dtype()
} |
Seems like this is stale now? |
Problem
In pd.read_sql and pd.read_sql_table when the chunksize parameter is set, Pandas builds a DataFrame with dtypes inferred from the data in the chunk. This can be a problem if an INTEGER colum contains null values in some chunks but not in others, leading the same column to be int64 in some cases and in others float64. A similar problem happens with strings.
In ETL processes or simply when dumping large queries to disk in HDF5 format, the user currently has the burden of explicitly having to handle the type conversions of potentially many columns.
Solution?
Instead of guessing the type from a subset of the data, it should be possible to obtain the type information from the database and map it to the appropriate dtypes.
It is possible to obtain column information from Sqlalchemy when querying a full table by inspecting its metadata, but I was unsuccessfull in findind a way to do it for a general query.
Although I am unaware of all the possible type problems that can arise DBAPI does actually enforce the cursor.description to specify whether each result column is nullable.
Pandas could use this information (optionally) to always interpret nullable numeric columns as floats and strings as object columns.
The text was updated successfully, but these errors were encountered: