Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.to_json OverflowError with np.long* dtypes #55495

Merged
merged 3 commits into from
Oct 16, 2023

Conversation

gupta-paras
Copy link
Contributor

@gupta-paras gupta-paras requested a review from WillAyd as a code owner October 12, 2023 13:02
@WillAyd
Copy link
Member

WillAyd commented Oct 12, 2023

We don't support long double as a data type and I don't think upcasting double to long double is appropriate. We should probably just raise here

@gupta-paras gupta-paras force-pushed the long_double_json_overflow branch from 0b279ba to b972f99 Compare October 12, 2023 14:24
@gupta-paras
Copy link
Contributor Author

gupta-paras commented Oct 12, 2023

@WillAyd, Thanks for identifying the issue with commit. I have updated the commit with changes suggested in above comment.
Thanks

@gupta-paras gupta-paras force-pushed the long_double_json_overflow branch from b972f99 to 3e3c713 Compare October 12, 2023 15:30
@@ -1610,6 +1610,11 @@ void Object_beginTypeContext(JSOBJ _obj, JSONTypeContext *tc) {
PyArray_DescrFromType(NPY_DOUBLE));
tc->type = JT_DOUBLE;
return;
} else if (PyArray_IsScalar(obj, LongDouble)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but I think the lines directly following it are supposed to be a catchall for unsupported types. Do you know why that isn't being hit? I would rather we keep this generic instead of having to specify an error message for every type that we don't serialize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a code section at the end of this function:

    pc->iterBegin = Dir_iterBegin;
    pc->iterEnd = Dir_iterEnd;
    pc->iterNext = Dir_iterNext;
    pc->iterGetValue = Dir_iterGetValue;
    pc->iterGetName = Dir_iterGetName;
    return;

By default everything falls to this section and this causes infinite loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but I'm asking about the next branch after what you've added.

} else if (PyArray_Check(obj) && PyArray_CheckScalar(obj)) {

Do you know what hits that currently? From reading the function I think the intent of that was to generically catch the issue you've described, but its possible the invariant is incorrect. Would something pass both PyArray_Check and PyArray_CheckScalar? Maybe the PyArray_Check call is incorrect and removing that alone would fix your issue?

Copy link
Contributor Author

@gupta-paras gupta-paras Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any value like: np.array(1) [0d array] evaluates true for both of them, so that check was intended for handling this case only. Also for any numpy scalar type, second will be true. So, simply removing pyArray_check handles both cases. Only thing we need to make sure, that all numpy scalar types are handled before this if block. I will add similar comment in code as well.
Thanks

@mroeschke mroeschke added the IO JSON read_json, to_json, json_normalize label Oct 13, 2023
@gupta-paras gupta-paras force-pushed the long_double_json_overflow branch from 0f7f840 to 7a29022 Compare October 13, 2023 16:45
@nmacholl
Copy link

We don't support long double as a data type and I don't think upcasting double to long double is appropriate. We should probably just raise here

I'm curious, if pandas doesn't support these numpy types in a data frame, would it make sense to raise an exception at the time of creation or dtype assignment? There are maybe other places in the code these unsupported types would break?

@WillAyd
Copy link
Member

WillAyd commented Oct 13, 2023

@nmacholl you might want to check existing issues for any discussion, but I don't think its quite that simple. For starters assignment/creation isn't the only way those types might make there way into a DataFrame. There also exists a subset of users who may not use pandas for I/O / computations (where a lot of those limitations are imposed) versus just using it as an easy way to get labeled access to NumPy arrays. As an example, you'll see quite a few use cases in the issue log where users contain NumPy unicode dtypes in DataFrames even though pandas has almost no support for those; in such cases a DataFrame is just a labeled indexer, and I don't think we want to break that type of workflow

@WillAyd
Copy link
Member

WillAyd commented Oct 13, 2023

@gupta-paras this looks good to me aside from outstanding comments / code suggestions. @mroeschke any thoughts on your end?

@mroeschke mroeschke added this to the 2.2 milestone Oct 16, 2023
@mroeschke mroeschke merged commit 32c9c8f into pandas-dev:main Oct 16, 2023
33 checks passed
@mroeschke
Copy link
Member

Thanks @gupta-paras

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.to_json OverflowError with np.long* dtypes
4 participants