Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support pd.json_normalize for normalizing only meta fields #60460

Closed
wants to merge 3 commits into from

Conversation

Ynjxsjmh
Copy link

@Ynjxsjmh Ynjxsjmh commented Dec 1, 2024

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Currently, meta is used with when record_path is not None. The logic is to extract both the record_path and ethe meta. For example:

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott", "year": 2014},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich", "year": 2015},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]
result = pd.json_normalize(
    data, "counties", ["state", "shortname", ["info", "governor"]]
)

result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

In the above example, pd.json_normalize not only retrieves counties, but also retrieves state, shortname and info.governor.

When record_path is not given, meta is ignored, for example:

result = pd.json_normalize(
    data, meta=["state", "shortname", ["info", "governor"]]
)

result
     state shortname                                           counties    info.governor   info.year
0  Florida        FL  [{'name': 'Dade', 'population': 12345}, {'name...       Rick Scott        2014
1     Ohio        OH  [{'name': 'Summit', 'population': 1234}, {'nam...      John Kasich        2015

This PR adds a feature when record_path is None or an empty list, only extracts the meta.

result = pd.json_normalize(
    data, meta=["state", "shortname", ["info", "governor"]]
)

result
  shortname    state info.governor
0        FL  Florida    Rick Scott
1        OH     Ohio   John Kasich

The behavior can be summarized as:

  • record_path is None, meta is None: normalize all records.
  • record_path is not None, meta is None: normalize only record_path.
  • record_path is not None, meta is not None: normalize record_path and meta.
  • record_path is None, meta is not None: normalize only meta. [This PR]

@Ynjxsjmh
Copy link
Author

Ynjxsjmh commented Dec 2, 2024

I couldn't reproduce the dtype of df.columns unmatch error (Future infer strings (without pyarrow) and Future infer strings). I think my test is just the same with other test with record_path like test_nonetype_record_path and test_nested_meta_path_with_nested_record_path. I don't understand why my test gets these errors while the other doesn't.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR but is there a related open issue discussing this feature? We require that (and agreement from the core team) before proceeding.

@Ynjxsjmh
Copy link
Author

Ynjxsjmh commented Dec 3, 2024

@mroeschke I didn't know that rule before. I did a thorough search and found no related issues. Need I close this pr and post a new issue?

@mroeschke
Copy link
Member

Yes let's wait for feedback on the issue before proceeding with a PR. We can reopen if there's agreement from the core team to support this feature

@mroeschke mroeschke closed this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants