-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH adding metadata argument to DataFrame.to_parquet #20521
Comments
cc @cpcloud What's the purpose here? Would this be in addition to or in place of the usual |
The user given dictionary updates current key value file metadata. If user gives pandas key then it overwrites pandas_metadata but warning.warn is issued. Purpose: User metadata is very needed when:
For me it is a very important feature and one of the main reasons I want to switch to parquet. |
That all sounds reasonable. |
Slight cosmetic suggestion - code a bit more Pythonic |
And added whatsnew and rebased to current master. |
Note for readers: the PR was closed but mentions a work-around that can be used for now if you need this: #20534 (comment) |
I have been thinking about this and am wondering what the general thoughts are to use DataFrame.attrs and Series.attrs for reading and writing metadata to/from parquet? For example, here is how the metadata would be written: pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
pdf.to_parquet("file.parquet") Then, when loading in the data: pdf = pandas.read_parquet("file.parquet")
pdf.attrs
pdf.a.attrs
Is this something that would need to be done in pandas or pyarrow/fastparquet? EDIT: Added issue to pyarrow here |
Here is a hack to get the attrs to work with pyarrow: def _write_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
column_attrs = {}
for col in pdf.columns:
attrs = pdf[col].attrs
if not attrs or not isinstance(col, str):
continue
column_attrs[col] = attrs
pandas_metadata.update(
attrs=pdf.attrs,
column_attrs=column_attrs,
)
schema_metadata[b"pandas"] = json.dumps(pandas_metadata)
return table.replace_schema_metadata(schema_metadata)
def _read_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
pdf.attrs = pandas_metadata.get("attrs", {})
col_attrs = pandas_metadata.get("column_attrs", {})
for col in pdf.columns:
pdf[col].attrs = col_attrs.get(col, {})
def to_parquet(pdf, filename):
# write parquet file with attributes
table = pyarrow.Table.from_pandas(pdf)
table = _write_attrs(table, pdf)
pyarrow.parquet.write_table(table, filename)
def read_parquet(filename):
# read parquet file with attributes
table = pyarrow.parquet.read_pandas(filename)
pdf = table.to_pandas()
_read_attrs(table, pdf)
return pdf Example: Writing: pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
to_parquet(pdf, "a.parquet") Reading: pdf = read_parquet("a.parquet")
pdf.attrs
pdf.a.attrs
|
I have a PR that seems to do the trick: #41545 |
Ideally, I think this would actually be done in pyarrow/fastparquet, as it is in those libraries that the "pandas" metadata item gets constructed currently |
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
so... can we have simple something to work with df.attrs ? The goal is to replace multiple pseudo-csv formats which add #-prefixed comments in the beginning of a file with something systematic. I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer) |
Yes, and a contribution to add this functionality is welcome, I think. And a PR to add generic parquet file-level metadata with a |
Edit don't need this ⬇️ since
# write
df.to_parquet(path)
meta = {'foo':'bar'}
fastparquet.update_file_custom_metadata(path, meta)
# read
pf = fastparquet.ParquetFile(path)
df_ = pf.to_pandas()
meta_ = pf.key_value_metadata Note |
This is done and in |
Code Sample, a copy-pastable example if possible
Please comsider merging
master...JacekPliszka:master
Problem description
Currently pandas can not add custom metadata to parquet file.
This patch add metadata argument to DataFrame.to_parquet that allows for that.
Warning is issued when pandas key is present in the dictionary passed.
The text was updated successfully, but these errors were encountered: