Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Croissant refers to incomplete parquet branch in native parquet datasets #3101

Open
fylux opened this issue Nov 7, 2024 · 6 comments
Open

Comments

@fylux
Copy link

fylux commented Nov 7, 2024

The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:

IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:

Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.

@fylux
Copy link
Author

fylux commented Nov 8, 2024

@lhoestq is there anything we can contribute to fix this or it needs to be done in the HuggingFace server?

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2024

Hi ! Yes if a dataset is already in parquet the croissant file doesn't need to point to the parquet branch (that may contain incomplete data). You can maybe check https://github.com/huggingface/dataset-viewer/blob/main/services/worker/src/worker/job_runners/dataset/croissant_crumbs.py and see if you can adapt the code for this case

@fylux
Copy link
Author

fylux commented Nov 12, 2024

My impression is that the required change should be deeper than croissant_crumbs.py.

In this file it assumes it already has a dataset info (containing config, splits), and IIUC the dataset_info is retrieved from the parque branch (or in general they is a clear mapping dataset_info -> parquet branch folder hierarchy). The first step would be to specify how the configs and splits are distributed across the folders of the main branch, which unlike the parquet branch it doesn't follow a predefined structure.

For example https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet/tree/refs%2Fconvert%2Fparquet :

  • Parquet branch (config/split):
default/partial-train/0000.parquet
  • Main branch (arbitrary set of folders):
filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet

@lhoestq
Copy link
Member

lhoestq commented Nov 15, 2024

you can use the datasets library to list the files of a given dataset:

>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("mlfoundations/dclm-baseline-1.0-parquet")
>>> builder.config.data_files
{
    NamedSplit('train'): [
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet',
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000001_processed.parquet'
        ...
    ]
}

@fylux
Copy link
Author

fylux commented Nov 15, 2024

Thanks @lhoestq, that's exactly what I was looking for. And combined with listing the configs, it should be able to cover the mapping config/split -> paths in main branch:

from datasets import load_dataset_builder, get_dataset_config_names
get_dataset_config_names("ai4bharat/sangraha") # ['verified', 'unverified', 'synthetic']
builder = load_dataset_builder("ai4bharat/sangraha","synthetic")
builder.config.data_files

Probably one of the challenges that we can face is that the data_files are listed individually (without globs) so if we list each file individually it could lead to huge Croissant files.

@lhoestq
Copy link
Member

lhoestq commented Nov 15, 2024

Alternatively we can rely on the dataset-compatible-libraries job in dataset-viewer, which creates code snippet e.g. for Dask using a glob pattern for the parquet files. The glob pattern is computed using heuristics.

For example for dclm it obtains this code pattern and glob:

import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/mlfoundations/dclm-baseline-1.0-parquet/**/*_train/**")

So we could just reuse the glob, that is stored along with the generated code snippet (no need to parse the code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants