Croissant refers to incomplete parquet branch in native parquet datasets #3101

fylux · 2024-11-07T11:26:14Z

The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:

IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:

https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.

fylux · 2024-11-08T08:07:53Z

@lhoestq is there anything we can contribute to fix this or it needs to be done in the HuggingFace server?

lhoestq · 2024-11-08T23:39:17Z

Hi ! Yes if a dataset is already in parquet the croissant file doesn't need to point to the parquet branch (that may contain incomplete data). You can maybe check https://github.com/huggingface/dataset-viewer/blob/main/services/worker/src/worker/job_runners/dataset/croissant_crumbs.py and see if you can adapt the code for this case

fylux · 2024-11-12T08:58:38Z

My impression is that the required change should be deeper than croissant_crumbs.py.

In this file it assumes it already has a dataset info (containing config, splits), and IIUC the dataset_info is retrieved from the parque branch (or in general they is a clear mapping dataset_info -> parquet branch folder hierarchy). The first step would be to specify how the configs and splits are distributed across the folders of the main branch, which unlike the parquet branch it doesn't follow a predefined structure.

For example https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet/tree/refs%2Fconvert%2Fparquet :

Parquet branch (config/split):

default/partial-train/0000.parquet

Main branch (arbitrary set of folders):

filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet

lhoestq · 2024-11-15T10:28:57Z

you can use the datasets library to list the files of a given dataset:

>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("mlfoundations/dclm-baseline-1.0-parquet")
>>> builder.config.data_files
{
    NamedSplit('train'): [
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet',
        'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000001_processed.parquet'
        ...
    ]
}

fylux · 2024-11-15T14:06:22Z

Thanks @lhoestq, that's exactly what I was looking for. And combined with listing the configs, it should be able to cover the mapping config/split -> paths in main branch:

from datasets import load_dataset_builder, get_dataset_config_names
get_dataset_config_names("ai4bharat/sangraha") # ['verified', 'unverified', 'synthetic']
builder = load_dataset_builder("ai4bharat/sangraha","synthetic")
builder.config.data_files

Probably one of the challenges that we can face is that the data_files are listed individually (without globs) so if we list each file individually it could lead to huge Croissant files.

lhoestq · 2024-11-15T16:48:39Z

Alternatively we can rely on the dataset-compatible-libraries job in dataset-viewer, which creates code snippet e.g. for Dask using a glob pattern for the parquet files. The glob pattern is computed using heuristics.

For example for dclm it obtains this code pattern and glob:

import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/mlfoundations/dclm-baseline-1.0-parquet/**/*_train/**")

So we could just reuse the glob, that is stored along with the generated code snippet (no need to parse the code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Croissant refers to incomplete parquet branch in native parquet datasets #3101

Croissant refers to incomplete parquet branch in native parquet datasets #3101

fylux commented Nov 7, 2024 •

edited

Loading

fylux commented Nov 8, 2024

lhoestq commented Nov 8, 2024

fylux commented Nov 12, 2024

lhoestq commented Nov 15, 2024

fylux commented Nov 15, 2024

lhoestq commented Nov 15, 2024

Croissant refers to incomplete parquet branch in native parquet datasets #3101

Croissant refers to incomplete parquet branch in native parquet datasets #3101

Comments

fylux commented Nov 7, 2024 • edited Loading

fylux commented Nov 8, 2024

lhoestq commented Nov 8, 2024

fylux commented Nov 12, 2024

lhoestq commented Nov 15, 2024

fylux commented Nov 15, 2024

lhoestq commented Nov 15, 2024

fylux commented Nov 7, 2024 •

edited

Loading