This directory contains links to download the datasets used in this repository, supporting the article "Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics".
This technique can be used to download full data directories (tested on Linux):
# Select the test data set to download
DATASET_NAME="testUndersampled_HLF_features.parquet"
#DATASET_NAME="testUndersampled_HLF_features.tfrecord"
#DATASET_NAME="testUndersampled_InclusiveClassifier.parquet"
#DATASET_NAME="testUndersampled_InclusiveClassifier.tfrecord"
#DATASET_NAME="testUndersampled.parquet"
wget -r -np -nH -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME
# Download the corresponding training data sets
DATASET_NAME="trainUndersampled_HLF_features.parquet"
#DATASET_NAME="trainUndersampled_HLF_features.tfrecord"
#DATASET_NAME="trainUndersampled_InclusiveClassifier.parquet"
#DATASET_NAME="trainUndersampled_InclusiveClassifier.tfrecord"
#DATASET_NAME="trainUndersampled.parquet"
wget -r -np -nH -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME
For the largest datasets (raw data and the output of first step of pre-processing) we have currently uploaded
only representative samples. The full dataset is expected to be made available using CERN Open Data.
Datasets are made available under the terms of the CC0 waiver.
Credits for the original (rawData) dataset to the authors of Topology classification with deep learning to improve real-time event selection at the LHC.
Datasets for Machine Learning, available in Apache Parquet and TFRecord formats have been produced using the notebooks published in this repository.
Note: If you have access to CERN computing resources, you can contact the authors to get
more information on where to find the full datasets, that are available both on the CERN Hadoop platform and on CERN EOS storage.
This is the simplest model. It contains an array of 14 "High Level Features" (HLF). The classifier has 3 output classes, labeled from 0 to 1. The training dataset has 3.4M rows and the training dataset 86K rows.
Schema:
|-- HLF_input: array
| |-- element: double
|-- encoded_label: array
| |-- element: double
-
HLF features in Apache Parquet format (training and test dataset):
-
HLF features in TFRecord format:
This is the complete dataset for training the more complex models (based on GRU): the "Particle Sequence Classifier" and the "Inclusive Classifier". This dataset is a superset and much larger than the HLF Features dataset described above, as it contains large arrays of particles used by the GRU model. The training dataset has 3.4M rows and the test dataset has 86K rows. HLF_Input are arrays contain 14 elements (high level features). GRU_input are arrays of size (801,19), they contain a list of 801 particles with 19 "low level" features per particle. The classifier has 3 output classes, labeled from 0 to 2.
Schema:
|-- hfeatures: vector
|-- label: long
|-- lfeatures: array
| |-- element: array
| | |-- element: double
|-- hfeatures_dense: vector
|-- encoded_label: vector
|-- HLF_input: vector
|-- GRU_input: array
| |-- element: array
| | |-- element: double
- Sample dataset with 2k events in Apache Parquet format:
- 162 MB: testUndersampled_2kevents.parquet Contains a sample of the test dataset with all the features, in Apache Parquet format, produced by the filtering and feature engineering steps
- Raw:
- 255 GB: trainUndersampled.parquet
- 64 GB: testUndersampled.parquet
- Prepared for the Inclusive classifier for Petastorm and TensorFlow/PyTorch.
It contains the same number of rows as the Parquet dataset, but only 3 fields, as needed by the Inclusive classifier with Tensorflow: HLF_input, GRU_input and encoded_labels.
- Note, this dataset is derived by the full datasets in Parquet. It contains the same number of rows as the Parquet dataset, but only 3 fields, as needed by the Inclusive classifier with Tensorflow: HLF_input, GRU_input and encoded_labels.
- 195 GB: trainUndersampled_InclusiveClassifier.tfrecord
- 49 GB: testUndersampled_InclusiveClassifier.tfrecord
Only a sample of the raw data is provided at present. The full dataset used by this work occupies 4.5TB.
- 14 GB lepFilter_rawData_SAMPLE
Only a sample of the data is provided currently, The full datataset occupies 943 GB.