Skip to content

Latest commit

 

History

History

Data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Data

This directory contains links to download the datasets used in this repository, supporting the article "Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics".

How to download the datasets using wget

This technique can be used to download full data directories (tested on Linux):

# Select the test data set to download
DATASET_NAME="testUndersampled_HLF_features.parquet"
#DATASET_NAME="testUndersampled_HLF_features.tfrecord"
#DATASET_NAME="testUndersampled_InclusiveClassifier.parquet"
#DATASET_NAME="testUndersampled_InclusiveClassifier.tfrecord"
#DATASET_NAME="testUndersampled.parquet"

wget -r -np -nH -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME

# Download the corresponding training data sets
DATASET_NAME="trainUndersampled_HLF_features.parquet"
#DATASET_NAME="trainUndersampled_HLF_features.tfrecord"
#DATASET_NAME="trainUndersampled_InclusiveClassifier.parquet"
#DATASET_NAME="trainUndersampled_InclusiveClassifier.tfrecord"
#DATASET_NAME="trainUndersampled.parquet"

wget -r -np -nH -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME

Notes

For the largest datasets (raw data and the output of first step of pre-processing) we have currently uploaded only representative samples. The full dataset is expected to be made available using CERN Open Data. Datasets are made available under the terms of the CC0 waiver.
Credits for the original (rawData) dataset to the authors of Topology classification with deep learning to improve real-time event selection at the LHC.
Datasets for Machine Learning, available in Apache Parquet and TFRecord formats have been produced using the notebooks published in this repository.
Note: If you have access to CERN computing resources, you can contact the authors to get more information on where to find the full datasets, that are available both on the CERN Hadoop platform and on CERN EOS storage.

HLF Features

This is the simplest model. It contains an array of 14 "High Level Features" (HLF). The classifier has 3 output classes, labeled from 0 to 1. The training dataset has 3.4M rows and the training dataset 86K rows.

Schema:
 |-- HLF_input: array 
 |    |-- element: double 
 |-- encoded_label: array 
 |    |-- element: double 

Low Level Features for GRU-based model

This is the complete dataset for training the more complex models (based on GRU): the "Particle Sequence Classifier" and the "Inclusive Classifier". This dataset is a superset and much larger than the HLF Features dataset described above, as it contains large arrays of particles used by the GRU model. The training dataset has 3.4M rows and the test dataset has 86K rows. HLF_Input are arrays contain 14 elements (high level features). GRU_input are arrays of size (801,19), they contain a list of 801 particles with 19 "low level" features per particle. The classifier has 3 output classes, labeled from 0 to 2.

Schema:
 |-- hfeatures: vector
 |-- label: long 
 |-- lfeatures: array
 |    |-- element: array
 |    |    |-- element: double
 |-- hfeatures_dense: vector
 |-- encoded_label: vector 
 |-- HLF_input: vector
 |-- GRU_input: array 
 |    |-- element: array
 |    |    |-- element: double
  • Sample dataset with 2k events in Apache Parquet format:
    • 162 MB: testUndersampled_2kevents.parquet Contains a sample of the test dataset with all the features, in Apache Parquet format, produced by the filtering and feature engineering steps

Low Level Features in Apache Parquet format:

Low Level Features in TFRecord format

Raw Data - SAMPLE

Only a sample of the raw data is provided at present. The full dataset used by this work occupies 4.5TB.

Output of the First Data Processing Step - SAMPLE

Only a sample of the data is provided currently, The full datataset occupies 943 GB.