Implementation of ROOT I/O designed to get TTrees into Spark DataFrames. Consists of the following three components:
- DataSource - Spark DataSourceV2 implementation
- ArrayInterpretation - Accepts raw TBasket byte ranges and returns deserialzed arrays
- root_proxy - Deserializes ROOT metadata to locate TBasket byte ranges
The scope of this project is only to perform vectorized (i.e. column-based) reads of TTrees consisting of relatively simple branches -- fundamental numeric types and both fixed-length/jagged arrays of those types.
Note that the most recent version number can be found here. To use a different version, replace 1.0.0 with your desired version
import pyspark.sql
spark = pyspark.sql.SparkSession.builder \
.master("local[1]") \
.config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
.getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
.option("tree", "tree") \
.load('small-flat-tree.root')
df.printSchema()
- The I/O is currently completely unoptimized -- there is no caching or prefetching. Remote reads will be slow as a consequence.
- Arrays (both fixed and jagged) of booleans return the wrong result
- Float16/Doubles32 are currently not supported
- String types are currently not supported
- C++ STD types are currently not supported (importantly, std::vector)