This is a performance testing framework for Spark SQL in Apache Spark 2.2+.
Note: This README is still under development. Please also check our source code for more information.
$sbt assemby
$ spark-submit target/scala-2.11/spark-sql-perf-assembly*.jar --help
spark-sql-perf 0.5.1-SNAPSHOT
Usage: spark-sql-perf [options]
-b, --benchmark <value> the name of the benchmark to run
--format <value> Format to use to read data (table is a hive table and not a data source)
-d, --dataset <value> dataset to find data in
-p, --path <value> HCFS directory holding data
-f, --filter <value> a filter on the name of the queries to run
-i, --iterations <value>
the number of iterations to run
-o, --output <value> HCFS directory containing the output report
-c, --compare <value> the timestamp of the baseline experiment to compare with
--help prints this usage text
$ spark-submit target/scala-2.11/spark-sql-perf-assembly*.jar --benchmark DatasetPerformance
Use sbt package
or sbt assembly
to build the library jar.
Use sbt +package
to build for scala 2.11 and 2.12.
The framework contains twelve benchmarks that can be executed in local mode. They are organized into three classes and target different components and functions of Spark:
- DatasetPerformance compares the performance of the old RDD API with the new Dataframe and Dataset APIs.
These benchmarks can be launched with the command
bin/run --benchmark DatasetPerformance
- JoinPerformance compares the performance of joining different table sizes and shapes with different join types.
These benchmarks can be launched with the command
bin/run --benchmark JoinPerformance
- AggregationPerformance compares the performance of aggregating different table sizes using different aggregation types.
These benchmarks can be launched with the command
bin/run --benchmark AggregationPerformance
To run MLlib tests, run /bin/run-ml yamlfile
, where yamlfile
is the path to a YAML configuration
file describing tests to run and their parameters.
Before running any query, a dataset needs to be setup by creating a Benchmark
object. Generating
the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that
you will need. The fork includes changes to generate TPCDS data to stdout, so that this library can
pipe them directly to Spark, without intermediate files. Therefore, this library will not work with
the vanilla TPCDS kit.
TPCDS kit needs to be installed on all cluster executor nodes under the same path!
It can be found here.
import com.databricks.spark.sql.perf.tpcds.TPCDSTables
// Set:
val rootDir = ... // root directory of location to create data in.
val databaseName = ... // name of database to create.
val scaleFactor = ... // scaleFactor defines the size of the dataset to generate (in GB).
val format = ... // valid spark format like parquet "parquet".
// Run:
val tables = new TPCDSTables(sqlContext,
dsdgenDir = "/tmp/tpcds-kit/tools", // location of dsdgen
scaleFactor = scaleFactor,
useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
useStringForDate = false) // true to replace DateType with StringType
tables.genData(
location = rootDir,
format = format,
overwrite = true, // overwrite the data that is already there
partitionTables = true, // create the partitioned fact tables
clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
tableFilter = "", // "" means generate all tables
numPartitions = 100) // how many dsdgen partitions to run - number of input tasks.
// Create the specified database
sql(s"create database $databaseName")
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = true)
// Or, if you want to create temporary tables
// tables.createTemporaryTables(location, format)
// For CBO only, gather statistics on all columns:
tables.analyzeTables(databaseName, analyzeColumns = true)
After setup, users can use runExperiment
function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using
$ spark-submit target/scala-2.11/spark-sql-perf-assembly*.jar --benchmark tpcds.TPCDS
While the experiment is running you can use experiment.html
to get a summary, or experiment.getCurrentResults
to get complete current results.
Once the experiment is complete, you can still access experiment.getCurrentResults
, or you can load the results from disk.
// Get all experiments results.
val resultTable = spark.read.json(resultLocation)
resultTable.createOrReplaceTempView("sqlPerformance")
sqlContext.table("sqlPerformance")
// Get the result of a particular run by specifying the timestamp of that run.
sqlContext.table("sqlPerformance").filter("timestamp = 1429132621024")
// or
val specificResultTable = spark.read.json(experiment.resultPath)
You can get a basic summary by running:
experiment.getCurrentResults // or: spark.read.json(resultLocation).filter("timestamp = 1429132621024")
.withColumn("Name", substring(col("name"), 2, 100))
.withColumn("Runtime", (col("parsingTime") + col("analysisTime") + col("optimizationTime") + col("planningTime") + col("executionTime")) / 1000.0)
.select('Name, 'Runtime)
TPC-H can be run similarly to TPC-DS replacing tpcds
for tpch
.
Take a look at the data generator and tpch_run
notebook code below.
There are example notebooks in src/main/notebooks
for running TPCDS and TPCH in the Databricks environment.
These scripts can also be run from spark-shell command line with minor modifications using :load file_name.scala
.
This notebook (or scala script) can be use to generate both TPCDS and TPCH data at selected scale factors.
It is a newer version from the tpcds_datagen
notebook below. To use it:
- Edit the config variables the top of the script.
- Run the whole notebook.
This notebook can be used to install dsdgen on all worker nodes, run data generation, and create the TPCDS database.
Note that because of the way dsdgen is installed, it will not work on an autoscaling cluster, and num_workers
has
to be updated to the number of worker instances on the cluster.
Data generation may also break if any of the workers is killed - the restarted worker container will not have dsdgen
anymore.
This notebook can be used to run TPCDS queries.
For running parallel TPCDS streams:
- Create a Cluster and attach the spark-sql-perf library to it.
- Create a Job using the notebook and attaching to the created cluster as "existing cluster".
- Allow concurrent runs of the created job.
- Launch appriopriate number of Runs of the Job to run in parallel on the cluster.
This notebook can be used to run TPCH queries. Data needs be generated first.