Spark Quickstart

In this document, we will show you the steps to submit a simple FeatHub job to a standalone Spark cluster. The FeatHub job simply consumes the data from the local filesystem, computes some feature, and prints out the result.

Prerequisites

Unix-like operating system (e.g. Linux, Mac OS X)
Python 3.7/3.8/3.9
Java 8

Install Spark

Download a stable release of Spark 3.3.1, then extract the archive:

$ curl -LO https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
$ tar -xzf spark-3.3.1-bin-hadoop3.tgz

Deploy a Standalone Spark cluster

You can deploy a standalone Spark cluster in your local environment with the following command.

$ ./spark-3.3.1-bin-hadoop3/sbin/start-all.sh

You should be able to navigate to the web UI at localhost:8080 to view the Spark dashboard and see that the cluster is up and running.

Install FeatHub Python Library with Spark Support

$ python -m pip install --upgrade "feathub-nightly[spark]"

Run demo

Execute the following command to run the nyc_taxi_spark_client.py demo.

$ python python/feathub/examples/nyc_taxi_spark_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-client-mode.md

spark-client-mode.md

Spark Quickstart

Prerequisites

Install Spark

Deploy a Standalone Spark cluster

Install FeatHub Python Library with Spark Support

Run demo

Files

spark-client-mode.md

Latest commit

History

spark-client-mode.md

File metadata and controls

Spark Quickstart

Prerequisites

Install Spark

Deploy a Standalone Spark cluster

Install FeatHub Python Library with Spark Support

Run demo