Skip to content

johnmuller87/spark-udf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WBAA Spark UDF

Introduction

This projects contains an example Scala UDF function, for use in PySpark.

Why use a Scala UDF?

Native Spark UDFs written in Python are slow, because they have to be executed in a Python process, rather than a JVM-based Spark Executor. For a Spark Executor to run a Python UDF, it must:

  • send data from the partition over to a Python process associated with the Executor, and
  • wait for the Python process to deserialize the data, run the UDF on it, reserialize the data, and send it back.

By contrast, a Spark Scala UDF, whether written in Scala or Java, can be executed in the Executor JVM, even if the DataFrame logic is in Python.

Building

To build the jar file, use this command:

$ sbt clean assembly

That command will download the dependencies (if they haven't already been downloaded), compile the code, run the unit tests, and create a jar files for Scala 2.11. That jars will be:

  • Scala 2.11: target/scala-2.11/spark-udf-assembly-0.2.0.jar

Using UDF in Spark

You can now register the UDF in Spark with the following line:

spark.udf.registerJavaFunction("ValidateIBAN", "com.ing.wbaa.spark.udf.ValidateIBAN", T.BooleanType())

You can now use the function as you would any other function:

spark.sql("""SELECT ValidateIBAN('NL20INGB0001234567')""").show()

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages