It will be much easier to run this script, if you open your OS’s terminal inside the folder where you save the hello.py
file. After you opened the terminal inside the folder, just run the python3 hello.py
command. As a result, python will execute hello.py
, and, the text Hello World!
should be printed to the terminal:
+
@@ -322,21 +322,21 @@
+
-
+
-
+
@@ -344,11 +344,11 @@
+
Or, a list of multiple strings:
-
+
names = [
"Anne", "Vanse", "Elliot",
"Carlyle", "Ed", "Memphis"
@@ -360,7 +360,7 @@
+
product = {
'name': 'Coca Cola',
'volume': '2 litters',
@@ -379,7 +379,7 @@
1.4 Expressions
Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.
-
+
8
@@ -387,7 +387,7 @@
The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+
). But any python expression can include a multitude of different items. It can be composed of functions (like print()
, map()
and str()
), constant strings (like "Hello World!"
), logical operators (like !=
, <
, >
and ==
), arithmetic operators (like *
, /
, **
, %
, -
and +
), structures (like lists, arrays and dicts) and many other types of commands.
Below we have a more complex example, that contains the def
keyword (which starts a function definition; in the example below, this new function being defined is double()
), many built-in functions (list()
, map()
and print()
), a arithmetic operator (*
), numbers and a list (initiated by the pair of brackets - []
).
-
+
def double(x):
return x * 2
@@ -399,7 +399,7 @@ Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):
), before it executes the print()
statement, because the print statement is below the function definition.
This order of evaluation is commonly referred as “control flow†in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.
As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40
.
-
+
But, if we execute the expression x = 40
before the print statement, we then change the result produced by the program.
-
+
If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x
is not defined (i.e. it does not exist).
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
-
+
@@ -437,16 +437,16 @@ A python package (or a python “libraryâ€) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark
is one of these many python packages available.
Python packages are usually published (that is, made available to the public) through the PyPI archive5. If a python package is published in PyPI, then, you can easily install it through the pip
tool.
To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError
as you try to use it, like in the example below.
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package>
command on the terminal of your OS.
-pip install pandas
+Terminal$ pip install pandas
But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import
statement at the start of your python file. For example, if I want to use the DataFrame
function from the pandas
package:
-
+
# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas
@@ -467,7 +467,7 @@
Therefore, with import pandas
I can access any of the functions available in the pandas
package, by using the dot operator after the name of the package (pandas.<name of the function>
). However, it can become very annoying to write pandas.
every time you want to access a function from pandas
, specially if you use it constantly in your code.
To make life a little easier, python offers some alternative ways to define this import
statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas
package as pd
. To do this, you use the as
keyword in your import
statement. This way, you can access the pandas
functionality with pd.<name of the function>
:
-
+
In contrast, if you want to make your life even easier and produce a more “clean†code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from
keyword in your import statement, like this:
-
+
Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*
):
-
+
# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
@@ -512,12 +512,12 @@ Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modulesâ€. In other words, the functions and classes of this python package are usually organized in “modulesâ€.
As an example, the pyspark
package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql
(to access Spark SQL), pandas
(to access the Pandas API of Spark), ml
(to access Spark MLib).
To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql
and pandas
modules of pyspark
, you would do this:
-
+
Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql
module of pyspark
have the functions
and window
sub-modules. To access these sub-modules, you use the dot operator again:
-
+
# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W
@@ -527,7 +527,7 @@ 1.6 Methods versus Functions
Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.
Standard python functions, are functions that we apply over an object. A classical example, is the print()
function. You can see in the example below, that we are applying print()
over the result
object.
-
+
@@ -535,7 +535,7 @@
Other examples of a standard python function would be map()
and list()
. See in the example below, that we apply the map()
function over a set of objects:
-
+
@@ -545,7 +545,7 @@
In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.
For example, the startswith()
method belongs to the str
class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith()
method through the name
object. This means that, startswith()
is a function. But, we cannot use it without an object of class str
, like name
.
-
+
@@ -556,7 +556,7 @@ So, if we have a class called people
, and, this class has a method called location()
, we can use this location()
method by using the dot operator (.
) with the name of an object of class people
. If an object called x
is an instance of people
class, then, we can do x.location()
.
But if this object x
is of a different class, like int
, then we can no longer use the location()
method, because this method does not belong to the int
class. For example, if your object is from class A
, and, you try to use a method of class B
, you will get an AttributeError
.
In the example exposed below, I have an object called number
of class int
, and, I try to use the method startswith()
from str
class with this object:
-
+
number = 2
# You can see below that, the `x` object have class `int`
type(number)
@@ -569,7 +569,7 @@ 1.7 Identifying classes and their methods
Over the next chapters, you will realize that pyspark
programs tend to use more methods than standard functions. So most of the functionality of pyspark
resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.
Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type()
function over this object. In the example below, we can see that, the name
object is an instance of the str
class.
-
+
@@ -577,7 +577,7 @@
+
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
diff --git a/docs/Chapters/04-columns.html b/docs/Chapters/04-columns.html
index 1f6f9eb7..eebc1925 100644
--- a/docs/Chapters/04-columns.html
+++ b/docs/Chapters/04-columns.html
@@ -285,7 +285,7 @@ 4 Section 3.4, and look at the class of any column of it. Like the id
column:
-
+
pyspark.sql.column.Column
@@ -295,7 +295,7 @@ 4 4.1 Building a column object
You can refer to or create a column, by using the col()
and column()
functions from pyspark.sql.functions
module. These functions receive a string input with the name of the column you want to create/refer to.
Their result are always a object of class Column
. For example, the code below creates a column called ID
:
-
+
@@ -308,7 +308,7 @@ 4.2 Columns are strongly related to expressions
Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column
class, and its methods, are so important in Apache Spark.
Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id
from df
DataFrame, Spark will bring a expression as a result, and not the values hold by this column.
-
+
Column<'id'>
@@ -317,7 +317,7 @@ is just an expression, and as consequence, Spark does not know which are the values of column ID
, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID
.
These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions†on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.
This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID
column:
-
+
@@ -327,7 +327,7 @@
+
@@ -339,7 +339,7 @@ 4.3 Literal values versus expressions
We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark
code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.
As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:
-
+
@@ -365,7 +365,7 @@ So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit()
(short for “literalâ€) function from the pyspark.sql.functions
module.
In other words, when you write in your code the statement lit(4)
, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.
With the code below, I am creating two new columns (called store_number
and store_address
), and adding them to the sales
DataFrame.
-
+
from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')
diff --git a/docs/Chapters/04-dataframes.html b/docs/Chapters/04-dataframes.html
index 4ea6294f..6ae969ec 100644
--- a/docs/Chapters/04-dataframes.html
+++ b/docs/Chapters/04-dataframes.html
@@ -423,39 +423,45 @@ In pyspark
, every Spark DataFrame is stored inside a python object of class pyspark.sql.dataframe.DataFrame
. Or more succintly, a object of class DataFrame
.
Like any python class, the DataFrame
class comes with multiple methods that are available for every object of this class. This means that you can use any of these methods in any Spark DataFrame that you create through pyspark
.
As an example, in the code below I expose all the available methods from this DataFrame
class. First, I create a Spark DataFrame with spark.range(5)
, and, store it in the object df5
. After that, I use the dir()
function to show all the methods that I can use through this df5
object:
-
+
+
+['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collect_as_arrow', '_ipython_key_completions_', '_jcols', '_jdf', '_jmap', '_joinAsOf', '_jseq', '_lazy_rdd', '_repr_html_', '_sc', '_schema', '_session', '_sort_cols', '_sql_ctx', '_support_repr_html', 'agg', 'alias', 'approxQuantile', 'cache', 'checkpoint', 'coalesce', 'colRegex', 'collect', 'columns', 'corr', 'count', 'cov', 'createGlobalTempView', 'createOrReplaceGlobalTempView', 'createOrReplaceTempView', 'createTempView', 'crossJoin', 'crosstab', 'cube', 'describe', 'distinct', 'drop', 'dropDuplicates', 'dropDuplicatesWithinWatermark', 'drop_duplicates', 'dropna', 'dtypes', 'exceptAll', 'explain', 'fillna', 'filter', 'first', 'foreach', 'foreachPartition', 'freqItems', 'groupBy', 'groupby', 'head', 'hint', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming', 'is_cached', 'join', 'limit', 'localCheckpoint', 'mapInArrow', 'mapInPandas', 'melt', 'na', 'observe', 'offset', 'orderBy', 'pandas_api', 'persist', 'printSchema', 'randomSplit', 'rdd', 'registerTempTable', 'repartition', 'repartitionByRange', 'replace', 'rollup', 'sameSemantics', 'sample', 'sampleBy', 'schema', 'select', 'selectExpr', 'semanticHash', 'show', 'sort', 'sortWithinPartitions', 'sparkSession', 'sql_ctx', 'stat', 'storageLevel', 'subtract', 'summary', 'tail', 'take', 'to', 'toDF', 'toJSON', 'toLocalIterator', 'toPandas', 'to_koalas', 'to_pandas_on_spark', 'transform', 'union', 'unionAll', 'unionByName', 'unpersist', 'unpivot', 'where', 'withColumn', 'withColumnRenamed', 'withColumns', 'withColumnsRenamed', 'withMetadata', 'withWatermark', 'write', 'writeStream', 'writeTo']
+
All the methods present in this DataFrame
class, are commonly referred as the DataFrame API of Spark. Remember, this is the most important API of Spark. Because much of your Spark applications will heavily use this API to compose your data transformations and data flows (Chambers and Zaharia 2018).
3.4 Building a Spark DataFrame
There are some different methods to create a Spark DataFrame. For example, because a DataFrame is basically a Dataset of rows, we can build a DataFrame from a collection of Row
’s, through the createDataFrame()
method from your Spark Session:
-
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.getOrCreate()
-from datetime import date
-from pyspark.sql import Row
-
-data = [
- Row(id = 1, value = 28.3, date = date(2021,1,1)),
- Row(id = 2, value = 15.8, date = date(2021,1,1)),
- Row(id = 3, value = 20.1, date = date(2021,1,2)),
- Row(id = 4, value = 12.6, date = date(2021,1,3))
-]
-
-df = spark.createDataFrame(data)
+
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.getOrCreate()
+from datetime import date
+from pyspark.sql import Row
+
+data = [
+ Row(id = 1, value = 28.3, date = date(2021,1,1)),
+ Row(id = 2, value = 15.8, date = date(2021,1,1)),
+ Row(id = 3, value = 20.1, date = date(2021,1,2)),
+ Row(id = 4, value = 12.6, date = date(2021,1,3))
+]
+
+df = spark.createDataFrame(data)
Remember that a Spark DataFrame in python is a object of class pyspark.sql.dataframe.DataFrame
as you can see below:
-
-
+
If you try to see what is inside of this kind of object, you will get a small description of the columns present in the DataFrame as a result:
-
-
-
+
@@ -463,24 +469,24 @@ But you can use different methods to create the same Spark DataFrame. As another example, with the code below, we are creating a DataFrame called students
from two different python lists (data
and columns
).
The first list (data
) is a list of rows. Each row is represent by a python tuple, which contains the values in each column. But the secont list (columns
) contains the names for each column in the DataFrame.
To create the students
DataFrame we deliver these two lists to createDataFrame()
method:
-
-data = [
- (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
- (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
- (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
- (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
- (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
- (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
-]
-
-columns = [
- 'StudentID', 'Name', 'Age', 'Height', 'Score1',
- 'Score2', 'Score3', 'Score4', 'Course', 'Department'
-]
-
-students = spark.createDataFrame(data, columns)
-students
-
+
+data = [
+ (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
+ (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
+ (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
+ (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
+ (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
+ (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
+]
+
+columns = [
+ 'StudentID', 'Name', 'Age', 'Height', 'Score1',
+ 'Score2', 'Score3', 'Score4', 'Course', 'Department'
+]
+
+students = spark.createDataFrame(data, columns)
+students
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
@@ -491,16 +497,19 @@ 3.5 Visualizing a Spark DataFrame
A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an “action†(this kind of operation is described in Section 5.2).
You can notice this laziness in the output below:
-
-
-
+
+
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
Because when we call for an object that stores a Spark DataFrame (like df
and students
), Spark will only calculate and print a summary of the structure of your Spark DataFrame, and not the DataFrame itself.
So how can we actually see our DataFrame? How can we visualize the rows and values that are stored inside of it? For this, we use the show()
method. With this method, Spark will print the table as pure text, as you can see in the example below:
-
-
+
+
+
+[Stage 0:> (0 + 1) / 1]
+
+---------+-------+---+------+------+------+------+------+---------+----------+
|StudentID| Name|Age|Height|Score1|Score2|Score3|Score4| Course|Department|
@@ -516,8 +525,8 @@
By default, this method shows only the top rows of your DataFrame, but you can specify how much rows exactly you want to see, by using show(n)
, where n
is the number of rows. For example, I can visualize only the first 2 rows of df
like this:
-
-
+
+
+---+-----+----------+
| id|value| date|
@@ -533,9 +542,9 @@
3.6 Getting the name of the columns
If you need to, you can easily collect a python list with the column names present in your DataFrame, in the same way you would do in a pandas
DataFrame. That is, by using the columns
method of your DataFrame, like this:
-
-
-
+
+
+
['StudentID',
'Name',
'Age',
@@ -552,9 +561,9 @@
3.7 Getting the number of rows
If you want to know the number of rows present in a Spark DataFrame, just use the count()
method of this DataFrame. As a result, Spark will build this DataFrame, and count the number of rows present in it.
-
-
-
+
@@ -578,10 +587,10 @@ MapType(keyType, valueType, valueContainsNull)
: Represents a set of key-value pairs. The data type of keys is described by keyType
and the data type of values is described by valueType
. For a MapType
value, keys are not allowed to have null
values. valueContainsNull
is used to indicate if values of a MapType
value can have null
values.
Each one of these Spark data types have a corresponding python class in pyspark
, which are stored in the pyspark.sql.types
module. As a result, to access, lets say, type StryngType
, we can do this:
-
-
+
-
+
-
+
@@ -344,11 +344,11 @@
+
Or, a list of multiple strings:
-
+
names = [
"Anne", "Vanse", "Elliot",
"Carlyle", "Ed", "Memphis"
@@ -360,7 +360,7 @@
+
product = {
'name': 'Coca Cola',
'volume': '2 litters',
@@ -379,7 +379,7 @@
1.4 Expressions
Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.
-
+
8
@@ -387,7 +387,7 @@
The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+
). But any python expression can include a multitude of different items. It can be composed of functions (like print()
, map()
and str()
), constant strings (like "Hello World!"
), logical operators (like !=
, <
, >
and ==
), arithmetic operators (like *
, /
, **
, %
, -
and +
), structures (like lists, arrays and dicts) and many other types of commands.
Below we have a more complex example, that contains the def
keyword (which starts a function definition; in the example below, this new function being defined is double()
), many built-in functions (list()
, map()
and print()
), a arithmetic operator (*
), numbers and a list (initiated by the pair of brackets - []
).
-
+
def double(x):
return x * 2
@@ -399,7 +399,7 @@ Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):
), before it executes the print()
statement, because the print statement is below the function definition.
This order of evaluation is commonly referred as “control flow†in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.
As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40
.
-
+
But, if we execute the expression x = 40
before the print statement, we then change the result produced by the program.
-
+
If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x
is not defined (i.e. it does not exist).
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
-
+
@@ -437,16 +437,16 @@ A python package (or a python “libraryâ€) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark
is one of these many python packages available.
Python packages are usually published (that is, made available to the public) through the PyPI archive5. If a python package is published in PyPI, then, you can easily install it through the pip
tool.
To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError
as you try to use it, like in the example below.
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package>
command on the terminal of your OS.
-pip install pandas
+Terminal$ pip install pandas
But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import
statement at the start of your python file. For example, if I want to use the DataFrame
function from the pandas
package:
-
+
# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas
@@ -467,7 +467,7 @@
Therefore, with import pandas
I can access any of the functions available in the pandas
package, by using the dot operator after the name of the package (pandas.<name of the function>
). However, it can become very annoying to write pandas.
every time you want to access a function from pandas
, specially if you use it constantly in your code.
To make life a little easier, python offers some alternative ways to define this import
statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas
package as pd
. To do this, you use the as
keyword in your import
statement. This way, you can access the pandas
functionality with pd.<name of the function>
:
-
+
In contrast, if you want to make your life even easier and produce a more “clean†code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from
keyword in your import statement, like this:
-
+
Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*
):
-
+
# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
@@ -512,12 +512,12 @@ Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modulesâ€. In other words, the functions and classes of this python package are usually organized in “modulesâ€.
As an example, the pyspark
package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql
(to access Spark SQL), pandas
(to access the Pandas API of Spark), ml
(to access Spark MLib).
To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql
and pandas
modules of pyspark
, you would do this:
-
+
Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql
module of pyspark
have the functions
and window
sub-modules. To access these sub-modules, you use the dot operator again:
-
+
# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W
@@ -527,7 +527,7 @@ 1.6 Methods versus Functions
Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.
Standard python functions, are functions that we apply over an object. A classical example, is the print()
function. You can see in the example below, that we are applying print()
over the result
object.
-
+
@@ -535,7 +535,7 @@
Other examples of a standard python function would be map()
and list()
. See in the example below, that we apply the map()
function over a set of objects:
-
+
@@ -545,7 +545,7 @@
In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.
For example, the startswith()
method belongs to the str
class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith()
method through the name
object. This means that, startswith()
is a function. But, we cannot use it without an object of class str
, like name
.
-
+
@@ -556,7 +556,7 @@ So, if we have a class called people
, and, this class has a method called location()
, we can use this location()
method by using the dot operator (.
) with the name of an object of class people
. If an object called x
is an instance of people
class, then, we can do x.location()
.
But if this object x
is of a different class, like int
, then we can no longer use the location()
method, because this method does not belong to the int
class. For example, if your object is from class A
, and, you try to use a method of class B
, you will get an AttributeError
.
In the example exposed below, I have an object called number
of class int
, and, I try to use the method startswith()
from str
class with this object:
-
+
number = 2
# You can see below that, the `x` object have class `int`
type(number)
@@ -569,7 +569,7 @@ 1.7 Identifying classes and their methods
Over the next chapters, you will realize that pyspark
programs tend to use more methods than standard functions. So most of the functionality of pyspark
resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.
Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type()
function over this object. In the example below, we can see that, the name
object is an instance of the str
class.
-
+
@@ -577,7 +577,7 @@
+
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
diff --git a/docs/Chapters/04-columns.html b/docs/Chapters/04-columns.html
index 1f6f9eb7..eebc1925 100644
--- a/docs/Chapters/04-columns.html
+++ b/docs/Chapters/04-columns.html
@@ -285,7 +285,7 @@ 4 Section 3.4, and look at the class of any column of it. Like the id
column:
-
+
pyspark.sql.column.Column
@@ -295,7 +295,7 @@ 4 4.1 Building a column object
You can refer to or create a column, by using the col()
and column()
functions from pyspark.sql.functions
module. These functions receive a string input with the name of the column you want to create/refer to.
Their result are always a object of class Column
. For example, the code below creates a column called ID
:
-
+
@@ -308,7 +308,7 @@ 4.2 Columns are strongly related to expressions
Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column
class, and its methods, are so important in Apache Spark.
Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id
from df
DataFrame, Spark will bring a expression as a result, and not the values hold by this column.
-
+
Column<'id'>
@@ -317,7 +317,7 @@ is just an expression, and as consequence, Spark does not know which are the values of column ID
, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID
.
These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions†on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.
This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID
column:
-
+
@@ -327,7 +327,7 @@
+
@@ -339,7 +339,7 @@ 4.3 Literal values versus expressions
We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark
code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.
As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:
-
+
@@ -365,7 +365,7 @@ So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit()
(short for “literalâ€) function from the pyspark.sql.functions
module.
In other words, when you write in your code the statement lit(4)
, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.
With the code below, I am creating two new columns (called store_number
and store_address
), and adding them to the sales
DataFrame.
-
+
from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')
diff --git a/docs/Chapters/04-dataframes.html b/docs/Chapters/04-dataframes.html
index 4ea6294f..6ae969ec 100644
--- a/docs/Chapters/04-dataframes.html
+++ b/docs/Chapters/04-dataframes.html
@@ -423,39 +423,45 @@ In pyspark
, every Spark DataFrame is stored inside a python object of class pyspark.sql.dataframe.DataFrame
. Or more succintly, a object of class DataFrame
.
Like any python class, the DataFrame
class comes with multiple methods that are available for every object of this class. This means that you can use any of these methods in any Spark DataFrame that you create through pyspark
.
As an example, in the code below I expose all the available methods from this DataFrame
class. First, I create a Spark DataFrame with spark.range(5)
, and, store it in the object df5
. After that, I use the dir()
function to show all the methods that I can use through this df5
object:
-
+
+
+['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collect_as_arrow', '_ipython_key_completions_', '_jcols', '_jdf', '_jmap', '_joinAsOf', '_jseq', '_lazy_rdd', '_repr_html_', '_sc', '_schema', '_session', '_sort_cols', '_sql_ctx', '_support_repr_html', 'agg', 'alias', 'approxQuantile', 'cache', 'checkpoint', 'coalesce', 'colRegex', 'collect', 'columns', 'corr', 'count', 'cov', 'createGlobalTempView', 'createOrReplaceGlobalTempView', 'createOrReplaceTempView', 'createTempView', 'crossJoin', 'crosstab', 'cube', 'describe', 'distinct', 'drop', 'dropDuplicates', 'dropDuplicatesWithinWatermark', 'drop_duplicates', 'dropna', 'dtypes', 'exceptAll', 'explain', 'fillna', 'filter', 'first', 'foreach', 'foreachPartition', 'freqItems', 'groupBy', 'groupby', 'head', 'hint', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming', 'is_cached', 'join', 'limit', 'localCheckpoint', 'mapInArrow', 'mapInPandas', 'melt', 'na', 'observe', 'offset', 'orderBy', 'pandas_api', 'persist', 'printSchema', 'randomSplit', 'rdd', 'registerTempTable', 'repartition', 'repartitionByRange', 'replace', 'rollup', 'sameSemantics', 'sample', 'sampleBy', 'schema', 'select', 'selectExpr', 'semanticHash', 'show', 'sort', 'sortWithinPartitions', 'sparkSession', 'sql_ctx', 'stat', 'storageLevel', 'subtract', 'summary', 'tail', 'take', 'to', 'toDF', 'toJSON', 'toLocalIterator', 'toPandas', 'to_koalas', 'to_pandas_on_spark', 'transform', 'union', 'unionAll', 'unionByName', 'unpersist', 'unpivot', 'where', 'withColumn', 'withColumnRenamed', 'withColumns', 'withColumnsRenamed', 'withMetadata', 'withWatermark', 'write', 'writeStream', 'writeTo']
+
All the methods present in this DataFrame
class, are commonly referred as the DataFrame API of Spark. Remember, this is the most important API of Spark. Because much of your Spark applications will heavily use this API to compose your data transformations and data flows (Chambers and Zaharia 2018).
3.4 Building a Spark DataFrame
There are some different methods to create a Spark DataFrame. For example, because a DataFrame is basically a Dataset of rows, we can build a DataFrame from a collection of Row
’s, through the createDataFrame()
method from your Spark Session:
-
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.getOrCreate()
-from datetime import date
-from pyspark.sql import Row
-
-data = [
- Row(id = 1, value = 28.3, date = date(2021,1,1)),
- Row(id = 2, value = 15.8, date = date(2021,1,1)),
- Row(id = 3, value = 20.1, date = date(2021,1,2)),
- Row(id = 4, value = 12.6, date = date(2021,1,3))
-]
-
-df = spark.createDataFrame(data)
+
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.getOrCreate()
+from datetime import date
+from pyspark.sql import Row
+
+data = [
+ Row(id = 1, value = 28.3, date = date(2021,1,1)),
+ Row(id = 2, value = 15.8, date = date(2021,1,1)),
+ Row(id = 3, value = 20.1, date = date(2021,1,2)),
+ Row(id = 4, value = 12.6, date = date(2021,1,3))
+]
+
+df = spark.createDataFrame(data)
Remember that a Spark DataFrame in python is a object of class pyspark.sql.dataframe.DataFrame
as you can see below:
-
-
+
If you try to see what is inside of this kind of object, you will get a small description of the columns present in the DataFrame as a result:
-
-
-
+
@@ -463,24 +469,24 @@ But you can use different methods to create the same Spark DataFrame. As another example, with the code below, we are creating a DataFrame called students
from two different python lists (data
and columns
).
The first list (data
) is a list of rows. Each row is represent by a python tuple, which contains the values in each column. But the secont list (columns
) contains the names for each column in the DataFrame.
To create the students
DataFrame we deliver these two lists to createDataFrame()
method:
-
-data = [
- (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
- (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
- (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
- (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
- (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
- (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
-]
-
-columns = [
- 'StudentID', 'Name', 'Age', 'Height', 'Score1',
- 'Score2', 'Score3', 'Score4', 'Course', 'Department'
-]
-
-students = spark.createDataFrame(data, columns)
-students
-
+
+data = [
+ (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
+ (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
+ (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
+ (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
+ (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
+ (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
+]
+
+columns = [
+ 'StudentID', 'Name', 'Age', 'Height', 'Score1',
+ 'Score2', 'Score3', 'Score4', 'Course', 'Department'
+]
+
+students = spark.createDataFrame(data, columns)
+students
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
@@ -491,16 +497,19 @@ 3.5 Visualizing a Spark DataFrame
A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an “action†(this kind of operation is described in Section 5.2).
You can notice this laziness in the output below:
-
-
-
+
+
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
Because when we call for an object that stores a Spark DataFrame (like df
and students
), Spark will only calculate and print a summary of the structure of your Spark DataFrame, and not the DataFrame itself.
So how can we actually see our DataFrame? How can we visualize the rows and values that are stored inside of it? For this, we use the show()
method. With this method, Spark will print the table as pure text, as you can see in the example below:
-
-
+
+
+
+[Stage 0:> (0 + 1) / 1]
+
+---------+-------+---+------+------+------+------+------+---------+----------+
|StudentID| Name|Age|Height|Score1|Score2|Score3|Score4| Course|Department|
@@ -516,8 +525,8 @@
By default, this method shows only the top rows of your DataFrame, but you can specify how much rows exactly you want to see, by using show(n)
, where n
is the number of rows. For example, I can visualize only the first 2 rows of df
like this:
-
-
+
+
+---+-----+----------+
| id|value| date|
@@ -533,9 +542,9 @@
3.6 Getting the name of the columns
If you need to, you can easily collect a python list with the column names present in your DataFrame, in the same way you would do in a pandas
DataFrame. That is, by using the columns
method of your DataFrame, like this:
-
-
-
+
+
+
['StudentID',
'Name',
'Age',
@@ -552,9 +561,9 @@
3.7 Getting the number of rows
If you want to know the number of rows present in a Spark DataFrame, just use the count()
method of this DataFrame. As a result, Spark will build this DataFrame, and count the number of rows present in it.
-
-
-
+
@@ -578,10 +587,10 @@ MapType(keyType, valueType, valueContainsNull)
: Represents a set of key-value pairs. The data type of keys is described by keyType
and the data type of values is described by valueType
. For a MapType
value, keys are not allowed to have null
values. valueContainsNull
is used to indicate if values of a MapType
value can have null
values.
Each one of these Spark data types have a corresponding python class in pyspark
, which are stored in the pyspark.sql.types
module. As a result, to access, lets say, type StryngType
, we can do this:
-
-
+
Or, a list of multiple strings:
-
+
names = [
"Anne", "Vanse", "Elliot",
"Carlyle", "Ed", "Memphis"
@@ -360,7 +360,7 @@
+
product = {
'name': 'Coca Cola',
'volume': '2 litters',
@@ -379,7 +379,7 @@
1.4 Expressions
Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.
-
+
8
@@ -387,7 +387,7 @@
The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+
). But any python expression can include a multitude of different items. It can be composed of functions (like print()
, map()
and str()
), constant strings (like "Hello World!"
), logical operators (like !=
, <
, >
and ==
), arithmetic operators (like *
, /
, **
, %
, -
and +
), structures (like lists, arrays and dicts) and many other types of commands.
Below we have a more complex example, that contains the def
keyword (which starts a function definition; in the example below, this new function being defined is double()
), many built-in functions (list()
, map()
and print()
), a arithmetic operator (*
), numbers and a list (initiated by the pair of brackets - []
).
-
+
def double(x):
return x * 2
@@ -399,7 +399,7 @@ Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):
), before it executes the print()
statement, because the print statement is below the function definition.
This order of evaluation is commonly referred as “control flow†in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.
As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40
.
-
+
But, if we execute the expression x = 40
before the print statement, we then change the result produced by the program.
-
+
If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x
is not defined (i.e. it does not exist).
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
-
+
@@ -437,16 +437,16 @@ A python package (or a python “libraryâ€) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark
is one of these many python packages available.
Python packages are usually published (that is, made available to the public) through the PyPI archive5. If a python package is published in PyPI, then, you can easily install it through the pip
tool.
To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError
as you try to use it, like in the example below.
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package>
command on the terminal of your OS.
-pip install pandas
+Terminal$ pip install pandas
But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import
statement at the start of your python file. For example, if I want to use the DataFrame
function from the pandas
package:
-
+
# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas
@@ -467,7 +467,7 @@
Therefore, with import pandas
I can access any of the functions available in the pandas
package, by using the dot operator after the name of the package (pandas.<name of the function>
). However, it can become very annoying to write pandas.
every time you want to access a function from pandas
, specially if you use it constantly in your code.
To make life a little easier, python offers some alternative ways to define this import
statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas
package as pd
. To do this, you use the as
keyword in your import
statement. This way, you can access the pandas
functionality with pd.<name of the function>
:
-
+
In contrast, if you want to make your life even easier and produce a more “clean†code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from
keyword in your import statement, like this:
-
+
Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*
):
-
+
# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
@@ -512,12 +512,12 @@ Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modulesâ€. In other words, the functions and classes of this python package are usually organized in “modulesâ€.
As an example, the pyspark
package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql
(to access Spark SQL), pandas
(to access the Pandas API of Spark), ml
(to access Spark MLib).
To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql
and pandas
modules of pyspark
, you would do this:
-
+
Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql
module of pyspark
have the functions
and window
sub-modules. To access these sub-modules, you use the dot operator again:
-
+
# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W
@@ -527,7 +527,7 @@ 1.6 Methods versus Functions
Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.
Standard python functions, are functions that we apply over an object. A classical example, is the print()
function. You can see in the example below, that we are applying print()
over the result
object.
-
+
@@ -535,7 +535,7 @@
Other examples of a standard python function would be map()
and list()
. See in the example below, that we apply the map()
function over a set of objects:
-
+
@@ -545,7 +545,7 @@
In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.
For example, the startswith()
method belongs to the str
class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith()
method through the name
object. This means that, startswith()
is a function. But, we cannot use it without an object of class str
, like name
.
-
+
@@ -556,7 +556,7 @@ So, if we have a class called people
, and, this class has a method called location()
, we can use this location()
method by using the dot operator (.
) with the name of an object of class people
. If an object called x
is an instance of people
class, then, we can do x.location()
.
But if this object x
is of a different class, like int
, then we can no longer use the location()
method, because this method does not belong to the int
class. For example, if your object is from class A
, and, you try to use a method of class B
, you will get an AttributeError
.
In the example exposed below, I have an object called number
of class int
, and, I try to use the method startswith()
from str
class with this object:
-
+
number = 2
# You can see below that, the `x` object have class `int`
type(number)
@@ -569,7 +569,7 @@ 1.7 Identifying classes and their methods
Over the next chapters, you will realize that pyspark
programs tend to use more methods than standard functions. So most of the functionality of pyspark
resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.
Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type()
function over this object. In the example below, we can see that, the name
object is an instance of the str
class.
-
+
@@ -577,7 +577,7 @@
+
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
diff --git a/docs/Chapters/04-columns.html b/docs/Chapters/04-columns.html
index 1f6f9eb7..eebc1925 100644
--- a/docs/Chapters/04-columns.html
+++ b/docs/Chapters/04-columns.html
@@ -285,7 +285,7 @@ 4 Section 3.4, and look at the class of any column of it. Like the id
column:
-
+
pyspark.sql.column.Column
@@ -295,7 +295,7 @@ 4 4.1 Building a column object
You can refer to or create a column, by using the col()
and column()
functions from pyspark.sql.functions
module. These functions receive a string input with the name of the column you want to create/refer to.
Their result are always a object of class Column
. For example, the code below creates a column called ID
:
-
+
@@ -308,7 +308,7 @@ 4.2 Columns are strongly related to expressions
Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column
class, and its methods, are so important in Apache Spark.
Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id
from df
DataFrame, Spark will bring a expression as a result, and not the values hold by this column.
-
+
Column<'id'>
@@ -317,7 +317,7 @@ is just an expression, and as consequence, Spark does not know which are the values of column ID
, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID
.
These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions†on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.
This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID
column:
-
+
@@ -327,7 +327,7 @@
+
@@ -339,7 +339,7 @@ 4.3 Literal values versus expressions
We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark
code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.
As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:
-
+
@@ -365,7 +365,7 @@ So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit()
(short for “literalâ€) function from the pyspark.sql.functions
module.
In other words, when you write in your code the statement lit(4)
, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.
With the code below, I am creating two new columns (called store_number
and store_address
), and adding them to the sales
DataFrame.
-
+
from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')
diff --git a/docs/Chapters/04-dataframes.html b/docs/Chapters/04-dataframes.html
index 4ea6294f..6ae969ec 100644
--- a/docs/Chapters/04-dataframes.html
+++ b/docs/Chapters/04-dataframes.html
@@ -423,39 +423,45 @@ In pyspark
, every Spark DataFrame is stored inside a python object of class pyspark.sql.dataframe.DataFrame
. Or more succintly, a object of class DataFrame
.
Like any python class, the DataFrame
class comes with multiple methods that are available for every object of this class. This means that you can use any of these methods in any Spark DataFrame that you create through pyspark
.
As an example, in the code below I expose all the available methods from this DataFrame
class. First, I create a Spark DataFrame with spark.range(5)
, and, store it in the object df5
. After that, I use the dir()
function to show all the methods that I can use through this df5
object:
-
+
+
+['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collect_as_arrow', '_ipython_key_completions_', '_jcols', '_jdf', '_jmap', '_joinAsOf', '_jseq', '_lazy_rdd', '_repr_html_', '_sc', '_schema', '_session', '_sort_cols', '_sql_ctx', '_support_repr_html', 'agg', 'alias', 'approxQuantile', 'cache', 'checkpoint', 'coalesce', 'colRegex', 'collect', 'columns', 'corr', 'count', 'cov', 'createGlobalTempView', 'createOrReplaceGlobalTempView', 'createOrReplaceTempView', 'createTempView', 'crossJoin', 'crosstab', 'cube', 'describe', 'distinct', 'drop', 'dropDuplicates', 'dropDuplicatesWithinWatermark', 'drop_duplicates', 'dropna', 'dtypes', 'exceptAll', 'explain', 'fillna', 'filter', 'first', 'foreach', 'foreachPartition', 'freqItems', 'groupBy', 'groupby', 'head', 'hint', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming', 'is_cached', 'join', 'limit', 'localCheckpoint', 'mapInArrow', 'mapInPandas', 'melt', 'na', 'observe', 'offset', 'orderBy', 'pandas_api', 'persist', 'printSchema', 'randomSplit', 'rdd', 'registerTempTable', 'repartition', 'repartitionByRange', 'replace', 'rollup', 'sameSemantics', 'sample', 'sampleBy', 'schema', 'select', 'selectExpr', 'semanticHash', 'show', 'sort', 'sortWithinPartitions', 'sparkSession', 'sql_ctx', 'stat', 'storageLevel', 'subtract', 'summary', 'tail', 'take', 'to', 'toDF', 'toJSON', 'toLocalIterator', 'toPandas', 'to_koalas', 'to_pandas_on_spark', 'transform', 'union', 'unionAll', 'unionByName', 'unpersist', 'unpivot', 'where', 'withColumn', 'withColumnRenamed', 'withColumns', 'withColumnsRenamed', 'withMetadata', 'withWatermark', 'write', 'writeStream', 'writeTo']
+
All the methods present in this DataFrame
class, are commonly referred as the DataFrame API of Spark. Remember, this is the most important API of Spark. Because much of your Spark applications will heavily use this API to compose your data transformations and data flows (Chambers and Zaharia 2018).
3.4 Building a Spark DataFrame
There are some different methods to create a Spark DataFrame. For example, because a DataFrame is basically a Dataset of rows, we can build a DataFrame from a collection of Row
’s, through the createDataFrame()
method from your Spark Session:
-
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.getOrCreate()
-from datetime import date
-from pyspark.sql import Row
-
-data = [
- Row(id = 1, value = 28.3, date = date(2021,1,1)),
- Row(id = 2, value = 15.8, date = date(2021,1,1)),
- Row(id = 3, value = 20.1, date = date(2021,1,2)),
- Row(id = 4, value = 12.6, date = date(2021,1,3))
-]
-
-df = spark.createDataFrame(data)
+
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.getOrCreate()
+from datetime import date
+from pyspark.sql import Row
+
+data = [
+ Row(id = 1, value = 28.3, date = date(2021,1,1)),
+ Row(id = 2, value = 15.8, date = date(2021,1,1)),
+ Row(id = 3, value = 20.1, date = date(2021,1,2)),
+ Row(id = 4, value = 12.6, date = date(2021,1,3))
+]
+
+df = spark.createDataFrame(data)
Remember that a Spark DataFrame in python is a object of class pyspark.sql.dataframe.DataFrame
as you can see below:
-
-
+
If you try to see what is inside of this kind of object, you will get a small description of the columns present in the DataFrame as a result:
-
-
-
+
@@ -463,24 +469,24 @@ But you can use different methods to create the same Spark DataFrame. As another example, with the code below, we are creating a DataFrame called students
from two different python lists (data
and columns
).
The first list (data
) is a list of rows. Each row is represent by a python tuple, which contains the values in each column. But the secont list (columns
) contains the names for each column in the DataFrame.
To create the students
DataFrame we deliver these two lists to createDataFrame()
method:
-
-data = [
- (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
- (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
- (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
- (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
- (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
- (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
-]
-
-columns = [
- 'StudentID', 'Name', 'Age', 'Height', 'Score1',
- 'Score2', 'Score3', 'Score4', 'Course', 'Department'
-]
-
-students = spark.createDataFrame(data, columns)
-students
-
+
+data = [
+ (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
+ (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
+ (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
+ (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
+ (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
+ (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
+]
+
+columns = [
+ 'StudentID', 'Name', 'Age', 'Height', 'Score1',
+ 'Score2', 'Score3', 'Score4', 'Course', 'Department'
+]
+
+students = spark.createDataFrame(data, columns)
+students
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
@@ -491,16 +497,19 @@ 3.5 Visualizing a Spark DataFrame
A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an “action†(this kind of operation is described in Section 5.2).
You can notice this laziness in the output below:
-
-
-
+
+
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
Because when we call for an object that stores a Spark DataFrame (like df
and students
), Spark will only calculate and print a summary of the structure of your Spark DataFrame, and not the DataFrame itself.
So how can we actually see our DataFrame? How can we visualize the rows and values that are stored inside of it? For this, we use the show()
method. With this method, Spark will print the table as pure text, as you can see in the example below:
-
-
+
+
+
+[Stage 0:> (0 + 1) / 1]
+
+---------+-------+---+------+------+------+------+------+---------+----------+
|StudentID| Name|Age|Height|Score1|Score2|Score3|Score4| Course|Department|
@@ -516,8 +525,8 @@
By default, this method shows only the top rows of your DataFrame, but you can specify how much rows exactly you want to see, by using show(n)
, where n
is the number of rows. For example, I can visualize only the first 2 rows of df
like this:
-
-
+
+
+---+-----+----------+
| id|value| date|
@@ -533,9 +542,9 @@
3.6 Getting the name of the columns
If you need to, you can easily collect a python list with the column names present in your DataFrame, in the same way you would do in a pandas
DataFrame. That is, by using the columns
method of your DataFrame, like this:
-
-
-
+
+
+
['StudentID',
'Name',
'Age',
@@ -552,9 +561,9 @@
3.7 Getting the number of rows
If you want to know the number of rows present in a Spark DataFrame, just use the count()
method of this DataFrame. As a result, Spark will build this DataFrame, and count the number of rows present in it.
-
-
-
+
@@ -578,10 +587,10 @@ MapType(keyType, valueType, valueContainsNull)
: Represents a set of key-value pairs. The data type of keys is described by keyType
and the data type of values is described by valueType
. For a MapType
value, keys are not allowed to have null
values. valueContainsNull
is used to indicate if values of a MapType
value can have null
values.
Each one of these Spark data types have a corresponding python class in pyspark
, which are stored in the pyspark.sql.types
module. As a result, to access, lets say, type StryngType
, we can do this:
-
-
names = [
"Anne", "Vanse", "Elliot",
"Carlyle", "Ed", "Memphis"
@@ -360,7 +360,7 @@
+
product = {
'name': 'Coca Cola',
'volume': '2 litters',
@@ -379,7 +379,7 @@
1.4 Expressions
Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.
-
+
8
@@ -387,7 +387,7 @@
The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+
). But any python expression can include a multitude of different items. It can be composed of functions (like print()
, map()
and str()
), constant strings (like "Hello World!"
), logical operators (like !=
, <
, >
and ==
), arithmetic operators (like *
, /
, **
, %
, -
and +
), structures (like lists, arrays and dicts) and many other types of commands.
Below we have a more complex example, that contains the def
keyword (which starts a function definition; in the example below, this new function being defined is double()
), many built-in functions (list()
, map()
and print()
), a arithmetic operator (*
), numbers and a list (initiated by the pair of brackets - []
).
-
+
def double(x):
return x * 2
@@ -399,7 +399,7 @@ Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):
), before it executes the print()
statement, because the print statement is below the function definition.
This order of evaluation is commonly referred as “control flow†in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.
As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40
.
-
+
But, if we execute the expression x = 40
before the print statement, we then change the result produced by the program.
-
+
If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x
is not defined (i.e. it does not exist).
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
-
+
@@ -437,16 +437,16 @@ A python package (or a python “libraryâ€) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark
is one of these many python packages available.
Python packages are usually published (that is, made available to the public) through the PyPI archive5. If a python package is published in PyPI, then, you can easily install it through the pip
tool.
To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError
as you try to use it, like in the example below.
-
+
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package>
command on the terminal of your OS.
-pip install pandas
+Terminal$ pip install pandas
But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import
statement at the start of your python file. For example, if I want to use the DataFrame
function from the pandas
package:
-
+
# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas
@@ -467,7 +467,7 @@
Therefore, with import pandas
I can access any of the functions available in the pandas
package, by using the dot operator after the name of the package (pandas.<name of the function>
). However, it can become very annoying to write pandas.
every time you want to access a function from pandas
, specially if you use it constantly in your code.
To make life a little easier, python offers some alternative ways to define this import
statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas
package as pd
. To do this, you use the as
keyword in your import
statement. This way, you can access the pandas
functionality with pd.<name of the function>
:
-
+
In contrast, if you want to make your life even easier and produce a more “clean†code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from
keyword in your import statement, like this:
-
+
Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*
):
-
+
# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
@@ -512,12 +512,12 @@ Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modulesâ€. In other words, the functions and classes of this python package are usually organized in “modulesâ€.
As an example, the pyspark
package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql
(to access Spark SQL), pandas
(to access the Pandas API of Spark), ml
(to access Spark MLib).
To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql
and pandas
modules of pyspark
, you would do this:
-
+
Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql
module of pyspark
have the functions
and window
sub-modules. To access these sub-modules, you use the dot operator again:
-
+
# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W
@@ -527,7 +527,7 @@ 1.6 Methods versus Functions
Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.
Standard python functions, are functions that we apply over an object. A classical example, is the print()
function. You can see in the example below, that we are applying print()
over the result
object.
-
+
@@ -535,7 +535,7 @@
Other examples of a standard python function would be map()
and list()
. See in the example below, that we apply the map()
function over a set of objects:
-
+
@@ -545,7 +545,7 @@
In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.
For example, the startswith()
method belongs to the str
class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith()
method through the name
object. This means that, startswith()
is a function. But, we cannot use it without an object of class str
, like name
.
-
+
@@ -556,7 +556,7 @@ So, if we have a class called people
, and, this class has a method called location()
, we can use this location()
method by using the dot operator (.
) with the name of an object of class people
. If an object called x
is an instance of people
class, then, we can do x.location()
.
But if this object x
is of a different class, like int
, then we can no longer use the location()
method, because this method does not belong to the int
class. For example, if your object is from class A
, and, you try to use a method of class B
, you will get an AttributeError
.
In the example exposed below, I have an object called number
of class int
, and, I try to use the method startswith()
from str
class with this object:
-
+
number = 2
# You can see below that, the `x` object have class `int`
type(number)
@@ -569,7 +569,7 @@ 1.7 Identifying classes and their methods
Over the next chapters, you will realize that pyspark
programs tend to use more methods than standard functions. So most of the functionality of pyspark
resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.
Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type()
function over this object. In the example below, we can see that, the name
object is an instance of the str
class.
-
+
@@ -577,7 +577,7 @@
+
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
diff --git a/docs/Chapters/04-columns.html b/docs/Chapters/04-columns.html
index 1f6f9eb7..eebc1925 100644
--- a/docs/Chapters/04-columns.html
+++ b/docs/Chapters/04-columns.html
@@ -285,7 +285,7 @@ 4 Section 3.4, and look at the class of any column of it. Like the id
column:
-
+
pyspark.sql.column.Column
@@ -295,7 +295,7 @@ 4 4.1 Building a column object
You can refer to or create a column, by using the col()
and column()
functions from pyspark.sql.functions
module. These functions receive a string input with the name of the column you want to create/refer to.
Their result are always a object of class Column
. For example, the code below creates a column called ID
:
-
+
@@ -308,7 +308,7 @@ 4.2 Columns are strongly related to expressions
Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column
class, and its methods, are so important in Apache Spark.
Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id
from df
DataFrame, Spark will bring a expression as a result, and not the values hold by this column.
-
+
Column<'id'>
@@ -317,7 +317,7 @@ is just an expression, and as consequence, Spark does not know which are the values of column ID
, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID
.
These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions†on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.
This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID
column:
-
+
@@ -327,7 +327,7 @@
+
@@ -339,7 +339,7 @@ 4.3 Literal values versus expressions
We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark
code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.
As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:
-
+
@@ -365,7 +365,7 @@ So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit()
(short for “literalâ€) function from the pyspark.sql.functions
module.
In other words, when you write in your code the statement lit(4)
, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.
With the code below, I am creating two new columns (called store_number
and store_address
), and adding them to the sales
DataFrame.
-
+
from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')
diff --git a/docs/Chapters/04-dataframes.html b/docs/Chapters/04-dataframes.html
index 4ea6294f..6ae969ec 100644
--- a/docs/Chapters/04-dataframes.html
+++ b/docs/Chapters/04-dataframes.html
@@ -423,39 +423,45 @@ In pyspark
, every Spark DataFrame is stored inside a python object of class pyspark.sql.dataframe.DataFrame
. Or more succintly, a object of class DataFrame
.
Like any python class, the DataFrame
class comes with multiple methods that are available for every object of this class. This means that you can use any of these methods in any Spark DataFrame that you create through pyspark
.
As an example, in the code below I expose all the available methods from this DataFrame
class. First, I create a Spark DataFrame with spark.range(5)
, and, store it in the object df5
. After that, I use the dir()
function to show all the methods that I can use through this df5
object:
-
+
+
+['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collect_as_arrow', '_ipython_key_completions_', '_jcols', '_jdf', '_jmap', '_joinAsOf', '_jseq', '_lazy_rdd', '_repr_html_', '_sc', '_schema', '_session', '_sort_cols', '_sql_ctx', '_support_repr_html', 'agg', 'alias', 'approxQuantile', 'cache', 'checkpoint', 'coalesce', 'colRegex', 'collect', 'columns', 'corr', 'count', 'cov', 'createGlobalTempView', 'createOrReplaceGlobalTempView', 'createOrReplaceTempView', 'createTempView', 'crossJoin', 'crosstab', 'cube', 'describe', 'distinct', 'drop', 'dropDuplicates', 'dropDuplicatesWithinWatermark', 'drop_duplicates', 'dropna', 'dtypes', 'exceptAll', 'explain', 'fillna', 'filter', 'first', 'foreach', 'foreachPartition', 'freqItems', 'groupBy', 'groupby', 'head', 'hint', 'id', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty', 'isLocal', 'isStreaming', 'is_cached', 'join', 'limit', 'localCheckpoint', 'mapInArrow', 'mapInPandas', 'melt', 'na', 'observe', 'offset', 'orderBy', 'pandas_api', 'persist', 'printSchema', 'randomSplit', 'rdd', 'registerTempTable', 'repartition', 'repartitionByRange', 'replace', 'rollup', 'sameSemantics', 'sample', 'sampleBy', 'schema', 'select', 'selectExpr', 'semanticHash', 'show', 'sort', 'sortWithinPartitions', 'sparkSession', 'sql_ctx', 'stat', 'storageLevel', 'subtract', 'summary', 'tail', 'take', 'to', 'toDF', 'toJSON', 'toLocalIterator', 'toPandas', 'to_koalas', 'to_pandas_on_spark', 'transform', 'union', 'unionAll', 'unionByName', 'unpersist', 'unpivot', 'where', 'withColumn', 'withColumnRenamed', 'withColumns', 'withColumnsRenamed', 'withMetadata', 'withWatermark', 'write', 'writeStream', 'writeTo']
+
All the methods present in this DataFrame
class, are commonly referred as the DataFrame API of Spark. Remember, this is the most important API of Spark. Because much of your Spark applications will heavily use this API to compose your data transformations and data flows (Chambers and Zaharia 2018).
3.4 Building a Spark DataFrame
There are some different methods to create a Spark DataFrame. For example, because a DataFrame is basically a Dataset of rows, we can build a DataFrame from a collection of Row
’s, through the createDataFrame()
method from your Spark Session:
-
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.getOrCreate()
-from datetime import date
-from pyspark.sql import Row
-
-data = [
- Row(id = 1, value = 28.3, date = date(2021,1,1)),
- Row(id = 2, value = 15.8, date = date(2021,1,1)),
- Row(id = 3, value = 20.1, date = date(2021,1,2)),
- Row(id = 4, value = 12.6, date = date(2021,1,3))
-]
-
-df = spark.createDataFrame(data)
+
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.getOrCreate()
+from datetime import date
+from pyspark.sql import Row
+
+data = [
+ Row(id = 1, value = 28.3, date = date(2021,1,1)),
+ Row(id = 2, value = 15.8, date = date(2021,1,1)),
+ Row(id = 3, value = 20.1, date = date(2021,1,2)),
+ Row(id = 4, value = 12.6, date = date(2021,1,3))
+]
+
+df = spark.createDataFrame(data)
Remember that a Spark DataFrame in python is a object of class pyspark.sql.dataframe.DataFrame
as you can see below:
-
-
+
If you try to see what is inside of this kind of object, you will get a small description of the columns present in the DataFrame as a result:
-
-
-
+
@@ -463,24 +469,24 @@ But you can use different methods to create the same Spark DataFrame. As another example, with the code below, we are creating a DataFrame called students
from two different python lists (data
and columns
).
The first list (data
) is a list of rows. Each row is represent by a python tuple, which contains the values in each column. But the secont list (columns
) contains the names for each column in the DataFrame.
To create the students
DataFrame we deliver these two lists to createDataFrame()
method:
-
-data = [
- (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
- (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
- (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
- (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
- (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
- (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
-]
-
-columns = [
- 'StudentID', 'Name', 'Age', 'Height', 'Score1',
- 'Score2', 'Score3', 'Score4', 'Course', 'Department'
-]
-
-students = spark.createDataFrame(data, columns)
-students
-
+
+data = [
+ (12114, 'Anne', 21, 1.56, 8, 9, 10, 9, 'Economics', 'SC'),
+ (13007, 'Adrian', 23, 1.82, 6, 6, 8, 7, 'Economics', 'SC'),
+ (10045, 'George', 29, 1.77, 10, 9, 10, 7, 'Law', 'SC'),
+ (12459, 'Adeline', 26, 1.61, 8, 6, 7, 7, 'Law', 'SC'),
+ (10190, 'Mayla', 22, 1.67, 7, 7, 7, 9, 'Design', 'AR'),
+ (11552, 'Daniel', 24, 1.75, 9, 9, 10, 9, 'Design', 'AR')
+]
+
+columns = [
+ 'StudentID', 'Name', 'Age', 'Height', 'Score1',
+ 'Score2', 'Score3', 'Score4', 'Course', 'Department'
+]
+
+students = spark.createDataFrame(data, columns)
+students
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
@@ -491,16 +497,19 @@ 3.5 Visualizing a Spark DataFrame
A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an “action†(this kind of operation is described in Section 5.2).
You can notice this laziness in the output below:
-
-
-
+
+
+
DataFrame[StudentID: bigint, Name: string, Age: bigint, Height: double, Score1: bigint, Score2: bigint, Score3: bigint, Score4: bigint, Course: string, Department: string]
Because when we call for an object that stores a Spark DataFrame (like df
and students
), Spark will only calculate and print a summary of the structure of your Spark DataFrame, and not the DataFrame itself.
So how can we actually see our DataFrame? How can we visualize the rows and values that are stored inside of it? For this, we use the show()
method. With this method, Spark will print the table as pure text, as you can see in the example below:
-
-
+
+
+
+[Stage 0:> (0 + 1) / 1]
+
+---------+-------+---+------+------+------+------+------+---------+----------+
|StudentID| Name|Age|Height|Score1|Score2|Score3|Score4| Course|Department|
@@ -516,8 +525,8 @@
By default, this method shows only the top rows of your DataFrame, but you can specify how much rows exactly you want to see, by using show(n)
, where n
is the number of rows. For example, I can visualize only the first 2 rows of df
like this:
-
-
+
+
+---+-----+----------+
| id|value| date|
@@ -533,9 +542,9 @@
3.6 Getting the name of the columns
If you need to, you can easily collect a python list with the column names present in your DataFrame, in the same way you would do in a pandas
DataFrame. That is, by using the columns
method of your DataFrame, like this:
-
-
-
+
+
+
['StudentID',
'Name',
'Age',
@@ -552,9 +561,9 @@
3.7 Getting the number of rows
If you want to know the number of rows present in a Spark DataFrame, just use the count()
method of this DataFrame. As a result, Spark will build this DataFrame, and count the number of rows present in it.
-
-
-
+
@@ -578,10 +587,10 @@ MapType(keyType, valueType, valueContainsNull)
: Represents a set of key-value pairs. The data type of keys is described by keyType
and the data type of values is described by valueType
. For a MapType
value, keys are not allowed to have null
values. valueContainsNull
is used to indicate if values of a MapType
value can have null
values.
Each one of these Spark data types have a corresponding python class in pyspark
, which are stored in the pyspark.sql.types
module. As a result, to access, lets say, type StryngType
, we can do this:
-
-