Skip to content

Latest commit

 

History

History
2364 lines (1588 loc) · 61.9 KB

README.MD

File metadata and controls

2364 lines (1588 loc) · 61.9 KB

Complete SQL Boot Camp: Go from Zero0️⃣ to Hero🦸‍♂️!

Databases are systems that allow users to store and organize data, GREAT with large sets of data. They contain a wide variety of users with various technical skill-sets (sales, ops, engineer, etc.)

From Spreadsheets to Databases

The first step to becoming a database administrator is to learn how to use spreadsheets.

  • Spreadsheets are great for small data sets, or a one-time analysis.
  • Reasonably sized data sets can be used
  • Ability for untrained people to work with data.

Then we can move on to databases once we understand the spreadsheet.

  • Better data integrity
  • Can handle massive amounts of data
  • Quickly combine different datasets
  • Automate steps for ure
  • Can support data for websites and applications

Databases have columns and rows just like a spreadsheet, but rather than a 'tab', we have individual 'tables' that contain a given dataset.


For this tutorial, we'll be using PostgreSQL, but keep in mind there are many other approaches and abstractions to working in SQL...
  • MySQL, Oracle Databases, Microsoft Access, Amazon's Redshift, MemSQL, Periscope Data, Hive (runs on top of hadoop), Google's BigQuery, Facebook's Presto.

NOTE: We'll use yellow text to note when we are using a PostGreSQL specific command.



SQL Syntax (can all be applied to any SQL database)

We use all caps for the SQL commands (though we could do lowercase) to differentiate them from the columns, rows, or data objects we're accessing.


In general, it is not good practice to use an asterisk(*) in the SELECT statement if you don't need all columns. It will automatically query everything, which increases traffic between the database server and the application, which can slow down the retrieval of results.

Alt text




The DISTINCT keyword operates on a column and returns only the distinct values in a column. The syntax is...

SELECT DISTINCT column_name FROM table_name

Or, you can place the column name in parenthesis

SELECT DISTINCT(column_name) FROM table_name

What if we have two of the same distinct instances within a column? We don't know if the returned data is the person with the name "David" as a duplicate entry, or two different people with the same first name... We only that David is one of the distinct 'first_name' values within that column.

Other useful cases for DISTINCT.


  1. How many different release years do we have? One, 2006 (results from query shown below).
SELECT DISTINCT release_year FROM film

  1. How many different rental_rates are there? Three,
SELECT DISTINCT (rental_rate) FROM film

Let's say we have a visitor from Australia who's not familiar with our movie ratings system in the USA. We want to know...

  1. What types (or, distinct values) of ratings do we have in the database? "PG" "R" "NC-17" "PG-13" and "G"
SELECT DISTINCT rating FROM film;



Count function

By default, the count function returns the number of input rows of a table.

SELECT COUNT(*) FROM table_name



By including an argument, the returned query will then match that specific argument/condition.

SELECT COUNT(first_name) FROM table_name

It will not work without () to pass in the value to be counted.


Comparison operators

  • = equal to
  • <> not equal to (or !=)
  • '> greater than'
  • < less than
  • '> = greater than or equal to'
  • <= less than or equal to

Along with comparison operators, there are 3 logical operators.


AND, OR, and NOT.

  • AND - returns true if both conditions are true
  • OR - returns true if either condition is true
  • NOT - returns true if the condition is false

SELECT first_name, choice FROM table_name
WHERE name = 'David'
AND choice = 'red'

How would we find the number of films with a rating of R or PG-13?

SELECT COUNT(*) FROM film
WHERE rating = 'R'
OR rating = 'PG-13'


Where statement

One of the most fundamental statements alongside SELECT. With Where, we can specify conditions on columns for which rows are to be returned. We decide what columns we want back, then the table, then WHERE to describe the conditions.

You can combine these in any fashion as long as spelling and syntax are correct. For example, we want the phone # from the customer with the address of '259 Ipoh Drive...

SELECT customer FROM customer
WHERE address = '259 Ipoh Drive'


Order By

With Order By, we can sort the data by a column in either ascending or descending (order for strings, numerical for numbers)

Basic syntax for ORDER BY...

  • Use ASC for ascending order (Default if left blank)
  • Use DESC for descending order
SELECT column_1, column_2 FROM table_name
ORDER BY column_1 ASC/DESC


You can use ORDER BY on multiple columns. This makes a lot of sense with duplicate entries For example...

Company Name Sales
Apple Andrew 100
Google David 500
Apple Zach 300
Google Claire 200
Xerox Steven 100

To sort by company and then sales in descending order:
SELECT company, sales FROM table
ORDER BY company, sales

Company Name Sales
Apple Andrew 100
Apple Zach 300
Google Claire 200
Google David 500
Xerox Steven 100

NOTE: You can sort by columns you didn't include in your query request statement. SQL knows that those fields are still there and can access them for purposes outside those we've seen so far.



Limit

  • The limit keyword allows us to limit the number of rows returned for a query.
  • Useful for not wanting to return every single row in a table, but only view the top few rows to get an idea of the table layout.
  • LIMIT also becomes useful in combination with ORDER BY.
  • Goes at the very end of a query and is the last command to be executed.

Limit Syntax:

SELECT * FROM payment
WHERE amount != 0.00
ORDER BY payment_date DESC
LIMIT(5)

This will return 5 recent payments (not 0) ordered by the payment date in descending order.


POP QUIZ:


1. We want to reward the first 10 customers who made a payment. Write a query to gather the customer_ids.
SELECT customer_id FROM payment
ORDER BY payment_date ASC
LIMIT 10

2. A customer wants to quickly rent a video over their short lunch break. What are the titles of the 5 shortest (length of runtime) movies?
SELECT title FROM film
ORDER BY length ASC
limit 5

Now that we know the shortest runtime ('length') is 46 minutes, we can query for how many films there are with that runtime instead of just returning 5. Coincidentally, there are 5 in this example for both queries.

SELECT title FROM film
WHERE length = '46'

NOTE: 'length' is actually a SQL keyword (which is why it's highlighted in PgAdmin), so be careful where you call it as a column name. Do your best to avoid column names that are the same as SQL keywords..

BONUS: If the previous customer can watch any movie that is 50 minutes or less in run time, how many options does she have? Answer: 37.
SELECT COUNT(title) FROM film
WHERE length <= '50'


Between

Can be used to match a value against a range of values

  • Value BETWEEN low AND high.
  • (Value >= low) AND (value <= high).
  • We include the low and high...

Combining NOT with BETWEEN...

  • value NOT BETWEEN low and HIGH.
  • (value < low) OR (value > high)
  • We DO NOT include the low and high, only the value in between them.


BETWEEN with DATES!

We can also use this on dates using the following syntax:

date_column BETWEEN start_date AND end_date

OR...

date BETWEEN '2007-01-01' AND '2007-02-01'

NOTE:

  1. You need to format dates in the ISO 8601 standard format, which is YYYY-MM-DD.
  2. When using BETWEEN operator with dates that also include timestamp information, pay careful attention to using BETWEEN versus <=, >= comparison operators, because a datetime start with 0:00.

How many payments are there above, or below 9 dollars?
Answer: 14157
SELECT count(*) from payment
WHERE amount NOT BETWEEN 8 and 9

How many payments were there between February 1st, 2007, and February 15th, 2007?

SELECT * FROM payment
WHERE payment_date BETWEEN '2007-02-01' AND '2007-02-15'

Remember Note 2. above? Here we have an example of this. We're only returning payments up to 2007-02-15, but not including them. We may want to include payments from the 15th, in which case we would search BETWEEN '2007-02-01' and '2007-02-16'



~ IN ~

  • You may want to check for multiple possible value options, for example, if a user's name shows up IN/NOT IN list of options.
  • We can use the IN/NOT IN operator to create a condition that checks to see if a value is included in a list of multiple options.

Syntax:

  • value IN (option1, option2,...,option_n)
  • WHERE color IN ('red', 'blue')
    NOT IN example:
  • WHERE color NOT IN ('green', 'blue')

Let's say we want to know the number of payments that are not '0.99', '1.98', or '1.99'.

SELECT * from payment
WHERE amount NOT IN (0.99, 1.98, 1.99)
ORDER BY amount ASC


LIKE and ILIKE

What if we want to match against a general pattern in a string?

  • All emails ending in '@gmail.com'.
  • All names that begin with an 'A'.

The LIKE operator allows us to perform pattern matching against string data with the use of wildcard characters:

  • Percent %
    • Matches any sequence of characters
  • Underscore _
    • Matches any single character.

Using %

  • All names that begin with an 'A'.
    • WHERE name LIKE 'A%'
    • Saying: match any name where the first letter is 'A' followed by and sequence of characters.
  • All names that end with an 'a'
    • WHERE name LIKE '%a'
    • Saying: match any name where the last letter is 'a', preceded by and sequence of characters.

You can use ANY combinations of _ and %! Let's find the first_name and email of anyone who has an 'er' in their name - How? Br wrapping % on both sides of what we're looking for we're asking for "any 'er' with any other characters on either side".

SELECT first_name, email from customer
WHERE first_name ILIKE '%er%'

NOTE:

  • LIKE is case-sensitive,
  • ILIKE is case-insensitive,

Using _

Using the underscore allows us to replace just a single character.


NOTE:

  • You can combine LIKE and ILIKE operators for far greater detail
  • REGEX expressions are fully supported in SQL!

Get all customer names that are some letter, followed by 'her', then any combination of letters.

SELECT first_name FROM customer
WHERE first_name LIKE '_her%'

How many people's name starts with just the letter J?

SELECT count(*) FROM customer
WHERE first_name LIKE 'J%'



CHALLENGE! (some basics)

  1. How many payment transactions were greater than $5.00?
    •  SELECT COUNT(*) FROM payment
       WHERE amount > 5.00
  2. How many actors have a first name that starts with the letter P?
    •  SELECT COUNT(actor) FROM actor
       WHERE first_name LIKE 'P%'
  3. How many unique districts are our customers from?
    • SELECT DISTINCT (COUNT(district)) FROM address
  4. Retrieve the list of names for those distinct districts from the previous question.
    • SELECT DISTINCT (district) FROM address
  5. How many films have a rating of R and a replacement cost between $5 and $15? Answer: 52.
    • SELECT COUNT(FILM) FROM film
      WHERE rating = 'R'
      AND replacement_cost BETWEEN 5 AND 15
  6. How many films have the word Truman somewhere in the title?
    • SELECT COUNT(*) FROM film
      WHERE title LIKE '%Truman%'



GROUP BY

Allows us to aggregate data and apply functions to better understand how data is distributed per category.

Aggregate Functions

Further info: https://www.postgresql.org/files/documentation/pdf/16/postgresql-16-US.pdf

SQL provides a variety of aggregate functions but some of the most common are:

  • AVG() and ROUND() - returns the average value (returns 2.41512346112347898).
    • Use ROUND() function to specify the precision of decimals.
    • (1st parameter = to be rounded, 2nd parameter = # of decimal places to round)
  • COUNT() - returns number of values
  • MAX() - returns maximum value
  • MIN() - returns minimum value
  • SUM() - returns the sum of all values.

NOTE: Aggregate function calls happen only in the SELECT clause of the HAVING clause.


MIN and MAX

We know now we'll have to spend at least 9.99 to replace a film, OR at most 29.99

SELECT MIN(replacement_cost) A FROM film
SELECT MAX(replacement_cost) A FROM film

This is messy and repetitive. Instead, we can separate the aggregate function calls with a comma. They're both returning a singular value so this sort of query makes sense.

SELECT MAX(replacement_cost),MIN(replacement_cost) FROM film

AVG and ROUND

SELECT AVG(replacement_cost) FROM film

We get waaaaaaay too many decimal points here. So, let's use the ROUND function, which takes in the 1st parameter as the value to be rounded, and the second as the number of decimal points to round to.

SELECT ROUND(AVG(replacement_cost),2) FROM film

SUM

SELECT SUM(replacement_cost) FROM film


GROUP BY - Part One - Theory

Allows us to aggregate columns per some category.

Things to consider:

  1. We need to choose a categorical column to GROUP BY
  2. Categorical columns are non-continuous
  3. Keep in mind, they can still be numerical, such as cabin class categories on a ship( e.g. Class 1, Class 2, Class 3)
Category Data Value
A 10
A 5
B 2
B 4
C 12
C 6

Considering the table above, we have category A, then B, then C. Remember, an aggregate function takes a group of numbers and returns a single value. So, the 10 & 5, 2 & 4, and 12 & 6 for each respective category would SUM to 15, 6, and 18, respectively.

Or AVG, or COUNT, etc...


Group by Syntax:

SELECT category_col, AGG(data_col)
FROM table
GROUP BY category_col

NOTE:

  • The GROUP BY clause must appear right after a FROM or WHERE statement
  • In the SELECT statement, columns must either have an aggregate function or be the GROUP by call.

For Example:

SELECT company, division, SUM(sales)
FROM finance_table
WHERE division IN ('marketing', 'transport')
GROUP BY company, division

WHERE statements should not refer to the aggregation result (SUM(sales)), later on, we will learn to use HAVING to filter those results.

Here's an example we've been working with:

SELECT amount, customer_id, SUM(amount)
FROM payment
WHERE amount != 0
GROUP BY customer_id,amount
ORDER BY SUM(amount) DESC
LIMIT 5

This is to say: "Find me the top 5 customers with the highest payments" - Sorted from highest to lowest, omitting anyone with 0 payment.



GROUP BY - Part Two - Implementation

Which customer has the highest payment?

SELECT customer_id,SUM(amount) FROM payment
GROUP BY customer_id
ORDER BY SUM(amount) DESC

Which customer has the highest number of payments?

SELECT customer_id,COUNT(amount) FROM payment
GROUP BY customer_id,
ORDER BY COUNT(amount) DESC

Using GROUP BY on multiple columns.

We can find the highest transactions made between which customer and which employee.

SELECT customer_id,staff_id,SUM(amount) FROM payment
GROUP BY staff_id, customer_id
ORDER BY customer_id


Using the DATE() function with GROUP BY

Simply call the DATE() function to convert the timestamped format to the standard YYYY-MM-DD format.

Below, we're able to find the days on which the lowest (ASC) amount transactions took place.

SELECT DATE(payment_date), SUM(amount) FROM payment
GROUP BY DATE(payment_date)
ORDER BY SUM(amount)


GROUP BY CHALLENGE! 🏘️

We have two staff members, with Staff IDs 1 and 2. We want to give a bonus to the staff member that handled the most payments. (most in terms of the number of payments processed, not the dollar amount).

How many payments did each staff member handle and who gets the bonus?

SELECT rating, ROUND(AVG(replacement_cost), 2) FROM film
GROUP BY rating
ORDER BY AVG(replacement_cost)
Answer:

Alt text

SELECT rating, AVG(replacement_cost) FROM film GROUP BY rating ORDER BY AVG(replacement_cost)



What is the average replacement cost per MPAA rating?

  • You may need to expand the AVG column to view correct results.
SELECT rating, AVG(replacement_cost) FROM film
GROUP BY rating
ORDER BY AVG(replacement_cost)
Answer:

Alt text



Running promotion to reward our top 5 customers with coupons. What are the customer IDs of the top 5 customers by total spend?

SELECT customer_id, SUM(amount) from payment
GROUP BY customer_id
ORDER BY SUM(amount) DESC
limit 5
Answer:

Alt text






Having

Having clause allows us to filter after an aggregation has already taken place. From before..

SELECT company, SUM(sales)
FROM finance_table
WHERE company != 'Google'
GROUP BY company

We've already seen we can filter before executing the GROUP BY, but what if we want to filter based on SUM(sales)?


NOTE: We can not use WHERE to filter based off of aggregate results, because those happen after a WHERE is executed.


HAVING allows us to use the aggregate result as a filter along with a GROUP BY.

This is to say - "Go ahead and do a filter with the where for all company's other than Google, then group by those companies, then only return those having more than 1000 in sales.

SELECT customer_id, SUM(amount)
FROM payment
WHERE customer_id NOT IN (184,87,477)
GROUP BY customer_id HAVING SUM(amount) > 100
Solution:

Alt text



Another example of using HAVING to filter the aggregate result... (Stores with more than 300 customers)

SELECT store_id,COUNT(customer_id)FROM customer
GROUP BY store_id
HAVING COUNT(customer_id) > 300


CHALLENGE tasks for HAVING 🎒⛏️

We are launching a platinum service for our most loyal customers. We will assign platinum status to customers that have had 40 or more transaction payments.

What customer_ids are eligible for platinum status?

SELECT customer_id, COUNT(payment_id)
FROM payment
GROUP BY customer_id
HAVING COUNT(payment_id) >= 40
Solution:

Alt text



What are the customer ids of customers who have spent more than $100 in payment transactions with our staff_id member 2?

SELECT customer_id, staff_id,SUM(amount) FROM payment
WHERE staff_id = 2
GROUP BY customer_id, staff_id
HAVING SUM(amount) > 100
ORDER BY SUM(amount)
Solution:

Alt text





ASSESSMENT TEST:

  1. Return the customer IDs of customers who have spent at least $110 with the staff member who has an ID of 2 (Answer: 187 and 148).
SELECT customer_id, SUM(amount) FROM payment
WHERE staff_id = 2
GROUP BY customer_id
HAVING SUM(amount) > 110
ORDER BY SUM(amount)
  1. How many films begin with the letter J? (Answer: 20)
SELECT COUNT(*) FROM film
WHERE title LIKE 'J%'
  1. What customer has the highest customer ID number who's name starts with an 'E' and has an address ID lower than 500 (Answer: Eddie Tomlin)
SELECT * FROM customer
WHERE first_name LIKE 'E%'
AND address_id < 500
ORDER BY customer_id DESC

Lets pull out the customer who has the most sales with staff_id #2

SELECT * FROM customer
WHERE first_name LIKE 'E%'
AND address_id < 500
ORDER BY customer_id DESC
LIMIT 3;





JOINS

JOINS allow us to combine information from multiple tables.

First, we'll learn how to create an alias with the AS clause. Then, we'll move on to the 4 joins listed below.

  • INNER JOINS
  • OUTER JOINS
  • FULL JOINS
  • UNIONS

AS Statement

Syntax for re-declaring a column name using AS:

SELECT column AS new_name FROM table

So, a real world example might be...

SELECT amount AS rental_price FROM payment
-- Or ...
SELECT SUM(amount) AS net_revenue FROM payment


NOTE: The AS operator gets executed at the very end of a query, meaning that we can not use the ALIAS inside a WHERE operator.

SELECT COUNT(amount) FROM payment

We can use AS conveniently to rename selections into what they "actually" are. These names are fore the column headers only.

Here, it's hard to tell specifically what we want, other than customer_id information and a sum of the amount.

SELECT customer_id, SUM(amount) FROM payment
GROUP BY customer_id

We can tidy this up by renaming the selection into.
SELECT customer_id AS cust_id, SUM(amount) AS total_spent
FROM payment
GROUP BY customer_id

NOTE: When calling on a column using HAVING, you have to use the original column name (SUM(amount)) and not the alias (total_spent) because the AS assignment happens at the very end.
-- ! BAD SQL ! --
SELECT customer_id, SUM(amount) AS total_spent
FROM payment
GROUP BY customer_id
HAVING total_spent > 100

Now that we know how to use AS, let's to back to JOINS...


INNER JOINS

INNER JOINS will result with the set of records that match in both tables.

Let's start by assuming our company is holding a conference for people in the movie rental industry. We'll have people register online (REGISTRATIONS) beforehand and then login the day of the conference (LOGINS). We're going to assume there are no duplicate first names.

We want to know who has registered for the event, and also who has logged in.


We're saying - "Grab table A and then grab table B. Because it's an inner join, only grab the rows that happen to be in both tables A and B. Join them ON TableA columns that match TableB columns.

Syntax:

SELECT * FROM TableA
INNER JOIN TableB
ON TableA.col_match = TableB.col_match

Alt text



Working example:

Alt text


Alt text


Now we have A LOT of results! But, we have duplicate name columns (because we selected *) with duplicate information and we know that's not how we like to do things.

To clean this up, we can declare specific columns that we want. Below (syntax example), We're saying select 3 things: Registration ID, Login names, and Login ID from the Registrations table. Then, join Logins where the names are equal to each other... However, because 'name' existed in both tables, and as we saw before, it will simply produce both columns with the same 'name' header. Here, we're only returning the Logins.name column (we could also use registrations.name, it doesn't matter since they're the same)

Alt text



REMEMBER!

  • Table order won't matter in an INNER JOIN
  • Also, if you see JOIN without the INNER, PostgreSQl will treat it as an INNER JOIN


This request pulls back all fields using the the JOIN with the same customer_id.

SELECT * FROM payment
JOIN customer
ON payment.customer_id = customer.customer_id

Let's widdle it down a bit and only select certain columns.

  • If we're referring to a column name that only exists in one table we don't have to do anything specific
  • If both tables contain the same column name (customer_id), then we can specify which table we want to use it from.
    • Since we're already pulling FROM the payment table, it would be best practice to use that one...
SELECT payment_id, payment.customer_id, first_name
FROM payment
JOIN customer
ON payment.customer_id = customer.customer_id

The above only JOINS customers FROM payment who have an existing payment (FROM payment), and that are both on the customer and payment table with matching ids.




FULL (OUTER) JOINS

Alt text

  • There are a few different types of OUTER (OUTER) JOINS
  • They will allow us to specify how to deal with values only present in one of the tables being joined.
  • There are the more complex JOINs, take your time!

Below, we'll explain...

  • FULL JOIN (Clarifying WHERE null)
  • LEFT JOIN (Clarifying WHERE null)
  • RIGHT JOIN (Clarifying WHERE null)


First, let's describe the FULL (OUTER) JOIN.

syntax: Select all the columns from table A, and join all of Table B.

SELECT * FROM TableA
FULL JOIN TableB
ON tableA.col_match = TableB.col_match

A more realistic example might be...

SELECT * FROM Registrations
FULL JOIN Logins
ON Registrations.name = Logins.name

The returned table will contain the matches where the names exist in both columns, but since we asked to return all, what about the rows where the names don't match? It will place 'null' instead of the values into those rows.

reg_id Name log_id name
1 Andrew 2 2023-01-05
2 Bob 4 2023-01-06
3 Charlie null null
4 David null null
null null 1 Xavier
null null 3 Yolanda


Full OUTER JOIN with WHERE - Get rows unique to either table (rows not found in both tables)

Meaning, we're looking for the actual values where one of the rows is null. This grabs the rows that are completely unique to either Table A, or Table B.

Syntax: (see diagram above for example venn diagram of an [Exclusive] Full Join)

SELECT * FROM TableA
FULL OUTER JOIN TableB
ON TableA.col_match = TableB.col_match
WHERE Table.id IS null OR Table.id IS null
SELECT * FROM Registrations
FULL OUTER JOIN Logins
ON Registrations.reg_id = Logins.log_id
WHERE Registrations.reg_id IS null
OR Logins.log_id IS null


Using the table from above, we would now see all results that are unique to the either the Registration table, or the Login table.

reg_id Name log_id name
3 Charlie null null
4 David null null
null null 1 Xavier
null null 3 Yolanda


Let's do an example in PgAdmin! We want to ensure that we don't have any customer information that's not tied to a payment, or any payment information that's not tied to a customer. Luckily, there are none.

SELECT * FROM customer
FULL OUTER JOIN payment
ON customer.customer_id = payment.customer_id
WHERE customer.customer_id IS null
OR payment.payment_id IS null
Solution:

Alt text


Let's double check by counting the distinct IDs - You'd have to check each table and ensure the numbers match. There also could be different id numbers in different tables.

SELECT COUNT(DISTINCT customer_id) FROM payment.



LEFT (OUTER) JOIN

NOTE:

  • With both LEFT and RIGHT JOINS, the order we enter the query now matters, since we have to specify what the left table will be.
  • LEFT OUTER JOIN, and RIGHT OUTER JOIN can both be shorted to simply LEFT JOIN and RIGHT JOIN

A LEFT OUTER JOIN results in the set of records that are in the left table, if there is no match with the right table, the results are null.

Think of a venn diagram with the left circle and center filled up with the right circle being empty.

Syntax

SELECT * FROM TableA
LEFT OUTER JOIN TableB
ON TableA.col_match = TableB.col_match

If something is only found in tableB, it will not be returned in the query.

Let's explore this with our login example.

  • Grab everything from the Registrations table
  • If there happens to be a match with the logins table, then go ahead and return it.
  • If there is no match, we're still going to grab the information from left hand table (Logins), and place 'null' in as the value.
SELECT * FROM Registrations
LEFT OUTER JOIN Logins
ON Registrations.name = Logins.name

reg_id Name log_id name
1 Andrew 2 Andrew
2 Bob 4 Bob
3 Charlie null null
4 David null null

As we did before, let's further qualify this with a WHERE statement to get rows only unique to the left table.

SELECT * FROM Registrations
LEFT OUTER JOIN Logins
ON Registrations.name = Logins.name
WHERE Logins.id IS null
Visualization:

Alt text



PgAdmin examples! Let's look at the film and inventory tables. We're returning the rows that are either just in the film table, or in both the film and inventory table.

Adding the WHERE, it filters it to films that are not in inventory, or in any store! But, we do have information available on them.

SELECT film.film_id,title, inventory_id FROM film
LEFT JOIN inventory ON
inventory.film_id = film.film_id
WHERE inventory.film_id IS null
Solution:

Not in inventory!

Alt text



RIGHT (OUTER) JOIN

Essentially the exact same that as a LEFT JOIN, but the tables are switched up... Yes, it is sorta confusing the LEFT JOIN excludes the right table, and the RIGHT JOIN excludes the left table.

You can apply a 'WHERE TableA.id IS null' to either join to exclude the results that match both columns (see below).

Visualization:

Alt text


NOTE: It is up to you how you have the tables organized "in your mind" when it comes to choosing a LEFT vs RIGHT join, since depending ont he table order you specify in the JOIN, you can perform duplicate JOINs with either method.



UNION

A UNION is essentially stacking the return queries from SELECT statements.

You can have multiple UNION calls in one query to combine results.

Syntax:

SELECT * FROM Sales2021_Q1
UNION
SELECT * FROM Sales2021_Q2
UNION
SELECT * FROM Sales2021_Q3
ORDER BY name

_ UNION calls don't have a good use case example in the dvdRental examples we've been using so we'll revisit it later on... _



JOIN CHALLENGE tasks!

  1. California sales tax laws have changed and we need to alert our customers to this through email.
  • What are the emails for the customers who live in California?
SELECT email, district FROM address
LEFT JOIN customer
ON address.address_id = customer.address_id
WHERE district = 'California'
Solution:
  • Remember, you could use an INNER or RIGHT join here.
  • You could also switch the FROM and LEFT JOIN tables in this basic example and it would be the same result.

Alt text


  1. A customer walks in to the store adn there a huge fan of the actor "Nick Wahlberg" and wants to know what movies he is in.
  • What movies are they in?

We first need to identify what table is the "go between", "link", "bridge", (call it what you want) to get us the information in the prompt.

The film table doesn't have direct information for what actors there are, only the title... So, we at least know we'll need the film table

The actor table has the actor_id, AND the film_actor table has both the actor_id id.

So, we first join the the film_actor table on the actor on matching actor and film_actor id. This would return the list of matching actor/film ids, but not the titles.

SELECT * FROM actor
INNER JOIN film_actor
ON actor.actor_id = film_actor.actor_id

This is where we would perform a 2nd JOIN of the film table onto the existing table (film_actor joined on actor) where actor_id matches the film_id.

SELECT * FROM actor
INNER JOIN film_actor ON actor.actor_id = film_actor.actor_id
INNER JOIN film ON film.actor_id = film.film_id

This is a good start, but this doesn't specify "Nick Wahlberg" anywhere, or if 'actor_id = 2'. So, we add a WHERE with this condition to the end.

SELECT title, first_name, last_name FROM film_actor
INNER JOIN actor ON film_actor.actor_id = actor.actor_id
INNER JOIN film ON film_actor.film_id = film.film_id
WHERE actor.actor_id = 2



Advanced SQL topics

These topics don't fit into the general flow of the previous section so we'll go over them one at a time here.

We'll go over commands that report back date and time information, as well as topics that will be useful when creating our own tables and databases.

Timestamps and Extracts - Part One

TIME - Contains only time

DATE - Contains only date

TIMESTAMP - Contains date and time

TIMESTAMPTZ - Contains date, time, and timezone.


REMEMBER: You can always remove historical information, but you can't add it!

Functions and operations related to these specific data types:

  • TIMEZONE

  • NOW()

  • TIMEOFDAY()

  • CURRENT_TIME

  • CURRENT_DATE



SHOW ALL - Will return a large list of parameters for your SQL environment.

Let's checkout timezone... (returns 'America/Chicago' format timezone)

SHOW TIMEZONE

Returns timezone stamp of the immediate moment.

SELECT NOW()

If you'd like a more user friendly format, a literal string interpretation , use TIMEOFDAY()

SELECT TIMEOFDAY()

Or, grab the current time

SELECT CURRENT_TIME

Or, the current date

SELECT CURRENT_DATE


Timestamps and Extracts - Part Two

Let's explore extracting information from a time based data type using:

  • EXTRACT()
  • AGE()
  • TO_CHAR()

EXTRACT()

Allows you to "extract" or obtain a sub-component of a date value.

  • YEAR
  • MONTH
  • DAY
  • WEEK
  • QUARTER

Syntax:

EXTRACT(YEAR FROM date_col)

AGE()

Calculates and returns the current age given a timestamp Syntax:

AGE(date_col)

13 years 1 mon 5 days: 01:34:13.003425

TO_CHAR()

General function to convert data types to text. Very useful for timestamp formatting. Syntax:

TO_CHAR(date_col, 'mm-dd-yyyy')


A working example would be...

SELECT EXTRACT(YEAR FROM payment_date) AS pay_year FROM payment

Or...

NOTE: When using WHERE you cannot call on the created variable name (pay_quarter), you have to call on the argument for the variable (EXTRACT(QUARTER FROM payment_date))

SELECT EXTRACT(QUARTER FROM payment_date) AS pay_quarter FROM payment
WHERE EXTRACT(QUARTER FROM payment_date) = 4

How old is a particular timestamp in regards to our current date?

SELECT AGE(payment_date) FROM payment.


TO_CHAR()

SELECT TO_CHAR(payment_date, 'YYYY/MM/DD HH24:MI:SS') as formatted_time FROM payment

Google 'SQL time formatting' to get a full list of options, there are tons!




CHALLENGE! (timestamps, operators)

  • During which months did payments occur?
    • Format your answer to return back the full month name.
SELECT DISTINCT(TO_CHAR(payment_date, 'Month')) payment_date FROM payment
Solution:

Alt text



  • How many payments occurred on a Monday?
    • We had to google the 'dow' function, day-of-week, which pulls out a numeric number (0-6, Sunday-Saturday) for each day of the week.
SELECT COUNT(EXTRACT(DOW FROM payment_date)) FROM payment
WHERE EXTRACT(DOW FROM payment_date) = 1

A cleaner answer would be...

SELECT COUNT(*) FROM payment
WHERE EXTRACT(DOW FROM payment_date) = 1



Mathematical Functions

PgSQL Mathematical functions documentation

We're going to go through some mathematical operations where we can add one column, or do something, with column and another (or 2 or 3...)

Back to PgAdmin: What percentage of the replacement cost is a rental rate?

SELECT rental_rate/replacement_cost FROM film
Returns:

Alt text

This is a good start, but it's a long number, not in the right form and for each row of the table... Let's clean it up.

SELECT ROUND(rental_rate/replacement_cost, 2)*100 AS percent_cost FROM film

Now we get back a good looking percent format! (33.00, 28.00, etc)




String Functions and Operators

We can edit, combine, and alter text data columns in PgSql.

String Functions and Operators documentation

Let's use the customer table.

In PgSQL, we use || as the 'concat' operand, so..

SELECT first_name || ' ' || last_name as full_name
FROM customer

You can change the case type of a string easily using UPPER() or LOWER()

SELECT UPPER(first_name) || ' ' || lower(last_name) as full_name FROM customer

What if there were missing emails, but we had the name information used to create the email (in our format)?

SELECT first_name || last_name || '@gmail.com' FROM customer
Returns:

Alt text



This is a very basic approach to creating emails. Let's assume we dont want the first letter of the last name capatilized, we only want the first letter for the first name capatalized, and we want '@gmail.com' for each...

 SELECT UPPER(LEFT(first_name, 1))
 || LOWER(last_name)
 || '@gmail.com' as customer_email
 FROM customer
Returns:

Alt text




SubQuery

Sub queries are queries nested inside another query. They allow you to perform complex operations by combining multiple tables and performing calculations on them.

Another way to say it is; A sub query is a query that is nested inside another query and returns one row for each result produced by the outer query.


We know how to grab all the info from a table...

SELECT student_grade FROM test_scores

And how to get the AVG() of some info.

SELECT AVG(student_scores) FROM test_scores

How can we get a list of students who scored BETTER than the average grade?

  • By placing a 2nd SELECT statement as the "less than" returned condition in the WHERE statement below, wrapped inside parenthesis.
SELECT student,grade
FROM test_scores
WHERE grade > (SELECT AVG(grade) FROM test_scores)

Above, the subquery ((SELECT AVG(grade) FROM test_scores)) is performed first since it is inside the parenthesis.


REMEMBER: You can also use the IN operator in conjunction with a subquery to check against multiple results returned.



A subquery can operate on a separate table:

Below, the separate select narrows down the overall return to only contain students that are in both the test_scores and honor_roll_table.

SELECT student, grade FROM test_scores
WHERE student IN (SELECT student FROM honor_roll_table)


EXISTS()

The EXISTS operator is used to test for the existence of any rows returned by a subquery - Think of it as a boolean

Syntax:

SELECT column_name FROM table_name
WHERE EXISTS
(SELECT column_name FROM table_name WHERE ~condition goes here~)


OK - Back to subquerys. To set one up first, create the subquery. Then, wrap it in parenthesis and place it at the end of your query with a line above it.

-- Leave a line here to start
(SELECT AVG(rental_rate) FROM film)

Then, add in the large (top-level) query to get your columns and table...

  • This first gets the average rental_rate from film, then returns the title and rental rate of films where the rental rate is higher than the average.
SELECT title, rental_rate
FROM film
WHERE rental_rate >
(SELECT AVG(rental_rate) FROM film)


NOTE: If your subquery returns multiple values, you'll have to use the IN operator!


Let's see this in combination with a join...

Situation: We want to grab the film titles that have been returned BETWEEN a certain set of dates, May 29th 2005 - May 30th 2005. Below we have the films from the dates we need in bulk format with no titles or inventory information.

SELECT * FROM rental
WHERE return_date BETWEEN '2005-05-29' AND '2005-05-30'

Next, we want to join the inventory table on matching inventory ids to make sure they are in inventory.

SELECT inventory.film_id FROM rental
INNER JOIN inventory
ON inventory.inventory_id = rental.inventory_id
WHERE return_date BETWEEN '2005-05-29' AND '2005-05-30'

Great! Now what would be the easiest way to just get the title information of these films added? (What section are we in?!) That's right, wrap the query below in parenthesis and make it the subquery with the line above it, then create the top-level query to get the film id titles included IN the subquery.

Remember, we want the title of the film ids where the id is in the subquery...

SELECT title, film_id FROM film
WHERE film_id IN
(SELECT inventory.film_id FROM rental
INNER JOIN inventory
ON inventory.inventory_id = rental.inventory_id
WHERE return_date BETWEEN '2005-05-29' AND '2005-05-30')

using EXISTS(), we can find the customers who have at least 1 payment that's greater than 11. We want both the first and last names... What tables need accessed to get all required information?

  • You could also use NOT before EXISTS() to return all the customers who have not met the subquery condition.
SELECT first_name, last_name FROM customer AS c
WHERE EXISTS(
SELECT * FROM payment AS p
WHERE c.customer_id = p.customer_id
AND amount > 11)
Returns:

Results



CHALLENGE!

Try to do the same, but only with customers from California.

SELECT first_name, last_name FROM customer AS c
WHERE EXISTS(
	SELECT *
	FROM payment AS p
	WHERE c.customer_id = p.customer_id
	AND amount > 11
	AND EXISTS(
		SELECT *
		FROM address AS ad
		JOIN address
		ON c.address_id = ad.address_id
		WHERE address.district LIKE 'California')
)


How can we double check this? We did a quick manual check of the # of customers total from California and there are 9.

self-join

A self-join in SQL is a join operation where a table is joined with itself. This is done by creating two or more aliases for the same table and specifying how the rows in the same table relate to each other. Self-joins are useful in situations where you need to compare and retrieve data from the same table based on some relationship between rows within that table. Here's the syntax for a self-join:

SELECT t1.column1, t2.column2
FROM table_name AS t1
INNER JOIN table_name AS t2 ON t1.related_column = t2.related_column;

There is no special keyword for a self join, its simply standard JOIN syntax with the same table in both parts. However, because you're using the same table you you must use an alias to differentiate between the original and "copy" (not an actual copy, it's just pulled twice)

Alt text



Let's explore a more realistic situation of when you would us a self-join!

You have a company and some employees, and employees have to send reports to other employees (below). We can see that Andrew sends a report to report_id 3, which is Charlie (emp_id = 3). Following this, you can see who see who sends reports to who.

Alt text


Let's show the results of the employee name, and their reports recipient name:

SELECT emp.name, report.name
FROM employees as emp
JOIN employees as report ON
emp.emp_id = report.report_id;
Returns (➡):

Alt text



Into PgAdmin we go for some examples!

Find all the PAIRS of films that have the same length (as the initial one at the top of each category, ie each different length)

To start...

SELECT * FROM film
WHERE length = 117

Now, using aliases we want to pair each with the movie where the ON identifies film ids that are different from the first, and where both are the same length, all from the same table.

SELECT f1.title, f2.title, f1.length
FROM film AS f1
INNER JOIN film AS f2
ON f1.film_id != f2.film_id
AND f1.length = f2.length
Returns:

Alt text




Restoring Database Exercises

This new database has a public and cd schema. Meaning, the queries for the FROM tables will have cd. in front of them, for example:

SELECT * FROM cd.bookings

NOTE: The exercise.tar file ( available through paid Udemy course, or the Google drive download link below) will be used for this portion of the course.

.tar file download

Once this is added, restored, and refreshed in PgAdmin (quick setup video!) you'll notice there are 2 schemas. One public, and one cd.

SELECT * FROM cd.bookings

This database is like a "gym", with facility info, booking information, and members. Some fields may be null...

For this portion we're going to use the Google link provided ( SQL Assessment Test 2 Questions and Expected Results ) so you can test yourself with the new table information using the notes taken above.

Try to think first about how you would construct your query. Does it need a subquery? How many tables (therefore, joins) are needed to gather the necessary columns? Do you need to find all of a match, or everything but the match?

Read the questions carefully and take your time, this is designed to be somewhat challenging so you can work through and solidify what's already been presented.

NOTE: Keep in mind there is usually more than one way to answer these questions. For the solutions you can visit this link (answers)





Quiz time!

  1. How can you retrieve all the information from the cd.facilities table? Expected Result should look similar to this (with more rows):

Answer:

SELECT * FROM cd.facilities

  1. You want to print out a list of all of the facilities and their cost to members. How would you retrieve a list of only facility names and costs?

Answer: (Below, we're actually using the AS shorthand which is simply not including the AS before the new variable names, facility_name and guest_cost)

SELECT cd.facilities.name facility_name, cd.facilities.guestcost guest_cost FROM cd.facilities
ORDER BY guest_cost
Returns:

Alt text


  1. How can you produce a list of facilities that charge a fee to members?
SELECT * FROM cd.facilities
WHERE cd.facilities.membercost > 0
Returns:

Alt text


  1. How can you produce a list of facilities that charge a fee to members, and that fee is less than 1/50th of the monthly maintenance cost?

Return the facility_id, fac_name, member cost, and monthly maintenance cost.

SELECT facid, name, membercost, monthlymaintenance
FROM cd.facilities
WHERE membercost > 0 AND
(membercost < monthlymaintenance/50.0);
Returns:

Alt text


  1. how can you produce a list of all facilities with the word 'Tennis' in their name?
SELECT * FROM cd.facilities
WHERE cd.facilities.name ILIKE '%Tennis%'
Returns:

Alt text


  1. How can you retrieve the details of facilities with ID 1 and 5? Try to do without using the OR operator
SELECT * FROM cd.facilities
WHERE facid IN (1,5)
Returns:

Alt text


  1. How can you produce a list of members who joined after the start of September 2012?
  • Return the memid, surname, firstname, and joindate of the members in question.
SELECT memid, surname, firstname, joindate FROM cd.members
WHERE joindate > '08-31-2012'
ORDER BY joindate
Returns:

Alt text


  1. How can you produce an ordered list of the first 10 surnames in the members table? The list must not contain duplicates.
  • Expected result should be 10 rows if you include GUEST as a last name.
SELECT DISTINCT(cd.members.surname) FROM cd.members
ORDER BY surname
LIMIT 10
Returns:

Alt text


  1. You'd like to get the sign-up date of your last member. How can you retrieve this information?
SELECT cd.members.joindate first_mem_joined FROM cd.members
ORDER BY cd.members.joindate DESC
LIMIT 1
Returns:

Alt text


  1. Produce a count of the number of facilities that have a cost to guests of 10 or more.
SELECT COUNT(cd.facilities.guestcost) AS num_of_fac_cost_over_10 FROM cd.facilities
WHERE cd.facilities.guestcost > 10
Returns:

Alt text


  1. Produce a list of the total number of slots booked per facility in the month of september 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.
SELECT cd.bookings.facid, SUM(cd.bookings.slots) AS sept_slots FROM cd.bookings
WHERE EXTRACT(MONTH FROM cd.bookings.starttime) = 9
GROUP BY cd.bookings.facid ORDER BY sept_slots
Returns:

Alt text


  1. Produce a list of facilities with more than 1000 slots booked. Produce an output table consisting of facility id and total slots, sorted by facility id.
SELECT cd.bookings.facid, SUM(cd.bookings.slots) FROM cd.bookings
GROUP BY cd.bookings.facid
HAVING SUM(cd.bookings.slots) > 1000
ORDER BY cd.bookings.facid
Returns:

Alt text


  1. Produce a list of the start times for bookings for tennis courts, for the date '2012-09-21'. Return a list of start time and facility name pairings, ordered by the time.
SELECT starttime, facilities.name FROM cd.bookings
JOIN cd.facilities ON bookings.facid = facilities.facid
WHERE DATE(starttime) = '2012-09-21' AND facilities.name LIKE '%Tennis Court%'
ORDER BY starttime
Returns:

Alt text


  1. How can you produce a list of the start time for bookings by members name 'David Farrell'?
SELECT starttime, members.firstname FROM cd.bookings
JOIN cd.members ON bookings.memid = members.memid
WHERE members.firstname = 'David'
AND members.surname = 'Farrell'
Returns: (34 rows total)

Alt text





Creating Database and Tables

We've only focused on querying and reading data from existing tables thus far.

Section overview:

  • Data Types
  • Primary and Foreign Keys
  • Constraints
  • CREATE
  • INSERT
  • UPDATE
  • DELETE, ALTER, DROP

We'll first focus on learning a few theoretical concepts, such as choosing the correct data types for a stored value and setting possible constraints on it.

Also, primary and foreign keys.


Data Types


We've already encountered numerous data types.

  • Boolean
    • True or False
  • Character
    • char, varchar, and text
  • Numeric
    • integer, and floating-point number
  • Temporal date, time, timestamp, and interval

Less common ones...

  • UUID
    • Universally Unique Identifiers
  • Array
    • Stores an array of strings, numbers, etc.
  • JSON
  • Hstore key-value pair
  • Special types such as network address and geometric data.

For example, using a phone number. Do we want to store it as numeric? If so, which type of numeric? 'bigint' would be the obvious choice, but what is actually best for the situation?

Use Google to search for "best practice, phone number, SQL" and go from there based off what the community says.

Stack Overflow discussion link: "What datatype should be used for storing phone numbers in SQL Server"?

These are just phone numbers that do not get any arithmetic done on them at all. So, let's actually use a VARCHAR data type!

When creating a database and table, take the time to plan for long term storage. You can always remove historical information, but you cannot go back and add. Always record more information, not less!

To decide what data types to use, review the docs: https://www.postgresql.org/docs/current/datatype.html

Alt text




Primary and Foreign Keys

Primary Key

  • Definition: A primary key is a column or a set of columns that uniquely identify each record in a table.
  • Uniqueness: Each value in the primary key column must be unique; it cannot contain NULL values.
  • Purpose: Uniquely identifies each record within the table. Acts as a unique identifier for a table's records.
  • Constraints: Only one primary key is allowed per table. It can comprise one or multiple columns.
  • Example: In a table of 'customers', the 'customer_id' column might be the primary key, ensuring each customer has a unique identification number.

See PgAdmin visual PK representation (customer_id, [PK] integer):

Alt text



Foreign Key

  • Definition: A foreign key is a column or set of columns in a table that establishes a link between data in two tables.
    • The table that contains the foreign key is called referencing table or child table.
    • The table to which the foreign key references is called referenced table or parent table.
  • Relationship: It creates a relationship between this column and the primary key of another table.
    • Recall in the dvdrental database payment table, each payment row had its unique payment_id( a primary key) and identified the customer that made the payment through the customer_id (a foreign key, since it references the customer tables primary key)
  • Purpose: Maintains referential integrity between tables. Defines a relationship between two tables, referencing the primary key in another table.
  • Constraints: Values in the foreign key column must either match a value in the primary key column of the referenced table or be NULL.
  • Example: In a 'Payment' table, a 'customer ID' column could be a foreign key, referencing the 'customer ID' in the 'Payments' table, linking each payment to the customer.

See PgAdmin visual representation of multiple foreign keys within one table:

Alt text



Key Differences

  • Uniqueness:

    Primary keys must be unique within their table. Foreign keys maintain referential integrity by referencing the primary key in another table.

  • Purpose: Primary keys identify unique records within a table. Foreign keys establish and maintain relationships between tables.

  • Constraints: Primary keys do not allow NULL values. Foreign keys can contain NULL values, allowing for optional relationships between tables.

    • For example: in our dvdrental database we saw customers had a unique, non-null customer_id column as their primary key.

      See PgAdmin visual PK representation for the example above:

      Alt text


In summary, primary keys uniquely identify records within a table, while foreign keys create relationships between tables by referencing the primary key of another table, ensuring data consistency and integrity.

When creating tables and defining columns, we can use constraints to define columns as being a primary key, or attaching a foreign key relationship to another table.




What if we want to view to view a list of the relationships of PKs and FKs for a given table?

In Pg Admin, you can go to: Schemas > Tables > desired_table > Constraints, and then view the PK shown in gold, and the FKs shown as double grey keys.

PgAdmin view for PK and FK relationship constraints

Alt text



From there, select one of the foreign keys and click on the 'Dependencies' tab to the right. It will show you a list of the dependencies for that particular column and alert you to what tables its referencing. Alt text


You can also right click the constraint, click on the 'Columns' tab, and it will tell you what it's referencing.

Alt text