- Complete SQL + Databases Bootcamp: Zero to Mastery [2020]
-
Section 3: Databases + SQL Fundamentals
Click to view all steps
- 13. SQL Playground
- 14. Exercise: Setting Up Your First Database
- 15. What Is SQL?
- 16. What Is A Query?
- 17. Imperative vs Declarative
- 18. History of SQL
- 19. Exercises: The Select Statement
- 20. Optional: History of SQL Deep Dive
- 21. SQL Standards
- 22. What Is A Database? Revisited
- 23. Database Oriented Approach
- 24. Exercise: SQL Starter Quiz
- 25. Database Models
- 26. Hierarchical And Networking Model
- 27. Relational Model
- 28. DBMS Revisited
- 29. Relational Model Revisited
- 30. Tables
- 31. Columns
- 32. Rows
- 33. Primary And Foreign Keys
- 34. OLTP vs OLAP
- 35. Exercise: OLTP vs OLAP
- 36. Exercise: Relational Model Quiz
-
Click to view all steps
- 52. Starting With Query
- 53. Exercise: Simple Queries
- 54. Exercise: Renaming Columns
- 55. Concat Function
- 56. What Is A Function In SQL?
- 57. Aggregate Functions
- 58. Exercise: Aggregate Functions
- 59. Commenting Your Queries
- 60. Common SELECT Mistakes
- 61. Filtering Data
- 62. AND and OR
- 63. Exercise: Filtering Data
- 64. Exercise: The Where Clause
- 65. The NOT Keyword
- 66. Comparison Operators
- 67. Exercise: Comparison Operators
- 68. Logical Operators
- 69. Operator Precedence
- 71. Exercise: Operator Precedence
- 72. Checking For NULL Values
- 73. IS Keyword
- 74. NULL Coalescing
- 75. Exercise: Null Value Coalescing
- 76. 3 Valued Logic
- 77. Exercise: 3 Valued Logic
- 78. BETWEEN + AND
- 79. IN Keyword
- 80. Exercise: IN Keyword
- 81. LIKE
- 82. Exercise: Like Keyword
- 83. Dates And Timezones
- 84. Setting Up Timezones
- 85. How Do We Format Date And Time?
- 86. Timestamps
- 87. Date Functions
- 88. Date Difference And Casting
- 89. Age Calculation
- 90. Extracting Information
- 91. Intervals
- 92. Exercise: Date and Timestamp
- 93. DISTINCT
- 94. Exercise: Distinct Keyword
- 95. Sorting Data
- 96. Multi Table SELECT
- 97. Inner Join
- 98. Self Join
- 99. Outer Join
- 100. Less Common Joins
- 101. Inner-Join Exercises
- 102. USING Keyword
-
Click to view all steps
- 103. GROUP BY
- 104. HAVING Keyword
- 105. Ordering Grouped Data
- 106. Group By Mental Model
- 107. Grouping Sets
- 108. Rollup
- 109. Group By Exercises
- 110. Window What?
- 111. Looking Through The Window
- 112. PARTITION BY
- 113. Order By Acting Strange
- 114. Using Framing In Window Function
- 115. Solving For Current Salary
- 116. FIRST_VALUE
- 117. LAST_VALUE
- 118. SUM
- 119. ROW_NUMBER
- 120. Window Function Exercises
- 121. Conditional Statements
- 122. NULLIF
- 123. Views...What Are They Good For?
- 124. View Syntax
- 125. Using Views
- 126. Indexes
- 127. Index Types
- 128. Index Algorithms
- 129. What Are Subqueries?
- 130. Subqueries vs Joins
- 131. Subquery Guidelines As Types
- 132. Using Subqueries
- 133. Getting The Latest Salaries
- 134. Subquery Operators
- 135. Subquery Exercises
-
Section 7: Database Management
Click to view all steps
- 136. What we'll be learning
- 137. Time To Create Some Stuff!
- 138. Types Of Databases In A RDBMS
- 139. Default PostgreSQL Database
- 140. Template Databases
- 141. Creating A Database
- 142. Database Organization
- 143. Roles In Postgres
- 144. Role Attributes And Creation
- 145. Creating Users And Configuring Login
- 146. Privileges
- 147. Granting Privileges and Role Management
- 148. Best Practices For Role Management
- 149. Data Types & Boolean Type
- 150. Storing Text
- 151. Storing Numbers
- 152. Storing Arrays
- 153. Data Models And Naming Conventions
- 154. CREATE TABLE
- 155. Extra information on CREATE TABLE
- 156. Column Constraints
- 157. Table Constraints
- 158. Regexes!
- 159. UUID Explained
- 160. Custom Data Types And Domains
- 161. Creating The Tables For ZTM
- 162. Extra information on ALTER TABLE
- 163. Adding Students And Teachers
- 164. Creating A Course
- 165. Adding Feedback To A Course
- 166. A Tale Of 2 Feedbacks
- 167. SQL Exercises
- 168. SQL Quiz
- 169. Backups And Why They Are Important
- 170. Backing Up In Postgres
- 171. Restoring A Database
- 172. Transactions
-
Section 8: Solving The Mystery
Click to view all steps
-
Click to view all steps
- 185. System Design And SDLC
- 186. SDLC Phases
- 187. System Design Deep Dive
- 188. Top-Down vs Bottom-Up
- 189. DRIVEME Academy
- 190. Top Down Design
- 191. ER Model
- 192. Step 1: Determining Entities
- 193. Tooling For Diagramming
- 194. DRIVEME Academy Entities
- 195. Step 2: Attributes
- 196. Relational Model Extended
- 197. Relational Schema And Instance
- 198. Super Key and Candidate Key
- 199. Primary Key and Foreign Key
- 200. Compound Composite And Surrogate Key
- 201. DRIVEME Attributes
- 202. Step 3: Relationships
- 203. DRIVEME Relationships
- 204. Step 4: Solving Many To Many
- 205. Step 5: Subject Areas
- 206. DRIVEME Subject Areas
- 207. Exercise: Painting Reservations
- 208. Exercise: Movie Theatre
- 209. Bottom Up Design
- 210. Anomalies
- 211. Normalization
- 212. Functional Dependencies
- 213. Functional Dependencies 2
- 214. The Normal Forms
- 215. Going from 0NF to 1NF
- 216. Going from 1NF to 2NF
- 217. Going from 2NF to 3NF
- 218. Boyce-Codd Normal Form
- 219. Why 4NF And 5NF Are Not Useful
- 220. Exercise: Database Design Quiz
-
Section 10: Database Landscape, Performance and Security
Click to view all steps
-
Section 14: Extra Bits: Databases In A Web App (with Node.js)
- Somebody stole our trade secrets!
It's a collection of data, a method for accessing and manipulating that data.
Data are collected by many sources
- Drone
- Mobile App
- Web App
- Videocam
- Automobile
Why is a database always represented with a cylinder?
Data are collected and use by different people
- Product Managers: learn from data to improve the product
- Marketers: analyse business decision
- Web Developers
- Data Analysts: analyse and build machine learning model
- Data Engineer
- Data Administrator
- How to put data in DB
- How to use/update/learn from data
- How to remove data
Confusing Acronyms
- DBMS: Database Management System
- RDBMS: Relational Database Management System
- SQL: Structured Query Language
Confusing Anagrams
5 data models
- Relational: support asset transactions
- Document: data is in a document, scalability
- Key Value: simple way to access data
- Graph: good for connected data
- Wide Columnar
- A database is a hardware and software system that allows a user to store, organize and use data.
- A database is a hardware and software system stores data and answer questions.
- Do you have useful data?
- Install a DB to Store Data
- Use SQL to Setup + Use Data
DB Fiddle - SQL Database Playground
CREATE TABLE Class (
id character(255),
year integer NOT NULL,
PRIMARY KEY(id)
);
CREATE TABLE Student (
id varchar(255) NOT NULL,
class varchar(255) NOT NULL,
firstName varchar(255) NOT NULL,
lastName varchar(255) NOT NULL,
dob date NOT NULL,
sex varchar(1) NOT NULL,
FOREIGN KEY (class) REFERENCES Class(id),
PRIMARY KEY (id)
);
INSERT INTO Class(id, year) VALUES('c1', 1);
INSERT INTO Class(id, year) VALUES('c2', 2);
INSERT INTO Class(id, year) VALUES('c2a', 2);
INSERT INTO Student VALUES ('s1', 'c1', 'George', 'Jacobson', '1992-01-01', 'm');
INSERT INTO Student VALUES ('s2', 'c2', 'Macy', 'Waterson', '1992-01-01', 'f');
INSERT INTO Student VALUES ('s3', 'c1', 'Bill', 'Peters', '1992-01-01', 'm');
INSERT INTO Student VALUES ('s4', 'c1', 'Janine', 'Wilson', '1992-01-01', 'f');
INSERT INTO Student VALUES ('s5', 'c2', 'Jason', 'Lipton', '1992-01-01', 'm');
- SQL is programming language to talk to database
- Database is a structured set of data
- Query is question written in SQL statement
CREATE TABLE User (
id varchar(255) NOT NULL,
name varchar(255) NOT NULL,
lastName varchar(255) NOT NULL,
dob date NOT NULL,
sex varchar(1) NOT NULL,
role varchar(255) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO User VALUES ('u1', 'George', 'Jacobson', '1992-01-01', 'm', 'manager');
INSERT INTO User VALUES ('u2', 'Macy', 'Waterson', '1992-01-01', 'f', 'employee');
INSERT INTO User VALUES ('u3', 'Bill', 'Peters', '1992-01-01', 'm', 'employee');
INSERT INTO User VALUES ('u4','Janine', 'Wilson', '1992-01-01', 'f', 'manager');
INSERT INTO User VALUES ('u5', 'Jason', 'Lipton', '1992-01-01', 'm', 'manager');
SELECT *
FROM User
WHERE role = 'employee';
- Declarative: What will happen?
- Imperative: How it will happen?
- Edgar F. Codd
- Donald Chamberlin
- The Story of two Genius Scientists Behind SQL
- A Relational Model of Data for Large Shared Data Banks
- SEQUEL: A STRUCIURED ENGLISH QUERY LANGUAGE
Exercises: The Select Statement
-- Insert the missing statement to get all the columns from the Customers table.
SELECT *
FROM Customers;
-- Write a statement that will select the City column from the Customers table.
SELECT City
FROM Customers;
-- Select all the different values from the Country column in the Customers table.
SELECT DISTINCT Country
FROM Customers;
The SQL Standard – ISO/IEC 9075:2016 (ANSI X3.135)
File processing systems
- save data on individual files
- no correlation between the files
- no relationship between the data
- update in one file is not reflected in data of relationed files
- no common standards per system
- no common models per data
- data are isolated and duplicated across systems
Model: structure of data
How you organize and store your data ?
- Hierarchical
- Networking
- Entity-Relationship
- Relational
- Object Oriented
- Flat
- Semi-Structured
Hierarchical Model
- each child can have only one parent
- child is tightly couple with parent
- represent 1 to many relationship
Networking Model
- represent many to many relationship
DBMS features
- App <-> SQL <-> DBMS
- CRUD operations
- Manage your data
- Secure your data
- Transactional management
- OLTP: Support day to day
- OLAP: Support analysis
- A database is being used to log orders and customers: OLTP
- A database is being used to figure out what new products we should offer: OLAP
- A database is being used to derive statistics for reporting to the executives: OLAP
- A database is being used to keep track of logged in users: OLTP
Exercise: Relational Model Quiz
PostgreSQL is the worlds’ best database
our choice was based on multiple criteria
- Is it cross-platform?
- Is it easy to use?
- Is it free?
- Can it be extended to multiple database providers?
Alternatives
psql -U chesterheng -d Employees < employees.sql
psql -U chesterheng -d World < world.sql
psql -U chesterheng -d Store < store.sql
psql -U chesterheng -d France < france.sql
- DCL: Data control language
- DDL: Data definition language
- DQL: Data query language
- DML: Data modification language
What is a select? Retrieve data
SELECT * FROM "public"."employees";
SELECT * FROM "public"."departments";
SELECT emp_no AS "Employee #",
birth_date AS "Birthday",
first_name AS "First Name"
FROM "public"."employees";
- Make your data prettier/simpler
SELECT emp_no,
CONCAT(first_name, ' ', last_name) AS "full name"
FROM "public"."employees";
- input -> function -> output
- a function is a set of steps that creates a single value
Types of functions
- Aggregate: operate on many records to produce 1 value
- Aggregate: produce a summary
- Scalar: operate on each record independently
- AVG()
- COUNT
- MIN()
- MAX()
- SUM()
SELECT COUNT(emp_no) FROM "public"."employees";
SELECT MAX(salary) FROM "public"."salaries";
SELECT SUM(salary) FROM "public"."salaries";
--
/*
* What database should I use for these exercises?
* Name: Employees
*/
--
-- Question 1: What is the average salary for the company?
-- Table: Salaries
-- Result: 63810.744836143706
select avg(salary) from salaries;
-- Question 2: What year was the youngest person born in the company?
-- Table: employees
select max(birth_date) from employees;
--
/*
* What database should I use for these exercises?
* Name: France
*/
--
-- Question 1: How many towns are there in france?
-- Table: Towns
-- Result: 36684
select count(id) from towns;
--
/*
* What database should I use for these exercises?
* Name: World
*/
--
-- Question 1: How many official languages are there?
-- Table: countrylanguage
-- Result: 238
select count(countrycode) from countrylanguage
where isofficial = true;
-- Question 2: What is the average life expectancy in the world?
-- Table: country
-- Result: 66.48603611164265
select avg(lifeexpectancy) from country;
-- Question 3: What is the average population for cities in the netherlands?
-- Table: city
-- Result: 185001
select AVG(population) from city
where countrycode = 'NLD';
How to make comments the most important ‘code’ you write
Ten Common SQL Programming Mistakes
- Misspell commands
- Use ; instead of , and vice versa
- Use " instead of '
- " is for tables
- ' is for text
- Invalid column name
SELECT first_name
FROM "public"."employees"
WHERE gender = 'F';
SELECT first_name, last_name, hire_date
FROM "public"."employees"
WHERE first_name = 'Georgi'
AND last_name = 'Facello'
OR first_name = 'Bezalel'
AND last_name = 'Simmel';
SELECT firstname, lastname, gender, state
FROM customers
WHERE gender = 'F' AND (state = 'OR' OR state = 'NY') ;
-- Select all records where the City column has the value "Berlin".
SELECT *
FROM Customers
WHERE City = 'Berlin';
-- Use the NOT keyword to select all records where City is NOT "Berlin".
SELECT *
FROM Customers
WHERE NOT City = 'Berlin';
-- Select all records where the CustomerID column has the value 32.
SELECT *
FROM Customers
WHERE CustomerID = 32 ;
-- Select all records where the City column has the value 'Berlin' and the PostalCode column has the value 12209.
SELECT *
FROM Customers
WHERE City = 'Berlin'
AND PostalCode = 12209;
-- Select all records where the City column has the value 'Berlin' or 'London'.
SELECT * FROM Customers
WHERE City = 'Berlin'
OR City = 'London';
What if you want to filter on everything but ... ?
SELECT age FROM customers WHERE NOT age = 55;
SELECT COUNT(age) FROM customers WHERE NOT age = 55;
Comparison Functions and Operators
-- How many female customers do we have from the state of Oregon (OR)?
-- Result: 106
SELECT COUNT(firstName)
FROM customers
WHERE gender = 'F' and state = 'OR';
-- Who over the age of 44 has an income of 100 000 or more? (excluding 44)
-- Result: 2497
SELECT COUNT(income)
FROM customers
WHERE age > 44 and income >= 100000;
-- Who between the ages of 30 and 50 has an income less than 50 000?
-- (include 30 and 50 in the results)
-- Result: 2362
SELECT COUNT(income)
FROM customers
WHERE age >= 30 and age <= 50 AND income < 50000;
-- What is the average income between the ages of 20 and 50? (Excluding 20 and 50)
-- Result: 59409.926240780098
SELECT AVG(income)
FROM customers
WHERE age > 20 and age < 50;
Logical Operators
- AND
- OR
- NOT
Order of operations
- FROM
- WHERE
- SELECT
A statement having multiple operators is evaluated based on the priority of operators
/*
* DB: Store
* Table: Customers
* Question:
* Select people either under 30 or over 50 with an income above 50000
* Include people that are 50
* that are from either Japan or Australia
*/
SELECT firstname, income, age
FROM customers
WHERE income > 50000
AND (age < 30 OR age >= 50)
AND (country = 'Japan' OR country = 'Australia')
/*
* DB: Store
* Table: Orders
* Question:
* What was our total sales in June of 2004 for orders over 100 dollars?
*/
SELECT SUM(totalamount)
FROM orders
WHERE (orderdate >= '2004-06-01' AND orderdate <= '2004-06-30')
AND totalamount > 100
- When a record does not hava a value, it is considered empty
- NULL is used to represent a missing/empty value
- Operate with NULL will get NULL, eg. SELECT NULL = NULL
- Best practice is to avoid NULL
Database Design Follies: NULL vs. NOT NULL
Should you use NULL?
- Optional or required?
- Future info?
- Retional? What impact?
Be defensive
- Always check for NULL when necessary
- Filter out NULL with IS operator
- Clean up your data
In database table creation, do you allow NULL or use a default value to represent empty data?
- Use default value: cannot filter NULL data
- Use default value: cannot distinguish user entered or default
- Replace NULL values to operate on the data
- Return the first non-null value in a list
SELECT COALESCE(<column>, 'Empty')
FROM <table>
CREATE TABLE "Student" (
id serial PRIMARY KEY,
name varchar(255),
lastName varchar(255),
age int
);
INSERT INTO "Student" (name, lastName, age) VALUES ('STUDENT 1', NULL, NULL);
INSERT INTO "Student" (name, lastName, age) VALUES ('STUDENT 2', NULL, 25);
INSERT INTO "Student" (name, lastName, age) VALUES (null, 'LAST NAME 3', 20);
INSERT INTO "Student" (name, lastName, age) VALUES (null, null, 34);
INSERT INTO "Student" (name, lastName, age) VALUES ('STUDENT 5', NULL, 58);
-- Assuming a students minimum age for the class is 15, what is the average age of a student?
SELECT avg(coalesce(age, 15)) FROM "Student";
-- Replace all empty first or last names with a default?
SELECT id, coalesce(name, 'fallback'), coalesce(name, 'lastName'), age FROM "Student";
- TRUE | NULL | FALSE
- NULL value can be anything
- NULL is always unknown
- NULL = NULL (NULL)
- IS NULL (TRUE)
/*
* DB: Store
* Table: customers
* Question: adjust the following query to display the null values as "No Address"
*/
SELECT COALESCE(address2, 'No Address')
FROM customers
/*
* DB: Store
* Table: customers
* Question: Fix the following query to apply proper 3VL
*/
SELECT *
FROM customers
WHERE address2 IS NOT null;
/*
* DB: Store
* Table: customers
* Question: Fix the following query to apply proper 3VL
*/
SELECT coalesce(lastName, 'Empty'), * from customers
where (age IS NULL);
- Shorthand to match against a range of value
- More readability and maintainability
WHERE <column> >= X AND <column> <= Y
WHERE <column> BETWEEN X AND Y
- Filter multiple values
- Check if a value matched any values in a list
WHERE <column> IN (value1, value2, value3)
WHERE <column> = value1
OR <column> = value2
OR <column> = value3
/*
* DB: Store
* Table: orders
* Question: How many orders were made by customer 7888, 1082, 12808, 9623
*/
SELECT COUNT(orderid)
FROM orders
WHERE customerid IN (7888, 1082, 12808, 9623)
/*
* DB: World
* Table: city
* Question: How many cities are in the district of Zuid-Holland, Noord-Brabant and Utrecht?
*/
SELECT COUNT(id)
FROM city
WHERE district IN ('Zuid-Holland', 'Noord-Brabant', 'Utrecht');
- partial lookups
- What if you don't know exactly what you are searching for?
- You need to build patterns to match when use LIKE
- postgres LIKE only does text comparison, so must cast whatever we use to text
Pattern wildcards
- %: match any number of characters
- -: match 1 character
/*
* DB: Employees
* Table: employees
* Question: Find the age of all employees who's name starts with M.
* Sample output: https://imgur.com/vXs4093
*/
SELECT emp_no, first_name, EXTRACT (YEAR FROM AGE(birth_date)) as "age" FROM employees
WHERE first_name ILIKE 'M%';
/*
* DB: Employees
* Table: employees
* Question: How many people's name start with A and end with R?
* Expected output: 1846
*/
SELECT count(emp_no) FROM employees
WHERE first_name ILIKE 'A%R';
/*
* DB: Store
* Table: customers
* Question: How many people's zipcode have a 2 in it?.
* Expected output: 4211
*/
SELECT count(customerid) FROM customers
WHERE zip::text LIKE '%2%';
/*
* DB: Store
* Table: customers
* Question: How many people's zipcode start with 2 with the 3rd character being a 1.
* Expected output: 109
*/
SELECT count(customerid) FROM customers
WHERE zip::text LIKE '2_1%';
/*
* DB: Store
* Table: customers
* Question: Which states have phone numbers starting with 302?
* Replace null values with "No State"
* Expected output: https://imgur.com/AVe6G4c
*/
SELECT coalesce(state, 'No State') as "State" FROM customers
WHERE phone::text LIKE '302%';
UTC is enough for everyone...right?
- Dates are hard
- What is a timezone? set by Sir Sanford Fleming
- Greenwich Mean Time (GMT): a time zone
- Universal Coordinate Time (UTC): a time standard
- No territories use UTC
- GMT and UTC share the same current time
SHOW TIMEZONE;
SET TIME ZONE 'UTC';
ALTER USER chesterheng SET timezone='UTC';
SHOW TIMEZONE;
- postgres uses ISO-8601
How do dates look like?
- YYYY-MM-DDTHH:MM:SS
- 2017-08-17T12:47:16+02:00
What is a format?
- Is a way of representing a date and time
- A timestamp is a date with time and timezone info
SELECT NOW();
CREATE TABLE timezones (
ts TIMESTAMP WITHOUT TIME ZONE,
tz TIMESTAMP WITH TIME ZONE
);
INSERT INTO timezones VALUES (
TIMESTAMP WITHOUT TIME ZONE '2000-01-01 10:00:00',
TIMESTAMP WITH TIME ZONE '2000-01-01 10:00:00-05'
);
timestamps or date ?
- depends on what to store
- normally date is enough
-- Current Date
SELECT NOW()::date;
SELECT CURRENT_DATE;
-- format
SELECT TO_CHAR(CURRENT_DATE, 'dd/mm/yyyy');
SELECT TO_CHAR(CURRENT_DATE, 'DDD');
SELECT TO_CHAR(CURRENT_DATE, 'WW');
- Subtract dates returns the difference in days
SELECT NOW() - '1800/01/01';
SELECT date '1800/01/01';
- cast to date before calculate age
SELECT AGE(date '1800/01/01');
SELECT AGE(date '1992/11/13', date '1800/01/01');
SELECT EXTRACT (DAY FROM date '1992/11/13') DAY;
SELECT EXTRACT (DAY FROM date '1992/11/13') AS DAY;
SELECT EXTRACT (MONTH FROM date '1992/11/13') AS MONTH;
SELECT EXTRACT (YEAR FROM date '1992/11/13') AS YEAR;
SELECT DATE_TRUNC ('year', date '1992/11/13');
SELECT DATE_TRUNC ('month', date '1992/11/13');
SELECT DATE_TRUNC ('day', date '1992/11/13');
- A Comprehensive Look at PostgreSQL Interval Data Type
- can store and manipulate a period of time in years, months, days, hours, mins, secs
-- 30 days before given date
SELECT NOW() - interval '30 days';
SELECT *
FROM orders
WHERE purchaseDate <= now() - interval '30 days'
SELECT NOW() - INTERVAL '1 year 2 months 3 days';
SELECT NOW() - INTERVAL '2 weeks ago';
SELECT NOW() - INTERVAL '1 year 3 hours 20 minutes';
/*
* DB: Employees
* Table: employees
* Question: Get me all the employees above 60, use the appropriate date functions
*/
SELECT AGE(birth_date), * FROM employees
WHERE (
EXTRACT (YEAR FROM AGE(birth_date))
) > 60 ;
-- alternative
SELECT count(birth_date) FROM employees
WHERE birth_date < now() - interval '61 years' -- 61 years before the current date
/*
* DB: Employees
* Table: employees
* Question: How many employees where hired in February?
*/
SELECT count(emp_no) FROM employees
where EXTRACT (MONTH FROM hire_date) = 2;
/*
* DB: Employees
* Table: employees
* Question: How many employees were born in november?
*/
SELECT COUNT(emp_no) FROM employees
WHERE EXTRACT (MONTH FROM birth_date) = 11;
/*
* DB: Employees
* Table: employees
* Question: Who is the oldest employee?
*/
SELECT MAX(AGE(birth_date)) FROM employees;
/*
* DB: Store
* Table: orders
* Question: How many orders were made in January 2004?
*/
SELECT COUNT(orderid)
FROM orders
WHERE DATE_TRUNC('month', orderdate) = date '2004-01-01';
- remove duplicate
- keep one row from each group of duplicates
- multiple columns will evaulate based on the combination of columns
/*
* DB: Employees
* Table: titles
* Question: What unique titles do we have?
*/
SELECT DISTINCT title FROM titles;
/*
* DB: Employees
* Table: employees
* Question: How many unique birth dates are there?
*/
SELECT COUNT(DISTINCT birth_date)
from employees;
/*
* DB: World
* Table: country
* Question: Can I get a list of distinct life expectancy ages
* Make sure there are no nulls
*/
SELECT DISTINCT lifeexpectancy FROM country
WHERE lifeexpectancy IS NOT NULL
ORDER BY lifeexpectancy;
- sort data ascending or descending by column
- using expressions
SELECT * FROM customers
ORDER BY LENGTH(name)
- A join combines columns from one table to another
SELECT a.emp_no, b.salary, b.from_date
FROM employees AS a, salaries AS b
WHERE a.emp_no = b.emp_no
ORDER BY a.emp_no;
- Table A [Matched] Table B
- Inner join syntax is generally considered a best pratice. It's more readable than using the WHERE syntax. It shows you what's being joined
- can become more complicated as you add more tables to combine
SELECT a.emp_no, b.salary, b.from_date
FROM employees AS a
INNER JOIN salaries AS b
ON a.emp_no = b.emp_no
ORDER BY a.emp_no;
-- You want to know the original salary and also salary at a promotion
SELECT a.emp_no,
CONCAT(a.first_name, a.last_name) AS "name",
b.salary,
COALESCE(c.title, 'No title change'),
COALESCE(c.from_date::text, '-') AS "title taken on"
FROM employees AS a
INNER JOIN salaries AS b
ON a.emp_no = b.emp_no
INNER JOIN titles AS c
ON c.emp_no = a.emp_no AND (
c.from_date = (b.from_date + interval '2 days') OR
c.from_date = b.from_date
)
ORDER BY a.emp_no;
- join a table to itself
- a table has a foreign reference its primary key
- similar to inner join
id | name | startDate | supervisorId |
---|---|---|---|
1 | Binni | 1990/01/13 | 2 |
2 | Andrei | 1980/01/23 | 2 |
CREATE TABLE employee(
id varchar(5) NOT NULL,
name varchar(20) NULL,
start_date date NULL,
supervisorId varchar(5) NULL,
CONSTRAINT id PRIMARY KEY(id),
CONSTRAINT supervisorId FOREIGN KEY(supervisorId) REFERENCES employee(id)
);
INSERT INTO employee VALUES ('1', 'Boss Man', date '1980/01/23', '1');
INSERT INTO employee VALUES ('1a', 'Andrei', date '1980/01/23', '1');
INSERT INTO employee VALUES ('1abc', 'Boss Man', date '1980/01/23', '1a');
SELECT a.id,
a.name AS "employee",
b.name AS "supervisor"
FROM employee AS a,
employee AS b
WHERE a.supervisorId = b.id;
SELECT a.id,
a.name AS "employee",
b.name AS "supervisor"
FROM employee AS a
INNER JOIN employee AS b
ON a.supervisorId = b.id;
- What if I need the rows that don't match?
- Table A <-> Table B
- Left outer add data that don't have a match from table A
- Right outer add data that don't have a match from table B
- Any value that does not match is made to be NULL
SELECT *
FROM <table A> AS a
LEFT [OUTER] JOIN <table B> AS b
ON a.id = b.id;
-- You want to know every salary and also salary at a promotion
SELECT a.emp_no,
CONCAT(a.first_name, a.last_name) AS "name",
b.salary,
COALESCE(c.title, 'No title change'),
COALESCE(c.from_date::text, '-') AS "title taken on"
FROM employees AS a
INNER JOIN salaries AS b
ON a.emp_no = b.emp_no
LEFT JOIN titles AS c
ON c.emp_no = a.emp_no AND (
c.from_date = (b.from_date + interval '2 days') OR
c.from_date = b.from_date
)
ORDER BY a.emp_no;
-- cross join: create a combination of every row
SELECT * FROM <tableA>
CROSS JOIN <tableB>;
-- full outer join: returns result from both table whether they match or not
SELECT * FROM <tableA> AS a
FULL JOIN <tableB> AS b
ON a.id = b.id;
/*
* DB: Store
* Table: orders
* Question: Get all orders from customers who live in Ohio (OH), New York (NY) or Oregon (OR) state
* ordered by orderid
*/
SELECT c.firstname, c.lastname, o.orderid FROM orders AS o
INNER JOIN customers AS c ON o.customerid = c.customerid
WHERE c.state IN ('NY', 'OH', 'OR')
ORDER BY o.orderid;
/*
* DB: Store
* Table: products
* Question: Show me the inventory for each product
*/
SELECT p.prod_id, i.quan_in_stock
FROM products as p
INNER JOIN inventory AS i oN p.prod_id = i.prod_id
/*
* DB: Employees
* Table: employees
* Question: Show me for each employee which department they work in
*/
SELECT e.first_name, dp.dept_name
FROM employees AS e
INNER JOIN dept_emp AS de ON de.emp_no = e.emp_no
INNER JOIN departments AS dp ON dp.dept_no = de.dept_no
- simplify the join syntax
SELECT *
FROM <tableA>
INNER JOIN <tableB> USING(id)
SELECT e.emp_no, e.first_name, d.dept_name
FROM employees AS e
INNER JOIN dept_emp AS de ON de.emp_no = e.emp_no
INNER JOIN departments AS d USING(dept_no);
- summarise or aggregate data by groups
- why group data? to get in-depth information by group
- GROUP BY splits data into groups to apply functions to group rather than entire table
- When we GROUP BY, we apply the function by group
- reduce all records found in the group to a single record
- GROUP BY utilizes SPLIT-APPLY-COMBINE strategy
- GROUP BY happens after WHERE / FROM
- FROM -> WHERE -> GROUP BY -> SELECT -> ORDER
- what if I want to filter groups?
- HAVING apply filters to a group as a whole
- FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER
SELECT col1, COUNT(col2)
FROM <table>
WHERE col2 > X
GROUP BY col1
HAVING col1 === Y;
SELECT d.dept_name,
COUNT(e.emp_no) AS "# of employees"
FROM employees AS e
INNER JOIN dept_emp AS de
ON de.emp_no = e.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
WHERE e.gender = 'M'
GROUP BY d.dept_name
HAVING COUNT(e.emp_no) > 25000;
SELECT d.dept_name,
COUNT(e.emp_no) AS "# of employees"
FROM employees AS e
INNER JOIN dept_emp AS de
ON de.emp_no = e.emp_no
INNER JOIN departments AS d
ON de.dept_no = d.dept_no
GROUP BY d.dept_name
ORDER BY COUNT(e.emp_no) DESC;
- GROUP BY emp_no
- we can apply aggregate function to from_date and salary
- however, we cannot get the salary of MAX(from_date)
- or cannot get the from_date of MAX(salary)
SELECT emp_no, MAX(from_date), MAX(salary)
FROM salaries
GROUP BY emp_no;
- What if we want to combine the results of multiple groupings?
- UNION removes duplicate records
- UNION ALL does not remove duplicate records
- Grouping Sets is a subclause of GROUP BY that allows you to define multiple groupings
SELECT NULL AS "prod_id", SUM(ol.quantity)
FROM "public"."orderlines" AS ol
UNION
SELECT prod_id AS "prod_id", SUM(ol.quantity)
FROM "public"."orderlines" AS ol
GROUP BY prod_id
ORDER BY prod_id DESC;
SELECT prod_id AS "prod_id", SUM(ol.quantity)
FROM "public"."orderlines" AS ol
GROUP BY
GROUPING SETS (
(),
(prod_id)
)
ORDER BY prod_id DESC;
SELECT prod_id AS "prod_id", orderlineid, SUM(ol.quantity)
FROM "public"."orderlines" AS ol
GROUP BY
GROUPING SETS (
(),
(prod_id),
(orderlineid)
)
ORDER BY prod_id DESC, orderlineid DESC;
- useful to replace GROUPING SETS that need all combinations
SELECT EXTRACT (YEAR FROM orderdate) AS "year",
EXTRACT (MONTH FROM orderdate) AS "month",
EXTRACT (DAY FROM orderdate) AS "day",
sum(ol.quantity)
FROM orderlines AS ol
GROUP BY
GROUPING SETS (
(EXTRACT (YEAR FROM orderdate)),
(
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate)
),
(
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate)
),
(
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate)
),
(EXTRACT (MONTH FROM orderdate)),
(EXTRACT (DAY FROM orderdate)),
()
)
ORDER BY
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate);
SELECT EXTRACT (YEAR FROM orderdate) AS "year",
EXTRACT (MONTH FROM orderdate) AS "month",
EXTRACT (DAY FROM orderdate) AS "day",
sum(ol.quantity)
FROM orderlines AS ol
GROUP BY
ROLLUP (
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate)
)
ORDER BY
EXTRACT (YEAR FROM orderdate),
EXTRACT (MONTH FROM orderdate),
EXTRACT (DAY FROM orderdate);
/*
* How many people were hired on did we hire on any given hire date?
* Database: Employees
* Table: Employees
*/
SELECT a.hire_date, COUNT(b.hire_date) as "amount"
FROM employees as a, employees as b
WHERE a.hire_date = b.hire_date
GROUP BY a.hire_date
ORDER BY "amount" DESC;
/*
* Show me all the employees, hired after 1991, that have had more than 2 titles
* Database: Employees
*/
SELECT e.emp_no, count(t.title) as "amount of titles"
FROM employees as e
JOIN titles as t USING(emp_no)
WHERE EXTRACT (YEAR FROM e.hire_date) > 1991
GROUP BY e.emp_no
HAVING count(t.title) > 2
ORDER BY e.emp_no;
/*
* Show me all the employees that have had more than 15 salary changes that work in the department development
* Database: Employees
*/
SELECT e.emp_no, count(s.from_date) as "amount of raises"
FROM employees as e
JOIN salaries as s USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
WHERE de.dept_no = 'd005'
GROUP BY e.emp_no
HAVING count(s.from_date) > 15
ORDER BY e.emp_no;
/*
* Show me all the employees that have worked for multiple departments
* Database: Employees
*/
SELECT e.emp_no, count(de.dept_no) as "worked for # departments"
FROM employees as e
JOIN dept_emp AS de USING(emp_no)
GROUP BY e.emp_no
HAVING count(de.dept_no) > 1
ORDER BY e.emp_no;
- group data is useful
- group happens after FROM / WHERE
- HAVING is a special filter for groups
- GROUPING SETS AND ROLLUP are useful for multiple groupings in a single query
- Group data is not a silver bullet
Window functions
- How do we apply functions against a set of rows related to the current row?
- add average to every salary to see how much each employee is from average
- Window Functions
- window functions create a new column based on functions performed on a subset or "window" of the data
window_function(arg1, arg2,..) OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression [ASC | DESC] [NULLS {FIRST | LAST}]
)
-- how to display max salary in each indivodual row?
SELECT *, MAX(salary) OVER()
FROM salaries
LIMIT 100;
- divide rows into groups to apply the function against
SELECT *, AVG(salary)
OVER (
PARTITION BY d.dept_name
)
FROM salaries
JOIN dept_emp AS de USING (emp_no)
JOIN departments AS d USING (dept_no);
- ORDER BY: order the results
window function ORDER BY
- ORDER BY changes the frame of the window function
- accumulating: take account of everything before me and myself
SELECT emp_no, salary, COUNT(salary)
OVER (ORDER BY emp_no)
FROM salaries
- when use a frame, we can create a sub-range or frame
- without ORDER BY, by default the frame is usually all partition rows
- with ORDER BY, by default the frame is usually everything before current row and current row
Key: Meaning
- Rows or range: use a range or rows as a frame
- preceding: rows before the current one
- following: rows after the current one
- unbounded preceding or following: returns all before or after
- current row: your current row
PARTITION BY category
ORDER BY price
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
SELECT emp_no, salary,
COUNT(salary) OVER (ORDER BY emp_no)
FROM salaries
SELECT emp_no, salary,
COUNT(salary) OVER (
PARTITION BY emp_no
ORDER BY salary
)
FROM salaries
SELECT emp_no, salary,
COUNT(salary) OVER (
PARTITION BY emp_no
ORDER BY salary
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM salaries
SELECT e.emp_no, e.first_name, d.dept_name, MAX(s.salary)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
GROUP BY e.emp_no, e.first_name, d.dept_name
ORDER BY e.emp_no;
SELECT DISTINCT e.emp_no,
LAST_VALUE(salary) OVER (
PARTITION BY s.emp_no
ORDER BY s.from_date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
ORDER BY e.emp_no;
SELECT DISTINCT e.emp_no,
LAST_VALUE(s.from_date) OVER (
PARTITION BY s.emp_no
ORDER BY s.from_date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
),
LAST_VALUE(salary) OVER (
PARTITION BY s.emp_no
ORDER BY s.from_date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
ORDER BY e.emp_no;
SELECT DISTINCT e.emp_no,
s.from_date,
s.salary
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
ORDER BY e.emp_no;
- return a value evaluated against the first row within its partition
SELECT prod_id, price, category
FIRST_VALUE(price) OVER (
PARTITION BY category
ORDER BY price
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS "cheapest in category"
FROM products
SELECT prod_id, price, category,
MIN(price) OVER (
PARTITION BY category
) AS "cheapest in category"
FROM products
- return a value evaluated against the last row within its partition
SELECT prod_id, price, category,
LAST_VALUE(price) OVER (
PARTITION BY category
ORDER BY price
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS "most expensive in category"
FROM products
SELECT prod_id, price, category,
MAX(price) OVER (
PARTITION BY category
) AS "most expensive in category"
FROM products
- sum values within a group depend on the frame
SELECT o.orderid, o.customerid, o.netamount,
SUM(o.netamount) OVER (
PARTITION BY o.customerid
ORDER BY o.orderid
) AS "cum sum"
FROM orders as o
ORDER BY o.customerid
- number the current row within the partition starting from 1 regardless of framing
SELECT prod_id, price, category,
ROW_NUMBER() OVER(
PARTITION BY category
ORDER BY price
) AS "position in category by price"
FROM products
/*
* Show the population per continent
* Database: World
* Table: Country
*/
SELECT
DISTINCT continent,
SUM(population) OVER w1 as "continent population"
FROM country
WINDOW w1 AS( PARTITION BY continent );
/*
* To the previous query add on the ability to calculate the percentage of the world population
* What that means is that you will divide the population of that continent by the total population and multiply by 100 to get a percentage.
* Make sure you convert the population numbers to float using `population::float` otherwise you may see zero pop up
*
* Database: World
* Table: Country
*/
SELECT
DISTINCT continent,
SUM(population) OVER w1 as"continent population",
CONCAT(
ROUND(
(
SUM( population::float4 ) OVER w1 /
SUM( population::float4 ) OVER()
) * 100
),'%' ) as "percentage of population"
FROM country
WINDOW w1 AS( PARTITION BY continent );
/*
* Count the number of towns per region
*
* Database: France
* Table: Regions (Join + Window function)
*/
SELECT
DISTINCT r.id,
r."name",
COUNT(t.id) OVER (
PARTITION BY r.id
ORDER BY r."name"
) AS "# of towns"
FROM regions AS r
JOIN departments AS d ON r.code = d.region
JOIN towns AS t ON d.code = t.department
ORDER BY r.id;
- What if you only want to select something when a certain criteria is met?
- case statements can be used in multiple places in a query
- each return must be a single output
Case statement
SELECT a,
CASE
WHEN a=1 THEN 'one'
WHEN a=2 THEN 'two'
END
FROM test;
Render custom row data
SELECT
o.orderid,
o.customerid,
CASE
WHEN o.customerid = 1
THEN 'my first customer'
ELSE 'not my first customer'
END,
o.netamount
FROM orders as o
ORDER BY o.customerid;
Filter in a where
SELECT
o.orderid,
o.customerid,
o.netamount
FROM orders as o
WHERE CASE
WHEN o.customerid > 10
THEN o.netamount < 100
ELSE o.netamount > 100
END
ORDER BY o.customerid;
In an aggregate function
SELECT SUM(
CASE
WHEN o.netamount <100
THEN -100
ELSE o.netamount
END
) AS "returns",
SUM(o.netamount) AS "normal total"
FROM orders as o;
- what if you want to return NULL if a condition is met?
- if value 1 = value 2, return NULL
- fill in empty spots with a NULL to avoid divide by zero issues
NULLIF(val_1, val_2)
NULLIF(0, 0) -- NULL
NULLIF('ABC', 'DEF') -- 'ABC'
- What if you want to store the results of a query?
- What if you want to query the results of a query?
- view allow you to store and query previously run queries
- there are 2 types of views materialized and non-materialized
- non-materialized: query get re-run each time the view is called
- materialized: store the data physically and periodically updates it when tables change
- view is the output of the query we ran
- view act like tables you can query them
- view take very little space to store
- we only store the definition of a view
- not all of the data that it returns
CREATE VIEW view_name AS query;
CREATE OR REPLACE view_name AS query;
ALTER VIEW <view_name> RENAME TO <view_name>;
DROP VIEW [ IF EXISTS] <view_name>;
- get the most recent salary of an employee
- use window functions
- view is easier to reason then window functions
SELECT e.emp_no, e.first_name, d.dept_name, MAX(s.salary)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
GROUP BY e.emp_no, e.first_name, d.dept_name
ORDER BY e.emp_no;
SELECT DISTINCT e.emp_no,
LAST_VALUE(salary) OVER (
PARTITION BY s.emp_no
ORDER BY s.from_date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
ORDER BY e.emp_no;
CREATE OR REPLACE VIEW last_salary_change AS
SELECT e.emp_no,
MAX(s.from_date)
FROM salaries AS s
JOIN employees AS e USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
GROUP BY e.emp_no
ORDER BY e.emp_no;
SELECT * FROM salaries
JOIN last_salary_change AS l USING(emp_no)
WHERE from_date = l.max
ORDER BY emp_no;
SELECT s.emp_no,
d.dept_name,
s.from_date,
s.salary
FROM last_salary_change
JOIN salaries AS s USING(emp_no)
JOIN dept_emp AS de USING(emp_no)
JOIN departments AS d USING(dept_no)
WHERE max = s.from_date;
Poor database indexing – a SQL query performance killer – recommendations
What is Indexes?
- An index is a pointer to data in a table
- An index is a construct to speed up data access in a table
- An index is used to find data without having to scan the whole table
- An index is used to improve query performance with filter conditions
- An index work like a table of contents to help you find a piece of data
What does it do?
- speed up queries
- slow down data insertion and updates
Types of indexes
- Single column
- Multi column
- Unique (PK)
- Partial
- Implicit Indexes (be default)
CREATE UNIQUE INDEX <name>
on <table> (column1, column2, ...)
DROP INDEX <name>
When to use
- Index foreign keys
- Index primary keys and unique columns
- Index on columns that end up in the ORDER BY / WHERE clause often
When NOT to use
- Don't add an index just to add an index
- Don't use indexes on small tables
- Don't use on tables that are updated frequently
- Don't use on columns that can contain NULL values
- Don't use on columns that have large values
- Single column
- Most frequently used column in a query
- when? retrieve data that satisfies one condition
- Multi column
- Most frequently used columns in a query
- when? retrieve data that satisfies multiple conditions
- Unique
- normally primary key or column with unique value
- for speed and integrity
CREATE UNIQUE INDEX <name>
ON <table> (column1);
- Partial
- index over a subset of a table
CREATE INDEX <name>
ON <table> (<expression>);
- Implicit Indexes (be default)
- auto created by database
- PRIMARY KEY
- UNIQUE KEY
EXPLAIN ANALYZE
SELECT "name", "district", "countrycode"
FROM "public"."city"
WHERE countrycode IN ('TUN', 'BE', 'NL');
CREATE INDEX idx_countrycode
ON city (countrycode);
Without Index
With Index
CREATE INDEX idx_countrycode
ON city (countrycode)
WHERE countrycode IN ('TUN', 'BE', 'NL');
With Partial Index
Postgres provides several types of indexes
- B-Tree: default algorithm
- best used for comparisons
- <, <=, =, >, >=, BETWEEN, IN, IS NULL, IS NOT NULL
- Hash
- can only handle equality (=) operations
- Gin: generalized inverted index
- best used when multiple values are stored in a single field
- Gist: generalized search tree
- useful in index geometric data and full-text search
Each index type uses a different algorithm
CREATE [UNIQUE] INDEX <name>
ON <table> USING <method> (column1, ...)
- a construct that allow you to build extremely complex query
- also called inner query or inner select
- is a query within another query
- most often found in the WHERE clause
- can also used it in SELECT, FROM and HAVING clause
- return a single value the SELECT clause,
- return a single column with single or multiple rows the WHERE and HAVING clause
- return a table in the FROM or JOIN clause
SELECT *
FROM <table>
WHERE <column> <condition> (
SELECT <column>
FROM <table>
[WHERE | GROUP BY | ORDER BY | ...]
)
SELECT (
SELECT <column>
FROM <table>
[WHERE | GROUP BY | ORDER BY | ...]
)
FROM <table> AS <name>
SELECT *
FROM (
SELECT <column>, <column>, ...
FROM <table>
[WHERE | GROUP BY | ORDER BY | ...]
) AS <name>
SELECT *
FROM <table> AS <name>
GROUP BY <column>
HAVING (
SELECT <column>
FROM <table>
[WHERE | GROUP BY | ORDER BY | ...]
)
- how a subquery different from a join?
- both combine data from different tables
- subquery could stand alone but not join
- join combines rows from one or more tables based on match condition
- subquery can return a single result or a row set(table)
- joins can only return a row set(table)
- a subquery's results cannot be reference by outer query
- a joined table can be reference by outer query
- use a join if possible over subquery for performance reason
SELECT AVG(price) FROM products
SELECT title, price,
(SELECT AVG(price) FROM products) AS "global average price"
FROM products;
SELECT prod_id, title, price, i.quan_in_stock
FROM products
JOIN inventory AS i USING(prod_id)
SELECT title, price,
(SELECT AVG(price) FROM products) AS "global average price"
FROM (
SELECT * FROM products WHERE price < 10
) AS "products_sub";
- a subquery must be enclosed in parentheses
- must be placed on the right side of the comparision operator
- cannot manipulate results internally (order by ignored)
- use single row operators with single-row subquery
- subquery return NULL may not return results
SELECT *
FROM <table> AS <name>
WHERE X >= | <= | = | != (
SELECT MAX(<column>)
FROM <table>
[WHERE | GROUP BY | ORDER BY | ...]
)
Types of subquery
- single row: return 0 or 1 row
- multiple row: return 0 or more rows
- multiple column: return 0 or more columns
- correlated: reference 1 or more columns in the other statement - runs against each row
- correlated: can have performance bottleneck
- nested: subquery in a subquery
single row
SELECT name, salary
FROM salaries
WHERE salary =
(SELECT AVG(salary) FROM salaries);
SELECT name, salary,
(SELECT AVG(salary) FROM salaries)
AS "Company average salary"
FROM salaries;
multiple row
SELECT title, price, category
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
);
multiple column
SELECT emp_no, salary,
dea.avg AS "Department average salary"
FROM salaries AS s
JOIN dept_emp AS de USING(emp_no)
JOIN (
SELECT dept_no, AVG(salary)
FROM salaries AS s2
JOIN dept_emp AS e USING(emp_no)
GROUP BY dept_no
) AS dea USING(dept_no)
WHERE salary > dea.avg;
correlated
SELECT emp_no, salary, from_date
FROM salaries AS s
WHERE from_date = (
SELECT MAX(s2.from_date) AS max
FROM salaries AS s2
WHERE s2.emp_no = s.emp_no
)
ORDER BY emp_no;
nested
SELECT orderlineid, prod_id, quantity
FROM orderlines
JOIN (
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
)
) AS limited USING(prod_id)
show all employees older than the average age
SELECT
first_name,
last_name,
birth_date,
AGE(birth_date),
(
SELECT AVG(AGE(birth_date))
FROM employees
)
FROM employees
WHERE AGE(birth_date) > (
SELECT AVG(AGE(birth_date))
FROM employees
);
Show the title by salary for each employee
SELECT
emp_no,
salary,
from_date,
(
SELECT title
FROM titles AS t
WHERE t.emp_no = s.emp_no
AND t.from_date = s.from_date
)
FROM salaries AS s
ORDER BY emp_no;
use join instead of subquery if possible for performance reason
SELECT
emp_no,
salary,
from_date,
t.title
FROM salaries AS s
JOIN titles AS t USING(emp_no, from_date)
ORDER BY emp_no;
Show the most recent employee salary
SELECT
emp_no,
salary AS "most recent salary",
from_date
FROM salaries AS s
WHERE from_date = (
SELECT MAX(from_date)
FROM salaries AS sp
WHERE sp.emp_no = s.emp_no
)
ORDER BY emp_no ASC;
SELECT
emp_no,
salary AS "most recent salary",
from_date
FROM salaries AS s
JOIN last_salary_change AS ls USING(emp_no)
WHERE from_date = ls.max
ORDER BY emp_no;
SELECT
emp_no,
salary AS "most recent salary",
from_date
FROM salaries AS s
JOIN (
SELECT emp_no, MAX(from_date)
FROM salaries AS sp
GROUP BY emp_no
) AS ls USING(emp_no)
WHERE ls.max = from_date
ORDER BY emp_no ASC;
- operators you can apply in the WHERE clause on subquery
- EXISTS: check if subquery returns any rows
- IN: check if value is equal to any of the rows in the return (NULL yields NULL)
- NOT IN: check if value is not equal to any of the rows in the return (NULL yields NULL)
- ANY/SOME: check each row against the operator and if any comparison matches return TRUE
- ALL: check each row against the operator and if all comparisons match return TRUE
- SINGLE VALUE COMPARISON: subquery must return a single row check comparatore against row
SELECT firstname, lastname, income
FROM customers AS c
WHERE EXISTS (
SELECT * FROM orders AS o
WHERE c.customerid = o.customerid
AND totalamount > 400
) AND income > 90000;
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
);
SELECT prod_id
FROM products
WHERE category IN (
SELECT category FROM categories
WHERE categoryname NOT IN ('Comedy', 'Family', 'Classics')
);
SELECT prod_id
FROM products
WHERE category = ANY (
SELECT category FROM categories
WHERE categoryname IN ('Comedy', 'Family', 'Classics')
);
SELECT prod_id, title, sales
FROM products
JOIN inventory AS i USING(prod_id)
WHERE i.sales > ALL (
SELECT AVG(sales) FROM inventory
JOIN products AS pl USING(prod_id)
GROUP BY pl.category
);
SELECT prod_id
FROM products
WHERE category = (
SELECT category FROM categories
WHERE categoryname IN ('Comedy')
);
/* TRY TO WRITE THESE AS JOINS FIRST */
/*
* DB: Store
* Table: orders
* Question: Get all orders from customers who live in Ohio (OH), New York (NY) or Oregon (OR) state
* ordered by orderid
*/
SELECT c.firstname, c.lastname, o.orderid
FROM orders AS o, (
SELECT customerid, state, firstname, lastname
FROM customers
) AS c
WHERE o.customerid = c.customerid AND
c.state IN ('NY', 'OH', 'OR')
ORDER BY o.orderid;
/*
* DB: Employees
* Table: employees
* Question: Filter employees who have emp_no 110183 as a manager
*/
SELECT emp_no, first_name, last_name
FROM employees
WHERE emp_no IN (
SELECT emp_no
FROM dept_emp
WHERE dept_no = (
SELECT dept_no
FROM dept_manager
WHERE emp_no = 110183
)
)
ORDER BY emp_no
-- Written with JOIN
SELECT e.emp_no, first_name, last_name
FROM employees as e
JOIN dept_emp as de USING (emp_no)
JOIN dept_manager as dm USING (dept_no)
WHERE dm.emp_no = 110183
- different types of data
- how to create a database
- how to create tables
- how to insert/update data
- ... so more more
Why is it important?
- create and maintain data is extremely hard
Data definition language (DDL)
- commands to create, modify, delete different structures
- commands for add, delete, modify data
Types Of Databases
- regular
- template: build a regular database based on a tempate
Create a database
- when you setup your postgres, 3 databases were created
- postgres
- template0
- template1
- postgres is the default database that is created when you setup postgres (initdb)
Postgres Database
- So the first time you initialize postgres, you can connect to postgres database to setup your first database
psql -U <user> <database>
- postgres by default will assume a connection to a database with the same name as the user if no database is supplied
psql -U postgres
- This command shows you your current connection
\conninfo
Template0 Database
- This is the template that is used to create template1 - never change it
- It is in essence a backup template
Template1 Database
-
This is the template that is used to create new databases
-
Any changes we make to the template are automatically applied to all new databases
-
one thing to note is that becase template1 is the default template - if it is being accessed / connected to no new databases can be created until the connection is closed!
-
Case 1: connect to Template1 and try to create new database
psql -U postgres template1
template1=#
psql -U postgres Employees
Employees=# CREATE DATABASE myblockeddb;
ERROR: source database "template1" is being accessed by other users
DETAIL: There is 1 other session using the database.
- Case 2: exit Template1 and try to create new database
psql -U postgres Employees
Employees=# CREATE DATABASE myblockeddb;
CREATE DATABASE
Case 3: connect to Template1 and try to create new database with template0
psql -U postgres template1
template1=#
psql -U postgres Employees
Employees=# CREATE DATABASE myblockeddb2 WITH TEMPLATE template0;
CREATE DATABASE
Template Databases
- You can create your own template database to create other database with. But this however not very common!
CREATE DATABASE mysupertemplate;
CREATE TABLE supertable ();
CREATE DATABASE mysuperdatabase with TEMPLATE mysupertemplate;
- So how to create a database?
- CREATE DATABASE
CREATE DATABASE name
[ [ WITH ]
[ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace ]
[ CONNECTION_LIMIT [=] connlimit ] ]
Setting | Default |
---|---|
TEMPLATE | template1 |
ENCODING | UTF8 |
CONNECTION_LIMIT | 100 |
OWNER | Current User |
Lets create an exmple database to store courses!
psql -U postgres postgres
postgres=# CREATE DATABASE ztm;
CREATE DATABASE
postgres=# DROP DATABASE ztm;
DROP DATABASE
- What is Database Schema
- What is a Database Schema
- Databases often contain many tables, view etc depending on how much you care, you may want to organize them in a logical way!
Postgres schema
- Postgres offers the concept of schemas. Think of it like a box in which you can organise tables, views, indexes, etc
- by default, each database gets a public schema
- unless you specific a schema, the default us always assumed to be public!
SELECT * FROM employees
SELECT * FROM public.employees
postgres=# \dn
List of schemas
Name | Owner
--------+----------
public | postgres
(1 row)
Create a schema
postgres=# CREATE SCHEMA sales;
CREATE SCHEMA
postgres=# \dn
List of schemas
Name | Owner
--------+-------------
public | postgres
sales | postgres
(2 rows)
Reasons to use schemas
- allow many users to use one database without interfering with each other
- organise database objects into logical groups to make them more management
- 3rd party applications can be put into seperate schemas so they do not collide with the names of other objects
Restricted
- create databases is a restricted action
- not everyone is allowed to do it
Roles
- roles are vital to any dbms they determine what's allowed
- a role can be a user or a group
- it depends on how you setup the role
- roles have the ability to grant membership to another role
- a role have attributes which define privileges
Role Attribute
- privileges of a role are determined in part by its attributes
- attribute example
- SUPERUSER | NOSUPERUSER
- CREATEDB | NOCREATEDB
- CREATEROLE | NOCREATEROLE
- LOGIN | NOLOGIN
Create role
- always encrypt when storing a role that can log in
CREATE ROLE readonly WITH LOGIN ENCRYPTED PASSWORD 'redaonly';
- by default only the creator of the database or superuser has access to its onject
SELECT * FROM departments;
ERROR: permission denied for table departments
postgres=# \du
List of roles
Role name | Attributes | Member of
-------------+------------------------------------------------------------+-----------
chesterheng | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
- A user is the same a role
- It assumes login by default
CREATE USER user1 WITH ENCRYPTED PASSWORD 'user1';
postgres=# CREATE role test_role_with_login WITH LOGIN ENCRYPTED PASSWORD 'password';
CREATE ROLE
postgres=# \du
List of roles
Role name | Attributes | Member of
----------------------+------------------------------------------------------------+-----------
chesterheng | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
test_role_with_login |
postgres=# CREATE USER test_user_with_login WITH ENCRYPTED PASSWORD 'password';
CREATE ROLE
postgres=# \du
List of roles
Role name | Attributes | Member of
----------------------+------------------------------------------------------------+-----------
chesterheng | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
test_role_with_login | | {}
test_user_with_login |
createuser --interactive
Enter name of role to add: test_interactive
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) y
postgres=# \du
List of roles
Role name | Attributes | Member of
----------------------+------------------------------------------------------------+-----------
chesterheng | Superuser, Create role, Create DB | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
test_interactive | Create role | {}
test_role_with_login | | {}
test_user_with_login |
postgres=# ALTER ROLE test_interactive WITH ENCRYPTED PASSWORD 'password';
ALTER ROLE
- Did not prompt for password login to Employees database with test_interactive user
psql -U test_interactive Employees
Employees=>
- Force to prompt for password login to Employees database with test_interactive user
- Can get in with any password
- Configuration of postgres with trust all default connection by default
psql -U test_interactive Employees -W
Employees=>
- show hba_file and config_file
- hba_file
- Password Authentication
- config_file
postgres=# show hba_file;
hba_file
----------------------------------------------------------------------------
/Users/chesterheng/Library/Application Support/Postgres/var-12/pg_hba.conf
(1 row)
postgres=# show config_file;
config_file
--------------------------------------------------------------------------------
/Users/chesterheng/Library/Application Support/Postgres/var-12/postgresql.conf
(1 row)
- Set method to scram-sha-256 in hba_file
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all scram-sha-256
# IPv4 local connections:
host all all 127.0.0.1/32 scram-sha-256
# IPv6 local connections:
host all all ::1/128 scram-sha-256
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all scram-sha-256
host replication all 127.0.0.1/32 scram-sha-256
host replication all ::1/128 scram-sha-256
- Login to Employees database with test_interactive user
- will ask for password now
psql -U test_interactive Employees
Password for user test_interactive:
psql: error: could not connect to server: FATAL: password authentication failed for user "test_interactive"
- need to tell postgres to use scram-sha-256 in config_file
password_encryption = scram-sha-256 # md5 or scram-sha-256
- Set method back to trust in hba_file
- Alter test_interactive to use password
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
postgres=# ALTER USER test_interactive WITH PASSWORD 'password';
ALTER ROLE
- Set method to scram-sha-256 in hba_file
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all scram-sha-256
# IPv4 local connections:
host all all 127.0.0.1/32 scram-sha-256
# IPv6 local connections:
host all all ::1/128 scram-sha-256
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all scram-sha-256
host replication all 127.0.0.1/32 scram-sha-256
host replication all ::1/128 scram-sha-256
psql -U test_interactive Employees
Password for user test_interactive:
Employees=>
- set method trust for postgres user
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all postgres trust
psql -U postgres postgres
postgres=# exit
- attributes can only define so many privileges
- when you are not a superuser
- what you can do is fairly limited
- depending on what default provoleges are set
- by default objects are only available to the one who creates them
- privileges need to be granted for new roles and users to have access to certain data
Grant Privileges
- be wise about the privileges you grant
GRANT ALL PRIVILEGES ON <table> TO <user>;
GRANT ALL ON ALL TABLES [IN SCHEMA <schema>] TO <user>;
GRANT [SELECT, INSERT, UPDATE, DELETE, ...] ON <table> [IN SCHEMA <schema>] TO <user>;
- Set method back to trust in hba_file for convenient
# TYPE DATABASE USER ADDRESS METHOD
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
- create new user privilegetest
- privilegetest has no privilege to query titles table created by postgres
createuser --interactive
Enter name of role to add: privilegetest
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) n
psql -U privilegetest Employees
Employees=# \conninfo
You are connected to database "Employees" as user "privilegetest" via socket in "/tmp" at port "5432".
Employees=# \dt
List of relations
Schema | Name | Type | Owner
--------+--------------+-------+-------------
public | departments | table | postgres
public | dept_emp | table | postgres
public | dept_manager | table | postgres
public | employees | table | postgres
public | salaries | table | postgres
public | titles | table | postgres
(6 rows)
Employees-# SELECT * FROM titles;
ERROR: syntax error at or near "SELECT"
LINE 2: SELECT * FROM titles;
^
- use postgres user to grant privilegetest the privilege to query titles table created by postgres
psql -U postgres Employees
Employees=# GRANT SELECT ON titles TO privilegetest;
GRANT
- use privilegetest to query titles table created by postgres
psql -U privilegetest Employees
Employees=> SELECT * FROM titles;
emp_no | title | from_date | to_date
--------+--------------------+------------+------------
10001 | Senior Engineer | 1986-06-26 | 9999-01-01
10002 | Staff | 1996-08-03 | 9999-01-01
10003 | Senior Engineer | 1995-12-03 | 9999-01-01
10004 | Engineer | 1986-12-01 | 1995-12-01
10004 | Senior Engineer | 1995-12-01 | 9999-01-01
10005 | Senior Staff | 1996-09-12 | 9999-01-01
10005 | Staff | 1989-09-12 | 1996-09-12
10006 | Senior Engineer | 1990-08-05 | 9999-01-01
10007 | Senior Staff | 1996-02-11 | 9999-01-01
10007 | Staff | 1989-02-10 | 1996-02-11
10008 | Assistant Engineer | 1998-03-11 | 2000-07-31
10009 | Assistant Engineer | 1985-02-18 | 1990-02-18
10009 | Engineer | 1990-02-18 | 1995-02-18
10009 | Senior Engineer | 1995-02-18 | 9999-01-01
10010 | Engineer | 1996-11-24 | 9999-01-01
10011 | Staff | 1990-01-22 | 1996-11-09
10012 | Engineer | 1992-12-18 | 2000-12-18
10012 | Senior Engineer | 2000-12-18 | 9999-01-01
10013 | Senior Staff | 1985-10-20 | 9999-01-01
10014 | Engineer | 1993-12-29 | 9999-01-01
10015 | Senior Staff | 1992-09-19 | 1993-08-22
:
- use postgres user to revoke privilegetest the privilege to query titles table created by postgres
psql -U postgres Employees
Employees=# REVOKE SELECT ON titles FROM privilegetest;
REVOKE
psql -U privilegetest Employees
Employees=> SELECT * FROM titles;
ERROR: permission denied for table titles
- grant all privileges to privilegetest in public schema
psql -U postgres Employees
Employees=# GRANT ALL ON ALL TABLES IN SCHEMA public TO privilegetest;
GRANT
Employees=# REVOKE ALL ON ALL TABLES IN SCHEMA public FROM privilegetest;
REVOKE
- create role with granular option such as employee_read
- grant role with granular option to specific user
Employees=# CREATE ROLE employee_read;
CREATE ROLE
Employees=# GRANT SELECT ON ALL TABLES IN SCHEMA public TO employee_read;
GRANT
Employees=# GRANT employee_read TO privilegetest;
GRANT ROLE
Employees=# \du
List of roles
Role name | Attributes | Member of
----------------------+------------------------------------------------------------+-----------------
chesterheng | Superuser, Create role, Create DB | {}
employee_read | Cannot login | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
privilegetest | | {employee_read}
test_interactive | Create role | {}
test_role_with_login | | {}
test_user_with_login | | {}
Employees=# REVOKE employee_read FROM privilegetest;
REVOKE ROLE
- when manage roles and permissions, always go with the principle of least privilege
- assign attributes and grant privileges to a role
- assign a role to a specific user
- don't use admin / super user by default
Before we start
- before we create a full-fledged database, it's important to know what types of data we can store in a database
Data types
- when we talk about types of data, we can store we categorize them into buckets or types
- type: Numberic Types, Arrays, Character Types, Date/Time Types, Boolean Types, UUID Types
- Data Types
- A data type is nothing more than a constraint place on a field to only allow that type of data to be filled in
- Specific a type is important because it tells the system how it can / should handle the data in that field
- it's also what allows the dbms to optimize it's algorithms for processing data by knowing its type
Boolean
- a boolean type can hold TRUE, FALSE or NULL
- it can also do smart conversions
- 1, yes, y , t, true => TRUE
- 0, no, f false => FALSE
Character
- postgres provides 3 character data types: CHAR(N), VARCHAR(N) and TEXT
- CHAR(N): fixed length with space padding
- VARCHAR(N): variable length with no padding
- TEXT: unlimited length text
CREATE TABLE test_text (
fixed char(4),
variable varchar(20),
unlimited text
);
INSERT INTO test_text VALUES ('mo', 'mo', 'I have unlimited space');
SELECT * FROM test_text;
- there are 2 types of number in postgres: integers and floating points
- integers: a whole number, a number that is not a fraction
- 3 types of integers: SMALLINT, INT, BIGINT
- SMALLINT: number that can range from -32,768 to 32,767
- INT: number that can range from -2,147,483,648 to 2,147,483,647
- BIGINT: number that can range from -9223372036854775808 to 9223372036854775807
- floating point: number that can contain a decimal point
- FLOAT4: single or 6 digits precision
- 1.123456789 => 1.1234568 (reserve 6 digits and round off from 7th digits)
- FLOAT8: double or 15 digits precision
- 1.12345678901234567 => 1.1234567890123457 (reserve 15 digits and round off from 16th digits)
- DECIMAL / NUMBERIC: up to 131072 digits before the decimal point; up to 16383 digits after the demical point
CREATE TABLE test_float (
four float4,
eight float8,
big decimal
);
INSERT INTO test_float VALUES (1.123456789, 1.12345678901234567, 1.12345678901234567);
SELECT * FROM test_float;
four | eight | big |
---|---|---|
1.1234568 | 1.1234567890123457 | 1.12345678901234567 |
- an array is a group of elements of the same type
- an array is denoted by a bracket syntax
- every data type in postgres has an array equivalent
- Example: store phone numbers
CREATE TABLE test_array (
four char(2)[],
eight text[],
big float4[]
);
INSERT INTO test_array VALUES (ARRAY ['mo', 'm', 'm', 'd'], ARRAY ['test', 'long text', 'longer text'], ARRAY [1.23, 2.11, 3.23, 5.32]);
SELECT * FROM test_array;
four | eight | big |
---|---|---|
["mo","m ","m ","d "] | ["test","long text","longer text"] | [1.23,2.11,3.23,5.32] |
Database Models
- create a model before we create a database
What is a model?
- a model is a design that is used to visualize what we are going to build
Naming conventions
- table names must be singular!
- columns must be lowercase with underscores
- columns with mixed case are acceptable
- columns with upper case are unacceptable
- be consistence. write down your rules
CREATE TABLE <name> (
<col1> TYPE [CONSTRAINT],
table_constraint [CONSTRAINT]
) [INHERITS <existing_tables>];
- Create Student table
CREATE TABLE student (
student_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
date_of_birth DATE NOT NULL
);
psql -U postgres ztm
ztm=# \conninfo
You are connected to database "ztm" as user "postgres" via socket in "/tmp" at port "5432".
ztm=# CREATE TABLE student (
ztm(# student_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
ztm(# first_name VARCHAR(255) NOT NULL,
ztm(# last_name VARCHAR(255) NOT NULL,
ztm(# email VARCHAR(255) NOT NULL,
ztm(# date_of_birth DATE NOT NULL
ztm(# );
ERROR: function uuid_generate_v4() does not exist
LINE 2: student_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
ztm=# CREATE extension if not exists "uuid-ossp";
CREATE EXTENSION
ztm=# CREATE TABLE student (
ztm(# student_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
ztm(# first_name VARCHAR(255) NOT NULL,
ztm(# last_name VARCHAR(255) NOT NULL,
ztm(# email VARCHAR(255) NOT NULL,
ztm(# date_of_birth DATE NOT NULL
ztm(# );
CREATE TABLE
ztm=# \dt
List of relations
Schema | Name | Type | Owner
--------+---------+-------+----------
public | student | table | postgres
(1 row)
ztm=# \d student
Table "public.student"
Column | Type | Collation | Nullable | Default
---------------+------------------------+-----------+----------+--------------------
student_id | uuid | | not null | uuid_generate_v4()
first_name | character varying(255) | | not null |
last_name | character varying(255) | | not null |
email | character varying(255) | | not null |
date_of_birth | date | | not null |
Indexes:
"student_pkey" PRIMARY KEY, btree (student_id)
Temporary tables
- They are a type of table that exist in a special schema, so you cannot define a schema name when declaring a temporary table.
- These types of tables will be dropped at the end of your session. It’s important to also note that they are only visible to the creator.
- Now you may be wondering, why would I ever use these? Well if you’re writing intensive queries against a data set it might be beneficial to temporarily create a table based off another table. This is because:
- Temporary tables behave just like normal ones
- Postgres will apply less “rules” (logging, transaction locking, etc.) to temporary tables so they execute more quickly
- You have full access rights to the data, if you otherwise didn’t so you can test things out.
CREATE TEMPORARY TABLE <name> (<columns>);
Constraints
- Constraints are a tool to apply validation methods against data that will be inserted
Column Constraints
- A column constraints is defined as part of a column definition
Examples
- NOT NULL: cannot be NULL
- PRIMARY KEY: column will be primary key
- UNIQUE: can only contain unique values (NULL is unique)
- CHECK: apply a special condition check against the values in the column
- REFERENCES: constraints the values of the column to only values that exist in the column of another table (Foreign Key)
Table Constraints
- a table constraint definition is not tied to a particular column, and it can encompass more than one column
Examples
- UNIQUE(column_list): can only contain unique values (NULL is unique)
- PRIMARY KEY(column_list): columns that will be the primary key
- CHECK(condition): a condition to check when insert or update
- REFERENCES: Foreign Key relationship to a column
- every column constraint can be written as a table constraint
- use column constraint when involve one column only
- use table constraint when involve many columns
CREATE TABLE student (
student_id UUID DEFAULT uuid_generate_v4(),
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
date_of_birth DATE NOT NULL,
CONSTRAINT pk_student_id PRIMARY KEY (student_id)
);
CREATE TABLE category (
cat_id SMALLINT PRIMARY KEY,
type TEXT
);
CREATE TABLE column_constraints (
cc_id SMALLINT PRIMARY KEY,
something TEXT NOT NULL,
email TEXT CHECK (email ~* '^[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+$'),
cat_id SMALLINT REFERENCES category(cat_id)
);
CREATE TABLE table_constraints (
cc_id SMALLINT,
something TEXT NOT NULL,
email TEXT,
cat_id SMALLINT REFERENCES category(cat_id),
CONSTRAINT pk_table_constraints PRIMARY KEY (cc_id),
CONSTRAINT valid_email CHECK (email ~* '^[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+$')
);
INSERT INTO category VALUES (1, 'category 1');
INSERT INTO column_constraints VALUES (1, 'something', '[email protected]', 1);
INSERT INTO table_constraints VALUES (1, 'something', '[email protected]', 1);
SELECT * FROM category;
SELECT * FROM table_constraints;
SELECT * FROM column_constraints;
- Query Error when insert wrong email
INSERT INTO column_constraints VALUES (1, 'something', '@email.com', 1);
Query Error: error: new row for relation "column_constraints" violates check constraint "column_constraints_email_check"
INSERT INTO table_constraints VALUES (1, 'something', '@email.com', 1);
Query Error: error: new row for relation "table_constraints" violates check constraint "valid_email"
- Regex for regular folk - A fun, illustrative and simple guide to learning regex
- Regex Crossword - Challenging puzzles to practice your Regex-fu
- Regex Search - A tool to find quick and easy regex references
- Regex Tester - A Regex testing tool that supports multiple flavours
- Regexone
- need to install an extension
- extension are pieces of software to expand what postgres can do or expand how certain processes run
- UUID extension allows us to generate unique identifiers for PRIMARY KEY
- UUID stands for universally unique identifier
CREATE extension if not exists "uuid-ossp";
PROS | CONS |
---|---|
unique everywhere | large values to store |
easy to shard | can have performance impact |
easy to merge / replicate | more difficult to debug |
expose less info about your system |
Custom Data Types
- Feedback custom data type
- postgres allow to create custom data types to store shapes of data that are more complex
Domains
- domain is a alias for an exitsing type that can have a check constraint
CREATE DOMAIN Rating SMALLINT
CHECK (VALUE > 0 AND VALUE <= 5);
CREATE TYPE Feedback AS (
student_id UUID,
rating Rating,
feedback TEXT
);
CREATE TABLE student (
student_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
date_of_birth DATE NOT NULL
);
CREATE TABLE subject (
subject_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
subject TEXT NOT NULL,
description TEXT
);
CREATE TABLE teacher (
teacher_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
date_of_birth DATE NOT NULL
);
CREATE TABLE course (
course_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
"name" TEXT NOT NULL,
description TEXT,
subject_id UUID REFERENCES subject(subject_id),
teacher_id UUID REFERENCES teacher(teacher_id),
feedback feedback[]
);
CREATE TABLE enrollment (
course_id UUID REFERENCES course(course_id),
student_id UUID REFERENCES student(student_id),
enrollment_date DATE NOT NULL,
CONSTRAINT pk_enrollment PRIMARY KEY (course_id, student_id)
);
ALTER TABLE student
ADD COLUMN email VARCHAR(255) NOT NULL;
- Change the SCHEMA of a table
- Change the name of a table-level CONSTRAINT
- Add and remove constraints
- Change a column level constraint
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
ADD COLUMN <col> <type> <constraint>
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
ALTER COLUMN <name> TYPE <new type> [USING <expression>]
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
RENAME COLUMN <old name> TO <new name>
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
DROP COLUMN <col> [ RESTRICT | CASCADE ]
Our main advice when using this command is to make sure you communicate changes early and set in place a plan for other systems/software to be able to migrate, the last thing you want to happen is changing a column and suddenly all your services start breaking!
INSERT INTO student (first_name, last_name, email, date_of_birth) VALUES ( 'Mo', 'Binni', '[email protected]', '1992-11-13'::DATE);
INSERT INTO teacher (first_name, last_name, email, date_of_birth) VALUES ( 'Mo', 'Binni', '[email protected]', '1992-11-13'::DATE);
INSERT INTO subject (subject, description) VALUES ('SQL', 'A database management language');
DELETE FROM subject WHERE subject = 'SQL';
INSERT INTO course ("name", description) VALUES ('SQL Zero to Mastery', 'The #1 resource for for SQL mastery');
UPDATE course
SET subject_id = 'e4a6e53f-db6f-438f-b4d0-80d4e6a9341e'
WHERE subject_id is NULL;
ALTER TABLE course
ALTER COLUMN subject_id SET NOT NULL;
UPDATE course
SET teacher_id = '90de5068-bc2f-4d37-b4f0-1364671fb89e'
WHERE teacher_id is NULL;
ALTER TABLE course
ALTER COLUMN teacher_id SET NOT NULL;
INSERT INTO course ("name", description, subject_id, teacher_id) VALUES ('SQL Zero to Mastery', 'The #1 resource for for SQL mastery', 'e4a6e53f-db6f-438f-b4d0-80d4e6a9341e', '90de5068-bc2f-4d37-b4f0-1364671fb89e');
- problem is there is not data integrity check for student_id when append feedback object into feedback array
- if we need to check student_id is valid in student table, we should use a feedback table instead of feedback array
INSERT INTO enrollment (student_id, course_id, enrollment_date) VALUES ('f914ba46-9ac4-42cf-b0ac-d10cfe625be8', '6174589a-7f97-4d40-b063-6a974a20a7ec', NOW()::DATE);
UPDATE course
SET feedback = array_append(
feedback, ROW('f914ba46-9ac4-42cf-b0ac-d10cfe625be8', 5, 'Great course!')::feedback
)
WHERE course_id = '6174589a-7f97-4d40-b063-6a974a20a7ec';
- Feedback is important
- Store feedback in a table instead of feedback array data type
- Deprecated feedback array data type by rename feedback to feedback_deprecated
- Deprecated feedback field in course by rename feedback to feedback_deprecated
- Use array data type for not so important data such as phone numbers
CREATE TABLE feedback (
student_id UUID NOT NULL REFERENCES student(student_id),
course_id UUID NOT NULL REFERENCES course(course_id),
feedback TEXT,
rating rating,
CONSTRAINT pk_feedback PRIMARY KEY (student_id, course_id)
);
INSERT INTO feedback (student_id, course_id, feedback, rating) VALUES ('f914ba46-9ac4-42cf-b0ac-d10cfe625be8', '6174589a-7f97-4d40-b063-6a974a20a7ec', 'Interesting Course', 4)
- disaster will strike
- so it's important
Have a plan
- backup plan
- disaster recovery plan
- test your plans
What can go wrong?
- hardware failures
- viruses
- power outages
- hackers
- human error
How do I make a plan?
- 1 determine what needs to be backed up
- 2 What's the appropriate way to back up?
- 3 Decide how frequently you are going to back up
- 4 decide where you are going to store them
- 5 have a retention policy for the backup
What to backup?
Type | Purpose | Frequency |
---|---|---|
full backup | backup all the data | less often |
incremental | backup since last incremental | often |
differential | backup since last full backup | often |
transaction log | backup of the database transactions | a lot |
psql -d postgres
postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432".
postgres=# \i ~/Desktop/Employees_20200927_153756.sql
Transactions
- a database is a shared resource, so many users may access it concurrently
- a transaction is a unit of instructions (query for instance)
- so how do you keep things consistent?
- dbms has a mechanism in place to manage transactions
- dbms wrap INSERT, UPDATE and DELETE commands in transaction by default
- Transaction lifecycle
- session 1
Employees=# BEGIN;
BEGIN
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
DELETE 5
Employees=# SELECT * FROM employees;
emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+----------------+------------------+--------+------------
10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02
10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10
10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15
10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18
10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24
- session 2
- session 2 hang when DELETE as session 1 lock employees table
Employees=# SELECT * FROM employees;
emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+----------------+------------------+--------+------------
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26
10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21
10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28
10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01
10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12
10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02
10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10
10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15
10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18
10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24
Employees=# BEGIN;
BEGIN
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
- session 1 rollback
Employees=# ROLLBACK;
- session 2 complete after session 1 rollback
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
DELETE 5
- session 1 begin another transaction and try to DELETE
- session 1 hang now as session 2 locked employees table
Employees=# BEGIN;
BEGIN
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
- session 2 commit and end transaction
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
DELETE 5
Employees=# END;
COMMIT
- session 1 delete no records as session 2 commit
Employees=# BEGIN;
BEGIN
Employees=# DELETE FROM employees WHERE emp_no BETWEEN 10000 AND 10005;
DELETE 0
Employees=# END;
COMMIT
Transactions
- Maintain the integrity of a database, all transactions must obey ACID properties
- Atomicty: either execute entirely or not at all
- Consistency: each transaction should leave the database in a consistent state (commit or rollback)
- Isolation: transaction should be executed in isolation from other transactions
- Durability: after completion of a transaction, the changes in the database should persist
- Clue #1: Bruno insists it's one of his employees
- Clue #2: Data breach happened during a certain time
We have two solid clues so far:
- It was an inside job
- It was done on 2020-06-23
For this part of the course you'll need 2 databases:
- Movr_Employees: This database contains all relevant information about the Employees of Keiko Corp.
- Movr: A ride-sharing product that Keiko offers to consumers.
To restore the Movr backup make sure to do the following steps in order!
- Create a database named Movr
- Restore the schema.sql first to the Movr database
- Restore the data.sql next
psql -U postgres Movr
Movr=# \i ~/Desktop/schema.sql
Movr=# \i ~/Desktop/data.sql
To restore the Movr_Employees backup make sure to do the following steps in order!
- Create a database named Movr_Employees
- Restore the movr_employees.sql
psql -U postgres Movr_Employees
Movr_Employees=# \i ~/Desktop/movr_employees.sql
You had to be on location to steal the data
- Date of incident: 2020-06-23
- Keiko Corp Latitude: -74.997 to -74.9968
- Keiko Corp Longitude: 40.5 to 40.6
CREATE VIEW suspected_rides AS
SELECT *
FROM "public"."vehicle_location_histories"
WHERE
"city" = 'new york' AND
"timestamp"::date = '2020-06-23'::date AND
"lat" BETWEEN -74.997 AND -74.9968 AND
"long" BETWEEN 40.5 AND 40.6
ORDER BY long;
SELECT DISTINCT r.vehicle_id
FROM suspected_rides AS sr
JOIN rides AS r ON r.id = sr.ride_id;
SELECT
DISTINCT r.vehicle_id,
u.name AS "owner name",
u.address,
v.status,
v.current_location
FROM suspected_rides AS sr
JOIN rides AS r ON r.id = sr.ride_id
JOIN vehicles AS v ON v.id = r.vehicle_id
JOIN users AS u ON u.id = v.owner_id;
It's not the drivers
So with our current setup, we go ahead and filter out all of the unique riders that were on those suspected rides on that horrible day of the theft.
SELECT DISTINCT r.vehicle_id, u.name AS "rider name", u.address
FROM suspected_rides AS sr
JOIN rides AS r ON r.id = sr.ride_id
JOIN users AS u ON u.id = r.rider_id;
- Clue #5: It's not a rider
- Clue #6: It was an inside job
We're going to have to cross-reference data between 2 separate databases.
-- run this to run queries across databases
CREATE extension dblink;
SELECT *
FROM dblink(
'host=localhost user=postgres password=postgres dbname=Employees',
'SELECT <column> FROM employees;')
AS t1(<column> NAME)
CREATE VIEW suspected_rider_names AS
SELECT DISTINCT
split_part(u.name, ' ', 1) AS "first_name",
split_part(u.name, ' ', 2) AS "last_name"
FROM suspected_rides AS vlh
JOIN rides AS r ON r.id = vlh.ride_id
JOIN users AS u ON u.id = r.rider_id;
SELECT * FROM suspected_rider_names;
SELECT DISTINCT
CONCAT(t1.first_name, ' ', t1.last_name) AS "employee",
CONCAT(u.first_name, ' ', u.last_name) AS "rider"
FROM dblink(
'host=localhost user=chesterheng password=postgres dbname=Movr_Employees',
'SELECT first_name, last_name FROM employees;')
AS t1(first_name NAME, last_name NAME)
JOIN suspected_rider_names AS u ON t1.last_name = u.last_name
ORDER BY "rider";
Software development lifecycle
- Phase 1: System planning and selection
- Phase 2: System analysis
- Phase 3: System design
- Phase 4: System implementation and operation
- Goal of SDLC is to design robust systems
- The process can be implemented in different ways: agile, waterfall, v-model (for hardware system)
- different models cycle through the steps at different paces
- Phase 1: requirements analysis
- get the information we needs to be done (scope)
- Phase 2: system analysis
- take in requirements and analyze if it can be done on time and on budget
- Phase 3: system design
- design the system architecture for all related databasesm=, apps, etc
- Phase 4: system implementation and operation
- build the software
- there are more phases that can be added: testing, maintenance, ... (inside phase 4)
- From chaos to structure: create structure that can be understood and communicated
Database design
- In any system, there is data
- There are different techniques to design database
- Top-Down vs Bottom-Up
Top-Down
- Start from 0
- Optimal choice when create a new database
- All requirements are gathered up-front
Bottom-Up
- There is an exitsing system or specific data in place (exmaple: use by doctore)
- Shape a new system around existing data
- Optimal choice when migrate an exitsing database
What to use? Often you'll end up using a bit of both
- In top-down design, you create the universe based on the requirements
- So, let'd do based on DRIVEME, an driving school
- Every school has instructors on payroll and an inventory of cars, truck and motorcycles for teaching
- Initial stake holders: CEO, CTO, BA's, customers
- Core mission: become a household name across USA for learning how to drive
- Current situation: has an outdated website and their customer acquistion is mostly word of mouth
- THey want to start gaining marketshare through an online presence
- Core requirements are a well-thought out document with sections of functionality
- There is a vehicle inventory for students to rent
- Thare are employees at every branch
- There is maintenance for the vehicles
- There is an optional exam at the end of lessons
- You can only take the exam twice, fail twice and you must take more lessons
- Goal: create a data model based on requirements
- What are requirements?
- high-level requirements
- user interviews
- data collection
- deep understanding of the business
- one of the key method to top-down deisgn is ER modeling
Phase 3: system design
- How to design?
- Top-down: ER modeling
- Bottom-up: ???
ER Model: Entity-Relationship Model or diagram
- A diagram that functions as a way to structure high-level requirments
- Step 1: Determine what entities are in the system
What is an entity?
- a person / place or a thing
- has a singular name
- has an identifier
- should contain more than one instance of data
Core requirements
- There is a vehicle inventory for students to rent
- Thare are employees at every branch
- There is maintenance for the vehicles
- There is an optional exam at the end of lessons
- You can only take the exam twice, fail twice and you must take more lessons
- give entities the information they will store
Attributes
- must be a property of the entity
- must be atomic
- can be store in an individual field
- not address
- address can breakdown into street name, postal code, region
- single / mutivalued (single or many phone numbers)
- keys
- relation schema
- attribute
- degree
- cardinality
- tuple
- column
- relation key
- domain
- tables
- relation instance
Relation schema
- table schema
- represent data that is going into table
Relation instance
- set of data related to relation schema
- Concept of Keys in DBMS - Super, Primary, Candidate, Foreign Key, etc
- DBMS Keys: Candidate, Super, Primary, Foreign (Example)
- relation key: unique identify rows and relationship
- relation key consists of super key and candidate key
- super key: an attribute or a set of attributes that can be used to identify row of data in a table
- candidate key: minimal subset of super key
- candidate key: if any proper subset of a super key is a super key, then that key cannot be a candidate key
- primary key: uniquely identify a row
- foreign key: reference a key in foreign table
- compound key: multiple keys to uniquely identify a row and include foreign key
- composite key: multiple keys to uniquely identify a row without include foreign key
- surrogate key: use to uniquely identify a row and has no meaning to the data in the table
School:
- school_id, street_name, street_nr, postal_code, state, city
Instructor:
- instructor_id, first_name, last_name, date_of_birth, hiring_date, school_id
Student:
- student_id, first_name, last_name, date_of_birth, enrollment_date, ...
Exam:
- student_id, teacher_id, date_taken, passed, lesson_id
Lesson:
- lesson_id, date_of_enrollment, package, student_id
- determine relationships between entities
- links 2 entities together
- 1 to 1
- 1 to many
- many to many
- connect entities with a line that tells is the relationship
- in relational model it isn't possible to store a many to many relationship
- technially you can do it, but you really don't want to
- you create more overhead: INSERT overhead, UPDATE overhead, DELETE overhead, potential redundancy
- as a rule of thumb always try to resolve many to many
- solution is use intermediate entities
- divide entities into logical groups that are related (think schemas)
- Subject Areas rules
- all entities must belong to one
- an entity can only belong to one
- you can nest subject areas
System requirements
- a rich business man has tons of paintings
- he wants to build a system to catalog and track where his art is
- he lends it to museums all across the world
- he even wants to see reservations
Some constraints to note:
- a painting can only have one artist
Let's answer some questions about the system:
- what's the goal of the system?
- track painting reservations for a wealthy man
- who are our stakeholders
- owner, museums
5 steps of top-down to create an erd of the system
Step 1: entities
- Painting
- Reservation
- Museum
- Artist
Step 2: attributes
Entity | Attributes |
---|---|
Painting | name, creation_date, style |
Reservation | creation_date, date_from, date_to, accepted |
Artist | name, birth_date, email |
Museum | name, address, phone_nr, email |
Step 3: Relationships
Step 4: Solving Many To Many
- to create a data model from specific details, existing systems, legacy systems, ...
Steps
- identify the data (attributes)
- group them (entities)
Goal
- create a perfect data model without redundancy and anomalies
- modification anomalies are problems that arise when database is not structured correctly
There are 3 types
- Insert anomalies
- Update anomalies
- Delete anomalies
- use Normalization is key to avoid anomalies
- Edgar Codd proposed the theory of normalization
Goal
- a design technique to reduce redundancy and anomalies
Normalization
- to do Normalization we must understand 2 things
- functional dependencies
- normal forms
- A functional Dependency show a relationship between attributes
- functional Dependency exists when a relationship between 2 attributes allows you to uniquely determine corresponding attribute's value
- we refer to an entity as R and attributes as A and B
- A is functional dependent on B when a value of B determines a value of A: B -> A
- it means that based on the value of B, you can determine the value of A: EMP_NO -> FIRST_NAME
- DETERMINANT -> DEPENDANT
- example: STUIDENT_ID -> BIRTH_DATE
- create unique relationship between the data
emp_no | proj_id | salary |
---|---|---|
1 | 1 | 300 |
2 | 1 | 500 |
1 | 2 | 400 |
3 | 2 | 300 |
Example:
EMP_NO -> SALARY (no) PROJ_ID -> SALARY (no) PROJ_ID, EMP_NO -> SALARY (yes)
student_no | activity | contribution |
---|---|---|
1 | swimming | 300 |
2 | sailing | 500 |
3 | bungee | 150 |
4 | shark diving | 200 |
1 | shark diving | 300 |
2 | bungee | 300 |
Example:
STUDENT_NO -> ACTIVITY (no) ACTIVITY -> STUDENT_NO (no) ACTIVITY -> CONTRIBUTION (no) STUDENT_NO, ACTIVITY -> CONTRIBUTION (yes)
- 0 NF -> 1 NF -> 2 NF -> 3 NF -> BCNF -> 4 NF -> 5 NF -> 6 NF
- normalization happens through a process of running attributes through the normal forms
- each normal form aims to further seperate relationships into smaller instances as to create less redundancy and anomalies
- normal forms 0 through BCNF are the most common normal forms to run through
- NF 4 and 5 are there to further reduce anomalies and 6 NF is not yet standardized
data is in 0 NF when it is unnormalized
- repeating groups of fields
- positional dependence of data
- non-atomic data
1 NF
- eliminate repeating columns of the same data
- each attribute should contain a single value
- determine a primary key
A repeating group means that a table contains 2 or more columns that are closely related
Example 1:
O NF
book | author 1 | author 2 | author 3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 2 | 1 | 3 |
3 | 3 | 2 | 1 |
1 NF
BOOK
book_id | title |
---|
BOOK_AUTHOR
book_id | author_id | author_name | author_address | author_email |
---|
Example 2:
O NF
color | quantity | price |
---|---|---|
red, green, blue | 20 | 9.99 |
yellow, orange, purple | 10 | 10.99 |
blue, cyan | 15 | 3.99 |
green, magento | 200 | 15.99 |
1 NF
PRODUCT
prod_id | quantity | price |
---|
PRODUCT_COLOUR
prod_id | color |
---|
- it is in 1 NF
- all non-key attributes are fully functional dependent on the primary key
Example:
O NF
- book
- title
- author
- author_name
- author_address
- author_email
1 NF
BOOK
- book_id (pk)
- title
BOOK_AUTHOR
- book_id (pk)
- author_id (pk)
- author_name
- author_address
- author_email
2 NF
BOOK
- book_id (pk)
- title
BOOK_AUTHOR
- book_id (pk / fk)
- author_id (pk / fk)
AUTHOR
- author_id (pk)
- author_name
- author_address
- author_email
- It is in 2 NF
- no transitive dependencies
- A is functional dependent on B and B is functional dependent on C, C is transitively dependent on A via B
- C -> B and B -> A means that C -> A
Example:
O NF
- branch
- first name
- last name
- title
- hours
1 NF
EMPLOYEE
- emp_no (pk)
- first name
- last name
- title
BRANCH
- branch_no (pk)
- emp_no (fk)
- street
- street_no
- province
- postal_code
- hours_logged
- work_date
- country
2 NF
EMPLOYEE
- emp_no (pk)
- first name
- last name
- title
BRANCH
- branch_no (pk)
- street
- street_no
- province
- postal_code
- country
TIMESHEET
- branch_no (pk / fk)
- emp_no (pk / fk)
- hours_logged
- work_date
3 NF
EMPLOYEE
- emp_no (pk)
- first name
- last name
- title
BRANCH
- branch_no (pk)
- street
- street_no
- postal_code
- province_id (fk)
PROVINCE
- province_id (pk)
- province
- country
TIMESHEET
- branch_no (pk / fk)
- emp_no (pk / fk)
- hours_logged
- work_date
- it is in 3 NF
- for any dependency A -> B, A should be a super key
- most relationship in 3 NF are also in BCNF, but not all of them
- 3 NF allows attributes to be part of a candidate key that is not the primary key - BCNF does not
A relationship is not in BCNF if
- the primary key is a composite key
- there is more than one candidate key
- some attributes have keys in common
3 NF
student_id | tutor_id | tutor_sin |
---|---|---|
1 | 998 | 838 383 494 |
2 | 234 | 343 535 352 |
3 | 999 | 112 134 134 |
4 | 1234 | 354 464 234 |
candidate keys
- STUDENT_ID, TUTOR_ID
- STUDENT_ID, TUTOR_SIN
functional dependency
- TUTOR_ID -> TUTOR_SIN
- TUTOR_SIN -> TUTOR_ID
- STUDENT_ID, TUTOR_ID -> TUTOR_SIN
- STUDENT_ID, TUTOR_SIN -> TUTOR_ID
BCNF
student_id | tutor_id |
---|---|
1 | 998 |
2 | 234 |
3 | 999 |
4 | 1234 |
tutor_id | tutor_sin |
---|---|
999 | 838 383 494 |
234 | 343 535 352 |
999 | 112 134 134 |
1234 | 354 464 234 |
- over normalization
- incur functionality loss
Scalability
- capability of database to handle growing amount of data
- more data to store into database
- reach limit on how much data to store
- more operations - lots of users query database for more information
- how many operations per second
2 strategies
- vertical scalability: add more capacity to a single database, add more disk space, add more memory, add physically device, normally for relation database
- horizontal scalability: add more database and worked as single database, normally for NoSQL such as MongoDB
- split data into many databases
- for example: 3 database
- 1st database store A to J
- 2nd database store K to T
- 3rd database store U to z
- need a routing system to know data is located in which database
- distribute query to different database
- replicate data across mutiple machines and different locations
- happen in real time
- eventual consistency: user write data in one database and database updates all its replicas
2 strategies
- synchronous: client update a database, database updates all its replicas before response to the client
- asynchronous: client update a database and response to the client and updates all its replicas in the background
- not done often
- normally at night
- dump data into a sql file "database backup "
- Centralized: keep in single location on a given network or control by one company, centrally control
- data integrity is more easy
- Distributed: compose of multiple databadr store in multiple locations or controlled by different organizations
- more complexity
- users can only see the data they are authorized to see
- keep unauthorized uses from access the database
- prevent data corruption
- database availability, no system crashes
- create database to store necessary info and give access to necessary users
- principle of least privilege state to give privilege to exactly what is needed
Big drop in developer job postings
-
NoSQL (MongoDB)
- duplicated data
- no standard way to query
- no need to have schema
- easy to scale horizontally
- related data in one locaton
- data is not linked to other location
-
Relational (PostgreSQL)
- normalisation, no data duplication
- standard way to query
- need a schema
- hard to scale horizontally
- need to get related data from different tables
- data is linked to many locations
NewSQL
- relational database and easy to scale horizontally
- good for data need to search
- build for full text search
- less reliable than relational database
- fast in search
- relational database as primary data source
- relational database to store important info
- elasticsearch database as secondary data source
- copy data to search to elasticsearch database
- document model
- use for object
- dump data and retrieve data
- Postgres
- Microsoft SQL Server
- MongoDB
- Amazon DocumentDB
- Firebase
- Elasticsearch
- Redis
- Amazon S3
CREATE TABLE users (
id serial PRIMARY KEY,
"name" VARCHAR(100),
email text UNIQUE NOT NULL,
entries BIGINT DEFAULT 0,
joined TIMESTAMP NOT NULL
);
CREATE TABLE login (
id serial PRIMARY KEY,
hash VARCHAR(100) NOT NULL,
email text UNIQUE NOT NULL
);
const postgres = knex({
client: 'pg',
connection: {
host : 'localhost',
user : 'postgres',
password : '',
database : 'smart-brain'
}
});
postgres.select('*').from('users');
app.post('/register', (req, res) => {
const { email, name } = req.body;
db('users')
.returning('*')
.insert({
name,
email,
joined: new Date()
})
.then(user => res.json(user[0]))
.catch(err => res.status(400).json('unable to register'));
});