SQL Queries: Joins and set theory clauses
Contents
- Introduction
- Additive joins
- Set theory clauses
- Semi join
- Anti join
Introduction
The SQL JOIN
clause is used to create a new table from the rows of existing tables. There are many types of joins, depending on what rows we want from the source tables. The joins are performed by using a key field, the field (aka column) that's common between the tables to be joined.
JOIN
s are known to be time- and compute-intensive. One way to mitigate this burden is by using table indices as the join key.
Also, while reviewing the examples below, note thatDefinition A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space needed to maintain the index data structure. Instead of having to search every row, indexes are used for quick lookups. The index itself is a copy of selected column(s) called key(s).
JOIN
s can be used as a method of filtering in themselves!
-
The result of a
JOIN
will ONLY include rows that are in common in both tables, whether or not a column from theJOIN
ed table is used. -
We can join on multiple conditions using the
AND
clause. - We can include explicit filtering conditions here.
-
When the key column we want to join on has the same name in both tables, we can use a
USING (TableA.key)
clause rather thanON TableA.key = TableB.key
. (IfJOIN
ing on multiple columns, the argument toUSING
is a comma-separated list.)
I regularly have to refresh my memory on the different kinds of JOIN
s so this article is written as a personal reference 😅
*In the following examples, the column named key refers to the join key, not an indexing key.
Inner join
INNER JOIN
is the rows associated with the keys that show up in both tables.
-
The default
JOIN
type is anINNER JOIN
. - For every key match in TableB, there's going to be a replication of that row.
SELECT A.key,
, A.val AS aval
, B.val AS bval
FROM TableA A
INNER JOIN TableB B
ON A.key = B.key
Self join
SELECT e1.emp_id,
, e1.first_name AS emp_name
, e2.first_name AS manager_name
FROM emp e1
INNER JOIN emp e2
ON e1.manager_id = e2.emp_id
- Alias'ing tables is required when performing a self join.
- It's a useful tool for restructuring tables from wide to long, for example, when we want to compare values in a column to other values in the same column (e.g., emp.first_name).
-
Notice how we performed an
INNER JOIN
so Jane, who has no manager, was omitted from the result set. This is because the key valueNULL
has no match, not even to anotherNULL
! We could have kept this row had we performed aLEFT JOIN
.
Left join
The result of a left join includes all the rows in the left table, and column values in the right table that have a match. Do we want to include values that appear in both tables? Or exclude values that appear in both tables?
Left inclusive join
SELECT A.key,
, A.val AS aval
, B.val AS bval
FROM TableA A
LEFT JOIN TableB B
ON A.key = B.key
LEFT [OUTER] JOIN
performs a left inclusive join. The result contains all the rows TableA. Columns from TableB are also present, but only the rows that have a matching key in TableA.
- Rows in TableB with no match are completely ignored.
- For multiple matches, every key match in TableB results in a replication of the corresponding TableA row.
Left exclusive join
SELECT A.key,
, A.val AS aval
, B.val AS bval
FROM TableA A
LEFT JOIN TableB B
ON A.key = B.key
WHERE B.key is NULL
Right join
The result of a right join includes all the rows in the right table, and column values in the left table that have a match. Do we want to include values that appear in both tables? Or exclude values that appear in both tables?
Right inclusive join
SELECT B.key,
, A.val AS aval
, B.val AS bval
FROM TableA A
RIGHT JOIN TableB B
ON A.key = B.key
RIGHT [OUTER] JOIN
performs a right inclusive join. The result contains all the rows TableB. Columns from TableA are also present, but only the rows that have a matching key in TableB.
- Rows in TableA with no match are completely ignored.
- For multiple matches, every key match in TableA results in a replication of the corresponding TableB row.
Right exclusive join
SELECT B.key,
, A.val AS aval
, B.val AS bval
FROM TableA A
RIGHT JOIN TableB B
ON A.key = B.key
WHERE A.key IS NULL
Full join
Full inclusive join
SELECT A.key
, A.val AS aval
, B.val AS bval
FROM TableA A
FULL JOIN TableB B
ON A.key = B.key
FULL [OUTER] JOIN
performs a full inclusive join. The result keeps all rows from both TableA and TableB, so can yield a very large result set!
- Rows in TableA with no match in TableB are included.
- Rows in TableB with no match in TableA are included.
- For multiple matches, every key match results in a replication of that row in the result.
Full exclusive join
SELECT A.key
, A.val AS aval
, B.val AS bval
FROM TableA A
FULL JOIN TableB B
ON A.key = B.key
WHERE A.key IS NULL
OR B.key IS NULL
Cross join
SELECT A.key
, A.val AS aval
, B.val AS bval
FROM TableA A
CROSS JOIN TableB B
ON A.key = B.key
CROSS JOIN
is the cartesian product of the rows in TableA with the rows in TableB. In simpler terms, think of every possible pairing of a row from TableA and a row from TableB.
-
As stated in PostgreSQL documentation, listing more than one table reference in the
FROM
clause is equivalent to a cross join. -
CROSS JOIN
s do not useON
orUSING
.
Union
SELECT A.key,
, A.val
FROM TableA A
WHERE A.key < 3
UNION
SELECT B.key,
, B.val
FROM TableB B
WHERE B.key > 3
UNION
operator works similarly to rbind()
in R. In short, it combines the results of two SELECT
queries by stacking the results on top of each other.
- The order and number of the columns must be the same in both queries.
- There's no guarantee on the order that the rows will be returned.
- Ordering and filters can be applied to the result by adding them to the second query.
-
Duplicate rows are removed. To retain them use
UNION ALL
. -
Any
WHERE
statements need to be applied to each query individually.
However,ORDER BY
can be added to the second query and will be applied to the whole result.
Intersect
-- what customers placed an order in August 2019?
SELECT a.id
, a.name
FROM accounts a
INTERSECT
SELECT o.customer_id
, aa.customer_name
FROM orders o
LEFT JOIN accounts aa
ON aa.id = o.customer_id
WHERE a.id = o.customer_id
AND o.occured_at BETWEEN '2019-08-01'
AND '2019-08-31')
INTERSECT
operator works similarly to UNION
in how it combines the results of two SELECT
queries. It returns the rows that are present in both queries.
- The order and number of the columns must be the same in both queries.
- There's no guarantee on the order that the rows will be returned.
- Ordering and filters can be applied to the result by adding them to the second query.
-
Duplicate rows are removed. To retain them use
INTERSECT ALL
.
Except
-- what customers didn't place an order
-- in August 2019?
SELECT a.id
, a.name
FROM accounts a
EXCEPT
SELECT o.customer_id
, aa.customer_name
FROM orders o
LEFT JOIN accounts aa
ON aa.id = o.customer_id
WHERE a.id = o.customer_id
AND o.occured_at BETWEEN '2019-08-01'
AND '2019-08-31')
EXCEPT
operator works similarly to UNION
in how it combines the results of two SELECT
queries. It returns the difference between the two queries, the rows present in the first query that are not in the second.
- The order and number of the columns must be the same in both queries.
- There's no guarantee on the order that the rows will be returned.
- Ordering and filters can be applied to the result by adding them to the second query.
-
Duplicate rows are removed. To retain them use
EXCEPT ALL
.
Semi join
-- which customers have > 1 order
-- during the month of August 2019?
SELECT a.id AS customer_id,
, a.name AS customer_name
FROM accounts a
WHERE EXISTS
(SELECT 1
FROM orders o
WHERE a.id = o.customer_id
AND o.occured_at BETWEEN '2019-08-01'
AND '2019-08-31')
LEFT JOIN
, for example. It refers to a kind of join where we determine what rows to keep in the left table by using values in the right table.
- The result of this query chooses rows from the first table based on a condition being met in the second table.
-
Condition-checking is done using a subquery in a
WHERE
statement or a subquery expression. - The main difference between a semi join and a conventional join is that a row from the left table will only be returned once, even if there are multiple matches in the right table.
-
In the example here,
EXISTS
checks that the subquery returns at least one row then stops; the full query is not run.
Anti join
LEFT JOIN
, for example. It refers to a kind of join where we determine what rows to keep in the left table by using values in the right table.
- The result of this query chooses rows from the first table based on a condition not being met in the second table.
-
Condition-checking is done using a subquery in a
WHERE
statement or a subquery expression. - Anti joins are useful when trying to find something that hasn't happened, for example identifying the customers that have not placed an order during a particular month (see example above).
- The main difference between an anti join and a conventional join is that a row from the left table will only be returned once, even if there are multiple matches in the right table.