Thoughts, summaries, and tutorials on a variety of topics.

SQL Queries: Joins and set theory clauses

Contents


Introduction

The SQL JOIN clause is used to create a new table from the rows of existing tables. There are many types of joins, depending on what rows we want from the source tables. The joins are performed by using a key field, the field (aka column) that's common between the tables to be joined.

Pictorial of the various kinds of left, right, full, and inner joins in SQL.
Diagram of the various kinds of joins in SQL. Click to enlarge.
Credit: Wikimedia/CC with purple text added by me.

JOINs are known to be time- and compute-intensive. One way to mitigate this burden is by using table indices as the join key.

Definition A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space needed to maintain the index data structure. Instead of having to search every row, indexes are used for quick lookups. The index itself is a copy of selected column(s) called key(s).

Also, while reviewing the examples below, note that JOINs can be used as a method of filtering in themselves!
  • The result of a JOIN will ONLY include rows that are in common in both tables, whether or not a column from the JOINed table is used.
  • We can join on multiple conditions using the AND clause.
  • We can include explicit filtering conditions here.
  • When the key column we want to join on has the same name in both tables, we can use a USING (TableA.key) clause rather than ON TableA.key = TableB.key. (If JOINing on multiple columns, the argument to USING is a comma-separated list.)

I regularly have to refresh my memory on the different kinds of JOINs so this article is written as a personal reference 😅

*In the following examples, the column named key refers to the join key, not an indexing key.


Inner join

The result of an INNER JOIN is the rows associated with the keys that show up in both tables.
  • The default JOIN type is an INNER JOIN.
  • For every key match in TableB, there's going to be a replication of that row.
SELECT A.key,
     , A.val AS aval
     , B.val AS bval
FROM TableA A
INNER JOIN TableB B
ON A.key = B.key
Depiction of an INNER JOIN using a toy example.
Click to enlarge.

Self join

SELECT e1.emp_id,
    , e1.first_name AS emp_name
    , e2.first_name AS manager_name
FROM emp e1
INNER JOIN emp e2
ON e1.manager_id = e2.emp_id
A self join is when we join a table to itself.
  • Alias'ing tables is required when performing a self join.
  • It's a useful tool for restructuring tables from wide to long, for example, when we want to compare values in a column to other values in the same column (e.g., emp.first_name).
  • Notice how we performed an INNER JOIN so Jane, who has no manager, was omitted from the result set. This is because the key value NULL has no match, not even to another NULL! We could have kept this row had we performed a LEFT JOIN.
Depiction of an self join using a toy example.
Click to enlarge.

Left join

The result of a left join includes all the rows in the left table, and column values in the right table that have a match. Do we want to include values that appear in both tables? Or exclude values that appear in both tables?

Left inclusive join
SELECT A.key,
     , A.val AS aval
     , B.val AS bval
FROM TableA A
LEFT JOIN TableB B
ON A.key = B.key
By default, a LEFT [OUTER] JOIN performs a left inclusive join. The result contains all the rows TableA. Columns from TableB are also present, but only the rows that have a matching key in TableA.
  • Rows in TableB with no match are completely ignored.
  • For multiple matches, every key match in TableB results in a replication of the corresponding TableA row.
Depiction of a left inclusive join using a toy example.
Click to enlarge.
Left exclusive join
SELECT A.key,
     , A.val AS aval
     , B.val AS bval
FROM TableA A
LEFT JOIN TableB B
ON A.key = B.key
WHERE B.key is NULL
The syntax for an exclusive join is similar. The difference is we only keep TableA rows with no match in TableB.
Depiction of a left inclusive using a toy example.
Click to enlarge.

Right join

The result of a right join includes all the rows in the right table, and column values in the left table that have a match. Do we want to include values that appear in both tables? Or exclude values that appear in both tables?

Right inclusive join
SELECT B.key,
     , A.val AS aval
     , B.val AS bval
FROM TableA A
RIGHT JOIN TableB B
ON A.key = B.key
By default, a RIGHT [OUTER] JOIN performs a right inclusive join. The result contains all the rows TableB. Columns from TableA are also present, but only the rows that have a matching key in TableB.
  • Rows in TableA with no match are completely ignored.
  • For multiple matches, every key match in TableA results in a replication of the corresponding TableB row.
Depiction of a right inclusive join using a toy example.
Click to enlarge.
Right exclusive join
SELECT B.key,
     , A.val AS aval
     , B.val AS bval
FROM TableA A
RIGHT JOIN TableB B
ON A.key = B.key
WHERE A.key IS NULL
The syntax for an exclusive join is similar. The difference is we only keep TableB rows with no match in TableA.
Depiction of a right exclusive using a toy example.
Click to enlarge.

Full join

Full inclusive join
SELECT A.key
     , A.val AS aval
     , B.val AS bval
FROM TableA A
FULL JOIN TableB B
ON A.key = B.key
By default, a FULL [OUTER] JOIN performs a full inclusive join. The result keeps all rows from both TableA and TableB, so can yield a very large result set!
  • Rows in TableA with no match in TableB are included.
  • Rows in TableB with no match in TableA are included.
  • For multiple matches, every key match results in a replication of that row in the result.
Depiction of an full inclusive join using a toy example.
Click to enlarge.
Full exclusive join
SELECT A.key
     , A.val AS aval
     , B.val AS bval
FROM TableA A
FULL JOIN TableB B
ON A.key = B.key
WHERE A.key IS NULL
   OR B.key IS NULL
The syntax for an exclusive join is similar. The only difference is that we only keep rows that appear in either TableA or TableB, but not both.
Depiction of an full join using a toy example.
Click to enlarge.

Cross join

SELECT A.key
     , A.val AS aval
     , B.val AS bval
FROM TableA A
CROSS JOIN TableB B
ON A.key = B.key
The result of a CROSS JOIN is the cartesian product of the rows in TableA with the rows in TableB. In simpler terms, think of every possible pairing of a row from TableA and a row from TableB.
  • As stated in PostgreSQL documentation, listing more than one table reference in the FROM clause is equivalent to a cross join.
  • CROSS JOINs do not use ON or USING.
Depiction of an cross join using a toy example.
Click to enlarge.

Union

SELECT A.key,
     , A.val
FROM TableA A
WHERE A.key < 3
UNION
SELECT B.key,
     , B.val
FROM TableB B
WHERE B.key > 3
The UNION operator works similarly to rbind() in R. In short, it combines the results of two SELECT queries by stacking the results on top of each other.
  • The order and number of the columns must be the same in both queries.
  • There's no guarantee on the order that the rows will be returned.
  • Ordering and filters can be applied to the result by adding them to the second query.
  • Duplicate rows are removed. To retain them use UNION ALL.
  • Any WHERE statements need to be applied to each query individually.
    However, ORDER BY can be added to the second query and will be applied to the whole result.

Intersect

-- what customers placed an order in August 2019?
SELECT a.id
     , a.name
FROM accounts a
INTERSECT
SELECT o.customer_id
     , aa.customer_name
FROM orders o
LEFT JOIN accounts aa
  ON aa.id = o.customer_id
WHERE a.id = o.customer_id
AND o.occured_at BETWEEN '2019-08-01' 
                     AND '2019-08-31')
The INTERSECT operator works similarly to UNION in how it combines the results of two SELECT queries. It returns the rows that are present in both queries.
  • The order and number of the columns must be the same in both queries.
  • There's no guarantee on the order that the rows will be returned.
  • Ordering and filters can be applied to the result by adding them to the second query.
  • Duplicate rows are removed. To retain them use INTERSECT ALL.

Except

-- what customers didn't place an order 
-- in August 2019?
SELECT a.id
     , a.name
FROM accounts a
EXCEPT
SELECT o.customer_id
    , aa.customer_name
FROM orders o
LEFT JOIN accounts aa
    ON aa.id = o.customer_id
WHERE a.id = o.customer_id
AND o.occured_at BETWEEN '2019-08-01' 
                     AND '2019-08-31')
The EXCEPT operator works similarly to UNION in how it combines the results of two SELECT queries. It returns the difference between the two queries, the rows present in the first query that are not in the second.
  • The order and number of the columns must be the same in both queries.
  • There's no guarantee on the order that the rows will be returned.
  • Ordering and filters can be applied to the result by adding them to the second query.
  • Duplicate rows are removed. To retain them use EXCEPT ALL.

Semi join

-- which customers have > 1 order
-- during the month of August 2019?
SELECT a.id AS customer_id,
     , a.name AS customer_name
FROM accounts a
WHERE EXISTS
  (SELECT 1
   FROM orders o
   WHERE a.id = o.customer_id
     AND o.occured_at BETWEEN '2019-08-01' 
                          AND '2019-08-31')
A semi join doesn't have special syntax like a LEFT JOIN, for example. It refers to a kind of join where we determine what rows to keep in the left table by using values in the right table.
  • The result of this query chooses rows from the first table based on a condition being met in the second table.
  • Condition-checking is done using a subquery in a WHERE statement or a subquery expression.
  • The main difference between a semi join and a conventional join is that a row from the left table will only be returned once, even if there are multiple matches in the right table.
  • In the example here, EXISTS checks that the subquery returns at least one row then stops; the full query is not run.

Anti join

-- what customers didn't place an order
-- during the month of August 2019?
SELECT a.id AS customer_id,
     , a.name AS customer_name
FROM accounts a
WHERE a.id NOT IN
  (SELECT DISTINCT o.id
   FROM orders o
   WHERE o.occured_at BETWEEN '2019-08-01' 
                          AND '2019-08-31')
Example inspired by blog post from Mode Analytics.
An anti join doesn't have special syntax like a LEFT JOIN, for example. It refers to a kind of join where we determine what rows to keep in the left table by using values in the right table.
  • The result of this query chooses rows from the first table based on a condition not being met in the second table.
  • Condition-checking is done using a subquery in a WHERE statement or a subquery expression.
  • Anti joins are useful when trying to find something that hasn't happened, for example identifying the customers that have not placed an order during a particular month (see example above).
  • The main difference between an anti join and a conventional join is that a row from the left table will only be returned once, even if there are multiple matches in the right table.
Depiction of an anti join using a toy example.
Click to enlarge.