When our data was all in a single table, we could easily retrieve a particular row from that table in order to get all the pertinent data we needed for a particular query. For example, looking at our unnormalized table below, if we wanted information on the books that 'John Smith' has checked out, we could use a simple SELECT
query such as SELECT * WHERE full_name = 'John Smith'
. This would return us the first two rows from that table, containing information such as the book title and author, and checkout and return dates.
Now that this data is split across three tables, users
, books
, and checkouts
, we first have to join those tables together before we can select the data that we need.
SQL handles queries across more than one table through the use of JOINs. JOINs are clauses in SQL statements that link two tables together, usually based on the keys that define the relationship between those two tables. There are several types of JOINs: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS; they all do slightly different things, but the basic theory behind them all is the same. We'll take a look at each type of JOIN
in turn, but first let's go over the general syntax that JOIN statements share.
The general syntax of a JOIN
statement is as follows:
SELECT table_nameN.column_name, ...
FROM table_name1
join_type JOIN table_name2
ON join_condition;
The ellipsis (...
) in the above command format indicates that we can provide any number of columns with the table_nameN.column_name
format.
The first part of this:
SELECT table_nameN.column_name...
is essentially the SELECT column_list
form that you've already seen in previous SELECT
queries, with the slight difference that column names are prepended by table names in the column list.
Let's first focus on the second part of the statement, the part that joins the tables:
FROM table_name1
join_type JOIN table_name2
ON join_condition
To join one table to another, PostgreSQL needs to know several pieces of information:
These pieces of information are combined together using the JOIN
and ON
keywords. The part of the statement that comes after the ON
keyword is the join condition; this defines the logic by which a row in one table is joined to a row in another table. In most cases this join condition is created using the primary key of one table and the foreign key of the table we want to join it with.
Remember our colors
and shapes
example from the previous chapter? The color_id
column of the shapes
table is a Foreign Key which references the id
column of the colors
table.
If we wanted a list of shapes and their colors, we could use a query like this:
SELECT colors.color, shapes.shape
FROM colors
JOIN shapes
ON colors.id = shapes.color_id;
Within the second part of this query, colors JOIN shapes ON colors.id = shapes.color_id
, the join condition will look at each id
value in the colors
table and attempt to match it with a color_id
value in the shapes
table. If there is a match then those two rows are joined together to form a new row in a transient table known as a join table. Since the id
1
for the color Red
appears twice in the color_id
column of our shapes
table, this row of the colors
table appears twice in our transient table, joined to both Square
and Star
. Since the id
3
for the color Orange
does not appear at all in the color_id
column of our shapes
table, this row of the colors
table is omitted completely from our transient table.
If you're rereading this book or chapter, you may be wondering what happened to the discussion of virtual join tables. We realized that that term was too easily confused with something else in SQL systems called a virtual table. Since our use of the term was non-standard, we have decided to replace it with the term transient table.
Note that transient tables are strictly a mental model designed to help you think about joins, especially when you start working with multiple joins.
With this transient join table created, the SELECT column_list FROM
part of our statement can then be executed to select columns from this transient table. Those columns could originally be from the first table or the second table; to avoid confusion, we therefore need to specify both the table name and column name in our column list, in the form table_name.column_name1
. Looking at our example, selecting columns from our transient join table is effectively the same as saying:
SELECT colors.color, shapes.shape
FROM shapes_colors_join_table;
The resulting data would look like this:
Now that we understand the basic principle behind how joins work, let's take a look at some specific examples of different types of joins.
As mentioned earlier, a JOIN
statement can come in various forms. To specify which type of join to use, you can add either INNER
, LEFT
, RIGHT
, FULL
or CROSS
just before the keyword JOIN
. We'll look at an example of each of those types of join using the tables in our sql_book
database.
An INNER JOIN
returns a result set that contains the common elements of the tables, i.e the intersection where they match on the joined condition. INNER JOINs are the most frequently used JOINs; in fact if you don't specify a join type and simply use the JOIN
keyword, then PostgreSQL will assume you want an inner join. Our shapes
and colors
example from earlier used an INNER JOIN
in this way.
In the query below, the line INNER JOIN (addresses) ON (users.id = addresses.user_id)
creates the intersection between the two tables, which means that the join table contains only rows where there is a definite match between the values in the two columns used in the condition.
SELECT users.*, addresses.*
FROM users
INNER JOIN addresses
ON users.id = addresses.user_id;
The data in our unjoined tables looks like this:
The result of our SELECT
query using an INNER JOIN
would look like this:
The value in the id
column of the users
table for the user Jane Smith
is 5
; since this value does not appear in the user_id
column of the addresses table, she is omitted entirely from the join table and so only 3 records are returned by the query.
If we did want to include Jane Smith in our results despite her not having an address, we would have to use a different type of join, an outer join.
A LEFT JOIN or a LEFT OUTER JOIN takes all the rows from one table, defined as the LEFT
table, and joins it with a second table. The JOIN
is based on the conditions supplied in the ON
clause. A LEFT JOIN
will always include the rows from the LEFT
table, even if there are no matching rows in the table it is JOINed with. When there is no match, the corresponding rows will use NULL
to represent the missing values from the second table.
Let's try and use the same JOIN query as before, but this time we'll use a left join:
SELECT users.*, addresses.*
FROM users
LEFT JOIN addresses
ON users.id = addresses.user_id;
Here, the Jane Smith
row from the users
table is included in the join table, since she doesn't have any matching rows in the addresses
table, the join table has NULL
values in her row for the columns of that table.
Note that using either LEFT JOIN
or LEFT OUTER JOIN
does exactly the same thing, and the OUTER
part is often omitted. Even so, it is still common to refer to this type of join as an 'outer' join in order to differentiate it from an 'inner' join. Another type of outer join is a RIGHT JOIN
, which we'll look at next.
A RIGHT JOIN
is similar to a LEFT JOIN
except that the roles between the two tables are reversed, and all the rows on the second table are included along with any matching rows from the first table. In the last chapter we mentioned that in our sql_book
database we have books, and also reviews for those books. Not all of our books have reviews, however. Let's make a RIGHT JOIN
or RIGHT OUTER JOIN
that displays all reviews and their associated books, along with any books that don't have a review. When there is no match, the corresponding rows will use NULL
to represent the missing values from the first table.
SELECT reviews.book_id, reviews.content,
reviews.rating, reviews.published_date,
books.id, books.title, books.author
FROM reviews
RIGHT JOIN books
ON reviews.book_id = books.id;
The data in our unjoined tables looks like this:
The result of our SELECT
query using an RIGHT JOIN
would look like this:
As you can see, My Third SQL Book
doesn't yet have a review, and so all the columns from the review table have NULL
values for that row in the join table.
A FULL JOIN
or FULL OUTER JOIN
is essentially a combination of LEFT JOIN
and RIGHT JOIN
. This type of join contains all of the rows from both of the tables. Where the join condition is met, the rows of the two tables are joined, just as in the previous examples we've seen. For any rows on either side of the join where the join condition is not met, the columns for the other table have NULL
values for that row.
A FULL JOIN
is a little less common than the other ones we've looked at so far and so we won't show an example for this.
Another uncommon type of join is a CROSS JOIN
; let's take a look.
A CROSS JOIN
, also known as a Cartesian JOIN, returns all rows from one table crossed with every row from the second table. In other words, the join table of a cross join contains every possible combination of rows from the tables that have been joined. Since it returns all combinations, a CROSS JOIN
does not need to match rows using a join condition, therefore it does not have an ON
clause.
The way this join works is sometimes a little difficult to envisage, so it's worth looking at an example in this case. This SQL query has the similar syntax to other JOIN
s, but without the ON
clause:
SELECT * FROM users CROSS JOIN addresses;
The query above returns the addresses
and users
tables, cross joined. The result set consists of every record in users
mapped to every record in addresses
. For 4 users and 3 addresses, we get a total of 4x3=12
records. In mathematical terms, this is the cross product of a set.
In an application, it's very unlikely that you would use a CROSS JOIN
. Most of the time, it's more important to match rows together through a join condition in order to return a meaningful result. It's still important to be aware of CROSS JOIN
s however, since you may occasionally encounter them.
Now that we've covered the basics of joins, let's explore a couple of other useful techniques for working with multiple tables.
It is possible, and indeed common, to join more than just two tables together. This is done by adding additional JOIN
clauses to your SELECT
statement. To join multiple tables in this way, there must be a logical relationship between the tables involved. One example would be joining our users
, checkouts
, and books
tables.
SELECT users.full_name, books.title,
checkouts.checkout_date
FROM users
INNER JOIN checkouts
ON users.id = checkouts.user_id
INNER JOIN books
ON books.id = checkouts.book_id;
Here we are using two INNER JOIN
s. One between users
and checkouts
and one between checkouts
and books
. In both cases the JOIN
is implemented by using the Primary Key of one table (either users
or books
) and the Foreign Key for that table in the checkouts
table.
To understand what's happening here, we have to dive a little deeper into how joins work. We won't go into too much depth, though - just enough to clarify how the above command works.
When you perform a JOIN
between two tables, PostgreSQL creates a transient table that contains data from both the original table and the table identified by the JOIN
. It's convenient to think of this transient table as containing all columns from the records in the FROM
table as well as all columns from the JOIN
table. Before PostgreSQL displays the results, it picks out just the columns that you've mentioned in the SELECT
statement.
If you add a second JOIN
, like we did above, PostgreSQL creates yet another transient table that contains all of the columns from the previous transient table as well as all of the columns from the matching rows in the second JOIN
table. Note that we're not referring back to the FROM
table here -- we're only working with a transient table and a JOIN
table, so we don't need to reference the FROM
table at all.
A third and fourth JOIN
acts in the same way -- each works with the previous transient table and adds data from matching rows in the JOIN
tables.
In the command shown above, the first JOIN
combines data from users
and checkouts
into a transient table that contains users.full_name
and checkouts.checkout_date
. The second JOIN
combines the data from this transient table with the title
column from the books
table. Together, the command creates a transient table that contains the 3 columns we're displaying.
You may have noticed that some of the queries we list above can get a bit long. We can cut back on the length of these queries by using aliasing. Aliasing allows us to specify another name for a column or table and then use that name in later parts of a query to allow for more concise syntax. Let's use our three table join from above as an example. Using aliasing, the query would look like this:
SELECT u.full_name, b.title, c.checkout_date
FROM users AS u
INNER JOIN checkouts AS c
ON u.id = c.user_id
INNER JOIN books AS b
ON b.id = c.book_id;
Here we specify single letter aliases for our tables, and use those aliases instead of our table names in order to prepend the columns from those tables in the column list and in the join conditions. This is commonly referred to as 'table aliasing'.
We can even use a shorthand for aliasing by leaving out the AS
keyword entirely. FROM users u
and FROM users AS u
are equivalent SQL clauses.
Aliasing isn't just useful for shortening SQL queries. We can also use it to display more meaningful information in our result table. For instance, if we want to display the number of checkouts from the library we could write something like:
SELECT count(id) AS "Number of Books Checked Out"
FROM checkouts;
Number of Books Checked Out
-----------------------------
4
(1 row)
If we hadn't used aliasing above then we lose context about what was counted.
SELECT count(id) FROM checkouts;
count
-------
4
(1 row)
If you're a user just trying to access information, then most likely you wouldn't know about the exact tables being queried; being explicit about what information we're displaying can be important in a case like that.
Thus far in this chapter, we've looked at using JOIN
s to work with more than one table. Although joining tables together is probably the most common way of working with multiple tables, you can often achieve the same results through use of a subquery. Before we compare subqueries and joins, let's examine what a subquery is.
Imagine executing a SELECT
query, and then using the results of that SELECT query as a condition in another SELECT
query. This is called nesting, and the query that is nested is referred to as a subquery.
For example, suppose we need to select users that have no books checked out. We could do this by finding users
whose user_id
is not in the checkouts
table. If no relation is found, that would mean that the user has not checked out any books.
SELECT u.full_name FROM users u
WHERE u.id NOT IN (
SELECT c.user_id FROM checkouts c
);
full_name
-------------
Harry Potter
(1 row)
In the code above, the NOT IN
clause compares the current user_id
to all of the rows in the result of the subquery. If that id
number isn't part of the subquery results, then the full_name
for current row is added to the result set.
This might seem a bit confusing, so let's break it down. Our initial checkouts
table looks like this:
The nested query SELECT c.user_id FROM checkouts c
returns the following results:
This transient table can then effectively be used by our NOT IN
clause as a list of values against which to check the values in the id
column of our users
table.
The only value in that column that is not in the results of the nested query is the id
for 'Harry Potter': 3
.
Subquery Expressions
PostgreSQL provides a number of expressions that can be used specifically with sub-queries, such as IN
, NOT IN
, ANY
, SOME
, and ALL
. These all work slightly differently, but essentially they all compare values to the results of a subquery.
We won't go into too much detail about subqueries here as you'll get to work with them some more later in course LS180. One thing it is useful to know though is that in some situations a subquery can be used as an alternative to a join.
As you write more queries, you may find that there is more than one way to write a query and achieve the same results. The most common choices are between subqueries and JOINs.
For instance, we can get the same result table as in our previous example by using a JOIN
clause instead of a subquery.
SELECT u.full_name
FROM users u
LEFT JOIN checkouts c ON u.id = c.user_id
WHERE c.user_id IS NULL;
full_name
--------------
Harry Potter
(1 row)
When creating queries that return the same result, a differentiator between them may be their performance when compared to each other. As a general rule, JOINs are faster to run than subqueries. This may be something to bear in mind if working with large datasets.
We've covered a lot of content in this chapter, from exploring how joins work at a conceptual level, through working with different types of joins, and finally to useful techniques such as aliasing and subqueries.
One of the most important things to remember about how joins work is that we set a condition that compares a value from the first table (usually a primary key), with one from the second table (usually a foreign key). If the condition that uses these two values evaluates to true, then the row that holds the first value is joined with the row that holds the second value.
Let's quickly recap on some of the different types of join we can use:
Join Type | Notes |
---|---|
INNER | Combines rows from two tables whenever the join condition is met. |
LEFT | Same as an inner join, except rows from the first table are added to the join table, regardless of the evaluation of the join condition. |
RIGHT | Same as an inner join, except rows from the second table are added to the join table, regardless of the evaluation of the join condition. |
FULL | A combination of left join and right join. |
CROSS | Doesn't use a join condition. The join table is the result of matching every row from the first table with the second table, the cross product of all rows across both tables. |
When using joins, sometimes our queries can get unwieldy, especially when we're dealing with 2 or more JOIN
s. To better manage this we can alias table and column names to shorten our query. We can also use aliasing to give more context about the query results.
Finally, the result from a join query can sometimes be obtained using different methods. Subqueries offer another method for us to query the database and retrieve the same results, or similar results, as if we had used a JOIN clause.
We're almost finished with our Introduction to SQL. In the next and final chapter we'll sum up everything we've learned so far and point you towards some next steps and resources to continue your learning!