SQL Joins

When our data was all in a single table, we could easily retrieve a particular row from that table in order to get all the pertinent data we needed for a particular query. For example, looking at our unnormalized table below, if we wanted information on the books that 'John Smith' has checked out, we could use a simple SELECT query such as SELECT * WHERE full_name = 'John Smith'. This would return us the first two rows from that table, containing information such as the book title and author, and checkout and return dates.

Unormalized Users Table

Now that this data is split across three tables, users, books, and checkouts, we first have to join those tables together before we can select the data that we need.

What is a SQL Join?

SQL handles queries across more than one table through the use of JOINs. JOINs are clauses in SQL statements that link two tables together, usually based on the keys that define the relationship between those two tables. There are several types of JOINs: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS; they all do slightly different things, but the basic theory behind them all is the same. We'll take a look at each type of JOIN in turn, but first let's go over the general syntax that JOIN statements share.

Join Syntax

The general syntax of a JOIN statement is as follows:

SELECT table_nameN.column_name, ...
       FROM table_name1
       join_type JOIN table_name2
                 ON join_condition;

The ellipsis (...) in the above command format indicates that we can provide any number of columns with the table_nameN.column_name format.

The first part of this:

SELECT table_nameN.column_name...

is essentially the SELECT column_list form that you've already seen in previous SELECT queries, with the slight difference that column names are prepended by table names in the column list.

Let's first focus on the second part of the statement, the part that joins the tables:

FROM table_name1
join_type JOIN table_name2
          ON join_condition

To join one table to another, PostgreSQL needs to know several pieces of information:

  • The name of the first table to join
  • The type of join to use
  • The name of the second table to join
  • The join condition.

These pieces of information are combined together using the JOIN and ON keywords. The part of the statement that comes after the ON keyword is the join condition; this defines the logic by which a row in one table is joined to a row in another table. In most cases this join condition is created using the primary key of one table and the foreign key of the table we want to join it with.

Remember our colors and shapes example from the previous chapter? The color_id column of the shapes table is a Foreign Key which references the id column of the colors table.

Shapes and Colors, separate tables

If we wanted a list of shapes and their colors, we could use a query like this:

SELECT colors.color, shapes.shape
       FROM colors
       JOIN shapes
            ON colors.id = shapes.color_id;

Within the second part of this query, colors JOIN shapes ON colors.id = shapes.color_id, the join condition will look at each id value in the colors table and attempt to match it with a color_id value in the shapes table. If there is a match then those two rows are joined together to form a new row in a transient table known as a join table. Since the id 1 for the color Red appears twice in the color_id column of our shapes table, this row of the colors table appears twice in our transient table, joined to both Square and Star. Since the id 3 for the color Orange does not appear at all in the color_id column of our shapes table, this row of the colors table is omitted completely from our transient table.

Shapes and Colors, transient join table

If you're rereading this book or chapter, you may be wondering what happened to the discussion of virtual join tables. We realized that that term was too easily confused with something else in SQL systems called a virtual table. Since our use of the term was non-standard, we have decided to replace it with the term transient table.

Note that transient tables are strictly a mental model designed to help you think about joins, especially when you start working with multiple joins.

With this transient join table created, the SELECT column_list FROM part of our statement can then be executed to select columns from this transient table. Those columns could originally be from the first table or the second table; to avoid confusion, we therefore need to specify both the table name and column name in our column list, in the form table_name.column_name1. Looking at our example, selecting columns from our transient join table is effectively the same as saying:

SELECT colors.color, shapes.shape
       FROM shapes_colors_join_table;

The resulting data would look like this:

Shapes and Colors, resulting data

Now that we understand the basic principle behind how joins work, let's take a look at some specific examples of different types of joins.

Types of Joins

As mentioned earlier, a JOIN statement can come in various forms. To specify which type of join to use, you can add either INNER, LEFT, RIGHT, FULL or CROSS just before the keyword JOIN. We'll look at an example of each of those types of join using the tables in our sql_book database.

INNER JOIN

An INNER JOIN returns a result set that contains the common elements of the tables, i.e the intersection where they match on the joined condition. INNER JOINs are the most frequently used JOINs; in fact if you don't specify a join type and simply use the JOIN keyword, then PostgreSQL will assume you want an inner join. Our shapes and colors example from earlier used an INNER JOIN in this way.

In the query below, the line INNER JOIN (addresses) ON (users.id = addresses.user_id) creates the intersection between the two tables, which means that the join table contains only rows where there is a definite match between the values in the two columns used in the condition.

SELECT users.*, addresses.*
       FROM users
       INNER JOIN addresses
             ON users.id = addresses.user_id;

The data in our unjoined tables looks like this:

Users and Addresses

The result of our SELECT query using an INNER JOIN would look like this:

Inner Join

The value in the id column of the users table for the user Jane Smith is 5; since this value does not appear in the user_id column of the addresses table, she is omitted entirely from the join table and so only 3 records are returned by the query.

If we did want to include Jane Smith in our results despite her not having an address, we would have to use a different type of join, an outer join.

LEFT JOIN

A LEFT JOIN or a LEFT OUTER JOIN takes all the rows from one table, defined as the LEFT table, and joins it with a second table. The JOIN is based on the conditions supplied in the ON clause. A LEFT JOIN will always include the rows from the LEFT table, even if there are no matching rows in the table it is JOINed with. When there is no match, the corresponding rows will use NULL to represent the missing values from the second table.

Let's try and use the same JOIN query as before, but this time we'll use a left join:

SELECT users.*, addresses.*
       FROM users
       LEFT JOIN addresses
            ON users.id = addresses.user_id;

Here, the Jane Smith row from the users table is included in the join table, since she doesn't have any matching rows in the addresses table, the join table has NULL values in her row for the columns of that table.

Left Join

Note that using either LEFT JOIN or LEFT OUTER JOIN does exactly the same thing, and the OUTER part is often omitted. Even so, it is still common to refer to this type of join as an 'outer' join in order to differentiate it from an 'inner' join. Another type of outer join is a RIGHT JOIN, which we'll look at next.

RIGHT JOIN

A RIGHT JOIN is similar to a LEFT JOIN except that the roles between the two tables are reversed, and all the rows on the second table are included along with any matching rows from the first table. In the last chapter we mentioned that in our sql_book database we have books, and also reviews for those books. Not all of our books have reviews, however. Let's make a RIGHT JOIN or RIGHT OUTER JOIN that displays all reviews and their associated books, along with any books that don't have a review. When there is no match, the corresponding rows will use NULL to represent the missing values from the first table.

SELECT reviews.book_id, reviews.content,
       reviews.rating, reviews.published_date,
       books.id, books.title, books.author
    FROM reviews
    RIGHT JOIN books
          ON reviews.book_id = books.id;

The data in our unjoined tables looks like this:

Reviews and Books

The result of our SELECT query using an RIGHT JOIN would look like this:

Right Join

As you can see, My Third SQL Book doesn't yet have a review, and so all the columns from the review table have NULL values for that row in the join table.

FULL JOIN

A FULL JOIN or FULL OUTER JOIN is essentially a combination of LEFT JOIN and RIGHT JOIN. This type of join contains all of the rows from both of the tables. Where the join condition is met, the rows of the two tables are joined, just as in the previous examples we've seen. For any rows on either side of the join where the join condition is not met, the columns for the other table have NULL values for that row.

A FULL JOIN is a little less common than the other ones we've looked at so far and so we won't show an example for this.

Another uncommon type of join is a CROSS JOIN; let's take a look.

CROSS JOIN

A CROSS JOIN, also known as a Cartesian JOIN, returns all rows from one table crossed with every row from the second table. In other words, the join table of a cross join contains every possible combination of rows from the tables that have been joined. Since it returns all combinations, a CROSS JOIN does not need to match rows using a join condition, therefore it does not have an ON clause.

The way this join works is sometimes a little difficult to envisage, so it's worth looking at an example in this case. This SQL query has the similar syntax to other JOINs, but without the ON clause:

SELECT * FROM users CROSS JOIN addresses;

The query above returns the addresses and users tables, cross joined. The result set consists of every record in users mapped to every record in addresses. For 4 users and 3 addresses, we get a total of 4x3=12 records. In mathematical terms, this is the cross product of a set.

CROSS Join

In an application, it's very unlikely that you would use a CROSS JOIN. Most of the time, it's more important to match rows together through a join condition in order to return a meaningful result. It's still important to be aware of CROSS JOINs however, since you may occasionally encounter them.

Now that we've covered the basics of joins, let's explore a couple of other useful techniques for working with multiple tables.

Multiple Joins

It is possible, and indeed common, to join more than just two tables together. This is done by adding additional JOIN clauses to your SELECT statement. To join multiple tables in this way, there must be a logical relationship between the tables involved. One example would be joining our users, checkouts, and books tables.

SELECT users.full_name, books.title,
       checkouts.checkout_date
    FROM users
    INNER JOIN checkouts
          ON users.id = checkouts.user_id
    INNER JOIN books
          ON books.id = checkouts.book_id;

Here we are using two INNER JOINs. One between users and checkouts and one between checkouts and books. In both cases the JOIN is implemented by using the Primary Key of one table (either users or books) and the Foreign Key for that table in the checkouts table.

To understand what's happening here, we have to dive a little deeper into how joins work. We won't go into too much depth, though - just enough to clarify how the above command works.

When you perform a JOIN between two tables, PostgreSQL creates a transient table that contains data from both the original table and the table identified by the JOIN. It's convenient to think of this transient table as containing all columns from the records in the FROM table as well as all columns from the JOIN table. Before PostgreSQL displays the results, it picks out just the columns that you've mentioned in the SELECT statement.

If you add a second JOIN, like we did above, PostgreSQL creates yet another transient table that contains all of the columns from the previous transient table as well as all of the columns from the matching rows in the second JOIN table. Note that we're not referring back to the FROM table here -- we're only working with a transient table and a JOIN table, so we don't need to reference the FROM table at all.

A third and fourth JOIN acts in the same way -- each works with the previous transient table and adds data from matching rows in the JOIN tables.

In the command shown above, the first JOIN combines data from users and checkouts into a transient table that contains users.full_name and checkouts.checkout_date. The second JOIN combines the data from this transient table with the title column from the books table. Together, the command creates a transient table that contains the 3 columns we're displaying.

Aliasing

You may have noticed that some of the queries we list above can get a bit long. We can cut back on the length of these queries by using aliasing. Aliasing allows us to specify another name for a column or table and then use that name in later parts of a query to allow for more concise syntax. Let's use our three table join from above as an example. Using aliasing, the query would look like this:

SELECT u.full_name, b.title, c.checkout_date
       FROM users AS u
       INNER JOIN checkouts AS c
           ON u.id = c.user_id
       INNER JOIN books AS b
           ON b.id = c.book_id;

Here we specify single letter aliases for our tables, and use those aliases instead of our table names in order to prepend the columns from those tables in the column list and in the join conditions. This is commonly referred to as 'table aliasing'.

We can even use a shorthand for aliasing by leaving out the AS keyword entirely. FROM users u and FROM users AS u are equivalent SQL clauses.

Column Aliasing

Aliasing isn't just useful for shortening SQL queries. We can also use it to display more meaningful information in our result table. For instance, if we want to display the number of checkouts from the library we could write something like:

SELECT count(id) AS "Number of Books Checked Out"
       FROM checkouts;
Number of Books Checked Out
-----------------------------
                          4
(1 row)

If we hadn't used aliasing above then we lose context about what was counted.

SELECT count(id) FROM checkouts;
 count
-------
     4
(1 row)

If you're a user just trying to access information, then most likely you wouldn't know about the exact tables being queried; being explicit about what information we're displaying can be important in a case like that.

Subqueries

Thus far in this chapter, we've looked at using JOINs to work with more than one table. Although joining tables together is probably the most common way of working with multiple tables, you can often achieve the same results through use of a subquery. Before we compare subqueries and joins, let's examine what a subquery is.

Imagine executing a SELECT query, and then using the results of that SELECT query as a condition in another SELECT query. This is called nesting, and the query that is nested is referred to as a subquery.

For example, suppose we need to select users that have no books checked out. We could do this by finding users whose user_id is not in the checkouts table. If no relation is found, that would mean that the user has not checked out any books.

SELECT u.full_name FROM users u
       WHERE u.id NOT IN (
           SELECT c.user_id FROM checkouts c
       );
  full_name
-------------
 Harry Potter
(1 row)

In the code above, the NOT IN clause compares the current user_id to all of the rows in the result of the subquery. If that id number isn't part of the subquery results, then the full_name for current row is added to the result set.

This might seem a bit confusing, so let's break it down. Our initial checkouts table looks like this:

Checkouts table

The nested query SELECT c.user_id FROM checkouts c returns the following results:

Nested Query Results

This transient table can then effectively be used by our NOT IN clause as a list of values against which to check the values in the id column of our users table.

NOT IN example

The only value in that column that is not in the results of the nested query is the id for 'Harry Potter': 3.

Subquery Expressions

PostgreSQL provides a number of expressions that can be used specifically with sub-queries, such as IN, NOT IN, ANY, SOME, and ALL. These all work slightly differently, but essentially they all compare values to the results of a subquery.

We won't go into too much detail about subqueries here as you'll get to work with them some more later in course LS180. One thing it is useful to know though is that in some situations a subquery can be used as an alternative to a join.

Subqueries vs Joins

As you write more queries, you may find that there is more than one way to write a query and achieve the same results. The most common choices are between subqueries and JOINs.

For instance, we can get the same result table as in our previous example by using a JOIN clause instead of a subquery.

SELECT u.full_name
       FROM users u
       LEFT JOIN checkouts c ON u.id = c.user_id
       WHERE c.user_id IS NULL;
  full_name
--------------
 Harry Potter
(1 row)

When creating queries that return the same result, a differentiator between them may be their performance when compared to each other. As a general rule, JOINs are faster to run than subqueries. This may be something to bear in mind if working with large datasets.

Summary

We've covered a lot of content in this chapter, from exploring how joins work at a conceptual level, through working with different types of joins, and finally to useful techniques such as aliasing and subqueries.

One of the most important things to remember about how joins work is that we set a condition that compares a value from the first table (usually a primary key), with one from the second table (usually a foreign key). If the condition that uses these two values evaluates to true, then the row that holds the first value is joined with the row that holds the second value.

Let's quickly recap on some of the different types of join we can use:

Join Type Notes
INNER Combines rows from two tables whenever the join condition is met.
LEFT Same as an inner join, except rows from the first table are added to the join table, regardless of the evaluation of the join condition.
RIGHT Same as an inner join, except rows from the second table are added to the join table, regardless of the evaluation of the join condition.
FULL A combination of left join and right join.
CROSS Doesn't use a join condition. The join table is the result of matching every row from the first table with the second table, the cross product of all rows across both tables.

When using joins, sometimes our queries can get unwieldy, especially when we're dealing with 2 or more JOINs. To better manage this we can alias table and column names to shorten our query. We can also use aliasing to give more context about the query results.

Finally, the result from a join query can sometimes be obtained using different methods. Subqueries offer another method for us to query the database and retrieve the same results, or similar results, as if we had used a JOIN clause.

We're almost finished with our Introduction to SQL. In the next and final chapter we'll sum up everything we've learned so far and point you towards some next steps and resources to continue your learning!