Thus far in this book, all the work we've done has been with a single database table. The majority of databases you'll work with as a developer will have more than one table, and those tables will be connected together in various ways to form table relationships. In this chapter we'll explore the reasons for having multiple tables in a database, look at how to define relationships between different tables, and outline the different types of table relationships that can exist.
At this point, our users
table doesn't need to hold that much data for each user in our system. It stores a name for the user, whether their account is enabled or not, when they last logged in, and an id to identify each user record. In reality, the requirements of our application will mean that we need to store a lot more data than that. Our app will be used to manage a library of SQL books and allow users to check out the books and also review them.
To implement some of these requirements we could simply try adding more columns to our users
table; the resulting table might look a little like this:
Wow, that's a lot of information all in one table! There are other issues here as well, such as duplication of data (often referred to as 'redundancy'). For each book that a user checks out, we have to repeat all of the user data in our table. It's a similar story with the book data, if more than one person checks out a book, such as with 'My Second SQL Book', we have to repeat the book title, author, isbn, and published date.
Duplicating data in this way can lead to issues with data integrity. For example what if for one of the 'My Second SQL Book' checkouts the title is entered as 'My 2nd SQL Book' instead, or a typo had been made with the isbn on one of the rows? From looking at the data in the table, how would we know which piece of data is correct?
How do we deal with this situation? The answer is to split our data up across multiple different tables, and create relationships between them. The process of splitting up data in this way to remove duplication and improve data integrity is known as normalization.
Normalization is a deep topic, and there are complex sets of rules which dictate the extent to which a database is judged to be normalized. These rule-sets, known as 'normal forms', are beyond the scope of this book; for now there are two important things to remember:
We know that we want to split the data for our application across multiple tables, but how do we decide what those tables should be and what relationships should exist between them? When answering questions such as these it is often useful to zoom out and think at a higher level of abstraction, and this is where the process of database design comes in.
At a high level, the process of database design involves defining entities to represent different sorts of data and designing relationships between those entities. But what do we mean by entities, and how do different entities relate to each other? Let's find out.
An entity represents a real world object, or a set of data that we want to model within our database; we can often identify these as the major nouns of the system we're modeling. For the purposes of this book we're going to try and keep things simple and draw a direct correlation between an entity and a single table of data; in a real database however, the data for a single entity might be stored in more than one table.
What entities might we define for our SQL Book application?
users
table, and we can think of a user as a specific entity within our app; a 'user' is someone who uses our app.
Now that we have defined the entities we need, we can plan tables to store the data for each entity. Those tables might look something like this:
We're making good progress with our database design. We've decided on the entities we want and have formed a picture of the tables we need, the columns in those tables, and even examples of the data that those columns will hold. There's something missing though, and that's the relationships between our entities.
If we look at the diagram of our five tables, the tables are all isolated and it's not obvious how these tables should relate to each other. Let's simplify our tables a bit and explicitly define some relationships between them.
This diagram shows an abstract representation of our various entities and also the relationships between them, (note: in reality we could imagine that more than one user might share the same address; this structure is intended for illustration purposes). We can think of this diagram as a simple Entity Relationship Diagram, or ERD. An ERD is a graphical representation of entities and their relationships to each other, and is a commonly used tool within database design.
There are different types of ERD varying from conceptual to detailed, and often using specific conventions such as crow's foot notation to model relationships. We won't go into the details of these different types, or the conventions they use, in this book. For now it's useful to simply think of an ERD as any diagram which models relationships between entities.
Okay, so we now know the tables that we need and we've also defined the relationships that should exist between those tables in our ERD, but how do we actually implement those relationships in terms of our table schema? The answer to that is to use keys.
In an earlier section of this book we looked at an aspect of schema called constraints, and explored how constraints act on and work with the data in our database tables. Keys are a special type of constraint used to establish relationships and uniqueness. They can be used to identify a specific row in the current table, or to refer to a specific row in another table. In this chapter we'll look at two types of keys that fulfill these particular roles: Primary Keys, and Foreign Keys.
A necessary part of establishing relationships between two entities or two pieces of data is being able to identify the data correctly. In SQL, uniquely identifying data is critical. A Primary Key is a unique identifier for a row of data.
In order to act as a unique identifier, a column must contain some data, and that data should be unique to each row. If you're thinking that those requirements sound a lot like our NOT NULL
and UNIQUE
constraints, you'd be right; in fact, making a column a PRIMARY KEY
is essentially equivalent to adding NOT NULL
and UNIQUE
constraints to that column.
The id
column in our users
table has both of these constraints, and we've used that column in many of our SELECT
queries in order to uniquely identify rows; we've effectively had id
as a primary key all along although we haven't explicitly set it as the Primary Key. Let's do that now:
ALTER TABLE users ADD PRIMARY KEY (id);
Although any column in a table can have UNIQUE
and NOT NULL
constraints applied to them, each table can have only one Primary Key. It is common practice for that Primary Key to be a column named id
. If you look at the other tables we've defined for our database, most of them have an id
column. While a column of any name can serve as the primary key, using a column named id
is useful for mnemonic reasons and so is a popular convention.
Being able to uniquely identify a row of data in a table via that table's Primary Key column is only half the story when it comes to creating relationships between tables. The other half of this story is the Primary Key's partner, the Foreign Key.
A Foreign Key allows us to associate a row in one table to a row in another table. This is done by setting a column in one table as a Foreign Key and having that column reference another table's Primary Key column. Creating this relationship is done using the REFERENCES
keyword in this form:
FOREIGN KEY (fk_col_name)
REFERENCES target_table_name (pk_col_name);
We'll explore some specific examples of how this is used when we look at setting up various kinds of relationships later in this chapter, but in general terms you can think of this reference as creating a connection between rows in different tables.
Imagine for instance that we have two tables, one called colors
and one called shapes
. The color_id
column of the shapes
table is a Foreign Key which references the id
column of the colors
table.
In the diagram above, the 'Red' row of our colors
table is associated with the 'Square' and 'Star' rows of our shapes
table. Similarly, 'Blue' is associated with 'Triangle' and 'Green' with 'Circle'. 'Orange' isn't currently associated with any row in the shapes
table, but there is the potential to create such an association if we insert a row into shapes
with a color_id
of 3
.
By setting up this reference, we're ensuring the referential integrity of a relationship. Referential integrity guarantees that a given foreign key value references an existing record in the target table; if it doesn't, anan error occurs. In other words, PostgreSQL won't let you add a value to the Foreign Key column of a table if the Primary Key column of the target table doesn't already have that value. We'll discuss this concept in a bit more detail later on.
The specific way in which a Foreign Key is used as part of a table's schema depends on the type of relationship we want to define between our tables. In order to implement that schema correctly it is useful to formally describe the relationships we need to model between our entities:
The entity relationships described above can be classified into three relationship types:
Let's look at them each in turn.
A one-to-one relationship between two entities exists when a particular entity instance exists in one table, and it can have only one associated entity instance in another table.
Example: A user can have only one address, and an address belongs to only one user.
This example is contrived: in the real world, users can have multiple addresses and multiple people can live at the same address.
In the database world, this sort of relationship is implemented like this: the id
that is the PRIMARY KEY
of the users
table is used as both the FOREIGN KEY
and PRIMARY KEY
of the addresses
table.
/*
one-to-one: User has one address
*/
CREATE TABLE addresses (
user_id int, -- Both a primary and foreign key
street varchar(30) NOT NULL,
city varchar(30) NOT NULL,
state varchar(30) NOT NULL,
PRIMARY KEY (user_id),
FOREIGN KEY (user_id)
REFERENCES users (id)
ON DELETE CASCADE
);
Executing the above SQL statement will create an addresses
table, and create a relationship between it and the users
table. Notice the PRIMARY KEY
and FOREIGN KEY
clauses at the end of the CREATE
statement. These two clauses make the user_id
both the Primary key of the addresses
table and the Foreign key that references the users
table.
Let's go ahead and add some data to our table.
INSERT INTO addresses
(user_id, street, city, state)
VALUES (1, '1 Market Street', 'San Francisco', 'CA'),
(2, '2 Elm Street', 'San Francisco', 'CA'),
(3, '3 Main Street', 'Boston', 'MA');
The user_id
column uses values that exist in the id
column of the users
table in order to connect the tables through the foreign key constraint we just created.
We're going to take a slight detour here to discuss a topic that's extremely important when dealing with table relationships: referential integrity. This is a concept used when discussing relational data to indicate that that table relationships must be consistent. Different RDBMSes might enforce referential integrity rules differently, but the concept is the same.
The constraints we've defined for our addresses
table enforce the one-to-one relationship we want between it and our users
table. A user can only have one address and an address must have one, and only one, user. This is an example of referential integrity. Let's demonstrate how this works.
What happens if we try to add another address for a user who already has one?
INSERT INTO addresses (user_id, street, city, state)
VALUES (1, '2 Park Road', 'San Francisco', 'CA');
ERROR: duplicate key value violates unique constraint "addresses_pkey"
DETAIL: Key (user_id)=(1) already exists.
The error above occurs because we are trying to insert a value 1
into the user_id
column when such a value already exists in that column. The UNIQUE
constraint on the column prevents us from doing so.
How about if we try to add an address for a user who doesn't exist?
INSERT INTO addresses (user_id, street, city, state)
VALUES (7, '11 Station Road', 'Portland', 'OR');
ERROR: insert or update on table "addresses" violates foreign key constraint "addresses_user_id_fkey"
DETAIL: Key (user_id)=(7) is not present in table "users".
Here we get a different error. The FOREIGN KEY
constraint on the user_id
column prevents us from adding the value 7
to that column because that value is not present in the id
column of the users
table.
If you're wondering why we can add a user without an address but can't add an address without a user, this is due to the modality of the relationship between the two entities. Don't worry about exactly what this means for now, just think of it as another aspect of entity relationships.
You might have noticed in the table creation statement for our addresses
table, the FOREIGN KEY
definition included a clause which read ON DELETE CASCADE
. Adding this clause, and setting it to CASCADE
basically means that if the row being referenced is deleted, the row referencing it is also deleted. There are alternatives to CASCADE
such as SET NULL
or SET DEFAULT
which instead of deleting the referencing row will set a new value in the appropriate column for that row.
Determining what to do in situations where you delete a row that is referenced by another row is an important design decision. It has a direct impact on maintaining referential integrity. If we don't set such clauses we leave the decision of what to do up to the RDBMS we are using. In the case of PostgreSQL, if we try to delete a row that is referenced by a row in another table and we have no ON DELETE
clause for that reference, PostgreSQL throws an error.
Okay, time to get back to our different table relationship types with a look at one-to-many. A one-to-many relationship exists between two entities if an entity instance in one of the tables can be associated with multiple records (entity instances) in the other table. The opposite relationship does not exist; that is, each entity instance in the second table can only be associated with one entity instance in the first table.
Example: A review belongs to only one book. A book has many reviews.
Let's set up the necessary data. First let's create our tables
CREATE TABLE books (
id serial,
title varchar(100) NOT NULL,
author varchar(100) NOT NULL,
published_date timestamp NOT NULL,
isbn char(12),
PRIMARY KEY (id),
UNIQUE (isbn)
);
/*
one-to-many: Book has many reviews
*/
CREATE TABLE reviews (
id serial,
book_id integer NOT NULL,
reviewer_name varchar(255),
content varchar(255),
rating integer,
published_date timestamp DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
FOREIGN KEY (book_id)
REFERENCES books(id)
ON DELETE CASCADE
);
These table creation statements for our books
and reviews
tables are fairly similar to our previous example. There's a key difference worth pointing out in the statement for our reviews
table however:
addresses
table, the PRIMARY KEY
and FOREIGN KEY
reference different columns, id
and book_id
respectively. This means that the FOREIGN KEY
column, book_id
is not bound by the UNIQUE
constraint of our PRIMARY KEY
and so the same value from the id
column of the books
table can appear in this column more than once. In other words a book can have many reviews.
Note that the foreign key book_id
has a NOT NULL
constraint. In general, foreign keys in one-to-many relationships should not allow NULL
. In this case, it makes no sense to have a review that isn't tied to a book.
Now that we have created our books
and reviews
tables, let's add some data to them.
INSERT INTO books
(id, title, author, published_date, isbn)
VALUES
(1, 'My First SQL Book', 'Mary Parker',
'2012-02-22 12:08:17.320053-03',
'981483029127'),
(2, 'My Second SQL Book', 'John Mayer',
'1972-07-03 09:22:45.050088-07',
'857300923713'),
(3, 'My Third SQL Book', 'Cary Flint',
'2015-10-18 14:05:44.547516-07',
'523120967812');
INSERT INTO reviews
(id, book_id, reviewer_name, content, rating,
published_date)
VALUES
(1, 1, 'John Smith', 'My first review', 4,
'2017-12-10 05:50:11.127281-02'),
(2, 2, 'John Smith', 'My second review', 5,
'2017-10-13 15:05:12.673382-05'),
(3, 2, 'Alice Walker', 'Another review', 1,
'2017-10-22 23:47:10.407569-07');
The order in which we add the data is important here. Since a column in reviews
references data in books
we must first ensure that the data exists in the books
table for us to reference.
Just as with the users
/ addresses
relationship, the FOREIGN KEY
creates relationships between the reviews
table and the books
table. Unlike the users
/ addresses
relationship however, both books and users can have multiple reviews. For example the id
value of 2
for My Second SQL Book
appears twice in the book_id
column of the reviews
table.
In a real database our reviews
table would probably also have a Foreign Key reference to the id
column in users
table rather than have user type data directly in a reviewer_name
column. We set up the table in this way for our example because we wanted to focus on the one-to-many relationship type. If we had added such a Foreign Key to reviews
we'd effectively be setting up a many-to-many relationship between books
and users
, which is what we'll look at next.
A many-to-many relationship exists between two entities if for one entity instance there may be multiple records in the other table, and vice versa.
Example: A user can check out many books. A book can be checked out by many users (over time).
There isn't a way to implement a many-to-many relationship between two tables directly. Instead, we break apart this many-to-many relationship into two one-to-many relationships using a third, cross-reference, table (also known as a join table). This table holds the relationship between the two entities, by having two FOREIGN KEY
s, each of which references the PRIMARY KEY of one of the tables for which we want to create this relationship. We already have our books
and users
tables, so we just need to create the cross-reference table: checkouts
.
Here, the user_id
column in checkouts
references the id
column in users
, and the book_id
column in checkouts
references the id
column in books
. Each row of the checkouts
table uses these two Foreign Keys to create an association between rows of users
and books
.
We can see on the first row of checkouts
, the user with an id
of 1
is associated with the book with an id
of 1
. On the second row, the same user is also associated with the book with an id
of 2
. On the third row, a different user, with an id
of 2
is associated with the same book from the previous row. On the fourth row, the user with an id
of 5
is associated with the book with an id
of 3
.
Don't worry if you don't completely understand this right away. Shortly, we'll expand on what these associations look like in terms of the data in users
and books
. First, though, let's create our checkouts
table and add some data to it.
CREATE TABLE checkouts (
id serial,
user_id int NOT NULL,
book_id int NOT NULL,
checkout_date timestamp,
return_date timestamp,
PRIMARY KEY (id),
FOREIGN KEY (user_id) REFERENCES users(id)
ON DELETE CASCADE,
FOREIGN KEY (book_id) REFERENCES books(id)
ON DELETE CASCADE
);
You may have noticed that our table contains a couple of other columns checkout_date
and return_date
. While these aren't necessary to create the relationship between the users
and books
table, they can provide additional context to that relationship. Attributes like a checkout date or return date don't pertain specifically to users or specifically to books, but to the association between a user and a book.
This kind of additional context can be useful within the business logic of the application using our database. For example, in order to prevent more than one user trying to check out the same book at the same time, the app could determine which books are currently checked out by querying those that have a value in the checkout_date
column of the checkouts
table but where the return_date
is set to NULL
.
As with one-to-many relationships, the foreign keys in many-to-many relationships should not allow NULL
entries. Thus, both user_id
and book_id
have the NOT NULL
constraint. In this case, it makes no sense to have a checkout row for a user if they haven't checked out a specific book, nor does it make sense to have a row for a book that hasn't been checked out by a specific user.
Now that we have our checkouts
created, we can add the data that will create the associations between the rows in users
and books
.
INSERT INTO checkouts
(id, user_id, book_id, checkout_date, return_date)
VALUES
(1, 1, 1, '2017-10-15 14:43:18.095143-07',
NULL),
(2, 1, 2, '2017-10-05 16:22:44.593188-07',
'2017-10-13 13:0:12.673382-05'),
(3, 2, 2, '2017-10-15 11:11:24.994973-07',
'2017-10-22 17:47:10.407569-07'),
(4, 5, 3, '2017-10-15 09:27:07.215217-07',
NULL);
Let's have a look at what this data looks like in terms of the relationships between the tables.
Here we can see that the id
value of 1
from the users
table for 'John Smith' appears twice in the user_id
column of checkouts
, but alongside different values for book_id
(1
and 2
); this satisfies the 'a user can check out many books' part of the relationship. Similarly we can see that id
value of 2
from the books
table for 'My Second SQL Book' appears twice in the book_id
column of checkouts
, alongside different values for user_id
(1
and 2
); this satisfies the 'a book can be checked out by many users' part of the relationship.
We can perhaps think of a many-to-many relationship as combining two one-to-many relationships; in this case between checkouts
and users
, and between checkouts
and books
.
In this chapter we covered a number of different topics regarding table relationships:
To recap, here is a list of common relationships that you'll encounter when working with SQL:
Relationship | Example |
---|---|
one-to-one | A User has ONE address |
one-to-many | A Book has MANY reviews |
many-to-many | A user has MANY books and a book has MANY users |
Earlier in this book we looked at how to query data in a database table using SELECT
. Now that our data is split across multiple tables, how can we structure our queries if we need data from more than one table at the same time? In order to do this, we need to join our tables together. In the next chapter we'll explore how to do exactly that by introducing another SQL keyword, JOIN
.