As we stated earlier, the SELECT
command is one of the most important and frequently used features of the SQL language. It is also very powerful. In the earlier chapters we discovered a few of the features we get with SELECT
-- namely, selecting specific columns, selecting specific rows by filtering in various ways, and ordering the results of our queries. In this chapter we'll look at some more of the features available within the syntax of the SELECT
statement.
We'll introduce how to use functions to process data. We'll also explore how data can be grouped together based on various criteria. First of all though, let's look at how we can further filter our data by adding LIMIT
, OFFSET
, and DISTINCT
clauses to our queries.
The SELECT
statements we've used so far have returned all of the results that match the conditions of our statement. That's not a problem for our users
table, since we only have a few rows of data in it. When working with large datasets, a common requirement is to only display one portion of the data at a time.
Displaying portions of data as separate 'pages' is a user interface pattern used in many web applications, generally referred to as 'pagination'. An example of this can be seen in the Launch School forum pages, where twelve forum posts are displayed on the first 'page' and you need to navigate to the next 'page' to see the next twelve.
The LIMIT
and OFFSET
clauses of SELECT
are the base on which pagination is built. Let's look at how it works.
Say we are outputting the details of all of our users to a user admin page of our web application. We perhaps use this page to manage individual users and so only want to display one user at a time. We can do this by adding a LIMIT
clause to our SELECT
query:
SELECT * FROM users LIMIT 1;
Setting the limit to 1
lets us select a single row, the first row, from our table.
Executing this query should produce the following result:
id | full_name | enabled | last_login
----+------------+---------+----------------------------
1 | John Smith | f | 2017-10-25 10:26:10.015152
(1 row)
If we want to return the second row instead, we need a way to skip the first row when querying data.
We can do this with the OFFSET
clause. This clause will skip the first N rows. Let's give it a try.
SELECT * FROM users LIMIT 1 OFFSET 1;
id | full_name | enabled | last_login
----+--------------+---------+----------------------------
2 | Jane Smith | t | 2017-10-25 10:26:50.295461
(1 row)
Here we still only return one row, due to the LIMIT
clause, but since we have OFFSET
our query by 1
, it is the second row that is returned instead of the first. For each row we want to skip, we can increment the value of offset by 1.
If we want to return more than one result per page, we can adjust the values of LIMIT
and OFFSET
accordingly. For example, we could imagine the query for page 2 of the Launch School general forum looking something like this:
SELECT topic, author, publish_date, category,
replies_count, likes_count, last_activity_date
FROM posts
LIMIT 12
OFFSET 12;
As well as specific use cases such as pagination, LIMIT
can also be useful in development when testing our queries. We can use LIMIT
to get a preview or taste of what data is available or would be returned rather than returning the entire dataset. This is especially useful during development when forming your queries and getting an understanding of the dataset and data quality.
On the subject of data quality issues, a common one that you might encounter is duplicate data in your tables. There are situations where you might need to formulate queries in a certain way in order to deal with duplication. One way to do that is with the DISTINCT
clause, which we'll look at next.
Sometimes duplicate data is unavoidable. For example, we might get duplication when joining more than one table together. We'll delve into working with multiple tables later in this book, for now let's look at a way to deal with duplication by using the DISTINCT
qualifier. To demonstrate how this works let's first add some duplicate data to our users
table.
INSERT INTO users (id, full_name)
VALUES (4, 'Harry Potter');
INSERT INTO users (id, full_name)
VALUES (5, 'Jane Smith');
id | full_name | enabled | last_login
----+--------------+---------+----------------------------
1 | John Smith | f | 2017-10-25 10:26:10.015152
2 | Jane Smith | t | 2017-10-25 10:26:50.295461
3 | Harry Potter | t | 2017-10-25 10:26:50.295461
4 | Harry Potter | t | 2017-10-25 10:36:38.188286
5 | Jane Smith | t | 2017-10-25 10:36:43.324015
(5 rows)
If we select the full_name
column from our table, we get five rows back, two of which contain duplicate names:
SELECT full_name FROM users;
full_name
--------------
John Smith
Jane Smith
Harry Potter
Harry Potter
Jane Smith
(5 rows)
We can use DISTINCT
as part of our SELECT
query to only return distinct, or unique, values. Try running the following query:
SELECT DISTINCT full_name FROM users;
full_name
--------------
John Smith
Jane Smith
Harry Potter
(3 rows)
Here we only get three rows back, since there are only three distinct values in our full_name
column.
DISTINCT
can be useful when used in conjunction with SQL functions. Compare, for example, the results of the following two queries using the count()
function:
SELECT count(full_name) FROM users;
count
-------
5
(1 row)
SELECT count(DISTINCT full_name) FROM users;
count
-------
3
(1 row)
The first query only really tells us how many rows of data we have in our table. The second one tells us how many unique names we have in the full_name
column.
We haven't really looked at SQL functions yet, so let's do that now.
Functions are a way of working with data in SQL that may seem a little more familiar if you're coming to SQL from a programming language such as Ruby or JavaScript. Functions are a set of commands included as part of the RDBMS that perform particular operations on fields or data. Some functions provide data transformations that can be applied before returning results. Others simply return information on the operations carried out.
These functions can generally be grouped into different types. Some of the most commonly used types of functions are:
We're not going to go into detail here on each function that exists for every function type. The most important thing is to know what types of functions exist and have a general understanding of how and when to use them. When you need to perform a specific operation, you can always look up a function for that operation in the documentation.
String Functions, as their name suggests, perform some sort of operation on values whose data type is String. Some examples are:
Function | Example | Notes |
---|---|---|
length |
SELECT length(full_name) FROM users; |
This returns the length of every user's name. You could also use length in a WHERE clause to filter data based on name length. |
trim |
SELECT trim(leading ' ' from full_name) FROM users; |
If any of the data in our full_name column had a space in front of the name, using the trim function like this would remove that leading space. |
Just as string functions perform operations on String data, Date/ Time functions, for the most part, perform operations on date and time data. Many of the date/ time functions take time or timestamp inputs. Our last_login
column, for example, has a data type of timestamp
and so data in that column can act as an argument to such functions:
Function | Example | Notes |
---|---|---|
date_part |
SELECT full_name, date_part('year', last_login) FROM users; |
date_part allows us to view a table that only contains a part of a user's timestamp that we specify. The example query allows us to see each user's name along with the year of the last_login date. Sometimes having date/time data down to the second isn't needed. |
age |
SELECT full_name, age(last_login) FROM users; |
The age function, when passed a single timestamp as an argument, calculates the time elapsed between that timestamp and the current time. The example query allows us to see how long it has been since each user last logged in. |
Aggregate functions perform aggregation; that is, they compute a single result from a set of input values. We briefly looked at one of these, count
, a little earlier in this chapter.
Function | Example | Notes |
---|---|---|
count |
SELECT count(id) FROM users; |
Returns the number of values in the column passed in as an argument. This type of function can be very useful depending on the context. We could find the number of users who have enabled an account, or even how many users have certain last names if we use the above statement with other clauses. |
sum |
SELECT sum(id) FROM users; |
Not to be confused with count . This sums numeric type values for all of the selected rows and returns the total. |
min |
SELECT min(last_login) FROM users; |
This returns the lowest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string. |
max |
SELECT max(last_login) FROM users; |
This returns the highest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string. |
avg |
SELECT avg(id) FROM users; |
Returns the average (arithmetic mean) of numeric type values for all of the selected rows. |
Currently the count
function in our table above isn't all that useful. It would simply count how many values we have in the id
column; since we have five rows in our table, we have five values in the id
column, so the result would be 5
. You can quickly execute that statement just to check:
SELECT count(id) FROM users;
count
-------
5
(1 row)
Aggregate functions really start to be useful when grouping table rows together. The way we do that is by using the GROUP BY
clause.
Sometimes you need to combine data results together to form more meaningful information. Let's say we wanted to implement one of the examples mentioned earlier, and count the number of users who have accounts that are or are not enabled. How would we approach this?
We could try using a WHERE
clause to first filter the results to rows where the enabled
column has a value of true
, and then use the count
function to count the values in the id
column (which is equal to the number of rows, we could actually use any column here).
SELECT count(id) FROM users WHERE enabled = true;
count
-------
4
(1 row)
This only shows the number of rows where the value of enabled
is true
though; if we want to know how many rows there are where enabled
is false
, we have to change our condition to WHERE enabled = false
.
What if we want to show the counts for true
and false
at the same time? What we need is some way to split up the true
and false
rows into two separate groups and then count the rows in each group.
The GROUP BY
clause allows us to do exactly that:
SELECT enabled, count(id) FROM users GROUP BY enabled;
enabled | count
---------+-------
f | 1
t | 4
(2 rows)
Although we don't need to include enabled
in our column list in order to use it in our GROUP BY
clause, doing so makes our output more meaningful as we can clearly see the correlation between the count and the values in enabled
.
One thing to be aware of when using aggregate functions, is that if you include columns in the column list alongside the function, then those columns must also be included in a GROUP BY
clause. For example, the following statement returns an error:
SELECT enabled, full_name, count(id)
FROM users
GROUP BY enabled; -- full_name is not grouped
ERROR: column "users.full_name" must appear in the GROUP BY clause or be used in an aggregate function
The PostgreSQL documentation says:
When
GROUP BY
is present, or any aggregate functions are present, it is not valid for theSELECT
list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
Of particular interest here is that all of the columns in the select list must either be included in the GROUP BY
clause, be the result of an aggregate function, or the GROUP BY
clause must be based on the primary key. This requirement ensures that there is a single value for every column in the result set. This rule can be a bit difficult to work with at first. However, practice using GROUP BY
will gradually lead to a working understanding.
Ian Austin, a Launch School student, wrote an excellent article on the GROUP BY
clause and aggregate functions. If you're having trouble understanding GROUP BY
, be sure to read this article - it's worth your time.
In this chapter we've built on our knowledge of SELECT
, and looked at a number of different ways we can make our SELECT
queries more flexible and powerful:
LIMIT
and OFFSET
DISTINCT
GROUP BY
Thus far in this book we've looked at a lot of the things that you need to know in order to work with a single table in a database; from creating, altering, or even deleting the table itself, through to adding data and then querying that data in a number of different ways. There's one thing we haven't looked at yet, and that's changing data that already exists in your table.
In the next chapter we'll do exactly that by looking at the UPDATE
and DELETE
commands that will let us modify existing data in our table.