More on Select

As we stated earlier, the SELECT command is one of the most important and frequently used features of the SQL language. It is also very powerful. In the earlier chapters we discovered a few of the features we get with SELECT -- namely, selecting specific columns, selecting specific rows by filtering in various ways, and ordering the results of our queries. In this chapter we'll look at some more of the features available within the syntax of the SELECT statement.

We'll introduce how to use functions to process data. We'll also explore how data can be grouped together based on various criteria. First of all though, let's look at how we can further filter our data by adding LIMIT, OFFSET, and DISTINCT clauses to our queries.

LIMIT and OFFSET

The SELECT statements we've used so far have returned all of the results that match the conditions of our statement. That's not a problem for our users table, since we only have a few rows of data in it. When working with large datasets, a common requirement is to only display one portion of the data at a time.

Displaying portions of data as separate 'pages' is a user interface pattern used in many web applications, generally referred to as 'pagination'. An example of this can be seen in the Launch School forum pages, where twelve forum posts are displayed on the first 'page' and you need to navigate to the next 'page' to see the next twelve.

pagination example

The LIMIT and OFFSET clauses of SELECT are the base on which pagination is built. Let's look at how it works.

Say we are outputting the details of all of our users to a user admin page of our web application. We perhaps use this page to manage individual users and so only want to display one user at a time. We can do this by adding a LIMIT clause to our SELECT query:

SELECT * FROM users LIMIT 1;

Setting the limit to 1 lets us select a single row, the first row, from our table.

LIMIT example

Executing this query should produce the following result:

 id | full_name  | enabled |         last_login
----+------------+---------+----------------------------
  1 | John Smith | f       | 2017-10-25 10:26:10.015152
(1 row)

If we want to return the second row instead, we need a way to skip the first row when querying data.

LIMIT and OFFSET example

We can do this with the OFFSET clause. This clause will skip the first N rows. Let's give it a try.

SELECT * FROM users LIMIT 1 OFFSET 1;
 id |  full_name   | enabled |         last_login
----+--------------+---------+----------------------------
  2 | Jane Smith   | t       | 2017-10-25 10:26:50.295461
(1 row)

Here we still only return one row, due to the LIMIT clause, but since we have OFFSET our query by 1, it is the second row that is returned instead of the first. For each row we want to skip, we can increment the value of offset by 1.

If we want to return more than one result per page, we can adjust the values of LIMIT and OFFSET accordingly. For example, we could imagine the query for page 2 of the Launch School general forum looking something like this:

SELECT topic, author, publish_date, category,
       replies_count, likes_count, last_activity_date
    FROM posts
    LIMIT 12
    OFFSET 12;

As well as specific use cases such as pagination, LIMIT can also be useful in development when testing our queries. We can use LIMIT to get a preview or taste of what data is available or would be returned rather than returning the entire dataset. This is especially useful during development when forming your queries and getting an understanding of the dataset and data quality.

On the subject of data quality issues, a common one that you might encounter is duplicate data in your tables. There are situations where you might need to formulate queries in a certain way in order to deal with duplication. One way to do that is with the DISTINCT clause, which we'll look at next.

DISTINCT

Sometimes duplicate data is unavoidable. For example, we might get duplication when joining more than one table together. We'll delve into working with multiple tables later in this book, for now let's look at a way to deal with duplication by using the DISTINCT qualifier. To demonstrate how this works let's first add some duplicate data to our users table.

INSERT INTO users (id, full_name)
           VALUES (4, 'Harry Potter');
INSERT INTO users (id, full_name)
           VALUES (5, 'Jane Smith');
id |  full_name   | enabled |         last_login
----+--------------+---------+----------------------------
 1 | John Smith   | f       | 2017-10-25 10:26:10.015152
 2 | Jane Smith   | t       | 2017-10-25 10:26:50.295461
 3 | Harry Potter | t       | 2017-10-25 10:26:50.295461
 4 | Harry Potter | t       | 2017-10-25 10:36:38.188286
 5 | Jane Smith   | t       | 2017-10-25 10:36:43.324015
(5 rows)

If we select the full_name column from our table, we get five rows back, two of which contain duplicate names:

SELECT full_name FROM users;
 full_name
--------------
 John Smith
 Jane Smith
 Harry Potter
 Harry Potter
 Jane Smith
(5 rows)

We can use DISTINCT as part of our SELECT query to only return distinct, or unique, values. Try running the following query:

SELECT DISTINCT full_name FROM users;
 full_name
--------------
 John Smith
 Jane Smith
 Harry Potter
(3 rows)

Here we only get three rows back, since there are only three distinct values in our full_name column.

SELECT DISTINCT example

DISTINCT can be useful when used in conjunction with SQL functions. Compare, for example, the results of the following two queries using the count() function:

SELECT count(full_name) FROM users;
 count
-------
     5
(1 row)
SELECT count(DISTINCT full_name) FROM users;
 count
-------
     3
(1 row)

The first query only really tells us how many rows of data we have in our table. The second one tells us how many unique names we have in the full_name column.

We haven't really looked at SQL functions yet, so let's do that now.

Functions

Functions are a way of working with data in SQL that may seem a little more familiar if you're coming to SQL from a programming language such as Ruby or JavaScript. Functions are a set of commands included as part of the RDBMS that perform particular operations on fields or data. Some functions provide data transformations that can be applied before returning results. Others simply return information on the operations carried out.

These functions can generally be grouped into different types. Some of the most commonly used types of functions are:

  1. String
  2. Date/Time
  3. Aggregate

We're not going to go into detail here on each function that exists for every function type. The most important thing is to know what types of functions exist and have a general understanding of how and when to use them. When you need to perform a specific operation, you can always look up a function for that operation in the documentation.

String Functions

String Functions, as their name suggests, perform some sort of operation on values whose data type is String. Some examples are:

Function Example Notes
length SELECT length(full_name) FROM users; This returns the length of every user's name. You could also use length in a WHERE clause to filter data based on name length.
trim SELECT trim(leading ' ' from full_name) FROM users; If any of the data in our full_name column had a space in front of the name, using the trim function like this would remove that leading space.

Date/ Time Functions

Just as string functions perform operations on String data, Date/ Time functions, for the most part, perform operations on date and time data. Many of the date/ time functions take time or timestamp inputs. Our last_login column, for example, has a data type of timestamp and so data in that column can act as an argument to such functions:

Function Example Notes
date_part SELECT full_name, date_part('year', last_login) FROM users; date_part allows us to view a table that only contains a part of a user's timestamp that we specify. The example query allows us to see each user's name along with the year of the last_login date. Sometimes having date/time data down to the second isn't needed.
age SELECT full_name, age(last_login) FROM users; The age function, when passed a single timestamp as an argument, calculates the time elapsed between that timestamp and the current time. The example query allows us to see how long it has been since each user last logged in.

Aggregate Functions

Aggregate functions perform aggregation; that is, they compute a single result from a set of input values. We briefly looked at one of these, count, a little earlier in this chapter.

Function Example Notes
count SELECT count(id) FROM users; Returns the number of values in the column passed in as an argument. This type of function can be very useful depending on the context. We could find the number of users who have enabled an account, or even how many users have certain last names if we use the above statement with other clauses.
sum SELECT sum(id) FROM users; Not to be confused with count. This sums numeric type values for all of the selected rows and returns the total.
min SELECT min(last_login) FROM users; This returns the lowest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.
max SELECT max(last_login) FROM users; This returns the highest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.
avg SELECT avg(id) FROM users; Returns the average (arithmetic mean) of numeric type values for all of the selected rows.

Currently the count function in our table above isn't all that useful. It would simply count how many values we have in the id column; since we have five rows in our table, we have five values in the id column, so the result would be 5. You can quickly execute that statement just to check:

SELECT count(id) FROM users;
 count
-------
     5
(1 row)

Aggregate functions really start to be useful when grouping table rows together. The way we do that is by using the GROUP BY clause.

GROUP BY

Sometimes you need to combine data results together to form more meaningful information. Let's say we wanted to implement one of the examples mentioned earlier, and count the number of users who have accounts that are or are not enabled. How would we approach this?

We could try using a WHERE clause to first filter the results to rows where the enabled column has a value of true, and then use the count function to count the values in the id column (which is equal to the number of rows, we could actually use any column here).

SELECT count(id) FROM users WHERE enabled = true;
 count
-------
     4
(1 row)

This only shows the number of rows where the value of enabled is true though; if we want to know how many rows there are where enabled is false, we have to change our condition to WHERE enabled = false.

What if we want to show the counts for true and false at the same time? What we need is some way to split up the true and false rows into two separate groups and then count the rows in each group.

GROUP BY example

The GROUP BY clause allows us to do exactly that:

SELECT enabled, count(id) FROM users GROUP BY enabled;
 enabled | count
---------+-------
 f       |     1
 t       |     4
(2 rows)

Although we don't need to include enabled in our column list in order to use it in our GROUP BY clause, doing so makes our output more meaningful as we can clearly see the correlation between the count and the values in enabled.

One thing to be aware of when using aggregate functions, is that if you include columns in the column list alongside the function, then those columns must also be included in a GROUP BY clause. For example, the following statement returns an error:

SELECT enabled, full_name, count(id)
       FROM users
       GROUP BY enabled;   -- full_name is not grouped
ERROR:  column "users.full_name" must appear in the GROUP BY clause or be used in an aggregate function

The PostgreSQL documentation says:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

Of particular interest here is that all of the columns in the select list must either be included in the GROUP BY clause, be the result of an aggregate function, or the GROUP BY clause must be based on the primary key. This requirement ensures that there is a single value for every column in the result set. This rule can be a bit difficult to work with at first. However, practice using GROUP BY will gradually lead to a working understanding.

Ian Austin, a Launch School student, wrote an excellent article on the GROUP BY clause and aggregate functions. If you're having trouble understanding GROUP BY, be sure to read this article - it's worth your time.

Summary

In this chapter we've built on our knowledge of SELECT, and looked at a number of different ways we can make our SELECT queries more flexible and powerful:

  • Returning portions of a dataset using LIMIT and OFFSET
  • Returning unique values using DISTINCT
  • Using SQL functions to work with data in various ways
  • Aggregating data using GROUP BY

Thus far in this book we've looked at a lot of the things that you need to know in order to work with a single table in a database; from creating, altering, or even deleting the table itself, through to adding data and then querying that data in a number of different ways. There's one thing we haven't looked at yet, and that's changing data that already exists in your table.

In the next chapter we'll do exactly that by looking at the UPDATE and DELETE commands that will let us modify existing data in our table.