Open Bookshelf Cover Page
Preface
Getting Started
Your First Database: Schema
Your First Database: Data
Working With Multiple Tables
- Create Multiple Tables
- SQL Joins
Conclusion
- Summary and Additional Resources
Share on

More on Select

As we stated earlier, the SELECT command is one of the most important and frequently used features of the SQL language. It is also very powerful. In the earlier chapters we discovered a few of the features we get with SELECT -- namely, selecting specific columns, selecting specific rows by filtering in various ways, and ordering the results of our queries. In this chapter we'll look at some more of the features available within the syntax of the SELECT statement.

We'll introduce how to use functions to process data. We'll also explore how data can be grouped together based on various criteria. First of all though, let's look at how we can further filter our data by adding LIMIT, OFFSET, and DISTINCT clauses to our queries.

LIMIT and OFFSET

The SELECT statements we've used so far have returned all of the results that match the conditions of our statement. That's not a problem for our users table, since we only have a few rows of data in it. When working with large datasets, a common requirement is to only display one portion of the data at a time.

Displaying portions of data as separate 'pages' is a user interface pattern used in many web applications, generally referred to as 'pagination'. An example of this can be seen in the Launch School forum pages, where twelve forum posts are displayed on the first 'page' and you need to navigate to the next 'page' to see the next twelve.

pagination example

The LIMIT and OFFSET clauses of SELECT are the base on which pagination is built. Let's look at how it works.

Say we are outputting the details of all of our users to a user admin page of our web application. We perhaps use this page to manage individual users and so only want to display one user at a time. We can do this by adding a LIMIT clause to our SELECT query:

SELECT * FROM users LIMIT 1;

Setting the limit to 1 lets us select a single row, the first row, from our table.

LIMIT example

Executing this query should produce the following result:

 id | full_name  | enabled |         last_login
----+------------+---------+----------------------------
  1 | John Smith | f       | 2017-10-25 10:26:10.015152
(1 row)

If we want to return the second row instead, we need a way to skip the first row when querying data.

LIMIT and OFFSET example

We can do this with the OFFSET clause. This clause will skip the first N rows. Let's give it a try.

SELECT * FROM users LIMIT 1 OFFSET 1;

 id |  full_name   | enabled |         last_login
----+--------------+---------+----------------------------
  2 | Jane Smith   | t       | 2017-10-25 10:26:50.295461
(1 row)

Here we still only return one row, due to the LIMIT clause, but since we have OFFSET our query by 1, it is the second row that is returned instead of the first. For each row we want to skip, we can increment the value of offset by 1.

If we want to return more than one result per page, we can adjust the values of LIMIT and OFFSET accordingly. For example, we could imagine the query for page 2 of the Launch School general forum looking something like this:

SELECT topic, author, publish_date, category,
       replies_count, likes_count, last_activity_date
    FROM posts
    LIMIT 12
    OFFSET 12;

As well as specific use cases such as pagination, LIMIT can also be useful in development when testing our queries. We can use LIMIT to get a preview or taste of what data is available or would be returned rather than returning the entire dataset. This is especially useful during development when forming your queries and getting an understanding of the dataset and data quality.

On the subject of data quality issues, a common one that you might encounter is duplicate data in your tables. There are situations where you might need to formulate queries in a certain way in order to deal with duplication. One way to do that is with the DISTINCT clause, which we'll look at next.

DISTINCT

Sometimes duplicate data is unavoidable. For example, we might get duplication when joining more than one table together. We'll delve into working with multiple tables later in this book, for now let's look at a way to deal with duplication by using the DISTINCT qualifier. To demonstrate how this works let's first add some duplicate data to our users table.

INSERT INTO users (id, full_name)
           VALUES (4, 'Harry Potter');
INSERT INTO users (id, full_name)
           VALUES (5, 'Jane Smith');

id |  full_name   | enabled |         last_login
----+--------------+---------+----------------------------
 1 | John Smith   | f       | 2017-10-25 10:26:10.015152
 2 | Jane Smith   | t       | 2017-10-25 10:26:50.295461
 3 | Harry Potter | t       | 2017-10-25 10:26:50.295461
 4 | Harry Potter | t       | 2017-10-25 10:36:38.188286
 5 | Jane Smith   | t       | 2017-10-25 10:36:43.324015
(5 rows)

If we select the full_name column from our table, we get five rows back, two of which contain duplicate names:

SELECT full_name FROM users;

 full_name
--------------
 John Smith
 Jane Smith
 Harry Potter
 Harry Potter
 Jane Smith
(5 rows)

We can use DISTINCT as part of our SELECT query to only return distinct, or unique, values. Try running the following query:

SELECT DISTINCT full_name FROM users;

 full_name
--------------
 John Smith
 Jane Smith
 Harry Potter
(3 rows)

Here we only get three rows back, since there are only three distinct values in our full_name column.

SELECT DISTINCT example

DISTINCT can be useful when used in conjunction with SQL functions. Compare, for example, the results of the following two queries using the count() function:

SELECT count(full_name) FROM users;

 count
-------
     5
(1 row)

SELECT count(DISTINCT full_name) FROM users;

 count
-------
     3
(1 row)

The first query only really tells us how many rows of data we have in our table. The second one tells us how many unique names we have in the full_name column.

We haven't really looked at SQL functions yet, so let's do that now.

Functions

Functions are a way of working with data in SQL that may seem a little more familiar if you're coming to SQL from a programming language such as Ruby or JavaScript. Functions are a set of commands included as part of the RDBMS that perform particular operations on fields or data. Some functions provide data transformations that can be applied before returning results. Others simply return information on the operations carried out.

These functions can generally be grouped into different types. Some of the most commonly used types of functions are:

String
Date/Time
Aggregate

We're not going to go into detail here on each function that exists for every function type. The most important thing is to know what types of functions exist and have a general understanding of how and when to use them. When you need to perform a specific operation, you can always look up a function for that operation in the documentation.

String Functions

String Functions, as their name suggests, perform some sort of operation on values whose data type is String. Some examples are:

Function	Example	Notes
`length`	`SELECT length(full_name) FROM users;`	This returns the length of every user's name. You could also use `length` in a `WHERE` clause to filter data based on name length.
`trim`	`SELECT trim(leading ' ' from full_name) FROM users;`	If any of the data in our `full_name` column had a space in front of the name, using the `trim` function like this would remove that leading space.

Date/ Time Functions

Just as string functions perform operations on String data, Date/ Time functions, for the most part, perform operations on date and time data. Many of the date/ time functions take time or timestamp inputs. Our last_login column, for example, has a data type of timestamp and so data in that column can act as an argument to such functions:

Function	Example	Notes
`date_part`	`SELECT full_name, date_part('year', last_login) FROM users;`	`date_part` allows us to view a table that only contains a part of a user's timestamp that we specify. The example query allows us to see each user's name along with the year of the `last_login` date. Sometimes having date/time data down to the second isn't needed.
`age`	`SELECT full_name, age(last_login) FROM users;`	The `age` function, when passed a single `timestamp` as an argument, calculates the time elapsed between that timestamp and the current time. The example query allows us to see how long it has been since each user last logged in.

Aggregate Functions

Aggregate functions perform aggregation; that is, they compute a single result from a set of input values. We briefly looked at one of these, count, a little earlier in this chapter.

Function	Example	Notes
`count`	`SELECT count(id) FROM users;`	Returns the number of values in the column passed in as an argument. This type of function can be very useful depending on the context. We could find the number of users who have enabled an account, or even how many users have certain last names if we use the above statement with other clauses.
`sum`	`SELECT sum(id) FROM users;`	Not to be confused with `count`. This sums numeric type values for all of the selected rows and returns the total.
`min`	`SELECT min(last_login) FROM users;`	This returns the lowest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.
`max`	`SELECT max(last_login) FROM users;`	This returns the highest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.
`avg`	`SELECT avg(id) FROM users;`	Returns the average (arithmetic mean) of numeric type values for all of the selected rows.

Currently the count function in our table above isn't all that useful. It would simply count how many values we have in the id column; since we have five rows in our table, we have five values in the id column, so the result would be 5. You can quickly execute that statement just to check:

SELECT count(id) FROM users;

 count
-------
     5
(1 row)

Aggregate functions really start to be useful when grouping table rows together. The way we do that is by using the GROUP BY clause.

GROUP BY

Sometimes you need to combine data results together to form more meaningful information. Let's say we wanted to implement one of the examples mentioned earlier, and count the number of users who have accounts that are or are not enabled. How would we approach this?

We could try using a WHERE clause to first filter the results to rows where the enabled column has a value of true, and then use the count function to count the values in the id column (which is equal to the number of rows, we could actually use any column here).

SELECT count(id) FROM users WHERE enabled = true;

 count
-------
     4
(1 row)

This only shows the number of rows where the value of enabled is true though; if we want to know how many rows there are where enabled is false, we have to change our condition to WHERE enabled = false.

What if we want to show the counts for true and false at the same time? What we need is some way to split up the true and false rows into two separate groups and then count the rows in each group.

GROUP BY example

The GROUP BY clause allows us to do exactly that:

SELECT enabled, count(id) FROM users GROUP BY enabled;

 enabled | count
---------+-------
 f       |     1
 t       |     4
(2 rows)

Although we don't need to include enabled in our column list in order to use it in our GROUP BY clause, doing so makes our output more meaningful as we can clearly see the correlation between the count and the values in enabled.

One thing to be aware of when using aggregate functions, is that if you include columns in the column list alongside the function, then those columns must also be included in a GROUP BY clause. For example, the following statement returns an error:

SELECT enabled, full_name, count(id)
       FROM users
       GROUP BY enabled;   -- full_name is not grouped

ERROR:  column "users.full_name" must appear in the GROUP BY clause or be used in an aggregate function

The PostgreSQL documentation says:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

Of particular interest here is that all of the columns in the select list must either be included in the GROUP BY clause, be the result of an aggregate function, or the GROUP BY clause must be based on the primary key. This requirement ensures that there is a single value for every column in the result set. This rule can be a bit difficult to work with at first. However, practice using GROUP BY will gradually lead to a working understanding.

Ian Austin, a Launch School student, wrote an excellent article on the GROUP BY clause and aggregate functions. If you're having trouble understanding GROUP BY, be sure to read this article - it's worth your time.

Summary

In this chapter we've built on our knowledge of SELECT, and looked at a number of different ways we can make our SELECT queries more flexible and powerful:

Returning portions of a dataset using LIMIT and OFFSET
Returning unique values using DISTINCT
Using SQL functions to work with data in various ways
Aggregating data using GROUP BY

Thus far in this book we've looked at a lot of the things that you need to know in order to work with a single table in a database; from creating, altering, or even deleting the table itself, through to adding data and then querying that data in a number of different ways. There's one thing we haven't looked at yet, and that's changing data that already exists in your table.

In the next chapter we'll do exactly that by looking at the UPDATE and DELETE commands that will let us modify existing data in our table.

Exercises

Make sure you are connected to the encyclopedia database. Write a query to retrieve the first row of data from the countries table.

Solution

SELECT * FROM countries LIMIT 1;

 id |  name  | capital | population
----+--------+---------+------------
  1 | France | Paris   |   67158000
(1 row)

Write a query to retrieve the name of the country with the largest population.
Solution
SELECT name FROM countries ORDER BY population DESC LIMIT 1;
name ------ USA (1 row)

Here we first order the rows in descending order by population, and then retrieve the name value from the first of the ordered rows.
Write a query to retrieve the name of the country with the second largest population.
Solution
SELECT name FROM countries ORDER BY population DESC LIMIT 1 OFFSET 1;
name ------ Japan (1 row)

This is similar to the previous solution, but we retrieve the second row, rather than the first, by using OFFSET.
Write a query to retrieve all of the unique values from the binomial_name column of the animals table.
Solution
SELECT DISTINCT binomial_name FROM animals;
binomial_name -------------------------- Aquila Chrysaetos Strigops habroptila Falco Peregrinus Columbidae Columbiformes (4 rows)

Since Columbidae Columbiformes occurs twice (for Dove and for Pigeon), we can prepend the column name with DISTINCT in our query in order to return only the unique values.
Write a query to return the longest binomial name from the animals table.
Solution
SELECT binomial_name FROM animals ORDER BY length(binomial_name) DESC LIMIT 1;
binomial_name -------------------------- Columbidae Columbiformes (1 row)

Here we use the length() function to determine the length of each string in the binomial_name column, and use the resulting values to order our query (in descending order). By setting a LIMIT of 1, we return only the longest value.
Write a query to return the first name of any celebrity born in 1958.
Solution
SELECT first_name FROM celebrities WHERE date_part('year', date_of_birth) = 1958;
first_name ------------ Madonna Prince (2 rows)

Here we use the date_part() function to extract the year from the date_of_birth, and then use it within a WHERE clause.
Write a query to return the highest maximum age from the animals table.
Solution
SELECT max(max_age_years) FROM animals;
max ----- 60 (1 row)

Write a query to return the average maximum weight from the animals table.

Solution

SELECT avg(max_weight_kg)
FROM animals;

        avg
--------------------
 3.1700000000000000
(1 row)

Write a query to return the number of rows in the countries table.
Solution
SELECT count(id) FROM countries;
count ------- 4 (1 row)

Here we use the count() function to count the number of rows. We could have used any column to count the rows, e.g. count(name) would have returned the same result.
Write a query to return the total population of all the countries in the countries table.
Solution
SELECT sum(population) FROM countries;
sum ----------- 601544589 (1 row)
Write a query to return each unique conservation status code alongside the number of animals that have that code.
Solution
SELECT conservation_status, count(id) FROM animals GROUP BY conservation_status;
conservation_status | count ---------------------+------- CR | 1 LC | 4 (2 rows)

The key thing here is the GROUP BY clause in conjunction with the count() function. We are effectively combining the rows into groups based on a certain value (in this case conservation_status), and calling the function on the rows in each group. Without the GROUP BY clause, the function would treat all the rows as one single group.

Connect to the ls_burger database. Write a query that returns the average burger cost for all orders that include fries.

Solution

SELECT avg(burger_cost)
FROM orders
WHERE side = 'Fries';

        avg
--------------------
 4.0000000000000000
(1 row)

Write a query that returns the cost of the cheapest side ordered.

Solution

SELECT min(side_cost)
FROM orders
WHERE side IS NOT NULL;

 min
------
 0.99
(1 row)

Alternatively:

SELECT side_cost FROM orders
WHERE side IS NOT NULL
ORDER BY side_cost ASC LIMIT 1;

Write a query that returns the number of orders that include Fries and the number of orders that include Onion Rings.

Solution

SELECT side, count(id)
FROM orders
WHERE side = 'Fries'
OR side = 'Onion Rings'
GROUP BY side;

    side     | count
-------------+-------
 Fries       |     2
 Onion Rings |     1
(2 rows)