Introduction to Collections

Collection Types

Python has a surprisingly large number of built-in and auxiliary collection types. Collections are objects that contain zero or more member objects, often called elements. There are 3 main categories of collection: sequences, mappings, and sets.

Let's recap the types table from the Data Types chapter. We've reproduced a portion of the table below:

Type Class Category Kind Mutable
ranges range sequences Non-primitive No
tuples tuple sequences Non-primitive No
lists list sequences Non-primitive Yes
dictionaries dict mappings Non-primitive Yes
sets set sets Non-primitive Yes
frozen sets frozenset sets Non-primitive No

As you can see, the collections fall into several categories: sequences, mappings, and sets.

We often also include text sequences when talking about collections:

Type Class Category Kind Mutable
strings str text sequences Primitive No

The primary text sequence of interest is strings. Strings and other text sequences are not collections, however. They are sufficiently similar that we can treat them as sequence collections, but its important to remember that they are not. For the purposes of this chapter, we will discuss text sequences as though they were collections.

Some collections are mutable; others are immutable. Keeping them straight takes some practice; be patient. In particular, lists and tuples differ only in that lists are mutable; tuples are not. The set collections differ similarly: sets are mutable; frozen sets are not.

We've met all these types earlier in this book. In this lesson, we'll explore them in depth.

What are Sequences?

Sequences are types that maintain an ordered collection of objects (also: elements or values) that can be indexed by whole numbers. Ordered collections are collections organized in some sequence: a first element, a second element, a third element, and so on. Indexed sequences associate every object member with a whole number (0, 1, 2, etc.) that can be used to access or modify that object. The built-in sequences (ranges, lists, and tuples) always have the first object at index 0, the second at index 1, and so on.

letters = ['a', 'b', 'c']
print(letters[0])         # a (first element)
print(letters[1])         # b (second element)
print(letters[2])         # c (third element)

Lists and tuples are heterogeneous; they may contain different kinds of objects, including other sequences:

my_list = [
    'May',
    2.99,
    {None, True, False},
    [1, 2, 3],
]

Ranges are homogenous; they always contain integers.

Strings are a form of sequence called a text sequence. They differ from ordinary sequences in several ways:

  1. Strings are homogenous; all characters in a string are, um, characters.
  2. Characters are not a distinct kind of object; they are merely strings of length 1.
  3. A string's individual characters are not separate strings until you reference a character.
  4. Strings are not actual collections since the characters inside the string aren't objects.

Bullet 3 merits an example, but it requires using Unicode. The Greek letter theta (θ) is a fine choice. Let's look at two snippets, one using a list and the other a string:

letters = ['a', 'b', 'θ', 'd', 'e']
char = letters[2]
print(char is letters[2])     # True
letters = 'abθde'
char = letters[2]
print(char is letters[2])     # False

Example 1 shows that letters[2] returns the object at index 2 of the letters list. Both char and letters[2] reference the same object, as shown on line 3.

However, in Example 2, the two characters are not the same object. They have the same value but are distinct objects.

We used θ to avoid interning, a topic we discussed earlier. Python would have interned any character from the extended ASCII character set, and both snippets would have returned True. By switching to a non-ASCII character, we could show that characters in a string aren't objects.

While there are differences, we can usually treat strings as ordinary sequences.

What are Sets?

Sets are types that maintain an unordered collection of unique objects (also called elements or members). Unlike sequences, sets cannot be indexed. Unordered means no well-defined order exists for the objects in a set. In fact, the sets {1, 2, 3} and {3, 1, 2} are equal since the order doesn't matter. By unique, we mean a set can not have duplicate members.

Python has two main built-in set types: sets and frozen sets. Regular sets are mutable; frozen sets are immutable. This is the only significant difference between the two:

letters = {'a', 'b', 'c'}
letters.add('d')
print(letters)
# {'a', 'b', 'c', 'd'} (order may differ)

frozen_letters = frozenset(letters)
frozen_letters.add('e')
# AttributeError: 'frozenset' object has no
# attribute 'add'

Sets and frozen sets, like Lists and tuples, are heterogeneous; they may contain different kinds of objects, including other hashable collections:

my_set = {
    'May',
    2.99,
    (None, True, False),
    range(5),
}

Python will ignore any attempt to add duplicate members to a set:

letters = {'a', 'b', 'c', 'b', 'a'}
print(letters)
# {'a', 'c', 'b'} (order may differ)

letters.add('c')
print(letters)
# {'a', 'c', 'b'} (order may differ)

Since Python 3.7, sets of integers sometimes seem to be ordered. For instance:

numbers = { 1, 2, 3, 4, 5 }
print(numbers)      # { 1, 2, 3, 4, 5 }

numbers = { 5, 3, 1, 2, 4 }
print(numbers)      # { 1, 2, 3, 4, 5 }

No matter how often you run that code, it prints both sets with the same ordered values. It's an illusion, however, produced by an implementation detail. The behavior isn't guaranteed: it's easy to construct sets of integers that are clearly unordered:

numbers = { 1, 2, 37, 4, 5 }
print(numbers)      # {1, 2, 4, 37, 5}

The order of the numbers you see may not match ours, but you will likely see that the numbers are no longer predictably ordered.

What are Mappings?

Mappings are types that maintain an unordered collection of key/value pairs (also called elements or members). Unlike sequences, mappings are accessed by their keys, which usually are not numbers. While Python has several mapping types, the only one we'll meet in this book is the dictionary or dict type.

Dicts have the following characteristics:

  • Dicts are mutable.
  • The keys in a dict must be unique. Trying to add a duplicate key to a dict merely replaces the existing key's associated value.
  • Keys must be "hashable" values. We won't define "hashable" right now. However, hashable values are usually (not always) immutable. All built-in immutable types (numbers, strings, booleans, tuples, frozen sets, and None) are hashable and can be dict keys.
  • The values in each key/value pair may be any object.
d = {'a': 1, (1, 3): 3, 1: 'x'}
print(d)         # {'a': 1, (1, 3): 3, 1: 'x'}
print(d['a'])    # 1
print(d[(1, 3)]) # 3
print(d[1])      # 'x'

d['a'] = 'A'
print(d)        # {'a': 'A', (1, 3): 3, 1: 'x'}

Since version 3.7, Python dicts maintain the insertion order of the pairs. Python guarantees it. Thus, you can iterate over the pairs in the same order they were inserted into the dictionary. However, this behavior doesn't fundamentally alter whether dicts are unordered collections. In practice, you should mentally think of dicts as unordered collections, but it's OK to rely on the ordering in contexts where the insertion order matters.

d = {'a': 1, (1, 3): 3, 1: 'x'}

del d['a']      # Deletes key/value pair 'a': 1
print(d)        # {(1, 3): 3, 1: 'x'}

d['a'] = 42
print(d)        # {(1, 3): 3, 1: 'x', 'a': 42}

Sequence Constructors

Most built-in Python data types let the programmer create new objects using literal values. Literals are great. However, you can also use special functions called constructors to create new objects. In fact, sometimes you can't use literals; you must use constructors to create ranges, frozen sets, and empty sets.

Let's examine the constructors used with collections.

String Constructor

The string constructor, str, has two basic formats, str() and str(something), where something is any Python object. Regardless of what you pass to str, it returns a string. For instance:

str()            # returns '' (empty string)
str('abc')       # returns 'abc'
str(42)          # returns '42'
str(3.141592)    # returns '3.141592'
str(False)       # returns 'False'
str(None)        # returns 'None'
str(range(3, 7)) # returns 'range(3, 7)'
str([1, 2, 3])   # returns '[1, 2, 3]'
str(int)         # returns "<class 'int'>"
str(list)        # returns "<class 'list'>"

class Person: pass
str(Person())
# returns "<__main__.Person object at 0x1006d21e0>"

In most cases, the values returned by str give a good idea of the original value. However, the last 6 lines show that the return value isn't always helpful. This is especially true with custom types like the Person class above (we'll learn about custom types in the OO Python book). You can fix this issue with custom types, but we won't discuss that until later in the Core Curriculum.

Range Constructor

The range constructor comes in 3 forms:

  • range(start, stop, step)

    This constructor generates a sequence of integers between start and stop - 1 with an increment of step between each consecutive integer. You can use a negative step to generate a sequence in reverse order. In this case, the range ends at stop + 1. For instance:

    r = range(5, 12, 2)
    print(list(r))            # [5, 7, 9, 11]
    
    r = range(12, 8, -1)
    print(list(r))            # [12, 11, 10, 9]
    
    r = range(12, 5, -2)
    print(list(r))            # [12, 10, 8, 6]
    

    You can create empty ranges by giving values where start >= stop when step is positive or start <= stop when step is negative. Empty ranges are often bugs.

    r = range(5, 5, 1)
    print(list(r))           # []
    
    r = range(5, 7, -1)
    print(list(r))            # []
    
  • range(start, stop)

    When you omit the step argument, Python uses a default value of 1. Hence, range(start, stop) is identical to range(start, stop, 1).

  • range(stop)

    When you omit the start argument, Python uses a default value of 0 for start. Hence, range(stop) is identical to range(0, stop, 1).

Ranges are lazy sequences: they don't create any element values until your program needs them. This is why we use list(r) above; without it, print(r) prints something like range(5, 12, 2), which may not be helpful. You can also use the tuple constructor to do the same thing. The set and frozenset constructors also work, but the sets may not be in the same order.

The List, Tuple, Set, and Frozen Set Constructors

Lists, tuples, sets, and frozen sets share two common constructor forms:

  • list() or list(iterable)
  • tuple() or tuple(iterable)
  • set() or set(iterable)
  • frozenset() or frozenset(iterable)

The constructors with no arguments create an empty list, tuple, set, or frozen set, as appropriate: a sequence or set with no members.

The constructors that take an iterable argument expect an object that Python can iterate: an iterable. From the built-in types, the iterables include all sequences, strings, sets, and mappings. (Note that iterating a mapping iterates the mapping's keys.) Thus, you can use these constructors to convert lists to tuples, tuples to sets, strings to lists, and so on:

my_str = 'Python'

my_list = list(my_str)
print(my_list)  # ['P', 'y', 't', 'h', 'o', 'n']

my_tuple = tuple(my_list)
print(my_tuple) # ('P', 'y', 't', 'h', 'o', 'n')

my_set = set(my_tuple)
print(my_set)   # {'t', 'o', 'n', 'h', 'P', 'y'}

You can also use these constructors to duplicate a collection:

my_list = [5, 12, 2]
another_list = list(my_list)

print(my_list)                          # [5, 12, 2]
print(another_list)                     # [5, 12, 2]

print(my_list == another_list)          # True
print(my_list is another_list)          # False

As you can see in lines 4, 5, and 7, the lists are identical. However, line 8 shows that the list objects are not the same object.

Let's look at a more complex example involving nested lists:

my_list = [[1, 2, 3], [4, 5, 6]]
another_list = list(my_list)

print(my_list)                          # [[1, 2, 3], [4, 5, 6]]
print(another_list)                     # [[1, 2, 3], [4, 5, 6]]

print(my_list == another_list)          # True
print(my_list is another_list)          # False
print(my_list[0] is another_list[0])    # True
print(my_list[1] is another_list[1])    # True

In this case, we can see from lines 4, 5, and 7 that the lists have the same values, and from line 8, we can see that they are different objects. Strangely, in lines 9 and 10, we can see that their elements are the same objects. That's because the list constructor performs what is known as a shallow copy. In a later chapter, we'll discuss shallow and deep copies and what is happening here. For now, remember that nested collections are subject to shallow copies. That's often fine. However, it can lead to problems that are difficult to diagnose.

Passing a string to any of these constructors causes it to iterate over the characters in the string. It places each character in a separate element of the resulting collection:

print(tuple('Python'))
# ('P', 'y', 't', 'h', 'o', 'n')

Summary

Python has a rich variety of collections for storing and manipulating data. You'll see collections in almost all real-world programs; nearly every useful program uses them. We'll continue to explore them in the next chapter.

Exercises

  1. If you have a list named people, how can you determine the number of entries in that list?

    Solution

    You can use len(people) to determine the number of entries.

    Video Walkthrough

    Please register to play this video

  2. Can you write some code to change the value 'bye' in the following tuple to 'goodbye'?

    stuff = ('hello', 'world', 'bye', 'now')
    

    Note: this problem is a bit tricky.

    Solution

    Since tuples are immutable, you can't replace 'bye' with 'goodbye'. At best, you can create a new tuple and reassign it:

    stuff = ('hello', 'world', 'goodbye', 'now')
    
    stuff = stuff[0:2] + ('goodbye', stuff[3])
    
    stuff = list(stuff)
    stuff[2] = 'goodbye'
    stuff = tuple(stuff)
    

    Solution 1 creates an entirely new tuple with the changed value. That's not quite in the spirit of what we're asking. It would also be tedious if the tuple contained more than a few elements.

    Solution 2 is a little closer to being in the proper spirit. This one concatenates a slice of the original tuple combined with a new tuple that includes 'goodbye' and 'now' (from stuff[3]). However, that code is difficult to read. Furthermore, the slicing and indexing is a sure-fire way to get an off-by-one error. For example, if you wrote stuff[0:1] instead of stuff[0:2], the result would be missing 'world'.

    Solution 3 is as close as we can get to the spirit of the problem. Here, we convert the original tuple to a list and reassign the element to the new value. Finally, we transform the list into a new tuple. This approach still creates an entirely new tuple. However, it avoids the problem of re-entering a long list of members, is cleaner than slicing and indexing, and is much easier to read.

    Video Walkthrough

    Please register to play this video

  3. Identify at least 2 differences between lists and tuples. Identify at least 2 ways that lists and tuples are similar.

    Solution

    • Differences
      1. Lists are mutable; tuples are immutable.
      2. List literals use []; tuple literals use ().
    • Similarities
      1. Lists and tuples are both sequences. Sequences are ordered collections that can use numeric indexes to access the members.
      2. Lists and tuples are heterogeneous; elements do not need to all be the same type.

    Video Walkthrough

    Please register to play this video

  4. Why can we treat strings as sequences?

    Solution

    Strings are ordered; you can access the individual characters with indexing.

    Video Walkthrough

    Please register to play this video

  5. How do the set types differ from the sequence types?

    Solution

    Sets are unordered; they don't support indexing.

    Video Walkthrough

    Please register to play this video

  6. Assuming you have the following assignment:

    pi = 3.141592
    

    Write some code that converts the value of pi to a string and assigns it to a variable named str_pi.

    Solution

    pi = 3.141592
    str_pi = str(pi)
    
    pi = 3.141592
    str_pi = f'{pi}'
    

    Solution 1 is the preferred solution since it is slightly more readable. Solution 2 works, too. However, f-strings are better suited for creating strings mixed with other text.

    Video Walkthrough

    Please register to play this video

  7. Without running the following code, identify the numbers that are included in each of the following ranges:

    range(7)
    range(1, 6)
    range(3, 15, 4)
    range(3, 8, -1)
    range(8, 3, -1)
    

    Solution

    range(7)            # [0, 1, 2, 3, 4, 5, 6]
    range(1, 6)         # [1, 2, 3, 4, 5]
    range(3, 15, 4)     # [3, 7, 11]
    range(3, 8, -1)     # []
    range(8, 3, -1)     # [8, 7, 6, 5, 4]
    

    The most important thing to observe here is that ranges never include the "stop" value. Furthermore, a negative step value counts downwards from the start to the stop value. Thus, the start value should typically be larger than the stop value when the step value is negative.

    Video Walkthrough

    Please register to play this video

  8. How would you print all the numbers in the following range?

    range(3, 17, 4)
    

    Solution

    print(list(range(3, 17, 4)))
    
    print(tuple(range(3, 17, 4)))
    

    Since ranges are lazy sequences, you must either iterate over the range or convert the range to a non-lazy sequence: a list or tuple. Here, we transform the range into a list and a tuple.

    Video Walkthrough

    Please register to play this video

  9. my_list = [1, 2, 3, [4, 5, 6]]
    another_list = list(my_list)
    

    Given the above code, answer the following questions and explain your answers:

    1. Are the lists assigned to my_list and another_list equal?
    2. Are the lists assigned to my_list and another_list the same object?
    3. Are the nested lists at index 3 of my_list and another_list equal?
    4. Are the nested lists at index 3 of my_list and another_list the same object?

    Don't write any code for this exercise.

    Solution

    1. The two lists are equal: they are both lists with the same elements.
    2. The two lists are not the same object: The list constructor creates a new object.
    3. The two nested lists are equal: they are both lists that have the same elements.
    4. The two nested lists are the same object: the list constructor creates a shallow copy of the iterable passed as an argument. Shallow copies don't create new nested lists. Instead, only a reference to the nested list gets copied.

    Video Walkthrough

    Please register to play this video

  10. Bob expects the following code to print the names in alphabetical order. Without running the code, can you say whether it will? Explain your answer.

    names = { 'Chris', 'Clare', 'Karis', 'Karl',
              'Max', 'Nick', 'Victor' }
    print(names)
    

    Solution

    This code probably won't print the names alphabetically. Sets, by definition, are unordered collections. Thus, the order in which members are shown when printing or iterating is arbitrary. Bob's code may print the names alphabetically on rare occasions, but it would do so only once every 5040 times.

    Video Walkthrough

    Please register to play this video

  11. Consider the data in the following table:

    Name Country
    Alice USA
    Francois Canada
    Inti Peru
    Monika Germany
    Sanyu Uganda
    Yoshitaka Japan

    You need to write some Python code to determine the country name associated with one of the listed names. Your code should include the data structure(s) you need and at least one test case to ensure the code works.

    Solution

    countries = {
        'Alice':     'USA',
        'Francois':  'Canada',
        'Inti':      'Peru',
        'Monika':    'Germany',
        'Sanyu':     'Uganda',
        'Yoshitaka': 'Japan',
    }
    
    print(countries['Monika'])
    

    Video Walkthrough

    Please register to play this video