Anchors

Let's get wet by moving into chest-deep waters. Keep your feet anchored to the bottom, though. That's what we're about to discuss: anchors. Anchors provide a way to limit how a regex matches a particular string by telling the regex engine where matches can begin and where they can end.

Anchors are a bit strange in the world of regex; they don't match any characters. What they do is ensure that a regex matches a string at a specific place: the beginning or end of the string or end of a line, or on a word or non-word boundary.

If you've ever used regex in any other context, there's a pretty good chance that you are familiar with the ^ and $ anchors, so we'll start our exploration of anchors there. Don't skip ahead though! Each of the Big Three languages has its own idea of what these anchors should do. If you really want to see the inconsistent behaviors of the Big Three, read all three sections.

Furthermore, our recommended test beds (Rubular, Scriptular, and Pythex) each treat test strings differently. For now, we recommend using the irb command for Ruby, the node command for JavaScript, and the python for Python. Also, if you try to use any of the test beds, you'll quickly find that they don't recognize \n in the test string, which can make properly testing these anchors difficult.

Line Anchors in Ruby

This section has Ruby-specific information for anchors. If you aren't using Ruby, you can advance to the JavaScript or Python sections.

Instead of using Rubular, we will use Ruby itself to run some Ruby code from a file. For each example, you should put the code in a file named something like regex.rb, then run that file with the command ruby regex.rb in your terminal. We show the output as a comment to the right of or below each invocation of p.

In Ruby, the ^ and $ meta-characters are anchors that match the beginning (^) or ending ($) of a line of text. In Ruby, a line of text is the entire string if the string contains no newline characters (\n). However, if the string contains embedded newlines, each newline marks the end of a line and the beginning of a new line of text.

Let's see how the ^ anchor works with strings that don't contain embedded newlines. We'll use the /^c.t/i regex to test a variety of strings:

p "cat".scan(/^c.t/i)          # ["cat"]
p "cot\n".scan(/^c.t/i)        # ["cot"]
p "CATASTROPHE".scan(/^c.t/i)  # ["CAT"]
p "WILDCAUGHT".scan(/^c.t/i)   # []
p "wildcat\n".scan(/^c.t/i)    # []
p "-CET-".scan(/^c.t/i)        # []
p "Yacht".scan(/^c.t/i)        # []

This example demonstrates that ^ forces the c.t pattern to match at the beginning of each string. The tests on lines 1-3 show successful matches against the strings "cat", "cot\n", and "CATASTROPHE", all of which start with the case-insensitive c.t pattern. The remaining tests return no matches, as indicated by the empty array ([]).

Note that the strings on lines 2 and 5 end with newline characters. This has no effect on whether the regex matches those strings.

Let's repeat those tests with the regex /c.t$/i:

p "cat".scan(/c.t$/i)          # ["cat"]
p "cot\n".scan(/c.t$/i)        # ["cot"]
p "CATASTROPHE".scan(/c.t$/i)  # []
p "WILDCAUGHT".scan(/c.t$/i)   # []
p "wildcat\n".scan(/c.t$/i)    # ["cat"]
p "-CET-".scan(/c.t$/i)        # []
p "Yacht".scan(/c.t$/i)        # ["cht"]

This time, our regex matches on lines 1, 2, 5, and 6. Thus, the $ pattern tells Ruby to match at the end of each string. Note that the strings on lines 2 and 5 end with newline characters. This has no effect on whether the regex matches those strings; Ruby ignores the newlines at the end a string.

Now things get more interesting. Let's see what happens when the string contains embedded newlines. Try the following code:

text = "cat\ncot\nCATASTROPHE\nWILDCAUGHT\n" +
       "wildcat\n-GET-\nYacht"

p text.scan(/^c.t/i) # ["cat", "cot", "CAT"]
p text.scan(/c.t$/i) # ["cat", "cot", "cat", "cht"]

Though we only have one test string in this example, both tests show multiple matches: three for the first and four for the second. This shows that Ruby treats embedded newline characters as separate lines. The scan method returns a list of all matches that it finds. For the first test, the matches were for "cat\n", "cot\n", and "CATASTROPHE\n"; for the second test, "cat\n", "cot\n", "wildcat\n", and "Yacht" all matched.

Suppose we want to match only at the start or end of the string? How do we do that? The solution is to use two additional anchors: \A and \z (there's also a \Z metacharacter that we won't discuss). The \A anchors at the beginning of a string, while \z anchors at the end. Newlines in the string are not given any special consideration. Let's try it:

text = "cat\ncot\nCATASTROPHE\nWILDCAUGHT\n" +
       "wildcat\n-GET-\nYacht"

p text.scan(/\Ac.t/i) # ["cat"]
p text.scan(/c.t\z/i) # ["cht"]

This time, we see one match per test. These tests demonstrate that the matches only occur at the beginning and ending of the string.

Line Anchors in JavaScript

This section has JavaScript-specific information for anchors. If you aren't using JavaScript, you should go to either the Ruby or Python sections.

Instead of using Scriptular or Rubular, we will use Node.js to run some JavaScript code from a file. For each example, you should put the code in a file named something like regex.js, then run that file with the command node regex.js in your terminal. We show the output as a comment to the right of or below each invocation of p.

In JavaScript, the ^ and $ meta-characters are anchors that match the beginning (^) or ending ($) of a string. By default, JavaScript mostly ignores embedded newlines in the string.

Let's see how the ^ anchor works with strings that don't contain embedded newlines. We'll use the /^c.t/ig regex to test a variety of strings:

function p(str) {
  console.log(str.match(/^c.t/ig));
}

p("cat")          // [ 'cat' ]
p("cot\n")        // [ 'cot' ]
p("CATASTROPHE")  // [ 'CAT' ]
p("WILDCAUGHT")   // null
p("wildcat\n")    // null
p("-CET-")        // null
p("Yacht")        // null

The /ig flags tell JavaScript to match the string case-insensitively and to return an array of all matches in the string. The p function defined here is there to help us unclutter the test code.

This example demonstrates that ^ forces the c.t pattern to match at the beginning of each string. The tests on lines 5-7 show successful matches against the strings "cat", "cot\n", and "CATASTROPHE", all of which start with the case-insensitive c.t pattern. The remaining tests return no matches, as indicated by null values.

Note that the strings on lines 6 and 9 end with newline characters. This has no effect on whether the regex matches those strings.

Let's repeat those tests with the regex /c.t$/ig:

function p(str) {
  console.log(str.match(/c.t$/ig));
}

p("cat")          // [ 'cat' ]
p("cot\n")        // null
p("CATASTROPHE")  // null
p("WILDCAUGHT")   // null
p("wildcat\n")    // null
p("-CET-")        // null
p("Yacht")        // [ 'cht' ]

This time, our regex matches on lines 5 and 11. Thus, the $ pattern tells JavaScript to match at the end of each string. Note that the strings on lines 6 and 9 do not match; the newline characters prevent a match. If you want to match the newlines when present, you can use the /m flag:

function p(str) {
  console.log(str.match(/c.t$/mig));
}

p("cat")          // [ 'cat' ]
p("cot\n")        // [ 'cot' ]
p("CATASTROPHE")  // null
p("WILDCAUGHT")   // null
p("wildcat\n")    // [ 'cat' ]
p("-CET-")        // null
p("Yacht")        // [ 'cht' ]

Now things get more interesting. Let's see what happens when the string contains embedded newlines. Try the following code:

function p(str, regex) {
  console.log(str.match(regex));
}

let text = "cat\ncot\nCATASTROPHE\nWILDCAUGHT\n" +
           "wildcat\n-GET-\nYacht"

p(text, /^c.t/mig) // [ 'cat', 'cot', 'CAT' ]
p(text, /c.t$/mig) // [ 'cat', 'cot', 'cat', 'cht' ]

Though we only have one test string, both tests show multiple matches: three for the first and four for the second. This shows that JavaScript treats embedded newline characters as separate lines when you use the /m flag. The match method returns a list of all matches that it finds. For the first test, the matches were for "cat\n", "cot\n", and "CATASTROPHE\n"; for the second test, "cat\n", "cot\n", "wildcat\n", and "Yacht" all matched.

Line Anchors in Python

This section has Python-specific information for anchors. If you aren't using Python, you should go to either the Ruby or JavaScript sections.

Instead of using Pythex or Rubular, we will use Python itself to run some Python code from a file. For each example, you should put the code in a file named something like regex.py, then run that file with the command python regex.py in your terminal. We show the output as a comment to the right of or below each invocation of p.

In Python, the ^ and $ meta-characters are anchors that match the beginning (^) or ending ($) of a line of text. In Python, a line of text is the entire string if the string contains no newline characters (\n). However, if the string contains embedded newlines, each newline marks the end of a line and the beginning of a new line of text.

Let's see how the ^ anchor works with strings that don't contain embedded newlines. We'll use the r'^c.t' regex with the re.IGNORECASE flag to test a variety of strings:

import re

def p(text):
    print(re.findall(r'^c.t',
                     text,
                     flags=re.IGNORECASE))

p("cat")         # ['cat']
p("cot\n")       # ['cot']
p("CATASTROPHE") # ['CAT']
p("WILDCAUGHT")  # []
p("wildcat\n")   # []
p("-CET-")       # []
p("Yacht")       # []

This example demonstrates that ^ forces the c.t pattern to match at the beginning of each string. The tests on lines 8-10 show successful matches against the strings "cat", "cot\n", and "CATASTROPHE", all of which start with the case-insensitive c.t pattern. The remaining tests return no matches, as indicated by the empty list ([]).

Note that the strings on lines 9 and 12 end with newline characters. This has no effect on whether the regex matches those strings.

Let's repeat those tests with the regex r'c.t$':

import re

def p(text):
    print(re.findall(r'c.t$',
                     text,
                     flags=re.IGNORECASE))

p("cat")         # ['cat']
p("cot\n")       # ['cot']
p("CATASTROPHE") # []
p("WILDCAUGHT")  # []
p("wildcat\n")   # ['cat']
p("-CET-")       # []
p("Yacht")       # ['cht']

This time, our regex matches on lines 8, 9, 12, and 14. Thus, the $ pattern tells Python to match at the end of each string. Note that the strings on lines 9 and 12 end with newline characters. This has no effect on whether the regex matches those strings; Python ignores the newlines at the end a string.

Now things get more interesting. Let's see what happens when the string contains embedded newlines. Try the following code:

import re

def p(regex, text):
    print(re.findall(regex,
                     text,
                     flags=re.IGNORECASE))

text = ("cat\ncot\nCATASTROPHE\nWILDCAUGHT\n" +
        "wildcat\n-GET-\nYacht")

p(r'^c.t', text) # ['cat']
p(r'c.t$', text) # ['cht']

This time we got only one match in each test: the cat at the beginning of the string, and the Yacht at the end. If that's what we want, this is perfect. However, if you want to match against each line in the string, you must use the re.MULTILINE flag:

import re

def p(regex, text):
    print(re.findall(regex,
                     text,
                     flags=re.IGNORECASE | re.MULTILINE))

text = ("cat\ncot\nCATASTROPHE\nWILDCAUGHT\n" +
        "wildcat\n-GET-\nYacht")

p(r'^c.t', text) # ['cat', 'cot', 'CAT']
p(r'c.t$', text) # ['cat', 'cot', 'cat', 'cht']

This time, we got three matches for the first pattern ("cat\n", "cot\n", and "CATASTROPHE\n") and four matches for the second ("cat\n", "cot\n", "wildcat\n", and "Yacht").

Note that we're using two flags this time. To combine them, we use the | operator (the bitwise "or" operator). This operator sometimes arises when you have a choice of binary options that aren't mutually exclusive.

Line Anchors Wrapup

If you took the time to read all 3 sections above, your head is probably spinning a bit. Mine is, and I had to explain them. It's really hard to keep things straight when each language interprets line anchors differently. Nobody should have to memorize these differences, and that includes you. We've got a few recommendations to help you navigate the use of line anchors:

  • If at all possible, avoid using the anchors when working with strings that have embedded newlines, including newlines at the end.
  • If you must use anchors with newlines:
    • Split the string into an array or list of substrings delimited by newlines.
    • Strip the trailing newlines from each of the substrings.
    • Apply the regex matches against each substring in a loop.
  • Use the \A and \z anchors in Ruby and Python; use ^ and $ in JavaScript.

Even though we recommend using \A and \z for anchored matches in Ruby and Python, most examples and exercises in this book use ^ and $ instead. It is easier to demonstrate certain behaviors when using ^ and $ on Rubular.

Word Boundaries

The last two anchors anchor regex matches to word boundaries (\b) and non-word boundaries (\B). For these anchors, words are sequences of word characters (\w), while non-words are sequences of non-word characters (\W). A word boundary occurs:

  • between any pair of characters, one of which is a word character and one which is not.
  • at the beginning of a string if the first character is a word character.
  • at the end of a string if the last character is a word character.

A non-word boundary matches any place else:

  • between any pair of characters, both of which are word characters or both of which are not word characters.
  • at the beginning of a string if the first character is a non-word character.
  • at the end of a string if the last character is a non-word character.

For instance:

Eat some food.

Here, word boundaries occur before the E, s, and f at the start of the three words, and after the t, e, and d at their ends. Non-word boundaries occur elsewhere, such as between the o and m in some, and following the . at the end of the sentence.

To anchor a regex to a word boundary, use the \b pattern. For example, to match 3 letter words consisting of "word characters", you can use /\b\w\w\w\b/. Try it with:

One fish,
Two fish,
Red fish,
Blue fish.
123 456 7890

It's rare that you must use the non-word boundary anchor, \B. Here's a somewhat contrived example you can try. Try the regex /\Bjohn/i against these strings:

John Silver
Randy Johnson
Duke Pettijohn
Joe_Johnson

The regex matches john in the last two strings, but not the first two.

\b and \B do not work as word boundaries inside of character classes (between square brackets). In fact, \b means something else entirely when inside square brackets: it matches a backspace character.

Summary

With the use of anchors, you now have a great deal more flexibility. These simple constructs provide a degree of control over your regex that you didn't have before -- you can tell the regex engine where matches can occur. If you need it, more is available with look-ahead and look-behind assertions, but that topic is beyond the scope of this book.

In the next chapter, we'll get into quantifiers. Quantifiers, more than any other feature, lie at the heart of what makes regex so useful.

But, before you wade out any further, take a little while to work the exercises below. In these exercises, use Rubular to write and test your regex. You don't need to write any code.

Exercises

  1. Write a regex that matches the word The when it occurs at the beginning of a line. Test it with these strings:

    The lazy cat sleeps.
    The number 623 is not a word.
    Then, we went to the movies.
    Ah. The bus has arrived.
    

    There should be two matches.

    Solution

    /^The\b/
    

    This regex should match the word The in the first two lines, but should not match anything on the last two.

    If you tried using /\AThe\b/ on Rubular, the match probably didn't work. Why not? If you haven't already tried, try it now. In most cases, you should use \A instead of ^ in Ruby, but Rubular treats the test string as a single multi-line string, so you should use ^ instead.

    Trying r'\AThe\b' on pythex also doesn't work for the same reason. pythex always treats the test string as a single long line. Worse yet, enabling MULTILINE mode has no impact on this test. Also, r'^The\b' won't work unless you enable MULTILINE mode.

    These two issues aren't about Ruby or Python, specifically. They are more about the implementations of Rubular and pythex.

  2. Write a regex that matches the word cat when it occurs at the end of a line. Test it with these strings:

    The lazy cat sleeps
    The number 623 is not a cat
    The Alaskan drives a snowcat
    

    There should be one match.

    Solution

    /\bcat$/
    

    This regex should match the word cat in the second line, but should not match anything else.

    If you tried using /\bcat\z/ on Rubular or r'\bcat\z' on pythex, the match probably didn't work correctly. Why not? See the solution to the previous problem for an explanation.

  3. Write a regex that matches any three-letter word; a word is any string comprised entirely of letters. You can use these test strings.

    reds and blues
    The lazy cat sleeps.
    The number 623 is not a word. Or is it?
    

    There should be five matches.

    Solution

    /\b[a-z][a-z][a-z]\b/i
    

    As expected, this regex matches and, cat, The (both occurrences), and not. Notice that it does not match 623 or it?.

  4. Challenge: Write a regex that matches an entire line of text that consists of exactly 3 words as follows:

    • The first word is A or The.
    • There is a single space between the first and second words.
    • The second word is any 4-letter word.
    • There is a single space between the second and third words.
    • The third word -- the last word -- is either dog or cat.

    Test your solution with these strings:

    A grey cat
    A blue caterpillar
    The lazy dog
    The white cat
    A loud dog
    --A loud dog
    Go away dog
    The ugly rat
    The lazy, loud dog
    

    There should be three matches.

    Solution

    /^(A|The) [a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z] (dog|cat)$/
    

    The valid matches are A grey cat, The lazy dog, and A loud dog.

    This solution employs alternation from the first chapter in this section to define the words that occur at the beginning and end of each line and includes a match for a four-letter word in the middle. We have assumed that the middle word can contain both uppercase and lowercase letters, so we have to specify [a-zA-Z] for each of the four letters. We don't use \w because the problem explicitly asked for four-letter words.

    As with the other exercises, a proper Ruby or Python solution would use \A and \z instead of ^ and $, but to allow for Rubular and pythex limitations, we use ^ and $ instead.