Using Regular Expressions

Now that you're bobbing along atop the waves, it's time to relax and explore your surroundings. Get your swim fins on, and head on out into deeper waters.

Thus far, our explorations have given us a good handle on the different types of patterns that can appear in a regex. You know how to match specific characters, classes of characters, can anchor your matches, and can even match strings of different sizes and content. However, you've seen but a handful of examples that show what this looks like in real code. We're going to rectify that a bit in this section and introduce some of the coding options available in the Big Three. This discussion isn't comprehensive, but it does provide the tools you'll need in the future. Most developers will never need anything more.

Oddly, the Regexp (Ruby) and RegExp (JavaScript) classes don't provide the regex methods you'll use most often. Instead, the String class does. In Python, the re module supplies all the functionality you need.

Matching Strings

We've already seen a few examples of interacting with regex in code. The techniques vary by language and intended use. For example:

fetch_url(url) if text.match(/\Ahttps?:\/\/\S+\z/)
if (url.search(/^https?:\/\/\S+$/) !== -1) {
  fetchUrl(url);
}
import re

if re.search(r'^https?://\S+$', url):
    fetch_url(url)

Here we call fetch_url(url) when a match occurs: that is, when url contains something that looks like a URL.

We won't discuss the return values of match and search; see the documentation instead. For now, here's what you need to know for each language:

  • Ruby
    • String#match
      • Returns a MatchData object that describes the first instance of a substring that matches the regex.
      • Returns nil if no matches are found.
    • String#scan
      • Returns an array of all instances of substrings that match a regex.
      • Returns an empty array if no matches are found.
  • JavaScript
    • String.prototype.match
      • If the regex does not include the /g flag:
        • Returns a one element array of the first instance of a substring that matches the regex, The array has few extra properties that describe the match in more detail.
        • Returns null if no matches are found.
      • If the regex includes the /g flag:
        • Returns an array of all instances of the matching substrings. This array lacks the additional information provided when /g isn't used.
        • Returns null if no matches are found.
  • Python
    • re.search
      • Returns a Match object that describes the first instance of a substring that matches the regex.
      • Returns None if no matches are found.
    • re.findall
      • Returns a list of all instances of substrings that match a regex.
      • Returns an empty list if no matches are found.

Each of these tools can also deal with capturing groups, though that complicates matters somewhat. We won't worry about it here.

In Ruby, you sometimes see something like this:

fetch_url(text) if text =~ /\Ahttps?:\/\/\S+\z/

=~ is similar to match, except that it returns the index within the string at which the regex matched, or nil if there was no match. =~ is measurably faster than match, so some Rubyists prefer to use it when they can. Others dislike it because it is unfamiliar, or solely because =~ reminds them of the Perl language where it saw widespread use.

Splitting Strings

Applications that process text often must analyze data comprised of records and fields delimited by some special characters or delimiters. A typical format has records separated by newlines, and fields delineated by tabs. Such data often needs parsing before you can use it in your program; the split method is an often-useful parsing tool.

split is frequently used with a simple string as a delimiter:

record = "xyzzy\t3456\t334\tabc"
fields = record.split("\t")
p fields
# ["xyzzy", "3456", "334", "abc"]
let record = "xyzzy\t3456\t334\tabc";
let fields = record.split("\t");
console.log(fields);
// [ 'xyzzy', '3456', '334', 'abc' ]
record = "xyzzy\t3456\t334\tabc"
fields = record.split("\t")
print(fields)
# ['xyzzy', '3456', '334', 'abc']
import re

record = "xyzzy\t3456\t334\tabc"
fields = re.split(r'\t', record)
print(fields)
# ['xyzzy', '3456', '334', 'abc']

As you can see, split returns an array (a list in Python) that contains the values from each of the split fields. Note that Python has two split methods: str.split, which takes a string as a delimiter, and re.split, which uses a regex delimiter.

Not all delimiters are as simple as that, though. Sometimes, formatting is much more relaxed. For example, you may encounter data where arbitrary whitespace characters separate fields, and there may be more than one whitespace character between each pair of items. The regex form of split comes in handy in such cases:

record = "xyzzy  3456  \t  334\t\t\tabc"
fields = record.split(/\s+/)
p fields
# ["xyzzy", "3456", "334", "abc"]
let record = "xyzzy  3456  \t  334\t\t\tabc";
let fields = record.split(/\s+/);
console.log(fields)
// [ 'xyzzy', '3456', '334', 'abc' ]
record = "xyzzy  3456  \t  334\t\t\tabc";
fields = record.split()
print(fields)
# ['xyzzy', '3456', '334', 'abc']

fields = record.split(' ')
print(fields)
# Oops: ['xyzzy', '', '3456', '', '\t', '', '334\t\t\tabc']
import re

record = "xyzzy  3456  \t  334\t\t\tabc";
fields = re.split(r'\s+', record)
print(fields)
# ['xyzzy', '3456', '334', 'abc']

Note that calling Python's str.split with no delimiter splits at runs of one or more whitespace characters. If you pass it a literal space, it splits at every space character.

Beware of regex like /:*/ and /\t?/ when using split. Recall that the * quantifier matches zero or more occurrences of the pattern it is modifying, while ? matches zero or one occurrence. In the case of split, the result may be totally unexpected:

'abc:xyz'.split(/:*/)
# -> ['a', 'b', 'c', 'x', 'y', 'z']

'abc:xyz'.split(/\t?/)
# -> ['a', 'b', 'c', ':', 'x', 'y', 'z']

A six element array instead of the two element array you may have expected. This result occurs because the regex matches the gaps between each letter; zero occurrences of : occurs between each pair of characters.

Similar behaviors arise in Python and Ruby.

Capture Groups: A Diversion

Before moving on to the final methods in our whirlwind tour, we need to first talk about capture groups. (Note that regex also have non-capture groups but we won't cover them here.) You've already encountered these before, though we called them something different at the time: grouping parentheses. We didn't mention it at the time, but these meta-characters have another function: they provide capture and non-capture groups.

Capture groups capture the matching characters that correspond to part of a regex. You can reuse these matches later in the same regex, and when constructing new values based on the matched string.

We'll start with a simple example. Suppose you need to match quoted strings inside some text, where either single or double quotes delimit the strings. How would you do that using the regex patterns you know? You might consider:

/['"].+?['"]/

as your first attempt to match quotes, but, you'll soon find that it also matches mixed single and double quotes. This may not be what you want. Instead, you need a way to capture the opening quote and reuse that character for the closing quote. It's time to call on capture groups:

/(['"]).+?\1/

Here the group captures the part of the string that matches the pattern between parentheses; in this case, either a single or double quote. We then match one or more of any other character and end with a \1: we call this sequence a backreference - it references the first capture group in the regex. If the first group matches a double quote, then \1 matches a double quote, but not a single quote.

It may be more reasonable to use two regex to solve this problem:

if text.match(/".*?"/) || text.match(/'.*?'/)
  puts "Got a quoted string"
end

It's easier to read and maintain when written like this. However, you will almost certainly encounter problems where a single regex with a backreference is the preferred solution.

A regex may contain multiple capture groups, numbers from left to right as groups 1, 2, 3, and so on, up to 9. As you might expect, the backreferences are \1, \2, \3, ..., and \9.

Note that there are patterns in Ruby and Python that allow for named groups and named backreferences, but this is beyond the scope of this book. If you find yourself needing multiple groups in Ruby or Python regex, you may want to investigate these named groups and backreferences.

While you can use capture groups in any regex, they are most useful in conjunction with methods that use regex to transform strings. We'll see this in the next two sections.

By the way: did you notice that lazy quantifier in our regex? Why do you think we used that here?

Transformations in Ruby

While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.

Transforming a string with regex involves matching that string against the regex, and using the results to construct a new value. In Ruby, we typically use String#sub and String#gsub. #sub transforms the first part of a string that matches a regex, while #gsub transforms every part of a string that matches.

Here's a simple example:

text = 'Four score and seven'
vowelless = text.gsub(/[aeiou]/, '*')
# -> 'F**r sc*r* *nd s*v*n'

Here we replace every vowel in text with an *.

Transformations in JavaScript

While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.

Transforming a string with regex involves matching that string against the regex, and using the results of the match to construct a new value. In JavaScript, we can use the replace method which transforms the matched part of a string. If the regex includes a g flag, the transformation applies to every match in the string.

Here's a simple example:

let text = 'Four score and seven';
let vowelless = text.replace(/[aeiou]/g, '*');
// -> 'F**r sc*r* *nd s*v*n'

Here we replace every vowel in text with an *. We applied the transformation globally since we used the g flag on the regex.

Transformations in Python

While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.

Transforming a string with regex involves matching that string against the regex, and using the results to construct a new value. In Python, we typically use re.sub. It transforms every substring that matches a regex to a new value. You can use the count keyword argument to limit the number of changes.

Here's a simple example:

import re

text = 'Four score and seven'

vowelless = re.sub(r'[aeiou]', '*', text)
print(vowelless)           # F**r sc*r* *nd s*v*n

first_3 = re.sub(r'[aeiou]', '*', text, count=3)
print(first_3)             # F**r sc*re and seven

In the first re.sub invocation, we replaced every vowel with an *. In the second, we replaced only the first three vowels.

Summary

We now conclude our little dive into the regex ocean. We hope you've learned a lot and enjoyed the experience. We have one more section: it includes a regex cheat sheet and a few other useful tidbits.

But, before you proceed, take a little while to work the exercises below. In these exercises, write your code using your language of choice. Rubyists may want to use IRB to test their methods, while JavaScripters can check their answers in node or their browser's JavaScript console.

Exercises

  1. Write a method that returns true if its argument looks like a URL, false if it does not.

    Examples:

    url?('https://launchschool.com')     # -> true
    url?('http://example.com')           # -> true
    url?('https://example.com hello')    # -> false
    url?('   https://example.com')       # -> false
    
    isUrl('https://launchschool.com');   // -> true
    isUrl('http://example.com');         // -> true
    isUrl('https://example.com hello');  // -> false
    isUrl('   https://example.com');     // -> false
    
    is_url('https://launchschool.com')    # -> true
    is_url('http://example.com')          # -> true
    is_url('https://example.com hello')   # -> false
    is_url('   https://example.com')      # -> false
    

    Solution

    def url?(text)
      !!text.match(/\Ahttps?:\/\/\S+\z/)
    end
    
    def url?(text)
      text.match?(/\Ahttps?:\/\/\S+\z/)
    end
    
    let isUrl = function (text) {
      return !!text.match(/^https?:\/\/\S+$/);
    };
    
    import re
    
    def is_url(text):
        return bool(re.search(r'^https?://\S+$', text))
    

    Note that we use !! to coerce the result of our Ruby match call to a boolean value. More recent Ruby versions add the String.match? method, which we demonstrate in our second Ruby solution.

  2. Write a method that returns all of the fields in a haphazardly formatted string. A variety of spaces, tabs, and commas separate the fields, with possibly multiple occurrences of each delimiter.

    Examples:

    fields("Pete,201,Student")     # ["Pete", "201", "Student"]
    fields("Pete \t 201   ,  TA")  # ["Pete", "201", "TA"]
    fields("Pete \t 201")          # ["Pete", "201"]
    fields("Pete \n 201")          # ["Pete", "\n", "201"]
    
    fields("Pete,201,Student");    // ['Pete', '201', 'Student']
    fields("Pete \t 201   ,  TA"); // ['Pete', '201', 'TA']
    fields("Pete \t 201");         // ['Pete', '201']
    fields("Pete \n 201");         // ['Pete', '\n', '201']
    
    fields("Pete,201,Student");    # ['Pete', '201', 'Student']
    fields("Pete \t 201   ,  TA"); # ['Pete', '201', 'TA']
    fields("Pete \t 201");         # ['Pete', '201']
    fields("Pete \n 201");         # ['Pete', '\n', '201']
    

    Solution

    def fields(text)
      text.split(/[ \t,]+/)
    end
    
    let fields = function (text) {
      return text.split(/[ \t,]+/);
    };
    
    import re
    
    def fields(text):
        return re.split(r'[ \t,]+', text)
    

    Note that we don't use \s here since we want to split at spaces and tabs, not other whitespace characters.

  3. Write a method that changes the first arithmetic operator (+, -, *, /) in a string to a '?' and returns the resulting string. Don't modify the original string.

    Examples:

    mystery_math('4 + 3 - 5 = 2')
    # '4 ? 3 - 5 = 2'
    
    mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
    # '(4 ? 3 + 2) / 7 - 1 = 1'
    
    mysteryMath('4 + 3 - 5 = 2');
    // '4 ? 3 - 5 = 2'
    
    mysteryMath('(4 * 3 + 2) / 7 - 1 = 1');
    // '(4 ? 3 + 2) / 7 - 1 = 1'
    
    mystery_math('4 + 3 - 5 = 2')
    # '4 ? 3 - 5 = 2'
    
    mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
    # '(4 ? 3 + 2) / 7 - 1 = 1'
    

    Solution

    def mystery_math(equation)
      equation.sub(/[+\-*\/]/, '?')
    end
    
    let mysteryMath = function (equation) {
      return equation.replace(/[+\-*\/]/, '?');
    };
    
    import re
    
    def mystery_math(equation):
        return re.sub(r'[+\-*\/]', '?', equation, count=1)
    

    Note that we need to escape the - character in our character class so it gets interpreted as a literal hyphen, not a range specification. We also must escape the / character in the Ruby code; in the JavaScript and Python code, we don't need to escape the / character but do so here for consistency.

  4. Write a method that changes every arithmetic operator (+, -, *, /) to a '?' and returns the resulting string. Don't modify the original string.

    Examples:

    mysterious_math('4 + 3 - 5 = 2')
    # '4 ? 3 ? 5 = 2'
    mysterious_math('(4 * 3 + 2) / 7 - 1 = 1')
    # '(4 ? 3 ? 2) ? 7 ? 1 = 1'
    
    mysteriousMath('4 + 3 - 5 = 2');
    // '4 ? 3 ? 5 = 2'
    mysteriousMath('(4 * 3 + 2) / 7 - 1 = 1');
    // '(4 ? 3 ? 2) ? 7 ? 1 = 1'
    
    mystery_math('4 + 3 - 5 = 2')
    # '4 ? 3 ? 5 = 2'
    
    mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
    # '(4 ? 3 ? 2) ? 7 ? 1 = 1'
    

    Solution

    def mysterious_math(equation)
      equation.gsub(/[+\-*\/]/, '?')
    end
    
    let mysteriousMath = function (equation) {
      return equation.replace(/[+\-*\/]/g, '?');
    };
    
    import re
    
    def mystery_math(equation):
        return re.sub(r'[+\-*\/]', '?', equation)
    

    Note that we now use the gsub method in Ruby, and apply the g flag to the regex in JavaScript. In Python, we drop the count keyword argument.

  5. Write a method that changes the first occurrence of the word apple, blueberry, or cherry in a string to danish.

    Examples:

    danish('An apple a day keeps the doctor away')
    # -> 'An danish a day keeps the doctor away'
    
    danish('My favorite is blueberry pie')
    # -> 'My favorite is danish pie'
    
    danish('The cherry of my eye')
    # -> 'The danish of my eye'
    
    danish('apple. cherry. blueberry.')
    # -> 'danish. cherry. blueberry.'
    
    danish('I love pineapple')
    # -> 'I love pineapple'
    
    danish('An apple a day keeps the doctor away');
    // -> 'An danish a day keeps the doctor away'
    
    danish('My favorite is blueberry pie');
    // -> 'My favorite is danish pie'
    
    danish('The cherry of my eye');
    // -> 'The danish of my eye'
    
    danish('apple. cherry. blueberry.');
    // -> 'danish. cherry. blueberry.'
    
    danish('I love pineapple');
    // -> 'I love pineapple'
    
    danish('An apple a day keeps the doctor away')
    # -> 'An danish a day keeps the doctor away'
    
    danish('My favorite is blueberry pie')
    # -> 'My favorite is danish pie'
    
    danish('The cherry of my eye')
    # -> 'The danish of my eye'
    
    danish('apple. cherry. blueberry.')
    # -> 'danish. cherry. blueberry.'
    
    danish('I love pineapple')
    # -> 'I love pineapple'
    

    Solution

    def danish(text)
      text.sub(/\b(apple|blueberry|cherry)\b/, 'danish')
    end
    
    let danish = function (text) {
      return text.replace(/\b(apple|blueberry|cherry)\b/, 'danish');
    }
    
    import re
    
    def danish(text):
        return re.sub(r'\b(apple|blueberry|cherry)\b',
                      'danish', text, count=1)
    

    Note that pineapple is not changed in the last example for each language.