Now that you're bobbing along atop the waves, it's time to relax and explore your surroundings. Get your swim fins on, and head on out into deeper waters.
Thus far, our explorations have given us a good handle on the different types of patterns that can appear in a regex. You know how to match specific characters, classes of characters, can anchor your matches, and can even match strings of different sizes and content. However, you've seen but a handful of examples that show what this looks like in real code. We're going to rectify that a bit in this section and introduce some of the coding options available in the Big Three. This discussion isn't comprehensive, but it does provide the tools you'll need in the future. Most developers will never need anything more.
Oddly, the Regexp
(Ruby) and RegExp
(JavaScript) classes don't provide the regex methods you'll use most often. Instead, the String
class does. In Python, the re
module supplies all the functionality you need.
We've already seen a few examples of interacting with regex in code. The techniques vary by language and intended use. For example:
fetch_url(url) if text.match(/\Ahttps?:\/\/\S+\z/)
if (url.search(/^https?:\/\/\S+$/) !== -1) {
fetchUrl(url);
}
import re
if re.search(r'^https?://\S+$', url):
fetch_url(url)
Here we call fetch_url(url)
when a match occurs: that is, when url
contains something that looks like a URL.
We won't discuss the return values of match
and search
; see the documentation instead. For now, here's what you need to know for each language:
String#match
MatchData
object that describes the first instance of a substring that matches the regex.
nil
if no matches are found.
String#scan
String.prototype.match
/g
flag:
null
if no matches are found.
/g
flag:
/g
isn't used.
null
if no matches are found.
re.search
Match
object that describes the first instance of a substring that matches the regex.
None
if no matches are found.
re.findall
Each of these tools can also deal with capturing groups, though that complicates matters somewhat. We won't worry about it here.
In Ruby, you sometimes see something like this:
fetch_url(text) if text =~ /\Ahttps?:\/\/\S+\z/
=~
is similar to match
, except that it returns the index within the string at which the regex matched, or nil
if there was no match. =~
is measurably faster than match
, so some Rubyists prefer to use it when they can. Others dislike it because it is unfamiliar, or solely because =~
reminds them of the Perl language where it saw widespread use.
Applications that process text often must analyze data comprised of records and fields delimited by some special characters or delimiters. A typical format has records separated by newlines, and fields delineated by tabs. Such data often needs parsing before you can use it in your program; the split
method is an often-useful parsing tool.
split
is frequently used with a simple string as a delimiter:
record = "xyzzy\t3456\t334\tabc"
fields = record.split("\t")
p fields
# ["xyzzy", "3456", "334", "abc"]
let record = "xyzzy\t3456\t334\tabc";
let fields = record.split("\t");
console.log(fields);
// [ 'xyzzy', '3456', '334', 'abc' ]
record = "xyzzy\t3456\t334\tabc"
fields = record.split("\t")
print(fields)
# ['xyzzy', '3456', '334', 'abc']
import re
record = "xyzzy\t3456\t334\tabc"
fields = re.split(r'\t', record)
print(fields)
# ['xyzzy', '3456', '334', 'abc']
As you can see, split
returns an array (a list in Python) that contains the values from each of the split fields. Note that Python has two split
methods: str.split
, which takes a string as a delimiter, and re.split
, which uses a regex delimiter.
Not all delimiters are as simple as that, though. Sometimes, formatting is much more relaxed. For example, you may encounter data where arbitrary whitespace characters separate fields, and there may be more than one whitespace character between each pair of items. The regex form of split
comes in handy in such cases:
record = "xyzzy 3456 \t 334\t\t\tabc"
fields = record.split(/\s+/)
p fields
# ["xyzzy", "3456", "334", "abc"]
let record = "xyzzy 3456 \t 334\t\t\tabc";
let fields = record.split(/\s+/);
console.log(fields)
// [ 'xyzzy', '3456', '334', 'abc' ]
record = "xyzzy 3456 \t 334\t\t\tabc";
fields = record.split()
print(fields)
# ['xyzzy', '3456', '334', 'abc']
fields = record.split(' ')
print(fields)
# Oops: ['xyzzy', '', '3456', '', '\t', '', '334\t\t\tabc']
import re
record = "xyzzy 3456 \t 334\t\t\tabc";
fields = re.split(r'\s+', record)
print(fields)
# ['xyzzy', '3456', '334', 'abc']
Note that calling Python's str.split
with no delimiter splits at runs of one or more whitespace characters. If you pass it a literal space, it splits at every space character.
Beware of regex like /:*/
and /\t?/
when using split
. Recall that the *
quantifier matches zero or more occurrences of the pattern it is modifying, while ?
matches zero or one occurrence. In the case of split
, the result may be totally unexpected:
'abc:xyz'.split(/:*/)
# -> ['a', 'b', 'c', 'x', 'y', 'z']
'abc:xyz'.split(/\t?/)
# -> ['a', 'b', 'c', ':', 'x', 'y', 'z']
A six element array instead of the two element array you may have expected. This result occurs because the regex matches the gaps between each letter; zero occurrences of :
occurs between each pair of characters.
Similar behaviors arise in Python and Ruby.
Before moving on to the final methods in our whirlwind tour, we need to first talk about capture groups. (Note that regex also have non-capture groups but we won't cover them here.) You've already encountered these before, though we called them something different at the time: grouping parentheses. We didn't mention it at the time, but these meta-characters have another function: they provide capture and non-capture groups.
Capture groups capture the matching characters that correspond to part of a regex. You can reuse these matches later in the same regex, and when constructing new values based on the matched string.
We'll start with a simple example. Suppose you need to match quoted strings inside some text, where either single or double quotes delimit the strings. How would you do that using the regex patterns you know? You might consider:
/['"].+?['"]/
as your first attempt to match quotes, but, you'll soon find that it also matches mixed single and double quotes. This may not be what you want. Instead, you need a way to capture the opening quote and reuse that character for the closing quote. It's time to call on capture groups:
/(['"]).+?\1/
Here the group captures the part of the string that matches the pattern between parentheses; in this case, either a single or double quote. We then match one or more of any other character and end with a \1
: we call this sequence a backreference - it references the first capture group in the regex. If the first group matches a double quote, then \1
matches a double quote, but not a single quote.
It may be more reasonable to use two regex to solve this problem:
if text.match(/".*?"/) || text.match(/'.*?'/)
puts "Got a quoted string"
end
It's easier to read and maintain when written like this. However, you will almost certainly encounter problems where a single regex with a backreference is the preferred solution.
A regex may contain multiple capture groups, numbers from left to right as groups 1, 2, 3, and so on, up to 9. As you might expect, the backreferences are \1
, \2
, \3
, ..., and \9
.
Note that there are patterns in Ruby and Python that allow for named groups and named backreferences, but this is beyond the scope of this book. If you find yourself needing multiple groups in Ruby or Python regex, you may want to investigate these named groups and backreferences.
While you can use capture groups in any regex, they are most useful in conjunction with methods that use regex to transform strings. We'll see this in the next two sections.
By the way: did you notice that lazy quantifier in our regex? Why do you think we used that here?
While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.
Transforming a string with regex involves matching that string against the regex, and using the results to construct a new value. In Ruby, we typically use String#sub
and String#gsub
. #sub
transforms the first part of a string that matches a regex, while #gsub
transforms every part of a string that matches.
Here's a simple example:
text = 'Four score and seven'
vowelless = text.gsub(/[aeiou]/, '*')
# -> 'F**r sc*r* *nd s*v*n'
Here we replace every vowel in text
with an *
.
While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.
Transforming a string with regex involves matching that string against the regex, and using the results of the match to construct a new value. In JavaScript, we can use the replace
method which transforms the matched part of a string. If the regex includes a g
flag, the transformation applies to every match in the string.
Here's a simple example:
let text = 'Four score and seven';
let vowelless = text.replace(/[aeiou]/g, '*');
// -> 'F**r sc*r* *nd s*v*n'
Here we replace every vowel in text
with an *
. We applied the transformation globally since we used the g
flag on the regex.
While regex-based transformations in the Big Three are conceptually similar, the implementations are different. We'll cover these transformations in separate sections.
Transforming a string with regex involves matching that string against the regex, and using the results to construct a new value. In Python, we typically use re.sub
. It transforms every substring that matches a regex to a new value. You can use the count
keyword argument to limit the number of changes.
Here's a simple example:
import re
text = 'Four score and seven'
vowelless = re.sub(r'[aeiou]', '*', text)
print(vowelless) # F**r sc*r* *nd s*v*n
first_3 = re.sub(r'[aeiou]', '*', text, count=3)
print(first_3) # F**r sc*re and seven
In the first re.sub
invocation, we replaced every vowel with an *
. In the second, we replaced only the first three vowels.
We now conclude our little dive into the regex ocean. We hope you've learned a lot and enjoyed the experience. We have one more section: it includes a regex cheat sheet and a few other useful tidbits.
But, before you proceed, take a little while to work the exercises below. In these exercises, write your code using your language of choice. Rubyists may want to use IRB to test their methods, while JavaScripters can check their answers in node
or their browser's JavaScript console.
Write a method that returns true if its argument looks like a URL, false if it does not.
Examples:
url?('https://launchschool.com') # -> true
url?('http://example.com') # -> true
url?('https://example.com hello') # -> false
url?(' https://example.com') # -> false
isUrl('https://launchschool.com'); // -> true
isUrl('http://example.com'); // -> true
isUrl('https://example.com hello'); // -> false
isUrl(' https://example.com'); // -> false
is_url('https://launchschool.com') # -> true
is_url('http://example.com') # -> true
is_url('https://example.com hello') # -> false
is_url(' https://example.com') # -> false
def url?(text)
!!text.match(/\Ahttps?:\/\/\S+\z/)
end
def url?(text)
text.match?(/\Ahttps?:\/\/\S+\z/)
end
let isUrl = function (text) {
return !!text.match(/^https?:\/\/\S+$/);
};
import re
def is_url(text):
return bool(re.search(r'^https?://\S+$', text))
Note that we use !!
to coerce the result of our Ruby match
call to a boolean value. More recent Ruby versions add the String.match?
method, which we demonstrate in our second Ruby solution.
Write a method that returns all of the fields in a haphazardly formatted string. A variety of spaces, tabs, and commas separate the fields, with possibly multiple occurrences of each delimiter.
Examples:
fields("Pete,201,Student") # ["Pete", "201", "Student"]
fields("Pete \t 201 , TA") # ["Pete", "201", "TA"]
fields("Pete \t 201") # ["Pete", "201"]
fields("Pete \n 201") # ["Pete", "\n", "201"]
fields("Pete,201,Student"); // ['Pete', '201', 'Student']
fields("Pete \t 201 , TA"); // ['Pete', '201', 'TA']
fields("Pete \t 201"); // ['Pete', '201']
fields("Pete \n 201"); // ['Pete', '\n', '201']
fields("Pete,201,Student"); # ['Pete', '201', 'Student']
fields("Pete \t 201 , TA"); # ['Pete', '201', 'TA']
fields("Pete \t 201"); # ['Pete', '201']
fields("Pete \n 201"); # ['Pete', '\n', '201']
def fields(text)
text.split(/[ \t,]+/)
end
let fields = function (text) {
return text.split(/[ \t,]+/);
};
import re
def fields(text):
return re.split(r'[ \t,]+', text)
Note that we don't use \s
here since we want to split at spaces and tabs, not other whitespace characters.
Write a method that changes the first arithmetic operator (+
, -
, *
, /
) in a string to a '?' and returns the resulting string. Don't modify the original string.
Examples:
mystery_math('4 + 3 - 5 = 2')
# '4 ? 3 - 5 = 2'
mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
# '(4 ? 3 + 2) / 7 - 1 = 1'
mysteryMath('4 + 3 - 5 = 2');
// '4 ? 3 - 5 = 2'
mysteryMath('(4 * 3 + 2) / 7 - 1 = 1');
// '(4 ? 3 + 2) / 7 - 1 = 1'
mystery_math('4 + 3 - 5 = 2')
# '4 ? 3 - 5 = 2'
mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
# '(4 ? 3 + 2) / 7 - 1 = 1'
def mystery_math(equation)
equation.sub(/[+\-*\/]/, '?')
end
let mysteryMath = function (equation) {
return equation.replace(/[+\-*\/]/, '?');
};
import re
def mystery_math(equation):
return re.sub(r'[+\-*\/]', '?', equation, count=1)
Note that we need to escape the -
character in our character class so it gets interpreted as a literal hyphen, not a range specification. We also must escape the /
character in the Ruby code; in the JavaScript and Python code, we don't need to escape the /
character but do so here for consistency.
Write a method that changes every arithmetic operator (+
, -
, *
, /
) to a '?' and returns the resulting string. Don't modify the original string.
Examples:
mysterious_math('4 + 3 - 5 = 2')
# '4 ? 3 ? 5 = 2'
mysterious_math('(4 * 3 + 2) / 7 - 1 = 1')
# '(4 ? 3 ? 2) ? 7 ? 1 = 1'
mysteriousMath('4 + 3 - 5 = 2');
// '4 ? 3 ? 5 = 2'
mysteriousMath('(4 * 3 + 2) / 7 - 1 = 1');
// '(4 ? 3 ? 2) ? 7 ? 1 = 1'
mystery_math('4 + 3 - 5 = 2')
# '4 ? 3 ? 5 = 2'
mystery_math('(4 * 3 + 2) / 7 - 1 = 1')
# '(4 ? 3 ? 2) ? 7 ? 1 = 1'
def mysterious_math(equation)
equation.gsub(/[+\-*\/]/, '?')
end
let mysteriousMath = function (equation) {
return equation.replace(/[+\-*\/]/g, '?');
};
import re
def mystery_math(equation):
return re.sub(r'[+\-*\/]', '?', equation)
Note that we now use the gsub
method in Ruby, and apply the g
flag to the regex in JavaScript. In Python, we drop the count
keyword argument.
Write a method that changes the first occurrence of the word apple
, blueberry
, or cherry
in a string to danish
.
Examples:
danish('An apple a day keeps the doctor away')
# -> 'An danish a day keeps the doctor away'
danish('My favorite is blueberry pie')
# -> 'My favorite is danish pie'
danish('The cherry of my eye')
# -> 'The danish of my eye'
danish('apple. cherry. blueberry.')
# -> 'danish. cherry. blueberry.'
danish('I love pineapple')
# -> 'I love pineapple'
danish('An apple a day keeps the doctor away');
// -> 'An danish a day keeps the doctor away'
danish('My favorite is blueberry pie');
// -> 'My favorite is danish pie'
danish('The cherry of my eye');
// -> 'The danish of my eye'
danish('apple. cherry. blueberry.');
// -> 'danish. cherry. blueberry.'
danish('I love pineapple');
// -> 'I love pineapple'
danish('An apple a day keeps the doctor away')
# -> 'An danish a day keeps the doctor away'
danish('My favorite is blueberry pie')
# -> 'My favorite is danish pie'
danish('The cherry of my eye')
# -> 'The danish of my eye'
danish('apple. cherry. blueberry.')
# -> 'danish. cherry. blueberry.'
danish('I love pineapple')
# -> 'I love pineapple'
def danish(text)
text.sub(/\b(apple|blueberry|cherry)\b/, 'danish')
end
let danish = function (text) {
return text.replace(/\b(apple|blueberry|cherry)\b/, 'danish');
}
import re
def danish(text):
return re.sub(r'\b(apple|blueberry|cherry)\b',
'danish', text, count=1)
Note that pineapple
is not changed in the last example for each language.