Basic Matching

In this section, we'll get our feet wet in the calmer waters of the regex ocean with a quick introduction to regex patterns, namely those that match substrings. We'll also touch on some more intricate patterns, namely those that can match one of two or more subpatterns.

We will run most of our examples through Rubular, so please go ahead and open it now. You can enter each pattern, and see what happens when you attempt to match them against a variety of different strings. Watch for how Rubular highlights matched characters; it shows you what your regex matches. Note that you can enter multiple lines in the test strings area.

JavaScript and Python programmers can use Scriptular or pythex instead of Rubular to test regex. However, there are some differences in behavior from Rubular. Our narrative uses Rubular's behavior; avoid confusion and use it for this book even if you're learning regex for JavaScript or Python.

Note that Big Three test beds provide the / characters that delimit regex in most languages. You shouldn't enter the / characters yourself when entering them in the test bed. However, remember that you need them in your programs.

Python doesn't use / to delimit regex, but pythex still shows them. We will also use them in most examples.

In Python code, regex are represented as ordinary strings. However, the plethora of backslashes in regex means a regex that looks like /\d\d\w\w\d\d/ this in Ruby or JavaScript looks like '\\d\\d\\w\\w\\d\\d' in Python. To avoid leaning toothpicks like this, you can use a Python raw string when writing regex: r'\d\d\w\w\d\d'. While the raw string syntax only has an effect when backslashes are present, we will opt for consistency: all of our Python regex will be specified as raw strings.

You only need to worry about quotes and raw strings when writing Python code. pythex doesn't need either.

Alphanumerics

The most basic regex of all is one that matches a specific alphanumeric character. You can construct such a regex by placing the letter or number of interest between two slashes (/).

For example, /s/ matches the letter s anywhere it appears inside a string. It matches s, sand, cats, cast, and even Mississippi. In this last example, /s/ matches four times, at each of the occurrences of s in the string.

When we say that /s/ matches four times, we refer specifically to how regex work in Rubular. By default, in most languages, a regex matches each string once or not at all; that is, regex matching is a boolean operation. We won't mention this again until near the end of the book.

Note, however, that /s/ does not match S or KANSAS. Regex are case sensitive by default, so if you want to match a capital S, you need to specify /S/.

Go ahead and give this a try in Rubular: enter the regex /s/, then enter the following strings:

s
sand
cats
cast
Mississippi
S
KANSAS

Rubular should highlight all the s characters in the "Match result" box, thus showing where the regex matches. However, the regex doesn't highlight the uppercase S characters; it doesn't match the last two strings. If you change the regex to /S/, Rubular should light up all the S characters, but not the s-es.

Great. This discussion is interesting, but how do you put it to use in a real program? Regex usage in a program is language dependent, and also dependent upon what you need to do. As a starter, though, you can use the match method from the Ruby and JavaScript string classes; for Python, you can use the re.search function from the re module.

mystr = 'cast'

print "matched 's'" if mystr.match(/s/)
print "matched 'x'" if mystr.match(/x/)
let mystr = 'cast';

if (mystr.match(/s/)) {
  console.log("matched 's'");
}

if (mystr.match(/x/)) {
  console.log("matched 'x'");
}
import re

mystr = 'cast'

if re.search(r's', mystr):
    print("matched 's'")

if re.search(r'x', mystr):
    print("matched 'x'")

All three of these print matched 's' since mystr contains the letter 's'. On the other hand, none of them prints matched 'x' since mystr does not contain the letter 'x'.

Ruby and JavaScript: If you aren't acquainted with match already, you can learn enough with a few minutes skimming the documentation. We won't use anything more complex than the basic form of match that takes a single regex argument and a string caller. You can ignore the rest of the documentation for now.

Python: If you aren't acquainted with re.search already, you can learn enough with a few minutes skimming the documentation. We won't use anything more complex than the basic form of search that takes a regex argument and a string as arguments. You can ignore the rest of the documentation for now.

Special Characters

Regex can also match non-alphanumeric characters. However, some of those have special meaning in a pattern and require specialized treatment. Others have no additional interpretation and need no special treatment.

The following special characters have special meaning in the Big Three regex:

$ ^ * + ? . ( ) [ ] { } | \ /

There is one exception in Python: / is not special in Python regex.

We call such characters meta-characters. If you want to match a literal meta-character, you must escape it with a leading backslash (\). To match a question mark, for instance, use the regex /\?/. Go ahead and try /\?/ in Rubular now with these strings (and some of your own if you aren't sure what will happen):

?
What's up, doc?
Silence!
"What's that?"

You should find that /\?/ matches all of the question marks in these strings. Try the same strings using /?/ - you should see an error message instead.

Inside square brackets, the rules for meta-characters change. We'll talk about meta-characters in "character classes" a little later.

Some variants of regex have different meta-characters, and some reverse the sense of escaped characters. In vim, for example, \( and \) are meta-characters, while ( and ) match literal parentheses. This reversal can be confusing, but you must be aware of it.

In recent years, programs that use regex have begun to support multiple regex styles. vim, for instance now has what it calls extended syntax which provides enhanced regex, and also lets you swap the way escaped characters work. You can choose to use ( and ) for grouping like most other programs, and use \( and \) for literal parentheses. Check your documentation to see whether your software supports different syntaxes.

The remaining characters aren't meta-characters; they have no special meaning inside a regex. Both colons (':') and spaces (' ') fall into this category. You can use these characters without an escape since they have no special meaning inside a pattern. For example, try /:/ against these strings:

chris:x:300
A thought; no, forget it.
::::

Try changing the regex to / /.

As of this writing, Rubular does not detect a single space as a regex. Try /[ ]/ instead - this is equivalent to / /, but it works in Rubular.

Now change the regex to /./ (that's a period between the / characters). Whoa! What happened here? Oh, right, . is a meta-character; you must escape it. Change the regex to /\./ instead. That's better now? (We'll return to /./ and why everything lit up in a later chapter.)

You don't need to memorize the list of meta-characters. You can escape all non-alphanumerics even when you don't need to. However, it's good to get a feel for which are meta-characters; unnecessary escapes make your regex harder to read. Keep the list of meta-characters handy until you have them fully loaded into your brain.

Concatenation

You can concatenate two or more patterns into a new pattern that matches each of the originals in sequence. The regex /cat/, for instance, consists of the concatenation of the c, a, and t patterns, and matches any string that contains a c followed by an a followed by a t.

Give /cat/ a try using the following strings:

cat
catalog
copycat
scatter
the lazy cat.
CAT
cast

If all went well, the first five strings matched the regex, but the last two did not. CAT didn't match since it is uppercase, and cast didn't match because s isn't part of the pattern.

The fact that we use a fancy name like concatenation should give you a hint that more is going on here than meets the eye. The patterns we concatenated are simple; they each match a single, specific character. We aren't limited to these simple patterns though; in fact, you can concatenate any pattern to another to produce a larger regex. There are no practical limits to the number of concatenations you perform other than the physical limitations of your hardware.

This fundamental idea is one of the more important concepts behind regex; patterns are the building blocks of regex, not characters or strings. You can construct complex regex by concatenating a series of patterns, and you can analyze a complex regex by breaking it down into its component patterns.

In theory, your computer's capacity to handle large regex places some limitations on the size and complexity of your regex. In practice, though, your ability to understand and maintain your code places more severe restrictions on it. Your head will reach the breaking point long before your computer does. You'll sometimes hear regex called write-only expressions or line noise because it's easy to write an unreadable and unmaintainable mess. Use regex not because you can; use them because your code demands them. Often, a bit of refactoring will eliminate the need for a complex regex.

Alternation

In this section, we introduce alternation, a simple way to construct a regex that matches one of several sub-patterns. In its most basic form, you write two or more patterns separated by pipe (|) characters, and then surround the entire expression in parentheses. For example, try the regex /(cat|dog|rabbit)/ with the following strings:

The lazy cat.
The dog barks.
Down the rabbit hole.
The lazy cat, chased by the barking dog,
dives down the rabbit hole.
catalog
The Yellow Dog
My bearded dragon's name is Darwin

As with other patterns, case matters, so the Dog in The Yellow Dog is not matched.

As with concatenation, there are no built-in restrictions on alternation.

Even though parentheses and pipes are meta-characters that require escaping, we don't do that here. We aren't performing a literal match, but are instead using the "meta" meaning of those characters.

To see the difference, give the regex /\(cat\|dog\)/ a try with the following strings:

(cat|dog)
bird(cat|dog)zebra
cat
dog

You'll notice this time that we don't match either cat or dog; since we escaped everything, the regex matcher looks for literal instances of those characters and doesn't treat them as an alternation operation.

Control Character Escapes

Most modern computing languages use control character escapes in strings to represent characters that don't have a visual representation. For example, \n, \r, and \t are nearly universal ways to represent line feeds, carriage returns, and tabs, respectively. The Big Three support these escapes, as do all regex engines. For example:

text = "1 2 3 \t 4 5 6"
puts "has tab" if text.match(/\t/)
let text = "1 2 3 \t 4 5 6";
if (text.match(/\t/)) {
  console.log("has tab");
}
import re

text = "1 2 3 \t 4 5 6"
if re.search(r'\t', text):
    print("has tab")

All three print has tab since text contains a tab character.

Note that not everything that looks like a control character escape is a genuine control character escape. For instance:

  • \s and \d are special character classes (we'll cover these later)
  • \A and \z are anchors (we'll cover these as well)
  • \x and \u are special character code markers (we won't cover these)
  • \y and \q have no special meaning at all

Ignoring Case

As we've seen, regex are case sensitive by default. If you want to match a lowercase s, you need to use a lowercase s in your regex. If you want to match an uppercase S, you must use an S in your regex.

You can change this default behavior by appending an i to the closing / of a regex, which makes the entire regex ignore case. For example, try the pattern /launch/ against these strings:

I love Launch School!
LAUNCH SCHOOL! Gotta love it!
launchschool.com

You should see one match -- launch in the domain name. Now add an i flag (or option or modifier) to the regex, i.e., /launch/i and try again. This time, Rubular will highlight all three instances of launch without regard to their case. Nifty!

Python: To use Python regex in a case-insensitive manner, add a flags=re.I or flags='re.IGNORECASE argument to the re.search call:

re.match(r'd', 'xyzDabc', flag=re.I)

There are other useful flags like /i, but the flags are language specific. We don't have a dedicated section to discuss any other flags, some of which we'll meet later. See the documentation for your language of choice for complete list of available flags.

Summary

The discussion so far is straight-forward. You've learned the basic regex syntax, seen an example of using regex, and played around with a few basic regex. You've also learned about one of the fundamental concepts behind regex: concatenation of patterns. In the next chapter, we'll explore a little further and examine regex that can match any set of characters.

But, before you proceed, take a little while to work the exercises below. In these exercises, use Rubular to write and test your regex. You don't need to write any code.

Exercises

  1. Write a regex that matches an uppercase K. Test it with these strings:

    Kx
    BlacK
    kelly
    

    There should be two matches.

    Solution

    /K/
    

    The correct matches are K at the beginning of line 1, and K at the end of line 2.

  2. Write a regex that matches an uppercase or lowercase H. Test it with these strings:

    Henry
    perch
    golf
    

    There should be two matches.

    Solution

    /h/i
    

    If you are using Python, you will need to use re.I or re.IGNORECASE:

    re.search(r'h', text, re.IGNORECASE)
    

    An alternative solution is to use replace h with alternation and remove the flag:

    /(h|H)/
    

    The correct matches are H at the beginning of line 1, and h at the end of line 2.

    Can you think of a situation where you might want to use alternation instead of the i flag?

  3. Write a regex that matches the string dragon. Test it with these strings:

    snapdragon
    bearded dragon
    dragoon
    

    There should be two matches.

    Solution

    /dragon/
    

    The regex should match the word dragon at the end of lines 1 and 2.

  4. Write a regex that matches any of the following fruits: banana, orange, apple, strawberry. The fruits may appear in other words. Test it with these strings:

    banana
    orange
    pineapples
    strawberry
    raspberry
    grappler
    

    There should be five matches.

    Solution

    /(banana|orange|apple|strawberry)/
    

    Note that our regex matches apple in the words pineapples and grappler. You'll learn how to prevent this later on.

    The solution matches everything except raspberry.

  5. Write a regex that matches a comma or space. Test your regex with these strings:

    This line has spaces
    This,line,has,commas,
    No-spaces-or-commas
    

    There should be seven matches.

    Solution

    /( |,)/
    

    The expression should match three spaces on line 1 and four commas on line 2.

  6. Challenge: Write a regex that matches blueberry or blackberry, but write berry precisely once. Test it with these strings:

    blueberry
    blackberry
    black berry
    strawberry
    

    There should be two matches.

    Hint: you need both concatenation and alternation.

    Solution

    /(blue|black)berry/
    

    The key to this challenge is that concatenation works with patterns, not characters. Thus, we can concatenate (blue|black) with berry to produce the final result.

    The expression matches the first two lines.

    How come the regex doesn't match black berry?