Hidden Emphasis

First Expression

Regular Expressions Would Increase Your Productivity (If You Just Let Them)

Knowledge workers should take a page out of a programmer's book. Working with texts involves so many tedious tasks that it would appear irresponsible not to learn how to avoid them altogether. And to save plenty of time in the process.

Regular Expressions: Man lost in a labyrinth
Photo: unsplash.com / Tobias Rademacher (@tobbes_rd)

A large chunk of our day consists of writing. And while we might not dread writing itself, there are tasks associated with it that make us cringe. Imagine spending fifteen minutes learning a technique that will likely save you hundreds of hours of exactly this tedious work in the long run. Would you consider learning more about regular expressions?

Regular expressions explained for non-programmers

Regular expressions are sequences of characters that define a search pattern. Or simply put: Regular expressions are recipes which computers can understand and use to search and replace text. You have surely used the “search & replace” function of your text editor countless times by now. Regular expressions take it to another level by allowing more sophisticated search patterns than you are used to.

Look at this example of a regular expression:

\w+@[A-Za-z0-9_]+?\.[a-zA-Z]{2,4}

To most people, this looks like gibberish. But I’d like to make two promises to you: First, this gibberish actually makes sense. Second, you will understand what this pattern does and how it works once you’ve finished reading this article. In about fifteen minutes, you will even be able to create these kinds of patterns yourself.

Why you should learn regular expressions

Though everyone who writes makes mistakes, changes can become inevitable for a variety of reasons. Oftentimes it isn’t even your fault, but circumstances change. In the past, search & replace has probably saved you from making some of these changes by hand.

However, this basic function has its limits. Replacing “Susan” with “Susanne”? Easy. Did you realize you should’ve written “€” instead of “$”? Still no problem. But what if you sometimes forget to put a space before and after a dash? Or if you have a long list of nations next to their populations in parentheses, but you want to get rid of all the numbers?

In all those cases, a basic search & replace isn’t feasible. This function can neither differentiate between dashes that stick to other characters and those that don’t, nor can it find an arbitrary number in parentheses at the end of a line. Search & replace can only literally find what you command it to seek. You can make it look for “(123)”, but for “(321)” you’d have to start another search.

Regular expressions can be precise but also fuzzy. They can find strings of text exactly as intended, or matches that you didn’t even think of because they simply adhered to a particular pattern. Regular expressions are great when you want to find specific text, but they excel when you don’t know what exactly you’re looking for.

Sometimes, regular expressions make possible what wouldn’t be imaginable without them. And other times, they just save you a lot of time by condensing the number of searches you’d have to perform into a single efficient pattern.

Almost every programmer knows regular expressions -- that they look like computer code to some people is no coincidence. But to believe that only programmers should know their way around regular expressions would be shortsighted. And it would rob everyone else who has to work with texts of a tremendously helpful and time-saving skill.

There’s an app for that

Aside from programming languages, you can use regular expressions in various apps and tools, even in Google Analytics. For starters, a web app is sufficient. Later, you might want to explore your favorite writing app’s integrated “regex” function or even a dedicated app just for this job.

Writing apps and code editors

My favorite all-purpose writing app (that also majored in code editing) is BBEdit. It offers extensive support for regular expressions and is an excellent and fast writing app in general. Its basic search & replace function is undoubtedly powerful, but as soon as you check the “Grep” box, it becomes supercharged. “Grep pattern” is a synonym for “regular expression.”

Another example of a writing app with regex support would be Nisus Writer Pro, which has many fans among the writing community. And the popular app Scrivener offers regular expressions in their document search, too.

The app you are using might support it as well. Developers sometimes hide this function, though, because only advanced users should enable it. Otherwise, it could lead to unexpected results.

Apps just for regular expressions

If you’d prefer a dedicated app, there are a few of them on the market that focus exclusively on regular expressions. That is especially useful if you use various apps for your texts and want to work with one familiar interface.

A popular and user-friendly app is Patterns. In addition to showing you what matches your pattern in any given text, it offers search & replace support. A handy reference sheet keeps the syntax right at your fingertips.

RegExRX is another candidate, but it’s much more extensive than Patterns. It offers a template menu for frequently-used patterns and can export the matches to various file formats, for example, Microsoft Excel.

An app like that is not necessary when starting with regular expressions, but it can prove helpful. Though aimed mostly at people who use them daily, the colorful syntax highlighting and reference sheets can aid your learning even as a beginner.

Web apps (for starters)

If you don’t happen to own an app that supports regular expressions, there’s no need to download one right now. There are plenty of web apps that match text against your search patterns for free. Examples would be RegExr or RegEx101.

Many might appear complicated at first, but they all share the same structure: one field for the pattern, and another for the text. That is all you need get going.

Learn how to use regular expressions

Even though the syntax of regular expressions appears overwhelming at first, the basic principle is straightforward. You have a search pattern and a text (e.g., a blog post or an entire book) against which you want to match the pattern. And you might also want to have a string of text you’d like to replace the matches with, though this is optional.

Let’s assume you had this poorly-formatted list of people’s names and ages:

  • Peter Smith, 46
  • Dorothy Miller,102
  • Quentin Scott, 39
  • Julia Martin,9

Imagine you wanted to get rid of the age and could hand your computer a recipe like this:

  1. Find text that:
    1. starts with a comma,
    2. is (sometimes) followed by a space,
    3. is a number consisting of 1-3 digits.
  2. Replace the matches with an empty string.

With a typical search & replace function, you could find a particular age and remove it by searching for “, 46”. But you’d have to perform another search for each name, adjusting for the number of omitted spaces. In the end, you might as well remove the age by hand. It wouldn’t make any difference in terms of time spent.

A computer could find any age using the recipe above – if only it were able to understand human language. That’s where regular expressions come into play. You can translate that recipe to a syntax that computers can understand (you’ll learn how at the end of this article). And suddenly, the search & replace function is supercharged, leaving you with endless possibilities.

Let’s find out how to create such a pattern ourselves. Please make sure to try at least some of the following examples in an app of your choice. Otherwise, the concepts might be hard to grasp.

Relieving: Most characters match themselves

Even if regular expressions appear very complex, there is one fundamental rule that should bring some relief: Most characters match themselves.

That means an “a” in the pattern matches an “a” in the text.

Pattern: a

Text: Dogs on a train.

That also works for whole words. If your pattern were “train”, it would match “train” in the corresponding text.

Besides letters, there are a lot of symbols in a pattern. And it’s these special characters that make the patterns powerful.

You’ll need to escape some characters

You can use a variety of symbols to build your search pattern, such as “.”, “*”, or “$”. But what if you wanted to match them in a text?

Pattern: $

Text: Jack owes me $20.

If you want to use special characters that also serve a purpose in regular expressions, you need to escape them. To do that, just insert a “\” in front of them.

Pattern: \$

Text: Jack owes me $20.

Pars pro toto: wildcards and special characters

The reason we use symbols in regular expressions is to differentiate them from letters and numbers. They usually don’t represent themselves (unless escaped) but a significant number of other characters.

The most comprehensive symbol is the dot (.), a wildcard that stands for any character other than a hard line break. It’s the all-purpose character in regular expressions.

Pattern: .

Text: Obviously, this apple  despite its green color  is sweet.

The “.” matches every single character in the text, including spaces and dashes.

Symbols like this make regular expressions powerful because they can stand for what you don’t know. After all, these patterns are so useful because you often can only provide general search terms, not specific ones.

Usually, though, the symbols are not so extensive. After all, we want a certain specificity in our patterns that allows us to find exactly what we’re looking for.

The following are examples of compound symbols that are not single characters but escaped letters. You need to escape them because otherwise, the engine will assume you meant the actual letter.

  • \d All digits (0-9)
  • \D Everything that is not a digit
  • \w Word characters (all letters, digits, and underscores)
  • \W Everything that is not a word character
  • \s Any whitespace (space, tab, carriage return, etc.)
  • \S Everything that is not a whitespace character

Ranges & sets – collector’s edition

It can surely help to define whether a character in your search pattern is a digit or a letter. But what if it can only be a number between 1 and 3 or a letter between a and f?

In that case, you can define a range. Put both letters or digits in square brackets with a hyphen between them.

Pattern: [a-f]

Text: Another day in paradise.

Only the characters that are in the range between a and f will match. But did you recognize that the capital “A” in “Another” did not match? That’s because you only specified a range of lowercase letters.

To include uppercase letters and digits from 1 to 3, we can concatenate all ranges in the brackets.

Pattern: [A-Fa-f1-3]

Text: Another day in paradise at 5:12 am.

Notice that there are no commas that separate the different ranges from each other. If we added commas, the engine would recognize them as individual characters and try to match them.

Of course, you can string together not only ranges but also arbitrary characters. That mix will serve as a pool of characters from which the engine can choose.

Pattern: [aeiou:]

Text: Another day in paradise at 5:12 am.

By the way, we didn’t need to escape the colon in our character class (that’s what sets are also called) because it has no special meaning inside of a set.

Negating – when you know what you don’t know

Often, you don’t know what you’re looking for, but you know what you’re not looking for. That’s when negating comes in handy.

Let’s say you wanted to match every character that is not a vowel. You could write it like this:

Pattern: [^aeiou]

Text: Thats a great idea.

Notice that the caret symbol at the beginning of this set means that you want to match any character that is not a vowel. That doesn’t mean, however, that it will only match consonants. It will literally match any character other than a, e, i, o, or u. So, every apostrophe, space, dot, etc. will also be matched. After all, the engine isn’t well-versed in phonetics – it will think like a machine.

Quantifiers make your regular expressions grow

Up until now, we’ve only matched single characters. Even if all characters in a line were highlighted in previous examples, each of them represented an individual match. Take this example:

Pattern: [of]

Text: Duke of Buckingham

This pattern would match the letters “o” and “f” but not the word “of”. There is a crucial difference. You can observe this by replacing the match with the word “in”.

Pattern: [of]

Text: Duke of Buckingham

Replacement: in

Result: Duke inin Buckingham

What happened here? Both letters, “o” and “f”, were replaced with the word “in”. That’s because you searched for a set with the letters “o” and “f”, not for the word “of”. Thus, the engine found two independent matches.

To match the entire word “of”, you can use a quantifier. This determines that a particular character or set (immediately to its left) has to occur a certain number of times in order to match. Quantifiers are:

  • * Zero or more occurrences
  • + One or more occurrences
  • ? Zero or one occurrences
  • {3} Exactly three occurrences
  • {3,} At least three occurrences
  • {3,5} At least three, but no more than five occurrences

Remember the previous example. Instead of matching “o” and “f” individually, we wanted to match “of” and replace it with “in”. Since we know exactly how many characters we’re targeting, we can write it like this:

Pattern: [of]{2}

Text: Duke of Buckingham

Replacement: in

Result: Duke in Buckingham

Not only sets, but also characters, can be quantified. Take a look at the following example:

Pattern: .*

Text: The Duke of Buckingham goes to a pub in Buckingham.

That matches the entire line because we’ve said it should match any character (except for a hard line break), and it can occur zero or more times.

Important: Regular expressions are naturally greedy. You might ask yourself why this pattern matches the whole line, while zero or one or maybe three occurrences would have sufficed to satisfy the pattern. The reason is that regular expressions will, by default, always try to match as much as possible. That’s the concept of the “longest match.”

Procrastinating with lazy quantifiers

To avoid the greediness that is inherent to regular expressions, you can use non-greedy (also known as lazy or reluctant) quantifiers:

  • *? Zero or more occurrences
  • +? One or more occurrences
  • ?? Zero or one occurrences
  • {3}? Exactly three occurrences
  • {3,}? At least three occurrences
  • {3,5}? At least three, but no more than five occurrences

Following the quantifier with a question mark tells the engine to stop when a match satisfies the pattern.

Pattern: .*\.

Text: This is Lola. She is adorable.

This pattern matches any character for zero or more times up until a period (which we need to escape because it has a special meaning). With regular expressions, you can just concatenate multiple characters or character sets as you desire. The engine will go through the instructions step by step, starting on the left. Don't separate the individual parts by a comma or space, though. Put them right next to each other.

In our example, this matches the whole line because the quantifier * is greedy and will not stop until the last period in the line. What would we have to change to match only the first sentence? We could add a question mark to the quantifier.

Pattern: .*?\.

Text: This is Lola. She is adorable.

But wait – did that still match the whole line? Not quite. These are actually two individual matches: “This is Lola.” and “ She is adorable.” Replacing these matches with the word “knock” would result in “knockknock”.

Remember that the dot technically matches the space before the second sentence as well. In the next section, you’ll learn how to target only the first sentence.

Anchors keep your expressions in place

You often know that what you’re searching for is either at the beginning or end of a line. That can be a crucial cue for the engine. You can use the following two characters to determine precisely that:

  • ^ Beginning of the line
  • $ End of the line

And that’s also where you use them. If you wanted a match to occur only at the beginning of a line, you’d start the pattern with a caret (^). Conversely, if you wanted only matches at the end of a line, you’d end the pattern with a dollar symbol ($).

Pattern: ^.*?\.

Text: This is Lola. She is adorable.

You might be tempted to use the “$” at the end of the pattern to target just the last sentence. But that won’t work.

Pattern: .*?\.$

Text: This is Lola. She is adorable.

A non-greedy quantifier only tells the engine where to stop, not where to start. Because of the “$”, it has to match the end of a line. It will start right at the beginning of the line (because here the whole line matches the pattern), and it won’t stop before the end of the line because it simply can’t.

Note: You can use “^” and “$” at the same time to match a full line.

(More or less) practical examples

Regular expressions can be quite abstract. In the following examples, we’ll explore the practical uses of regular expressions and how to build the corresponding patterns.

The whitespace conundrum

One annoyance writers face is the whitespace conundrum. Throughout their texts, they omit necessary spaces, which makes their writing look careless.

For the sake of simplicity, imagine the writer sometimes forgets to put spaces before and after dashes. What could a regular expression do about that?

The easiest way would be to match all dashes and replace them with a dash surrounded by a space on each side. But our writer might sometimes remember to insert the space, so our pattern would occasionally result in double spaces.

To accommodate this, we could match all dashes with or without spaces before and after them and replace them with the adequately-spaced dash. Additionally, we could target those dashes with more than one space before or after them, which are likely mistakes as well. And while we’re at it, we should take care of the most sinister of writers who would dare to put a tab anywhere around a dash.

Pattern: \s*–\s*

Text: The weather– it’s so hot. Can we please– please!have some water?

Replacement: “ – “

Result: The weather – it’s so hot. Can we please – please! – have some water?

This pattern is very straightforward. It matches a dash () surrounded by zero or more whitespace characters (that includes tabs). It will then be replaced by a dash that is surrounded by one space on either side.

Note: We don’t have to escape the dash, as it has no special meaning. A hyphen, however, would be a different story.

Getting rid of all the numbers

Earlier, we looked at a list of people’s names and ages:

  • Peter Smith, 46
  • Dorothy Miller,102
  • Quentin Scott, 39
  • Julia Martin,9

With regular expressions, it’s easy to get rid of the numbers, so we’re left with only the names.

Pattern: ,\s*\d{1,3}

Text:
Peter Smith, 46
Dorothy Miller,102
Quentin Scott, 39
Julia Martin,9

Replacing the pattern with an empty string would result in a neat list like this:

  • Peter Smith
  • Dorothy Miller
  • Quentin Scott
  • Julia Martin

The pattern above would even accommodate a sloppy writer who occasionally omitted the space after the comma.

Finding (overly) long sentences

Maybe you tend to go overboard with your sentences, adding more and more words, even though you know it’s enough already, but you can’t seem to stop, adding even more words, until – finally – a period comes to your rescue.

With regular expressions, we can quickly identify sentences that might be too long. Let’s say 50 characters or more. Look at this example:

Pattern: [A-Z0-9][^.?!]{50,}[.?!]

Text: Ice cream is delicious! In the summer we eat it every day. The bravest of us can even eat it in the winter, regardless of the temperatures outside.

The last sentence is 88 characters long. It matches our pattern of overly-long sentences. But how?

  • [A-Z0-9] First of all, we need to make sure that the string starts with a capital letter or a digit. We assume that a sentence usually begins like that.
  • [^.?!] The body of the sentence will likely consist of various characters but not of periods, question marks, or exclamation marks. To keep it simple, we assume there’s no quote in the sentence.
  • {50,} The aforementioned characters need to occur at least 50 times. If we don’t specify the second number (the maximum), the engine infers that we only want to set a minimum.
  • [.?!] The end of the sentence should be either a period, a question mark, or an exclamation mark.

This example is very simplistic, and an astute reader will not take long to find plenty of examples that this pattern wouldn’t capture accurately. However, it shows what such an expression could generally look like.

Putting it all to the test

With your newly-acquired knowledge, look at the pattern from the beginning of this article again. Can you imagine what you could use the following pattern for?

Pattern: \w+@[A-Za-z0-9_]+?\.[a-zA-Z]{2,4}

The most telling clue in this pattern is the “@” symbol. If you guessed that this pattern could find e-mail addresses, you would be right. However, this is a very simplistic form. According to the RFC 5322 specification, an e-mail address can be much more elaborate than our example would be able to handle.

If you’d like to see a pattern that actually catches 99.99% of all e-mail addresses, take a look at this one from emailregex.com:

Pattern: (?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Don’t feel intimidated, though. Even senior regular expressionists would have serious trouble taming this behemoth of a pattern.