The Power of Regex: A Beginner’s Guide

This is a very old piece I wrote, even before the release of OpenOffice 2.4. I’m bringing it back because people often ask me for quick text editing tricks. I’ll show you how to do it using RegEx. The OpenOffice RegEx engine is quite usable. Microsoft Word can do similar things too, but it handles them in a very peculiar and cumbersome way. In the Find and Replace function, you need to enable the “Use wildcards” (or similar) option, and then you can use a kind of dumbed-down version of RegEx.

Searching within texts, selecting certain parts, and replacing them are everyday tasks. When editing a larger manuscript—such as a thesis or a book—this might be needed a thousand times, and often for the same kinds of changes. For example, quotation marks at the beginning of words need to be replaced with low-9 quotes instead of straight quotes, and closing quotes with high-9 ones (in Hungarian). Sequences like space-hyphen-space need to be replaced with em dashes. You’ll want to standardize the formatting of numbers, phone numbers, and units of measurement. In such Sisyphean tasks, regular expressions (RegExp or RegEx for short) can save a lot of time.

Regular expressions can also be used in programming tasks, such as input validation, database queries, or any situation where you need to detect a pattern in a text or match character sequences.

You’ll encounter RegEx in many places. On UNIX systems, it’s widely used in text-searching tools (like grep, egrep), in text editors (such as emacs, sed, or vi), and even the bash shell supports it. Many programming languages (e.g., C, Perl, Delphi, Java) support RegEx, and many text editors do as well—for example, Notepad++, OpenOffice.org, and even Microsoft Word. There are some differences between various RegEx engines, so expressions may need to be slightly adjusted from one application to another. The available features also vary between programs, but the core logic is the same.

To a beginner, something like <[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}> might look intimidatingly complex. But once you understand the unique logic of RegEx, the meaning becomes quite straightforward. We’ll decode this shortly, but let’s not get ahead of ourselves.

What follows is intended for complete beginners and demonstrates RegEx usage with OpenOffice.org. We won’t cover everything, so for those interested in diving deeper, references at the end of the article will help guide further exploration.

In OpenOffice.org Writer, go to Edit → Find & Replace to open the standard dialog. Click More Options, and you’ll see the Regular expressions checkbox. If you enable this, the software will interpret the text in the Search for and Replace with fields as regular expressions.

The Regex Riddle of Column 80

While cleaning out the attic, a yellowed corner of a newspaper turned up. On it was a partially solved crossword puzzle fragment. In the vertical column number 80, the following letters could be seen: O___A_A_Z. What could the solution be? The clues are missing, so we can only guess. Let’s grab a dictionary and look up words that could fit! In the attached file o_letter_words.txt (UTF-8 encoded), there are about 2000 Hungarian words that start with the letter O. That should be enough. Let’s open the file and start searching!

We’re looking for a word that matches the pattern above. In the Search for field, enter o...a.a.z (without quotation marks). In RegEx, the dot (.) stands for any single character. Make sure to enable the Regular expressions option, and ensure that Match case is turned off! When clicking the Find button, Writer returns words like odaragaszt and odatapaszt. Clearly, something’s not right here.

Searching with Regular Expressions in OpenOffice.org

One issue is that these words are longer than they should be. The pattern matches them, but they continue with an extra “t” at the end. Another problem is that in crossword puzzles, accented and unaccented letters are not distinguished—so the first letter could be ó, ö, or ő, and the a could also be á. You can express this in RegEx by listing possible characters inside square brackets: [oóöő]...[aá].[aá].z. With this expression, we’ll get many more matches.

Inside square brackets, you can list as many characters as you like, or you can use a hyphen to specify a full alphabetical range. The pattern will match if any one of the listed characters appears. It’s very important to note that […] always matches exactly one character. If you want to cover more characters, you can specify how many using curly braces {…}.

For example, our previous search expression can also be written like this: [oóöő].{3}[aá].[aá].z. The . {3} means that after the first letter (any of o, ó, ö, ő), there should be exactly three arbitrary characters.

Another thing is that we’re only interested in words that start with the o-letter and end with z. We don’t want words where this pattern appears in the middle. You can indicate this with two more special symbols: \< marks the beginning of a word, and \> marks the end of a word. So our full expression becomes: \<[oóöő].{3}[aá].[aá].z\>.

With that, it’s clear that the correct answer must be orvvadász or orvhalász.

It’s also worth noting that quantifiers ({…}, +, *) behave greedily. This means they match as many characters as possible. For instance, the expression a.*a in the word adavakedavra won’t stop at ada, but will match the entire adavakedavra.

Money, money, money, must be funny

Let’s put what we’ve learned to use. Let’s write a RegEx that hunts for bank account numbers in documents. A good starting point for matching sequences like 12345678-12345678-12345678 would be the expression:
\<[0-9]{8}-[0-9]{8}-[0-9]{8}\>.

But there’s a problem: account numbers aren’t always written like that. First of all, not all of them are 24 digits long—some are only 16 digits. Secondly, some people use spaces instead of dashes, while others write the entire number without any separators.

We could try replacing the dashes with [ -]. That way, the expression would match numbers written with spaces—but then it won’t match the ones with dashes anymore. Why? The reason is simple: inside square brackets, the dash doesn’t act like a character, but as a range operator (just like in [0-9]). Since the second half of the range is missing in [ -], the RegEx engine simply doesn’t understand it.

We need to tell the engine explicitly that we mean a literal dash. That’s what the backslash (\) is for—it “escapes” the character that follows, protecting it from the RegEx processor. So we must write the dash as \-. Similarly, if you want to match a dot, you’d write \..

So a version of our pattern that’s tolerant to both dashes and spaces would be:

\<[0-9]{8}[ -][0-9]{8}[ -][0-9]{8}>

And what if the account number is written as a single block, without any separators? That’s when we can add a quantifier to the [ \-] part—like [ \-]{0,1} (note that the zero must be written explicitly here). But there’s a simpler option: use a question mark:

\<[0-9]{8}[ -]?[0-9]{8}[ -]?[0-9]{8}>

The question mark means that the preceding element appears once or not at all.

In some RegEx implementations, the question mark can also be used to make greedy quantifiers non-greedy. For example, in the word adavakedavra, the pattern a.*?a would first match ada, then akeda, and so on—rather than gobbling up the entire adavakedavra as a.*a would.
In OpenOffice.org, the question mark doesn’t work like that, but it does in tools like Dreamweaver.

Finally, we still need to make the expression work for both 16- and 24-digit account numbers. RegEx allows grouping parts of a pattern with parentheses—just like in math. So we can rewrite our expression like this:

\<[0-9]{8}([ -]?[0-9]{8}){1,2}>

This reads as follows: match 8 digits at the beginning of the word, followed by 1 or 2 groups consisting of 8 more digits, each optionally preceded by a space or dash.

Phew! Let the account number hunt begin!

Stuttering and Other Complex Examples

A well-known text error is word repetition, also known as stuttering. This phenomenon is mostly a byproduct of computer-based text editing. There are several methods to detect it. For example, we can search for repeated words using the expression:

\<([:alpha:]+)\>(.{1,3})\1

Here, \1 is a backreference to the portion previously captured in parentheses.

This expression means: look for a word made up of letters, followed by 1 to 3 arbitrary characters, and then the same word again.

The [:alpha:] element is not supported in every RegEx implementation. It comes from the POSIX regex extensions and represents an alphabetic character. OpenOffice.org supports it, and it includes accented characters too. According to what we’ve seen earlier, this is equivalent to the character class [a-záéőúüóöűí], assuming the “Match case” option is turned off.
(Incidentally, OpenOffice.org handles this option correctly for accented letters as well, though many other applications may require special care when dealing with uppercase and lowercase accented characters.)

Understanding the examples in the table should now be easy. You can try all of them out on the sample text provided on the disk.

Replacement with RegEx Backreference in Notepad++.
(Note: “Behelyettesítéssel” as a specific checkbox label, accurate translation is: ‘Substitution’)

Examples of Searching for More Complex Patterns

ExpressionWhat it Finds
<[a-z/][a-z]*>HTML (or XML, etc.) tags
\(\<[:alpha:]*\>\)Words in parentheses
\<[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\>Email addresses
\<(\+36|06)?[\- \/\(]*(1|[0-9]{2})[\)\- \/\(]*([0-9\- ]{6,9})\>Hungarian phone numbers
\<((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\>IP addresses

Find and Replace

Many software tools allow you to use backreferences to parts of the search RegEx pattern during replacement. This can be done using \1, \2, \3, etc., similar to how backreferences work within RegEx, or using $1, $2, $3 and so on.

For example, to convert dates written in the form 2008-03-15 or 2008/03/15 into the proper Hungarian format (2008. 03. 15.), you would search with the RegEx:
([21]?[0-9]{3})[\-\/]([01]?[0-9])[\-\/]([0-3]?[0-9])
and in the Replace with field, you would enter:
$1. $2. $3.

This means: take the part of the match captured by the first group, add a dot and a space, then the second group, and so on.

Note: Tomas Bilek has written a free alternative search tool for OpenOffice.org. It’s easy to use—after installation, it appears right on the toolbar. It makes extensive use of RegEx capabilities and includes great prebuilt patterns. It can search styles, comments, footnotes, tables, images, text frames, and many other elements. During replacements, you can reference up to 9 parenthesized expressions. Among other things, it can insert non-breaking spaces, line and column breaks, objects, clipboard content, and styles during replacement. And best of all, it supports batch processing. You can save a series of find-and-replace operations and run them automatically one after the other.

Essential RegEx Patterns

RegEx ExpressionMeaning
. (dot)Matches any single character.
[characters]Matches any one of the listed characters.
[^characters]Matches any character not listed.
[a-e]A character between a and e (inclusive). Note: uppercase and lowercase letters are distinct. If both are needed, use [a-zA-Z]. Accented characters follow the basic alphabet (e.g., [a-z] matches only English letters).
[a-z0-9]A letter or a digit.
[^0-9]Any character that is not a digit.
(…)Groups expressions. E.g., ék(es)?írás matches both ékírás and ékesírás. Grouped expressions can be referenced later.
|Alternation (OR). For example, (apple|pear) matches both apple and pear.
{n}Matches the preceding character exactly n times. E.g., hal{2} matches hall.
{3,5}Matches the preceding character at least 3, but no more than 5 times. E.g., or.{2,4}t matches orvost and orangután, but not ordinátatengely. {,4} means up to 4 times; {2,} means at least 2 times, with no upper limit.
*Matches zero or more occurrences (same as {0,}).
+Matches one or more occurrences (same as {1,}).
?Matches zero or one occurrence of the preceding character.
\<Word boundary (start of a word). E.g., \<vackor matches vackor and all its inflected forms.
\>Word boundary (end of a word). E.g., For example, (andó|endő)> matches the Hungarian suffixes -andó and -endő.
^Start of a paragraph.
$End of a paragraph.
\Escapes the character that follows, so it’s treated literally. Exceptions: \>, \<, \x, \t, \n. \x#### represents a Unicode character by its hexadecimal code. \t is a tab. In OpenOffice.org, \n has special meanings: in the Search field it represents a line break (Shift+Enter), in the Replace field it means a paragraph break (Enter).
\1, \2, \3Backreferences to previously matched parenthesized groups. They don’t search for the same sequence, but match whatever the corresponding group captured.

Find and Replace: Practical Tips

What You Search ForWhat You Replace It WithDescription
\n\nConvert line breaks (Ctrl+Enter) to paragraph breaks (Enter) (Note: context-specific in OpenOffice.org)
\t' ' (a single space)Remove tab characters
[:space:]{2,}' ' (a single space)Remove multiple consecutive spaces
^[:space:](nothing)Remove leading spaces used as indentation
[:space:]$(nothing)Remove trailing spaces at the end of lines
^$(nothing)Remove empty lines
--' – ' (space–en dash–space)Convert double hyphens to en dash
\.{2,3} (ellipsis character)Replace two or three dots with proper ellipsis

\>[:space:]([…,;\.!\?])
$1Remove incorrect space before punctuation marks that stick to the previous word
([““]|”)\< (low double quote)Incorrect Opening Quotation Marks in Hungarian Text
\>([““„]|”)”(upper double quote)Incorrect Closing Quotation Marks in Hungarian Text
([01]?[0-9])/([0-3]?[0-9])/([21]?[0-9]{0,3})$3.$2.$1Converting Dates in MM/DD/YYYY and M/D/YY Format to YYYY.MM.DD and YY.M.D Format
\<[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\>Searching for Email Addresses

References

Leave a comment