A Short Guide to Regular Expressions

Regular expressions and how to use them to search the Enzyme List

1. What are regular expressions?

A regular expression, commonly abbreviated as “regex”, or “regexp”, is a sequence of alphanumeric characters and symbols from the ASCII character set forming a text-matching template.

At its simplest, a regexp can be used for case-sensitive searching. Most characters from the ASCII character set match themselves; hence, “Glucose” and “glucose” match themselves, but not each other.

The exceptions are the reserved characters, which are

   $  .  ^  *  ?  (  )  \  <  >  {  }  [  ]  +  -

To search for these characters, each must be preceded each by a backslash ‘\’, a process referred to as “escaping”:

   \$  \.  \*  \?  \(  \)  \\  \<  \>  \{  \}  \[  \] \+ \-

2. What are the meanings of these special characters?

To search for zero or more occurrences in a text string, append the ‘*’ symbol to that character. The regexp

a*

would match the null string ‘’, as well as ‘a’, ‘aa’, ‘aaa’, etc.

a+

would match one or more occurrences of ‘a’, thus: ‘a’, ‘aa’, or ‘aaa’, etc.

ab?

would specify that ‘a’ is followed optionally by a ‘b’.

The full stop, or period, symbol matches any single character except a NEWLINE.
Therefore, the regexp

.*

matches zero or more occurrences of any character. This is a powerful regexp, which should be used with caution since it can match more characters than anticipated.

   [  ]  -

Square brackets are used to denote ranges of characters, thus [a-z] matches any single lowercase letter; [A-Z] matches any capital letter; [A-Zabc] matches any capital letter or any of the lowercase letters a, b or c (but not d through z); [A-Za-z] would match any single letter, [0-9] any single digit. Ranges can also be combined with the wildcard characters *, + and ?, so that, for example, [0-9]+ matches one or more digits. Since the full stop is a reserved character, to match an EC number the following regexp could be used:

   [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+