Regular Expressions

This is an introduction to regular expressions for those of you that have heard the phrase “just use a regular expression on that string to extract the data…” but doesn’t understand the thing about a “regular expression”. I’ll write RE instead of “regular expression” from now on to save me some typing.

The phrase given above hints that one should use a RE on a text string and that it can be used to extract data from such a string. This is what a RE does: it matches something. A RE is a short way of representing a potentially very complex pattern that is to be matched against a string of characters.

The simplest RE is one that just matches itself. An example of such a RE is “a”. This pattern matches the string “aabbaa” or any other string that contains at least one “a”.

You can combine two REs by concatenating them (writing them after each other). The concatenation of REs A and B will match any string on the form “xxABzz”, that is: the substring matched by A must be adjacent to the substring matched by B. So “ab” will match the string “abx” but not the string “axb”.

So far you haven’t seen anything that couldn’t be done just as easy without the use of REs. The real power of REs lies in the use of wildcards.

A wildcard is a character that doesn’t match itself. You have probably already seen examples of wildcards in shells: the command “rm *” deletes all files in the present directory because shells expands “*” to a list of all files in the directory.

One of the most basic uses of a wildcard in a RE is to match a character a given number of times. To match a character without any limit on the number of consecutive occurrences, then you append a “*” to the character like this: “a*”. This pattern will match strings like “aa”, “bab” and even a string like “b” because the “a*” demands that “a” is present any number of times, including zero times.

If you want to match something at least once, then append a “+”: “a+” matches “a”, “aaa”, and “aba” but not “b”.

To further restrict the number of occurrences you can use a “?”. This makes the expression optional: “ab?a” matches “aa”, “aba”, “aaa”, but not “abba”.

The characters ““, “+”, and “?” are really shortcuts into a more general system: you can append “{x,y}” to an expression to have it matched between *x and y times, both inclusive. If you leave out a number, then it means “no limed in that direction”. So the equivalent RE for “a* b+ c?” is “a{0,} b{1,} c{0,1}”.

We’re almost done with the introduction of new symbols. The next wildcard is “.” — the full stop character. A full stop matches any single character so the RE “a.b” matches “axb”, “aabb”, but not “ab”.

The square brackets are used to match a single character from a set of allowed characters. The set can be a range of characters, a list of characters, or a combination of both. A range of characters is specified by the first character in the range, a hyphen, and then the last character in the range. An example would be the RE “[0-9]” which matches a single digit. A list is made by simply listing the allowed characters in any order: to match any of the letters “a”, “c”, or “e” use “[ace]“. You can combine a range with a list like this: “[0-9acex-z]“. This will match either a digit, one of the letters “a”, “c”, or “e”, or one of the letters in the range between “x” and “z”, that is “x”, “y”, or “z”. To match a hyphen you have to include it as the last character inside the brackets. Notice that the square bracket construct always matches a single character.

You can also invert the logic so that anything in the brackets are excluded. This is done by putting a “^” inside the brackets as the first letter. So to match anything that falls outside of the 25 letters used in the English alphabet, you would use “[^a-z]“. You should read the “^” as “does not contain”. The “^” is only special when it is the first letter inside the brackets.

The final two special characters are the “^” and the “$”. They match the beginning and end of a line respectively. You use this when you want better control over the match: the RE “^a+b$” will only match a string that starts with at least one “a” followed by a single “b”. Nothing more is allowed, so “abc” is not accepted but “ab”, “aab”, and so on is.

You can use parentheses “()” to make groups: “^(ab)+$” will only match strings like “ab”, “abab”, and so on. The groups are also used when you are doing search-replace. Each group can be extracted from the string: the RE “^([a-zA-Z]+) ([a-zA-Z]+)$” has two groups, each of which will match a word consisting of letters from the English alphabet. If this RE matches a string, then each word will have been stored for later use.

Be sure to checkout my cool examples of regular expressions, for one of the best ways to learn is to look at (and try to understand!) other peoples code.

8 Comments

  1. Sleepy:

    Martin,

    this has helped me a lot to finally understand the RegEx concept. Especially the great examples

    Thanks a lot and keep up the good work.

  2. Martin Geisler:

    Thanks, I’m happy to hear that!

  3. MN__:

    Hello! I figured it would be nice if I signed your guestbook. I just thought that I would visit your homepage and see what all you have been up to, well - it’s awesome!

  4. estetik plastik:

    Very useful information for me. Thank you.

  5. Ferdi Kucuk:

    Thanks a lot and keep up the good work.

  6. Plastik cerrahi:

    Very nice tip on Admin Drop Down Menu. Now we have a very nice menu options. Thanks a lot.

  7. Dedektör:

    Thank you very much. Good work…

  8. Define:

    Thaks. Very good work.