All notes

Good References

Quick Reference

Character Classes

To match either gray or grey: gr[ae]y
You can find a word, even if it is misspelled, such as: sep[ae]r[ae]te or li[cs]en[cs]e.
C-style hexadecimal number: 0[xX][A-Fa-f0-9]+
Find an identifier in a programming language: [A-Za-z_][A-Za-z_0-9]*

#---------- Negated Character Classes

# Unlike the dot, negated character classes also match (invisible) line break characters.
[^0-9\r\n] matches any character that is not a digit or a line break.

It does not match the q in the string "Iraq".
It does match the q and the space after the q in "Iraq is a country".


The only special characters or metacharacters inside a character class are: ], \, ^, and -.

To search for a star or plus, use [+*].

Shorthand Character Classes

\d is short for [0-9].
\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_].
\s stands for "whitespace character". It includes [ \t\r\n\f].

Shorthand character classes can be used both inside and outside the square brackets:
[\s\d] matches a single character that is either whitespace or a digit.

#---------- Negated Shorthand Character Classes

\D is the same as [^\d]
\W is short for [^\w]
\S is the equivalent of [^\s]

#---------- In summary

.	any character (but newline).

\d 	Any single digit.
\D  Non-digit.
\s  white-space character. Same as [ \t\n\r\f].
\S  Non-space.
\w 	Word character; same as [0-9A-Za-z_].
\W  Non-word character.
\b  Word boundary.
\B  Non-word boundary.

Non-Printable Characters

\t: tab character (ASCII 0x09)
\r: carriage return (0x0D). Windows text files use \r\n, while UNIX text files use \n.
\n: line feed (0x0A)
\a: (bell, 0x07)
\e: (escape, 0x1B)
\f: (form feed, 0x0C).
\v: vertical tab (ASCII 0x0B).


[^/]+ 	One or more characters until (and not including) a forward slash.

	matches an HTML tag without any attributes.

{0,1} is the same as ?
{0,} is the same as *
{1,} is the same as +
Omitting both the comma and max tells the engine to repeat the token exactly min times.

	to match a number between 1000 and 9999. \b means word boundary.
	matches a number between 100 and 99999.
A word boundary, in most regex dialects, is a position between \w and \W (non-word char). So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

Greediness of +
Using <.+> on <EM>first</EM> will match <EM>first</EM>.
Reason: The plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack.

Using <.+?> on <EM>first</EM> will match <EM> and </EM>.
*?, +?, and ?? are non-greedy versions of *, +, and ?.
non-greedy qualifiers *?, +?, ??, or {m,n}?

Best way
Using <[^>]+> on <EM>first</EM> will also match <EM> and </EM>.



Matched Subexpressions


Input = "He said that that was the the correct answer.";
Matched: "that that", "the the".

Named Matched Subexpressions


((?abc)\d+)?(?xyz)(.*) produces the following capturing groups by number and by name. The first capturing group (number 0) always refers to the entire pattern:
Number  Name                Pattern
0       0 (default name)    ((?abc)\d+)?(?xyz)(.*)
1       1 (default name)    ((?abc)\d+)
2       2 (default name)    (.*)
3       One                 (?abc)
4       Two                 (?xyz)

Noncapturing Groups


input = "This is a short sentence.";
// The example displays the following output:
//       Match: This is a short sentence.

Difference between () and []

These regexes are equivalent (for matching purposes):


(a|b|c) is a regex "OR" and means "a or b or c", although the presence of brackets, necessary for the OR, also captures the digit. To be strictly equivalent, you would code (?:7|8|9) to make it a non capturing group.

[abc] is a "character class" that means "any character from a,b or c" (a character class may use ranges, e.g. [a-d] = [abcd])

The reason these regexes are similar is that a character class is a shorthand for an "or" (but only for single characters). In an alternation, you can also do something like (abc|def) which does not translate to a character class.

Zero-Width Lookahead/Lookbehind Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors.

Positive Lookahead q(?=u) matches a q that is followed by a u. Negative lookahead q(?!u) match a q not followed by a u.

Lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b (negative lookbehind) matches a "b" that is not preceded by an "a". It doesn't match cab, but matches the b (and only the b) in bed or debt. (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.

Zero-Width Positive Lookahead Assertions:
* For a match to be successful, the input string must match the regular expression pattern in subexpression.
* The matched substring is not included in the match result.
* A zero-width positive lookahead assertion does not backtrack.

inputs = { "The dog is a Malamute.", 
           "The island has beautiful birds.", 
           "The pitch missed home plate.", 
           "Sunday is a weekend day." };
if (match.Success)
    Console.WriteLine("'{0}' precedes 'is'.", match.Value);
    Console.WriteLine("'{0}' does not match the pattern.", input);
// The example displays the following output:
//    'dog' precedes 'is'.
//    'The island has beautiful birds.' does not match the pattern.
//    'The pitch missed home plate.' does not match the pattern.
//    'Sunday' precedes 'is'.
Find a regexp to transform
"<please help me> hello everyone <HI!>"
    After applying replace-regexp we could get:
"<please_help_me> hello everyone <HI!>"
Regex:  ( )(?=[^<]+>)
Replacement string: _


Zero-Width Negative Lookahead Assertions:

input = "unite one unethical ethics use untie ultimate";
// The example displays the following output:
//       one
//       ethics
//       use
//       ultimate


Zero-Width Positive Lookbehind Assertions
* For a match to be successful, subexpression must occur at the input string to the left of the current position.

input = @"2010 1999 1861 2140 2009";
pattern = @"(?<=\b20)\d{2}\b";
// The example displays the following output:
//       10
//       09


Zero-Width Negative Lookbehind Assertions

{ "Monday February 1, 2010", 
  "Wednesday February 3, 2010", 
  "Saturday February 6, 2010", 
  "Sunday February 7, 2010", 
  "Monday, February 8, 2010" };
pattern = @"(?<!(Saturday|Sunday) )\b\w+ \d{1,2}, \d{4}\b";
// The example displays the following output:
//       February 1, 2010
//       February 3, 2010
//       February 8, 2010
Group Options:


(?:) is the non-capturing group. See <a href="">StackOverflow</a>.
Match ""
	Group 1: ""
	Group 2: "/"

Match ""
	Group 1: ""
	Group 2: "/questions/tagged/regex"


Anchors include: ^, $, \b. Anchors do not match any character at all.

^, $

Start/End of String Anchors.

If you have a string consisting of multiple lines, like "first line\nsecond line" (where \n indicates a line break), it is often desirable to work with lines, rather than the entire string. Therefore, most regex engines have the option to expand the meaning of both anchors.

\A, \Z, \`, \'

\A only ever matches at the start of the string, \Z only matches at the end of the string.
These two tokens never match at line breaks. This is true in all regex flavors.

The GNU extensions to POSIX regular expressions use \` (backtick) to match the start of the string, and \' (single quote) to match the end of the string.

\Z, \z

Because Perl returns a string with a newline at the end when reading a line from a file, Perl's regex engine matches $ at the position before the line break at the end of the string even when multi-line mode is turned off. So ^\d+$ matches 123 whether the subject string is 123 or 123\n.

If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A\d+\z does not match 123\n.


The metacharacter \b matches at a position called a "word boundary".

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words.

Exactly which characters are word characters depends on the regex flavor you're working with. In most flavors, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w.

Example: "\bis\b" matches the third "is" in the string "This island is beautiful".

The start/end-of-word metachar

Most flavors have only one metacharacter (e.g. \b) that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.

GNU uses its own syntax for start-of-word and end-of-word boundaries. \< matches at the start of a word, like Tcl's \m. \> matches at the end of a word, like Tcl's \M. The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary.


\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.


Elisp regexp

^[\t]*?[^[:space:]]+ [^[:space:]]+(.+)