Regular expressions are patterns that allow to extract from the text words or phrases that match some criteria. They can not just find a given word, but also extract arbitrary word that has required words nearby. Also, it can be predefined regions anywhere in the word that need not match, allowing to recognize singular and plural form with the same regular expression, for instance. A single expression can also cover multiple entities that must be found. Surely, regular expressions can look for arbitrary sequences of characters, not just for ordinary words. Regular expressions differ between programming languages that provide them. Depending on the options, search with regular expressions may or may not be case sensitive. This article explains Java regular expressions, but expression of other languages are not fundamentally different. It tries to represent the material in understandable way rather than in strictly formal and fully complete way, as found in the official specification[1] that we suggest to read next.
The match can be either case sensitive or case insensitive. It is possible to specify this directly in the pattern but this is very clumsy; usually case sensitivity is specified in Java language construct that compiles the pattern.
Grouping allow to use regular expressions for finding pattern that is not initially fully known. For instance, if we expect a phone number to be in a form 00-370-4567788 where "370" is a country code and "00" is an international prefix, we can write expression 00-(\d+)-(\d+) that would pick the country code as the first group and the rest of number as the second group, but only if the two zeros were before. Groups are surrounded by parenthesis.
When using groups, some additional rules about star and plus become important. Normally, these qualifiers are "greedy", trying to reach as far as possible ahead. For instance, the pattern one(.*)cat will extract " one white cat and another black cat here, taking the second match of the cat and not the first. When needed, "reluctant" qualifiers can be used instead that take the smallest possible math. Start and plus can be made reluctant by adding a question mark after them. For instance, one(.*?)cat will extract a single word " white " from the previous sample.
It can be more than one group in the pattern. In this case the patters counts as found (and the group values are available) only after all groups are detected, and exactly in the sequence that they follow in the pattern (regions both inside and between the group parentheses must match). The pattern can be found second time when all its groups are detected again.
It is possible to specify several possible alternatives using the | operator. A|B means either A or B. It is usually used together with parenthesis, specifying alternative sequences or even sub-expressions rather than single symbols. For instance, ((white)|(grey)) cat matches either white cat or grey cat but does not match black cat. The second pair of parenthesis around enumerated cases, as ((aa)|(bb)|(cc)) is required for the proper interpretation, do not forget.
By default, parentheses also specify that the group inside must be captured for the later retrieval of the matched values. If the content inside parenthesis need not be captured, this can be suppressed by adding :? after the opening (. For instance, the above example can be written as (:?(:?white)|(:?grey)) cat. Specifying which groups are expressions to capture and which are required just for the expressions internally makes the code easier to understand. With many non capturing groups it may be tricky to tell the number of the group that at the end must capture something, and the group content is retrieved by its number.
If the symbol, used to specify the pattern, must be included in the search directly, a backslash prefix is added. For instance, \* searches for (literally) star and \\ searches for literally backslash itself. In the strings inside Java code backslash must be duplicated one more time because it is also standard escaping character there. Hence to find a backslash in Java, one must specify Pattern.compile("\\\\").
There are more "pattern programming" constructs than covered here, including even some binary match capabilities, but these are less frequently used. They are described in formal specification below. The applet supports these advanced patterns as well.