Have an Android phone?

Try our new game!

Regular expressions (Java)

Regular expressions are patterns that allow to extract from the text words or phrases that match some criteria. They can not just find a given word, but also extract arbitrary word that has required words nearby. Also, it can be predefined regions anywhere in the word that need not match, allowing to recognize singular and plural form with the same regular expression, for instance. A single expression can also cover multiple entities that must be found. Surely, regular expressions can look for arbitrary sequences of characters, not just for ordinary words. Regular expressions differ between programming languages that provide them. Depending on the options, search with regular expressions may or may not be case sensitive. This article explains Java regular expressions, but expression of other languages are not fundamentally different. It tries to represent the material in understandable way rather than in strictly formal and fully complete way, as found in the official specification[1] that we suggest to read next.

This applet supports both basic and advanced constructs. In the screen shot, we the pattern with the simple group that has been found twice. When moving to/from Java code, please mind that in Java code backslashes (\) must be doubled as it is also the escape symbol in standard Java string.

Basics

  • Simple match. Symbols other than defined below match if they are the same (for instance, cat matches cat and nothing else).
  • Dot (.) matches any character. For instance, c.t would pick both cat and cut.
  • Question mark (?) indicates that the character preceding it may or may not appear, the match still counts. For instance, the mark cats?7 makes s optional, allowing cat7 to match as well.
  • Star (*) similarly makes any number of following characters optional, for instance cat *? here matches both cat is here and cat from the great mountain is hot here.
  • Plus (+) states that the character before must occur once but may also occur more times. For instance 123+4 will match 1234, 12334, 123334 but will not match 124.
  • \w matches "word" characters (letters, digits and underscore). \W matches non word characters.
  • \d matches digits only. \D matches non digits.
  • \s matches whitespace symbols (space, tab, eth). \S matches symbol that is not whitespace.
  • Normally pattern can match anywhere in the target string. If required, ^ can be used to mark the start of the target string and $ to mark the end.

The match can be either case sensitive or case insensitive. It is possible to specify this directly in the pattern but this is very clumsy; usually case sensitivity is specified in Java language construct that compiles the pattern.

Groups

Grouping allow to use regular expressions for finding pattern that is not initially fully known. For instance, if we expect a phone number to be in a form 00-370-4567788 where "370" is a country code and "00" is an international prefix, we can write expression 00-(\d+)-(\d+) that would pick the country code as the first group and the rest of number as the second group, but only if the two zeros were before. Groups are surrounded by parenthesis.

When using groups, some additional rules about star and plus become important. Normally, these qualifiers are "greedy", trying to reach as far as possible ahead. For instance, the pattern one(.*)cat will extract " one white cat and another black cat here, taking the second match of the cat and not the first. When needed, "reluctant" qualifiers can be used instead that take the smallest possible math. Start and plus can be made reluctant by adding a question mark after them. For instance, one(.*?)cat will extract a single word " white " from the previous sample.

It can be more than one group in the pattern. In this case the patters counts as found (and the group values are available) only after all groups are detected, and exactly in the sequence that they follow in the pattern (regions both inside and between the group parentheses must match). The pattern can be found second time when all its groups are detected again.

More advanced matching

  • [xyz] matches x, y or z (single character), no any other.
  • [^xyz] matches any character apart x, y or z.
  • a{3} matches "aaa"
  • a{3,5} matches at least three but no more than five subsequent characters of 'a' ("aaa", "aaaa" and "aaaaa").
  • a{3,} matches three or more 'a' characters.

Alternatives

It is possible to specify several possible alternatives using the | operator. A|B means either A or B. It is usually used together with parenthesis, specifying alternative sequences or even sub-expressions rather than single symbols. For instance, ((white)|(grey)) cat matches either white cat or grey cat but does not match black cat. The second pair of parenthesis around enumerated cases, as ((aa)|(bb)|(cc)) is required for the proper interpretation, do not forget.

By default, parentheses also specify that the group inside must be captured for the later retrieval of the matched values. If the content inside parenthesis need not be captured, this can be suppressed by adding :? after the opening (. For instance, the above example can be written as (:?(:?white)|(:?grey)) cat. Specifying which groups are expressions to capture and which are required just for the expressions internally makes the code easier to understand. With many non capturing groups it may be tricky to tell the number of the group that at the end must capture something, and the group content is retrieved by its number.

Escaping

If the symbol, used to specify the pattern, must be included in the search directly, a backslash prefix is added. For instance, \* searches for (literally) star and \\ searches for literally backslash itself. In the strings inside Java code backslash must be duplicated one more time because it is also standard escaping character there. Hence to find a backslash in Java, one must specify Pattern.compile("\\\\").

Other matches

There are more "pattern programming" constructs than covered here, including even some binary match capabilities, but these are less frequently used. They are described in formal specification below. The applet supports these advanced patterns as well.

References

  1. 1 Official specification of Java regular expressions