Developer's Documentation for free mobile OCR SDK

Documentation Menu

Regular Expressions

Regular Expressions

This section describes the regular expression syntax supported by the ABBYY Real-Time Recognition SDK engine for capturing custom data fields (see Capture a Custom Data Field: iOS and Capture a Custom Data Field: Android).

noteNote: All matches are always greedy (match as much as possible).

Supported syntax

Pattern

Syntax

Examples and comments

Literal

any character or text, except metacharacters \^$.|?*+()[{

pill matches "pill" in "caterpillar"
a matches the first "a" in "caterpillar" but not the second (see the above note)

Metacharacters are part of regular expression syntax; to match these literally, you have to escape them with a backslash. If you want to match 1+1, the correct expression is 1\+1 — otherwise "+" has a special meaning.

Any symbol

. (dot)

s.t matches "sat", "sit" but not "seat"

Character set

[]

matches a single character which may be any character from the set: gr[ae]y matches both "gray" and "grey" but not "graey"

Character range in a set

- (minus)

[0-9] matches a single digit
concatenation is allowed: [a-zA-Z0-9] matches an alphanumeric character

Negated character set

[^]

[^0-9] matches anything that is not a digit

Shorthand character sets

\s — any whitespace
\S — anything that is not a whitespace
\d — any digit
\D — anything that is not a digit
\w — a word character, which includes alphanumerics and punctuation marks
\W — a non-word character
\R — a new line character or the CR LF sequence
\v — a new line character but not the CR LF sequence
\V — a non-new line character
\h — a horizontal white space character
\H — anything except horizontal white space

 

Non-printable characters

\n — line feed LF
\r — carriage return CR
\t — tab character
\f — form feed
\a — bell character \u0007
\e — escape character

 

Unicode character

\uFFFF
\x{FFFF}

\u20AC or \x{20AC} matches the euro currency sign.

Character by its hexadecimal index

\xFF

\xA9 matches the copyright symbol in the Latin-1 character set

Alternation

|

abc|123 matches either "abc" or "123"
|word matches either an empty string "" or "word"

Repetitions

+ — matches once or more times
* — matches zero or more times
? — matches zero times or once (optional match)
{n} — matches exactly n times
{n,m} — matches n to m times
{n,} — matches n or more times
{,m} — matches zero or more times up to m

colou?r matches "color" and "colour"
[a-zA-Z0-9]{2,4} matches a 2-4 digit alphanumeric code

Note that all repetitions are greedy (prefer to match as much as possible): c.*r will match "caterpillar", not stopping with "cater" (in such cases negation works better: c[^p]* matches "cater").

Grouping

()

(word)+ matches "word", "wordword" and so on

Unsupported syntax

The following regular expression syntax features are not yet supported in ABBYY Real-Time Recognition SDK:

  • Anchors: ^ (beginning of a line), $ (end of a line), \b (word boundary) and its negation \B, and other.
  • Lazy quantifiers such as +? or {n,m}? that prefer to match as few times as possible.
  • Concatenation with nested character sets such as [[a-z][0-9]].
  • Advanced features such as lookarounds, backreferences, possessive matches, named groups, non-capturing and atomic match groups, evaluation flag settings and other.