RegExChar

Background to Regular Expressions (Skip forwards if you already understand them)

Superficially this topic may appear a shade dull, & whilst there's more than a grain of truth to this point of view, you've already progressed beyond the title, so can be considered to be on a bit of a roll. Using this unprecedented momentum, I'll continue to reveal the weapons-grade dullness that exists tantalisingly beneath the surface.

Colonel Gaddafi.

Let's say, you want to search for the name Muammar Abu Minyar al-Gaddafi in a telephone-directory for Libya, but are uncertain of the precise spelling: recent (2011-10-20) events have overtaken this requirement, but he has many offspring who might equally like a phone-call. If you can access the phone-book in a digital format, you could write a program to iterate through the candidate names in the directory, looking for the first to match any of the possible spellings. This simple solution is tolerable, unless the set of possible spellings with which each candidate name from the directory is compared, is large; in this case there are only about 37 possible spellings. Alternatively one may, after deciding to start some shady business involving cold-calling obscenely rich people, want to extract all double-barrelled names from this same telephone-directory, so, adopting a similar approach, one creates a large list containing the subset of all known double-barrelled names, then iterates though each candidate name from the directory, & compares it with this match-list. This is clearly a lousy solution, but as such illustrates, that the similar solution to the initial problem is also lousy. Though it has a tolerable O(n × m) time-complexity, the real problem is that the value of 'm' is unnecessarily large.

Regular-expressions provides a concise way to define such subsets. As an alternative to exhaustively enumerating the possible spellings of the colonel's name, one could isolate those parts which are common to all acceptable spellings, & join them with shorter sequences describing the permissible alternatives. By this method, the combinatorial explosion in the choices of capitalisation, hyphenation, & optional middle names, can be defined in a single concise expression. The number of possible solutions is still as unpleasantly large as it always was, but the chance of acquiring an unpleasant case of RSI whilst defining them, has undoubtedly been reduced.

Post-box

You might legitimately say that the above example is rather contrived, since you've little interest in either phoning the Colonel or cold-calling anybody, & if that's the best reason I can dredge-up for you to understand regular-expressions, well you may opt to postpone it in favour of completing a few warranty-coupons, ironing your shoe-laces, or extracting some of your navel-lint. To make it more relevant, you may, as did I for the construction of this web-site, have the requirement to validate email-addresses. The standard for this is only marginally less complex than an HMRC self-assessment tax-return, but there are some basic features all email-addresses possess (other than being spam-magnets), like an '@'-character separating the local part from the domain. The local part consists of a '.'-separated list of non-null names, while the domain consists of a '.'-separated list of host-names, ending in one of a set of well-defined TLDs; though ICANN keep rubber-stamping new ones. The maximum permissible length of the address is 255 characters, & though the character-set from which the domain is traditionally composed is quite restricted (unless one takes the bold move to account for internationalised domain-names), the character-set from which the local part can be composed is more liberal, making the total number of valid email-addresses far too great to enumerate. One could consider a valid email-address, to be a huge subset of the infinite set of all strings, just as the permissible spellings of the colonel's name, or those names classified as double-barrelled, are a subset of all names. It is the definition of such subsets, that is the stamping-ground of regular-expressions. One can then use that definition, to ask whether a candidate string is a member of this subset, rather than iterating through the exhaustively enumerated members of the subset, & asking whether the current candidate is an exact match.

This type of problem occurs sufficiently frequently, to warrant the isolation of the solution, into an independent entity called a regex-engine. The regex-engine can be configured for a specific problem, using an encoded string which loosely resembles the strings in the subset you're trying to define, but rather than exhaustively enumerating all the set-members, meta-characters are used to represent the parts of the string where more flexibility is permissible. Once configured, the regex-engine can be passed a candidate data-string, & queried regarding set-membership.

Example

More detailed information is made available when you hover your cursor over specific parts of this example.

/^\s*[-#$^&|!%'`?*+\/={}~\w]+(\.[-#$^&|!%'`?*+\/={}~\w]+)*@([A-Z\d][-A-Z\d]{0,62}\.)+([A-Z]{2,3}|AERO|ARPA|ASIA|COOP|INFO|JOBS|MOBI|MUSEUM|NAME|TRAVEL)\s*$/i

Before complaining about my failure to account for some corner-case, there are a few bizarre formats covered by the standard (e.g. a double-quoted local part, a domain specified using a bracketed IP-address, or use of Unicode), which I've chosen to exclude from the set of valid matches, for clarity … & because I couldn't be bothered.

Traditionally one would employ a simple regex as the introductory example, which this is not. So, not wanting to gain the distinction of being the individual responsible for dissuading you from any subsequent encounter, don't be daunted by the sea of symbols; it's less complicated than might initially appear. What the above example does do, besides actually being useful, is demonstrate the way in which one configures a typical regex-engine, & provide a broad (though incomplete) selection of the possible instructions. The syntax used in this mini language, is largely standardised by POSIX, though it's a rapidly evolving area. I'll let other texts be your tutorial, since I don't believe I've anything to add.

In this instance the regex is delimited by '/'-characters, following which an 'i' is used to specify globally case-insensitivity; the syntax for both of which varies between implementations.

Between this, one can see anchors, which moor the corresponding end of the regex, preventing it from floating freely within the sea of input-data:

Stepping-stones over Crocodile-infested River.
^
Requires that the regex matches from the very start of the input-data.
$
Requires that the regex matches from the very end of the input-data.

Working inwards, one can just see literal parts of the email-address poking through the more liberal meta-characters, like stepping-stones across a crocodile-infested river; from 10,000 meters above, one can see first the local part, then the '@'-separator, & finally the TLD.

The meta-characters used, are defined as:

.

A wild-card, which matches any input-datum.

[]

Delimits a character-class, which, by default, defines the set of input-data, any one of which the regex is permitted to consume; though this can be amended to match the set of input-data which the regex isn't permitted to consume, by prefixing the set with '^'.

Bracket-expressions are a type of character-class, in which the members are either enumerated literally, or defined as a range of values, separated by a '-', or some combination of these.

Common bracket-expressions have been predefined, & can be specified by a shortcut prefixed by a '\' (which is also used as an escape-character, either when specifying the requirement for some types of white-space, or when using a meta-character in a literal sense within a regex). These Perl-style shortcuts can either be used within a regex as a meta-character, or included as a member of another bracket-expression.

Perl Camel.
Perl-style Bracket-expression shortcuts
Negated
Shortcut Equivalent
Bracket-expression
Shortcut Equivalent
Bracket-expression
\d [0-9] \D [^0-9]
\s [ \t\r\n\v\f] \S [^ \t\r\n\v\f]
\w [A-Za-z0-9_] \W [^A-Za-z0-9_]

As a more comprehensive locale-independent alternative, the IEEE Std 1003.1, 2004 Edition, POSIX-standard defines character-classes, for use only as members of a bracket-expression, using a more verbose notation.

POSIX Character-classes
Character
Class
ASCII Equivalent
Bracket-expression Shortcut
[:alnum:] [A-Za-z0-9]
[:alpha:] [A-Za-z]
[:blank:] [ \t]
[:cntrl:] [\x00-\x1F\x7F]
[:digit:] [0-9], \d
[:graph:] [\x21-\x7E]
[:lower:] [a-z]
[:print:] [\x20-\x7E]
[:punct:] [][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]
[:space:] [ \t\r\n\v\f] \s
[:upper:] [A-Z]
[:xdigit:] [0-9A-Fa-f]
Non-standard
Additions
[:word:] [A-Za-z0-9_], \w
[:ascii:] [\x00-\x7F]

Since these are only for use within bracket-expressions, they can be negated by prefixing with '^', like the contents of any normal bracket-expression.

Julius Newtree set, defined by z=sqrt(z^4)+c. |
Separates alternative sub-expressions, each of which has the same syntax as the regex of which it forms a part. This makes the definition recursive; i.e. the regex can be defined using other regexen.
()
Delimits a capture-group, which contains either one, or a set of alternative, sub-expressions. When the capture-group is also repeatable, The manner in which the input-data maps into the regex, becomes tree-shaped; each capture-group being a node from which new branches, each representing the sub-expression which matched on that single repetition, emerges. One doesn't normally either know or care about the shape of the match, since the typical requirement is merely to query the regex regarding set-membership.

Meta-characters & capture-groups (as above), can be quantified by postfix operators, which define the number of times they may be invoked to consume matching input-data, which by default is exactly one:

Quantifiers
Greedy Non-greedy
Unary * *? Zero or more input-data items.
+ +? One or more input-data items. Whilst this could be defined as a special case of the above, using a literal followed by zero or more instances of the same literal, it occurs sufficiently frequently to warrant a dedicated meta-character.
? ?? Optional input-datum.
Binary {n,} {n,}? At least n times.
{n} Exactly n times. Note that this has no concept of greediness.
Ternary {n, m} {n, m}? From n to m items of input-data. Whilst this general form could be used to define all the above in a single consistent syntax, the resulting regex would be excessively verbose, so is probably best reserved for those cases not already covered by a dedicated meta-character.

The traditional greedy quantifiers, by preference cause the qualified meta-character or capture-group, to consume the most input-data permissible, before any remaining input-data is forwarded to the remainder of the regex. The corresponding set of non-greedy or lazy quantifiers, as one might guess, by preference consume the least permissible input-data before forwarding the remainder. Whilst strict observation of this preference isn't always possible, since it may lead to an inferior solution or perhaps none at all, it does affect the search-order, & thus potentially the efficiency (though neither approach is more efficient in general), & can affect the choice amongst otherwise equally good solutions. The package "regexdot", & consequently regexchar, supports this feature.

Data-consumption

This is just a selection of the language by which one can instruct a typical regex-engine. The important point (though it doesn't really rank alongside either the explosion of world-population or of Yellowstone Caldera) is that it has no direct connection to either telephone-directories or email-addresses; the solution has been abstracted from the specific details of the problem. One concatenates a sequence of meta-characters, each of which specifies a requirement from the input-data, of either a literal character or a more liberal set of characters, & which may be repeatedly invoked to consume matching input-data, according to quantified consumption-bounds. Terminal anchors can optionally be defined around the concatenation, which force the resulting regex to match the corresponding end of the input-data stream. Several regexen can be defined & expressed as alternative sub-expressions. One or more sub-expressions can be associated in a capture-group, which may also be repeatedly invoked to consume input-data, according to quantified bounds.

Once defined, the regex can be fed sample input-data to determine whether it matches. On receipt of input-data, the regex-engine, will direct it towards the first term in the regex, permitting it to consume matching data within the quantified consumption-bounds, & according to greed, before re-directing the remaining input-data towards the next term in the regex. If a term in regex can't consume data within the quantified consumption-bounds, then alternative solutions are explored, in which previous terms are either required to regurgitate any excess input-data they can spare, or are force-fed supplementary data until bursting-point.

Why ?

This page isn't about learning to use regexen, it's about the implementation of a regex-engine, "regexchar". Since there are already many good regex-engines, one has to question why I've implemented another. The reason is more about the way it was built, than the end-result. This implementation is a specialisation of the underlying polymorphic library "regexdot", to the traditional narrow role of processing character-based input-data. It is the polymorphic core which is the special part of this implementation, since this is applicable to the solution of a far wider class of problems.

Regex-engines are extremely bug-prone, so rigorous testing is vital, but defining a reasonably complete set of tests, & deciding the correct solutions according to a complex standard, is daunting. Not only is it difficult, but ideally should be done by someone else, to avoid incorporating one's own misinterpretation of that standard. The problem of testing is exacerbated by one particular feature of regex-engines; capture-groups. This feature allows one to delimit specific portions of the regex, then after successfully matching some input-data, to retrieve the corresponding captured portion. Under these circumstances a match-operation between a string & a regex, returns more than a simple Bool result, it returns a set of captured input-data. The specification of exactly what should be captured is understandably much more complex than merely stating whether the supplied input-data is a member of the set defined by the regex. I know of no test-suites applicable to a polymorphic regex-engine, indeed I'm neither certain that the task is even possible, nor do I know of any other such regex-engines which may have blazed the trail. I built regexchar, so that I could re-use existing character-specific regex test-suites, on the assumption that since the parametric polymorphism of the underlying library, prohibits any reference to type-specific input-data (Char or otherwise), I'm merely testing the type-agnostic logic in the package "regexdot", which whilst necessary, should also be sufficient.

Features

To be a suitable test-subject, regexchar needs to be reasonably fully-featured, otherwise a significant number of the existing standard tests won't be applicable. Some features of a typical regex-engine, either have no meaning outside a character-specific context, or can't be defined without knowledge of the actual type-parameter, & therefore aren't implemented in the underlying polymorphic regex-engine, nor therefore automatically available to any specialisation, but must be appropriately implemented by each specialisation, where applicable. So regexchar builds on the features inherited from the package "regexdot", to provide traditional character-specific features.

Bracket-expression ranges:

These require the formal type-parameter to implement either of the type-classes "Enum" or "Ord", which, to avoid unnecessarily constraining the range of type-parameters to which it's applicable, the underlying polymorphic regex-engine purposely doesn't mandate.

Knowing the type-classes to which the actual type-parameter "Char" belongs, regexchar implements these.

Perl-style shortcuts:

Though the concept can be applied to input-data of arbitrary type, by defining custom domain-specific shortcuts (& regexdot facilitates such customisation), the traditional Perl-style shortcuts are only meaningful wrt character-based input-data.

regexchar uses the infrastructure provided by regexdot, to implement the traditional shortcuts.

POSIX character-classes:

This feature has no meaning except for character-based input-data, & therefore can't be made available by regexdot.

regexchar implements these.

Terse regex-specification, within a String:

Though the polymorphic regex-engine can read a regex-specification from a String, it expects that String to define a comma-separated list of optionally quantified meta-data, & necessarily excludes any anchor-specifications from this homogeneous list, since their type is independent of the polymorphic type-parameter. The encoding of the traditional character-based regex, requires no separators between meta-characters, because it not unreasonably assumes unescaped characters have a uniform length (of one), & since anchors are also represented by a single character, can be integrated with the resulting String.

regexchar defines a dedicated parser, so that unmodified traditional character-based regex-specifications can be read.

Case-insensitivity:

This feature has no meaning except for character-based input-data, & therefore can't be made available by regexdot.

Though desirable, this feature hasn't yet been implemented in regexchar.

The package "regexdot", & consequently regexchar, has another unusual feature; the ability to extract the complete structure & content of the match. Rather than speculatively providing a number of alternative function-calls to extract the information commonly required from a match-operation, & hoping that the user's requirement coincides with one of these, all the data from the match can be retrieved, permitting the user to slice & dice it as required.

The Package

In the pages to follow, some of the more significant modules in regexchar, will be referenced. To give you some feel for the environment, this is the complete set of files from which it is composed, with those files corresponding to exposed modules, emphasised. A summary of the function of each file is available, & can be accessed (given focus), by hovering your cursor over its name.

$ tar -zxf 'regexchar-0.9.0.13.tar.gz' && ls -pR 'regexchar-0.9.0.13/'	#Unpack & list the package-contents.
regexchar-0.9.0.13/:
changelog.Debian	copyright	debian/	man/		regexchar.cabal	Setup.hs		src-lib/
changelog.markdown	data/		LICENSE	README.markdown	regexchar.spec	src-exe/	src-test/
./data:
Tests_ATTBasic.txt	Tests_ATTInterpretation.txt	Tests.txt

./debian:
DEBIAN/

./debian/DEBIAN:
control

./man:
man1/

./man/man1:
grecce.1

./src-exe:
Grecce/	Main.hs

./src-exe/Grecce:
CommandOptions.hs	Grep.hs	Test/

./src-exe/Grecce/Test:
Assert/	Performance/

./src-exe/Grecce/Test/Assert:
RegExOptsChar.hs

./src-exe/Grecce/Test/Performance:
ExtendedRegEx.hs	ExtendedRegExTest.hs	ExtendedRegExTestsNegative.hs	ExtendedRegExTestsPositive.hs

./src-lib:
RegExChar/

./src-lib/RegExChar:
ExtendedRegExChar.hs	MetaChar.hs	RegExOptsChar.hs

./src-test:
Grecce/	Main.hs

./src-test/Grecce:
Test/

./src-test/Grecce/Test:
QC/

./src-test/Grecce/Test/QC:
ExtendedRegExChar.hs	MetaChar.hs	RepeatableMetaChar.hs

regexchar's Interfaces

API