Regexen are a bit like hammers, they're as easy to use as they are blunt instruments, & having gained some familiarity with them, suddenly one is surrounded by things that beg to be whacked; before you know it, you'll be surrounded by bent nails.
They were lean & mean when I first encountered them in ed, acquired some fat around the gut with the inclusion of capture-groups, back-references, & POSIX character-classes, & now look like Mr Creosote after being fed on a feature-rich diet in Perl-5, which included look-around assertions, shortcuts & ironically, non-greedy quantification.
They've outgrown their origins to such an extent, that they can no longer justify their original name "regular expression" (given its anachronistic association with the subject of "regular algebra", from which they emerged), but are now collectively known as "regexen".
I see several problems with regexen:
I want to whack different types of thing with the hammer, rather than just character-strings. The obvious question at this point is, "Why would you want a Swiss Army hammer" ?! Specifically, I wanted a regex-engine that could match bidding-sequences in the card-game "Contract Bridge", but whilst I have to admit to the lack of another good example, I didn't even know of one good example until I stumbled across it, so I'm rather hoping that this is a solution waiting for a problem.
Traditionally a regex-engine, not only consumes character-strings as input-data, but reads a regex-specification encoded within one, using a terse, & almost standard, notation. This was necessary, to facilitate such use-cases, as passing the regex-specification to /e?grep|sed/ from the UNIX command-line, or typing it into ed or vim, but this paradigm persisted even within the source-code of languages like Perl. This terse encoding has IMHO, become progressively less appropriate as the feature-set has grown, a tacit admission of which is the addition, to Perl-5's regex-specification, of white space & comments. Whilst these additions certainly improved readability, the resulting regex-definition also began to look suspiciously like a DSL.
Whilst it may be convenient to hide the specification of regexen from the compiler of the programming language in which they're embedded, allowing the latter to remain unchanged but without stifling the relatively rapid pace at which new regex-features are invented, syntax-errors within a regex can in consequence persist until run-time.
Debugging regexen can be hard, even if their syntax is checked statically. Some regex-engines permit one to retrieve the input-data captured by specific portions of a regex, which is a useful aid to debugging, but this feature is typically incomplete, only permitting retrieval of the last data that a repeatable capture-group matched, leaving one unclear what happened on previous repetitions.
My Big Idea
regexdot aims to rectify these specific issues:
Whilst traditional regex-engines binge-out on a pure diet of character-strings, this one's omnivorous, caring only that the data-type it consumes, implements both of the type-classes "Eq" & "Show". It helps if the data-type, from which the list of input-data is composed also implements the type-class "Read", allowing both the input-data to be defined, & the regex to be encoded, within character-strings, thus making it directly accessible from the command-line.
The only reference to this concept that I could find was, ironically, as integral feature of a joke-language; "A note on the Haskerl extension to Haskell", Will Partain, 1 April 1993.
NB: while there are already many Haskell regex-engines which are polymorphic in their return-type, i.e. that can return a result appropriate to the requirements of the environment from which they were called, it isn't this concept to which I currently referring; though I do in the discussion about the package "regexchar".
It exports a simple API, exposing the syntax of the regex, to the unforgiving scrutiny of the Haskell-compiler.
It allows one to discover the complete mapping of input-data into the regex.
The inclusion of capture-groups & sub-expressions, introduced a recursive element to the definition of a regex, making the complete mapping of input-data into the regex, the shape of a Rose tree. Querying this tree-shaped result, to determine what input-data was consumed by a specific quantified meta-datum, is difficult since the detailed structure of the result-tree is uncertain until the solution is known, but the whole can still be methodically traversed & therefore printed, & doing so is refreshingly informative compared with the prospect of staring myopically at a dysfunctional regex, vainly hoping for a moment of clarity.
You may wonder why a direct implementation of a polymorphic regex-engine is required. If, for example, one has an enumerated type, with unique instances corresponding to say, the colours of the rainbow, & one wishes to match a lists of these colours, against a regex describing a colour-pattern (exactly why one might want to do this, isn't really the point), couldn't one just define a mapping, from each of the possible enumerated colours, to an arbitrary seven-member subset of characters, then perform the match using a traditional regex-engine ? This scheme, for which Heath Robinson doubtless holds a patent, might work to a limited extent, but closer inspection reveals its inadequacy:
- If we extend the scope of the above example, to an enumerated type including all the 12000 colours in the Dulux Trade-range, then there aren't going to be enough ASCII-characters to which they can be uniquely mapped. Whilst there are enough characters in UTF-16, there's always an example waiting around the corner which exceeds that too, for example the infinite set of integers.
- What happens when the regex-specification includes Perl-style shortcuts. Why should an instance of the enumerated type, just because it was arbitrarily mapped to one of the white-space characters, match /\s/ ?
- What does case-insensitivity now mean ?
- The input-data not only has to be mapped from its original enumerated type to the arbitrarily chosen subset of characters, but captured data has to be mapped back to the original type.
- The readability of the code would be poor, & debugging it, abnormally unpleasant.
This website would be much smaller if I'd decided that the best strategy was not to bother, & though I don't currently know of many instances in which such a beast might be useful, my gut-instinct is that it's worth producing; regrettably, my gut can't even forecast meal-times accurately.
The implementation of a regex-engine falls into three basic categories; recursive, NFA, & DFA. The Thompson NFA algorithm is theoretically most efficient, but; it trades the undesirable exponential time-complexity inherent to other implementations, for (arguably less) undesirable space-complexity; can't easily be queried regarding the nature of any match; & makes the implementation of some newer regex-features difficult. Because I want to; know the precise nature of the match; implement features like lazy quantification; & retain the option to implement back-references; regexdot is implemented using the simple recursive algorithm.
This design-choice results in relatively poor performance on some pathological regexen, but on typical regexen, this performance-penalty is masked anyway, by the poor rate of data-consumption resulting from its inherent polymorphism, which blocks adoption of Char-specific optimizations. On the plus side, the algorithm is capable of evaluating alternative sub-expressions in parallel, & will do so when linked with the threaded runtime & where multiple CPU-cores are available.
No concept outside a character-based regex
† Fully supported by the package "regexchar".
In the pages to follow, some of the more significant modules in regexdot, will be referenced. To give you some feel for the environment, this is the complete set of files from which it is composed, with those files corresponding to significant modules, emphasised. A summary of the function of each file is available, & can be accessed (given focus), by hovering your cursor over its name.
tar -zxf 'regexdot-0.11.1.2.tar.gz' && ls -pR 'regexdot-0.11.1.2/' #Unpack & list the package-contents.regexdot-0.11.1.2/: changelog.markdown copyright LICENSE README.markdown regexdot.cabal Setup.hs src-lib/ regexdot-0.11.1.2/src-lib: RegExDot/ regexdot-0.11.1.2/src-lib/RegExDot: Anchor.hs CompilationOptions.hs ConsumptionProfile.hs DSL.hs Meta.hs Repeatable.hs Span.hs BracketExpression.hs Consumer.hs DataSpan.hs ExecutionOptions.hs RegEx.hs Result.hs Tree.hs BracketExpressionMember.hs ConsumptionBounds.hs DataSpanTree.hs InstanceInt.hs RegExOpts.hs ShowablePredicate.hs
In order to demonstrate the use of the package "regexdot", I need a test data-type, & lacking a better example, I'll use the one for which it was developed, i.e. the type "Bid". This data-type can be inspected using ghci (using package "bridge-0.1.0.12" which, in contrast to earlier versions, exposes a library).
ghci GHCi, version 7.6.3: https://www.haskell.org/ghc/ :? for help Loading package ghc-prim … linking … done. Loading package integer-gmp … linking … done. Loading package base … linking … done. Prelude>
:module +Bridge.Tier2.BidPrelude Bridge.Tier2.Bid>
:info Biddata Bid = Redouble | Double | Pass | MkBid BidLevel.BidLevel Trump.Trump -- Defined at Bridge/Tier2/Bid.hs:103:6-8 instance Enum Bid -- Defined at Bridge/Tier2/Bid.hs:124:10-17 instance Eq Bid -- Defined at Bridge/Tier2/Bid.hs:109:17-18 instance Ord Bid -- Defined at Bridge/Tier2/Bid.hs:115:10-16 instance Read Bid -- Defined at Bridge/Tier2/Bid.hs:141:10-17 instance Show Bid -- Defined at Bridge/Tier2/Bid.hs:135:10-17 Prelude Bridge.Tier2.Bid>
take 38 [Redouble ..]Loading package array-0.4.0.1 … linking … done. Loading package deepseq-220.127.116.11 … linking … done. Loading package bytestring-0.10.0.2 … linking … done. Loading package containers-0.5.0.0 … linking … done. Loading package binary-0.7.6.1 … linking … done. Loading package filepath-18.104.22.168 … linking … done. Loading package old-locale-22.214.171.124 … linking … done. Loading package time-126.96.36.199 … linking … done. Loading package unix-188.8.131.52 … linking … done. Loading package directory-184.108.40.206 … linking … done. Loading package pretty-220.127.116.11 … linking … done. Loading package process-18.104.22.168 … linking … done. Loading package Cabal-22.214.171.124 … linking … done. Loading package parallel-126.96.36.199 … linking … done. Loading package primes-0.2.1.0 … linking … done. Loading package random-188.8.131.52 … linking … done. Loading package QuickCheck-2.6 … linking … done. Loading package toolshed-0.16.0.0 … linking … done. Loading package factory-0.2.1.2 … linking … done. Loading package transformers-0.3.0.0 … linking … done. Loading package mtl-2.1.2 … linking … done. Loading package text-0.11.3.1 … linking … done. Loading package parsec-3.1.3 … linking … done. Loading package regexdot-0.11.1.2 … linking … done. Loading package bridge-0.1.0.12 … linking … done. [xx,x,Pass,1C,1D,1H,1S,1NT,2C,2D,2H,2S,2NT,3C,3D,3H,3S,3NT,4C,4D,4H,4S,4NT,5C,5D,5H,5S,5NT,6C,6D,6H,6S,6NT,7C,7D,7H,7S,7NT]
As required, the data-type "Bridge.Tier2.Bid.Bid" implements the type-classes "Eq" & "Show". ghci used the latter, to print the 38 unique instances of the data-type, which can be built using the four data-constructors.
The first three Bid-instances each have their own nullary data-constructor, but the remainder must be constructed using the single binary data-constructor, so for convenience, the module "Bridge.Tier2.Bid" also exports constants, corresponding to each these.