Composable Regular Expressions and Fields

Martin Fowler wrote a brief article about composing regular expressions in order to make it easier to deal with individual “tokens” and to give them structure.

As noted by a reddit commenter, string constants can be a bit unwieldly to deal with and will break. Scheme Shell solves this problem with an s-expression-based structure syntax of regular expressions called SRE. You can click that link to read more about it, but what I’ll do here is convert Fowler’s example to some SREs to show how much more useful structured regular expressions are when you use them in a Lisp-like language.

His structured pattern in the Java language looks like this:

const string scoreKeyword = @"^scores+";
const string numberOfPoints = @"(d+)";
const string forKeyword = @"s+fors+";
const string numberOfNights = @"(d+)";
const string nightsAtKeyword = @"s+nights?s+ats+";
const string hotelName = @"(.*)";
const string pattern =  scoreKeyword + numberOfPoints +
  forKeyword + numberOfNights + nightsAtKeyword + hotelName;

He also has an alternative method that joins strings with whitespace:

private String composePattern(params String[] arg) {
  return "^" + String.Join(@"s+", arg);
}
const string numberOfPoints = @"(d+)";
const string numberOfNights = @"(d+)";
const string hotelName = @"(.*)";
const string pattern =  composePattern("score", numberOfPoints, 
  "for", numberOfNights, "nights?", "at", hotelName);

Here is what that looks like in Scheme Shell:

(define number-of-points (rx (submatch (+ digit))))
(define number-of-nights (rx (submatch (+ digit))))
(define hotel-name (rx (submatch (* any))))
(define s+ (rx (+ whitespace)))
(define pattern (rx "score" ,s+ ,@number-of-points ,s+ "for" ,s+ ,@number-of-nights ,s+ "nights?" ,s+ "at" ,s+ ,@hotel-name))

Now that I’ve written that out, it looks cumbersome. It wasn’t a huge pain to type out, but all of that whitespace matching looks redundant and I can see why Fowler wrote up the composePattern function.

Let’s try treating this problem differently by assuming that the important data is separated by whitespace, and let’s call the stuff that’s in between the whitespace a “field”. By doing this, we no longer have to use regular expressions. We can now use a tool like AWK or something else to split each string into some fields and then look at each field and turn it into the appropriate data type.

Here is my Python solution:

# fields:    0    1   2  3   4     5   6     7       8
example = "score 400 for 2 nights at Minas Tirith Airport"
fields = example.split() # Python assumes whitespace as the delimiter
numberOfPoints = int(fields[1]) # 400
numberOfNights = int(fields[3]) # 2
hotelName = ' '.join(fields[6:])   # 'Minas Tirith Airport'

There might be a slightly better solution that treats the hotel name differently, but it doesn’t really matter. The point is that this is no longer a problem that requires the use and composition of regular expressions.