Finding and Extracting Substrings

The beauty of Prolog as a text processing language is that definite clause grammars (DCG's) are not only part of it, but almost an inevitable part, and may be used for constructing and decomposing pieces of text as well as matching them.

As an example of the power of definite clause grammars, suppose we want to take American-style dates apart. Here is a grammar:

     usa_date(Y, M, D, MDY) :-
             usa_date(Y, M, D, MDY, "").
     usa_date(Y, M, D) -->
             digits(M), "/", digits(D), "/", digits(Y).
     digits([D|Ds]) -->
             [D], {is_digit(D)},
             (   digits(Ds)
             ;   {Ds = []}

With this definition, we can take dates apart:

     | ?- usa_date(Y, M, D, "12/25/86").
     Y = "86",
     M = "12",
     D = "25"
     | ?- usa_date(Y, M, D, "2/1/87").
     Y = "87",
     M = "2",
     D = "1".
     | ?- usa_date(Y, M, D, "1-feb-87").

We can also put dates together:

     | ?- usa_date("86", "12", "25", Date).
     Date = "12/25/86"
     | ?- usa_date("87", "2", "1", Date).
     Date = "2/1/87"

Thanks to the fact that non-terminals in a DCG can take arguments, and with a little care, you can write quite complicated grammars that can be used for composition as well as decomposition.

If you want to do any sort of text processing in Prolog, you should learn how to use grammar rules. A well-written grammar requires less mental decoding than a program using the operations in library(strings). Here are versions of usa_date/4 written using the operations in library(strings). cons_date/4 can only build a date, and dest_date/4 can only take one apart. Both are simplified, and do not check that Y, M, and D are made of digits.

     cons_date(Y, M, D, Date) :-
             concat_atom([Y,/,M,/,D], Date).
     dest_date(Y, M, D, Date) :-
             substring(Date, /, I, 1),       % find left /
             substring(Date, /, J, 1, R),    % find right /
             J > I,
             substring(Date, Y, 0, I),       % extract Y
             substring(Date, D, _, R, 0),    % extract D
             P is I+1, Q is R+1,             % widen fringes
             substring(Date, M, P, _, Q).    % extract M

It is not immediately obvious what this does, whereas the version using grammar rules is considerably clearer.

The argument is sometimes raised that, while grammar rules may be more elegant, string operations are more efficient. However, it is the daily experience of Prolog programmers that "clean" and "efficient" tend to describe the same code.

The following relative times were measured using Quintus Prolog on an MC68020:

     cons_date('86', '12', '25', _)
     ------------------------------ = 1.5
     usa_date( "86", "12", "25", _)
     dest_date(_, _, _, '86/12/25')
     ------------------------------ = 4.5
     usa_date( _, _, _, "86/12/25")

In both cases, processing lists of character codes using grammar rules was more efficient than using "string" operations. If the "string" operations were built in, rather than being part of the library, they could be faster than they are. Even so, using grammar rules would still be the preferred method.