9.12 Pattern Matching

In the previous section we saw that the STRLEFT function could be used to extract a portion of a string. There are also STRRIGHT and STRMID functions that can be used in a similar manner to extract portions of a string. Each of these functions has the limitation that the length of the extracted string is dependent on the program, not the value of the string.

To illustrate what we mean, consider this problem: your report is passed an environment variable $FIRSTINV that should end in a string of digits of arbitrary length. Your report needs to produce invoice numbers from the demonstration subscription system (a problem similar to the one from the previous section).

This problem differs from the previous section in that we are not given the two parts of the invoice numbers (prefix and sequence). Instead, we are given a model that we must follow.

The overall structure of the report is identical to the final report from the previous section. The differences are in the techniques used to produce the invoice keys.

Before we can produce invoice keys, we still need the two parts: prefix and sequence. Notice how the starting invoice number was defined:

" ...should end in a string of digits..."

The phrase "end in a string of digits" is considered a pattern. That is, all of the starting invoice numbers will end with digits. Also, the word "should" implies that the starting invoice number might not be what we expect and should be checked.

To check for patterns in strings, you use the MATCH Report Writer function. Its syntax is:

MATCH (expression1, expression2)

Both expressions are evaluated and, if necessary, converted to strings. Expression2 is treated as a regular expression. This is a fancy term for a pattern matching string. Expression1 is the string to be matched for the pattern defined by expression2. If the pattern is found, MATCH returns $TRUE, otherwise it returns $FALSE.

If things seem a little complicated, an example should help shed some light. The pattern we want to find is a string that ends with digits. If we used a pattern string of "0", that would match a string that contained a zero. Likewise, a pattern string of "1" would match a string that contained a one. Similar patterns can be done for each digit. Unfortunately, this doesn't help much. We want to find any digit.

When you are looking for one of many possible characters (i.e., 0, 1, 2, etc.) you define a class of characters as in:

[0123456789]

The square brackets [ ] mark the start and end points of the class. The characters between the brackets define the class. Rather than having to list every character in a class, you can specify a range:

[0-9]

This class is identical to the one above. You can do this with letters too:

[a-z]

[A-Z]

[A-Za-z]

The first example matches any lower case letter, the second matches any upper case letter, the third example matches any letter (upper or lower case).

Now that we can find a single digit, we need to be able to find a string of digits of arbitrary length; that is, a digit possibly followed by more digits.

When you are looking for a repeated pattern (digits), you put a * character after the pattern you expect to be repeated:

[0-9]*

The * character says that the preceding pattern can be repeated zero or more times. This pattern would match:

8

23423

389872

01348

It would not match 1a92 as an entire string (only the 1 would match). When we do our pattern match, we want to make sure that we actually find at least one digit. This can be done in two ways:

[0-9] [0-9]*

or

[0-9]\{1,\}

The first method looks for one digit ([0-9]) followed by zero or more digits ([0-9]*). The second method matches at least one (\{1,\}) digit ([0-9]).

Finally, we want to match these digits at the end of the string. Our previous patterns could match any portion of a string: beginning, middle, or end. By ending our pattern with a $, the pattern matches the end of the string:

[0-9] [0-9]*$

This pattern matches any series of digits at the end of a string. It matches any of the following strings:

abcd01234

01abc234

123456

It does not match abc0123def (the string does not end with digits).

Now that we have the ending series of digits, we still need a way to split the original string into its two components: prefix and sequence. Before we can do that, we must create a regular expression that matches the prefix portion of the string.

A prefix can be thought of as a series of characters ending with a non-digit. To match any character, you use the . character. Thus, the pattern

.

would match any single character:

A

%

-

0

Notice the last example? Our pattern also matches digits! For now, this will be ok. Since a prefix can be more than one character, we need to use the * character in our pattern to match multiple characters:

.*

This pattern can match

A

%

abc00def

INV-009

As you can see in the last example, our pattern thus far matches an entire string. What we need is to stop the match at the last non-digit. To match a group of characters, we used the class pattern ([ ]). To match characters not in a group of characters, put a ^ symbol after the opening bracket, as in:

[^0-9]

This pattern matches anything other than a digit. Adding this pattern to our previous one gives us a pattern that matches all characters up to the last non-digit:

.*[^0-9]

We are finally ready to extract the parts from our original string. This is done using a combination of the MATCH function and the RESULT function.

After a successful MATCH function, the RESULT Report Writer function extracts portions of the matched string. Its syntax is

RESULT (n)

RESULT returns the nth matched string after a successful MATCH. RESULT (0) returns the string matched by the last MATCH function. For other values of n, you need to identify substrings in your MATCH pattern.

A substring is identified by the sequence \ (..patterns..\). The characters \( and \) mark the starting and ending points of the substring. Anything matched by the patterns in between these two symbols become a substring. For example, if we used the pattern abc\([0-9] [0-9]\) def to match the string abc34def, the matching substring would be 34. RESULT (1) would return 34 after the successful MATCH.

Multiple substrings can be defined in a pattern and returned with different calls to RESULT:

IF MATCH (phone, "^\([0-9]\{3\}\)\([0-9]\{4\}\)$")

THEN

PRINT "Phone number is: ", RESULT(1)

"-", RESULT (2), NL;

The first substring pattern, \([0-9]\{3\}\), matches the first 3 digits of a phone number. The second substring pattern, \([0-9]\{4\}\), matches the next 4 digits of a phone number. The characters ^ and $ match the start and end of the phone number, respectively. These ensure that our pattern matches the entire phone number. If phone is 2019889, the output would be:

Phone number is: 201-9889

By now, it should be clear how to proceed. We have a pattern that matches the desired invoice numbers:

.*[^0-9] [0-9] [0-9]*$

The pattern .*[^0-9] matches the prefix of the invoice number. The pattern [0-9] [0-9]*$ matches the ending digits. All we need to do is put these two patterns into substrings:

\(.*[^0-9]\)\([0-9] [0-9]*$\)

If this pattern matches the starting invoice number, RESULT (1) returns the prefix portion and RESULT (2) returns the sequence portion:

IF MATCH ($FIRSTINV, "\(.*[^0-9]\)\([0-9] [0-9]*$\)")

THEN BEGIN

prefix = RESULT (1);

digits = RESULT (2);

END

We still need to set the starting sequence number and determine the proper formatting string:

sequence = digits +0;

fmtString = STRLEFT ("000000000000000",

STRLEN (digits));

The remainder of the report is identical to our final version from the previous section:

/* generate sequence numbers */

FILE mag IS "mag"

FILE script IS "script"

FIELDS IN mag ARE magazine, year_rate;

FIELDS IN script ARE subscriber, magazine;

VARIABLES ARE

digits, /* string of digits in $FIRSTINV */

fmtString, /* string to format sequence #'s */

prefix, /* constant invoice prefix */

sequence, /* sequence number counter */

key, /* next sequence key */

amount, /* amount of one subscription */

sub_total TOTAL OF amount,

grand_total TOTAL OF amount;

MAIN

BEGIN

/* look for prefix and digits */

IF MATCH ($FIRSTINV, "\(.*[^0-9]\)\([0-9][0-9]*\)$")

THEN BEGIN

prefix = RESULT (1);

digits = RESULT (2);

END

/* don't know how to increment this one */

ELSE BEGIN

PRINT "Starting invoice '", $FIRSTINV,

"' does not end with digits.", NL;

RETURN;

END

/* establish formatting string */

fmtString = STRLEFT ("0000000000000000",

STRLEN (digits));

sequence = digits +0;

/* adding zero makes sequence a real */

CHECK script, AT END OF subscriber

DO print_invoice;

SELECT FROM script

SORTED BY subscriber, magazine;

FOR EACH script

DO accumulate_subscription;

DO last_invoice;

END

/* compute change for one subscription */

PROCEDURE accumulate_subscription

BEGIN

FIND IN mag

WHERE magazine EQ script.magazine;

IF ERROR (mag) THEN BEGIN

PRINT "No magazine master for: ",

script.magazine, NL;

RETURN;

END

/* charge one year for each subscription */

amount = year_rate;

END

/* print next invoice and subscriber amount */

PROCEDURE print_invoice

BEGIN

DO next_invoice;

PRINT "Invoice: ", key,

", Amount: ", sub_total, NL;

END

/* print final summary */

PROCEDURE last_invoice

BEGIN

DO next_invoice;

PRINT "Next invoice to use is: ",

key, NL;

END

/* generate next invoice number */

PROCEDURE next_invoice

BEGIN

key = prefix @ FORMAT (sequence, fmtString);

sequence = sequence + 1;

END

PROBLEM 6

There are two minor drawbacks to the final report in this session:

1. you can't enter a starting invoice number of all digits;

1. if the sequence is incremented past the current number of digits (9 to 10, 99 to 100, 999 to 1,000, etc.), the FORMAT routine returns an empty string ("") as the result. That is, FORMAT (1000,"000") results in an empty string, not 1000. Modify the report of this section to handle these two additional cases without losing any of its previous abilities.

Hint 1: One regular expression cannot be written to handle both types of starting invoice numbers (all digits vs. prefix/digits).

Hint 2: When an overflow occurs in FORMAT and it returns an empty string, the resulting key always equals the prefix.

Food for Thought

This section has barely scratched the surface of regular expressions. Regular expressions are a language unto themselves and a small book could be devoted to exploring them in depth. In Chapter 10, Report Writer Programming Reference there is a section on the MATCH function that shows the patterns you can use in regular expressions. You have seen some of them used here. The only way to really learn these is to try new expressions and see what results you get.