10.19 Record Selection Techniques

When using the WHERE clause with the FIND statement or using the WHERE or SORT clause with the SELECT statement, the report writer finds the optimum method for retrieving the records.

As the simplest case, the report writer reads the file sequentially, testing each record to see if it meets the conditions stated in the WHERE clause. When there is a SORT statement as part of a SELECT statement, then the sort key fields and the record number are written to a temporary file for sorting. The temporary file is sorted by an external sort program. When the FOR or READ statements are processed, the original data records are re-read using the record number extracted from the sorted temporary file.

The report writer attempts to optimize record selection and sorting by looking at the fields being sorted, by examining the conditions in the WHERE clause, and looking at the type of file being read.

Sorting by Secondary Index

The Report Writer examines the secondary indexes of a file, looking for an index that has the fields contained in the SORT clause. If an index is found with all the fields in the SORT clause, and in the correct order, then the file will be read in index order, rather than producing an external file and sorting the file. When an index is used, no action is taken when the SELECT statement is processed other than selecting an index to use. When the FOR or READ statement is executed, then the file is read in index order.

The index selected may have more fields than necessary to satisfy the SORT clause. For example, if a file has an index on the customer and date fields, then the clause

SORTED by customer

will use this index. The first field in the index must be the first field in the SORT clause. The clause

SORTED by date

could not use the customer and date index for sorting.

Selecting by Secondary Index

The Report Writer uses a secondary index to selectively read some records from the file if there is an index that matches the WHERE clause. If an index can by used for selections, it limits the number of records that are examined, since every record need not be read. If a file named customer has only an index on the state field, the statement below will use the secondary index in this way.

SELECT customer

WHERE state = "MD"

If you use relational operators in the WHERE clause, secondary indexes are still used. The statement

SELECT customer

WHERE state > "MD"

uses the state secondary index to read the customer records.

Multiple Selection Criteria

When a SELECT clause has multiple conditions, the conditions must be joined by the AND logical operator to use secondary indexes. Assume that a file named invoicem has an index on the invoice field, which is the invoice number. The invoicem file has a second index using two fields customer and date. The statement

SELECT invoicem

WHERE customer = "Jones"

AND date > 01/01/84

AND date < 01/01/85;

uses the customer and date index for selection.

The field that has the relational operators in the WHERE clause prevents any additional fields in the secondary index from being used.

Not all conditions in the WHERE clause must be in an index, as long as they are all joined by the AND operator. In the example above, assume instead that there is only an index on the customer field. While processing the FOR statement, the Report Writer reads the invoicem file using the customer index, reading only those records where the customer field is "Jones". It then re-applies all the WHERE conditions to each record read, i.e., also testing the date field. If the record passes the WHERE condition test, it is returned to the program. If the record does not pass the WHERE condition test, the record is skipped. This process uses a secondary index to limit the number of records examined, yet does not require that a secondary index completely satisfies the conditions stated in the WHERE clause.

Both Selecting and Sorting by Secondary Index

When the SELECT statement contains both a SORT clause and a WHERE clause, the Report Writer finds the secondary index that best satisfies both clauses.

There may be an index that can be used only for the WHERE conditions. The Report Writer reads the file using a secondary index to select records, creates the temporary file for sorting, sorts the file, and re-reads the original records using the record number. This is done if the fields named in the WHERE clause are indexed, but the fields in the SORT clause are not.

Or, the Report Writer may read the file in secondary index order and test each record against the conditions in the WHERE clause. This is done if the fields in the SORT clause are in a secondary index, but the fields in the WHERE clause are not. In this case, no external sort is performed; the file is read in secondary index order as the FOR statement is processed.

Finally, the Report Writer may use the same secondary index for both the WHERE clause and the SORT clause. As in the previous case, no external sort is performed; the file is read in secondary index order as the FOR statement is processed. Again using the customer file, this statement will use the secondary index for both sorting and selecting records:

SELECT customer

WHERE state > "G" AND state < "W"

SORTED BY state;

Use of secondary indexes is more complicated when several fields are involved, or when there are several indexes to choose from. First some examples are shown, then the rules for selecting an index are given.

Again assume that a file named invoicem has an index on the invoice field, which is the invoice number. The invoicem file has a second index using two fields customer and date. The statement

SELECT invoicem

WHERE invoice > "ABC0000"

SORTED BY date;

uses the invoice index for selection, and a temporary file is created for sorting.

The statement

SELECT invoicem

WHERE customer = "Jones"

SORTED BY date;

uses the customer and date index for selection and uses a temporary file for sorting.

The statement

SELECT invoicem

WHERE customer = "Jones"

SORTED BY customer, date;

uses the customer and date index for both selection and sorting. At first glance, it seems foolish to sort by both customer and date since there is only one customer selected. But since there is an index on both customer and date, the Report Writer will use this index for reading in sorted order.

Selecting a Secondary Index

Several conditions must by satisfied before a secondary index is used. First, the conditions named in the WHERE clause are examined. The conditions must all be either EQ, GT, GE, LT, or LE. If there is more than one condition, then they must be joined by the AND operator. Using the OR operator usually eliminates any use of secondary indexes.

Second, the SORT fields are checked. If any fields are sorted in descending order, the Report Writer does not use a secondary index.

Third, each index of the file being selected is examined. The number of fields in an index that satisfy the WHERE clause are counted. The fields in the index are examined starting with the first index field. A field in the index that is not in the WHERE clause stops the count for that index. If the field in the WHERE clause is not an EQ condition, that field is counted, but no more fields are examined.

The fields of the index are counted again, this time counting the number of index fields that match the SORT fields. The index that best satisfies the following criteria is selected.

1. All selection keys, best sorting

1. Best sorting, best selection keys

1. Best selection key

1. Linear selection

An index satisfying the criteria all selection keys is an index that has all of its fields named in the WHERE clause and all the fields are specified exactly (using EQ only). The WHERE clause can have more fields than those named in the index.

An index satisfying the criteria best selection keys is the index that has the most fields named in the WHERE clause and all but the last field is specified exactly (using EQ only). The last field can have other than an EQ condition.

An index selected for best sorting has the highest count of sort fields.

If no index can be found to satisfy the first three criteria, the report writer will read sequentially through the entire file to find the records.

File Type Considerations

Reading by a secondary index is performed if the SORT and WHERE statements name fields that are included in a secondary index. Since the primary key of an indexed file is also an index, it also is considered as one of the secondary indexes when performing the index evaluations. The primary key of a hashed file is not an index, so the methods described above do not apply to the primary key of a hashed file.

However, the primary key of a hashed file can be used to retrieve records in certain circumstances. Hashed files are read by primary key when the WHERE condition specifies one value for each keyfield of the file. The syntax for this kind of specification would look like:

fieldname EQ value

or

value EQ fieldname

If the file has more than one keyfield, each keyfield must be specified as above with each specification separated by the logical AND operator.

For example, when reading through an invoice file, the statement

FIND customer

WHERE customer EQ invoice.customer;

will retrieve the customer record for that invoice by using the primary key. It is assumed in this example that the customer field is the primary key in the customer file. If it were not, then other searching techniques would be used. Note that the correct action would be performed in either case; but finding the customer record by using the primary key would be much faster.

If the selection is done by primary key and the records are also to be sorted, then a temporary file is created as described above and sorted.

Pattern Matching

A crude form of pattern matching is provided in conditions through the use of the asterisk. Pattern matching is applied only to string constants and is specified as:

fieldname EQ "pattern"

or

"pattern" EQ fieldname

where pattern has one or more asterisks. Each asterisk will match zero or more characters. When pattern matching is used, upper and lower case letters are considered the same ('a' = 'A').

For example:

"or*"

would match

oregon, Or, OR, Oriental, etc.

NOTE: Pattern matching conditions are not used with indexes.