4.2.2.3 Hashed Files



The hashed file access method scatters the records randomly throughout the RMS data file. When creating a hashed RMSfile, the maximum number of records the file will contain must be declared. When a record is added to a hashed RMSfile, the primary key value is transformed into a number between one and the number of records in the file. RMS attempts to place the record at that location. If a record already exists at that location, a collision has occurred and the record must be placed elsewhere.

RMS provides two methods for handling collisions. The first is a linear method where RMS inspects the next adjacent record to see if a record is stored there. If there is no active record, the record being added is placed there. If the adjacent record contains an active record, the record after that is inspected. This process is continued until either an empty record is found or half the RMS data file has been searched. In the latter case, a "file full" error message is returned. If the end of the RMS data file is reached during the search, the search is started again from the beginning of the file.

You can create long sets of adjacent records with linear duplicate key handling. This slows record access significantly, since it is necessary to read each record in the chain to find the desired one. A second method of handling duplicate keys is provided that helps reduce the length of chained records and the clustering it produces.

This method, quadratic, changes the algorithm that picks the next record slot to inspect. For convenience, we will call the location where the record should be placed the home record. If the home record is occupied, the first record past the home record is checked, then the fourth, the ninth, and so forth. The progression goes:

1, 4, 9, 16, 25, 36, 49, ...

This progression takes the numbers:

1, 2, 3, 4, 5, 6, 7, ...

and squares the number to find the next record to inspect. This process also halts with the "file full" error message after half the records in the RMS data file have been inspected. This method spreads the collisions in a hashed RMSfile throughout the file rather than clustering them in one place.

Choosing a method for duplicate key handling depends on whether or not the RMSfile has many duplicate keys. If there are few duplicates, the linear duplicate key handling method is sufficient. Files having a large number of duplicate key values should probably use the quadratic duplicate key handling method.

The performance of a hashed RMSfile depends on two main factors: the distribution of the primary keys, and the load factor of the file. The distribution of primary keys is controlled by three things: the hashing algorithm, the duplicate key method, and the number of allocated records. The load factor of an RMSfile is the percentage of allocated records being used. For example, a load factor of 60 means that approximately 60 percent of the available records are in use. This factor is controlled by the number of records allocated to the RMSfile and the number of records currently being stored in the file. Optimum performance is usually achieved by selecting the proper method of duplicate key handling as described above and allocating enough records to make the load factor of the RMSfile around 60 to 65 percent. For more information on hashed files, see section 6.4, "Sorting and Searching", in The Art of Computer Programming[2] by Donald Knuth.