Wednesday, September 3, 2008

A History and Evaluation of System R

System R was a relational database system meant to boost user productivity through use of a high-level user interface and support for many concurrent users and multiple types of database use. It was developed over the course of 5 years in three distinct phases (zero through two): prototype, full-blown version, and evaluation.

Phase Zero struck me as being a bit odd in that the designers knew before they started it that they would completely scrap it afterwards. Granted, there were some important lessons learned from the prototype; that performance should be measured in numbers of I/Os rather than in number of tuples fetched, that future optimizers should aim for simple queries, and that both "join" and "subqueries" should be supported are a few of these lessons that were incorporated into the full-blown system.

What struck me in particular about the Phase One design were the elaborate recovery and locking subsystems. The designers were clearly taking no chances with any of the three types of failure, even going so far as to use a dual-logging scheme. Parts of the recovery subsystem such as the "shadow pages" were a detriment to performance; while a certain amount of performance can be sacrificed for success in recovery from failure, the alternative technique suggested in the paper itself of simply logging all database updates might have been a much simpler solution.

A lot of thought seemed to go into the locking subsystem, but in the end, Levels 1 and 2 probably never saw much use and consequently could have been left out entirely. As it is, the locking subsystem had to be redesigned to account for the "convoy phenomenon" by allowing processes to reobtain locks during a given time slice.

One of the biggest successes of System R was the the popularity SQL found with users. Not only was it simple for users to learn, but it was actually extended during the creation of the system to satisfy user demands for increased functionality. Compilation of SQL into machine code also appears to have been quite successful; the code generation stage requires a sufficiently small enough amount of time that only a small number of queries have to happen for it to be worthwhile.

With all the apparent effort put into the recovery and locking subsystems, I felt a little disappointed by the use of B-trees for access paths. While B-trees are the better choice for obtaining multiple records and maintaining ordering and connections, hashing and linking are the better choice for small numbers of records and "canned transactions." If there had been a nice way of incorporating a mix of these, System R could have handled any transactions well. However, the designers probably had better insight into which queries would actually occur most.

No comments: