Wednesday, November 12, 2008

Lessons from Giant-Scale Services, Brewer

This paper discusses giant-scale services and their advantages, as well as maintaining high availability. The giant-scale services mentioned here are single-site, single-owner, well-connected clusters and provide ubiquitous infrastructure that allow users to access services and centralized data from multiple devices. The paper discusses load management and high availability as major factors in the design of giant-scale services.

The availability metrics of uptime, yield, and harvest, as well as the DQ principle are brought up in deciding whether replication or partitioning is a better scheme for increasing availability. The author agrees with the traditionally held viewpoint that replication is the better strategy, although he suggests that some combination of partitioning and replication could give finer recovery control. Good strategies for graceful degradation, disaster tolerance, and online evolution are crucial for high availability. The author mentions that he has developed tools to help design giant-scale systems, but the paper delves more into metrics and high-level topics instead of said tools.

No comments: