Monday, September 29, 2008

Capriccio: Scalable Threads for Internet Services

Capriccio is a user-level threading library for Linux. It attempts to simpify programming for high-concurrency servers while adding scalability and flexibility through the use of user-level threads, asynchronous I/O interface, linked stacks, and a resource-aware scheduler. The advantages of user-level threads over kernel threads are low-overhead synchronization, no kernel crossings for mutex acquire and release, and efficient memory management, although as of the paper's writing Capriccio only supports single-CPU threading.

Extensive compiler support is necessary for the linked stacks and scheduler; the result of the call-graph and checkpoint scheme is non-contiguous dynamically-sized stacks that can have their amounts of wasted stack space adjusted by the user. The resource-aware scheduler draws inspiration from the event-based handler system. Capriccio uses a blocking graph to automate scheduling at run-time; it maintains queues for each blocking graph node and dynamically decides which nodes' threads should run next based on current available resources. The bandwidth of Apache for 100+ clients increases when used with Capriccio, and thread-based Knot performs comparably to the event-based Haboob web server.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Welsh, Culler, & Brewer

SEDA, staged event-driven architecture, is a software architecture designed to support high concurrency Internet services and simplify the constructing of such; it is also meant to have dynamic resource management and support for applications to adapt to changing load conditions. It manages all these through the use of pipelined event stages and dynamic resource controllers. Each stage is represented as an event handler and contains an event queue and thread pool; the number of threads active in a stage is controlled dynamically by the thread pool controller, and the batch controller automates the number of events handled within a stage.

The paper discusses two applications designed after SEDA, the high-performance HTTP-server Haboob and the packet-router Gnutella. Haboob is shown to perform comparably to Apache and Flash in terms of throughput and fairness and with a goal of higher-average/lower-variance response time. It was interesting to see the differing thoughts of the Capriccio and SEDA authors on thread v.s. event-based systems; the Capriccio authors brought up the complaint that event-based systems like SEDA are much less intuitive to program than thread-based ones, but the SEDA authors claim that their system is in the middle-ground of thread and event-based and hence is ideal.

Wednesday, September 24, 2008

Experience with Processes and Monitors in Mesa, Lampson & Redell

I decided to skip this writeup because I really just couldn't get into the paper. Sorry.

Monday, September 22, 2008

Lightweight Recoverable Virtual Memory

RVM, a recoverable virtual memory implementation, was designed with lessons learned from Camelot in mind. The biggest design principle in its creation was simpicity over generality, which led to RVM's lack of support for nesting, distribution, and concurrency control, along with its being decoupled from the operating system. RVM uses logging and allows no-flush transactions and has a memory allocator and segment loader layered on top of it. RVM uses external data segments as backing stores for recoverable memory and only transfers committed changes to these external segments, allowing for no-undo/redo logging.

It was impressive to hear that RVM had actually been in use for two years prior to the paper's writing. As the authors point out, however, RVM has a weak point in that programmers must explicitly remember to make a set_range call if they want the memory modifications to be restored in the event of failure. This scenario could be easily remedied during compilation or post-compilation stages.

Wednesday, September 17, 2008

Stasis: Flexible Transactional Storage, Sears & Brewer

Stasis is a transactional storage library that bases a number of its policies on ARIES. It uses write-ahead logging techniques to support redoing and undoing transactions in the event of recovery, both of which are necessary to support no-force and steal. Stasis also uses ARIES' notions of LSNs, CLRs, and nested top actions. The authors compared Stasis and a similar system, Berkeley DB, with respect to their hash table implementations, handling of concurrent requests, and object-persistence optimizing; Stasis seemed to do as well as or better than Berkeley DB in each of these aspects.

Consistent with the "Systems View," Stasis is noteworthy in that it is extensible, allowing for the addition of user operations, a LSN-free version of recovery, different data layouts, and any other number of specializations. The authors make a number of assertions about these different features Stasis "probably" could handle, but unfortunately few of these had actually been implemented at the time of the writing. As a measure of how simple/complex Stasis is, the authors included a count of its lines of code, which was an interesting metric to see.

Monday, September 15, 2008

ARIES

This paper discusses ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), a recovery method for databases. The method uses write-ahead logging for rollbacks and media recovery. The paper's authors emphasize the point that ARIES repeats history; it does this by redoing and then undoing certain transactions that were in progress during a failure. A strong point of ARIES is that it allows both partial and total rollbacks through the use of savepoints. The paper mentions differences between the ARIES method and that of System R; in particular, System R's shadow page technique was a source of overhead that the ARIES method avoids by allowing in-place updates.

ARIES uses page-oriented redos to cut down on tables and indexes required to access data pages but requires both page-oriented and logical undos to better allow for concurrency. Despite the authors' attempts to carefully trim the features needed for ARIES, such as the number of pages returned from non-volatile storage during recovery and descriptor tables, the logging system has a large amount of data it has to keep track of; from PrevLSN pointer fields within log records to CLRs and separate redo and undo actions, it seems like ARIES contains a non-trivial amount of bookkepping data and that the recovery algorithm would require quite a bit of extra overhead. However, ARIES has been adopted by a number of databases such as DB2 and MS SQL Server, so perhaps the authors are not underplaying the overhead after all.

Wednesday, September 10, 2008

The HP AutoRAID Hierarchical Storage System

The HPAutoRAID is a redundancy-level storage hierarchy implemented through the (SCSI) block-level device interface that employs background garbage collecting and load balancing. It was created to solve the difficult problem of manually configuring a RAID system. The storage hierarchy consists of mirrored storage for active data and RAID5 for inactive data, and the data can only be kept in one of these levels at a time. Data space on the disks is split up into physical extents (PEXes), which are grouped into PEGs on different disks in such a way as to keep data on the disks balanced, and are further broken down into segments, which serve as the stripe and duplication units. The logical space, on the other hand, is divided into relocation blocks (RBs) which serve as units of data migration.

Reads and writes to the mirrored storage are fairly simple, as are RAID reads. RAID writes, on the other hand, can use the per-RB scheme, that is, to write one RB (and also its stripe's parity) at a time, or they can use batched writes. If these batched writes fail, the per-RB scheme will be undertaken. The authors made a good decision when they decided to use RBs demoted during idle periods to plug the holes left by previously promoted RBs; not using hole-plugging may be faster and easier initially, but enacting it nearly cut down on any need for a RAID 5 cleaner in the simulations.

Monday, September 8, 2008

Analysis and Evolution of Journaling File Systems

This paper introduced the concepts of semantic block analysis (SBA) and semantic trace profiling (STP) as ways of analyzing and modifying file system behavior, respectively. SBA is similar to block-level tracing except that SBA has the added benefit of knowing the on-disk format of the file system. Generic SBA code can be combined with a smaller amount of file system specific code to accomodate ext3, ReiserFS, JFS, and NTFS. However, the authors could have made a more concrete, convincing argument about why this generic route is a better one to take than just implementing tracing straight into the file system. STP is a good way to test changes to the file system that are not too complex or "radical;" no time-consuming changing of the file system or accurate simulating of the file system is required. Actual case studies of ext3 and ReiserFS showed how useful SBA and STP can be, especially because of the handful of errors in ReiserFS that SBA uncovered.

The paper takes a close look at the ext3 file system and uses SBA to examine different aspects of its performance. In particular, it studies how the three journaling modes of ext3 handle sequential and random workloads. Based on these experiments, the authors recommend some techniques for improving performance. Adaptive journaling is suggested as a flexible journaling mode to efficiently handle a combination of sequential and random writes. Falsely limited parallelism can be combated by initiating journal and fixed-location writes at the same time, and tangled synchrony can be fixed by un-grouping synchronous and asynchronous traffic streams. It was hard to believe that ext3 didn't already employ differential journaling, especially when STP results showed it gave speedups of 6 and 200 using data journaling mode for two database workloads.

A Fast File System For UNIX

This paper traced the creation of FFS, a newer implementation of the UNIX file system. A few layout features of this file system include cylinder groups, an available block bitmap (replacing the old system's free list), a static number of inodes per cylinder group, and larger block sizes. The added benefit of larger block sizes is being able to keep big files together on a cylinder and hence having fewer seeks. However, since UNIX typically uses lots of little files, this scheme also causes lots of wasted space (up to 45%). This wasted space led to the technique of splitting blocks into fragments.

In terms of layout, the top level goals are to minimize seek latency and to promote larger data transfers; the authors wanted to increase locality, but they also wanted to spread unrelated data to different cylinder groups. Blocks are allocated with the idea of keeping them "rotationally optimal." In order to localize files, there is a requisite free space reserve of blocks that must be kept unallocated. Because of the new system's larger block size, both reads and writes are faster than in the old system. Reads are sped up, but writes are slower than reads because they require longer allocation for the larger blocks. The idea of pre-allocating block batches to a file sounds like a good way to improve write performance, and the creators of the DEMOS file system felt the same way, but these authors did not deem it worth their time to handle something accounting for 10% of a write system call.

Wednesday, September 3, 2008

A History and Evaluation of System R

System R was a relational database system meant to boost user productivity through use of a high-level user interface and support for many concurrent users and multiple types of database use. It was developed over the course of 5 years in three distinct phases (zero through two): prototype, full-blown version, and evaluation.

Phase Zero struck me as being a bit odd in that the designers knew before they started it that they would completely scrap it afterwards. Granted, there were some important lessons learned from the prototype; that performance should be measured in numbers of I/Os rather than in number of tuples fetched, that future optimizers should aim for simple queries, and that both "join" and "subqueries" should be supported are a few of these lessons that were incorporated into the full-blown system.

What struck me in particular about the Phase One design were the elaborate recovery and locking subsystems. The designers were clearly taking no chances with any of the three types of failure, even going so far as to use a dual-logging scheme. Parts of the recovery subsystem such as the "shadow pages" were a detriment to performance; while a certain amount of performance can be sacrificed for success in recovery from failure, the alternative technique suggested in the paper itself of simply logging all database updates might have been a much simpler solution.

A lot of thought seemed to go into the locking subsystem, but in the end, Levels 1 and 2 probably never saw much use and consequently could have been left out entirely. As it is, the locking subsystem had to be redesigned to account for the "convoy phenomenon" by allowing processes to reobtain locks during a given time slice.

One of the biggest successes of System R was the the popularity SQL found with users. Not only was it simple for users to learn, but it was actually extended during the creation of the system to satisfy user demands for increased functionality. Compilation of SQL into machine code also appears to have been quite successful; the code generation stage requires a sufficiently small enough amount of time that only a small number of queries have to happen for it to be worthwhile.

With all the apparent effort put into the recovery and locking subsystems, I felt a little disappointed by the use of B-trees for access paths. While B-trees are the better choice for obtaining multiple records and maintaining ordering and connections, hashing and linking are the better choice for small numbers of records and "canned transactions." If there had been a nice way of incorporating a mix of these, System R could have handled any transactions well. However, the designers probably had better insight into which queries would actually occur most.

Monday, September 1, 2008

Design Philosophy of DARPA Internet Protocols, Clark

This paper discusses the various architectural decisions that were faced in the creation of the Internet. It discusses different approaches that could have been taken and justifies the reasons behind the ultimate choices. The list of priorities in creating the Internet contained a number of unsurprising items: survivability, flexibility, accountability, cost-effectiveness, and distributed resource management.

However, the ordering of these priorities was a surprise; the utmost goal of the Internet was to connect together existing networks, but survivability in the case of failure was not far behind. I was amazed to see that cost-effectiveness was ranked as the 5th secondary goal out of 7 and can only imagine how different the Internet would be today if it were being created for commercial rather than military purposes.

One important decision made was for the use of the "fate sharing" scheme, that is, placing state information at endpoints to protect against intermediate node failures. This idea comes directly out of the need for survivability but also meshes well with the end-to-end argument. The disadvantages of employing the end-to-end strategy were that retransmitted packets may cross intervening networks multiple times and also that programmers would have to deal with implementing protocols themselves, an idea that perhaps strikes slightly less fear into current-day programmers than it did then.

A decision driven by the need for flexibility was the splitting of TCP and IP to allow for different transport protocols. This drive for letting in a variety of networks and providing a variety of services means that there were very minimal requirements for networks to meet; the requirement of being "fairly reliable" in transporting a "reasonably sized" packet using addressing (usually) was left general to accommodate the many different networks already in existence. However, because the designers did not want to constrain flexibility/variability, they did not tackle the difficult problem of formalizing performance constraints.

An interesting feature present in the ARPANet protocol that was changed for TCP is that ARPANet flow control was based on both bytes and packets. Despite the variability the designers strove for in terms of networks and services incorporated into the Internet, TCP was made to only deal with bytes. To eliminate one form of regulation because it seems too complex seems like a poor excuse, particularly since the ARPANet already had ways of dealing with both. However, the designers really had no reason to spend time on anything complex that did not contribute to the top goals of survivability or network variability of the Internet. Unfortunately, tools for distributed management and accountability also fall on the list of under-developed concepts.