Wednesday, December 3, 2008

Dynamo: Amazon's Highly Available Key-value Store

[This is the first paper in which I've been able to tell that the authors are not native English speakers..]

Dynamo is a highly available decentralized storage system that sacrifices consistency (and ACID in general) for high availability. A few important assumptions about Dynamo are that it is used for applications needing to store only small objects and that it is used only in Amazon's non-hostile internal environment, hence removing the need to authenticate. Clients and servers make Service Level Agreements (SLAs) specifying certain system-related client expectations.

Dynamo was designed to take advantage of heterogeneity and to scale well, which it does through consistent hashing. By replicating data on multiple hosts and asynchronously propagating updates through these hosts, Dynamo provides high availability. It deals with failures through hinted handoffs, in which data for an unreachable node can be redirected and then put into the correct node at a later time. Dynamo provided an impressive amount of availability over 2 years at 99.9995% of requests.

Speculative Execution in a Distributed File System

Oops, ParLab opening + me losing the paper copy in Vegas -> didn't get to read before class/write a summary.

Wednesday, November 26, 2008

Deconstructing Process Isolation

This paper adds on to the afore-mentioned Singularity paper by presenting comparative costs of Singularity's selective use of hardware isolation. In a number of benchmarks, the number of cycles was very close for all the setups with SIPs and HIPs running in Ring 0; for the most part, the Ring 3 HIPs performed comparably as well, except for the small handful in which each process ran in an unprivileged non-kernel domain in Ring 3 and encountered 10x the number of cycles. This latter setup is apparently the one used in most hardware-isolated microkernel systems.

Singularity: Rethinking the Software Stack, Hunt & Larus

The Singularity project was started to re-examine design decisions of modern operating systems. The project consists of a microkernel operating system, C#-based programming language, and verification tools meant to address such shortcomings as security vulnerabilities, lack of robustness, and unexpected application interactions and failures. The three major features of Singularity are software-isolated processes (SIPs), use of contract-based channels for communication, and manifest-based programs (MBPs). SIPs were designed such that they could be easily garbage-collected; SIPs can only have pointers to their own memory or their memory on the exchange heap.

The authors place a huge amount of importance on safe, verified code, leading them to create Sing#. The overhead associated with run-time checking of the language accounts for an additional 4-5% of cycles, but the authors deem that an acceptable trade-off. It was interesting to note that the manifest-based configuration code syntax seems to be approaching Perl with its multiple dollar signs and forall keyword. Another interesting point was the distrust of the compiler Bartok and its ability to compile MSIL assuredly into safe native code.

Monday, November 24, 2008

Two-Phase Commit

This looks like another one of those awesome '80s database papers. Woo, R*! I think I'll pass.

** Update (Nov. 29, 2010): For some reason this blog entry seems to be attracting lots of traffic. Sadly, I don't have a review to offer of this paper, but I can share a quote and some advice for you if you want to learn more about Two-Phase Commit.

"Two-phase commit: isn't that just like a simplified Paxos?" My advice: read the original Two-Phase Commit paper by Mohan, Lindsay, and Obermarck. Then read Lamport's "Paxos Made Simple." Then re-read "Two-Phase Commit" and enjoy how easy the algorithm is.

Paxos Made Simple, Lamport

Paxos is a distributed consensus algorithm for a network of processes in which each process plays the role of proposer, acceptor, and learner. The algorithm is carried out in two phases; in the first phase, a proposer selects a proposal number and sends a prepare request to some majority of acceptors. An acceptor will only accept this proposal if it has not seen a larger proposal number, and if it accepts, it responds with the highest-numbered proposal it has previously seen. Before sending this response, the acceptor records its intended response to stable storage so that consensus can still be reached in case of failure.

In the second phase, the proposer awaits responses to its prepare requests. If a majority of the acceptors respond, then the proposer subsequently sends out an accept request for the value of the highest-numbered response it received. Each acceptor then accepts the value sent to it unless it has already responded to a prepare request for a larger value. The purpose of these carefully designed phases is to enforce the rule that only one value will be chosen solely from proposed values. The possibility of failure of one acceptor created a need for multiple acceptors, which then created a need for reaching consensus among these multiple acceptors. This paper's abstract was an effective attention-grabber. I can only imagine how painful the original paper was when Paxos was not explained in simple terms.

Wednesday, November 19, 2008

Implementing Declarative Overlays

P2 is a system used for the declarative construction of networks. P2's language OverLog can be used to specify overlays very concisely, requiring only 16 rules to express a Narada-style network and 47 for a Chord overlay. However, the conciseness of OverLog does not necessarily speak for the learnability of the language. Better readability (a la SQL) would help its popularity, particularly since OverLog's declarative nature is already somewhat foreign to the typical imperative-language programmer.

P2's modularity is useful in that unlike PIER, P2 can be used with a variety of underlying networks, building up a knowledge of the network from a small set of nearby neighbor addresses. Its architecture is based on Click, though P2 differs from Click in that P2 elements are meant to act as database operators rather than packet routing functions. Performance testing of P2 Chord overlays revealed expected performance on par with those of Chord itself. The results of hop-count, latency, and performance under churn were "acceptable" as the authors didn't expect to compete with hand-tuned implementations.

Friday, November 14, 2008

K42: Building a Complete Operating System

This paper discusses K42, a highly-customizable object-oriented operating system. K42 was designed to address some technology predictions made by the authors, including the rise of multiprocessors and 64-bit machines. One of the goals of K42 was the ability to scale up with multiprocessors, which was achieved through minimizing sharing using techniques like Protected Procedure Calls and clustered objects, and affected the design of memory management. K42's design places a lot of importance on locality and customizability; the latter feature is a really interesting one in that K42 allows each instance of a resource to be managed by different sets of object instances.

Also supported is dynamic customization through hot swapping and dynamic upgrade. K42 reduces system calls and storage/pinned memory in the kernel by moving thread-scheduling into user-space. However, this user-level scheduling creates problems when using pthreads. Though Windows was predicted to dominate the market, the authors decided to provide Linux compatibility in K42.

Wednesday, November 12, 2008

Lessons from Giant-Scale Services, Brewer

This paper discusses giant-scale services and their advantages, as well as maintaining high availability. The giant-scale services mentioned here are single-site, single-owner, well-connected clusters and provide ubiquitous infrastructure that allow users to access services and centralized data from multiple devices. The paper discusses load management and high availability as major factors in the design of giant-scale services.

The availability metrics of uptime, yield, and harvest, as well as the DQ principle are brought up in deciding whether replication or partitioning is a better scheme for increasing availability. The author agrees with the traditionally held viewpoint that replication is the better strategy, although he suggests that some combination of partitioning and replication could give finer recovery control. Good strategies for graceful degradation, disaster tolerance, and online evolution are crucial for high availability. The author mentions that he has developed tools to help design giant-scale systems, but the paper delves more into metrics and high-level topics instead of said tools.

Monday, November 10, 2008

The Impact of DHT Routing Geometry on Resilience and Proximity

This paper discusses DHTs and compares the performance of different routing designs. The paper focuses on routing geometry rather than algorithmic details. Flexibility, that is, the amount of freedom available in choosing neighbors and next-hop paths, is a direct consequence of the geometry and in turn affects such properties as static resilience, path latency, and local convergence. Geometries that have little optimal path route selection flexibility are tree, butterfly, XOR and hybrid geometries, while hypercube and butterfly geometries have little flexibility in neighbor selection.

Based on path latency experiments, the atuhors suggest that geometries only actually affect path latency based on their ability to implement Proximity Neighbor/Route Selection. The authors tentatively suggest that the comparatively simple ring geometry be used more frequently because it is flexible and performs just as well or better than other geometries in a variety of tests. It seems odd that the authors should sound so tentative about using ring geometries, seeing as Chord had already been successfully implemented prior to this paper's publishing.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

Chord is a scalable lookup protocol that supports the operation of mapping a given key onto a node. It is fully distributed and attempts to load balance by acting as a distributed hash function. Nodes only need to keep track of a few successor nodes on the circle, which they maintain in a finger table. Chord allows for a dynamic network; it correctly maintains each node's sucessor and ensures that a node is responsible for every key. It uses stabilization to keep nodes' successor pointers correct but also allows for lookup retries in the event that stabilizing has not happened recently enough. This paper contained a lot of theoretical analysis about Chord in addition to simulation/experimental results, and the authors use these analyses to back up the idea that Chord functions correctly even in the face of incorrect or failed nodes.

Wednesday, November 5, 2008

The Click Modular Router

Click is a flexible software architecture for creating routers. These routers are built from components called elements, which represent small units of router processing and are implemented as C++ objects. Click routers support both push and pull connections but with carefully specified rules as to which output and input types can be connected to explicit Queue elements.

Click runs as a single-thread in the Linux kernel and supports handlers and hot-swapping for changing configurations. It uses polling instead of interrupts and achieves a much higher max loss-free forwarding rate than Linux with interrupts. Click supports extensions such as varying types of packet schedulers and dropping policies. The Click language paired with small, modular elements makes it simple for users to create quick configurations, but a few additions would be helpful such as the use of configuration variables to prevent the user from having to copy information manually.

Monday, November 3, 2008

Wednesday, October 29, 2008

TCP Congestion Control with a Misbehaving Receiver

This paper discusses how a TCP receiver can misbehave by defeating congestion control measures through various attacks. The three attacks described are ACK Division, DupACK spoofing, and Optimistic ACKing. An ACK Division attack divides an acknowledgment for a data segment into multiple separate acknowledgments. DupACK spoofing involves sending back multiple acknowledgments for the last data segment received, and Optimistic ACKing is preemptive sending of acknowledgments for data that has not yet been received.

The first two cause the sender to send packets in a quick burst by quickly expanding the congestion window, while the third causes the sender to send data sooner than it originally would have. All three attacks were generally successful against the TCP implementations of a variety of OSes. The authors use a simple HTTP request in their test and provide graphs that clearly show that their exploits worked. The attacks were able to be implemented with a disturbingly small amount of code.

Congestion Avoidance and Control, Jacobson & Karels

This paper discusses techniques added for the sake of packet conservation in TCP. Three problems addressed are connections not getting to equilibrium, senders injecting new packets before old ones exit, and equilibrium not being attained due to limited resources on the path. The first problem is fixed through slow start. For every ack that arrives, two new ones are generated, opening the window exponentially.

The second problem is due to not estimating variation of round-trip time, ultimately causing packets to be retransmitted whose delivery has only been delayed; the authors provide an estimation method and also suggest the use of an exponential backoff timer for retransmits. The solution to this second problem, that is, taking variation into account, is one of the most important contributions of the paper because it hugely reduces the number of packets needing to be retransmitted. For the third problem, the authors note that a timeout is a good indicator of packet loss, which is a good indicator of congestion, hence suggestion to use timeout as a signal of congestion. Upon receiving a timeout, the congestion window is halved, and for each ack the congestion window is increased by 1/(congestion window).

Monday, October 27, 2008

ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay

ReVirt is a system logger that moves the operating system into a VM and then runs below the OS. It does this to avoid relying on the integrity of the OS being logged and also to allow replaying non-deterministic events. ReVirt is implemented as a kernel module in the host OS; using an OS-on-OS virtualization strategy, the authors argue that although attackers may gain control of a guest OS, their actions on the host OS are severely restricted, as the implemented VMM allows the guest kernel access to fewer than 7% of system calls.

To log enough information to replay non-deterministic events, the authors record asynchronous virtual interrupts and external input. They suggest cooperative logging as a way to reduce log volume. The authors pass off the overhead associated with kernel-intensive workloads (up to 58%) as being an acceptable tradeoff for security, although they only provide one brief case of how they successfully used ReVirt to find and analyze an intrusion.

Live Migration of Virtual Machines

This paper discusses live, in-service OS migration as opposed to migration at the process-level. Reasons for choosing this granularity include avoiding "residual dependencies" that require original host machines to remain accessible to migrated processes, transferring in-memory state consistently and efficiently, and separating the concerns of users and operators.

The approach to moving memory used is pre-copy migration, a multi-step process combining push and stop-and-copy techniques. The first phase in a push phase in which all pages are transferred to the new VM, and future phases transfer only pages that were modified during the previous round. Once the remaining pages are part of the frequently-written writable working set, the current VM is suspended, the remaining pages are transferred, and the new VM is started. This pre-copy approach improved upon the amount of downtime caused by a full stop-and-copy significantly for a number of different migrations.

The authors ignore the issue of having to deal with local resources; they make the assumption that migrations tend to happen in a cluster environment within a single LAN and that data centers tend to use network-attached storage in place of local disks.

Wednesday, October 22, 2008

Are Virtual Machine Monitors Microkernels Done Right?

This paper discusses a few areas in which microkernels and VMMs differ architecturally but in which VMMs seem to have gotten things right. The first of these areas is avoiding liability inversion, which happens to microkernels that rely on user-space components to continue execution instead of keeping individual VM failure isolated the way VMMs do.

The next area is making IPC performance irrelevant, which VMMs do by making communication between VMs look like device interactions instead of through IPC- and RPC-based interfaces. Lastly, and perhaps most importantly for their mainstream adoption, is the treatment of operating systems. VMMs were designed to support existing OSes, avoiding the additional efforts required by microkernels to do so after the fact.

Xen and the Art of Virtualization

Xen is a virtual machine monitor that allows many instances of an operating system to share hardware together. It employs the use of paravirtualization but with support for existing ABIs and multi-application OSes. Memory is statically partitioned between each guest OS domain to provide strong isolation, although the guest OS may request additional memory pages or return unused ones to Xen. Xen uses disk I/O rings to handle requests from the guest OSes. I/O requests placed by the OSes may be reordered before Xen handles them and places the completed ones back in the ring.

XenoLinux performs similarly to Linux in regards to context switch times, file system latency, various process benchmarks, and so on. The differences, such as in the execution of fork, exec, and sh, are due to verifications Xen must make of page table updates since the guest OS runs at a lower privilege level than the Xen hypervisor.

Monday, October 20, 2008

Eddies: Continuously Adaptive Query Processing, Avnur & Hellerstein

An eddy is a query processing mechanism that can reorder queries at runtime for optimization purposes. Eddies were designed to minimize synchronization barrier overhead and take advantage of moments of symmetry for reordering. The authors use a pre-optimizer to set up joins and kept this step very simple so that the focus stayed on eddies. An eddy keeps track of which operators the input tuples have been processed by and only outputs tuples that have been seen by every eligible operator.

The "naive eddy" works well for operators with the same selectivity, whereas the "fast eddy" takes into account both the consumption and production rates, and hence both cost and selectivity, via lottery scheduling. A window scheme is added to the lottery scheduling to only take recent history into account, allowing the eddy to adapt over time to changes in performance and data.

Sunday, October 19, 2008

Encapsulation of Parallelism in the Volcano Query Processing System, Graefe

Volcano is a query evaluation system that was developed to be extensible and parallelizable. In particular, it uses the "operator model" in which an exchange operator is responsible for running single-process code in parallel. This operator allows pipelining between processes and sets up a consumer/producer relation between the parent and child.

So-called "bushy" parallelism is supported by the exchange operator and forking; intra-process parallelism is supported with data partitioning and intermediate result queues. A somewhat odd decision made in the design of Volcano is that child processes must wait for the parent processes to complete before the call to close trickles down the tree to it; this order was the result of design decisions involving virtual devices and record pinning.

Wednesday, October 15, 2008

MapReduce: Simplified Data Processing on Large Clusters, Dean & Ghemawat

MapReduce is a programming model that allows users to easily write code for processing large amounts of data in parallel. As in its Google implementation, the user specifies functions for map and reduce phases; a set of inputs is split across a number of machines, the map function is applied, a set of intermediate values are written, and then the intermediate values are read by more machines that apply the reduce function and write out an answer.

Google's MapReduce cleverly tackles the problems of "straggler" machines and fault-tolerance through duplication and re-execution. Once a certain number of reductions have completed, so-called backup executions are scheduled for the remaining tasks. Similarly, when a worker machine fails, the task it was working on is simply re-executed, while when a master fails, it is re-started from some checkpointed state. MapReduce works well for Google because they have a huge number of machines at their disposal, although that's not to say that the MapReduce technique isn't useful for a much smaller number of machines though the paper did not discuss the performance implications of this scenario.

Parallel Database Systems: The Future of High Performance Database Systems, DeWitt & Gray

This paper advocates the use of shared-nothing parallel database systems made from fast, inexpensive microprocessors. It gives examples of actual multiprocessors that follow this model. Because of the lower hardware complexity and network traffic, the shared-nothing architecture allows for a greater number of processors within the system and near-linear speedups and scaleups for OLTP loads. The paper argues that shared-nothing systems with a large number of processors are ideal for database applications because SQL code is easily executed in parallel, unlike most old software. The paper mentions data partitioning as a good way to improve speedup and scaleup, an idea which is further examined in the Google MapReduce paper.

Monday, October 13, 2008

Access Path Selection in a Relational Database Management System

This paper discusses how System R selects access paths for queries, both simple and complex. It briefly describes the four phases of processing a SQL statement (parsing, optimization, code gen, execution) but then goes in-depth into how the optimizer works. The optimizer attempts to find the best access path by finding one that minimizes needed CPU and I/O resources; it chooses between "interesting" order and "unordered" access paths based on the given query's ordering requirements (e.g., "group by" and "order by") and a so-called selectivity factor that it uses in calculating the cost.

The optimizer chooses between doing nested-loop and merge scan joins for any n-way joins, which it treats as a series of 2-way joins. The optimal ordering of these joins is created using a search tree of equivalence classes. The best way of accessing single relations is found, and then the best way to join any relation with these relations is found, and the cost is built up layer by layer, so to speak. The authors mention that other path selectors exist but do not present alternative methods to the ones they chose, nor do they provide any kind of comparative testing or quantify their method's performance.

Wednesday, October 8, 2008

Lottery Scheduling: Flexible Proportional-Share Resource Management, Waldspurger & Weihl

Lottery scheduling is a randomized resource allocation mechanism that uses a "ticket" system. It allocates resources to clients in proportion to the number of tickets they have, although due to the randomized nature of the scheduling, this ratio is not always spot-on. Clients can transfer their tickets back and forth explicitly, or individual clients can take advantage of "ticket inflation," that is, making more tickets for themselves. The latter sounds like a poor idea in the general case, as it only works with mutually trusting clients.

Clients that do not consume all of their allocated time slots are given "compensation tickets" which increase their chances of gaining access to resources and hence keep the allocation ratios accurate. The authors determined that the lottery scheduling kernel ran comparably to the unmodified Mach kernel for the Dhrystone benchmark, but that lottery scheduling ran a little faster than the unmodified kernel for a multithreaded database server, possibly due to better caching because the Mach kernel only uses round-robin scheduling.

Sunday, October 5, 2008

Concurrency Control Performance Modeling: Alternatives and Implications, Agrawal, Carey, & Livny

This paper examines three methods of concurrency control, blocking, optimistic algorithm, and immediate-restart, in an attempt to determine which is ultimately the best method for concurrency control. A number of experiments were performed, with factors differing from the number of available resources (CPUs and disks), restart delays, external and internal think times, no-lock-upgrades (placing a write lock on a to-be-updated object instead of starting it off with a read lock and upgrading later), and fake restarts (replacing restarted transactions with a different, independent one). The throughput, utilization, and conflict ratios for different multiprogramming levels were examined and compared for each set of experiments.

Each set of experiments produced a different winner, but the authors were able to draw a few conclusions from this seemingly mixed bag of results. Each of the three techniques increased throughput with fake restarts, and blocking did slightly better than the other two with one resource while the optimistic approach did better with infinite resources. The no-lock upgrade policy increased throughput for the immediate-restart algorithm in both infinite and limited resource cases, while the blocking suffered from this policy for infinite resources and increased throughput for 1 resource. In general, blocking trumps restart techniques in a system with medium to high resource usage, while the optimistic algorithm might be well-suited for a system with a large number of resources and large amount of intratransaction think time, although the authors admit that this kind of system is probably undesirable because its lack of cost-effectiveness and that in the absence of this system, blocking is the best form of concurrency control.

Wednesday, October 1, 2008

On Optimistic Methods for Concurrency Control, Kung & Robinson

A lock-free approach to concurrency control is presented as a solution to certain disadvantages associated with locking. One of the biggest premises of this "optimistic" approach is that locking is rarely needed in certain applications, the only concrete examples being concurrent B-tree insertions and query-dominant systems. The approach removes all restrictions from reads and instead places restrictions on writes. It breaks transactions into read, validation, and (optional) write phases. Changes from writes are first made on copies of nodes, and the changes are swapped into the global nodes only after successful validation.

Upon failure to validate, the transaction is simply rolled back to start again. However, upon numerous failed validations, the fall-back is essentially to write-lock the entire database until the transaction runs to completion. The given technique of serial validation with transaction numbers only holds for one CPU, and things get a little messier for multiple processors, involving multiple stages of validation. This paper presented two applications in which the "optimistic" strategy might perform better than locking but doesn't offer any new insights into figuring out when the optimistic solution is preferable to a locking one.

Granularity of Locks and Degrees of Consistency in a Shared Data Base

[Man, this paper is so ghetto. Seriously, you should see it.]

This paper discusses the different granularities of locking and consistency. It presents the tradeoffs between the concurrency and overhead determined by the lockable unit, namely that a smaller unit allows greater concurrency but also requires more overhead. It discusses how these different units fit into locking hierarchies and DAGs along with different access modes (S, X, IX, IS, and SIX). It puts a lot of emphasis on the compatibilities of the access modes and also on how to propagate these modes to ancestors and descendants to achieve desired locking behavior. The authors do a really good job of providing step-by-step examples of how the locking protocols would be carried out in practice.

Consistency is ranked from degree 0 up to 3 with higher levels of consistency accounting for lower concurrency and higher overhead. All transactions abide by degree 0 and do not overwrite dirty data of other transactions, and each increasing level respectively adds the following conditions: the transaction doesn't commit writes before the end of the transaction, the transaction doesn't read dirty data of other transactions, and other transactions do not dirty the data read by the transaction before it completes. The paper compares its locking hierarchies and consistency to those of IMS/VS and DMS 1100; in particular, IMS/VS doesn't support the notion of S or SIX locking and hence doesn't require different intention locks, and DMS doesn't do any implicit locking.

Monday, September 29, 2008

Capriccio: Scalable Threads for Internet Services

Capriccio is a user-level threading library for Linux. It attempts to simpify programming for high-concurrency servers while adding scalability and flexibility through the use of user-level threads, asynchronous I/O interface, linked stacks, and a resource-aware scheduler. The advantages of user-level threads over kernel threads are low-overhead synchronization, no kernel crossings for mutex acquire and release, and efficient memory management, although as of the paper's writing Capriccio only supports single-CPU threading.

Extensive compiler support is necessary for the linked stacks and scheduler; the result of the call-graph and checkpoint scheme is non-contiguous dynamically-sized stacks that can have their amounts of wasted stack space adjusted by the user. The resource-aware scheduler draws inspiration from the event-based handler system. Capriccio uses a blocking graph to automate scheduling at run-time; it maintains queues for each blocking graph node and dynamically decides which nodes' threads should run next based on current available resources. The bandwidth of Apache for 100+ clients increases when used with Capriccio, and thread-based Knot performs comparably to the event-based Haboob web server.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Welsh, Culler, & Brewer

SEDA, staged event-driven architecture, is a software architecture designed to support high concurrency Internet services and simplify the constructing of such; it is also meant to have dynamic resource management and support for applications to adapt to changing load conditions. It manages all these through the use of pipelined event stages and dynamic resource controllers. Each stage is represented as an event handler and contains an event queue and thread pool; the number of threads active in a stage is controlled dynamically by the thread pool controller, and the batch controller automates the number of events handled within a stage.

The paper discusses two applications designed after SEDA, the high-performance HTTP-server Haboob and the packet-router Gnutella. Haboob is shown to perform comparably to Apache and Flash in terms of throughput and fairness and with a goal of higher-average/lower-variance response time. It was interesting to see the differing thoughts of the Capriccio and SEDA authors on thread v.s. event-based systems; the Capriccio authors brought up the complaint that event-based systems like SEDA are much less intuitive to program than thread-based ones, but the SEDA authors claim that their system is in the middle-ground of thread and event-based and hence is ideal.

Wednesday, September 24, 2008

Experience with Processes and Monitors in Mesa, Lampson & Redell

I decided to skip this writeup because I really just couldn't get into the paper. Sorry.

Monday, September 22, 2008

Lightweight Recoverable Virtual Memory

RVM, a recoverable virtual memory implementation, was designed with lessons learned from Camelot in mind. The biggest design principle in its creation was simpicity over generality, which led to RVM's lack of support for nesting, distribution, and concurrency control, along with its being decoupled from the operating system. RVM uses logging and allows no-flush transactions and has a memory allocator and segment loader layered on top of it. RVM uses external data segments as backing stores for recoverable memory and only transfers committed changes to these external segments, allowing for no-undo/redo logging.

It was impressive to hear that RVM had actually been in use for two years prior to the paper's writing. As the authors point out, however, RVM has a weak point in that programmers must explicitly remember to make a set_range call if they want the memory modifications to be restored in the event of failure. This scenario could be easily remedied during compilation or post-compilation stages.

Wednesday, September 17, 2008

Stasis: Flexible Transactional Storage, Sears & Brewer

Stasis is a transactional storage library that bases a number of its policies on ARIES. It uses write-ahead logging techniques to support redoing and undoing transactions in the event of recovery, both of which are necessary to support no-force and steal. Stasis also uses ARIES' notions of LSNs, CLRs, and nested top actions. The authors compared Stasis and a similar system, Berkeley DB, with respect to their hash table implementations, handling of concurrent requests, and object-persistence optimizing; Stasis seemed to do as well as or better than Berkeley DB in each of these aspects.

Consistent with the "Systems View," Stasis is noteworthy in that it is extensible, allowing for the addition of user operations, a LSN-free version of recovery, different data layouts, and any other number of specializations. The authors make a number of assertions about these different features Stasis "probably" could handle, but unfortunately few of these had actually been implemented at the time of the writing. As a measure of how simple/complex Stasis is, the authors included a count of its lines of code, which was an interesting metric to see.

Monday, September 15, 2008

ARIES

This paper discusses ARIES (Algorithms for Recovery and Isolation Exploiting Semantics), a recovery method for databases. The method uses write-ahead logging for rollbacks and media recovery. The paper's authors emphasize the point that ARIES repeats history; it does this by redoing and then undoing certain transactions that were in progress during a failure. A strong point of ARIES is that it allows both partial and total rollbacks through the use of savepoints. The paper mentions differences between the ARIES method and that of System R; in particular, System R's shadow page technique was a source of overhead that the ARIES method avoids by allowing in-place updates.

ARIES uses page-oriented redos to cut down on tables and indexes required to access data pages but requires both page-oriented and logical undos to better allow for concurrency. Despite the authors' attempts to carefully trim the features needed for ARIES, such as the number of pages returned from non-volatile storage during recovery and descriptor tables, the logging system has a large amount of data it has to keep track of; from PrevLSN pointer fields within log records to CLRs and separate redo and undo actions, it seems like ARIES contains a non-trivial amount of bookkepping data and that the recovery algorithm would require quite a bit of extra overhead. However, ARIES has been adopted by a number of databases such as DB2 and MS SQL Server, so perhaps the authors are not underplaying the overhead after all.

Wednesday, September 10, 2008

The HP AutoRAID Hierarchical Storage System

The HPAutoRAID is a redundancy-level storage hierarchy implemented through the (SCSI) block-level device interface that employs background garbage collecting and load balancing. It was created to solve the difficult problem of manually configuring a RAID system. The storage hierarchy consists of mirrored storage for active data and RAID5 for inactive data, and the data can only be kept in one of these levels at a time. Data space on the disks is split up into physical extents (PEXes), which are grouped into PEGs on different disks in such a way as to keep data on the disks balanced, and are further broken down into segments, which serve as the stripe and duplication units. The logical space, on the other hand, is divided into relocation blocks (RBs) which serve as units of data migration.

Reads and writes to the mirrored storage are fairly simple, as are RAID reads. RAID writes, on the other hand, can use the per-RB scheme, that is, to write one RB (and also its stripe's parity) at a time, or they can use batched writes. If these batched writes fail, the per-RB scheme will be undertaken. The authors made a good decision when they decided to use RBs demoted during idle periods to plug the holes left by previously promoted RBs; not using hole-plugging may be faster and easier initially, but enacting it nearly cut down on any need for a RAID 5 cleaner in the simulations.

Monday, September 8, 2008

Analysis and Evolution of Journaling File Systems

This paper introduced the concepts of semantic block analysis (SBA) and semantic trace profiling (STP) as ways of analyzing and modifying file system behavior, respectively. SBA is similar to block-level tracing except that SBA has the added benefit of knowing the on-disk format of the file system. Generic SBA code can be combined with a smaller amount of file system specific code to accomodate ext3, ReiserFS, JFS, and NTFS. However, the authors could have made a more concrete, convincing argument about why this generic route is a better one to take than just implementing tracing straight into the file system. STP is a good way to test changes to the file system that are not too complex or "radical;" no time-consuming changing of the file system or accurate simulating of the file system is required. Actual case studies of ext3 and ReiserFS showed how useful SBA and STP can be, especially because of the handful of errors in ReiserFS that SBA uncovered.

The paper takes a close look at the ext3 file system and uses SBA to examine different aspects of its performance. In particular, it studies how the three journaling modes of ext3 handle sequential and random workloads. Based on these experiments, the authors recommend some techniques for improving performance. Adaptive journaling is suggested as a flexible journaling mode to efficiently handle a combination of sequential and random writes. Falsely limited parallelism can be combated by initiating journal and fixed-location writes at the same time, and tangled synchrony can be fixed by un-grouping synchronous and asynchronous traffic streams. It was hard to believe that ext3 didn't already employ differential journaling, especially when STP results showed it gave speedups of 6 and 200 using data journaling mode for two database workloads.

A Fast File System For UNIX

This paper traced the creation of FFS, a newer implementation of the UNIX file system. A few layout features of this file system include cylinder groups, an available block bitmap (replacing the old system's free list), a static number of inodes per cylinder group, and larger block sizes. The added benefit of larger block sizes is being able to keep big files together on a cylinder and hence having fewer seeks. However, since UNIX typically uses lots of little files, this scheme also causes lots of wasted space (up to 45%). This wasted space led to the technique of splitting blocks into fragments.

In terms of layout, the top level goals are to minimize seek latency and to promote larger data transfers; the authors wanted to increase locality, but they also wanted to spread unrelated data to different cylinder groups. Blocks are allocated with the idea of keeping them "rotationally optimal." In order to localize files, there is a requisite free space reserve of blocks that must be kept unallocated. Because of the new system's larger block size, both reads and writes are faster than in the old system. Reads are sped up, but writes are slower than reads because they require longer allocation for the larger blocks. The idea of pre-allocating block batches to a file sounds like a good way to improve write performance, and the creators of the DEMOS file system felt the same way, but these authors did not deem it worth their time to handle something accounting for 10% of a write system call.

Wednesday, September 3, 2008

A History and Evaluation of System R

System R was a relational database system meant to boost user productivity through use of a high-level user interface and support for many concurrent users and multiple types of database use. It was developed over the course of 5 years in three distinct phases (zero through two): prototype, full-blown version, and evaluation.

Phase Zero struck me as being a bit odd in that the designers knew before they started it that they would completely scrap it afterwards. Granted, there were some important lessons learned from the prototype; that performance should be measured in numbers of I/Os rather than in number of tuples fetched, that future optimizers should aim for simple queries, and that both "join" and "subqueries" should be supported are a few of these lessons that were incorporated into the full-blown system.

What struck me in particular about the Phase One design were the elaborate recovery and locking subsystems. The designers were clearly taking no chances with any of the three types of failure, even going so far as to use a dual-logging scheme. Parts of the recovery subsystem such as the "shadow pages" were a detriment to performance; while a certain amount of performance can be sacrificed for success in recovery from failure, the alternative technique suggested in the paper itself of simply logging all database updates might have been a much simpler solution.

A lot of thought seemed to go into the locking subsystem, but in the end, Levels 1 and 2 probably never saw much use and consequently could have been left out entirely. As it is, the locking subsystem had to be redesigned to account for the "convoy phenomenon" by allowing processes to reobtain locks during a given time slice.

One of the biggest successes of System R was the the popularity SQL found with users. Not only was it simple for users to learn, but it was actually extended during the creation of the system to satisfy user demands for increased functionality. Compilation of SQL into machine code also appears to have been quite successful; the code generation stage requires a sufficiently small enough amount of time that only a small number of queries have to happen for it to be worthwhile.

With all the apparent effort put into the recovery and locking subsystems, I felt a little disappointed by the use of B-trees for access paths. While B-trees are the better choice for obtaining multiple records and maintaining ordering and connections, hashing and linking are the better choice for small numbers of records and "canned transactions." If there had been a nice way of incorporating a mix of these, System R could have handled any transactions well. However, the designers probably had better insight into which queries would actually occur most.

Monday, September 1, 2008

Design Philosophy of DARPA Internet Protocols, Clark

This paper discusses the various architectural decisions that were faced in the creation of the Internet. It discusses different approaches that could have been taken and justifies the reasons behind the ultimate choices. The list of priorities in creating the Internet contained a number of unsurprising items: survivability, flexibility, accountability, cost-effectiveness, and distributed resource management.

However, the ordering of these priorities was a surprise; the utmost goal of the Internet was to connect together existing networks, but survivability in the case of failure was not far behind. I was amazed to see that cost-effectiveness was ranked as the 5th secondary goal out of 7 and can only imagine how different the Internet would be today if it were being created for commercial rather than military purposes.

One important decision made was for the use of the "fate sharing" scheme, that is, placing state information at endpoints to protect against intermediate node failures. This idea comes directly out of the need for survivability but also meshes well with the end-to-end argument. The disadvantages of employing the end-to-end strategy were that retransmitted packets may cross intervening networks multiple times and also that programmers would have to deal with implementing protocols themselves, an idea that perhaps strikes slightly less fear into current-day programmers than it did then.

A decision driven by the need for flexibility was the splitting of TCP and IP to allow for different transport protocols. This drive for letting in a variety of networks and providing a variety of services means that there were very minimal requirements for networks to meet; the requirement of being "fairly reliable" in transporting a "reasonably sized" packet using addressing (usually) was left general to accommodate the many different networks already in existence. However, because the designers did not want to constrain flexibility/variability, they did not tackle the difficult problem of formalizing performance constraints.

An interesting feature present in the ARPANet protocol that was changed for TCP is that ARPANet flow control was based on both bytes and packets. Despite the variability the designers strove for in terms of networks and services incorporated into the Internet, TCP was made to only deal with bytes. To eliminate one form of regulation because it seems too complex seems like a poor excuse, particularly since the ARPANet already had ways of dealing with both. However, the designers really had no reason to spend time on anything complex that did not contribute to the top goals of survivability or network variability of the Internet. Unfortunately, tools for distributed management and accountability also fall on the list of under-developed concepts.

Sunday, August 31, 2008

End-to-End Arguments, Saltzer, Reed, & Clark

This paper discusses the end-to-end argument, namely that functionality should be handled by applications rather than being pushed into lower levels of communication systems. It discusses the trade-offs associated with handling functions in low-level subsystems v.s. placing the functions closer to the applications level and presents various real scenarios, from encryption to duplicate message suppression to reliable data transmission, that provide support for the end-to-end argument.

One of the main arguments for the end-to-end strategy is that some operations, such as checksumming with file transmission, must be done at the applications level regardless, so adding that functionality to the communication system is redundant and at times costly in terms of delay. If a function is added to the low-level system, even applications that don't require said function will get it anyway, which can cause problems for applications like real-time speech transmission whose success depends on low delay in message delivery.

The issue is, of course, not completely cut-and-dried. Situations in which data reliability rather than minimized delay is important seem to go against the end-to-end argument; for example, in the case of reliable data transmission, without the communication system performing any checks, the transmission time for a file grows exponentially with the length of the file due to retries of sending and checksumming.

Clearly there must be some happy medium chosen to accomodate different application types. This paper sets the stage for the TCP/IP model to jump in and save the day. The separated internet and transport layers seem to provide a clean solution to address the needs of different applications.

The arguments presented seem to be a common theme in networking and architecture, that is, how much functionality to encapsulate from the applications/users and how much to let the applications handle themselves. The paper does a good job of acknowledging that both end-to-end and lower-level implementations work better for different sets of tasks and that the trick is to find a good balance between the two strategies; in this case, that "good balance" came from finding the right layering split.

Friday, August 29, 2008

Success!

In the very near future, this blog will contain summaries/criticisms of computer networking papers. If I get so ambitious, I might just stick the systems papers' ones up here, too. We'll see.