Wednesday, October 29, 2008

TCP Congestion Control with a Misbehaving Receiver

This paper discusses how a TCP receiver can misbehave by defeating congestion control measures through various attacks. The three attacks described are ACK Division, DupACK spoofing, and Optimistic ACKing. An ACK Division attack divides an acknowledgment for a data segment into multiple separate acknowledgments. DupACK spoofing involves sending back multiple acknowledgments for the last data segment received, and Optimistic ACKing is preemptive sending of acknowledgments for data that has not yet been received.

The first two cause the sender to send packets in a quick burst by quickly expanding the congestion window, while the third causes the sender to send data sooner than it originally would have. All three attacks were generally successful against the TCP implementations of a variety of OSes. The authors use a simple HTTP request in their test and provide graphs that clearly show that their exploits worked. The attacks were able to be implemented with a disturbingly small amount of code.

Congestion Avoidance and Control, Jacobson & Karels

This paper discusses techniques added for the sake of packet conservation in TCP. Three problems addressed are connections not getting to equilibrium, senders injecting new packets before old ones exit, and equilibrium not being attained due to limited resources on the path. The first problem is fixed through slow start. For every ack that arrives, two new ones are generated, opening the window exponentially.

The second problem is due to not estimating variation of round-trip time, ultimately causing packets to be retransmitted whose delivery has only been delayed; the authors provide an estimation method and also suggest the use of an exponential backoff timer for retransmits. The solution to this second problem, that is, taking variation into account, is one of the most important contributions of the paper because it hugely reduces the number of packets needing to be retransmitted. For the third problem, the authors note that a timeout is a good indicator of packet loss, which is a good indicator of congestion, hence suggestion to use timeout as a signal of congestion. Upon receiving a timeout, the congestion window is halved, and for each ack the congestion window is increased by 1/(congestion window).

Monday, October 27, 2008

ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay

ReVirt is a system logger that moves the operating system into a VM and then runs below the OS. It does this to avoid relying on the integrity of the OS being logged and also to allow replaying non-deterministic events. ReVirt is implemented as a kernel module in the host OS; using an OS-on-OS virtualization strategy, the authors argue that although attackers may gain control of a guest OS, their actions on the host OS are severely restricted, as the implemented VMM allows the guest kernel access to fewer than 7% of system calls.

To log enough information to replay non-deterministic events, the authors record asynchronous virtual interrupts and external input. They suggest cooperative logging as a way to reduce log volume. The authors pass off the overhead associated with kernel-intensive workloads (up to 58%) as being an acceptable tradeoff for security, although they only provide one brief case of how they successfully used ReVirt to find and analyze an intrusion.

Live Migration of Virtual Machines

This paper discusses live, in-service OS migration as opposed to migration at the process-level. Reasons for choosing this granularity include avoiding "residual dependencies" that require original host machines to remain accessible to migrated processes, transferring in-memory state consistently and efficiently, and separating the concerns of users and operators.

The approach to moving memory used is pre-copy migration, a multi-step process combining push and stop-and-copy techniques. The first phase in a push phase in which all pages are transferred to the new VM, and future phases transfer only pages that were modified during the previous round. Once the remaining pages are part of the frequently-written writable working set, the current VM is suspended, the remaining pages are transferred, and the new VM is started. This pre-copy approach improved upon the amount of downtime caused by a full stop-and-copy significantly for a number of different migrations.

The authors ignore the issue of having to deal with local resources; they make the assumption that migrations tend to happen in a cluster environment within a single LAN and that data centers tend to use network-attached storage in place of local disks.

Wednesday, October 22, 2008

Are Virtual Machine Monitors Microkernels Done Right?

This paper discusses a few areas in which microkernels and VMMs differ architecturally but in which VMMs seem to have gotten things right. The first of these areas is avoiding liability inversion, which happens to microkernels that rely on user-space components to continue execution instead of keeping individual VM failure isolated the way VMMs do.

The next area is making IPC performance irrelevant, which VMMs do by making communication between VMs look like device interactions instead of through IPC- and RPC-based interfaces. Lastly, and perhaps most importantly for their mainstream adoption, is the treatment of operating systems. VMMs were designed to support existing OSes, avoiding the additional efforts required by microkernels to do so after the fact.

Xen and the Art of Virtualization

Xen is a virtual machine monitor that allows many instances of an operating system to share hardware together. It employs the use of paravirtualization but with support for existing ABIs and multi-application OSes. Memory is statically partitioned between each guest OS domain to provide strong isolation, although the guest OS may request additional memory pages or return unused ones to Xen. Xen uses disk I/O rings to handle requests from the guest OSes. I/O requests placed by the OSes may be reordered before Xen handles them and places the completed ones back in the ring.

XenoLinux performs similarly to Linux in regards to context switch times, file system latency, various process benchmarks, and so on. The differences, such as in the execution of fork, exec, and sh, are due to verifications Xen must make of page table updates since the guest OS runs at a lower privilege level than the Xen hypervisor.

Monday, October 20, 2008

Eddies: Continuously Adaptive Query Processing, Avnur & Hellerstein

An eddy is a query processing mechanism that can reorder queries at runtime for optimization purposes. Eddies were designed to minimize synchronization barrier overhead and take advantage of moments of symmetry for reordering. The authors use a pre-optimizer to set up joins and kept this step very simple so that the focus stayed on eddies. An eddy keeps track of which operators the input tuples have been processed by and only outputs tuples that have been seen by every eligible operator.

The "naive eddy" works well for operators with the same selectivity, whereas the "fast eddy" takes into account both the consumption and production rates, and hence both cost and selectivity, via lottery scheduling. A window scheme is added to the lottery scheduling to only take recent history into account, allowing the eddy to adapt over time to changes in performance and data.

Sunday, October 19, 2008

Encapsulation of Parallelism in the Volcano Query Processing System, Graefe

Volcano is a query evaluation system that was developed to be extensible and parallelizable. In particular, it uses the "operator model" in which an exchange operator is responsible for running single-process code in parallel. This operator allows pipelining between processes and sets up a consumer/producer relation between the parent and child.

So-called "bushy" parallelism is supported by the exchange operator and forking; intra-process parallelism is supported with data partitioning and intermediate result queues. A somewhat odd decision made in the design of Volcano is that child processes must wait for the parent processes to complete before the call to close trickles down the tree to it; this order was the result of design decisions involving virtual devices and record pinning.

Wednesday, October 15, 2008

MapReduce: Simplified Data Processing on Large Clusters, Dean & Ghemawat

MapReduce is a programming model that allows users to easily write code for processing large amounts of data in parallel. As in its Google implementation, the user specifies functions for map and reduce phases; a set of inputs is split across a number of machines, the map function is applied, a set of intermediate values are written, and then the intermediate values are read by more machines that apply the reduce function and write out an answer.

Google's MapReduce cleverly tackles the problems of "straggler" machines and fault-tolerance through duplication and re-execution. Once a certain number of reductions have completed, so-called backup executions are scheduled for the remaining tasks. Similarly, when a worker machine fails, the task it was working on is simply re-executed, while when a master fails, it is re-started from some checkpointed state. MapReduce works well for Google because they have a huge number of machines at their disposal, although that's not to say that the MapReduce technique isn't useful for a much smaller number of machines though the paper did not discuss the performance implications of this scenario.

Parallel Database Systems: The Future of High Performance Database Systems, DeWitt & Gray

This paper advocates the use of shared-nothing parallel database systems made from fast, inexpensive microprocessors. It gives examples of actual multiprocessors that follow this model. Because of the lower hardware complexity and network traffic, the shared-nothing architecture allows for a greater number of processors within the system and near-linear speedups and scaleups for OLTP loads. The paper argues that shared-nothing systems with a large number of processors are ideal for database applications because SQL code is easily executed in parallel, unlike most old software. The paper mentions data partitioning as a good way to improve speedup and scaleup, an idea which is further examined in the Google MapReduce paper.

Monday, October 13, 2008

Access Path Selection in a Relational Database Management System

This paper discusses how System R selects access paths for queries, both simple and complex. It briefly describes the four phases of processing a SQL statement (parsing, optimization, code gen, execution) but then goes in-depth into how the optimizer works. The optimizer attempts to find the best access path by finding one that minimizes needed CPU and I/O resources; it chooses between "interesting" order and "unordered" access paths based on the given query's ordering requirements (e.g., "group by" and "order by") and a so-called selectivity factor that it uses in calculating the cost.

The optimizer chooses between doing nested-loop and merge scan joins for any n-way joins, which it treats as a series of 2-way joins. The optimal ordering of these joins is created using a search tree of equivalence classes. The best way of accessing single relations is found, and then the best way to join any relation with these relations is found, and the cost is built up layer by layer, so to speak. The authors mention that other path selectors exist but do not present alternative methods to the ones they chose, nor do they provide any kind of comparative testing or quantify their method's performance.

Wednesday, October 8, 2008

Lottery Scheduling: Flexible Proportional-Share Resource Management, Waldspurger & Weihl

Lottery scheduling is a randomized resource allocation mechanism that uses a "ticket" system. It allocates resources to clients in proportion to the number of tickets they have, although due to the randomized nature of the scheduling, this ratio is not always spot-on. Clients can transfer their tickets back and forth explicitly, or individual clients can take advantage of "ticket inflation," that is, making more tickets for themselves. The latter sounds like a poor idea in the general case, as it only works with mutually trusting clients.

Clients that do not consume all of their allocated time slots are given "compensation tickets" which increase their chances of gaining access to resources and hence keep the allocation ratios accurate. The authors determined that the lottery scheduling kernel ran comparably to the unmodified Mach kernel for the Dhrystone benchmark, but that lottery scheduling ran a little faster than the unmodified kernel for a multithreaded database server, possibly due to better caching because the Mach kernel only uses round-robin scheduling.

Sunday, October 5, 2008

Concurrency Control Performance Modeling: Alternatives and Implications, Agrawal, Carey, & Livny

This paper examines three methods of concurrency control, blocking, optimistic algorithm, and immediate-restart, in an attempt to determine which is ultimately the best method for concurrency control. A number of experiments were performed, with factors differing from the number of available resources (CPUs and disks), restart delays, external and internal think times, no-lock-upgrades (placing a write lock on a to-be-updated object instead of starting it off with a read lock and upgrading later), and fake restarts (replacing restarted transactions with a different, independent one). The throughput, utilization, and conflict ratios for different multiprogramming levels were examined and compared for each set of experiments.

Each set of experiments produced a different winner, but the authors were able to draw a few conclusions from this seemingly mixed bag of results. Each of the three techniques increased throughput with fake restarts, and blocking did slightly better than the other two with one resource while the optimistic approach did better with infinite resources. The no-lock upgrade policy increased throughput for the immediate-restart algorithm in both infinite and limited resource cases, while the blocking suffered from this policy for infinite resources and increased throughput for 1 resource. In general, blocking trumps restart techniques in a system with medium to high resource usage, while the optimistic algorithm might be well-suited for a system with a large number of resources and large amount of intratransaction think time, although the authors admit that this kind of system is probably undesirable because its lack of cost-effectiveness and that in the absence of this system, blocking is the best form of concurrency control.

Wednesday, October 1, 2008

On Optimistic Methods for Concurrency Control, Kung & Robinson

A lock-free approach to concurrency control is presented as a solution to certain disadvantages associated with locking. One of the biggest premises of this "optimistic" approach is that locking is rarely needed in certain applications, the only concrete examples being concurrent B-tree insertions and query-dominant systems. The approach removes all restrictions from reads and instead places restrictions on writes. It breaks transactions into read, validation, and (optional) write phases. Changes from writes are first made on copies of nodes, and the changes are swapped into the global nodes only after successful validation.

Upon failure to validate, the transaction is simply rolled back to start again. However, upon numerous failed validations, the fall-back is essentially to write-lock the entire database until the transaction runs to completion. The given technique of serial validation with transaction numbers only holds for one CPU, and things get a little messier for multiple processors, involving multiple stages of validation. This paper presented two applications in which the "optimistic" strategy might perform better than locking but doesn't offer any new insights into figuring out when the optimistic solution is preferable to a locking one.

Granularity of Locks and Degrees of Consistency in a Shared Data Base

[Man, this paper is so ghetto. Seriously, you should see it.]

This paper discusses the different granularities of locking and consistency. It presents the tradeoffs between the concurrency and overhead determined by the lockable unit, namely that a smaller unit allows greater concurrency but also requires more overhead. It discusses how these different units fit into locking hierarchies and DAGs along with different access modes (S, X, IX, IS, and SIX). It puts a lot of emphasis on the compatibilities of the access modes and also on how to propagate these modes to ancestors and descendants to achieve desired locking behavior. The authors do a really good job of providing step-by-step examples of how the locking protocols would be carried out in practice.

Consistency is ranked from degree 0 up to 3 with higher levels of consistency accounting for lower concurrency and higher overhead. All transactions abide by degree 0 and do not overwrite dirty data of other transactions, and each increasing level respectively adds the following conditions: the transaction doesn't commit writes before the end of the transaction, the transaction doesn't read dirty data of other transactions, and other transactions do not dirty the data read by the transaction before it completes. The paper compares its locking hierarchies and consistency to those of IMS/VS and DMS 1100; in particular, IMS/VS doesn't support the notion of S or SIX locking and hence doesn't require different intention locks, and DMS doesn't do any implicit locking.