This paper discusses MACAW, a media access protocol for wireless LANs based off of MACA meant for use at Xerox PARC. Xerox PARC had a setup consisting of "base stations" connected via Ethernet and personal mobile "pad" computing devices. The authors observe that collision occurs at receivers, not senders, and hence, CSMA is inappropriate for the wireless network because it only takes into account senders' transmitting.
A major building block of MACA is a link-layer equivalent of the 3 way handshake: RTS, CTS, and DATA. Nodes hearing these requests hold off from transmitting so as not to interfere. Symmetry between nodes also plays big role: if a node can hear an RTS but not a CTS, it's within range of the sender but not the receiver and is free to transmit as it pleases because the receiver can't hear it as noise. If a node can hear the CTS but not RTS, it needs to hold off on transmitting for the receiver's sake.
MACA employs binary exponential backoff, which the authors deem an unfair backoff strategy because the more a sender backs off, the less likely it'll gain access to the channel, since a non backed-off sender will probabilistically keep having control. The authors propose to remedy this by setting all senders' backoffs to the same value after a successful transmission. This "copy" strategy is accomplished by adding a field to the packet headers that list the backoff; when a station hears a packet, it copies the packet's backoff value as its own, so all stations within range of each other have the same backoff value. Additionally, the authors implement multiplicative (~1.5) increase, linear decrease for the backoff scheme to prevent large, sudden oscillations in the backoff values. Copying seems like it could be a problem in that one value will copy itself across multiple groupings of bases/pads in which drastically different backoff values are called for.
MACAW allocates bandwidth on a per-stream basis rather than per station, using a queue per stream and per-stream backoff. It also concludes its RTS-CTS-DATA sequence with an ACK packet so that senders don't have to waste time waiting for TCP timeouts before retransmitting data. Overhead from this ack seems to be made up for when loss rates are greater than 1 in 1000 packets. Also, when the receiver is transmitting data already, it doesn't matter if the sender can send without interfering elsewhere; plus, if the sender were to attempt to transmit to a busy receiver, it would have to increment backoff, so the sender might as well wait. The way for a sender to know its potential receiver is busy is through the addition of a data-sending packet (DS) that comes between CTS and DATA.
However, there are still more problems to take care of, to which the authors propose yet another control packet (the RRTS), but it's not enough to cover all the open problems including multicast. In comparison to MACA, MACAW gets better throughput when the layout of senders and receivers creates lots of potential collisions, but it almost feels like MACAW employs a lot of hacks to get around corner cases noted by use of MACA and that any solution the authors create just opens the door to more problems needing solved.
Monday, September 28, 2009
Sunday, September 27, 2009
Understanding TCP Incast Throughput Collapse in Datacenter Networks; Chen, Griffith, Liu, Katz, & Joseph
This paper examines TCP Incast in a datacenter environment and attempts to come up with an analytical model to understand it. The authors take an experimental approach to better understand the phenomenon by making a few modifications (such as changing minimum RTO and randomizing backoff) to see if any prevent incast collapse and then moving on from there. They found that the biggest improvements came from minimizing the minimum RTO.
This work is based on the original TCP Incast paper from CMU and used a workload such that each server sends a fixed amount of data; however, the CMU group came out with a new workload in which there is a constant amount of data to send split between some changing number of senders, meaning that the more senders there are, the less each sender actually has to send. However, large numbers of senders with this new workload could hide incast behavior because each sender only ends up sending a few packets. So this paper's authors used the original workload setup of a fixed amount of data per sender.
Some of the results obtained in this study differ from those made by the original CMU group (regarding the importance of high-res timers and turning off delay-acks), and the authors presume that this is due to the underlying network. The authors use both a small scale cluster and an experimental testbed setup, but my complaint is that they do not list how many machines are actually in these systems (though it appears to be at least 20 based on later graphs).
A number of experiments were run with varying RTO minimum values, with a few noteworthy results. There appear to be 3 phases to the incast problem: collapse, goodput recovery increase, and goodput decrease. Large numbers of timeouts occur with this initial phase, but then there are few timeout events as senders ramp up. However, at some point the timing of sending packets surpasses the RTO such that timeouts happen frequently, causing interference between senders retransmitting after a timeout and those waiting for a timeout. Hence, the third phase will happen after a longer time post-collapse for higher RTO minimum values.
Smaller RTO minimum values equate to shorter "goodput recovery" time, that is, time required to hit the maximum goodput. The time at which collapse turns into goodput recovery increase is the same regardless of the minimum RTO. While the CMU paper made much ado about using high resolution timers, it appears that the optimal setting is actually a low resolution RTO of 1ms paired with delayed acks turned on. An important difference to note between the two studies, however, is that the average observed RTT in this paper was 2ms, while RTTs in the CMU study were along the lines of 100s of microseconds. With that in mind, a starting RTO value of 1ms is actually shorter than the expected RTT.
The optimality of delayed acks was surprising since it could in theory slow down goodput, but a suggestion put forth by the authors is that not using delay-acks can over-drive the congestion window. The graphs used to justify this theory were somewhat confusing; I wasn't quite sure what each dot on the graph corresponded to, as the delay-acks graph has a significantly larger number of dots on it.
This work is based on the original TCP Incast paper from CMU and used a workload such that each server sends a fixed amount of data; however, the CMU group came out with a new workload in which there is a constant amount of data to send split between some changing number of senders, meaning that the more senders there are, the less each sender actually has to send. However, large numbers of senders with this new workload could hide incast behavior because each sender only ends up sending a few packets. So this paper's authors used the original workload setup of a fixed amount of data per sender.
Some of the results obtained in this study differ from those made by the original CMU group (regarding the importance of high-res timers and turning off delay-acks), and the authors presume that this is due to the underlying network. The authors use both a small scale cluster and an experimental testbed setup, but my complaint is that they do not list how many machines are actually in these systems (though it appears to be at least 20 based on later graphs).
A number of experiments were run with varying RTO minimum values, with a few noteworthy results. There appear to be 3 phases to the incast problem: collapse, goodput recovery increase, and goodput decrease. Large numbers of timeouts occur with this initial phase, but then there are few timeout events as senders ramp up. However, at some point the timing of sending packets surpasses the RTO such that timeouts happen frequently, causing interference between senders retransmitting after a timeout and those waiting for a timeout. Hence, the third phase will happen after a longer time post-collapse for higher RTO minimum values.
Smaller RTO minimum values equate to shorter "goodput recovery" time, that is, time required to hit the maximum goodput. The time at which collapse turns into goodput recovery increase is the same regardless of the minimum RTO. While the CMU paper made much ado about using high resolution timers, it appears that the optimal setting is actually a low resolution RTO of 1ms paired with delayed acks turned on. An important difference to note between the two studies, however, is that the average observed RTT in this paper was 2ms, while RTTs in the CMU study were along the lines of 100s of microseconds. With that in mind, a starting RTO value of 1ms is actually shorter than the expected RTT.
The optimality of delayed acks was surprising since it could in theory slow down goodput, but a suggestion put forth by the authors is that not using delay-acks can over-drive the congestion window. The graphs used to justify this theory were somewhat confusing; I wasn't quite sure what each dot on the graph corresponded to, as the delay-acks graph has a significantly larger number of dots on it.
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication; Vasudevan et al.
This paper discusses TCP incast collapse in which throughput in datacenter environments drops significantly due to simultaneous sending and backoff attempts of servers. For the problem to occur, multiple senders must communicate with a single receiver in a high bandwidth, low delay network, causing switch buffers to overflow which in turn causes timeouts.
In a datacenter environment where RTTs tend to be on the order of 10s or 100s or microseconds, 100+ millisecond timeouts are 1000s of times larger than the RTT. Incast is particularly a problem for barrier-synchronized workloads because even just one server experiencing these disproportionately long timeouts can delay the entire job. However, as the paper notes this timeout-to-RTT ratio in datacenters is a problem for any latency-sensitive workload even when incast is not occurring.
The authors study the problem in both simulation and in two different sized clusters and examine data for RTTs in the real world, though some of the conclusions and configuration choices the authors make seem questionable or tenuous at best. For one thing, the RTT distribution obtained from Los Alamos has around 20% of RTTs pegged at 100 microseconds or less, which the authors claim shows that "networks... operate in very low-latency environments," though in reality at least 50% of the RTTs fall in between 300 and 600 microseconds, meaning that the minimum obtainable RTO min of 5ms would be perfectly acceptable without worrying about high resolution timers.
In order to scale up to 10Gbps links, the authors set blocksize at 80MB so that any flow has the ability to saturate a link; I'm not sure that this is a reasonable choice to make. In their comparison of throughput for RTO min values of 200ms and 200us, the authors ignore all flows smaller than a certain threshold, and I would definitely be curious to see how those flows differed for the two scenarios.
The authors also believe that delayed acks could cause problems for the network due to the fact that their delay timer is set to 40ms, much larger than the proposed new RTO min value. Another thing to bear in mind is that any changes made to dealing with the RTO such as randomizing it by adding delay could actually harm performance in any non-datacenter environment because flows will not have homogeneous RTTs.
In a datacenter environment where RTTs tend to be on the order of 10s or 100s or microseconds, 100+ millisecond timeouts are 1000s of times larger than the RTT. Incast is particularly a problem for barrier-synchronized workloads because even just one server experiencing these disproportionately long timeouts can delay the entire job. However, as the paper notes this timeout-to-RTT ratio in datacenters is a problem for any latency-sensitive workload even when incast is not occurring.
The authors study the problem in both simulation and in two different sized clusters and examine data for RTTs in the real world, though some of the conclusions and configuration choices the authors make seem questionable or tenuous at best. For one thing, the RTT distribution obtained from Los Alamos has around 20% of RTTs pegged at 100 microseconds or less, which the authors claim shows that "networks... operate in very low-latency environments," though in reality at least 50% of the RTTs fall in between 300 and 600 microseconds, meaning that the minimum obtainable RTO min of 5ms would be perfectly acceptable without worrying about high resolution timers.
In order to scale up to 10Gbps links, the authors set blocksize at 80MB so that any flow has the ability to saturate a link; I'm not sure that this is a reasonable choice to make. In their comparison of throughput for RTO min values of 200ms and 200us, the authors ignore all flows smaller than a certain threshold, and I would definitely be curious to see how those flows differed for the two scenarios.
The authors also believe that delayed acks could cause problems for the network due to the fact that their delay timer is set to 40ms, much larger than the proposed new RTO min value. Another thing to bear in mind is that any changes made to dealing with the RTO such as randomizing it by adding delay could actually harm performance in any non-datacenter environment because flows will not have homogeneous RTTs.
Monday, September 21, 2009
VL2: A Scalable and Flexible Data Center Network; Greenberg et. al.
VL2 is another network architecture striving to tackle the problem of providing Ethernet semantics for increasingly larger and more unwieldy networks. VL2 is set apart from other attempts such as PortLand and SEATTLE by its dedication to achieving agility, "the capacity to assign any server to any service," a goal which the authors can break down as 3 objectives: uniform high traffic flow between any 2 servers, isolation of effects on traffic from services, and Layer-2 properties. Modifications are made to the end hosts, not to the underlying network itself, and consists of currently existing technology (ECMP forwarding, IP any/multicasting, link-state routing).
VL2 takes an approach similar to PortLand that involves separating names from locations, with each switches and servers using IP addresses as location-specific addresses (LA) and applications having their own application-specific addresses (AA) that do not change, allowing continued service in the face of virtual machine migration. A smaller cluster of directory servers store lookup information for AA to LA. Each server is outfitted with a VL2 agent that intercepts its host's packets and either queries the directory for an AA-to-LA mapping or provides one from its mappings cache and send the packet to the TOR with the corresponding LA.
VL2 was created specifically with data centers in mind and uses a Clos network, whose inherent layout provides redundancy to combat component failures. Targeting a specific realm instead of general networks seems to be a good path to take in the networking world where there are always tradeoffs for networks with different needs and purposes, and VL2 seems to be no exception. As data centers are already designed with special optimizations (e.g., large chunk sizes to amortize disk seeks), it makes sense that a network architecture could be developed with their performance needs in mind as well.
Unfortunately, it appears that traffic patterns are often very random with given paths being very short-lived, due to the practice of randomly distributing files and workloads over the data center, so it's more difficult to try to come up with traffic flow enhancements. VL2 pairs Valiant Load Balancing (VLB) with TCP to achieve fairness and high utilization. VLB is realized by randomly selecting the intermediate switch through which to send traffic, and TCP is required to deal with the scenario that different-sized flows aren't distributed well (causing congestion at some links). VLB's randomization also helps provide isolation, so applications that would otherwise flood the network have their loads spread evenly across the network instead. Valiant Load-Balancing also uses randomization to determine paths, which does not lead to best-case performance and would work better for supporting general applications.
The evaluation was run on an actual prototype setup, though as a workload, the authors examined a worst-case scenario: all-to-all shuffle. VL2 does seem to provide fairness, isolation, and high goodput, but perhaps more promising is that for the same cost as an over-subscribed traditional network, this Clos network can be built without over-subscription.
VL2 takes an approach similar to PortLand that involves separating names from locations, with each switches and servers using IP addresses as location-specific addresses (LA) and applications having their own application-specific addresses (AA) that do not change, allowing continued service in the face of virtual machine migration. A smaller cluster of directory servers store lookup information for AA to LA. Each server is outfitted with a VL2 agent that intercepts its host's packets and either queries the directory for an AA-to-LA mapping or provides one from its mappings cache and send the packet to the TOR with the corresponding LA.
VL2 was created specifically with data centers in mind and uses a Clos network, whose inherent layout provides redundancy to combat component failures. Targeting a specific realm instead of general networks seems to be a good path to take in the networking world where there are always tradeoffs for networks with different needs and purposes, and VL2 seems to be no exception. As data centers are already designed with special optimizations (e.g., large chunk sizes to amortize disk seeks), it makes sense that a network architecture could be developed with their performance needs in mind as well.
Unfortunately, it appears that traffic patterns are often very random with given paths being very short-lived, due to the practice of randomly distributing files and workloads over the data center, so it's more difficult to try to come up with traffic flow enhancements. VL2 pairs Valiant Load Balancing (VLB) with TCP to achieve fairness and high utilization. VLB is realized by randomly selecting the intermediate switch through which to send traffic, and TCP is required to deal with the scenario that different-sized flows aren't distributed well (causing congestion at some links). VLB's randomization also helps provide isolation, so applications that would otherwise flood the network have their loads spread evenly across the network instead. Valiant Load-Balancing also uses randomization to determine paths, which does not lead to best-case performance and would work better for supporting general applications.
The evaluation was run on an actual prototype setup, though as a workload, the authors examined a worst-case scenario: all-to-all shuffle. VL2 does seem to provide fairness, isolation, and high goodput, but perhaps more promising is that for the same cost as an over-subscribed traditional network, this Clos network can be built without over-subscription.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
This paper discusses PortLand, an Ethernet-compatible set of routing, forwarding, and address resolution protocols intended for use in data centers. Data centers are scaling up at enormous rates and hence no longer play well with protocols intended for general LANs. To that end, the authors lay out some goals for this new protocol to meet: mobility, self-configuration, loop-free connectivity, and failure detection.
PortLand is specialized to a fat tree network topology. Having this specific layout in mind allows PortLand to use a hierarchical addressing scheme to encode location information as a pseudo-MAC address (PMAC). PortLand also uses name information in addition to location and stays with the tradition of using flat addressing (based on MACs), adding the notion of a PMAC address on top of that, which encodes the pod grouping of a host.
A centralized fabric manager maintains network configuration information and helps resolve IP-to-PMAC mappings. The danger in having this important function centralized is how FM failure is handled, though presumably it would be replicated to or spread across multiple nodes. Generally this setup prevents mass broadcasts of ARP packets, as ARPs are intercepted by edge switches and instead used to query the fabric manager, which only resorts to broadcast when it doesn't have the IP-to-PMAC mapping cached. Even more important is the amount of computation required of the FM when the number of ARPs sent out per host increases; up to 70 cores worth of required CPU are projected for large numbers of hosts each sending 100 ARPs/sec. As data centers are scaling out very quickly, it makes sense to watch out for this computation overload for the FM and perhaps develop it as a cluster earlier rather than later.
PortLand uses a Location Discovery Protocol to self-configure; switches send out Location Discovery Messages to one another, which allows them to infer information about their location in the tree based on messages they receive in certain ports. A very cool property about this protocol paired with the known fat tree structure is that it can actually pick out certain failure conditions like miswiring.
PortLand is specialized to a fat tree network topology. Having this specific layout in mind allows PortLand to use a hierarchical addressing scheme to encode location information as a pseudo-MAC address (PMAC). PortLand also uses name information in addition to location and stays with the tradition of using flat addressing (based on MACs), adding the notion of a PMAC address on top of that, which encodes the pod grouping of a host.
A centralized fabric manager maintains network configuration information and helps resolve IP-to-PMAC mappings. The danger in having this important function centralized is how FM failure is handled, though presumably it would be replicated to or spread across multiple nodes. Generally this setup prevents mass broadcasts of ARP packets, as ARPs are intercepted by edge switches and instead used to query the fabric manager, which only resorts to broadcast when it doesn't have the IP-to-PMAC mapping cached. Even more important is the amount of computation required of the FM when the number of ARPs sent out per host increases; up to 70 cores worth of required CPU are projected for large numbers of hosts each sending 100 ARPs/sec. As data centers are scaling out very quickly, it makes sense to watch out for this computation overload for the FM and perhaps develop it as a cluster earlier rather than later.
PortLand uses a Location Discovery Protocol to self-configure; switches send out Location Discovery Messages to one another, which allows them to infer information about their location in the tree based on messages they receive in certain ports. A very cool property about this protocol paired with the known fat tree structure is that it can actually pick out certain failure conditions like miswiring.
Thursday, September 17, 2009
Detailed Diagnosis in Enterprise Networks; Kandula et. al
The authors of this paper conducted a study of error logs within a small enterprise network. Based on a sample of these logs, they came to the conclusion that the majority of faults within a network impact applications (instead of whole machines) typically because of some configuration issue. Disturbingly, the causes of 25% of problems were unknown yet solvable by rebooting the system, casting more of a "black magic" veil on networking. Their goal then is to create a system called NetMedic that can diagnose, to the process level, culprits of network problems and probable underlying causes in as much detail as possible without relying on application-specific knowledge.
NetMedic models the network as a dependency graph with edges between dependent components. It employs a set of techniques that allow a process to keep track of its own configuration state, the behavior of peers it's in communication with, and the state of its host machine. Diagnosis of errors requires 3 steps: determining which pieces of the network are abnormal (statistically, based on previous values through time), computing edge weights for the dependency graph, and computing path weights to decide on the likely error source (with largest path weights being likely causes). Interestingly, they don't consider the possibility that a normally behaving component could be the cause of other problems; I guess it's probably the case that problems occur when things (like configuration files) change and not just out of the blue, so this assumption is probably okay for catching most errors.
NetMedic monitors both the Windows software counters as well as the Windows registry and configuration files. However, if half of the causes are due to configuration files, I wonder how important the software counters actually are to diagnosing problems. I thought it was admirable that because they couldn't instrument actual servers that are in use, the authors just started up their own and inject faults at random.
NetMedic models the network as a dependency graph with edges between dependent components. It employs a set of techniques that allow a process to keep track of its own configuration state, the behavior of peers it's in communication with, and the state of its host machine. Diagnosis of errors requires 3 steps: determining which pieces of the network are abnormal (statistically, based on previous values through time), computing edge weights for the dependency graph, and computing path weights to decide on the likely error source (with largest path weights being likely causes). Interestingly, they don't consider the possibility that a normally behaving component could be the cause of other problems; I guess it's probably the case that problems occur when things (like configuration files) change and not just out of the blue, so this assumption is probably okay for catching most errors.
NetMedic monitors both the Windows software counters as well as the Windows registry and configuration files. However, if half of the causes are due to configuration files, I wonder how important the software counters actually are to diagnosing problems. I thought it was admirable that because they couldn't instrument actual servers that are in use, the authors just started up their own and inject faults at random.
Wednesday, September 16, 2009
Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises; Kim, Caesar, & Rexford
SEATTLE is an alternative architecture meant to combat some of the inherent problems associated with Ethernet. Using Ethernet, the control overhead in a network is proportional to the number of hosts in the network and requires the use of spanning tree for paths. Problems address then include large forwarding table size and control overhead, uneven link utilization due to root bridges and route spanning trees, poor throughput because of said spanning tree, and frequent flooding. One possible solution is to use a hybrid approach of connecting LANs via IP, each of which have IP subnets of up to a few hundred hosts. However, initial configuration is time-consuming, reconfiguring is required whenever the network design changes, and the ability of a host to move and still maintain its IP address isn't supported easily. Even using VLANs to allow mobility still has the problems of manual configuration and poor utilization of links due to a spanning tree.
To address all these problems, the authors present SEATTLE. SEATTLE has overhead and forwarding tables that are proportional to the number of subnets, not hosts. It uses a flat addressing scheme a la Ethernet, forwarding packets based on MAC addresses, but is scalable to large networks, as it uses a link-state protocol but only maintains switch information which is subject to less change and hence fewer control messages going out on the network. SEATTLE uses a DHT to lookup host information, which actually lengthens the path for at least the first traversal of a flow's path. Switches have the option of caching the host information themselves for forwarding based on shortest paths at the risk of ending up with a large, unwieldy forwarding table.
In case of network changes, SEATTLE attempts to avoid flooding whenever possible, choosing instead to unicast updates to other switches and even forward misdelivered packets on behalf of those other switches. SEATTLE employs a somewhat lazy approach in correcting these routing table entries.
In evaluating their design, the authors strove to use real data, an endeavor at which they partly succeeded, at least using a mix of real traces with artificially enhanced data. Running each experiment 25 times definitely seems like overkill to me; would the experiments differ much beyond 3 to 5 runs? Granted, the root bridge was randomly selected in these runs, but if network behavior would be that differently affected each time, that seems like a flaw in the design: not being able to predict the network behavior makes simple performance predictions near impossible, but maybe it's not any easier to do that with Ethernet.
There is a possibility of having unduly lengthened paths when a switch evicts a cached entry before its locally-attached hosts are done using the entry, in which case the switch forwards the packets to a resolver switch instead of queueing them itself. This scenario can be prevented by having a long timeout to eviction, with the drawback that the forwarding table could accumulate a very large number of entries.
SEATTLE addresses the biggest problems with Ethernet and hybrids involving it: use of a spanning tree for paths, frequent flooding, and required manual configuration for hybrid and VLAN schemes. However, I'm not convinced of the true need for SEATTLE; the only actual performance graphs included show control overhead, table size, and packet loss in the case of switch failures and moving hosts. The packet loss goes up to 30% as switch failure rate increases to 1 and only up to 5% for 200 hosts moving per second, which seems unreasonably high to me.
To address all these problems, the authors present SEATTLE. SEATTLE has overhead and forwarding tables that are proportional to the number of subnets, not hosts. It uses a flat addressing scheme a la Ethernet, forwarding packets based on MAC addresses, but is scalable to large networks, as it uses a link-state protocol but only maintains switch information which is subject to less change and hence fewer control messages going out on the network. SEATTLE uses a DHT to lookup host information, which actually lengthens the path for at least the first traversal of a flow's path. Switches have the option of caching the host information themselves for forwarding based on shortest paths at the risk of ending up with a large, unwieldy forwarding table.
In case of network changes, SEATTLE attempts to avoid flooding whenever possible, choosing instead to unicast updates to other switches and even forward misdelivered packets on behalf of those other switches. SEATTLE employs a somewhat lazy approach in correcting these routing table entries.
In evaluating their design, the authors strove to use real data, an endeavor at which they partly succeeded, at least using a mix of real traces with artificially enhanced data. Running each experiment 25 times definitely seems like overkill to me; would the experiments differ much beyond 3 to 5 runs? Granted, the root bridge was randomly selected in these runs, but if network behavior would be that differently affected each time, that seems like a flaw in the design: not being able to predict the network behavior makes simple performance predictions near impossible, but maybe it's not any easier to do that with Ethernet.
There is a possibility of having unduly lengthened paths when a switch evicts a cached entry before its locally-attached hosts are done using the entry, in which case the switch forwards the packets to a resolver switch instead of queueing them itself. This scenario can be prevented by having a long timeout to eviction, with the drawback that the forwarding table could accumulate a very large number of entries.
SEATTLE addresses the biggest problems with Ethernet and hybrids involving it: use of a spanning tree for paths, frequent flooding, and required manual configuration for hybrid and VLAN schemes. However, I'm not convinced of the true need for SEATTLE; the only actual performance graphs included show control overhead, table size, and packet loss in the case of switch failures and moving hosts. The packet loss goes up to 30% as switch failure rate increases to 1 and only up to 5% for 200 hosts moving per second, which seems unreasonably high to me.
Monday, September 14, 2009
Congestion Control for High Bandwidth-Delay Product Networks; Katabi, Handley, & Rohrs
This paper sets out to re-create congestion control from scratch. A big point made by the authors is that congestion not binary, so why should feedback to the endhosts be? Every packet has a congestion header which has a feedback field that routers along the path modify to let the sender know how much to change its window size. An ack is sent from the receiver that contains the feedback value.
An interesting decoupling the authors choose is to separate efficiency from fairness. While this may seem somewhat strange in that they both have to do with how a link is utilized, the paper's argument that the number of flows is constantly changing is a valid one. An efficiency controller attempts to allocate all bandwidth on the level of bytes through the use of a feedback function which takes into account spare bandwidth and persistent queue size as well as roundtrip time. The efficiency controller doesn't concern itself with which senders get the feedback just as long as the total result is more efficient bandwidth usage; the fairness controller does the allocation.
Based on the paper's simulations, XCP drops very few packets and attains high utilization of bottleneck resources; the authors chose maintaining a small queue size to be of higher importance than obtaining highest possible utilization. It also handles sudden changes in traffic much better than TCP does. However, XCP does not inherently have protections against misbehaving users, but because of the additional information stored in the congestion headers, if a router is willing to keep some per-connection state, it can check to ensure a user is actually responding to given feedback.
An interesting decoupling the authors choose is to separate efficiency from fairness. While this may seem somewhat strange in that they both have to do with how a link is utilized, the paper's argument that the number of flows is constantly changing is a valid one. An efficiency controller attempts to allocate all bandwidth on the level of bytes through the use of a feedback function which takes into account spare bandwidth and persistent queue size as well as roundtrip time. The efficiency controller doesn't concern itself with which senders get the feedback just as long as the total result is more efficient bandwidth usage; the fairness controller does the allocation.
Based on the paper's simulations, XCP drops very few packets and attains high utilization of bottleneck resources; the authors chose maintaining a small queue size to be of higher importance than obtaining highest possible utilization. It also handles sudden changes in traffic much better than TCP does. However, XCP does not inherently have protections against misbehaving users, but because of the additional information stored in the congestion headers, if a router is willing to keep some per-connection state, it can check to ensure a user is actually responding to given feedback.
Random Early Detection Gateways for Congestion Avoidance; Floyd & Jacobson
This paper describes the Random Early Detection (RED) scheme to detect congestion in gateways. To attain low latency and high throughput, RED attempts to keep the average queue length short while still permitting larger instantaneous queue sizes. It also strives to prevent global synchronization in which multiple connections decrease their window size and resume slow start at one time.
It uses an underlying FIFO and keeps track of the average queue size with a weighted average. It also has minimum and maximum thresholds such that when the average queue size is less than the minimum, packets are free to pass through without interference. When the average queue size is greater than the maximum threshold, every incoming packet is marked (and presumably dropped). In between the minimum and maximum threshold lengths, packets are marked probabilistically based on their associated connection's share of bandwidth. RED can handle bias and burstiness because its average queue length can be weighted higher to give recent traffic more importance or weighted lower to allow bursts of traffic.
RED uses a uniform probability distribution to decide which packets to be marked in the situations when the average queue size is between the min and max. The marking of packets pays no heed to which connections are being marked, though presumably the number of dropped packets per connection should reflect the connection's sent packets; this idea fails to control misbehaving users in the case of RED simply marking and not dropping the packets, as the congestion control is then left up to the ends.
Additionally, RED's reducing traffic by dropping packets relies on the use of TCP, or at least some protocol in which the source realizes a packet was dropped. Furthermore, the size of the packet isn't taken into account, which could potentially lead to a situation in which lots of small packets are dropped but there's still a large number of bytes to be pushed across a link. In general, RED laid out a good, flexible framework that respects different types of traffic and attempts to prevent global synchs and dropping of all packets at a gateway, but it has a number of issues that need to be resolved.
It uses an underlying FIFO and keeps track of the average queue size with a weighted average. It also has minimum and maximum thresholds such that when the average queue size is less than the minimum, packets are free to pass through without interference. When the average queue size is greater than the maximum threshold, every incoming packet is marked (and presumably dropped). In between the minimum and maximum threshold lengths, packets are marked probabilistically based on their associated connection's share of bandwidth. RED can handle bias and burstiness because its average queue length can be weighted higher to give recent traffic more importance or weighted lower to allow bursts of traffic.
RED uses a uniform probability distribution to decide which packets to be marked in the situations when the average queue size is between the min and max. The marking of packets pays no heed to which connections are being marked, though presumably the number of dropped packets per connection should reflect the connection's sent packets; this idea fails to control misbehaving users in the case of RED simply marking and not dropping the packets, as the congestion control is then left up to the ends.
Additionally, RED's reducing traffic by dropping packets relies on the use of TCP, or at least some protocol in which the source realizes a packet was dropped. Furthermore, the size of the packet isn't taken into account, which could potentially lead to a situation in which lots of small packets are dropped but there's still a large number of bytes to be pushed across a link. In general, RED laid out a good, flexible framework that respects different types of traffic and attempts to prevent global synchs and dropping of all packets at a gateway, but it has a number of issues that need to be resolved.
Thursday, September 10, 2009
Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks; Stoica, Shenker, & Zhang
This paper starts out with the conjectures that fair allocation is very important and/or necessary to congestion control and that current fair allocation implementations are too complex to become widespread. As such, they present an algorithm to give approximately fair bandwidth allocation in which edge routers in some contiguous set of "island" routers calculate per-flow rates and encode flow rate into packet headers, information which can then be used by "island" core nodes without the core nodes needing to maintain any state.
They also simplify the fair queueing mechanism by not separating scheduling and buffering out per flow, opting instead for a FIFO and probabilistic packet dropping, in which a flow's drop probability is based on how much more traffic than its fair share it's trying to send. The authors also describe a method of giving different weights to different flows grouped by islands.
They then run a number of simulations in ns-2 that compares CSFQ against a number of queueing algorithms: FIFO and RED (no per-flow separation) and FRED and DRR (which do allocate on a per-flow basis). These simulations include mixes of UDP and TCP flows, some of which send packets at much larger rates than their fair share, layered multicast traffic, and large propagation delays. CSFQ outperforms FIFO and RED, which do not attempt to fairly allocate, and performs on par with FRED but without requiring any special per-flow treatment. It does not perform as well as DRR, but the authors are optimistic about improvements that can be made to it.
Overall this sounds like a good algorithm for achieving approximate fair allocation, with the real savings being in the lack of state required in all the routers. Since the paper was an overview of a proposed architecture, there weren't huge amounts of detail about how large "islands" should be and hence how much complexity would be cut by having stateless core routers. Also, high-speed packet classification is still an actively-researched problem, and one that the authors would benefit from having a good solution for, as their algorithm requires classifying to flows.
They also simplify the fair queueing mechanism by not separating scheduling and buffering out per flow, opting instead for a FIFO and probabilistic packet dropping, in which a flow's drop probability is based on how much more traffic than its fair share it's trying to send. The authors also describe a method of giving different weights to different flows grouped by islands.
They then run a number of simulations in ns-2 that compares CSFQ against a number of queueing algorithms: FIFO and RED (no per-flow separation) and FRED and DRR (which do allocate on a per-flow basis). These simulations include mixes of UDP and TCP flows, some of which send packets at much larger rates than their fair share, layered multicast traffic, and large propagation delays. CSFQ outperforms FIFO and RED, which do not attempt to fairly allocate, and performs on par with FRED but without requiring any special per-flow treatment. It does not perform as well as DRR, but the authors are optimistic about improvements that can be made to it.
Overall this sounds like a good algorithm for achieving approximate fair allocation, with the real savings being in the lack of state required in all the routers. Since the paper was an overview of a proposed architecture, there weren't huge amounts of detail about how large "islands" should be and hence how much complexity would be cut by having stateless core routers. Also, high-speed packet classification is still an actively-researched problem, and one that the authors would benefit from having a good solution for, as their algorithm requires classifying to flows.
Analysis and Simulation of a Fair Queueing Algorithm; Demers, Keshav, & Shenker
This paper sets out to describe a "new" fair queueing algorithm to be implemented at the gateway. (The authors spend copious amounts of time describing other similar work to their algorithm, which makes me question how different their work really is.) Three requirements for this queueing algorithm to fulfill include allocating buffer space and bandwidth fairly amongst a number of users, the ability to control promptness/delay somewhat independently of buffer space/bw, and continuity with the packet-scheduling scheme.
The paper defines fairness based on Jain's max-min fairness, where each user receives the minimum of either their requested amount of a resource or their fair share. It's tough to decide how to actually define a "user;" it could be done on a per-sender, per-receiver, per-sender-receiver pair, or per-process basis, each with possible drawbacks in terms of security and fairness. The authors select sender-receiver pairs (called "conversations") because it doesn't reward bad behavior or penalize a user due to malicious circumstances out of their control.
The algorithm for fair queueing is based on bit-by-bit round-robin. The authors created metrics for start and finish time that can be used to order packet's worth of bits by, and after start and finish times are assigned, we can use those metrics for whole packets without regard to how long the packets actually are. Using these finish times, there is one simple rule the algorithm follows: the next packet to go out has the lowest finish time. A more complicated flavor of this FQ algorithm introduces the concept of bid, a way to give less delay to users who use less than their fair share of BW. When the queue is full and a new packet arrives, simply drop the last packet of the conversation using the most buffer space.
Paired with a fair queueing algorithm, there also needs to be flow control. The progression of flow control over time began with a timeout mechanism and sliding window flow control and progressed to the Jacobson-Karels (JK) which added dynamic window size, and finally to DECbit which allows the gateway to send selective congestion signals to overagressive users. This third approach is an intriguing one, though I wonder if it ever got used (though presumably it did within DEC)?
Simulations combining different combos of flow control and queueing (generic, JK, or DEC flow control with FCFS or FQ): fair queueing needs good flow control to be effective. When ill-behaving users are added to the mix, the paper's FQ does a good job of letting well-behaved hosts' traffic through, minimizing the incentive for a user to misbehave. The simulations used benchmarks, not real workloads, though the benchmarks were meant to show a mix of different scenarios, using as applications Telnet and FTP. A question I have is how easy is FQ to modify to incorporate not just what a user's equal share would be but also how much of a resource something deserves? It seems difficult to come up with good generalized algorithms that can accommodate the different types of users (file servers vs. Telnet, etc.), though FQ seems to generalize well regardless of the type of traffic it sees.
The paper defines fairness based on Jain's max-min fairness, where each user receives the minimum of either their requested amount of a resource or their fair share. It's tough to decide how to actually define a "user;" it could be done on a per-sender, per-receiver, per-sender-receiver pair, or per-process basis, each with possible drawbacks in terms of security and fairness. The authors select sender-receiver pairs (called "conversations") because it doesn't reward bad behavior or penalize a user due to malicious circumstances out of their control.
The algorithm for fair queueing is based on bit-by-bit round-robin. The authors created metrics for start and finish time that can be used to order packet's worth of bits by, and after start and finish times are assigned, we can use those metrics for whole packets without regard to how long the packets actually are. Using these finish times, there is one simple rule the algorithm follows: the next packet to go out has the lowest finish time. A more complicated flavor of this FQ algorithm introduces the concept of bid, a way to give less delay to users who use less than their fair share of BW. When the queue is full and a new packet arrives, simply drop the last packet of the conversation using the most buffer space.
Paired with a fair queueing algorithm, there also needs to be flow control. The progression of flow control over time began with a timeout mechanism and sliding window flow control and progressed to the Jacobson-Karels (JK) which added dynamic window size, and finally to DECbit which allows the gateway to send selective congestion signals to overagressive users. This third approach is an intriguing one, though I wonder if it ever got used (though presumably it did within DEC)?
Simulations combining different combos of flow control and queueing (generic, JK, or DEC flow control with FCFS or FQ): fair queueing needs good flow control to be effective. When ill-behaving users are added to the mix, the paper's FQ does a good job of letting well-behaved hosts' traffic through, minimizing the incentive for a user to misbehave. The simulations used benchmarks, not real workloads, though the benchmarks were meant to show a mix of different scenarios, using as applications Telnet and FTP. A question I have is how easy is FQ to modify to incorporate not just what a user's equal share would be but also how much of a resource something deserves? It seems difficult to come up with good generalized algorithms that can accommodate the different types of users (file servers vs. Telnet, etc.), though FQ seems to generalize well regardless of the type of traffic it sees.
Tuesday, September 8, 2009
Congestion Avoidance and Control; Jacobson & Karels
This paper discusses congestion avoidance/control mechanisms for TCP, including "slow start" and "additive increase, multiplicative decrease," two methods that can be used alone or, as they actually are in practice, together. TCP Slow Start involves the use of "self-clocking," that is, sending out two packets for every ack received by the sender. This strategy performs better over the long run in terms of throughput and prevents multiple retransmissions of packets. An added advantage is that slow start is not a complicated algorithm to implement, and it's also easy to know how big of a buffer is needed at the bottleneck in the network based on window size.
Besides slow start, TCP also uses the congestion avoidance algorithm that is consistent with Jain/Chiu's recommendation: additive increase, multiplicative decrease. I thought this paper did a really nice job of offering the intuition for this combination, more so than the Jain/Chiu paper, which is that the network takes a greater hit from overshooting than from underuse; hence, caution is the best strategy, and the best performance can be had by probing in small increments to find available bandwidth and decreasing window size exponentially to account for all packets that will be queueing up. Using packet loss as a sign of network congestion to trigger the decreases of window size seems pretty reasonable; obviously it would be convenient for the hosts if the underlying network could just send a number back to each host specifying a concrete amount of bandwidth remaining for it to use.
This congestion avoidance is easily confused with slow start; when should the network use which? The appendices show how to combine the two: when a timeout occurs, we must decide which strategy to take. We engage in slow start (starting from a window size of one packet), and while the window size is less than half the size when the timeout occurred, double the window size with slow-start, and then probe incrementally. The intuition for this method is that half the timeout window size was probably a "safe" size within the bounds of our resources, so the network should ramp up to that point and then slowly and carefully probe for the max bandwidth in that range. (Ironically, "slow start" is the faster incrementing method in this case.)
Another change this paper recommends is a modification to the restransmit timer, namely taking into account the round-trip time variance to make a variable timer instead of a hard-coded one. The paper noted that exponential backoffs are the best all-around strategy, but what if we have more specific information about what's going on in the network? There are some other strategies such as "Fast Retransmit" that actually end up ignoring these timers when it becomes apparent that a chunk of packets have been dropped. It's possible in these cases that there are a number of gratuitous retransmits, so in the end there's a tradeoff between unnecessary packets in the network and fast recovery.
This paper is important because it formed the basis for the congestion avoidance used in TCP.
Besides slow start, TCP also uses the congestion avoidance algorithm that is consistent with Jain/Chiu's recommendation: additive increase, multiplicative decrease. I thought this paper did a really nice job of offering the intuition for this combination, more so than the Jain/Chiu paper, which is that the network takes a greater hit from overshooting than from underuse; hence, caution is the best strategy, and the best performance can be had by probing in small increments to find available bandwidth and decreasing window size exponentially to account for all packets that will be queueing up. Using packet loss as a sign of network congestion to trigger the decreases of window size seems pretty reasonable; obviously it would be convenient for the hosts if the underlying network could just send a number back to each host specifying a concrete amount of bandwidth remaining for it to use.
This congestion avoidance is easily confused with slow start; when should the network use which? The appendices show how to combine the two: when a timeout occurs, we must decide which strategy to take. We engage in slow start (starting from a window size of one packet), and while the window size is less than half the size when the timeout occurred, double the window size with slow-start, and then probe incrementally. The intuition for this method is that half the timeout window size was probably a "safe" size within the bounds of our resources, so the network should ramp up to that point and then slowly and carefully probe for the max bandwidth in that range. (Ironically, "slow start" is the faster incrementing method in this case.)
Another change this paper recommends is a modification to the restransmit timer, namely taking into account the round-trip time variance to make a variable timer instead of a hard-coded one. The paper noted that exponential backoffs are the best all-around strategy, but what if we have more specific information about what's going on in the network? There are some other strategies such as "Fast Retransmit" that actually end up ignoring these timers when it becomes apparent that a chunk of packets have been dropped. It's possible in these cases that there are a number of gratuitous retransmits, so in the end there's a tradeoff between unnecessary packets in the network and fast recovery.
This paper is important because it formed the basis for the congestion avoidance used in TCP.
Monday, September 7, 2009
Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks; Chiu & Jain
This paper gave a rigorous analysis of several metrics related to congestion avoidance mechanisms, including efficiency, fairness, distributedness, and convergence. Network efficiency describes how closely the network is operating to the goal "knee" load level where throughput is still increasing slightly but response time is increasing vastly. Fairness is an even sharing of bottleneck resources, distributedness is decentralization of information necessary for hosts to make behavior decisions, and convergence refers more generally to reaching some equilibrium with small oscillations being acceptable.
The authors then examine what happens with these metrics when various combinations of multiplicative and incremental increases and decreases are used for congestion avoidance. Ultimately they find that decreases should be multiplicative and increases should be linearly additive. Interestingly, a host receives binary feedback from the network, and this extra bit of information became pervasively used despite the fact that it feels like a hack to get around the E2E principle.
All different aspects the authors examine lead them to the same "additive increase, multiplicative decrease" conclusion, from quanitifying responsiveness (time to converge) to looking at smoothness (size of oscillations). They show some nice graphs that demonstrate convergence of fairness and efficiency and how one metric might be neglected even as the other converges. The mathematical proofs even allow the authors to predict maximum overshoot of their proposed mechanism. This paper really hammered in the idea that the use of "additive increase, multiplicative decrease" in TCP's congestion avoidance is the correct way to go.
The authors then examine what happens with these metrics when various combinations of multiplicative and incremental increases and decreases are used for congestion avoidance. Ultimately they find that decreases should be multiplicative and increases should be linearly additive. Interestingly, a host receives binary feedback from the network, and this extra bit of information became pervasively used despite the fact that it feels like a hack to get around the E2E principle.
All different aspects the authors examine lead them to the same "additive increase, multiplicative decrease" conclusion, from quanitifying responsiveness (time to converge) to looking at smoothness (size of oscillations). They show some nice graphs that demonstrate convergence of fairness and efficiency and how one metric might be neglected even as the other converges. The mathematical proofs even allow the authors to predict maximum overshoot of their proposed mechanism. This paper really hammered in the idea that the use of "additive increase, multiplicative decrease" in TCP's congestion avoidance is the correct way to go.
Thursday, September 3, 2009
Understanding BGP Misconfiguration; Mahajan, Wetherall, & Anderson
This paper examines the globally visible impacts of small-scale misconfigurations in BGP routers. The types of faults studied are accidental insertion and propagation of routes to global BGP tables. To study the frequency and effects of these faults, the authors looked at route updates in 19 ASes (it would have been interesting to know exactly which ASes these include) and decided to classify a route change as a misconfiguration if it only lasted a brief amount of time.
This idea is slightly flawed in that it doesn't take into account misconfigurations that aren't discovered for a long time, and it's also possible that legitimate route changes can occur that last for less than a day. However, the authors try to deal with the latter problem by contacting ISP operators for information regarding each potential misconfiguration (though ironically enough, 1/3 of these e-mails didn't reach their intended target due to stale data).
The authors describe the origins of new routes as being self-deaggregation, related origin, and foreign origin; the general thought regarding foreign origins was that they were more likely to be an address space hijack than a legitimate update. However, I would suspect that legitimate route updates of foreign origin would be more prevalent today than seven years ago when this paper was written, as tech companies like Google and Facebook with equipment all over the world implement policies to accommodate for failures in other locations by transferring traffic elsewhere.
There were a number of interesting findings from this study. The authors found that for all the BGP misconfigurations that occur, they noticeably increased load but didn't significantly hinder connectivity. In looking at the causes of these misconfigurations, some slightly disturbing facts arose, such as a bug in a certain type of router that leaked entries during reboot and also how much manual configuration is required from operators; misconfigurations can be due to underlying causes one might not have even thought involved in the process. This paper is of particular interest because it actually studied routing changes in the wild and then took steps to uncover the cause of misconfigurations.
This idea is slightly flawed in that it doesn't take into account misconfigurations that aren't discovered for a long time, and it's also possible that legitimate route changes can occur that last for less than a day. However, the authors try to deal with the latter problem by contacting ISP operators for information regarding each potential misconfiguration (though ironically enough, 1/3 of these e-mails didn't reach their intended target due to stale data).
The authors describe the origins of new routes as being self-deaggregation, related origin, and foreign origin; the general thought regarding foreign origins was that they were more likely to be an address space hijack than a legitimate update. However, I would suspect that legitimate route updates of foreign origin would be more prevalent today than seven years ago when this paper was written, as tech companies like Google and Facebook with equipment all over the world implement policies to accommodate for failures in other locations by transferring traffic elsewhere.
There were a number of interesting findings from this study. The authors found that for all the BGP misconfigurations that occur, they noticeably increased load but didn't significantly hinder connectivity. In looking at the causes of these misconfigurations, some slightly disturbing facts arose, such as a bug in a certain type of router that leaked entries during reboot and also how much manual configuration is required from operators; misconfigurations can be due to underlying causes one might not have even thought involved in the process. This paper is of particular interest because it actually studied routing changes in the wild and then took steps to uncover the cause of misconfigurations.
Interdomain Internet Routing; Balakrishnan
This paper describes interdomain routing in the Internet. It first details the decentralized nature of routing, detailing how layered tiers of Internet Service Providers engage in a complicated game of competition and cooperation to best serve customers while maximizing their profits. ISPs engage in both transit and peering relationships with one another; transit is a customer-provider relationship in which a provider makes the guarantee to its customer that any sender on the Internet will be able to reach them.
However, the ISPs will be selective about which routes they share so as to minimize all non-customer routing that goes through them. Choosing which routes to export is a complicated matter of economics and resource tradeoffs, keeping customers happy and not wasting one's own resources. I thought the paper mentioned a very interesting way of looking at a transit relationship: that what ISPs are charging their customers for is, in essence, an entry in a routing table. I was also curious as to how the different tier ISPs came about: did the multiple layers get added on gradually over time as the Internet grew, or was it a clean break from the centralized system of the ARPANET? Furthermore, when did the notion of making money as an ISP really start to come into the picture?
The paper then moves on to discuss autonomous systems and how they exchange routing info via Border Gateway Protocol (BGP) and use IGP to optimize their own internal paths. BGP is a path vector protocol that has internal and external flavors, iBGP and eBGP, which it uses internal to an AS and between ASes, respectively. BGP is used to exchange route information and keep track of numerous attributes of the route, including a vector of ASes the route should follow. BGP is not a complicated protocol, sending a few simple messages, including UPDATE when routes change, NOTIFICATION when routes go down, and KEEPALIVE to ensure that routes are still up. It selects routes based on some of the attributes associated with a route, including whether or not the route originated from a customer and based on shortest AS length.
BGP's working properly requires putting a lot of trust in all ASes; the fact that any AS can originate a route for any prefix seems like a bad idea that has caused some serious (though easily-fixed and short-lived) problems. Some sort of network sentinels that watch for unexpected changes in advertised routes seems like a very worthwhile idea, though apparently logging is not particularly reliable either, meaning that we're left with a lack of accountability in the case of route hijacks. For a highly trafficked site like youtube, it doesn't take long for the world to notice that routing tables world-wide have been poisoned, but how long would it take to find problems with more subtle misconfigurations?
However, the ISPs will be selective about which routes they share so as to minimize all non-customer routing that goes through them. Choosing which routes to export is a complicated matter of economics and resource tradeoffs, keeping customers happy and not wasting one's own resources. I thought the paper mentioned a very interesting way of looking at a transit relationship: that what ISPs are charging their customers for is, in essence, an entry in a routing table. I was also curious as to how the different tier ISPs came about: did the multiple layers get added on gradually over time as the Internet grew, or was it a clean break from the centralized system of the ARPANET? Furthermore, when did the notion of making money as an ISP really start to come into the picture?
The paper then moves on to discuss autonomous systems and how they exchange routing info via Border Gateway Protocol (BGP) and use IGP to optimize their own internal paths. BGP is a path vector protocol that has internal and external flavors, iBGP and eBGP, which it uses internal to an AS and between ASes, respectively. BGP is used to exchange route information and keep track of numerous attributes of the route, including a vector of ASes the route should follow. BGP is not a complicated protocol, sending a few simple messages, including UPDATE when routes change, NOTIFICATION when routes go down, and KEEPALIVE to ensure that routes are still up. It selects routes based on some of the attributes associated with a route, including whether or not the route originated from a customer and based on shortest AS length.
BGP's working properly requires putting a lot of trust in all ASes; the fact that any AS can originate a route for any prefix seems like a bad idea that has caused some serious (though easily-fixed and short-lived) problems. Some sort of network sentinels that watch for unexpected changes in advertised routes seems like a very worthwhile idea, though apparently logging is not particularly reliable either, meaning that we're left with a lack of accountability in the case of route hijacks. For a highly trafficked site like youtube, it doesn't take long for the world to notice that routing tables world-wide have been poisoned, but how long would it take to find problems with more subtle misconfigurations?
Tuesday, September 1, 2009
The Design Philosophy of the DARPA Internet Protocols; Clark
This paper describes some of the basic design principles that went into the DARPA Internet, the foundations of the Internet as we know and love it today. The top goal of the DARPA Internet was to connect together existing networks, such as the ARPANET and ARPA packet radio network. Two additional aspects of the Internet that carried over from the ARPANET are packet switching and store-and-forward of packets by gateway packet switches, though it's really the incorporation of existing networks that really drove the direction of the Internet. It's not clear to me how the Internet would be different today if the designers had chosen to create a new unified, multi-media system instead of incorporating existing networks; would it be something significantly higher performing and survivable?
The top secondary goal of the designers was survivability: that, short of complete network partition, service should continue in some way between sender and receiver. To maintain this state information, the design uses "fate sharing," that is, keeping this state information at an endpoint host. The argument is that losing the state information about a host is okay if the host entity itself is lost at the same time.
Another secondary goal included accommodating different types of service, differing by requirements in speed, latency, and reliability. Providing for each of these different services proved to be too much for one protocol to handle effectively, which ultimately led to the split of TCP and IP as well as the creation of UDP. Additionally, the Internet was meant to allow for a variety of networks, which it does by making a minimal set of assumptions: the network must be able to transport a datagram. On top of that, there are some fuzzier requirements; the packet must be of "reasonable" size with delivered with "reasonable" reliability through a network with "suitable" addressing.
In creating a framework that cobbled together existing networks, the designers of the Internet caused problems for themselves in terms of not being able to accurately simulate or formalize performance constraints for the Internet architecture. Also of interest is that the end-to-end principle made the architecture less cost-effective and (at least initially) more difficult to connect new hosts to. While use of the datagram as a building block served its purpose of providing some of the flexibility-related secondary goals, it did not provide for easy resource management and accounting, which would have been better served by some notion of a "flow" through the network. All these factors, from datagrams to byte-managed TCP to packet-switching, allowed the Internet to achieve its main goal but also played some part in keeping it from being the cleanest, most effective, high-performing implementation possible.
The top secondary goal of the designers was survivability: that, short of complete network partition, service should continue in some way between sender and receiver. To maintain this state information, the design uses "fate sharing," that is, keeping this state information at an endpoint host. The argument is that losing the state information about a host is okay if the host entity itself is lost at the same time.
Another secondary goal included accommodating different types of service, differing by requirements in speed, latency, and reliability. Providing for each of these different services proved to be too much for one protocol to handle effectively, which ultimately led to the split of TCP and IP as well as the creation of UDP. Additionally, the Internet was meant to allow for a variety of networks, which it does by making a minimal set of assumptions: the network must be able to transport a datagram. On top of that, there are some fuzzier requirements; the packet must be of "reasonable" size with delivered with "reasonable" reliability through a network with "suitable" addressing.
In creating a framework that cobbled together existing networks, the designers of the Internet caused problems for themselves in terms of not being able to accurately simulate or formalize performance constraints for the Internet architecture. Also of interest is that the end-to-end principle made the architecture less cost-effective and (at least initially) more difficult to connect new hosts to. While use of the datagram as a building block served its purpose of providing some of the flexibility-related secondary goals, it did not provide for easy resource management and accounting, which would have been better served by some notion of a "flow" through the network. All these factors, from datagrams to byte-managed TCP to packet-switching, allowed the Internet to achieve its main goal but also played some part in keeping it from being the cleanest, most effective, high-performing implementation possible.
End-to-End Arguments in System Design; Saltzer, Reed, & Clark
The authors describe the "end-to-end" principle in this paper as it relates to deciding on the specific functionality built into a communication system: that some functions require some information or help from the application regardless, so building that functionality into the communication system itself is not necessarily beneficial, and indeed can even cause harm at times. Leaving out extra functionality from the low-level system can be preferable because applications that don't need the functions still have to deal with the performance hit in the underlying system and also because some applications will implement the function better or in more detail than the communcation system.
They provide a few examples to further their argument. The reliable file transfer example provides a more concrete notion to this principle, stressing that even if the underlying communication system has some fancy built-in guarantees, the application itself still has to check (and possibly handle) a small variety of errors, so complicating the communication system is not particularly helpful.
The "MIT checksum incident" example almost seems to argue that low-level function implementation is a harmful thing; if the networks hadn't used a packet checksum, perhaps the application programmers would have built in their own verification method and caught packet corruption earlier. I was skeptical of this notion that the communication system should be completely stripped off all duty to ensure correct data transmission, but luckily the authors countered this idea immediately, insisting that proper design involves a tradeoff between complexity costs of the communication system and the performance hits from an application having to deal with many errors in a simple communication system.
I like the notion of the end-to-end principle being the equivalent of RISC, in that both advocate paring down the underlying tools (communication system and instruction set, respectively) to the core building blocks necessary for everything else to get built on top of. The notion of layers in a network stems from this idea, with the underlying IP being simple and higher-level TCP/UDP/etc. layers actually offering the performance guarantees. This paper is important because it is likely one of the initial bases for the layers of the Internet Protocol Suite.
They provide a few examples to further their argument. The reliable file transfer example provides a more concrete notion to this principle, stressing that even if the underlying communication system has some fancy built-in guarantees, the application itself still has to check (and possibly handle) a small variety of errors, so complicating the communication system is not particularly helpful.
The "MIT checksum incident" example almost seems to argue that low-level function implementation is a harmful thing; if the networks hadn't used a packet checksum, perhaps the application programmers would have built in their own verification method and caught packet corruption earlier. I was skeptical of this notion that the communication system should be completely stripped off all duty to ensure correct data transmission, but luckily the authors countered this idea immediately, insisting that proper design involves a tradeoff between complexity costs of the communication system and the performance hits from an application having to deal with many errors in a simple communication system.
I like the notion of the end-to-end principle being the equivalent of RISC, in that both advocate paring down the underlying tools (communication system and instruction set, respectively) to the core building blocks necessary for everything else to get built on top of. The notion of layers in a network stems from this idea, with the underlying IP being simple and higher-level TCP/UDP/etc. layers actually offering the performance guarantees. This paper is important because it is likely one of the initial bases for the layers of the Internet Protocol Suite.
Resurrecting the Blog
... for Randy's networking class. I kept all my summaries for systems papers here for easy access and because I'm special like that, and I thought about doing that for my architecture papers but never got around to it.
Subscribe to:
Posts (Atom)