This paper describes BotGraph, a system to detect Web-account abuse traffic generated by bots. It attempts to determine the botnet on the collective scale, not determining individual nodes, by taking advantage of similarities in configuration among nodes in a botnet. Detecting this abuse consists of detecting aggressive signups and logins.
Detecting botnet signups is based on spikes of signups from a given IP address. Locating all bots in a bot-user group is done through use of a graph with users as vertices and weighted edges between them showing their activity similiarity. This paper uses a DryadLINQ-based system to process large amounts of data in parallel for these user-user graphs. A nice thing about this paper is that if the botnets adaptively limit their signups or e-mails sent per bot, even if the graph doesn't pick them up as being bots, their activities will have been sufficiently scaled back that they don't pose nearly the problem they did previously.
Wednesday, December 9, 2009
Sunday, December 6, 2009
Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks; Gummadi, Balakrishnan, Maniatis, & Ratnasamy
Botnets are responsible for a large portion of all undesirable web traffic, such as spam, DDoS attacks, and click-fraud, which can be reduced with effective schemes to correctly classify traffic as human-generated or bot-generated. However, these schemes must not trust potentially-infected hosts and must be as transparent as possible to users. This paper describes Not-a-Bot (NAB), a human-activity attester that requires both a Trusted Platform Module chip and a verifier module implemented in the Xen hypervisor at any server wishing to use it. NAB attempts to reduce bot-generated traffic while not affecting human-generated traffic.
An interaction is attested to be human-generated based on how closely recent activity from the keyboard and mouse ports matches the potential web traffic. If matching recent activity cannot be found, an alert is issued that the activity could not be attested. Bots could replicate this user behavior in their messages to be sent, but NAB rate-limits this activity by requiring application-specific time between attestations. The TPM on a host then signs and sends the request on, where it can be checked at the destination verifier.
The authors evaluate their scheme on actual traces of several hundred users, as well as malware traces and spam messages. They found that NAB does very well in suppressing spam, DDoS attempts, and click-fraud by around 90%, and it didn't deny any human-generated traffic.
In reading about the scenarios in which NAB is actually applied, I wondered how many average e-mail users would actually want to deploy NAB. The authors then addressed this question by saying that e-mail users would benefit from having all their mail correctly classified. However, I personally don't feel that I'm largely affected by mis-classification of my e-mails, so I still have a hard time accepting that NAB could catch on widely. The DDoS scenario is less effective in justifying use of NAB, particularly since some legitimate web requests could be machine-generated. As for detecting click-fraud, any company getting ad revenue could make use of NAB, but the benefits for users is low; I particularly disliked the idea of having an attester bundled with installed toolbars.
An interaction is attested to be human-generated based on how closely recent activity from the keyboard and mouse ports matches the potential web traffic. If matching recent activity cannot be found, an alert is issued that the activity could not be attested. Bots could replicate this user behavior in their messages to be sent, but NAB rate-limits this activity by requiring application-specific time between attestations. The TPM on a host then signs and sends the request on, where it can be checked at the destination verifier.
The authors evaluate their scheme on actual traces of several hundred users, as well as malware traces and spam messages. They found that NAB does very well in suppressing spam, DDoS attempts, and click-fraud by around 90%, and it didn't deny any human-generated traffic.
In reading about the scenarios in which NAB is actually applied, I wondered how many average e-mail users would actually want to deploy NAB. The authors then addressed this question by saying that e-mail users would benefit from having all their mail correctly classified. However, I personally don't feel that I'm largely affected by mis-classification of my e-mails, so I still have a hard time accepting that NAB could catch on widely. The DDoS scenario is less effective in justifying use of NAB, particularly since some legitimate web requests could be machine-generated. As for detecting click-fraud, any company getting ad revenue could make use of NAB, but the benefits for users is low; I particularly disliked the idea of having an attester bundled with installed toolbars.
Friday, November 20, 2009
Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems; Nedevschi, Chandrashekar, Liu, Nordman, Ratnasamy, & Taft
This paper examines the energy wasted by computers not taking advantage of idle times to go into lower power sleep modes. The authors straight out acknowledge that this problem of when and how to effectively put nodes to sleep without them losing their network presence has been examined, with wake-on-LAN and proxying as potential solutions.
Based on traces from actual users, they justify the need for these solutions by showing that the studied computers actually do spend a lot of time in idle and that in many cases, long gaps of time pass between packet arrivals at the idle machine. Not suprisingly, the machines tend to have more frequently arriving packets in the office setting than in the home setting, with some users having even more available idle time than others.
I like this paper because it breaks down the type of network traffic seen in both the home and office settings; it starts out breaking both the incoming and outing packets into unicast, multicast, and broadcast groupings, and then further examines the breakdown of the multicast and broadcast packets into such things as ARP, IGMP, etc. These breakdowns are important because the power savings possible due to proxying for a node are dependent on which packets a node does not actually need to be woken up for.
Protocols are classified into groupings such as "don't-ignore" (ARP or DHCP requests for the sleeping node), "don't wake", ignorable, and mechanical response that the proxy can respond to. With these classifications in mind, the authors present 4 proxy schemes that offer varying levels of power savings, with the tradeoff that less application state is maintained. They find that their more aggressive schemes offer a lot of power savings, particularly in the office environment, and create a prototype of their proxy architecture in Click.
Based on traces from actual users, they justify the need for these solutions by showing that the studied computers actually do spend a lot of time in idle and that in many cases, long gaps of time pass between packet arrivals at the idle machine. Not suprisingly, the machines tend to have more frequently arriving packets in the office setting than in the home setting, with some users having even more available idle time than others.
I like this paper because it breaks down the type of network traffic seen in both the home and office settings; it starts out breaking both the incoming and outing packets into unicast, multicast, and broadcast groupings, and then further examines the breakdown of the multicast and broadcast packets into such things as ARP, IGMP, etc. These breakdowns are important because the power savings possible due to proxying for a node are dependent on which packets a node does not actually need to be woken up for.
Protocols are classified into groupings such as "don't-ignore" (ARP or DHCP requests for the sleeping node), "don't wake", ignorable, and mechanical response that the proxy can respond to. With these classifications in mind, the authors present 4 proxy schemes that offer varying levels of power savings, with the tradeoff that less application state is maintained. They find that their more aggressive schemes offer a lot of power savings, particularly in the office environment, and create a prototype of their proxy architecture in Click.
Cutting the Electric Bill for Internet-Scale Systems; Qureshi, Weber, Balakrishnan, Guttag, & Maggs
This paper discusses reducing energy costs of operating a data center. This technique relies on temporal and spatial fluctuations in electricity costs as well as use of dynamic request routing across widely distributed machines. The notion of energy elasticity determines how much money can actually be saved by changing the load on a machine; machines with greater energy proportionality can save more money when they're less used than those machines that continue to burn lots of power when idle.
The authors choose to focus on real-time electricity pricing instead of future, predicted markets, though RT makes up about 10% of the market share. Routing to different geographic areas to obtain the best price relies on low correlation between hourly prices in different regions, which appears to be the case between different Regional Transmission Organizations (RTOs). Part of the difference in cost between two locations at once is the fact that different time zones reach their peak cost at non-peak hours in other locations.
An inherent tradeoff in cutting cost of electricity by routing it elsewhere geographically is an increase in bandwidth costs, which the authors accounted for in part of their study. They did find that this cut the amount of energy costs that could be saved, but they still managed to cut total operating costs. Further reductions could be made through enrolling in demand reduction programs, but that would require delay-insensitive workloads within the data center, and it's not obvious whether that tradeoff would end up being monetarily beneficial.
Regardless of the assumptions the authors must make to simulate their ideas, I think the idea of geographic routing based on hourly electricity costs is a great idea (to the extent that performance doesn't suffer greatly), particularly if the work might be moved around anyway due to node failure or data placement.
The authors choose to focus on real-time electricity pricing instead of future, predicted markets, though RT makes up about 10% of the market share. Routing to different geographic areas to obtain the best price relies on low correlation between hourly prices in different regions, which appears to be the case between different Regional Transmission Organizations (RTOs). Part of the difference in cost between two locations at once is the fact that different time zones reach their peak cost at non-peak hours in other locations.
An inherent tradeoff in cutting cost of electricity by routing it elsewhere geographically is an increase in bandwidth costs, which the authors accounted for in part of their study. They did find that this cut the amount of energy costs that could be saved, but they still managed to cut total operating costs. Further reductions could be made through enrolling in demand reduction programs, but that would require delay-insensitive workloads within the data center, and it's not obvious whether that tradeoff would end up being monetarily beneficial.
Regardless of the assumptions the authors must make to simulate their ideas, I think the idea of geographic routing based on hourly electricity costs is a great idea (to the extent that performance doesn't suffer greatly), particularly if the work might be moved around anyway due to node failure or data placement.
Thursday, November 19, 2009
Scalable Application Layer Multicast; Banerjee, Bhattacharjee, & Kommareddy
This paper describes the NICE (NICE is the Internet Cooperative Environment) multicast protocol. Like SRM, NICE is implemented at the application layer so as not to require changes to the underlying network. NICE uses hierarchical layers to group nodes into clusters and in doing so, reduce stretch, or unnecessarily increased path length from the unicast path.
When new hosts attempt to join, they first communicate with a Rendezvous Point (RP) node, which then directs them to high-tiered nodes, which subsequently direct the host to lower-level nodes close to them. Higher-level nodes are cluster leaders, responsible for keeping the size of their associated cluster between a certain threshold and splitting or merging the cluster group when the max or min threshold is passed. Members of a cluster alert each other to their presence throughput periodic heartbeat messages.
The authors simulate NICE and compare their results to Narada, an existing multicast protocol. They find that it performs comparably, but with less control overhead and less stress (total messages sent over a link) on links. The hierarachical structure of NICE seems to be the main reason that it has such low control overhead and hence can be used (at least in simulation) for groups of large sizes, say, greater than 1024 hosts, which Narada cannot do. The paper references another paper about Narada, but I wished they would have given a succint, organized comparison of the two protocols.
This paper went much more into detail about the underlying workings of the multicast groups, while the other paper offered a more concrete example of when this multicast framework would be used and how to detect and deal with losses.
When new hosts attempt to join, they first communicate with a Rendezvous Point (RP) node, which then directs them to high-tiered nodes, which subsequently direct the host to lower-level nodes close to them. Higher-level nodes are cluster leaders, responsible for keeping the size of their associated cluster between a certain threshold and splitting or merging the cluster group when the max or min threshold is passed. Members of a cluster alert each other to their presence throughput periodic heartbeat messages.
The authors simulate NICE and compare their results to Narada, an existing multicast protocol. They find that it performs comparably, but with less control overhead and less stress (total messages sent over a link) on links. The hierarachical structure of NICE seems to be the main reason that it has such low control overhead and hence can be used (at least in simulation) for groups of large sizes, say, greater than 1024 hosts, which Narada cannot do. The paper references another paper about Narada, but I wished they would have given a succint, organized comparison of the two protocols.
This paper went much more into detail about the underlying workings of the multicast groups, while the other paper offered a more concrete example of when this multicast framework would be used and how to detect and deal with losses.
A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing; Floyd, Jacobson, McCanne, Liu, & Zhang
This paper discusses scalable reliable multicast (SRM). It starts off by presenting the problems of blindly scaling unicast to multicast. Reliable delivery becomes a very difficult problem to tackle when there are multiple receivers and a layer of indirection involved. Interestingly, the authors advocate the use of Application Data Units (ADU) which is a method more closely tied to an application for requesting pieces of data.
The authors then go on to describe wb, an application that uses reliable multicast, and the design decisions that accompany it. Particular to wb are that all senders and receivers are part of the same multicast group without regard to roles, and delivery order of packets does not matter to the application. With the latter fact in mind, it seems odd that they should choose this as their example application, since part of the difficulty in multicast comes in with ordering packets correctly. To detect gaps due to losses without causing a response implosion on the sender side, receivers send periodic messages about which sequence numbers they have seen; they work together to send and respond to repair requests after first waiting some random timeout.
The paper gives the expected number of requests and timer delays for a few topologies: chain, star, and bounded-degree trees, and then shows some simulation results for these topologies. One realism missing from the simulations was the possibility of dropping repairs and repairs requests and thus requiring retransmissions of retransmissions. I hated the simulation result graphs. They weren't labeled at all, so I had to jump back and forth to the text a few times to figure out what everything was.
The authors then go on to describe wb, an application that uses reliable multicast, and the design decisions that accompany it. Particular to wb are that all senders and receivers are part of the same multicast group without regard to roles, and delivery order of packets does not matter to the application. With the latter fact in mind, it seems odd that they should choose this as their example application, since part of the difficulty in multicast comes in with ordering packets correctly. To detect gaps due to losses without causing a response implosion on the sender side, receivers send periodic messages about which sequence numbers they have seen; they work together to send and respond to repair requests after first waiting some random timeout.
The paper gives the expected number of requests and timer delays for a few topologies: chain, star, and bounded-degree trees, and then shows some simulation results for these topologies. One realism missing from the simulations was the possibility of dropping repairs and repairs requests and thus requiring retransmissions of retransmissions. I hated the simulation result graphs. They weren't labeled at all, so I had to jump back and forth to the text a few times to figure out what everything was.
Thursday, November 12, 2009
X-Trace: A Pervasive Network Tracing Framework; Fonseca, Porter, Katz, Shenker, & Stoica
This paper discusses X-Trace, a network diagnostic tool that works across a range of domains and applications by adding metadata to all network operations that originate from certain tasks. By keeping track of which tasks spawn new tasks, X-Trace ends up keeping track of task trees, whose steps it can replay to troubleshoot failures. Nodes push IDs to their next hop and establish these task trees through pushDown and pushNext, functions for lower and same layer hops, respectively.
X-Traces design was driven by 3 core ideas: trace requests should be sent in-band, collected traces should be sent out-of-band, and originators and receivers of requests should be decoupled. These principles have good reasons behind them, namely that trace requests need to be able to follow the same path as the original flow, that cted data should avoid the troublesome spots that it will help diagnose, and that administrative domains (ADs) that are unwilling to disclose some of their information need not fear this tool doing just that.
A few scenarios for using X-Trace are described including how traces could pinpoint problems; these scenarios include web site requests from an Apache server, web hosting services for photos, the i3 overlay network, tunneling, and troubleshooting ISP connections. A big open issue for X-Trace is security, for its metadata and for generating reports. I liked how frankly the authors acknowledged problems with X-Trace, that is, how difficult it was to deploy and how easy adoption/retrofitting of it would be.
NetFPGA: A Network Tool for Research and Education; Watson, McKeown, & Casado
This short paper describes the NetFPGA system, an FPGA-based platform originally created as a teaching tool to give students experience with network physical and link layers. It is currently in version 2, the main difference from version 1 being its format, using PCI and requiring no specialized backplanes or rack.
NetFPGA sounds like a great tool for people who want practical experience in developing networking hardware but cannot due to cost and complexity of such activities; networking researchers now needn't rely solely on software routing for their implementations. I don't know what their current status of running software on the actual boards is; apparently the software part of the routers was not initially run on the boards and required some outside computer, which is unfortunate.
NetFPGA sounds like a great tool for people who want practical experience in developing networking hardware but cannot due to cost and complexity of such activities; networking researchers now needn't rely solely on software routing for their implementations. I don't know what their current status of running software on the actual boards is; apparently the software part of the routers was not initially run on the boards and required some outside computer, which is unfortunate.
Tuesday, November 10, 2009
A Policy-aware Switching Layer for Data Centers; Joseph, Tavakoli, & Stoica
This paper describes PLayer, a policy-aware switching layer for data centers. PLayer is meant to help combat the problems posed by middleboxes (e.g. firewalls) and having equivalent logical and physical network topologies. With middleboxes that are hard-wired into the physical network, any rerouting of traffic due to server failures could bypass the middlebox, meaning that behavior is not reliable.
PLayer separates the notions of policy from reachability, meaning that traffic goes through middleboxes as expected by data center policy, not just as network mechanisms allow. Additionally, middleboxes are taken off the physical network data path, with traffic forwarded to them via a policy-aware pswitch. Pswitches contain switch and policy cores; as the authors point out, minimal hardware change is required for PLayer to actually be adopted in data centers.
However, it's not clear from the paper how these switches would be acquired, if DC owners would have to hack their own together. The authors themselves developed a software implementation of the pswitches using Click. They simulated use of the pswitches on a small collection of topologies and found through the use of simple micro-benchmarks that their software prototype incurred an increase in latency and decrease in throughput. Despite this degrading performance, use of pswitches offers guarantees that middleboxes will be traversed as desired even in the presence of network churn.
PLayer separates the notions of policy from reachability, meaning that traffic goes through middleboxes as expected by data center policy, not just as network mechanisms allow. Additionally, middleboxes are taken off the physical network data path, with traffic forwarded to them via a policy-aware pswitch. Pswitches contain switch and policy cores; as the authors point out, minimal hardware change is required for PLayer to actually be adopted in data centers.
However, it's not clear from the paper how these switches would be acquired, if DC owners would have to hack their own together. The authors themselves developed a software implementation of the pswitches using Click. They simulated use of the pswitches on a small collection of topologies and found through the use of simple micro-benchmarks that their software prototype incurred an increase in latency and decrease in throughput. Despite this degrading performance, use of pswitches offers guarantees that middleboxes will be traversed as desired even in the presence of network churn.
Internet Indirection Infrastructure; Stoica, Adkins, Zhuang, Shenker, & Surana
This paper discusses i3 (Internet Indirection Infrastructure), an overlay network that provides indirection through decoupling of sending and receiving to allow flexible support for anycast, mobility, and other services. i3 uses rendezvous-style communication; receiving nodes insert triggers onto specified servers that consist of a packet id of interest and an IP address to which these packets should be forwarded. Thus, receivers and senders may remain blissfully unaware of how many of the other there are.
Mobility is supported by having receiving nodes simply update their triggers when their addresses change. Multicast involves all group members setting up triggers for an i3 id. To allow intermediate steps like HTML->WML mapping, id chains can be created, and large-scale multicast can be accomodated by hierarchical triggers.
i3 is implemented and simulated using Chord. Some measurements of note include the per-packet fowarding and routing overhead; forwarding overhead seems to add between 10-40 us. While i3 allows flexible implementation of a number of services with fairly low overhead, it has some issues with security. One in particular that stands out is that any user can insert or remove a trigger for any receiver from an i3 node, which seems like a big flaw fixable only through more indirection or verification.
Mobility is supported by having receiving nodes simply update their triggers when their addresses change. Multicast involves all group members setting up triggers for an i3 id. To allow intermediate steps like HTML->WML mapping, id chains can be created, and large-scale multicast can be accomodated by hierarchical triggers.
i3 is implemented and simulated using Chord. Some measurements of note include the per-packet fowarding and routing overhead; forwarding overhead seems to add between 10-40 us. While i3 allows flexible implementation of a number of services with fairly low overhead, it has some issues with security. One in particular that stands out is that any user can insert or remove a trigger for any receiver from an i3 node, which seems like a big flaw fixable only through more indirection or verification.
Thursday, November 5, 2009
DNS Performance and the Effectiveness of Caching; Jung, Sit, Balakrishnan, & Morris
This paper presents findings about DNS regarding its performance and effectiveness of caching. Studying traces from MIT and KAIST, the authors show the breakdown of DNS query types; they found that around 60% were name-to-address queries, with another 25-30% being reverse address-to-name queries. However, they say they excluded DNS lookups not associated with TCP connections from these statistics and then revealed that 50% of MIT lookups fall into this category. How much does this class of lookups change their results?
It sounds like a lot of DNS traffic is somewhat gratuitous, for example with systems doing reverse lookups and then verifying that result by doing an A lookup on it. Surprisingly, nearly 1/4th of all DNS lookups go unanswered, with at least another 10% receiving a negative answer. The authors also chart latency of lookups based on number of referrals in the process. Not surprisingly, 2 or more referrals can cause a a tenfold increase in latency. Also not surprising is that caching can bring this latency down significantly, as well as easing the load at the root servers.
The authors further study caching through a trace-based simulation which looks at cached IP address hits. They find that a small number of names are very frequently accessed, allowing cache rates for small size caches to be high, and that increasing TTL increases cache hits only to a certain degree, as web pages with a given name are often accessed in small, successive bursts. I liked this paper because it actually took a look at what all those DNS lookups are running around the net doing.
It sounds like a lot of DNS traffic is somewhat gratuitous, for example with systems doing reverse lookups and then verifying that result by doing an A lookup on it. Surprisingly, nearly 1/4th of all DNS lookups go unanswered, with at least another 10% receiving a negative answer. The authors also chart latency of lookups based on number of referrals in the process. Not surprisingly, 2 or more referrals can cause a a tenfold increase in latency. Also not surprising is that caching can bring this latency down significantly, as well as easing the load at the root servers.
The authors further study caching through a trace-based simulation which looks at cached IP address hits. They find that a small number of names are very frequently accessed, allowing cache rates for small size caches to be high, and that increasing TTL increases cache hits only to a certain degree, as web pages with a given name are often accessed in small, successive bursts. I liked this paper because it actually took a look at what all those DNS lookups are running around the net doing.
Development of the Domain Name System; Mockapetris & Dunlap
This paper describes the initial goals behind the creation of the Domain Name System, or DNS. Originally, mappings of host names to addresses was maintained in one single centrally-located HOSTS.TXT file. Understandably, this approach became unwieldy and needed to be distributed somehow.
The new DNS approach was designed to adhere to the following constraints: provide the same information as HOSTS.TXT, allow distributed management, and flexibly interoperate across environments including the DARPA Internet, all while providing "tolerable" performance and imposing few limits on size of names or other components associated with it.
DNS requires the use of name servers, which store information and answer queries, and resolvers, which interact with client programs to find the right name server to connect to. Resolvers can choose to cache query responses. DNS has an internal name space in the form labels concatenated together from a variable-depth tree, allowing organizations flexibility to organize themselves and expand internally.
I found it amusing to hear that the Berkeley community was somewhat resistant to the change to DNS. As this paper was written over 20 years ago, I wonder if HOSTS.TXT is still used by many systems; I know I personally use a hosts mapping file in Linux to build shortcuts on top of DNS, but to be at all usable, HOSTS.TXT really needs to be an extra layer in addition to DNS to handle special cases at local levels.
The new DNS approach was designed to adhere to the following constraints: provide the same information as HOSTS.TXT, allow distributed management, and flexibly interoperate across environments including the DARPA Internet, all while providing "tolerable" performance and imposing few limits on size of names or other components associated with it.
DNS requires the use of name servers, which store information and answer queries, and resolvers, which interact with client programs to find the right name server to connect to. Resolvers can choose to cache query responses. DNS has an internal name space in the form labels concatenated together from a variable-depth tree, allowing organizations flexibility to organize themselves and expand internally.
I found it amusing to hear that the Berkeley community was somewhat resistant to the change to DNS. As this paper was written over 20 years ago, I wonder if HOSTS.TXT is still used by many systems; I know I personally use a hosts mapping file in Linux to build shortcuts on top of DNS, but to be at all usable, HOSTS.TXT really needs to be an extra layer in addition to DNS to handle special cases at local levels.
Monday, November 2, 2009
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications; Stoica, Morris, Karger, Kaashoek, & Balakrishnan
This paper discusses Chord, a distributed lookup protocol. Chord is a DHT that uses a distributed routing table and requires that each node have knowledge of only a few neighbors, making it scalable. It maintains a balanced load through the use of consistent hashing. Any key k has a successor node successor(k) that it corresponds to, with successor(k) being the first clockwise node from the identifier k. Nodes leaving and joining the network simply have values transferred back and forth between successors and themselves.
Each chord node maintains more than just the next successor; it keeps a finger table of a subset of log n nodes. These finger table entries can be used to find an identifier k by finding the closest predecessor j to k in the table and then sending the request to j and recursing until eventually the correct node has been found. A difficulty is in dealing with simultaneously joining/failing nodes, though apparently given a small "settling" time, the finger tables will eventually converge nicely.
The authors simulate Chord on a large network (thousands of nodes storing tens of thousands of keys) and also try it out on a smaller scale experimental testbed, looking at latency and path length. Problems they have not yet tackled include dealing with a partitioned Chord ring and dealing with malicious users. Additionally, there's a tradeoff between required number of lookups and finger table sizes which could use further exploration. Otherwise, Chord seems like a fairly straightforward, simple distributed lookup scheme.
Each chord node maintains more than just the next successor; it keeps a finger table of a subset of log n nodes. These finger table entries can be used to find an identifier k by finding the closest predecessor j to k in the table and then sending the request to j and recursing until eventually the correct node has been found. A difficulty is in dealing with simultaneously joining/failing nodes, though apparently given a small "settling" time, the finger tables will eventually converge nicely.
The authors simulate Chord on a large network (thousands of nodes storing tens of thousands of keys) and also try it out on a smaller scale experimental testbed, looking at latency and path length. Problems they have not yet tackled include dealing with a partitioned Chord ring and dealing with malicious users. Additionally, there's a tradeoff between required number of lookups and finger table sizes which could use further exploration. Otherwise, Chord seems like a fairly straightforward, simple distributed lookup scheme.
Looking Up Data in P2P Systems; Balakrishnan, Kaashoek, Karger, Morris, & Stoica
This article discusses the lookup problem in P2P networks, that is, scalable and efficient lookup. A few approaches include centralized databases, which have scalability and fault-tolerance issues, and DHTs, which involve passing around of key-value pairs. P2P systems do not employ hierarchical lookups, treating each host equally in the algorithm. Broadcasts between neighbors can be used when a node doesn't have a key, but too many broadcasts can quickly take down a system by using up all available bandwidth.
The authors then go into more depth about the DHTs, describing how each step in a lookup for some key or ID should progress to a node with a closer value to ID, based on some definition of closeness such as numeric difference or number of similar prefix bits. Chord is a DHT that uses a skip-list to allow binary searches for keys, whereas Pastry, Tapestry, Kademlia, and CAN all use tree-based structures.
The article is brief, having been published in communications of the ACM, but it seems to go into too much detail about the wrong things. I would have been more interested in an organized comparison of the different DHT implementations and an analysis of their strengths and weaknesses. Instead, I got a long discussion about P2P systems, skip lists, and hash table lookups.
The authors then go into more depth about the DHTs, describing how each step in a lookup for some key or ID should progress to a node with a closer value to ID, based on some definition of closeness such as numeric difference or number of similar prefix bits. Chord is a DHT that uses a skip-list to allow binary searches for keys, whereas Pastry, Tapestry, Kademlia, and CAN all use tree-based structures.
The article is brief, having been published in communications of the ACM, but it seems to go into too much detail about the wrong things. I would have been more interested in an organized comparison of the different DHT implementations and an analysis of their strengths and weaknesses. Instead, I got a long discussion about P2P systems, skip lists, and hash table lookups.
Thursday, October 29, 2009
Active network vision and reality: lessons from a capsule-based system; Wetherall
This paper describes a framework for Active Networks called ANTS. ANTS allows users to implement network functionality across an unrestricted set of nodes. Applications send special packets called capsules, in essence an extended IP packet which contains the actual service code to run on it, which are passed around via active node extensible routers.
ANTS decouples capsules from service code, caching service code for particular streams of packets and requesting code from a capsule's previously-visited active node if the current node doesn't have the service code cached. Capsules are kept from misbehaving by being greedy with resources or interfering with other payloads in the network by the ANTS sandbox in which their code is run.
The authors show graphs of ANTS latency and throughput versus capsule payload size and compare these results to impelementations of forwarding in Java and C. I had always heard people complain about how slow Java is compared to compiled languages like C, but I had never seen any results to verify that sentiment; these graphs were enough for me, as the C implementation can get about 4x the throughput of ANTS/Java. Hence, if the authors are more interested in ease of portability, they can stick with their Java implementation of ANTS, but if they want better performance, Java is not the answer.
Active Networks definitely seem to be in violation of the E2E principle: trying to implement functionality at the link layer for not much gain. Apparently Active Networks have applications such as multicast that they would be useful for distributing over the wide-area, but short of these few applications, it's difficult to see the real need for them.
ANTS decouples capsules from service code, caching service code for particular streams of packets and requesting code from a capsule's previously-visited active node if the current node doesn't have the service code cached. Capsules are kept from misbehaving by being greedy with resources or interfering with other payloads in the network by the ANTS sandbox in which their code is run.
The authors show graphs of ANTS latency and throughput versus capsule payload size and compare these results to impelementations of forwarding in Java and C. I had always heard people complain about how slow Java is compared to compiled languages like C, but I had never seen any results to verify that sentiment; these graphs were enough for me, as the C implementation can get about 4x the throughput of ANTS/Java. Hence, if the authors are more interested in ease of portability, they can stick with their Java implementation of ANTS, but if they want better performance, Java is not the answer.
Active Networks definitely seem to be in violation of the E2E principle: trying to implement functionality at the link layer for not much gain. Apparently Active Networks have applications such as multicast that they would be useful for distributing over the wide-area, but short of these few applications, it's difficult to see the real need for them.
Resilient Overlay Networks; Andersen, Balakrishnan, Kaashoek, & Morris
This paper discusses Resilient Overlay Networks (RON), which was created as a solution to the connectivity problems created by BGP's taking too much time to recover from faults. RON nodes continually monitor path (or virtual link) quality between themselves and other nodes and exchange this information amongst themselves to determine optimal routes based on some quality metric.
As different applications have different route needs in terms of latency, loss, and throughput, each group of clients (program communicating with node's RON software) can use their own metrics to determine paths. The entry node of a packet into the network identifies the needs of the incoming flow and selects a path from its routing table; it tags the packet with a flow ID and sends it on its way. Subsequent RON nodes can use flow ID tags to speed up the sending process for future packets.
RON uses link-state routing, where each node in the overlay needs to send information to every other node, which seems like a lot of wasteful overhead. However, since the problem being tackled deals with convergence time, RON floods are perhaps more desirable than lack of connectivity in the network. The three metrics a RON nodes keeps track of for each link are latency, loss, and throughput, and nodes frequently probe virtual links to ensure they're still usable.
In evaluating RON, the authors found that in one scenario RON was able to fully overcome outages (defined as packet loss rates above some given threshold for a certain length of time) and up to 60% of outages in the second set, the difference for which the authors don't discuss much, brushing it off as differences in the setup and also claiming that the 40% were due to sites not being reachable by any other RON site. It's not clear to me that the problem of connectivity posed by BGP's slow recovery justifies the use of RON, which will not scale well and also might not play well with NATs.
As different applications have different route needs in terms of latency, loss, and throughput, each group of clients (program communicating with node's RON software) can use their own metrics to determine paths. The entry node of a packet into the network identifies the needs of the incoming flow and selects a path from its routing table; it tags the packet with a flow ID and sends it on its way. Subsequent RON nodes can use flow ID tags to speed up the sending process for future packets.
RON uses link-state routing, where each node in the overlay needs to send information to every other node, which seems like a lot of wasteful overhead. However, since the problem being tackled deals with convergence time, RON floods are perhaps more desirable than lack of connectivity in the network. The three metrics a RON nodes keeps track of for each link are latency, loss, and throughput, and nodes frequently probe virtual links to ensure they're still usable.
In evaluating RON, the authors found that in one scenario RON was able to fully overcome outages (defined as packet loss rates above some given threshold for a certain length of time) and up to 60% of outages in the second set, the difference for which the authors don't discuss much, brushing it off as differences in the setup and also claiming that the 40% were due to sites not being reachable by any other RON site. It's not clear to me that the problem of connectivity posed by BGP's slow recovery justifies the use of RON, which will not scale well and also might not play well with NATs.
Tuesday, October 27, 2009
On Inferring Autonomous System Relationships in the Internet; Gao
This paper discusses ASes and inferring the relationships between them. It formally defines the concepts of peering, provider-customer, and sibling ASes based on the notion of transiting on behalf of other ASes. The author uses the fact that the selective export rule provides valley-free paths to establish 6 possible patterns an AS path can fit (e.g., uphill, downhill, uphill->downhill, uphill->p2p, p2p->downhill, uphill->p2p->downhill).
The author introduces maximal uphill and downhill paths, which are respectively the longest uphill path in an AS path (customer->provider and sibling->sibling edges only) and the remaining path that isn't the maximal uphill path. Relationships between pairs of ASes on a path can be determined if the top (topmost) providers for the uphill and downhill directions are known for these maximal paths.
The degree of an AS can be estimated by its size, allowing us to find the top provider. Prior to the top provider, all relationships are customer->provider or sibling relationships, and all following it are provider->customer or sibling relationships; these can be further refined based on transiting knowledge obtained from the ordering of ASes in a path.
Using heuristic algorithms, the author finds that out of all AS relationships from BGP routing tables taken at 3 different dates, few (1-2%) are siblings, a few more (~8%) are peering, and the remainder are customer-provider. He verifes his inferences with information from AT&T and the WHOIS lookup service, finding that their customer-provider inferences are correct, their peering inferences are generally correct, and sibling inferences are sometimes off.
This paper presented a cool, simple method for inferring AS relationships. However, it all hinges on the notion that a provider is larger in size than its client, which seems like a generally safe assumption to make, though it would be nice if the author had done some verifications of his own.
The author introduces maximal uphill and downhill paths, which are respectively the longest uphill path in an AS path (customer->provider and sibling->sibling edges only) and the remaining path that isn't the maximal uphill path. Relationships between pairs of ASes on a path can be determined if the top (topmost) providers for the uphill and downhill directions are known for these maximal paths.
The degree of an AS can be estimated by its size, allowing us to find the top provider. Prior to the top provider, all relationships are customer->provider or sibling relationships, and all following it are provider->customer or sibling relationships; these can be further refined based on transiting knowledge obtained from the ordering of ASes in a path.
Using heuristic algorithms, the author finds that out of all AS relationships from BGP routing tables taken at 3 different dates, few (1-2%) are siblings, a few more (~8%) are peering, and the remainder are customer-provider. He verifes his inferences with information from AT&T and the WHOIS lookup service, finding that their customer-provider inferences are correct, their peering inferences are generally correct, and sibling inferences are sometimes off.
This paper presented a cool, simple method for inferring AS relationships. However, it all hinges on the notion that a provider is larger in size than its client, which seems like a generally safe assumption to make, though it would be nice if the author had done some verifications of his own.
Monday, October 26, 2009
On Power-Law Relationships of the Internet Topology; Faloutsos, Faloutsos, & Faloutsos
This paper presents empirically-discovered power laws that can shine some light on traits of the Internet topology including neighbors and inter-domain connectivity. These laws are based on data about the number of ASes present between 1997 and 1998.
A few power laws were uncovered. The first involves the rank exponent, which states that the outdegree (number of edges incident to a node) is proportional to the rank of the node (index of node sorted by decreasing outdegree) raised to the constant R (slope of outdegree vs. rank of nodes, which can differ per graph family). Through this power law, the authors derive an estimation for the number of edges in a graph, which in practical terms can give an idea of connections between domains in the Internet.
A second power law states that the frequency of an outdegree (number of nodes with that outdegree) is proportional to the outdegree raised to the constant O (slope of frequency vs. outdegree). Apparently the O value is fairly constant regardless of the graph family. I found this law pretty amazing, as I would have expected much more randomness in outdegree frequency. In particular, the authors make a fairly strong statement about using this law to test realism of a topology, that if the exponent is far off or if the topology doesn't follow the law, "it probably does not represent a real topology."
A less powerful "law" (deemed an approximation because it consists of only 4 data points) deals with neighboring connectivity, stating that the number of pairs of nodes within some number of hops is proportional to the number of hops raised to the power of H (slope of pairs of nodes within some number of hops vs. number of hops). One of the four graphs actually had enough data to show any evidence of this approximation; the others were pretty worthless.
The final power law states that the eigenvalues of a graph are proportional to the order (in decreasing eigenvalue sequence) raised to the power of E (slope of sorted eigenvalues vs. order). This law is much more abstract to me; I'm not quite sure what the practical ramifications of it are.
I didn't think I'd like this paper initially, thinking it would be heavy on the math and light on applicability, but with each power law, the authors presented some ways to use the law, such as for estimating the number of edges, neighbors, and diameter of a graph. Additionally, these laws can be used to verify the realism of proposed topologies. In using these laws to make predictions for the future, however, they are basing estimates on a little over one year's worth of data from the late 90s. My intuition is that their guesses are far too low, but I would be interested in seeing current numbers for the Internet.
A few power laws were uncovered. The first involves the rank exponent, which states that the outdegree (number of edges incident to a node) is proportional to the rank of the node (index of node sorted by decreasing outdegree) raised to the constant R (slope of outdegree vs. rank of nodes, which can differ per graph family). Through this power law, the authors derive an estimation for the number of edges in a graph, which in practical terms can give an idea of connections between domains in the Internet.
A second power law states that the frequency of an outdegree (number of nodes with that outdegree) is proportional to the outdegree raised to the constant O (slope of frequency vs. outdegree). Apparently the O value is fairly constant regardless of the graph family. I found this law pretty amazing, as I would have expected much more randomness in outdegree frequency. In particular, the authors make a fairly strong statement about using this law to test realism of a topology, that if the exponent is far off or if the topology doesn't follow the law, "it probably does not represent a real topology."
A less powerful "law" (deemed an approximation because it consists of only 4 data points) deals with neighboring connectivity, stating that the number of pairs of nodes within some number of hops is proportional to the number of hops raised to the power of H (slope of pairs of nodes within some number of hops vs. number of hops). One of the four graphs actually had enough data to show any evidence of this approximation; the others were pretty worthless.
The final power law states that the eigenvalues of a graph are proportional to the order (in decreasing eigenvalue sequence) raised to the power of E (slope of sorted eigenvalues vs. order). This law is much more abstract to me; I'm not quite sure what the practical ramifications of it are.
I didn't think I'd like this paper initially, thinking it would be heavy on the math and light on applicability, but with each power law, the authors presented some ways to use the law, such as for estimating the number of edges, neighbors, and diameter of a graph. Additionally, these laws can be used to verify the realism of proposed topologies. In using these laws to make predictions for the future, however, they are basing estimates on a little over one year's worth of data from the late 90s. My intuition is that their guesses are far too low, but I would be interested in seeing current numbers for the Internet.
Sunday, October 25, 2009
White Space Networking with Wi-Fi like Connectivity; Bahl, Chandra, Moscibroda, Murty, & Welsh
This paper discusses WhiteFi, a Wi-Fi like protocol meant for use in UHF white spaces. The difficulty in configuring networks in the UHF realm is efficient channel assignment and Access Point (AP) discovery without causing interference from packet transmissions. WhiteFi must deal with spatial/temporal variation (different channels available in different locations/times) and spectrum fragmentation (non-contiguous channels).
WhiteFi uses special hardware called KNOWS which consists of a UHF translator that translates Wi-Fi signal to the proper UHF band, scanner that searches for signals, and processor that performs a FFT on the incoming signals. APs select the channel to operate on by finding the channel that maximizes the metric MCham, their expected share of the channel as calculated from all associated clients; channel selection happens periodically.
Variable channel widths complicate the process of clients finding an AP. Instead of scanning every possible channel at every channel width, clients scan a subset of channels and use the SIFT technique to detect an AP in hardware. Moving this process from software to hardware and reducing the number of scans improved performance significantly in simulation. One important aspect of channel assignment the paper neglects is incumbent detection; the authors bank on future research solving this problem and assume that part is dealt with.
When an AP or client senses an incumbent on the channel, it disconnects. WhiteFi handles these disconnections via a chirping protocol. In the APs normal beacons, it advertises a backup channel. In the case of disconnection, AP and client switch to the backup channel and transmit chirps until a new channel can be decided selected.
This paper had a number of aspects that were mysterious to me. First off, I didn't have much previous knowledge of the UHF spectrum, and the paper's introduction didn't really help that much. (Thanks to Wikipedia for a very helpful summary and description of UHF.) Secondly, why all this focus on wireless microphones? Out of all the electronic equipment out there, are these really the main concern in terms of wireless transmitters that could be impacted by interference from packets? Maybe it's just that in particular they are prevalently used with difficult-to-predict usage. An unfortunate fact of research on the UHF spectrum is that full-scale evaluations cannot be done without FCC approval, so simulations have to suffice.
WhiteFi uses special hardware called KNOWS which consists of a UHF translator that translates Wi-Fi signal to the proper UHF band, scanner that searches for signals, and processor that performs a FFT on the incoming signals. APs select the channel to operate on by finding the channel that maximizes the metric MCham, their expected share of the channel as calculated from all associated clients; channel selection happens periodically.
Variable channel widths complicate the process of clients finding an AP. Instead of scanning every possible channel at every channel width, clients scan a subset of channels and use the SIFT technique to detect an AP in hardware. Moving this process from software to hardware and reducing the number of scans improved performance significantly in simulation. One important aspect of channel assignment the paper neglects is incumbent detection; the authors bank on future research solving this problem and assume that part is dealt with.
When an AP or client senses an incumbent on the channel, it disconnects. WhiteFi handles these disconnections via a chirping protocol. In the APs normal beacons, it advertises a backup channel. In the case of disconnection, AP and client switch to the backup channel and transmit chirps until a new channel can be decided selected.
This paper had a number of aspects that were mysterious to me. First off, I didn't have much previous knowledge of the UHF spectrum, and the paper's introduction didn't really help that much. (Thanks to Wikipedia for a very helpful summary and description of UHF.) Secondly, why all this focus on wireless microphones? Out of all the electronic equipment out there, are these really the main concern in terms of wireless transmitters that could be impacted by interference from packets? Maybe it's just that in particular they are prevalently used with difficult-to-predict usage. An unfortunate fact of research on the UHF spectrum is that full-scale evaluations cannot be done without FCC approval, so simulations have to suffice.
Interactive WiFi Connectivity for Moving Vehicles; Balasubramanian, Mahajan, Venkataramani, Levine, & Zahorjan
This paper describes ViFi, a protocol for minimizing interruptions in connectivity of WiFi in moving vehicles. The authors describe six policies for handoffs between base stations: RSSI (highest signal strength BS), BRR (highest beacon reception ratio), Sticky (stay with BS until connectivity drops too low), History (BS with historically best performance), BestBS (BS that will be best in future), and AllBSes (utilize all BSes in range). AllBSes represents an upper bound to performance; if packets are lost for one BS, there will likely be another that received them, particularly in situations involving bursty loss.
ViFi works by having the vehicle designate a BS as its anchor, acting as the vehicle's gateway to the Internet, and other BSes in range of the vehicle are auxiliaries. The vehicle periodically broadcasts beacons that contain info about the current anchor and auxiliaries as well as the previous anchor. ViFi is actually deployed in VanLAN, and the authors use traces collected from DieselNet for simulations as well.
ViFi employs the techniques of diversity and salvaging to improve session connections; diversity is the use of multiple basestations to deliver packets, and salvaging is the transmitting of unacknowledged data from the Internet held in previous anchors. Some of the difficulties in ViFi are similar to those in the ExOR system, including coordination between nodes.
The algorithm for transmission works as follows: if the intended destination receives a packet, it sends an ACK. Any auxiliaries that overhear the packet and don't hear the ACK within a certain period probabilistically relay the packet. These relaying auxiliaries must take into account other auxiliaries that might be relaying as well as quality of connectivity, all while trying to keep the expected number of relays to a minimum. Packet reception rates are based on the beacons the vehicle sends.
ViFi doesn't do any packet reordering, as the authors claim it's not necessary for good performance, but it might have been nice to see some numbers about how frequent reordering is. The graphs involving session length took me a little longer to understand than I would have liked; I think there may be more intuitive ways of showing how good the connections were.
On the whole, ViFi seems like a pretty good idea and gets good results for short TCP transfers and VoIP sessions. I wonder how representative of a typical vehicle workload these tests are, though. Certainly short TCP transfers as from web surfing makes sense, but how often are people Skyping on a Microsoft shuttle bus? I liked that the paper talked about the limitations of ViFi; unfortunately, it sounds like large-scale deployment would be very difficult, and if ViFi were only implemented in large islands, maintaining connectivity could be complex.
ViFi works by having the vehicle designate a BS as its anchor, acting as the vehicle's gateway to the Internet, and other BSes in range of the vehicle are auxiliaries. The vehicle periodically broadcasts beacons that contain info about the current anchor and auxiliaries as well as the previous anchor. ViFi is actually deployed in VanLAN, and the authors use traces collected from DieselNet for simulations as well.
ViFi employs the techniques of diversity and salvaging to improve session connections; diversity is the use of multiple basestations to deliver packets, and salvaging is the transmitting of unacknowledged data from the Internet held in previous anchors. Some of the difficulties in ViFi are similar to those in the ExOR system, including coordination between nodes.
The algorithm for transmission works as follows: if the intended destination receives a packet, it sends an ACK. Any auxiliaries that overhear the packet and don't hear the ACK within a certain period probabilistically relay the packet. These relaying auxiliaries must take into account other auxiliaries that might be relaying as well as quality of connectivity, all while trying to keep the expected number of relays to a minimum. Packet reception rates are based on the beacons the vehicle sends.
ViFi doesn't do any packet reordering, as the authors claim it's not necessary for good performance, but it might have been nice to see some numbers about how frequent reordering is. The graphs involving session length took me a little longer to understand than I would have liked; I think there may be more intuitive ways of showing how good the connections were.
On the whole, ViFi seems like a pretty good idea and gets good results for short TCP transfers and VoIP sessions. I wonder how representative of a typical vehicle workload these tests are, though. Certainly short TCP transfers as from web surfing makes sense, but how often are people Skyping on a Microsoft shuttle bus? I liked that the paper talked about the limitations of ViFi; unfortunately, it sounds like large-scale deployment would be very difficult, and if ViFi were only implemented in large islands, maintaining connectivity could be complex.
Wednesday, October 21, 2009
Thursday, October 15, 2009
XORs in the Air: Practical Wireless Network Coding; Katti, Rahul, Hu, Katabi, Medard, & Crowcroft
This paper describes COPE, a forwarding architecture that employs the wireless coding technique of XORing multiple packets together at a router to reduce the number of total transmissions. COPE sits between the IP and MAC layers and decides which packets it can encode together to maximize the number of packets sent in one transmission while at the same time ensuring that nexthops have enough information to actually decode the packet intended for it.
Nodes know which packets its neighbors have via reception reports that nodes broadcast frequently; however, this method fails in congestive periods, at which point nodes are left to guess which packets neighbors have based on routing metrics like ETX. The topmost principle used by COPE is to never delay packets waiting around for other packets it can code with that at the head of its queue. Additionally, COPE tries to send packets of the same length, which equates to padding shorter packets with zeroes if needed and queueing short and long packets in two different virtual queues per nexthop.
COPE adds a packet header that includes IDs of the native packets and their nexthops, as well as reception reports and cumulative ACKs. Even with this additional overhead, COPE can provide an increase in goodput in addition to an increase in coding gain. The hidden terminal scenario of wireless hurts COPE's TCP goodput, but when hidden terminals are disallowed (hence removing lots of collisions) COPE can improve TCP's goodput by over 30%.
I was surprised to see that packets were coded between 20-50% of the time based on guessing. Since COPE seems to perform well with guessing, it makes me wonder if reception reports are even really necessary at all. An interesting point of this coding scheme is that because multiple packets destined for the same nexthop can't be coded together, it opens up the fairness realm for packets destined towards a neighbor that might not oterhwise have been able to capture the media for a while. I like the simple notion behind this paper of using XOR operations to reduce the number of packets transmitted, though as the paper admits, this technique will not work so well for a network that is energy-limited.
Nodes know which packets its neighbors have via reception reports that nodes broadcast frequently; however, this method fails in congestive periods, at which point nodes are left to guess which packets neighbors have based on routing metrics like ETX. The topmost principle used by COPE is to never delay packets waiting around for other packets it can code with that at the head of its queue. Additionally, COPE tries to send packets of the same length, which equates to padding shorter packets with zeroes if needed and queueing short and long packets in two different virtual queues per nexthop.
COPE adds a packet header that includes IDs of the native packets and their nexthops, as well as reception reports and cumulative ACKs. Even with this additional overhead, COPE can provide an increase in goodput in addition to an increase in coding gain. The hidden terminal scenario of wireless hurts COPE's TCP goodput, but when hidden terminals are disallowed (hence removing lots of collisions) COPE can improve TCP's goodput by over 30%.
I was surprised to see that packets were coded between 20-50% of the time based on guessing. Since COPE seems to perform well with guessing, it makes me wonder if reception reports are even really necessary at all. An interesting point of this coding scheme is that because multiple packets destined for the same nexthop can't be coded together, it opens up the fairness realm for packets destined towards a neighbor that might not oterhwise have been able to capture the media for a while. I like the simple notion behind this paper of using XOR operations to reduce the number of packets transmitted, though as the paper admits, this technique will not work so well for a network that is energy-limited.
ExOR: Opportunistic Multi-Hop Routing for Wirless Networks; Biswas & Morris
ExOR is a routing technique that broadcasts batches of packets to some set of nodes, out of which the "best" receiver proceeds to send the packets it received, with other receivers taking turns to fill in the packets the "best" one left out. In this way, ExOR can send packets with a smaller number of retransmissions than via traditional routing. Since a packet is forwarded by a single node, ExOR works with existing radio hardware, unlike many cooperative diversity schemes, which is why I assume the paper didn't attempt to compare any cooperative diversity techniques to ExOR.
ExOR uses a forwarder list that contains an ordering of when receivers should send their packets. Receivers "closer" to the destination send their packets sooner than those farther away, and nodes further down in the forwarder list decide when to transmit as-of-yet-unsent packets based on a predicted time of how long it takes higher priority nodes to send. Nodes pass around a batch map with each packet that contains information about which packets have been transmitted by which nodes, so packets won't be retransmitted unnecessarily. This method adds a bit of overhead since it's a map of all packets in a batch, sent in every packet, but this overhead doesn't seem to be all that detrimental to throughput based on an experiment that varied batch size.
ExOR only guarantees 90% packet arrival and cannot be paired with reliable transport such as TCP because its batching scheme might not work with TCP's chosen window size. After 90% of packets have reached the destination, ExOR finishes the batch using traditional routing.
The authors compare ExOR to traditional source routing using the ETX metric in Roofnet. They enforce the "store and forward" model at the level of entire files to prevent multiple nodes from sending at a time. A strength ExOR has over ETX-based source routing is that ExOR can make use of asymmetric links because the batch map doesn't have to flow back over the same path the packets used, though how the path is chosen in that case the authors don't say. ExOR has over 2x the throughput with mostly fewer packet transmissions; its hops can travel longer distances, hence bypassing some intermediate transmissions.
One complaint I have about the paper is an aesthetic one: the figures were laid out out-of-order, which made it difficult to quickly find the figure the text was describing.
ExOR uses a forwarder list that contains an ordering of when receivers should send their packets. Receivers "closer" to the destination send their packets sooner than those farther away, and nodes further down in the forwarder list decide when to transmit as-of-yet-unsent packets based on a predicted time of how long it takes higher priority nodes to send. Nodes pass around a batch map with each packet that contains information about which packets have been transmitted by which nodes, so packets won't be retransmitted unnecessarily. This method adds a bit of overhead since it's a map of all packets in a batch, sent in every packet, but this overhead doesn't seem to be all that detrimental to throughput based on an experiment that varied batch size.
ExOR only guarantees 90% packet arrival and cannot be paired with reliable transport such as TCP because its batching scheme might not work with TCP's chosen window size. After 90% of packets have reached the destination, ExOR finishes the batch using traditional routing.
The authors compare ExOR to traditional source routing using the ETX metric in Roofnet. They enforce the "store and forward" model at the level of entire files to prevent multiple nodes from sending at a time. A strength ExOR has over ETX-based source routing is that ExOR can make use of asymmetric links because the batch map doesn't have to flow back over the same path the packets used, though how the path is chosen in that case the authors don't say. ExOR has over 2x the throughput with mostly fewer packet transmissions; its hops can travel longer distances, hence bypassing some intermediate transmissions.
One complaint I have about the paper is an aesthetic one: the figures were laid out out-of-order, which made it difficult to quickly find the figure the text was describing.
Thursday, October 8, 2009
A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols; Broch, Maltz, Johnson, Hu, & Jetcheva
This paper presents performance comparisons of 4 routing protocols for ad hoc wireless networks: DSDV, TORA, DSR, and AODV. It simulates each in ns-2 for use in a network of 50 moving nodes.
Destination-Sequenced Distance Vector (DSDV) is a distance vector protocol that must frequently broadcast its route updates. It uses sequence numbers for choosing new routes such that if a route comes in with a greater sequence number or equal sequence number and lower metric, it selects the new route. Temporally-Ordered Routing Algorithm (TORA) works to quickly establish a route with little overhead using a "reverse link" algorithm, which can sacrifice route optimality.
Dynamic Source Routing (DSR) places the route a packet should take into the actual packet's header, keeping intermediate nodes from actually needing to maintain any routing info, though nodes can cache routes they've seen or overheard to prevent too many route requests from flooding the network. Ad Hoc On-Demand Distance Vector (AODV) is a cross between DSR and DSDV. It sends out route requests which return with number of hops and sequence numbers, and state is maintained on a hop-by-hop basis, with each node remembering next hop info only.
Simulations were run for different amounts of movement of the 50 nodes (from continuous movement to no movement at all) with 10 different patterns of movement for each of 7 different "pause time" values. The authors also varied the movement speed, sending rates, packet size, and number of CBR sources. Also of note is that TCP was not used in the simulation. Three metrics of comparison were used for the different scenarios: packet delivery ratio, routing overhead (total transmitted packets), and path optimality (difference of length for path taken and known shortest path).
In dealing with mobility, DSDV does poorly with short (< 300 s) of pause time, but DSR and AODV provide high packet delivery ratios. DSR generally has the lowest amount of overhead, with TORA having the most, and DSDV has a constant amount regardless of mobility because it is periodic in its broadcasts. The number of CBR sources only appears to affect TORA in terms of packet delivery, though there is a large amount of variability that is not captured by the graphs.
DSR has the least overhead from increased sources in terms of packets, though since each of its packets contains more bytes to deliver route information, it loses to AODV with regards to least sized overhead. I liked this paper a lot because it really took into account a lot of different scenarios for comparing these different protocols; unfortunately, since it was all in simulation, it's hard to say if these tests are really representative of these protocols in the wild.
Destination-Sequenced Distance Vector (DSDV) is a distance vector protocol that must frequently broadcast its route updates. It uses sequence numbers for choosing new routes such that if a route comes in with a greater sequence number or equal sequence number and lower metric, it selects the new route. Temporally-Ordered Routing Algorithm (TORA) works to quickly establish a route with little overhead using a "reverse link" algorithm, which can sacrifice route optimality.
Dynamic Source Routing (DSR) places the route a packet should take into the actual packet's header, keeping intermediate nodes from actually needing to maintain any routing info, though nodes can cache routes they've seen or overheard to prevent too many route requests from flooding the network. Ad Hoc On-Demand Distance Vector (AODV) is a cross between DSR and DSDV. It sends out route requests which return with number of hops and sequence numbers, and state is maintained on a hop-by-hop basis, with each node remembering next hop info only.
Simulations were run for different amounts of movement of the 50 nodes (from continuous movement to no movement at all) with 10 different patterns of movement for each of 7 different "pause time" values. The authors also varied the movement speed, sending rates, packet size, and number of CBR sources. Also of note is that TCP was not used in the simulation. Three metrics of comparison were used for the different scenarios: packet delivery ratio, routing overhead (total transmitted packets), and path optimality (difference of length for path taken and known shortest path).
In dealing with mobility, DSDV does poorly with short (< 300 s) of pause time, but DSR and AODV provide high packet delivery ratios. DSR generally has the lowest amount of overhead, with TORA having the most, and DSDV has a constant amount regardless of mobility because it is periodic in its broadcasts. The number of CBR sources only appears to affect TORA in terms of packet delivery, though there is a large amount of variability that is not captured by the graphs.
DSR has the least overhead from increased sources in terms of packets, though since each of its packets contains more bytes to deliver route information, it loses to AODV with regards to least sized overhead. I liked this paper a lot because it really took into account a lot of different scenarios for comparing these different protocols; unfortunately, since it was all in simulation, it's hard to say if these tests are really representative of these protocols in the wild.
A High-Throughput Path Metric for Multi-Hop Wireless Routing; De Couto, Aguayo, Bicket, & Morris
This paper presents expected transmission count (ETX) as a metric for routing in wireless networks. ETX attempts to minimize the total number of transmissions (including re-transmits) required to deliver a packet. It also accounts for asymmetric links. The hope is that a minimum number of transmissions equates to maximum possible throughput.
The authors actually test this routing metric on a real setup in an office building. They compare ETX's performance to to that of min-hop count. In fact, they spend a significant amount of time trying to convince the reader that minimum hop count doesn't always do the best job, which I thought was handled succinctly enough by the first example they presented. I assume they were having to fight very hard to get min-hop count enthusiasts to come over to their side.
ETX works by having each node calculate the forward and reverse delivery ratios along each of its links. Nodes broadcast link probes once every set period of time, and nodes at the other end of said link keep count of received probes. There is a danger of probes not being able to be sent along the link, affecting the neighbors' ETX values.
The evaluation of ETX using the DSDV routing protocol shows that ETX does only a little better than min-hop count in the context of 1 mW radio with small packets. When larger packets (1,386 instead of 134 bytes) are used, ETX doesn't outperform min-hop count, apparently because it overestimates link quality because it uses small packets for the probes, and these smaller packets are more likely to be delivered successfully. Transmitting at a higher power also hurts ETX's relative performance compared to min-hop since nodes require fewer hops to reach one another. I wasn't particularly impressed by the performance benefits of ETX over min-hop count, and especially not since we already read about ETT which was developed by essentially the same group of authors after this metric and seemed to do well enough.
The authors actually test this routing metric on a real setup in an office building. They compare ETX's performance to to that of min-hop count. In fact, they spend a significant amount of time trying to convince the reader that minimum hop count doesn't always do the best job, which I thought was handled succinctly enough by the first example they presented. I assume they were having to fight very hard to get min-hop count enthusiasts to come over to their side.
ETX works by having each node calculate the forward and reverse delivery ratios along each of its links. Nodes broadcast link probes once every set period of time, and nodes at the other end of said link keep count of received probes. There is a danger of probes not being able to be sent along the link, affecting the neighbors' ETX values.
The evaluation of ETX using the DSDV routing protocol shows that ETX does only a little better than min-hop count in the context of 1 mW radio with small packets. When larger packets (1,386 instead of 134 bytes) are used, ETX doesn't outperform min-hop count, apparently because it overestimates link quality because it uses small packets for the probes, and these smaller packets are more likely to be delivered successfully. Transmitting at a higher power also hurts ETX's relative performance compared to min-hop since nodes require fewer hops to reach one another. I wasn't particularly impressed by the performance benefits of ETX over min-hop count, and especially not since we already read about ETT which was developed by essentially the same group of authors after this metric and seemed to do well enough.
Tuesday, October 6, 2009
Architecture and Evaluation of an Unplanned 802.11b Mesh Network; Bickets, Aguayo, Biswas, & Morris
There are 2 families of community wireless networks: well-planned ones with good throughput and connectivity, and unplanned ones with minimal setup required but with worse coverage. This paper proposes the architecture for a wireless network that combines the best of both worlds, namely a network that's quick and easy to setup but that still gets good performance; this architecture consists of unconstrained nodes, omni directional antennae, and link-quality-aware multi-hop routing optimized for throughput. It examines the feasibility of this architecture as implemented in Roofnet, consisting of 37 roof-mounted antennae nodes over 1.5 square miles of Cambridge, MA.
Each node has a non-globally routable IP address and assigns IP addresses via DHCP to hosts attached to its Ethernet ports. Hosts can communicate with one another via their nodes due to the NAT between the Ethernet and Roofnet at each node. There are 4 nodes serving as gateways and acting as NATs from Roofnet to the Internet.
Routes are chosen through Roofnet by Srcr, which chooses based on lowest estimated transmission time (ETT), which presumably corresponds to highest throughput. Srcr tends to favor more short fast hops over longer-distance hops. Additionally, SampleRate selects the best bit-rate to send data at over each link.
The authors evaluate the throughput of Roofnet based on all pairs of nodes within the network. The average throughput was around 600 kbits/sec, with throughput being very closely tied to number of hops between the two. Interestingly, 10% of node pairings failed to find routes between them due to link losses that weren't overcomable by Srcr's flooded queries repeated after 5 seconds. Use of omni-directional antennae appears to be useful because most nodes seem to vary their first hop. While Roofnet is fairly robust to loss of links, removing only two of the best-connected nodes cuts throughput in half. However, in evaluating Roofnet, the authors did not affect Internet access of Roofnet users, meaning that non-experiment traffic could have skewed the findings.
Each node has a non-globally routable IP address and assigns IP addresses via DHCP to hosts attached to its Ethernet ports. Hosts can communicate with one another via their nodes due to the NAT between the Ethernet and Roofnet at each node. There are 4 nodes serving as gateways and acting as NATs from Roofnet to the Internet.
Routes are chosen through Roofnet by Srcr, which chooses based on lowest estimated transmission time (ETT), which presumably corresponds to highest throughput. Srcr tends to favor more short fast hops over longer-distance hops. Additionally, SampleRate selects the best bit-rate to send data at over each link.
The authors evaluate the throughput of Roofnet based on all pairs of nodes within the network. The average throughput was around 600 kbits/sec, with throughput being very closely tied to number of hops between the two. Interestingly, 10% of node pairings failed to find routes between them due to link losses that weren't overcomable by Srcr's flooded queries repeated after 5 seconds. Use of omni-directional antennae appears to be useful because most nodes seem to vary their first hop. While Roofnet is fairly robust to loss of links, removing only two of the best-connected nodes cuts throughput in half. However, in evaluating Roofnet, the authors did not affect Internet access of Roofnet users, meaning that non-experiment traffic could have skewed the findings.
Sunday, October 4, 2009
Modeling Wireless Links for Transport Protocols; Gurtov & Floyd
This paper describes models for wireless links. The 3 types of links considered are cellular (low bandwidth, high latency), wireless LANs (low latency), and satellite (high latency, low bandwidth). The topology given consideration is one in which only one endpoint of a path is a wireless link. The authors say that wireless links on both ends is an unlikely scenario except for the "future" possibility of VoIP being used for telephone calls, which is clearly the case 5 years later.
In considering different types of links, it's also important to keep in mind that each link has a different usage pattern. For example, cellular users will likely be involved in smaller transfers of data, and the ratio of downlink to uplink will vary across the 3 link types.
The paper points to four problems that exist in current modelling techniques. The first problem is that many models are simply unrealistic. Use of TCP over high-loss wireless links, deprecated versions of TCP (possibly due to ease of use in simulators), and retransmission constants for TCP are all representative of practices that just aren't carried out in wireless networks.
The second problem is that some models are realistic but only explore a limited parameter space. For instance, the evaluation might be done using a single flow without regard for the possibility of multiple flows, or the model might employ reverse channel allocation for ACKs without taking into account effects on performance if the ACKs are delayed. On a different extreme, some models are overly realistic in that some use bad implementations with known flaws of components for the simulation. While this practice would be fine for evaluating the badness of said component, it does not generalize well.
The last problem listed is the lack of reproducibility of some models. This problem consists of researchers not fully specifying their experimental settings, which is not so much problem of the model as of poor documentation on the researcher's part.
The paper suggests ideas about how to model a number of different tricky scenarios, including error losses/corruption, delay variation, packet reordering, bandwidth changes and assymetry, and mobility/handover implementation. Graphs showing the effects of mobility show the extent to which small changes like buffer size can affect the accuracy of a graph; in fact, none of the 3 simulations really capture what happens in the actual measurements, which almost seems to suggest that all this modeling is a waste of time. It seems like modeling is only a good idea when a researcher is trying to get a general idea of how new changes will affect a network but certainly not when trying to get an accurate picture of performance.
In considering different types of links, it's also important to keep in mind that each link has a different usage pattern. For example, cellular users will likely be involved in smaller transfers of data, and the ratio of downlink to uplink will vary across the 3 link types.
The paper points to four problems that exist in current modelling techniques. The first problem is that many models are simply unrealistic. Use of TCP over high-loss wireless links, deprecated versions of TCP (possibly due to ease of use in simulators), and retransmission constants for TCP are all representative of practices that just aren't carried out in wireless networks.
The second problem is that some models are realistic but only explore a limited parameter space. For instance, the evaluation might be done using a single flow without regard for the possibility of multiple flows, or the model might employ reverse channel allocation for ACKs without taking into account effects on performance if the ACKs are delayed. On a different extreme, some models are overly realistic in that some use bad implementations with known flaws of components for the simulation. While this practice would be fine for evaluating the badness of said component, it does not generalize well.
The last problem listed is the lack of reproducibility of some models. This problem consists of researchers not fully specifying their experimental settings, which is not so much problem of the model as of poor documentation on the researcher's part.
The paper suggests ideas about how to model a number of different tricky scenarios, including error losses/corruption, delay variation, packet reordering, bandwidth changes and assymetry, and mobility/handover implementation. Graphs showing the effects of mobility show the extent to which small changes like buffer size can affect the accuracy of a graph; in fact, none of the 3 simulations really capture what happens in the actual measurements, which almost seems to suggest that all this modeling is a waste of time. It seems like modeling is only a good idea when a researcher is trying to get a general idea of how new changes will affect a network but certainly not when trying to get an accurate picture of performance.
Thursday, October 1, 2009
A Comparison of Mechanisms for Improving TCP Performance over Wireless Links; Balakrishnan, Padmanabhan, Seshan, & Katz
This paper examines three different ways to improve TCP performance for wireless networks. Loss might equate to congestion in wired networks, but with wireless, the story is different. Assuming that loss is due to congestion can hinder performance by causing the connection to revert back to slow start and undertake recovery mechanisms. As such, there are three different techniques to improve wireless performance with TCP: end-to-end in which the sender is forced to handle the losses gracefully, split-connection in which the wireless link is hidden from the sender by connecting it to the base station, and link-layer solutions which also attempt to hide losses from the sender but via local retransmission and error correction.
This paper uses a wireless testbed of PC base stations with laptop mobile hosts. A sort of odd choice made for the tests was for bit-errors to be randomly chosen using an exponential distribution. Apparently this choice was made to allow for multiple packets to be lost in close proximity to one another, allowing the authors to really compare the techniques in this sort of worst-case loss scenario, and they actually state that this distribution was not intended to actually model a network.
Some options that improve end-to-end include partial and selective acks (SACKs) along with explicit loss notification (ELN), which prevents large congestion window fluctuations, with the greatest improvements combining SACKS and ELN. SACKs give a sender more information about which packets have actually made it through and ELN involves marking acks with the note that a loss happened so that retransmission of lost packets can occur without moving into congestion-control mode.
Split-connection protocols use a wired connection between a TCP source and a base station and a wireless connection between the base station and TCP receiver; two TCP connections are actually established, and since the wired connection doesn't experience loss, the base station need be the only one doing any retransmitting. However, since the wireless connection is the bottleneck for the completion of transmission, it still requires that some special mechanisms be used to improve performance.
The authors find that in general, link-layer protocols tend to give the best performance improvements, but particularly so if they actually do fine-grained retransmissions (and hence actually keep the sender from having to retransmit as well) and if they deliver packets in-order after a loss, thus preventing duplicate acks from being sent and causing fast retransmit/recovery. Interestingly, in face of low error rates, all techniques seem to do equally well.
This paper uses a wireless testbed of PC base stations with laptop mobile hosts. A sort of odd choice made for the tests was for bit-errors to be randomly chosen using an exponential distribution. Apparently this choice was made to allow for multiple packets to be lost in close proximity to one another, allowing the authors to really compare the techniques in this sort of worst-case loss scenario, and they actually state that this distribution was not intended to actually model a network.
Some options that improve end-to-end include partial and selective acks (SACKs) along with explicit loss notification (ELN), which prevents large congestion window fluctuations, with the greatest improvements combining SACKS and ELN. SACKs give a sender more information about which packets have actually made it through and ELN involves marking acks with the note that a loss happened so that retransmission of lost packets can occur without moving into congestion-control mode.
Split-connection protocols use a wired connection between a TCP source and a base station and a wireless connection between the base station and TCP receiver; two TCP connections are actually established, and since the wired connection doesn't experience loss, the base station need be the only one doing any retransmitting. However, since the wireless connection is the bottleneck for the completion of transmission, it still requires that some special mechanisms be used to improve performance.
The authors find that in general, link-layer protocols tend to give the best performance improvements, but particularly so if they actually do fine-grained retransmissions (and hence actually keep the sender from having to retransmit as well) and if they deliver packets in-order after a loss, thus preventing duplicate acks from being sent and causing fast retransmit/recovery. Interestingly, in face of low error rates, all techniques seem to do equally well.
Monday, September 28, 2009
MACAW: A Media Access Protocol for Wireless LANs; Bharghavan, Demers, Shenker, & Zhang
This paper discusses MACAW, a media access protocol for wireless LANs based off of MACA meant for use at Xerox PARC. Xerox PARC had a setup consisting of "base stations" connected via Ethernet and personal mobile "pad" computing devices. The authors observe that collision occurs at receivers, not senders, and hence, CSMA is inappropriate for the wireless network because it only takes into account senders' transmitting.
A major building block of MACA is a link-layer equivalent of the 3 way handshake: RTS, CTS, and DATA. Nodes hearing these requests hold off from transmitting so as not to interfere. Symmetry between nodes also plays big role: if a node can hear an RTS but not a CTS, it's within range of the sender but not the receiver and is free to transmit as it pleases because the receiver can't hear it as noise. If a node can hear the CTS but not RTS, it needs to hold off on transmitting for the receiver's sake.
MACA employs binary exponential backoff, which the authors deem an unfair backoff strategy because the more a sender backs off, the less likely it'll gain access to the channel, since a non backed-off sender will probabilistically keep having control. The authors propose to remedy this by setting all senders' backoffs to the same value after a successful transmission. This "copy" strategy is accomplished by adding a field to the packet headers that list the backoff; when a station hears a packet, it copies the packet's backoff value as its own, so all stations within range of each other have the same backoff value. Additionally, the authors implement multiplicative (~1.5) increase, linear decrease for the backoff scheme to prevent large, sudden oscillations in the backoff values. Copying seems like it could be a problem in that one value will copy itself across multiple groupings of bases/pads in which drastically different backoff values are called for.
MACAW allocates bandwidth on a per-stream basis rather than per station, using a queue per stream and per-stream backoff. It also concludes its RTS-CTS-DATA sequence with an ACK packet so that senders don't have to waste time waiting for TCP timeouts before retransmitting data. Overhead from this ack seems to be made up for when loss rates are greater than 1 in 1000 packets. Also, when the receiver is transmitting data already, it doesn't matter if the sender can send without interfering elsewhere; plus, if the sender were to attempt to transmit to a busy receiver, it would have to increment backoff, so the sender might as well wait. The way for a sender to know its potential receiver is busy is through the addition of a data-sending packet (DS) that comes between CTS and DATA.
However, there are still more problems to take care of, to which the authors propose yet another control packet (the RRTS), but it's not enough to cover all the open problems including multicast. In comparison to MACA, MACAW gets better throughput when the layout of senders and receivers creates lots of potential collisions, but it almost feels like MACAW employs a lot of hacks to get around corner cases noted by use of MACA and that any solution the authors create just opens the door to more problems needing solved.
A major building block of MACA is a link-layer equivalent of the 3 way handshake: RTS, CTS, and DATA. Nodes hearing these requests hold off from transmitting so as not to interfere. Symmetry between nodes also plays big role: if a node can hear an RTS but not a CTS, it's within range of the sender but not the receiver and is free to transmit as it pleases because the receiver can't hear it as noise. If a node can hear the CTS but not RTS, it needs to hold off on transmitting for the receiver's sake.
MACA employs binary exponential backoff, which the authors deem an unfair backoff strategy because the more a sender backs off, the less likely it'll gain access to the channel, since a non backed-off sender will probabilistically keep having control. The authors propose to remedy this by setting all senders' backoffs to the same value after a successful transmission. This "copy" strategy is accomplished by adding a field to the packet headers that list the backoff; when a station hears a packet, it copies the packet's backoff value as its own, so all stations within range of each other have the same backoff value. Additionally, the authors implement multiplicative (~1.5) increase, linear decrease for the backoff scheme to prevent large, sudden oscillations in the backoff values. Copying seems like it could be a problem in that one value will copy itself across multiple groupings of bases/pads in which drastically different backoff values are called for.
MACAW allocates bandwidth on a per-stream basis rather than per station, using a queue per stream and per-stream backoff. It also concludes its RTS-CTS-DATA sequence with an ACK packet so that senders don't have to waste time waiting for TCP timeouts before retransmitting data. Overhead from this ack seems to be made up for when loss rates are greater than 1 in 1000 packets. Also, when the receiver is transmitting data already, it doesn't matter if the sender can send without interfering elsewhere; plus, if the sender were to attempt to transmit to a busy receiver, it would have to increment backoff, so the sender might as well wait. The way for a sender to know its potential receiver is busy is through the addition of a data-sending packet (DS) that comes between CTS and DATA.
However, there are still more problems to take care of, to which the authors propose yet another control packet (the RRTS), but it's not enough to cover all the open problems including multicast. In comparison to MACA, MACAW gets better throughput when the layout of senders and receivers creates lots of potential collisions, but it almost feels like MACAW employs a lot of hacks to get around corner cases noted by use of MACA and that any solution the authors create just opens the door to more problems needing solved.
Sunday, September 27, 2009
Understanding TCP Incast Throughput Collapse in Datacenter Networks; Chen, Griffith, Liu, Katz, & Joseph
This paper examines TCP Incast in a datacenter environment and attempts to come up with an analytical model to understand it. The authors take an experimental approach to better understand the phenomenon by making a few modifications (such as changing minimum RTO and randomizing backoff) to see if any prevent incast collapse and then moving on from there. They found that the biggest improvements came from minimizing the minimum RTO.
This work is based on the original TCP Incast paper from CMU and used a workload such that each server sends a fixed amount of data; however, the CMU group came out with a new workload in which there is a constant amount of data to send split between some changing number of senders, meaning that the more senders there are, the less each sender actually has to send. However, large numbers of senders with this new workload could hide incast behavior because each sender only ends up sending a few packets. So this paper's authors used the original workload setup of a fixed amount of data per sender.
Some of the results obtained in this study differ from those made by the original CMU group (regarding the importance of high-res timers and turning off delay-acks), and the authors presume that this is due to the underlying network. The authors use both a small scale cluster and an experimental testbed setup, but my complaint is that they do not list how many machines are actually in these systems (though it appears to be at least 20 based on later graphs).
A number of experiments were run with varying RTO minimum values, with a few noteworthy results. There appear to be 3 phases to the incast problem: collapse, goodput recovery increase, and goodput decrease. Large numbers of timeouts occur with this initial phase, but then there are few timeout events as senders ramp up. However, at some point the timing of sending packets surpasses the RTO such that timeouts happen frequently, causing interference between senders retransmitting after a timeout and those waiting for a timeout. Hence, the third phase will happen after a longer time post-collapse for higher RTO minimum values.
Smaller RTO minimum values equate to shorter "goodput recovery" time, that is, time required to hit the maximum goodput. The time at which collapse turns into goodput recovery increase is the same regardless of the minimum RTO. While the CMU paper made much ado about using high resolution timers, it appears that the optimal setting is actually a low resolution RTO of 1ms paired with delayed acks turned on. An important difference to note between the two studies, however, is that the average observed RTT in this paper was 2ms, while RTTs in the CMU study were along the lines of 100s of microseconds. With that in mind, a starting RTO value of 1ms is actually shorter than the expected RTT.
The optimality of delayed acks was surprising since it could in theory slow down goodput, but a suggestion put forth by the authors is that not using delay-acks can over-drive the congestion window. The graphs used to justify this theory were somewhat confusing; I wasn't quite sure what each dot on the graph corresponded to, as the delay-acks graph has a significantly larger number of dots on it.
This work is based on the original TCP Incast paper from CMU and used a workload such that each server sends a fixed amount of data; however, the CMU group came out with a new workload in which there is a constant amount of data to send split between some changing number of senders, meaning that the more senders there are, the less each sender actually has to send. However, large numbers of senders with this new workload could hide incast behavior because each sender only ends up sending a few packets. So this paper's authors used the original workload setup of a fixed amount of data per sender.
Some of the results obtained in this study differ from those made by the original CMU group (regarding the importance of high-res timers and turning off delay-acks), and the authors presume that this is due to the underlying network. The authors use both a small scale cluster and an experimental testbed setup, but my complaint is that they do not list how many machines are actually in these systems (though it appears to be at least 20 based on later graphs).
A number of experiments were run with varying RTO minimum values, with a few noteworthy results. There appear to be 3 phases to the incast problem: collapse, goodput recovery increase, and goodput decrease. Large numbers of timeouts occur with this initial phase, but then there are few timeout events as senders ramp up. However, at some point the timing of sending packets surpasses the RTO such that timeouts happen frequently, causing interference between senders retransmitting after a timeout and those waiting for a timeout. Hence, the third phase will happen after a longer time post-collapse for higher RTO minimum values.
Smaller RTO minimum values equate to shorter "goodput recovery" time, that is, time required to hit the maximum goodput. The time at which collapse turns into goodput recovery increase is the same regardless of the minimum RTO. While the CMU paper made much ado about using high resolution timers, it appears that the optimal setting is actually a low resolution RTO of 1ms paired with delayed acks turned on. An important difference to note between the two studies, however, is that the average observed RTT in this paper was 2ms, while RTTs in the CMU study were along the lines of 100s of microseconds. With that in mind, a starting RTO value of 1ms is actually shorter than the expected RTT.
The optimality of delayed acks was surprising since it could in theory slow down goodput, but a suggestion put forth by the authors is that not using delay-acks can over-drive the congestion window. The graphs used to justify this theory were somewhat confusing; I wasn't quite sure what each dot on the graph corresponded to, as the delay-acks graph has a significantly larger number of dots on it.
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication; Vasudevan et al.
This paper discusses TCP incast collapse in which throughput in datacenter environments drops significantly due to simultaneous sending and backoff attempts of servers. For the problem to occur, multiple senders must communicate with a single receiver in a high bandwidth, low delay network, causing switch buffers to overflow which in turn causes timeouts.
In a datacenter environment where RTTs tend to be on the order of 10s or 100s or microseconds, 100+ millisecond timeouts are 1000s of times larger than the RTT. Incast is particularly a problem for barrier-synchronized workloads because even just one server experiencing these disproportionately long timeouts can delay the entire job. However, as the paper notes this timeout-to-RTT ratio in datacenters is a problem for any latency-sensitive workload even when incast is not occurring.
The authors study the problem in both simulation and in two different sized clusters and examine data for RTTs in the real world, though some of the conclusions and configuration choices the authors make seem questionable or tenuous at best. For one thing, the RTT distribution obtained from Los Alamos has around 20% of RTTs pegged at 100 microseconds or less, which the authors claim shows that "networks... operate in very low-latency environments," though in reality at least 50% of the RTTs fall in between 300 and 600 microseconds, meaning that the minimum obtainable RTO min of 5ms would be perfectly acceptable without worrying about high resolution timers.
In order to scale up to 10Gbps links, the authors set blocksize at 80MB so that any flow has the ability to saturate a link; I'm not sure that this is a reasonable choice to make. In their comparison of throughput for RTO min values of 200ms and 200us, the authors ignore all flows smaller than a certain threshold, and I would definitely be curious to see how those flows differed for the two scenarios.
The authors also believe that delayed acks could cause problems for the network due to the fact that their delay timer is set to 40ms, much larger than the proposed new RTO min value. Another thing to bear in mind is that any changes made to dealing with the RTO such as randomizing it by adding delay could actually harm performance in any non-datacenter environment because flows will not have homogeneous RTTs.
In a datacenter environment where RTTs tend to be on the order of 10s or 100s or microseconds, 100+ millisecond timeouts are 1000s of times larger than the RTT. Incast is particularly a problem for barrier-synchronized workloads because even just one server experiencing these disproportionately long timeouts can delay the entire job. However, as the paper notes this timeout-to-RTT ratio in datacenters is a problem for any latency-sensitive workload even when incast is not occurring.
The authors study the problem in both simulation and in two different sized clusters and examine data for RTTs in the real world, though some of the conclusions and configuration choices the authors make seem questionable or tenuous at best. For one thing, the RTT distribution obtained from Los Alamos has around 20% of RTTs pegged at 100 microseconds or less, which the authors claim shows that "networks... operate in very low-latency environments," though in reality at least 50% of the RTTs fall in between 300 and 600 microseconds, meaning that the minimum obtainable RTO min of 5ms would be perfectly acceptable without worrying about high resolution timers.
In order to scale up to 10Gbps links, the authors set blocksize at 80MB so that any flow has the ability to saturate a link; I'm not sure that this is a reasonable choice to make. In their comparison of throughput for RTO min values of 200ms and 200us, the authors ignore all flows smaller than a certain threshold, and I would definitely be curious to see how those flows differed for the two scenarios.
The authors also believe that delayed acks could cause problems for the network due to the fact that their delay timer is set to 40ms, much larger than the proposed new RTO min value. Another thing to bear in mind is that any changes made to dealing with the RTO such as randomizing it by adding delay could actually harm performance in any non-datacenter environment because flows will not have homogeneous RTTs.
Monday, September 21, 2009
VL2: A Scalable and Flexible Data Center Network; Greenberg et. al.
VL2 is another network architecture striving to tackle the problem of providing Ethernet semantics for increasingly larger and more unwieldy networks. VL2 is set apart from other attempts such as PortLand and SEATTLE by its dedication to achieving agility, "the capacity to assign any server to any service," a goal which the authors can break down as 3 objectives: uniform high traffic flow between any 2 servers, isolation of effects on traffic from services, and Layer-2 properties. Modifications are made to the end hosts, not to the underlying network itself, and consists of currently existing technology (ECMP forwarding, IP any/multicasting, link-state routing).
VL2 takes an approach similar to PortLand that involves separating names from locations, with each switches and servers using IP addresses as location-specific addresses (LA) and applications having their own application-specific addresses (AA) that do not change, allowing continued service in the face of virtual machine migration. A smaller cluster of directory servers store lookup information for AA to LA. Each server is outfitted with a VL2 agent that intercepts its host's packets and either queries the directory for an AA-to-LA mapping or provides one from its mappings cache and send the packet to the TOR with the corresponding LA.
VL2 was created specifically with data centers in mind and uses a Clos network, whose inherent layout provides redundancy to combat component failures. Targeting a specific realm instead of general networks seems to be a good path to take in the networking world where there are always tradeoffs for networks with different needs and purposes, and VL2 seems to be no exception. As data centers are already designed with special optimizations (e.g., large chunk sizes to amortize disk seeks), it makes sense that a network architecture could be developed with their performance needs in mind as well.
Unfortunately, it appears that traffic patterns are often very random with given paths being very short-lived, due to the practice of randomly distributing files and workloads over the data center, so it's more difficult to try to come up with traffic flow enhancements. VL2 pairs Valiant Load Balancing (VLB) with TCP to achieve fairness and high utilization. VLB is realized by randomly selecting the intermediate switch through which to send traffic, and TCP is required to deal with the scenario that different-sized flows aren't distributed well (causing congestion at some links). VLB's randomization also helps provide isolation, so applications that would otherwise flood the network have their loads spread evenly across the network instead. Valiant Load-Balancing also uses randomization to determine paths, which does not lead to best-case performance and would work better for supporting general applications.
The evaluation was run on an actual prototype setup, though as a workload, the authors examined a worst-case scenario: all-to-all shuffle. VL2 does seem to provide fairness, isolation, and high goodput, but perhaps more promising is that for the same cost as an over-subscribed traditional network, this Clos network can be built without over-subscription.
VL2 takes an approach similar to PortLand that involves separating names from locations, with each switches and servers using IP addresses as location-specific addresses (LA) and applications having their own application-specific addresses (AA) that do not change, allowing continued service in the face of virtual machine migration. A smaller cluster of directory servers store lookup information for AA to LA. Each server is outfitted with a VL2 agent that intercepts its host's packets and either queries the directory for an AA-to-LA mapping or provides one from its mappings cache and send the packet to the TOR with the corresponding LA.
VL2 was created specifically with data centers in mind and uses a Clos network, whose inherent layout provides redundancy to combat component failures. Targeting a specific realm instead of general networks seems to be a good path to take in the networking world where there are always tradeoffs for networks with different needs and purposes, and VL2 seems to be no exception. As data centers are already designed with special optimizations (e.g., large chunk sizes to amortize disk seeks), it makes sense that a network architecture could be developed with their performance needs in mind as well.
Unfortunately, it appears that traffic patterns are often very random with given paths being very short-lived, due to the practice of randomly distributing files and workloads over the data center, so it's more difficult to try to come up with traffic flow enhancements. VL2 pairs Valiant Load Balancing (VLB) with TCP to achieve fairness and high utilization. VLB is realized by randomly selecting the intermediate switch through which to send traffic, and TCP is required to deal with the scenario that different-sized flows aren't distributed well (causing congestion at some links). VLB's randomization also helps provide isolation, so applications that would otherwise flood the network have their loads spread evenly across the network instead. Valiant Load-Balancing also uses randomization to determine paths, which does not lead to best-case performance and would work better for supporting general applications.
The evaluation was run on an actual prototype setup, though as a workload, the authors examined a worst-case scenario: all-to-all shuffle. VL2 does seem to provide fairness, isolation, and high goodput, but perhaps more promising is that for the same cost as an over-subscribed traditional network, this Clos network can be built without over-subscription.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
This paper discusses PortLand, an Ethernet-compatible set of routing, forwarding, and address resolution protocols intended for use in data centers. Data centers are scaling up at enormous rates and hence no longer play well with protocols intended for general LANs. To that end, the authors lay out some goals for this new protocol to meet: mobility, self-configuration, loop-free connectivity, and failure detection.
PortLand is specialized to a fat tree network topology. Having this specific layout in mind allows PortLand to use a hierarchical addressing scheme to encode location information as a pseudo-MAC address (PMAC). PortLand also uses name information in addition to location and stays with the tradition of using flat addressing (based on MACs), adding the notion of a PMAC address on top of that, which encodes the pod grouping of a host.
A centralized fabric manager maintains network configuration information and helps resolve IP-to-PMAC mappings. The danger in having this important function centralized is how FM failure is handled, though presumably it would be replicated to or spread across multiple nodes. Generally this setup prevents mass broadcasts of ARP packets, as ARPs are intercepted by edge switches and instead used to query the fabric manager, which only resorts to broadcast when it doesn't have the IP-to-PMAC mapping cached. Even more important is the amount of computation required of the FM when the number of ARPs sent out per host increases; up to 70 cores worth of required CPU are projected for large numbers of hosts each sending 100 ARPs/sec. As data centers are scaling out very quickly, it makes sense to watch out for this computation overload for the FM and perhaps develop it as a cluster earlier rather than later.
PortLand uses a Location Discovery Protocol to self-configure; switches send out Location Discovery Messages to one another, which allows them to infer information about their location in the tree based on messages they receive in certain ports. A very cool property about this protocol paired with the known fat tree structure is that it can actually pick out certain failure conditions like miswiring.
PortLand is specialized to a fat tree network topology. Having this specific layout in mind allows PortLand to use a hierarchical addressing scheme to encode location information as a pseudo-MAC address (PMAC). PortLand also uses name information in addition to location and stays with the tradition of using flat addressing (based on MACs), adding the notion of a PMAC address on top of that, which encodes the pod grouping of a host.
A centralized fabric manager maintains network configuration information and helps resolve IP-to-PMAC mappings. The danger in having this important function centralized is how FM failure is handled, though presumably it would be replicated to or spread across multiple nodes. Generally this setup prevents mass broadcasts of ARP packets, as ARPs are intercepted by edge switches and instead used to query the fabric manager, which only resorts to broadcast when it doesn't have the IP-to-PMAC mapping cached. Even more important is the amount of computation required of the FM when the number of ARPs sent out per host increases; up to 70 cores worth of required CPU are projected for large numbers of hosts each sending 100 ARPs/sec. As data centers are scaling out very quickly, it makes sense to watch out for this computation overload for the FM and perhaps develop it as a cluster earlier rather than later.
PortLand uses a Location Discovery Protocol to self-configure; switches send out Location Discovery Messages to one another, which allows them to infer information about their location in the tree based on messages they receive in certain ports. A very cool property about this protocol paired with the known fat tree structure is that it can actually pick out certain failure conditions like miswiring.
Thursday, September 17, 2009
Detailed Diagnosis in Enterprise Networks; Kandula et. al
The authors of this paper conducted a study of error logs within a small enterprise network. Based on a sample of these logs, they came to the conclusion that the majority of faults within a network impact applications (instead of whole machines) typically because of some configuration issue. Disturbingly, the causes of 25% of problems were unknown yet solvable by rebooting the system, casting more of a "black magic" veil on networking. Their goal then is to create a system called NetMedic that can diagnose, to the process level, culprits of network problems and probable underlying causes in as much detail as possible without relying on application-specific knowledge.
NetMedic models the network as a dependency graph with edges between dependent components. It employs a set of techniques that allow a process to keep track of its own configuration state, the behavior of peers it's in communication with, and the state of its host machine. Diagnosis of errors requires 3 steps: determining which pieces of the network are abnormal (statistically, based on previous values through time), computing edge weights for the dependency graph, and computing path weights to decide on the likely error source (with largest path weights being likely causes). Interestingly, they don't consider the possibility that a normally behaving component could be the cause of other problems; I guess it's probably the case that problems occur when things (like configuration files) change and not just out of the blue, so this assumption is probably okay for catching most errors.
NetMedic monitors both the Windows software counters as well as the Windows registry and configuration files. However, if half of the causes are due to configuration files, I wonder how important the software counters actually are to diagnosing problems. I thought it was admirable that because they couldn't instrument actual servers that are in use, the authors just started up their own and inject faults at random.
NetMedic models the network as a dependency graph with edges between dependent components. It employs a set of techniques that allow a process to keep track of its own configuration state, the behavior of peers it's in communication with, and the state of its host machine. Diagnosis of errors requires 3 steps: determining which pieces of the network are abnormal (statistically, based on previous values through time), computing edge weights for the dependency graph, and computing path weights to decide on the likely error source (with largest path weights being likely causes). Interestingly, they don't consider the possibility that a normally behaving component could be the cause of other problems; I guess it's probably the case that problems occur when things (like configuration files) change and not just out of the blue, so this assumption is probably okay for catching most errors.
NetMedic monitors both the Windows software counters as well as the Windows registry and configuration files. However, if half of the causes are due to configuration files, I wonder how important the software counters actually are to diagnosing problems. I thought it was admirable that because they couldn't instrument actual servers that are in use, the authors just started up their own and inject faults at random.
Wednesday, September 16, 2009
Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises; Kim, Caesar, & Rexford
SEATTLE is an alternative architecture meant to combat some of the inherent problems associated with Ethernet. Using Ethernet, the control overhead in a network is proportional to the number of hosts in the network and requires the use of spanning tree for paths. Problems address then include large forwarding table size and control overhead, uneven link utilization due to root bridges and route spanning trees, poor throughput because of said spanning tree, and frequent flooding. One possible solution is to use a hybrid approach of connecting LANs via IP, each of which have IP subnets of up to a few hundred hosts. However, initial configuration is time-consuming, reconfiguring is required whenever the network design changes, and the ability of a host to move and still maintain its IP address isn't supported easily. Even using VLANs to allow mobility still has the problems of manual configuration and poor utilization of links due to a spanning tree.
To address all these problems, the authors present SEATTLE. SEATTLE has overhead and forwarding tables that are proportional to the number of subnets, not hosts. It uses a flat addressing scheme a la Ethernet, forwarding packets based on MAC addresses, but is scalable to large networks, as it uses a link-state protocol but only maintains switch information which is subject to less change and hence fewer control messages going out on the network. SEATTLE uses a DHT to lookup host information, which actually lengthens the path for at least the first traversal of a flow's path. Switches have the option of caching the host information themselves for forwarding based on shortest paths at the risk of ending up with a large, unwieldy forwarding table.
In case of network changes, SEATTLE attempts to avoid flooding whenever possible, choosing instead to unicast updates to other switches and even forward misdelivered packets on behalf of those other switches. SEATTLE employs a somewhat lazy approach in correcting these routing table entries.
In evaluating their design, the authors strove to use real data, an endeavor at which they partly succeeded, at least using a mix of real traces with artificially enhanced data. Running each experiment 25 times definitely seems like overkill to me; would the experiments differ much beyond 3 to 5 runs? Granted, the root bridge was randomly selected in these runs, but if network behavior would be that differently affected each time, that seems like a flaw in the design: not being able to predict the network behavior makes simple performance predictions near impossible, but maybe it's not any easier to do that with Ethernet.
There is a possibility of having unduly lengthened paths when a switch evicts a cached entry before its locally-attached hosts are done using the entry, in which case the switch forwards the packets to a resolver switch instead of queueing them itself. This scenario can be prevented by having a long timeout to eviction, with the drawback that the forwarding table could accumulate a very large number of entries.
SEATTLE addresses the biggest problems with Ethernet and hybrids involving it: use of a spanning tree for paths, frequent flooding, and required manual configuration for hybrid and VLAN schemes. However, I'm not convinced of the true need for SEATTLE; the only actual performance graphs included show control overhead, table size, and packet loss in the case of switch failures and moving hosts. The packet loss goes up to 30% as switch failure rate increases to 1 and only up to 5% for 200 hosts moving per second, which seems unreasonably high to me.
To address all these problems, the authors present SEATTLE. SEATTLE has overhead and forwarding tables that are proportional to the number of subnets, not hosts. It uses a flat addressing scheme a la Ethernet, forwarding packets based on MAC addresses, but is scalable to large networks, as it uses a link-state protocol but only maintains switch information which is subject to less change and hence fewer control messages going out on the network. SEATTLE uses a DHT to lookup host information, which actually lengthens the path for at least the first traversal of a flow's path. Switches have the option of caching the host information themselves for forwarding based on shortest paths at the risk of ending up with a large, unwieldy forwarding table.
In case of network changes, SEATTLE attempts to avoid flooding whenever possible, choosing instead to unicast updates to other switches and even forward misdelivered packets on behalf of those other switches. SEATTLE employs a somewhat lazy approach in correcting these routing table entries.
In evaluating their design, the authors strove to use real data, an endeavor at which they partly succeeded, at least using a mix of real traces with artificially enhanced data. Running each experiment 25 times definitely seems like overkill to me; would the experiments differ much beyond 3 to 5 runs? Granted, the root bridge was randomly selected in these runs, but if network behavior would be that differently affected each time, that seems like a flaw in the design: not being able to predict the network behavior makes simple performance predictions near impossible, but maybe it's not any easier to do that with Ethernet.
There is a possibility of having unduly lengthened paths when a switch evicts a cached entry before its locally-attached hosts are done using the entry, in which case the switch forwards the packets to a resolver switch instead of queueing them itself. This scenario can be prevented by having a long timeout to eviction, with the drawback that the forwarding table could accumulate a very large number of entries.
SEATTLE addresses the biggest problems with Ethernet and hybrids involving it: use of a spanning tree for paths, frequent flooding, and required manual configuration for hybrid and VLAN schemes. However, I'm not convinced of the true need for SEATTLE; the only actual performance graphs included show control overhead, table size, and packet loss in the case of switch failures and moving hosts. The packet loss goes up to 30% as switch failure rate increases to 1 and only up to 5% for 200 hosts moving per second, which seems unreasonably high to me.
Monday, September 14, 2009
Congestion Control for High Bandwidth-Delay Product Networks; Katabi, Handley, & Rohrs
This paper sets out to re-create congestion control from scratch. A big point made by the authors is that congestion not binary, so why should feedback to the endhosts be? Every packet has a congestion header which has a feedback field that routers along the path modify to let the sender know how much to change its window size. An ack is sent from the receiver that contains the feedback value.
An interesting decoupling the authors choose is to separate efficiency from fairness. While this may seem somewhat strange in that they both have to do with how a link is utilized, the paper's argument that the number of flows is constantly changing is a valid one. An efficiency controller attempts to allocate all bandwidth on the level of bytes through the use of a feedback function which takes into account spare bandwidth and persistent queue size as well as roundtrip time. The efficiency controller doesn't concern itself with which senders get the feedback just as long as the total result is more efficient bandwidth usage; the fairness controller does the allocation.
Based on the paper's simulations, XCP drops very few packets and attains high utilization of bottleneck resources; the authors chose maintaining a small queue size to be of higher importance than obtaining highest possible utilization. It also handles sudden changes in traffic much better than TCP does. However, XCP does not inherently have protections against misbehaving users, but because of the additional information stored in the congestion headers, if a router is willing to keep some per-connection state, it can check to ensure a user is actually responding to given feedback.
An interesting decoupling the authors choose is to separate efficiency from fairness. While this may seem somewhat strange in that they both have to do with how a link is utilized, the paper's argument that the number of flows is constantly changing is a valid one. An efficiency controller attempts to allocate all bandwidth on the level of bytes through the use of a feedback function which takes into account spare bandwidth and persistent queue size as well as roundtrip time. The efficiency controller doesn't concern itself with which senders get the feedback just as long as the total result is more efficient bandwidth usage; the fairness controller does the allocation.
Based on the paper's simulations, XCP drops very few packets and attains high utilization of bottleneck resources; the authors chose maintaining a small queue size to be of higher importance than obtaining highest possible utilization. It also handles sudden changes in traffic much better than TCP does. However, XCP does not inherently have protections against misbehaving users, but because of the additional information stored in the congestion headers, if a router is willing to keep some per-connection state, it can check to ensure a user is actually responding to given feedback.
Random Early Detection Gateways for Congestion Avoidance; Floyd & Jacobson
This paper describes the Random Early Detection (RED) scheme to detect congestion in gateways. To attain low latency and high throughput, RED attempts to keep the average queue length short while still permitting larger instantaneous queue sizes. It also strives to prevent global synchronization in which multiple connections decrease their window size and resume slow start at one time.
It uses an underlying FIFO and keeps track of the average queue size with a weighted average. It also has minimum and maximum thresholds such that when the average queue size is less than the minimum, packets are free to pass through without interference. When the average queue size is greater than the maximum threshold, every incoming packet is marked (and presumably dropped). In between the minimum and maximum threshold lengths, packets are marked probabilistically based on their associated connection's share of bandwidth. RED can handle bias and burstiness because its average queue length can be weighted higher to give recent traffic more importance or weighted lower to allow bursts of traffic.
RED uses a uniform probability distribution to decide which packets to be marked in the situations when the average queue size is between the min and max. The marking of packets pays no heed to which connections are being marked, though presumably the number of dropped packets per connection should reflect the connection's sent packets; this idea fails to control misbehaving users in the case of RED simply marking and not dropping the packets, as the congestion control is then left up to the ends.
Additionally, RED's reducing traffic by dropping packets relies on the use of TCP, or at least some protocol in which the source realizes a packet was dropped. Furthermore, the size of the packet isn't taken into account, which could potentially lead to a situation in which lots of small packets are dropped but there's still a large number of bytes to be pushed across a link. In general, RED laid out a good, flexible framework that respects different types of traffic and attempts to prevent global synchs and dropping of all packets at a gateway, but it has a number of issues that need to be resolved.
It uses an underlying FIFO and keeps track of the average queue size with a weighted average. It also has minimum and maximum thresholds such that when the average queue size is less than the minimum, packets are free to pass through without interference. When the average queue size is greater than the maximum threshold, every incoming packet is marked (and presumably dropped). In between the minimum and maximum threshold lengths, packets are marked probabilistically based on their associated connection's share of bandwidth. RED can handle bias and burstiness because its average queue length can be weighted higher to give recent traffic more importance or weighted lower to allow bursts of traffic.
RED uses a uniform probability distribution to decide which packets to be marked in the situations when the average queue size is between the min and max. The marking of packets pays no heed to which connections are being marked, though presumably the number of dropped packets per connection should reflect the connection's sent packets; this idea fails to control misbehaving users in the case of RED simply marking and not dropping the packets, as the congestion control is then left up to the ends.
Additionally, RED's reducing traffic by dropping packets relies on the use of TCP, or at least some protocol in which the source realizes a packet was dropped. Furthermore, the size of the packet isn't taken into account, which could potentially lead to a situation in which lots of small packets are dropped but there's still a large number of bytes to be pushed across a link. In general, RED laid out a good, flexible framework that respects different types of traffic and attempts to prevent global synchs and dropping of all packets at a gateway, but it has a number of issues that need to be resolved.
Thursday, September 10, 2009
Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks; Stoica, Shenker, & Zhang
This paper starts out with the conjectures that fair allocation is very important and/or necessary to congestion control and that current fair allocation implementations are too complex to become widespread. As such, they present an algorithm to give approximately fair bandwidth allocation in which edge routers in some contiguous set of "island" routers calculate per-flow rates and encode flow rate into packet headers, information which can then be used by "island" core nodes without the core nodes needing to maintain any state.
They also simplify the fair queueing mechanism by not separating scheduling and buffering out per flow, opting instead for a FIFO and probabilistic packet dropping, in which a flow's drop probability is based on how much more traffic than its fair share it's trying to send. The authors also describe a method of giving different weights to different flows grouped by islands.
They then run a number of simulations in ns-2 that compares CSFQ against a number of queueing algorithms: FIFO and RED (no per-flow separation) and FRED and DRR (which do allocate on a per-flow basis). These simulations include mixes of UDP and TCP flows, some of which send packets at much larger rates than their fair share, layered multicast traffic, and large propagation delays. CSFQ outperforms FIFO and RED, which do not attempt to fairly allocate, and performs on par with FRED but without requiring any special per-flow treatment. It does not perform as well as DRR, but the authors are optimistic about improvements that can be made to it.
Overall this sounds like a good algorithm for achieving approximate fair allocation, with the real savings being in the lack of state required in all the routers. Since the paper was an overview of a proposed architecture, there weren't huge amounts of detail about how large "islands" should be and hence how much complexity would be cut by having stateless core routers. Also, high-speed packet classification is still an actively-researched problem, and one that the authors would benefit from having a good solution for, as their algorithm requires classifying to flows.
They also simplify the fair queueing mechanism by not separating scheduling and buffering out per flow, opting instead for a FIFO and probabilistic packet dropping, in which a flow's drop probability is based on how much more traffic than its fair share it's trying to send. The authors also describe a method of giving different weights to different flows grouped by islands.
They then run a number of simulations in ns-2 that compares CSFQ against a number of queueing algorithms: FIFO and RED (no per-flow separation) and FRED and DRR (which do allocate on a per-flow basis). These simulations include mixes of UDP and TCP flows, some of which send packets at much larger rates than their fair share, layered multicast traffic, and large propagation delays. CSFQ outperforms FIFO and RED, which do not attempt to fairly allocate, and performs on par with FRED but without requiring any special per-flow treatment. It does not perform as well as DRR, but the authors are optimistic about improvements that can be made to it.
Overall this sounds like a good algorithm for achieving approximate fair allocation, with the real savings being in the lack of state required in all the routers. Since the paper was an overview of a proposed architecture, there weren't huge amounts of detail about how large "islands" should be and hence how much complexity would be cut by having stateless core routers. Also, high-speed packet classification is still an actively-researched problem, and one that the authors would benefit from having a good solution for, as their algorithm requires classifying to flows.
Analysis and Simulation of a Fair Queueing Algorithm; Demers, Keshav, & Shenker
This paper sets out to describe a "new" fair queueing algorithm to be implemented at the gateway. (The authors spend copious amounts of time describing other similar work to their algorithm, which makes me question how different their work really is.) Three requirements for this queueing algorithm to fulfill include allocating buffer space and bandwidth fairly amongst a number of users, the ability to control promptness/delay somewhat independently of buffer space/bw, and continuity with the packet-scheduling scheme.
The paper defines fairness based on Jain's max-min fairness, where each user receives the minimum of either their requested amount of a resource or their fair share. It's tough to decide how to actually define a "user;" it could be done on a per-sender, per-receiver, per-sender-receiver pair, or per-process basis, each with possible drawbacks in terms of security and fairness. The authors select sender-receiver pairs (called "conversations") because it doesn't reward bad behavior or penalize a user due to malicious circumstances out of their control.
The algorithm for fair queueing is based on bit-by-bit round-robin. The authors created metrics for start and finish time that can be used to order packet's worth of bits by, and after start and finish times are assigned, we can use those metrics for whole packets without regard to how long the packets actually are. Using these finish times, there is one simple rule the algorithm follows: the next packet to go out has the lowest finish time. A more complicated flavor of this FQ algorithm introduces the concept of bid, a way to give less delay to users who use less than their fair share of BW. When the queue is full and a new packet arrives, simply drop the last packet of the conversation using the most buffer space.
Paired with a fair queueing algorithm, there also needs to be flow control. The progression of flow control over time began with a timeout mechanism and sliding window flow control and progressed to the Jacobson-Karels (JK) which added dynamic window size, and finally to DECbit which allows the gateway to send selective congestion signals to overagressive users. This third approach is an intriguing one, though I wonder if it ever got used (though presumably it did within DEC)?
Simulations combining different combos of flow control and queueing (generic, JK, or DEC flow control with FCFS or FQ): fair queueing needs good flow control to be effective. When ill-behaving users are added to the mix, the paper's FQ does a good job of letting well-behaved hosts' traffic through, minimizing the incentive for a user to misbehave. The simulations used benchmarks, not real workloads, though the benchmarks were meant to show a mix of different scenarios, using as applications Telnet and FTP. A question I have is how easy is FQ to modify to incorporate not just what a user's equal share would be but also how much of a resource something deserves? It seems difficult to come up with good generalized algorithms that can accommodate the different types of users (file servers vs. Telnet, etc.), though FQ seems to generalize well regardless of the type of traffic it sees.
The paper defines fairness based on Jain's max-min fairness, where each user receives the minimum of either their requested amount of a resource or their fair share. It's tough to decide how to actually define a "user;" it could be done on a per-sender, per-receiver, per-sender-receiver pair, or per-process basis, each with possible drawbacks in terms of security and fairness. The authors select sender-receiver pairs (called "conversations") because it doesn't reward bad behavior or penalize a user due to malicious circumstances out of their control.
The algorithm for fair queueing is based on bit-by-bit round-robin. The authors created metrics for start and finish time that can be used to order packet's worth of bits by, and after start and finish times are assigned, we can use those metrics for whole packets without regard to how long the packets actually are. Using these finish times, there is one simple rule the algorithm follows: the next packet to go out has the lowest finish time. A more complicated flavor of this FQ algorithm introduces the concept of bid, a way to give less delay to users who use less than their fair share of BW. When the queue is full and a new packet arrives, simply drop the last packet of the conversation using the most buffer space.
Paired with a fair queueing algorithm, there also needs to be flow control. The progression of flow control over time began with a timeout mechanism and sliding window flow control and progressed to the Jacobson-Karels (JK) which added dynamic window size, and finally to DECbit which allows the gateway to send selective congestion signals to overagressive users. This third approach is an intriguing one, though I wonder if it ever got used (though presumably it did within DEC)?
Simulations combining different combos of flow control and queueing (generic, JK, or DEC flow control with FCFS or FQ): fair queueing needs good flow control to be effective. When ill-behaving users are added to the mix, the paper's FQ does a good job of letting well-behaved hosts' traffic through, minimizing the incentive for a user to misbehave. The simulations used benchmarks, not real workloads, though the benchmarks were meant to show a mix of different scenarios, using as applications Telnet and FTP. A question I have is how easy is FQ to modify to incorporate not just what a user's equal share would be but also how much of a resource something deserves? It seems difficult to come up with good generalized algorithms that can accommodate the different types of users (file servers vs. Telnet, etc.), though FQ seems to generalize well regardless of the type of traffic it sees.
Tuesday, September 8, 2009
Congestion Avoidance and Control; Jacobson & Karels
This paper discusses congestion avoidance/control mechanisms for TCP, including "slow start" and "additive increase, multiplicative decrease," two methods that can be used alone or, as they actually are in practice, together. TCP Slow Start involves the use of "self-clocking," that is, sending out two packets for every ack received by the sender. This strategy performs better over the long run in terms of throughput and prevents multiple retransmissions of packets. An added advantage is that slow start is not a complicated algorithm to implement, and it's also easy to know how big of a buffer is needed at the bottleneck in the network based on window size.
Besides slow start, TCP also uses the congestion avoidance algorithm that is consistent with Jain/Chiu's recommendation: additive increase, multiplicative decrease. I thought this paper did a really nice job of offering the intuition for this combination, more so than the Jain/Chiu paper, which is that the network takes a greater hit from overshooting than from underuse; hence, caution is the best strategy, and the best performance can be had by probing in small increments to find available bandwidth and decreasing window size exponentially to account for all packets that will be queueing up. Using packet loss as a sign of network congestion to trigger the decreases of window size seems pretty reasonable; obviously it would be convenient for the hosts if the underlying network could just send a number back to each host specifying a concrete amount of bandwidth remaining for it to use.
This congestion avoidance is easily confused with slow start; when should the network use which? The appendices show how to combine the two: when a timeout occurs, we must decide which strategy to take. We engage in slow start (starting from a window size of one packet), and while the window size is less than half the size when the timeout occurred, double the window size with slow-start, and then probe incrementally. The intuition for this method is that half the timeout window size was probably a "safe" size within the bounds of our resources, so the network should ramp up to that point and then slowly and carefully probe for the max bandwidth in that range. (Ironically, "slow start" is the faster incrementing method in this case.)
Another change this paper recommends is a modification to the restransmit timer, namely taking into account the round-trip time variance to make a variable timer instead of a hard-coded one. The paper noted that exponential backoffs are the best all-around strategy, but what if we have more specific information about what's going on in the network? There are some other strategies such as "Fast Retransmit" that actually end up ignoring these timers when it becomes apparent that a chunk of packets have been dropped. It's possible in these cases that there are a number of gratuitous retransmits, so in the end there's a tradeoff between unnecessary packets in the network and fast recovery.
This paper is important because it formed the basis for the congestion avoidance used in TCP.
Besides slow start, TCP also uses the congestion avoidance algorithm that is consistent with Jain/Chiu's recommendation: additive increase, multiplicative decrease. I thought this paper did a really nice job of offering the intuition for this combination, more so than the Jain/Chiu paper, which is that the network takes a greater hit from overshooting than from underuse; hence, caution is the best strategy, and the best performance can be had by probing in small increments to find available bandwidth and decreasing window size exponentially to account for all packets that will be queueing up. Using packet loss as a sign of network congestion to trigger the decreases of window size seems pretty reasonable; obviously it would be convenient for the hosts if the underlying network could just send a number back to each host specifying a concrete amount of bandwidth remaining for it to use.
This congestion avoidance is easily confused with slow start; when should the network use which? The appendices show how to combine the two: when a timeout occurs, we must decide which strategy to take. We engage in slow start (starting from a window size of one packet), and while the window size is less than half the size when the timeout occurred, double the window size with slow-start, and then probe incrementally. The intuition for this method is that half the timeout window size was probably a "safe" size within the bounds of our resources, so the network should ramp up to that point and then slowly and carefully probe for the max bandwidth in that range. (Ironically, "slow start" is the faster incrementing method in this case.)
Another change this paper recommends is a modification to the restransmit timer, namely taking into account the round-trip time variance to make a variable timer instead of a hard-coded one. The paper noted that exponential backoffs are the best all-around strategy, but what if we have more specific information about what's going on in the network? There are some other strategies such as "Fast Retransmit" that actually end up ignoring these timers when it becomes apparent that a chunk of packets have been dropped. It's possible in these cases that there are a number of gratuitous retransmits, so in the end there's a tradeoff between unnecessary packets in the network and fast recovery.
This paper is important because it formed the basis for the congestion avoidance used in TCP.
Monday, September 7, 2009
Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks; Chiu & Jain
This paper gave a rigorous analysis of several metrics related to congestion avoidance mechanisms, including efficiency, fairness, distributedness, and convergence. Network efficiency describes how closely the network is operating to the goal "knee" load level where throughput is still increasing slightly but response time is increasing vastly. Fairness is an even sharing of bottleneck resources, distributedness is decentralization of information necessary for hosts to make behavior decisions, and convergence refers more generally to reaching some equilibrium with small oscillations being acceptable.
The authors then examine what happens with these metrics when various combinations of multiplicative and incremental increases and decreases are used for congestion avoidance. Ultimately they find that decreases should be multiplicative and increases should be linearly additive. Interestingly, a host receives binary feedback from the network, and this extra bit of information became pervasively used despite the fact that it feels like a hack to get around the E2E principle.
All different aspects the authors examine lead them to the same "additive increase, multiplicative decrease" conclusion, from quanitifying responsiveness (time to converge) to looking at smoothness (size of oscillations). They show some nice graphs that demonstrate convergence of fairness and efficiency and how one metric might be neglected even as the other converges. The mathematical proofs even allow the authors to predict maximum overshoot of their proposed mechanism. This paper really hammered in the idea that the use of "additive increase, multiplicative decrease" in TCP's congestion avoidance is the correct way to go.
The authors then examine what happens with these metrics when various combinations of multiplicative and incremental increases and decreases are used for congestion avoidance. Ultimately they find that decreases should be multiplicative and increases should be linearly additive. Interestingly, a host receives binary feedback from the network, and this extra bit of information became pervasively used despite the fact that it feels like a hack to get around the E2E principle.
All different aspects the authors examine lead them to the same "additive increase, multiplicative decrease" conclusion, from quanitifying responsiveness (time to converge) to looking at smoothness (size of oscillations). They show some nice graphs that demonstrate convergence of fairness and efficiency and how one metric might be neglected even as the other converges. The mathematical proofs even allow the authors to predict maximum overshoot of their proposed mechanism. This paper really hammered in the idea that the use of "additive increase, multiplicative decrease" in TCP's congestion avoidance is the correct way to go.
Thursday, September 3, 2009
Understanding BGP Misconfiguration; Mahajan, Wetherall, & Anderson
This paper examines the globally visible impacts of small-scale misconfigurations in BGP routers. The types of faults studied are accidental insertion and propagation of routes to global BGP tables. To study the frequency and effects of these faults, the authors looked at route updates in 19 ASes (it would have been interesting to know exactly which ASes these include) and decided to classify a route change as a misconfiguration if it only lasted a brief amount of time.
This idea is slightly flawed in that it doesn't take into account misconfigurations that aren't discovered for a long time, and it's also possible that legitimate route changes can occur that last for less than a day. However, the authors try to deal with the latter problem by contacting ISP operators for information regarding each potential misconfiguration (though ironically enough, 1/3 of these e-mails didn't reach their intended target due to stale data).
The authors describe the origins of new routes as being self-deaggregation, related origin, and foreign origin; the general thought regarding foreign origins was that they were more likely to be an address space hijack than a legitimate update. However, I would suspect that legitimate route updates of foreign origin would be more prevalent today than seven years ago when this paper was written, as tech companies like Google and Facebook with equipment all over the world implement policies to accommodate for failures in other locations by transferring traffic elsewhere.
There were a number of interesting findings from this study. The authors found that for all the BGP misconfigurations that occur, they noticeably increased load but didn't significantly hinder connectivity. In looking at the causes of these misconfigurations, some slightly disturbing facts arose, such as a bug in a certain type of router that leaked entries during reboot and also how much manual configuration is required from operators; misconfigurations can be due to underlying causes one might not have even thought involved in the process. This paper is of particular interest because it actually studied routing changes in the wild and then took steps to uncover the cause of misconfigurations.
This idea is slightly flawed in that it doesn't take into account misconfigurations that aren't discovered for a long time, and it's also possible that legitimate route changes can occur that last for less than a day. However, the authors try to deal with the latter problem by contacting ISP operators for information regarding each potential misconfiguration (though ironically enough, 1/3 of these e-mails didn't reach their intended target due to stale data).
The authors describe the origins of new routes as being self-deaggregation, related origin, and foreign origin; the general thought regarding foreign origins was that they were more likely to be an address space hijack than a legitimate update. However, I would suspect that legitimate route updates of foreign origin would be more prevalent today than seven years ago when this paper was written, as tech companies like Google and Facebook with equipment all over the world implement policies to accommodate for failures in other locations by transferring traffic elsewhere.
There were a number of interesting findings from this study. The authors found that for all the BGP misconfigurations that occur, they noticeably increased load but didn't significantly hinder connectivity. In looking at the causes of these misconfigurations, some slightly disturbing facts arose, such as a bug in a certain type of router that leaked entries during reboot and also how much manual configuration is required from operators; misconfigurations can be due to underlying causes one might not have even thought involved in the process. This paper is of particular interest because it actually studied routing changes in the wild and then took steps to uncover the cause of misconfigurations.
Interdomain Internet Routing; Balakrishnan
This paper describes interdomain routing in the Internet. It first details the decentralized nature of routing, detailing how layered tiers of Internet Service Providers engage in a complicated game of competition and cooperation to best serve customers while maximizing their profits. ISPs engage in both transit and peering relationships with one another; transit is a customer-provider relationship in which a provider makes the guarantee to its customer that any sender on the Internet will be able to reach them.
However, the ISPs will be selective about which routes they share so as to minimize all non-customer routing that goes through them. Choosing which routes to export is a complicated matter of economics and resource tradeoffs, keeping customers happy and not wasting one's own resources. I thought the paper mentioned a very interesting way of looking at a transit relationship: that what ISPs are charging their customers for is, in essence, an entry in a routing table. I was also curious as to how the different tier ISPs came about: did the multiple layers get added on gradually over time as the Internet grew, or was it a clean break from the centralized system of the ARPANET? Furthermore, when did the notion of making money as an ISP really start to come into the picture?
The paper then moves on to discuss autonomous systems and how they exchange routing info via Border Gateway Protocol (BGP) and use IGP to optimize their own internal paths. BGP is a path vector protocol that has internal and external flavors, iBGP and eBGP, which it uses internal to an AS and between ASes, respectively. BGP is used to exchange route information and keep track of numerous attributes of the route, including a vector of ASes the route should follow. BGP is not a complicated protocol, sending a few simple messages, including UPDATE when routes change, NOTIFICATION when routes go down, and KEEPALIVE to ensure that routes are still up. It selects routes based on some of the attributes associated with a route, including whether or not the route originated from a customer and based on shortest AS length.
BGP's working properly requires putting a lot of trust in all ASes; the fact that any AS can originate a route for any prefix seems like a bad idea that has caused some serious (though easily-fixed and short-lived) problems. Some sort of network sentinels that watch for unexpected changes in advertised routes seems like a very worthwhile idea, though apparently logging is not particularly reliable either, meaning that we're left with a lack of accountability in the case of route hijacks. For a highly trafficked site like youtube, it doesn't take long for the world to notice that routing tables world-wide have been poisoned, but how long would it take to find problems with more subtle misconfigurations?
However, the ISPs will be selective about which routes they share so as to minimize all non-customer routing that goes through them. Choosing which routes to export is a complicated matter of economics and resource tradeoffs, keeping customers happy and not wasting one's own resources. I thought the paper mentioned a very interesting way of looking at a transit relationship: that what ISPs are charging their customers for is, in essence, an entry in a routing table. I was also curious as to how the different tier ISPs came about: did the multiple layers get added on gradually over time as the Internet grew, or was it a clean break from the centralized system of the ARPANET? Furthermore, when did the notion of making money as an ISP really start to come into the picture?
The paper then moves on to discuss autonomous systems and how they exchange routing info via Border Gateway Protocol (BGP) and use IGP to optimize their own internal paths. BGP is a path vector protocol that has internal and external flavors, iBGP and eBGP, which it uses internal to an AS and between ASes, respectively. BGP is used to exchange route information and keep track of numerous attributes of the route, including a vector of ASes the route should follow. BGP is not a complicated protocol, sending a few simple messages, including UPDATE when routes change, NOTIFICATION when routes go down, and KEEPALIVE to ensure that routes are still up. It selects routes based on some of the attributes associated with a route, including whether or not the route originated from a customer and based on shortest AS length.
BGP's working properly requires putting a lot of trust in all ASes; the fact that any AS can originate a route for any prefix seems like a bad idea that has caused some serious (though easily-fixed and short-lived) problems. Some sort of network sentinels that watch for unexpected changes in advertised routes seems like a very worthwhile idea, though apparently logging is not particularly reliable either, meaning that we're left with a lack of accountability in the case of route hijacks. For a highly trafficked site like youtube, it doesn't take long for the world to notice that routing tables world-wide have been poisoned, but how long would it take to find problems with more subtle misconfigurations?
Tuesday, September 1, 2009
The Design Philosophy of the DARPA Internet Protocols; Clark
This paper describes some of the basic design principles that went into the DARPA Internet, the foundations of the Internet as we know and love it today. The top goal of the DARPA Internet was to connect together existing networks, such as the ARPANET and ARPA packet radio network. Two additional aspects of the Internet that carried over from the ARPANET are packet switching and store-and-forward of packets by gateway packet switches, though it's really the incorporation of existing networks that really drove the direction of the Internet. It's not clear to me how the Internet would be different today if the designers had chosen to create a new unified, multi-media system instead of incorporating existing networks; would it be something significantly higher performing and survivable?
The top secondary goal of the designers was survivability: that, short of complete network partition, service should continue in some way between sender and receiver. To maintain this state information, the design uses "fate sharing," that is, keeping this state information at an endpoint host. The argument is that losing the state information about a host is okay if the host entity itself is lost at the same time.
Another secondary goal included accommodating different types of service, differing by requirements in speed, latency, and reliability. Providing for each of these different services proved to be too much for one protocol to handle effectively, which ultimately led to the split of TCP and IP as well as the creation of UDP. Additionally, the Internet was meant to allow for a variety of networks, which it does by making a minimal set of assumptions: the network must be able to transport a datagram. On top of that, there are some fuzzier requirements; the packet must be of "reasonable" size with delivered with "reasonable" reliability through a network with "suitable" addressing.
In creating a framework that cobbled together existing networks, the designers of the Internet caused problems for themselves in terms of not being able to accurately simulate or formalize performance constraints for the Internet architecture. Also of interest is that the end-to-end principle made the architecture less cost-effective and (at least initially) more difficult to connect new hosts to. While use of the datagram as a building block served its purpose of providing some of the flexibility-related secondary goals, it did not provide for easy resource management and accounting, which would have been better served by some notion of a "flow" through the network. All these factors, from datagrams to byte-managed TCP to packet-switching, allowed the Internet to achieve its main goal but also played some part in keeping it from being the cleanest, most effective, high-performing implementation possible.
The top secondary goal of the designers was survivability: that, short of complete network partition, service should continue in some way between sender and receiver. To maintain this state information, the design uses "fate sharing," that is, keeping this state information at an endpoint host. The argument is that losing the state information about a host is okay if the host entity itself is lost at the same time.
Another secondary goal included accommodating different types of service, differing by requirements in speed, latency, and reliability. Providing for each of these different services proved to be too much for one protocol to handle effectively, which ultimately led to the split of TCP and IP as well as the creation of UDP. Additionally, the Internet was meant to allow for a variety of networks, which it does by making a minimal set of assumptions: the network must be able to transport a datagram. On top of that, there are some fuzzier requirements; the packet must be of "reasonable" size with delivered with "reasonable" reliability through a network with "suitable" addressing.
In creating a framework that cobbled together existing networks, the designers of the Internet caused problems for themselves in terms of not being able to accurately simulate or formalize performance constraints for the Internet architecture. Also of interest is that the end-to-end principle made the architecture less cost-effective and (at least initially) more difficult to connect new hosts to. While use of the datagram as a building block served its purpose of providing some of the flexibility-related secondary goals, it did not provide for easy resource management and accounting, which would have been better served by some notion of a "flow" through the network. All these factors, from datagrams to byte-managed TCP to packet-switching, allowed the Internet to achieve its main goal but also played some part in keeping it from being the cleanest, most effective, high-performing implementation possible.
End-to-End Arguments in System Design; Saltzer, Reed, & Clark
The authors describe the "end-to-end" principle in this paper as it relates to deciding on the specific functionality built into a communication system: that some functions require some information or help from the application regardless, so building that functionality into the communication system itself is not necessarily beneficial, and indeed can even cause harm at times. Leaving out extra functionality from the low-level system can be preferable because applications that don't need the functions still have to deal with the performance hit in the underlying system and also because some applications will implement the function better or in more detail than the communcation system.
They provide a few examples to further their argument. The reliable file transfer example provides a more concrete notion to this principle, stressing that even if the underlying communication system has some fancy built-in guarantees, the application itself still has to check (and possibly handle) a small variety of errors, so complicating the communication system is not particularly helpful.
The "MIT checksum incident" example almost seems to argue that low-level function implementation is a harmful thing; if the networks hadn't used a packet checksum, perhaps the application programmers would have built in their own verification method and caught packet corruption earlier. I was skeptical of this notion that the communication system should be completely stripped off all duty to ensure correct data transmission, but luckily the authors countered this idea immediately, insisting that proper design involves a tradeoff between complexity costs of the communication system and the performance hits from an application having to deal with many errors in a simple communication system.
I like the notion of the end-to-end principle being the equivalent of RISC, in that both advocate paring down the underlying tools (communication system and instruction set, respectively) to the core building blocks necessary for everything else to get built on top of. The notion of layers in a network stems from this idea, with the underlying IP being simple and higher-level TCP/UDP/etc. layers actually offering the performance guarantees. This paper is important because it is likely one of the initial bases for the layers of the Internet Protocol Suite.
They provide a few examples to further their argument. The reliable file transfer example provides a more concrete notion to this principle, stressing that even if the underlying communication system has some fancy built-in guarantees, the application itself still has to check (and possibly handle) a small variety of errors, so complicating the communication system is not particularly helpful.
The "MIT checksum incident" example almost seems to argue that low-level function implementation is a harmful thing; if the networks hadn't used a packet checksum, perhaps the application programmers would have built in their own verification method and caught packet corruption earlier. I was skeptical of this notion that the communication system should be completely stripped off all duty to ensure correct data transmission, but luckily the authors countered this idea immediately, insisting that proper design involves a tradeoff between complexity costs of the communication system and the performance hits from an application having to deal with many errors in a simple communication system.
I like the notion of the end-to-end principle being the equivalent of RISC, in that both advocate paring down the underlying tools (communication system and instruction set, respectively) to the core building blocks necessary for everything else to get built on top of. The notion of layers in a network stems from this idea, with the underlying IP being simple and higher-level TCP/UDP/etc. layers actually offering the performance guarantees. This paper is important because it is likely one of the initial bases for the layers of the Internet Protocol Suite.
Resurrecting the Blog
... for Randy's networking class. I kept all my summaries for systems papers here for easy access and because I'm special like that, and I thought about doing that for my architecture papers but never got around to it.
Subscribe to:
Posts (Atom)