This paper examines the energy wasted by computers not taking advantage of idle times to go into lower power sleep modes. The authors straight out acknowledge that this problem of when and how to effectively put nodes to sleep without them losing their network presence has been examined, with wake-on-LAN and proxying as potential solutions.
Based on traces from actual users, they justify the need for these solutions by showing that the studied computers actually do spend a lot of time in idle and that in many cases, long gaps of time pass between packet arrivals at the idle machine. Not suprisingly, the machines tend to have more frequently arriving packets in the office setting than in the home setting, with some users having even more available idle time than others.
I like this paper because it breaks down the type of network traffic seen in both the home and office settings; it starts out breaking both the incoming and outing packets into unicast, multicast, and broadcast groupings, and then further examines the breakdown of the multicast and broadcast packets into such things as ARP, IGMP, etc. These breakdowns are important because the power savings possible due to proxying for a node are dependent on which packets a node does not actually need to be woken up for.
Protocols are classified into groupings such as "don't-ignore" (ARP or DHCP requests for the sleeping node), "don't wake", ignorable, and mechanical response that the proxy can respond to. With these classifications in mind, the authors present 4 proxy schemes that offer varying levels of power savings, with the tradeoff that less application state is maintained. They find that their more aggressive schemes offer a lot of power savings, particularly in the office environment, and create a prototype of their proxy architecture in Click.
Friday, November 20, 2009
Cutting the Electric Bill for Internet-Scale Systems; Qureshi, Weber, Balakrishnan, Guttag, & Maggs
This paper discusses reducing energy costs of operating a data center. This technique relies on temporal and spatial fluctuations in electricity costs as well as use of dynamic request routing across widely distributed machines. The notion of energy elasticity determines how much money can actually be saved by changing the load on a machine; machines with greater energy proportionality can save more money when they're less used than those machines that continue to burn lots of power when idle.
The authors choose to focus on real-time electricity pricing instead of future, predicted markets, though RT makes up about 10% of the market share. Routing to different geographic areas to obtain the best price relies on low correlation between hourly prices in different regions, which appears to be the case between different Regional Transmission Organizations (RTOs). Part of the difference in cost between two locations at once is the fact that different time zones reach their peak cost at non-peak hours in other locations.
An inherent tradeoff in cutting cost of electricity by routing it elsewhere geographically is an increase in bandwidth costs, which the authors accounted for in part of their study. They did find that this cut the amount of energy costs that could be saved, but they still managed to cut total operating costs. Further reductions could be made through enrolling in demand reduction programs, but that would require delay-insensitive workloads within the data center, and it's not obvious whether that tradeoff would end up being monetarily beneficial.
Regardless of the assumptions the authors must make to simulate their ideas, I think the idea of geographic routing based on hourly electricity costs is a great idea (to the extent that performance doesn't suffer greatly), particularly if the work might be moved around anyway due to node failure or data placement.
The authors choose to focus on real-time electricity pricing instead of future, predicted markets, though RT makes up about 10% of the market share. Routing to different geographic areas to obtain the best price relies on low correlation between hourly prices in different regions, which appears to be the case between different Regional Transmission Organizations (RTOs). Part of the difference in cost between two locations at once is the fact that different time zones reach their peak cost at non-peak hours in other locations.
An inherent tradeoff in cutting cost of electricity by routing it elsewhere geographically is an increase in bandwidth costs, which the authors accounted for in part of their study. They did find that this cut the amount of energy costs that could be saved, but they still managed to cut total operating costs. Further reductions could be made through enrolling in demand reduction programs, but that would require delay-insensitive workloads within the data center, and it's not obvious whether that tradeoff would end up being monetarily beneficial.
Regardless of the assumptions the authors must make to simulate their ideas, I think the idea of geographic routing based on hourly electricity costs is a great idea (to the extent that performance doesn't suffer greatly), particularly if the work might be moved around anyway due to node failure or data placement.
Thursday, November 19, 2009
Scalable Application Layer Multicast; Banerjee, Bhattacharjee, & Kommareddy
This paper describes the NICE (NICE is the Internet Cooperative Environment) multicast protocol. Like SRM, NICE is implemented at the application layer so as not to require changes to the underlying network. NICE uses hierarchical layers to group nodes into clusters and in doing so, reduce stretch, or unnecessarily increased path length from the unicast path.
When new hosts attempt to join, they first communicate with a Rendezvous Point (RP) node, which then directs them to high-tiered nodes, which subsequently direct the host to lower-level nodes close to them. Higher-level nodes are cluster leaders, responsible for keeping the size of their associated cluster between a certain threshold and splitting or merging the cluster group when the max or min threshold is passed. Members of a cluster alert each other to their presence throughput periodic heartbeat messages.
The authors simulate NICE and compare their results to Narada, an existing multicast protocol. They find that it performs comparably, but with less control overhead and less stress (total messages sent over a link) on links. The hierarachical structure of NICE seems to be the main reason that it has such low control overhead and hence can be used (at least in simulation) for groups of large sizes, say, greater than 1024 hosts, which Narada cannot do. The paper references another paper about Narada, but I wished they would have given a succint, organized comparison of the two protocols.
This paper went much more into detail about the underlying workings of the multicast groups, while the other paper offered a more concrete example of when this multicast framework would be used and how to detect and deal with losses.
When new hosts attempt to join, they first communicate with a Rendezvous Point (RP) node, which then directs them to high-tiered nodes, which subsequently direct the host to lower-level nodes close to them. Higher-level nodes are cluster leaders, responsible for keeping the size of their associated cluster between a certain threshold and splitting or merging the cluster group when the max or min threshold is passed. Members of a cluster alert each other to their presence throughput periodic heartbeat messages.
The authors simulate NICE and compare their results to Narada, an existing multicast protocol. They find that it performs comparably, but with less control overhead and less stress (total messages sent over a link) on links. The hierarachical structure of NICE seems to be the main reason that it has such low control overhead and hence can be used (at least in simulation) for groups of large sizes, say, greater than 1024 hosts, which Narada cannot do. The paper references another paper about Narada, but I wished they would have given a succint, organized comparison of the two protocols.
This paper went much more into detail about the underlying workings of the multicast groups, while the other paper offered a more concrete example of when this multicast framework would be used and how to detect and deal with losses.
A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing; Floyd, Jacobson, McCanne, Liu, & Zhang
This paper discusses scalable reliable multicast (SRM). It starts off by presenting the problems of blindly scaling unicast to multicast. Reliable delivery becomes a very difficult problem to tackle when there are multiple receivers and a layer of indirection involved. Interestingly, the authors advocate the use of Application Data Units (ADU) which is a method more closely tied to an application for requesting pieces of data.
The authors then go on to describe wb, an application that uses reliable multicast, and the design decisions that accompany it. Particular to wb are that all senders and receivers are part of the same multicast group without regard to roles, and delivery order of packets does not matter to the application. With the latter fact in mind, it seems odd that they should choose this as their example application, since part of the difficulty in multicast comes in with ordering packets correctly. To detect gaps due to losses without causing a response implosion on the sender side, receivers send periodic messages about which sequence numbers they have seen; they work together to send and respond to repair requests after first waiting some random timeout.
The paper gives the expected number of requests and timer delays for a few topologies: chain, star, and bounded-degree trees, and then shows some simulation results for these topologies. One realism missing from the simulations was the possibility of dropping repairs and repairs requests and thus requiring retransmissions of retransmissions. I hated the simulation result graphs. They weren't labeled at all, so I had to jump back and forth to the text a few times to figure out what everything was.
The authors then go on to describe wb, an application that uses reliable multicast, and the design decisions that accompany it. Particular to wb are that all senders and receivers are part of the same multicast group without regard to roles, and delivery order of packets does not matter to the application. With the latter fact in mind, it seems odd that they should choose this as their example application, since part of the difficulty in multicast comes in with ordering packets correctly. To detect gaps due to losses without causing a response implosion on the sender side, receivers send periodic messages about which sequence numbers they have seen; they work together to send and respond to repair requests after first waiting some random timeout.
The paper gives the expected number of requests and timer delays for a few topologies: chain, star, and bounded-degree trees, and then shows some simulation results for these topologies. One realism missing from the simulations was the possibility of dropping repairs and repairs requests and thus requiring retransmissions of retransmissions. I hated the simulation result graphs. They weren't labeled at all, so I had to jump back and forth to the text a few times to figure out what everything was.
Thursday, November 12, 2009
X-Trace: A Pervasive Network Tracing Framework; Fonseca, Porter, Katz, Shenker, & Stoica
This paper discusses X-Trace, a network diagnostic tool that works across a range of domains and applications by adding metadata to all network operations that originate from certain tasks. By keeping track of which tasks spawn new tasks, X-Trace ends up keeping track of task trees, whose steps it can replay to troubleshoot failures. Nodes push IDs to their next hop and establish these task trees through pushDown and pushNext, functions for lower and same layer hops, respectively.
X-Traces design was driven by 3 core ideas: trace requests should be sent in-band, collected traces should be sent out-of-band, and originators and receivers of requests should be decoupled. These principles have good reasons behind them, namely that trace requests need to be able to follow the same path as the original flow, that cted data should avoid the troublesome spots that it will help diagnose, and that administrative domains (ADs) that are unwilling to disclose some of their information need not fear this tool doing just that.
A few scenarios for using X-Trace are described including how traces could pinpoint problems; these scenarios include web site requests from an Apache server, web hosting services for photos, the i3 overlay network, tunneling, and troubleshooting ISP connections. A big open issue for X-Trace is security, for its metadata and for generating reports. I liked how frankly the authors acknowledged problems with X-Trace, that is, how difficult it was to deploy and how easy adoption/retrofitting of it would be.
NetFPGA: A Network Tool for Research and Education; Watson, McKeown, & Casado
This short paper describes the NetFPGA system, an FPGA-based platform originally created as a teaching tool to give students experience with network physical and link layers. It is currently in version 2, the main difference from version 1 being its format, using PCI and requiring no specialized backplanes or rack.
NetFPGA sounds like a great tool for people who want practical experience in developing networking hardware but cannot due to cost and complexity of such activities; networking researchers now needn't rely solely on software routing for their implementations. I don't know what their current status of running software on the actual boards is; apparently the software part of the routers was not initially run on the boards and required some outside computer, which is unfortunate.
NetFPGA sounds like a great tool for people who want practical experience in developing networking hardware but cannot due to cost and complexity of such activities; networking researchers now needn't rely solely on software routing for their implementations. I don't know what their current status of running software on the actual boards is; apparently the software part of the routers was not initially run on the boards and required some outside computer, which is unfortunate.
Tuesday, November 10, 2009
A Policy-aware Switching Layer for Data Centers; Joseph, Tavakoli, & Stoica
This paper describes PLayer, a policy-aware switching layer for data centers. PLayer is meant to help combat the problems posed by middleboxes (e.g. firewalls) and having equivalent logical and physical network topologies. With middleboxes that are hard-wired into the physical network, any rerouting of traffic due to server failures could bypass the middlebox, meaning that behavior is not reliable.
PLayer separates the notions of policy from reachability, meaning that traffic goes through middleboxes as expected by data center policy, not just as network mechanisms allow. Additionally, middleboxes are taken off the physical network data path, with traffic forwarded to them via a policy-aware pswitch. Pswitches contain switch and policy cores; as the authors point out, minimal hardware change is required for PLayer to actually be adopted in data centers.
However, it's not clear from the paper how these switches would be acquired, if DC owners would have to hack their own together. The authors themselves developed a software implementation of the pswitches using Click. They simulated use of the pswitches on a small collection of topologies and found through the use of simple micro-benchmarks that their software prototype incurred an increase in latency and decrease in throughput. Despite this degrading performance, use of pswitches offers guarantees that middleboxes will be traversed as desired even in the presence of network churn.
PLayer separates the notions of policy from reachability, meaning that traffic goes through middleboxes as expected by data center policy, not just as network mechanisms allow. Additionally, middleboxes are taken off the physical network data path, with traffic forwarded to them via a policy-aware pswitch. Pswitches contain switch and policy cores; as the authors point out, minimal hardware change is required for PLayer to actually be adopted in data centers.
However, it's not clear from the paper how these switches would be acquired, if DC owners would have to hack their own together. The authors themselves developed a software implementation of the pswitches using Click. They simulated use of the pswitches on a small collection of topologies and found through the use of simple micro-benchmarks that their software prototype incurred an increase in latency and decrease in throughput. Despite this degrading performance, use of pswitches offers guarantees that middleboxes will be traversed as desired even in the presence of network churn.
Internet Indirection Infrastructure; Stoica, Adkins, Zhuang, Shenker, & Surana
This paper discusses i3 (Internet Indirection Infrastructure), an overlay network that provides indirection through decoupling of sending and receiving to allow flexible support for anycast, mobility, and other services. i3 uses rendezvous-style communication; receiving nodes insert triggers onto specified servers that consist of a packet id of interest and an IP address to which these packets should be forwarded. Thus, receivers and senders may remain blissfully unaware of how many of the other there are.
Mobility is supported by having receiving nodes simply update their triggers when their addresses change. Multicast involves all group members setting up triggers for an i3 id. To allow intermediate steps like HTML->WML mapping, id chains can be created, and large-scale multicast can be accomodated by hierarchical triggers.
i3 is implemented and simulated using Chord. Some measurements of note include the per-packet fowarding and routing overhead; forwarding overhead seems to add between 10-40 us. While i3 allows flexible implementation of a number of services with fairly low overhead, it has some issues with security. One in particular that stands out is that any user can insert or remove a trigger for any receiver from an i3 node, which seems like a big flaw fixable only through more indirection or verification.
Mobility is supported by having receiving nodes simply update their triggers when their addresses change. Multicast involves all group members setting up triggers for an i3 id. To allow intermediate steps like HTML->WML mapping, id chains can be created, and large-scale multicast can be accomodated by hierarchical triggers.
i3 is implemented and simulated using Chord. Some measurements of note include the per-packet fowarding and routing overhead; forwarding overhead seems to add between 10-40 us. While i3 allows flexible implementation of a number of services with fairly low overhead, it has some issues with security. One in particular that stands out is that any user can insert or remove a trigger for any receiver from an i3 node, which seems like a big flaw fixable only through more indirection or verification.
Thursday, November 5, 2009
DNS Performance and the Effectiveness of Caching; Jung, Sit, Balakrishnan, & Morris
This paper presents findings about DNS regarding its performance and effectiveness of caching. Studying traces from MIT and KAIST, the authors show the breakdown of DNS query types; they found that around 60% were name-to-address queries, with another 25-30% being reverse address-to-name queries. However, they say they excluded DNS lookups not associated with TCP connections from these statistics and then revealed that 50% of MIT lookups fall into this category. How much does this class of lookups change their results?
It sounds like a lot of DNS traffic is somewhat gratuitous, for example with systems doing reverse lookups and then verifying that result by doing an A lookup on it. Surprisingly, nearly 1/4th of all DNS lookups go unanswered, with at least another 10% receiving a negative answer. The authors also chart latency of lookups based on number of referrals in the process. Not surprisingly, 2 or more referrals can cause a a tenfold increase in latency. Also not surprising is that caching can bring this latency down significantly, as well as easing the load at the root servers.
The authors further study caching through a trace-based simulation which looks at cached IP address hits. They find that a small number of names are very frequently accessed, allowing cache rates for small size caches to be high, and that increasing TTL increases cache hits only to a certain degree, as web pages with a given name are often accessed in small, successive bursts. I liked this paper because it actually took a look at what all those DNS lookups are running around the net doing.
It sounds like a lot of DNS traffic is somewhat gratuitous, for example with systems doing reverse lookups and then verifying that result by doing an A lookup on it. Surprisingly, nearly 1/4th of all DNS lookups go unanswered, with at least another 10% receiving a negative answer. The authors also chart latency of lookups based on number of referrals in the process. Not surprisingly, 2 or more referrals can cause a a tenfold increase in latency. Also not surprising is that caching can bring this latency down significantly, as well as easing the load at the root servers.
The authors further study caching through a trace-based simulation which looks at cached IP address hits. They find that a small number of names are very frequently accessed, allowing cache rates for small size caches to be high, and that increasing TTL increases cache hits only to a certain degree, as web pages with a given name are often accessed in small, successive bursts. I liked this paper because it actually took a look at what all those DNS lookups are running around the net doing.
Development of the Domain Name System; Mockapetris & Dunlap
This paper describes the initial goals behind the creation of the Domain Name System, or DNS. Originally, mappings of host names to addresses was maintained in one single centrally-located HOSTS.TXT file. Understandably, this approach became unwieldy and needed to be distributed somehow.
The new DNS approach was designed to adhere to the following constraints: provide the same information as HOSTS.TXT, allow distributed management, and flexibly interoperate across environments including the DARPA Internet, all while providing "tolerable" performance and imposing few limits on size of names or other components associated with it.
DNS requires the use of name servers, which store information and answer queries, and resolvers, which interact with client programs to find the right name server to connect to. Resolvers can choose to cache query responses. DNS has an internal name space in the form labels concatenated together from a variable-depth tree, allowing organizations flexibility to organize themselves and expand internally.
I found it amusing to hear that the Berkeley community was somewhat resistant to the change to DNS. As this paper was written over 20 years ago, I wonder if HOSTS.TXT is still used by many systems; I know I personally use a hosts mapping file in Linux to build shortcuts on top of DNS, but to be at all usable, HOSTS.TXT really needs to be an extra layer in addition to DNS to handle special cases at local levels.
The new DNS approach was designed to adhere to the following constraints: provide the same information as HOSTS.TXT, allow distributed management, and flexibly interoperate across environments including the DARPA Internet, all while providing "tolerable" performance and imposing few limits on size of names or other components associated with it.
DNS requires the use of name servers, which store information and answer queries, and resolvers, which interact with client programs to find the right name server to connect to. Resolvers can choose to cache query responses. DNS has an internal name space in the form labels concatenated together from a variable-depth tree, allowing organizations flexibility to organize themselves and expand internally.
I found it amusing to hear that the Berkeley community was somewhat resistant to the change to DNS. As this paper was written over 20 years ago, I wonder if HOSTS.TXT is still used by many systems; I know I personally use a hosts mapping file in Linux to build shortcuts on top of DNS, but to be at all usable, HOSTS.TXT really needs to be an extra layer in addition to DNS to handle special cases at local levels.
Monday, November 2, 2009
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications; Stoica, Morris, Karger, Kaashoek, & Balakrishnan
This paper discusses Chord, a distributed lookup protocol. Chord is a DHT that uses a distributed routing table and requires that each node have knowledge of only a few neighbors, making it scalable. It maintains a balanced load through the use of consistent hashing. Any key k has a successor node successor(k) that it corresponds to, with successor(k) being the first clockwise node from the identifier k. Nodes leaving and joining the network simply have values transferred back and forth between successors and themselves.
Each chord node maintains more than just the next successor; it keeps a finger table of a subset of log n nodes. These finger table entries can be used to find an identifier k by finding the closest predecessor j to k in the table and then sending the request to j and recursing until eventually the correct node has been found. A difficulty is in dealing with simultaneously joining/failing nodes, though apparently given a small "settling" time, the finger tables will eventually converge nicely.
The authors simulate Chord on a large network (thousands of nodes storing tens of thousands of keys) and also try it out on a smaller scale experimental testbed, looking at latency and path length. Problems they have not yet tackled include dealing with a partitioned Chord ring and dealing with malicious users. Additionally, there's a tradeoff between required number of lookups and finger table sizes which could use further exploration. Otherwise, Chord seems like a fairly straightforward, simple distributed lookup scheme.
Each chord node maintains more than just the next successor; it keeps a finger table of a subset of log n nodes. These finger table entries can be used to find an identifier k by finding the closest predecessor j to k in the table and then sending the request to j and recursing until eventually the correct node has been found. A difficulty is in dealing with simultaneously joining/failing nodes, though apparently given a small "settling" time, the finger tables will eventually converge nicely.
The authors simulate Chord on a large network (thousands of nodes storing tens of thousands of keys) and also try it out on a smaller scale experimental testbed, looking at latency and path length. Problems they have not yet tackled include dealing with a partitioned Chord ring and dealing with malicious users. Additionally, there's a tradeoff between required number of lookups and finger table sizes which could use further exploration. Otherwise, Chord seems like a fairly straightforward, simple distributed lookup scheme.
Looking Up Data in P2P Systems; Balakrishnan, Kaashoek, Karger, Morris, & Stoica
This article discusses the lookup problem in P2P networks, that is, scalable and efficient lookup. A few approaches include centralized databases, which have scalability and fault-tolerance issues, and DHTs, which involve passing around of key-value pairs. P2P systems do not employ hierarchical lookups, treating each host equally in the algorithm. Broadcasts between neighbors can be used when a node doesn't have a key, but too many broadcasts can quickly take down a system by using up all available bandwidth.
The authors then go into more depth about the DHTs, describing how each step in a lookup for some key or ID should progress to a node with a closer value to ID, based on some definition of closeness such as numeric difference or number of similar prefix bits. Chord is a DHT that uses a skip-list to allow binary searches for keys, whereas Pastry, Tapestry, Kademlia, and CAN all use tree-based structures.
The article is brief, having been published in communications of the ACM, but it seems to go into too much detail about the wrong things. I would have been more interested in an organized comparison of the different DHT implementations and an analysis of their strengths and weaknesses. Instead, I got a long discussion about P2P systems, skip lists, and hash table lookups.
The authors then go into more depth about the DHTs, describing how each step in a lookup for some key or ID should progress to a node with a closer value to ID, based on some definition of closeness such as numeric difference or number of similar prefix bits. Chord is a DHT that uses a skip-list to allow binary searches for keys, whereas Pastry, Tapestry, Kademlia, and CAN all use tree-based structures.
The article is brief, having been published in communications of the ACM, but it seems to go into too much detail about the wrong things. I would have been more interested in an organized comparison of the different DHT implementations and an analysis of their strengths and weaknesses. Instead, I got a long discussion about P2P systems, skip lists, and hash table lookups.
Subscribe to:
Posts (Atom)