Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks; Kandula et. al

The authors of this paper conducted a study of error logs within a small enterprise network. Based on a sample of these logs, they came to the conclusion that the majority of faults within a network impact applications (instead of whole machines) typically because of some configuration issue. Disturbingly, the causes of 25% of problems were unknown yet solvable by rebooting the system, casting more of a "black magic" veil on networking. Their goal then is to create a system called NetMedic that can diagnose, to the process level, culprits of network problems and probable underlying causes in as much detail as possible without relying on application-specific knowledge.

NetMedic models the network as a dependency graph with edges between dependent components. It employs a set of techniques that allow a process to keep track of its own configuration state, the behavior of peers it's in communication with, and the state of its host machine. Diagnosis of errors requires 3 steps: determining which pieces of the network are abnormal (statistically, based on previous values through time), computing edge weights for the dependency graph, and computing path weights to decide on the likely error source (with largest path weights being likely causes). Interestingly, they don't consider the possibility that a normally behaving component could be the cause of other problems; I guess it's probably the case that problems occur when things (like configuration files) change and not just out of the blue, so this assumption is probably okay for catching most errors.

NetMedic monitors both the Windows software counters as well as the Windows registry and configuration files. However, if half of the causes are due to configuration files, I wonder how important the software counters actually are to diagnosing problems. I thought it was admirable that because they couldn't instrument actual servers that are in use, the authors just started up their own and inject faults at random.

1 comment:

Randy H. Katz said...

Note that it is not difficult to add observation points in the network. Some of the packet shaping boxes, for example, can also be used to collect lots of protocol statistics. Plus if you include Chukwa, you can collect vast quantities of observations for analysis and model building.