Wednesday, October 15, 2008

MapReduce: Simplified Data Processing on Large Clusters, Dean & Ghemawat

MapReduce is a programming model that allows users to easily write code for processing large amounts of data in parallel. As in its Google implementation, the user specifies functions for map and reduce phases; a set of inputs is split across a number of machines, the map function is applied, a set of intermediate values are written, and then the intermediate values are read by more machines that apply the reduce function and write out an answer.

Google's MapReduce cleverly tackles the problems of "straggler" machines and fault-tolerance through duplication and re-execution. Once a certain number of reductions have completed, so-called backup executions are scheduled for the remaining tasks. Similarly, when a worker machine fails, the task it was working on is simply re-executed, while when a master fails, it is re-started from some checkpointed state. MapReduce works well for Google because they have a huge number of machines at their disposal, although that's not to say that the MapReduce technique isn't useful for a much smaller number of machines though the paper did not discuss the performance implications of this scenario.

No comments: