MapReduce

Posted on Tuesday, May 06, 2008 12:02 PM

I recently found the time to read the paper describing Google's MapReduce.

Google collects massive amounts of data. In order to process all this data in a reasonable amount of time, they use clusters of computers working in parallel. MapReduce is designed to take away the complexities of parallel computing and cluster management from the programmers. It is intimately connected with Google's own File System (GFS), making use of the same fragments of files.

The pragmatic parts are what struck me most.

  1. Run jobs on multiple machines and get the result from the machine that finishes first.
  2. When a job fails, if possible detect which record caused the crash, and try again on a different machine.
  3. And especially. Skipping Bad Records. If a job fails on multiple machines just skip the records that cause crashes (Get out results first and worry about problems later!)
n.b. See Hadoop, for an open source implementation of MapReduce.

Post Comment

Title  
Name  
Url
Comment   

ATTENTION: the code you need to copy is CaSe SeNsItIvE and is required to prevent spam.
Enter the code you see: