Tuesday, May 06, 2008

I recently found the time to read the paper describing Google's MapReduce.

Google collects massive amounts of data. In order to process all this data in a reasonable amount of time, they use clusters of computers working in parallel. MapReduce is designed to take away the complexities of parallel computing and cluster management from the programmers. It is intimately connected with Google's own File System (GFS), making use of the same fragments of files.

The pragmatic parts are what struck me most.

  1. Run jobs on multiple machines and get the result from the machine that finishes first.
  2. When a job fails, if possible detect which record caused the crash, and try again on a different machine.
  3. And especially. Skipping Bad Records. If a job fails on multiple machines just skip the records that cause crashes (Get out results first and worry about problems later!)
n.b. See Hadoop, for an open source implementation of MapReduce.

posted @ 12:02 PM | Feedback (0)