07/26/2012 09:21 PM
MapReduce is a popular framework for distributed, parallel computation that has started to be used in domains quite different from the web applications for which it was designed, including the processing of big structured data, e.g., scientific and financial data. Previous work on using MapReduce to process scientific data did not incorporate knowledge of this structure in its internal communications. We show that performance gains can be realized by leveraging knowledge of the structure of the data to minimize and localize communications between nodes, guarantee workload balance across processing nodes, ensure that Reduce tasks start as soon as possible, and create balanced, contiguous output. We implemented these improvements in SciHadoop, a version of the open-source Hadoop MapReduce framework designed for structured scientific data. Our results show total query execution time reductions of up to 29% over SciHadoop with initial results available with only 6% of the query completed, and the resultant output is more efficiently organized, compared to Hadoop.