mapreduce -- a Python MapReduce engine

mapreduce is a simple Python module for executing programs in the MapReduce paradigm. It offers two distinct classes that one could build a MapReduce program with:

MapReduce

The simplest implementation of the MapReduce pattern. This class executes MapReduce programs in a single process, simply by processing all maps and reduces sequentially. To use this class you must inherit it and implement your own mapper and reducer. Here is an example:

from mapreduce import MapReduce

class WordCounter(MapReduce):
    WORD = re.compile(r"[\w']+")

    def mapper(self, line):
        for word in self.WORD.findall(line):
            yield (word, 1)

    def reducer(self, key, entries):
        return key, sum(entries)

Additionally, you may define setup and cleanup methods in the same class. setup is executed before the mappers run, and cleanup is executed right after the last reducer. See the movie_ratings.py example for a use case of this.

To run your MapReduce job, you need only instantiate the class with the input file path, and use the run method. Here is an example:

if __name__ == "__main__"
    wc = WordCounter("shakespeare.txt")
    wc.run()

MapReduceConcurrent

This class, in contrast to its simpler counterpart, makes use of python's built-in multiprocessing library to execute mappers and reducers in a distributed environment. When instantiating this class, you may also provide a processes parameter. This determines how many processes will be spawned in the worker pool that carries out the map and reduce tasks. Here is an example with our WordCounter class from above:

class WordCounter(MapReduceConcurrent):

    ...

if __name__ == "__main__"
    wc = WordCounter("shakespeare.txt", processess=8)
    wc.run()

MapReduceConcurrent will start the MapReduce job by creating a pool of processes. It reuses this pool throughout the job to submit concurrent map and reduce tasks to different processes, each with an equally divided group of work to do. It uses the built-in Pool.map to divide the work, with a chunksize that is calculated from the amount of records in the Iterable divided by the amount of processes in the pool.

word_counter.py and movie_ratings.py are both MapReduce programs that use the mapreduce module as defined above. They can be used out of the box as follows:

    $ python word_counter.py shakespeare.txt > results.tsv

    $ python movie_ratings.py ratings.txt > results.tsv

Omitting the results.tsv will send the results to stdout. You may also change each example's inheritance from MapReduce to MapReduceConcurrent and vice versa. All functionality will remain the same.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
if-kipling.txt		if-kipling.txt
mapreduce.py		mapreduce.py
movie_ratings.py		movie_ratings.py
movies.txt		movies.txt
ratings.txt		ratings.txt
shakespeare.txt		shakespeare.txt
word_counter.py		word_counter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mapreduce -- a Python MapReduce engine

MapReduce

MapReduceConcurrent

About

Releases

Packages

Languages

dhop/mapreduce

Folders and files

Latest commit

History

Repository files navigation

mapreduce -- a Python MapReduce engine

MapReduce

MapReduceConcurrent

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages