Skip to content
/ mapreduce Public

a simple python mapreduce engine. written for the Shopify data engineering challenge.

Notifications You must be signed in to change notification settings

dhop/mapreduce

Repository files navigation

mapreduce -- a Python MapReduce engine

mapreduce is a simple Python module for executing programs in the MapReduce paradigm. It offers two distinct classes that one could build a MapReduce program with:

MapReduce

The simplest implementation of the MapReduce pattern. This class executes MapReduce programs in a single process, simply by processing all maps and reduces sequentially. To use this class you must inherit it and implement your own mapper and reducer. Here is an example:

from mapreduce import MapReduce

class WordCounter(MapReduce):
    WORD = re.compile(r"[\w']+")

    def mapper(self, line):
        for word in self.WORD.findall(line):
            yield (word, 1)

    def reducer(self, key, entries):
        return key, sum(entries)

Additionally, you may define setup and cleanup methods in the same class. setup is executed before the mappers run, and cleanup is executed right after the last reducer. See the movie_ratings.py example for a use case of this.

To run your MapReduce job, you need only instantiate the class with the input file path, and use the run method. Here is an example:

if __name__ == "__main__"
    wc = WordCounter("shakespeare.txt")
    wc.run()

MapReduceConcurrent

This class, in contrast to its simpler counterpart, makes use of python's built-in multiprocessing library to execute mappers and reducers in a distributed environment. When instantiating this class, you may also provide a processes parameter. This determines how many processes will be spawned in the worker pool that carries out the map and reduce tasks. Here is an example with our WordCounter class from above:

class WordCounter(MapReduceConcurrent):

    ...

if __name__ == "__main__"
    wc = WordCounter("shakespeare.txt", processess=8)
    wc.run()

MapReduceConcurrent will start the MapReduce job by creating a pool of processes. It reuses this pool throughout the job to submit concurrent map and reduce tasks to different processes, each with an equally divided group of work to do. It uses the built-in Pool.map to divide the work, with a chunksize that is calculated from the amount of records in the Iterable divided by the amount of processes in the pool.


word_counter.py and movie_ratings.py are both MapReduce programs that use the mapreduce module as defined above. They can be used out of the box as follows:

    $ python word_counter.py shakespeare.txt > results.tsv
    $ python movie_ratings.py ratings.txt > results.tsv

Omitting the results.tsv will send the results to stdout. You may also change each example's inheritance from MapReduce to MapReduceConcurrent and vice versa. All functionality will remain the same.

About

a simple python mapreduce engine. written for the Shopify data engineering challenge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages