Skip to content

3.3 The MapreducePipeline Class

markgoldstein edited this page Oct 24, 2014 · 13 revisions

The MapreducePipeline class is used to "wire-together" or connect all the steps needed to perform a specific Mapreduce job. It specifies the mapper, reducer, data input reader, output writer and so forth to be used to carry out the job.

Returns filenames from the output writer.

Constructor

class MapreducePipeline(job_name, mapper_spec, reducer_spec, input_reader_spec, output_writer_spec=None, mapper_params=None, reducer_params=None, shards=None)

The constructor's arguments fully specify a Mapreduce job:

job_name
The name of the Mapreduce job. This name shows up in the logs and in the UI.
mapper_spec
The name of the mapper used in this mapreduce job. The mapper processes the line by line input from the input reader specified in the input_reader_spec param.
reducer_spec
The name of the reducer used in this mapreduce job. The reducer performs work and yields results, using the optional output writer specified in the output_writer_spec param.
input_reader_spec
The name of the input reader used in the mapper for this Mapreduce job. The mapper processes the line by line input from the input reader specified.
output_writer_spec
The name of the output writer (if any) used to store results from this Mapreduce job.
mapper_params
Parameters to use in the input reader.
reducer_params
Parameters to use in the output writer.
shards
Number of shards to use for this Mapreduce job.

Instance Methods

A Mapreduce instance has the following methods:

start(self, **kwargs)

Starts the Mapreduce job. (This method is inherited from the Pipeline class.)