-
Notifications
You must be signed in to change notification settings - Fork 109
3.4 Readers and Writers
- Using Readers and Writers
- Standard Input Readers
- Standard Output Writers
- Customized Readers and Writers
You don't instantiate, invoke, or write to input readers or output writers; all of the interaction with the readers and writers is done for you by the [MapreducePipeline] (https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/pipeline_base.py) object. You simply tell your MapreducePipeline object what reader to use and what output writer to use, and you also provide the MapreducePipeline with the reader and writer parameters.
The illustration below is a representation of a MapreducePipeline object with its constructor specifying a word count job, corresponding mapper and reducer functions, and the input reader and output writer to be used. Notice the "mapper_params" and "reducer_params". Those parameters are actually for the reader and writer, respectively. Notice also how the reader and writer are specified, using the Mapreduce library.
mapreduce_pipeline.MapreducePipeline()
"word_count",
"main.work_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key":"blobkey",
},
reducer_params={
"mime_type":"text/plain",
},
shards=16)
```
## Standard Input Readers
The standard data input readers are designed to read in data from specific storage, such as blobstore or datastore and then supply the data to the mapper function. The summary below describes each reader and its mappreduce.yaml parameter settings.
###### [*BlobstoreLineInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L1358-L1509)
Reads a newline (\n) delimited text file one line at a time from Blobstore. It calls the mapper function once with each line, passing to the mapper a tuple comprised of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example: (byte_offset, line_value).
*Parameters*
* <code>blob_keys</code> Either a string containing the blob key, or a list containing multiple blob keys, specifying the data to be read by the reader.
###### [*BlobstoreZipInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L1512-L1673)
Iterates over all of the compressed files within the specified zipfile in Blobstore. It calls the mapper function once for each file, passing it the tuple comprised of the zipfile.ZipInfo entry for the file, and a parameterless function that your mapper calls to return the complete body of the file as a string. For example, (zipinfo, file_callable). The following snippet shows how your mapper might extract each file's data in each iteration:
def word_count_map(data):
"""Word count map function."""
(entry, text_fn) = data
text = text_fn()
*Parameters*
* <code>blob_key</code> A string containing the blob key specifying the zip file data to be read by the reader.
###### [*BlobstoreZipLineInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L1676-L1905)
Iterates over all of the compressed files, each of which must contain newline (\n) delimited data, within the specified zipfile in Blobstore. It calls the mapper function once for each line in each file, passing a tuple consisting of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example, (byte_offset, line_value).
*Parameters*
* <code>blob_keys</code> Either a string containing the blob key, or a list containing multiple blob keys, specifying the zip file data to be read by the reader.
###### [*DatastoreInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L718-L859)
Iterates and returns all instances of the specified entity (entity_kind) from the datastore, automatically advancing to the next unread entities. Each iteration returns the number of entities specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.
*Parameters*
* <code>entity_kind</code> The datastore kind to map over.
* <code>namespace</code> The namespace that will be searched for entity_kinds.
* <code>batch_size</code> The number of entities to read from the datastore with each batch get. Default is 50.
###### [*DatastoreKeyInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L860-L864)
Iterates and returns all keys of the entities in the datastore of the specified entity_kind, automatically advancing to the next unread keys. Each iteration returns the number of keys specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.
*Parameters*
* <code>entity_kind</code> The datastore kind whose keys are to be returned.
* <code>namespace</code> The namespace that will be searched for entity_kinds.
* <code>batch_size</code> The number of keys to read from the datastore with each batch get. Default is 50.
###### [*NamespaceInputReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L2001-L2091)
Iterates over and returns the available namespaces.
*Parameters*
* <code>namespace_range</code> The range of namespaces that will be iterated over.
* <code>batch_size</code> The number of namespaces to return in each iteration. Default is 10.
###### [*RecordsReader*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L2094-L2228)
Reads a list of files obtained via the Files API in records format, yielding each record as a string in each iteration.
*Parameters*
* <code>files</code> Either a string containing the file to be read or a list containing multiple strings of files to be read.</td>
## Standard Output Writers
The standard output writers write data from the reducer function to a specific storage, for example, datastore or blobstore. The summary below describes each writer and its mappreduce.yaml parameter settings.
###### [*BlobstoreOutputWriter*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/output_writers.py#L921-L922)
Writes data from the reducer function to Blobstore, automatically assigning a filename. To retrieve the filename, you must use the completed mapreduce pipeline, as demonstrated by the StoreOutput function in the [Mapreduce Made Easy](/appengine/docs/python/dataprocessing/helloworld) demo.
*Parameters*
* <code>mime_type</code> MIME content type of the output blob. For example, <code>"text/plain"</code>.
###### [*FileOutputWriter*](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/output_writers.py#L840-L852)
Writes output data to Blobstore or Google Cloud Storage, automatically
assigning a filename. To retrieve the filename, you must use the completed
[MapreducePipeline](/appengine/docs/python/dataprocessing/mapreducepipelineclass), as demonstrated by the StoreOutput function, which can
be found in the file [main.py](https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/demo/main.py) which is part of the [Mapreduce Made Easy](/appengine/docs/python/dataprocessing/helloworld) demo.
*Parameters*
* <code>filesystem</code> The type of output storage: <code>blobstore</code> or <code>gs</code>.
* <code>mime_type</code> The MIME content type of the written data. For example, <code>text/plain</code>.
* <code>gs_bucket_name</code> For a gs filesystem, the bucket name and directory. For example, <code>mybucket/dir1/dir2</code>.
* <code>output_sharding</code> Controls the number of output files. Only <code>input</code> is supported, which means the number of output files equals the number of input shards.
## Customized Readers and Writers
The standard input readers and output writers should suffice for most use cases. If you need a reader that handles a different input source and format or a writer that writes to a different location and output format than the standard ones, contact Google to determine whether Google can add these to the standard readers and writers.
Alternatively, for those who want to write their own reader or writer, you can take a look at the open source code for [readers](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py) and [writers](https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/output_writers.py) to see how to do this.