-
Notifications
You must be signed in to change notification settings - Fork 109
3.4 Readers and Writers
- Using Readers and Writers
- Standard Input Readers
- Standard Output Writers
- Customized Readers and Writers
You don't instantiate, invoke, or write to input readers or output writers; all of the interaction with the readers and writers is done for you by the [MapreducePipeline] (https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/pipeline_base.py) object. You simply tell your MapreducePipeline object what reader to use and what output writer to use, and you also provide the MapreducePipeline with the reader and writer parameters.
The illustration below is a representation of a MapreducePipeline object with its constructor specifying a word count job, corresponding mapper and reducer functions, and the input reader and output writer to be used. Notice the "mapper_params" and "reducer_params". Those parameters are actually for the reader and writer, respectively. Notice also how the reader and writer are specified, using the Mapreduce library.
mapreduce_pipeline.MapreducePipeline(
"word_count",
"main.work_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key":"blobkey",
},
reducer_params={
"mime_type":"text/plain",
},
shards=16)
The standard data input readers are designed to read in data from specific storage, such as blobstore or datastore and then supply the data to the mapper function. The summary below describes each reader and its mappreduce.yaml parameter settings.
Reads a newline (\n) delimited text file one line at a time from Blobstore. It calls the mapper function once with each line, passing to the mapper a tuple comprised of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example: (byte_offset, line_value).
Parameters
-
blob_keys
Either a string containing the blob key, or a list containing multiple blob keys, specifying the data to be read by the reader.
Iterates over all of the compressed files within the specified zipfile in Blobstore. It calls the mapper function once for each file, passing it the tuple comprised of the zipfile.ZipInfo entry for the file, and a parameterless function that your mapper calls to return the complete body of the file as a string. For example, (zipinfo, file_callable). The following snippet shows how your mapper might extract each file's data in each iteration:
def word_count_map(data):
"""Word count map function."""
(entry, text_fn) = data
text = text_fn()
Parameters
-
blob_key
A string containing the blob key specifying the zip file data to be read by the reader.
Iterates over all of the compressed files, each of which must contain newline (\n) delimited data, within the specified zipfile in Blobstore. It calls the mapper function once for each line in each file, passing a tuple consisting of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example, (byte_offset, line_value).
Parameters
-
blob_keys
Either a string containing the blob key, or a list containing multiple blob keys, specifying the zip file data to be read by the reader.
Iterates and returns all instances of the specified entity (entity_kind) from the datastore, automatically advancing to the next unread entities. Each iteration returns the number of entities specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.
Parameters
-
entity_kind
The datastore kind to map over. -
namespace
The namespace that will be searched for entity_kinds. -
batch_size
The number of entities to read from the datastore with each batch get. Default is 50.
Iterates and returns all keys of the entities in the datastore of the specified entity_kind, automatically advancing to the next unread keys. Each iteration returns the number of keys specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.
Parameters
-
entity_kind
The datastore kind whose keys are to be returned. -
namespace
The namespace that will be searched for entity_kinds. -
batch_size
The number of keys to read from the datastore with each batch get. Default is 50.
Iterates over and returns the available namespaces.
Parameters
-
namespace_range
The range of namespaces that will be iterated over. -
batch_size
The number of namespaces to return in each iteration. Default is 10.
Reads a list of files obtained via the Files API in records format, yielding each record as a string in each iteration.
Parameters
-
files
Either a string containing the file to be read or a list containing multiple strings of files to be read.
The standard output writers write data from the reducer function to a Google Cloud Storage location. The summary below describes each writer. Each writer uses the following mappreduce.yaml parameter settings:
Required Parameters for all output writers
-
bucket_name
Name of the bucket to use (with no extra delimiters or suffixes such as directories. Directories/prefixes can be specified as part of thenaming_format
Optional Parameters for all output writers
-
acl
acl to apply to new files, else bucket default used. -
naming_format
prefix format string for the new files (there is no required starting slash, expected formats would look likedirectory/basename...
, any starting slash will be treated as part of the file name) that should use the following substitutions:- $name - the name of the job
- $id - the id assigned to the job
- $num - the shard number. If there is more than one shard $num must be used. An arbitrary suffix may be applied by the writer.
-
content_type
mime type to apply on the files. If not provided, Google Cloud Storage will apply its default. -
tmp_bucket_name
name of the bucket used for writing tmp files by consistent GCS output writers. Defaults tobucket_name
if not set.
This version is known to create inconsistent outputs if the input changes during slice retries. Consider using GoogleCloudStorageConsistentOutputWriter instead.
Parameters
-
no_duplicate
if True, slice recovery logic will be used to ensure output files has no duplicates. Every shard should have only one final output in user specified location. But it may produce many smaller files (named "seg") due to slice recovery. These segs live in a tmp directory and should be combined and renamed to the final location. In current impl, they are not combined.
This version ensures that the output written to GCS is consistent.
Write key/values to Google Cloud Storage files in LevelDB format.
Lightweight record format. This format implements the log file format from levelDB: https://raw.githubusercontent.com/google/leveldb/master/doc/log_format.txt. The main advantages of this format are:
- to detect corruption. Every record has a crc32c checksum.
- to quickly skip corrupted record to the next valid record.
Same as GoogleCloudStorageRecordOutputWriter but ensures output written to Google Cloud Storage is consistent
The standard input readers and output writers should suffice for most use cases. If you need a reader that handles a different input source and format or a writer that writes to a different location and output format than the standard ones, contact Google to determine whether Google can add these to the standard readers and writers.
Alternatively, for those who want to write their own reader or writer, you can take a look at the open source code for readers and writers to see how to do this.