diff --git a/README.rst b/README.rst index 44f8b94e6..7215aa86e 100644 --- a/README.rst +++ b/README.rst @@ -3,13 +3,13 @@ Disco - Massive data, Minimal code ================================== Disco is an implementation of the `Map-Reduce framework -`_ for distributed computing. As +`_ for distributed computing. Like the original framework, which was publicized by Google, Disco supports -parallel computations over large data sets on unreliable cluster of +parallel computations over large data sets on an unreliable cluster of computers. This makes it a perfect tool for analyzing and processing large datasets without having to bother about difficult technical questions related to distributed computing, such as communication protocols, load -balancing, locking, job scheduling or fault tolerance, which are taken +balancing, locking, job scheduling or fault tolerance, all of which are taken care by Disco. See `discoproject.org `_ for more information. diff --git a/doc/start/ddfs.rst b/doc/start/ddfs.rst index 02a4d7a56..13b16643c 100644 --- a/doc/start/ddfs.rst +++ b/doc/start/ddfs.rst @@ -31,7 +31,7 @@ open-source projects such as `Hadoop Distributed Filesystem (HDFS) DDFS is a low-level component in the Disco stack, taking care of data *distribution*, *replication*, *persistence*, *addressing* and *access*. It does not provide a sophisticated query facility in itself but it is -**tightly integrated** with Disco jobs and Discodex indexing component, +**tightly integrated** with Disco jobs and the Discodex indexing component, which can be used to build application-specific query interfaces. Disco can store results of Map/Reduce jobs to DDFS, providing persistence and easy access for processed data. @@ -158,7 +158,7 @@ atomicity of metadata operations. Each storage node contains a number of disks or volumes (`vol0..volN`), assigned to DDFS by mounting them under ``$DDFS_ROOT/vol0`` ... ``$DDFS_ROOT/volN``. On each volume, DDFS creates two directories, -``tag`` and ``blob``, for storing tags anb blobs, respectively. DDFS +``tag`` and ``blob``, for storing tags and blobs, respectively. DDFS monitors available disk space on each volume on regular intervals for load balancing. New blobs are stored to the least loaded volumes. @@ -316,7 +316,7 @@ comments in the source code. This discussion is mainly interesting to developers and advanced users of DDFS and Disco. As one might gather from the sections above, metadata (tag) operations -are the hard core of DDFS, mainly due to their transactional nature. +are the central core of DDFS, mainly due to their transactional nature. Another non-trivial part of DDFS is re-replication and garbage collection of tags and blobs. These issues are discussed in more detail below. diff --git a/doc/start/install.rst b/doc/start/install.rst index 70bdc75b8..d39b946d7 100644 --- a/doc/start/install.rst +++ b/doc/start/install.rst @@ -27,7 +27,7 @@ story short, Disco works as follows: * Disco users start Disco jobs in Python scripts. * Jobs requests are sent over HTTP to the master. * Master is an Erlang process that receives requests over HTTP. - * Master launches another Erlang process, worker supervisor, on each node over + * Master launches another Erlang process, the worker supervisor, on each node over SSH. * Worker supervisors run Disco jobs as Python processes. @@ -121,7 +121,7 @@ On the master node, start the Disco master by executing ``disco start``. You can easily integrate ``disco`` into your system's startup sequence. For instance, you can see how ``debian/disco-master.init`` and -``debian/disco-node.init`` are implemented in the Disco's ``debian`` +``debian/disco-node.init`` are implemented in Disco's ``debian`` branch. If Disco has started up properly, you should see ``beam.smp`` running on your @@ -240,7 +240,7 @@ itself. If the machine where you run the script can access the master node but not other nodes in the cluster, you need to set the environment variable ``DISCO_PROXY=http://master:8989``. The proxy address should be the -same as the master's above. This makes Disco to fetch results through +same as the master's above. This makes Disco fetch results through the master node, instead of connecting to the nodes directly. If the script produces some results, congratulations, you have a diff --git a/lib/disco/func.py b/lib/disco/func.py index 68077f376..94619f0b4 100644 --- a/lib/disco/func.py +++ b/lib/disco/func.py @@ -23,7 +23,7 @@ The task uses stderr to signal events to the master. You can raise a :class:`disco.error.DataError`, to abort the task on this node and try again on another node. - It is usually a best to let the task fail if any exceptions occur: + It is usually best to let the task fail if any exceptions occur: do not catch any exceptions from which you can't recover. When exceptions occur, the disco worker will catch them and signal an appropriate event to the master. @@ -136,7 +136,7 @@ def reduce(input_stream, output_stream, params): :param input_stream: :class:`disco.func.InputStream` object that is used to iterate through input entries. - :param output_stream: :class:`disco.func.InputStream` object that is used + :param output_stream: :class:`disco.func.OutputStream` object that is used to output results. :param params: the :class:`disco.core.Params` object specified by the *params* parameter in :class:`disco.core.JobDict`.