Resolve TODOs and added timing explanation

- Added section for Approach in the Problem statement - Remove a TODO about an experiment which is not relevant enough to put in the Report. - Timing and memory experiments: fix runs for MapReduce (user and system time) and added explanation about baseline languages. - Remove TODOs that are no longer relevant.
timvandermeij · Jun 9, 2015 · 39a1bdb · 39a1bdb
1 parent 8b80a76
commit 39a1bdb
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 16 deletions.
diff --git a/Presentation/Presentation.tex b/Presentation/Presentation.tex
@@ -184,9 +184,6 @@
 \end{frame}
 
 % Regular slide
-% TODO: passive aggressive is better than random forest, so why are the 
-% language plots made with a random forest regressor? Determine which algorithm 
-% to use for this.
 \setbeamertemplate{navigation symbols}{}
 \begin{frame}[fragile]{Experiments and results}
 \begin{itemize}

diff --git a/Report/Report.tex b/Report/Report.tex
@@ -143,9 +143,7 @@ \section{Problem statement}\label{sec:problem}
 detecting sarcasm. On top of that, sentiment analysis on a single text should be
 fast in order to make working with a large dataset feasible.
 
-% TODO: Perhaps split up in subsections?
-% Perhaps also mention: "How do we obtain the dataset and preprocess it?"
-% And: "How can we distribute the work onto worker nodes?" From presentation.
+\subsection{Approach}\label{sec:approach}
 We propose a framework that performs sentiment analysis on large datasets in a
 distributed manner. We aim to distribute most operations performed by the 
 framework, such as downloading and preprocessing individual data dumps,
@@ -705,14 +703,16 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
 component from Section~\ref{sec:preprocessor} takes much longer on the DAS-3. 
 We wish to avoid the slowness issue caused by sequentially reading the dumps 
 and converting them to shelves containing language data. Therefore, we can 
-distribute the tasks for the dumps using MPI\@. We use eight nodes, of which one 
-node is a master node that distributes jobs. This setup causes a three-fold 
+distribute the tasks for the dumps using MPI\@. We use eight nodes, of which 
+one node is a master node that distributes jobs. This setup causes a three-fold 
 speedup for the repositories dumps. One could expect the process to be 
 seven~times faster due to the number of nodes, but at the end a few nodes are
 busy with a large dump, which causes the total time to be higher than expected.
 This is also the case for the commit comments task, which has to merge the
-language shelves beforehand. Note however that the baseline only processes one
-dump while the MPI task uses all 17~dumps of the commit comments.
+language shelves beforehand. Note however that the `baseline languages' column 
+in Table~\ref{tab:component-real-time} does make use of the languages from all 
+the repositories dumps, but it only processes one commit comments dump. 
+Meanwhile, the MPI task uses all 17~dumps of the commit comments.
 
 For the analyzer and the classifier components, we see that all platforms 
 perform fairly equally, where we again note that the MPI version processes more 
@@ -729,9 +729,9 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
     \quad Repositories     & ---       & ---    & 8h 26m 1s   & 9h 40m 30s \\
     \quad Commit comments  & 1m 10s    & ---    & 15m 57s     & 7h 32m 49s \\
     Analyzer               & 52s       & 1m 23s & 53s         & 4m 12s     \\
-    Classify, sort, reduce & 1m 4s     & 1m 46s & 1m 5s       & 10m 23s    \\
-    \quad Classifier       & 1m 3s     & 1m 40s & 1m 3s       & 10m 9s     \\
-    \quad Reducer          & 0.2s      & 10s    & 0.2s        & 2s         \\
+    Classify, sort, reduce & 1m 4s     & 1m 37s & 1m 5s       & 10m 23s    \\
+    \quad Classifier       & 1m 3s     & 1m 27s & 1m 3s       & 10m 9s     \\
+    \quad Reducer          & 0.2s      & 9s     & 0.2s        & 2s         \\
     \bottomrule
   \end{tabular}
   \caption{Summed user and system time of components using various distributed 
@@ -806,8 +806,6 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
 usage of the classifier to become 282~MB, an increase of 83~MB.
 
 \subsection{Determining the most accurate classifier}\label{sec:most-accurate-classifier}
-% TODO: Say something about our analysis/comparisons between the analyzer and 
-% the classifier (unrecognized.py)? It was a short experiment, but still.
 Before we are able to give classifications for commit comments, it is important 
 to determine the most accurate classifier to use on our dataset. In order to 
 consider as many combinations of classifier and parameters as possible, we have 

diff --git a/timing.txt b/timing.txt
@@ -55,7 +55,6 @@ reducer.py sys   0.01
 reducer.py mem   16224/4 kB
 We should probably just give the sum of this in order to compare with MapReduce.
 
-TODO: Comparison with more recent hardware (laptop, huisuil?) and parallel MPI
 
 preprocess.py for repos without anything on node02:
 real    51:59:38
@@ -103,6 +102,25 @@ hadoop user    12.63
 hadoop sys     0.77
 hadoop mem     559840/4 kB
 
+Total time spent by all maps in occupied slots (ms)=86719
+Total time spent by all reduces in occupied slots (ms)=8770
+Total time spent by all map tasks (ms)=86719
+Total time spent by all reduce tasks (ms)=8770
+Total vcore-seconds taken by all map tasks=86719
+Total vcore-seconds taken by all reduce tasks=8770
+Total megabyte-seconds taken by all map tasks=59142358
+Total megabyte-seconds taken by all reduce tasks=5981140
+GC time elapsed (ms)=1108
+CPU time spent (ms)=80110
+Physical memory (bytes) snapshot=1080074240
+Virtual memory (bytes) snapshot=7438934016
+Total committed heap usage (bytes)=844103680
+real    1:44.21
+user    12.43
+sys     0.73
+mem     558288/4 kB
+
+
 MapReduce classify.py for commit_comments with languages group and reducer.py
 Total time spent by all maps in occupied slots (ms)=100490
 Total time spent by all reduces in occupied slots (ms)=9561