Skip to content
This repository has been archived by the owner on Jun 25, 2022. It is now read-only.

Commit

Permalink
Resolve TODOs and added timing explanation
Browse files Browse the repository at this point in the history
- Added section for Approach in the Problem statement
- Remove a TODO about an experiment which is not relevant enough to put
  in the Report.
- Timing and memory experiments: fix runs for MapReduce (user and system
  time) and added explanation about baseline languages.
- Remove TODOs that are no longer relevant.
  • Loading branch information
lhelwerd committed Jun 9, 2015
1 parent 8b80a76 commit 39a1bdb
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 16 deletions.
3 changes: 0 additions & 3 deletions Presentation/Presentation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -184,9 +184,6 @@
\end{frame}

% Regular slide
% TODO: passive aggressive is better than random forest, so why are the
% language plots made with a random forest regressor? Determine which algorithm
% to use for this.
\setbeamertemplate{navigation symbols}{}
\begin{frame}[fragile]{Experiments and results}
\begin{itemize}
Expand Down
22 changes: 10 additions & 12 deletions Report/Report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,7 @@ \section{Problem statement}\label{sec:problem}
detecting sarcasm. On top of that, sentiment analysis on a single text should be
fast in order to make working with a large dataset feasible.

% TODO: Perhaps split up in subsections?
% Perhaps also mention: "How do we obtain the dataset and preprocess it?"
% And: "How can we distribute the work onto worker nodes?" From presentation.
\subsection{Approach}\label{sec:approach}
We propose a framework that performs sentiment analysis on large datasets in a
distributed manner. We aim to distribute most operations performed by the
framework, such as downloading and preprocessing individual data dumps,
Expand Down Expand Up @@ -705,14 +703,16 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
component from Section~\ref{sec:preprocessor} takes much longer on the DAS-3.
We wish to avoid the slowness issue caused by sequentially reading the dumps
and converting them to shelves containing language data. Therefore, we can
distribute the tasks for the dumps using MPI\@. We use eight nodes, of which one
node is a master node that distributes jobs. This setup causes a three-fold
distribute the tasks for the dumps using MPI\@. We use eight nodes, of which
one node is a master node that distributes jobs. This setup causes a three-fold
speedup for the repositories dumps. One could expect the process to be
seven~times faster due to the number of nodes, but at the end a few nodes are
busy with a large dump, which causes the total time to be higher than expected.
This is also the case for the commit comments task, which has to merge the
language shelves beforehand. Note however that the baseline only processes one
dump while the MPI task uses all 17~dumps of the commit comments.
language shelves beforehand. Note however that the `baseline languages' column
in Table~\ref{tab:component-real-time} does make use of the languages from all
the repositories dumps, but it only processes one commit comments dump.
Meanwhile, the MPI task uses all 17~dumps of the commit comments.

For the analyzer and the classifier components, we see that all platforms
perform fairly equally, where we again note that the MPI version processes more
Expand All @@ -729,9 +729,9 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
\quad Repositories & --- & --- & 8h 26m 1s & 9h 40m 30s \\
\quad Commit comments & 1m 10s & --- & 15m 57s & 7h 32m 49s \\
Analyzer & 52s & 1m 23s & 53s & 4m 12s \\
Classify, sort, reduce & 1m 4s & 1m 46s & 1m 5s & 10m 23s \\
\quad Classifier & 1m 3s & 1m 40s & 1m 3s & 10m 9s \\
\quad Reducer & 0.2s & 10s & 0.2s & 2s \\
Classify, sort, reduce & 1m 4s & 1m 37s & 1m 5s & 10m 23s \\
\quad Classifier & 1m 3s & 1m 27s & 1m 3s & 10m 9s \\
\quad Reducer & 0.2s & 9s & 0.2s & 2s \\
\bottomrule
\end{tabular}
\caption{Summed user and system time of components using various distributed
Expand Down Expand Up @@ -806,8 +806,6 @@ \subsection{Measuring time and memory usage of the framework}\label{sec:time-and
usage of the classifier to become 282~MB, an increase of 83~MB.

\subsection{Determining the most accurate classifier}\label{sec:most-accurate-classifier}
% TODO: Say something about our analysis/comparisons between the analyzer and
% the classifier (unrecognized.py)? It was a short experiment, but still.
Before we are able to give classifications for commit comments, it is important
to determine the most accurate classifier to use on our dataset. In order to
consider as many combinations of classifier and parameters as possible, we have
Expand Down
20 changes: 19 additions & 1 deletion timing.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ reducer.py sys 0.01
reducer.py mem 16224/4 kB
We should probably just give the sum of this in order to compare with MapReduce.

TODO: Comparison with more recent hardware (laptop, huisuil?) and parallel MPI

preprocess.py for repos without anything on node02:
real 51:59:38
Expand Down Expand Up @@ -103,6 +102,25 @@ hadoop user 12.63
hadoop sys 0.77
hadoop mem 559840/4 kB

Total time spent by all maps in occupied slots (ms)=86719
Total time spent by all reduces in occupied slots (ms)=8770
Total time spent by all map tasks (ms)=86719
Total time spent by all reduce tasks (ms)=8770
Total vcore-seconds taken by all map tasks=86719
Total vcore-seconds taken by all reduce tasks=8770
Total megabyte-seconds taken by all map tasks=59142358
Total megabyte-seconds taken by all reduce tasks=5981140
GC time elapsed (ms)=1108
CPU time spent (ms)=80110
Physical memory (bytes) snapshot=1080074240
Virtual memory (bytes) snapshot=7438934016
Total committed heap usage (bytes)=844103680
real 1:44.21
user 12.43
sys 0.73
mem 558288/4 kB


MapReduce classify.py for commit_comments with languages group and reducer.py
Total time spent by all maps in occupied slots (ms)=100490
Total time spent by all reduces in occupied slots (ms)=9561
Expand Down

0 comments on commit 39a1bdb

Please sign in to comment.