-
Notifications
You must be signed in to change notification settings - Fork 0
/
big.tex
86 lines (81 loc) · 4.64 KB
/
big.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
\subsection{Big-Data Streaming}\label{sec:big} % Martin
\begin{figure}[!h]
\begin{lstlisting}[xleftmargin=2mm,morekeywords={stream,float64,rstring,type,int32,window,output,tuple}]
stream<float64 len, rstring caller> Calls = CallsSrc() {}
type Stat = tuple<float64 len, int32 num, rstring who>;
stream<Stat> Stats = Aggregate(Calls) {
window Calls:sliding, time(24.0*60.0*60.0), time(60.0);
output Stats: len = Max(Calls.len),
num = MaxCount(Calls.len),
who = ArgMax(Calls.len, Calls.caller);
}
\end{lstlisting}
\vspace*{-4mm}
\caption{\label{fig:spl}SPL code example.}
\end{figure}
The need to handle diverse data and processing requirements at scale
motivated several recent big-data streaming languages and
systems~\cite{akidau_et_al_2013,akidau_et_al_2015,carbone_et_al_2015,chandramouli_et_al_2014,hirzel_schneider_gedik_2017,kulkarni_et_al_2015,murray_et_al_2013,toshniwal_et_al_2014,zaharia_et_al_2013}.
Each of them makes it easy to integrate operators written in
general-purpose languages and to parallelize them on clusters of
multicore computers. Hirzel et al.\ introduced the \textsf{SPL}
language as part of the \textsf{IBM~Streams} product in
2010~\cite{hirzel_et_al_2009,hirzel_schneider_gedik_2017}.
Figure~\ref{fig:spl} shows an example for a similar
use-case as Figure~\ref{fig:cql}. Line~1 defines a stream
\lstinline{Calls} by invoking an operator \lstinline{CallsSrc}, and
\mbox{Lines 3-8} define a stream \lstinline{Stats} by invoking an
operator \lstinline{Aggregate}. An SPL program explicitly specifies a
directed graph of stream edges and operator nodes. Streams carry
tuples; in the examples, tuple attributes contain primitive values,
but in general, they can also contain compound values such as other
tuples or lists. Operators create and transform streams; operators
are defined by users or libraries, not built into the
language. Operators can be further configured upon invocation, for
example, with windows or output assignments. To facilitate
distribution, SPL's semantics are defined to require minimal
synchronization between operators~\cite{soule_et_al_2016}.
\vspace*{-0.3mm}
Like SPL, the core concept of other languages for big-data streaming
is also that of a directed graph of streams and operators. This graph
is an evolution of the query plan of earlier stream-relational
systems. In fact, one can view \textsf{Aurora}~\cite{abadi_et_al_2003},
\textsf{Borealis}~\cite{abadi_et_al_2005}, and
\textsf{Spade}~\cite{gedik_et_al_2008} as the evolutionary links
between relational and big-data streaming languages. They still focused
on relational operators while already encouraging developers to
explicitly code graphs.
\vspace*{-0.3mm}
Unlike SPL, which is a stand-alone language, later big-data streaming
systems offer languages that are embedded in a general-purpose host
language, typically Java. \textsf{MillWheel} focused on key-based
partitioned parallelism and semi-automatic handling of out-of-order
data~\cite{akidau_et_al_2013}. \textsf{Naiad} focused on supporting
both streaming and iterative batch analytics~\cite{murray_et_al_2013},
using elaborate timestamps and a LINQ-based surface
language~\cite{meijer_beckman_bierman_2006}. \textsf{Spark Streaming}
emulated streaming by repeated computations on immutable in-me\-mo\-ry
data batches~\cite{zaharia_et_al_2013}. \textsf{Storm} offered
at-least-once semantics via buffering and
acknowledgements~\cite{toshniwal_et_al_2014}. \textsf{Trill} used
batching to improve throughput and offered an extensible
aggregation framework~\cite{chandramouli_et_al_2014}. \textsf{Heron}
displaced Storm by adding several improvements, such as a back-pressure
mechanism~\cite{kulkarni_et_al_2015}. \textsf{Beam} picks up where
MillWheel left off, giving programmers ways to reconcile event time
and processing time~\cite{akidau_et_al_2015}. And finally,
\textsf{Flink} focuses on supporting both real-time streaming and batch
analytics~\cite{carbone_et_al_2015}.
\vspace*{-0.3mm}
All of the above-listed big-data streaming systems offer embedded
languages for specifying more-or-less explicit stream graphs. An
embedded language is an advanced library or framework that makes heavy
use of host-language abstractions such as lambdas, generics, and local
variable type inference. For instance, LINQ integrates SQL-inspired
query syntax in a general-purpose
language~\cite{meijer_beckman_bierman_2006}. Embedded languages
offer simple interoperability with their host language, as well
as leveraging host-language tools and skills~\cite{hudak_1998}. On the
downside, since they are not self-contained, they are hard to isolate
clearly from the host language, inhibiting debugging, optimization,
and standardization.