-
Notifications
You must be signed in to change notification settings - Fork 0
/
veracity.tex
125 lines (115 loc) · 10.6 KB
/
veracity.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
\subsection{Veracity}\label{sec:veracity}
%Challenge: Handling veracity in a simple and well-defined way
%Veracity: English definition: Conformance to the facts. Accuracy.
With the evolution of the internet of things and related technologies,
many end-user applications require stream processing and analytics.
Streaming languages should ensure veracity of the output stream in
terms of accuracy, correctness, and completeness of the results.
Furthermore, they should not sacrifice performance either, answering
high-throughput input streams with low-latency output streams.
Veracity in a streaming environment depends on the semantics of the
language since the stream is infinite and new results may be added or
computed aggregates may change. It is important that the output stream
for a given input stream be well-defined based on the streaming
language semantics. For example, if the language offers a sliding
time window feature, any aggregate should be computed correctly at
any time point based on all data within the time window.
%
Stream veracity problems may occur for different reasons. For example,
in multi-streaming applications, each stream may be produced by
sensors. Errors may occur either in the data itself (e.g., noisy
sensor readings) or by delays or data loss during the transfer to the
stream processing system. For instance, data may arrive
out-of-order because of communication delays or because of the
inevitable time drift between independent distributed stream sources.
Ideally, the output stream should be accurate, complete, and timely
even if errors occur in the input stream. Unfortunately, this is not
always feasible.
\textbf{Why is this important?}
%
Veracity of the output of streaming applications is important when
high-stakes and irreversible decisions are based on these outputs. In
the big-data era, veracity is one of the most important problems even
for non-streaming data processing, and stream processing makes
veracity even more challenging than in the static case. Streams are
dynamic and usually operate in a distributed environment
with minimal control over the underlying infrastructure. Such a
loosely coupled model can lead to situations where any data source can
join and leave on the fly. Moreover, stream-producing sensors
have limitations such as processing power, energy level, or memory
consumption, which can easily compromise veracity.
%including data incompleteness, lack of quality and/or concept drifts. Additionally, in a dynamic environment where data is evolving over time and can change its properties and statistics, which demand novel veracity handling algorithms compared to static datastores.
%Ability of the stream management systems to handle veracity will facilitate handling robust application design and ensure correctness of results. With the evolution of internet of things and relevant technologies, a large number of applications are designed for end users which require real-time stream processing and analytics on the fly such as applications designed in the domains of IoT, healthcare and data on the Web.
%
%Particularly for stream mining, the traditional data mining algorithms which focus on batch processing to improve quality of data are not enough to cope with the veracity challenge while meeting the time constraints of stream processing. This is why it is important to have proper streaming methods of data mining and machine learning to process streaming in real-time, since the batch methods cannot be applied in this setting.
%
%Benchmarking is a very important aspect of any data processing systems to ensure correctness, validation, testing and reproducibility. Traditional benchmarking systems focus mainly on performance evaluation, but since stream processing brings a few additional dimensions including veracity of data. Hence, benchmarking systems designed for testing stream processing systems should cover a broad range of parameters including veracity.
\textbf{How can we measure the challenge?}
%
To estimate the robustness of a streaming language implementation to
veracity problems, we define as \emph{ground truth} the output stream
in the absence of veracity problems (for example data loss or delayed
data). Then we can quantify veracity. Let \emph{error} be a function
that compares the produced result of an approach with and without
veracity problems. An example of an error function is the number of
false positives and false negatives. An approach is \emph{robust} for
veracity of streaming data if the error scales at most linearly with
respect to the size and the error rate of the input stream, while the
delay in the latency is bounded and independent of the input size.
%
The streaming language veracity challenge can be broken down
into the following measures $\mathbf{C_1}$--$\mathbf{C_3}$:
\begin{itemize}
\item[$\mathbf{C_1}$] \emph{Fault-tolerance.} A program in the
language is robust even if some of its components fail. The
language can define different behaviors, for example,
at-least-once semantics in Storm~\cite{toshniwal_et_al_2014}
or check-pointing in Spark Streaming~\cite{zaharia_et_al_2013}.
\item[$\mathbf{C_2}$] \emph{Out-of-order handling.} This measure has
two facets. First, the streaming language should have clear
semantics about the expected result. Second, the streaming language should be
robust to out-of-order data and should ensure that the expected
output stream is produced with limited latency. Li et al.\ define
out-of-order stream semantics based on low
watermarks~\cite{Li:2008:OPN:1453856.1453890}; Spark Streaming
relies on idempotence to handle stragglers~\cite{zaharia_et_al_2013}; and Beam separates
event time from processing time~\cite{akidau_et_al_2015}.
\item[$\mathbf{C_3}$] \emph{Inaccurate value handling.} A program in
the language is robust even if some of its input data is wrong.
The language can help by supporting statistical quality
measures~\cite{wasserkrug_et_al_2008}.
\end{itemize}
%%Veracity refers to several data quality dimensions, such as accuracy, conformance, completeness, correctness and timeliness. When shifting from a static or batch setting to a streaming one, these dimensions need a different interpretation. We can classify the aspects of veracity into two broad categories: those related to the data itself and those dependent of the underlying stream processing. The inherent characteristics of streaming data such as the unboundedness, the uncertainty, the incompleteness and the non-deterministic order are definitely more challenging in the case of streams with respect to static data. Where the unboundedness and non-deterministic order are tied to the streaming setting, the uncertainty and the incompleteness are also crucial in the static case. The latter are however exacerbated in the presence of streams. On the other hand, timeliness which refers to the freshness of the data is more guaranteed in a streaming system than in a static environment.
%%One open problem is to come up with algorithms and techniques to check the validity of the streaming data in an online fashion. One obstacle to address is the lack of ground truth (or, gold standard) and the partial availability of the ground truth (if we assume to build the ground truth incrementally as new streaming data come)
%%
%%
%%.
%%Veracity of streaming data is also affected by the results of streaming processing algorithms, such as the execution of online algorithms, approximate computing techniques, stream mining and learning algorithms. These algorithms may produce new streams that are inherently less accurate and trustworthy than the original data. One additional challenge here is to design streaming processing algorithms that are quality-aware. These streaming processing algorithms aiming at addressing the veracity problem should be self-adaptive and configurable to accommodate new defects and anomalies in the upcoming data.
%%
%%
%%
%%Veracity shall be handled at the level of data instances and at the level of stream processing algorithms. Data has inherently different degrees of veracity (out of order, data loss, delay in the data delivery etc.) and checking the validity of streaming data poses new challenges than in the static case. As an example, a snapshot-centered definition of validity on the data is needed.
%%For what concerns the streaming processing algorithms, they bring approximate results that may introduce further veracity problems. How can we ensure that these algorithms do not worsen the veracity degree of the underlying data and lead to minimize the introduced errors and uncertainties?
%%
%%%% \subsubsection{Metrics}
%%%One major issue is if an stream processing approach is robust to veracity. In order to estimate the robustness of an approach to data lost or data that are out of date we define as ground truth the result of the stream processing in the absence of data loss or data that are delayed. Then we can quantify the error produced by the veracity. Let $error$ be a function that compare the produced result of an approach with and without veracity. An approach is robust to veracity of streaming data if $error$ scale linearly with respect to size of the data and the number of errors. The types of errors of concern in a streaming setting include timeliness of delivery of stream elements, fraction of elements that arrive out-of-order, element loss rate, and element integrity verification failure rate. An example of an $error$ function is the number of false positives and false negatives compared to the ground truth.
\textbf{Why is this difficult?}
%
In stream processing, data is typically sent on a best-effort basis.
As a result, data can be lost, incorrect, arrive out of order, or be
approximate. This is exacerbated by the fact that the streaming
setting affords limited opportunity to compensate for these issues.
Furthermore, the performance requirements of streaming systems
encourage the use of approximate computing~\cite{babcock_et_al_2002},
thus increasing the uncertainty of the data. Also, machine-learning
often yields uncertain results due to imperfect generalization.
An important aspect of streaming data is ordering, typically
characterized by time. The correctness of the response to
queries depends on the source of ordering, such as the creation,
processing, or delivery time. Stream processing often requires that
each piece of data must be processed within a window, which can be
characterized by predefined size or temporal constraints. In stream
settings, sources typically do not receive control feedback.
Consequently, when exceptions occur, recovery must occur at the
destination. This reduces the space of possibilities for handling
transaction rollbacks and fault tolerance.