-
Notifications
You must be signed in to change notification settings - Fork 0
/
variety.tex
95 lines (87 loc) · 5.17 KB
/
variety.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
\subsection{Data Variety}
Data variety refers to the presence of different data formats, data
types, data semantics, and associated data management solutions in an
information system. The term emerged with the advent of Big Data, but
the problem of taming variety is well known for machine understanding
of unstructured data such as text, images, and video as well as
(syntactic, structural, and semantic) interoperability and data
integration for structured and semistructured data. There are
multiple known solutions to data variety for a moderate number of
high-volume data sources. But data variety is still unsolved when
there are hundreds of data sources to integrate or when the data to
integrate is highly dynamic or streaming (as in this paper).
\textbf{Why is this important?}
%
Increasingly, applications must process heterogeneous data streams in
real-time together with large background knowledge bases. Consider the
following two examples from~\cite{DellAglioDataScience2017} (where interested readers can find others).
In the first example, we want to use sensor readings of the last
10~minutes to find electricity-producing turbines that are in a state
similar (e.g., Pearson correlated by at least 0.75) to any turbine
that subsequently had a critical failure. Here, data variety arises
from having tens of turbines of 3-4 different types equipped with
different sensors deployed other many years, where more sensors will be
deployed in the future. Moreover, in many cases, once an anomaly is
detected, the user also needs to retrieve multimedia maintenance
instructions and annotations to complete the diagnosis process.
In the second example, we want to use the latest open traffic
information and social media as well as the weather forecast to
determine if the users of a mobile mobility app are likely to run into
a traffic jam during their commute tonight and how long it will take
them to get home. Here, data variety arises from using third-party
data sources that are free to evolve in syntax, structure, and
semantics.
%There are many more examples that highlight the opportunity to create
%value by taming variety and velocity simultaneously. In social media
%analytics, it would be valuable to identify the current top
%influencers that are driving the discussion about the top emerging
%topics across all the social networks. In the tourism industry, it
%would be valuable to tell tourist where they can spend their evening
%given the presence of people and what the are doing (predicted
%analyzing the spatio-temporal correlation between privacy-preserving
%aggregates of mobile telecom data and of geo-located social media
%posts). In well-being analytics, it would be worth advising people when
%to go exercise, given their past, possibly sedentary, behavior and
%allergies (accessed in a privacy-preserving manner) as well as current
%weather conditions and pollution/allergen levels.
\textbf{How can we measure the challenge?}
%
The streaming language data variety challenge can be broken down
into the following measures $\mathbf{C_4}$--$\mathbf{C_6}$:
\begin{itemize}
\item[$\mathbf{C_4}$] \emph{Expressive data model.} The data model
used to logically represent information is expressive and allows
encoding multiple data types, data structures, and data
semantics. This is the path investigated by
RSP-QL~\cite{DellAglioDataScience2017,DBLP:conf/debs/ValleDM16}.
\item[$\mathbf{C_5}$] \emph{Multiple representations.} The language
can ingest data in multiple representations, offering the
programmer a unified set of logical operators while implementing
physical operators that work directly on the representations for
performance. An example is the most recent evolution of the
Streaming Linked Data framework~\cite{DBLP:conf/esws/BalduiniV017a}.
\item[$\mathbf{C_6}$] \emph{New sources with new formats.} The
language allows adding new sources where data are represented in a
format unforeseen when the language was
released. This might be accomplished by extending
R2RML\footnote{\url{https://www.w3.org/TR/r2rml}}.
% -- a language
% for expressing customized mappings from relational databases to
% RDF datasets.
\end{itemize}
\textbf{Why is this difficult?}
%
Deriving value is harder for a system that has to tame data variety
than for a system that only has to handle a single well-structured
data source. This is because solutions that analyze data require
homogeneous well-formed input data, so, when there is data variety,
preparing such data requires a number of different data management solutions that take time to
perform their part of the processing as well as to coordinate among each others. This time is particularly relevant in stream
processing, where answers should be generated with low latency. Even
if the time available to answer depends on the application domain (in
call centers, routing needs to be decided in sub-seconds, while in oil
operations, dangerous situations must be detected within
minutes), traditional batch pipelines for feature extraction and
extract-transform-load (ETL) may take so long that the results, when
computed, are no longer useful. For this reason, it is still
challenging to tame variety in stream processing systems.