-
Notifications
You must be signed in to change notification settings - Fork 11
Metadata queries
This page collects input that guides the design of the API to query and analyze metadata.
The goal of the metadata query API is to let users analyze and interactively explore the metadata stored in a metadata store. In particular, we want to attain the following goals:
- Ease of use. The API should be rather high-level and offer a rich set of operations. In particular, it should hide data modeling details of the underlying persistence layer, be it SQLite or Cassandra.
- Multi-facet queries. Different types of metadata adhere to different data models beyond the relational model and some types of metadata even have multiple interpretations. For instance, UCCs and INDs can basically be modeled as a set of sets. However, INDs can moreover be interpreted as (hyper-)graph. We want to accommodate as much data models as possible to support a wide set of analytical operations. Besides that, it should be possible to mix the different data models to compose complex, rich analytics.
- Extensibility. Although, we plan to offer a rich set of operations, the API should be extensible by custom operators. This is of high importance, as new use cases might entail highly specialized operators.
A metadata query in the MDMS could be expressed as a data flow plan. These plans do not only consist of an (extensible) set of operators but also of facet casts. These dictate how the data is to be interpreted by the following operator and there can be different facets in different stages of the query. Following this procedure, an amount of input metadata is transformed into a query result. This query result can either be stored for further use or visualized to the user using an appropriate visualization method for the result's facet.
A facet basically describes on how to interpret metadata elements. We need to support multiple different facets to serve a wide range of metadata exploration operations.
- Bag of elements. We are faced with a bag of uniform elements without any specific order. In contrast to the relational model, these elements are not tuples but arbitrarily structured objects. However, recursive data structures are in general not considered, as well as collection-like data types. This facet is particularly amenable to the basic operations, clustering, and classification/regression.
- Transactions. As the bag of elements, however, the elements themselves are bags of items. This is a natural interpretation of many integrity constraints, such as UCCs, INDs, and FDs. Typical operators for this facet are basic operations, clustering, and association rule mining.
- Graph. Integrity constraints typically express relationships between columns of a dataset. Graphs are thus an eligible representation for various kinds of metadata. A palpable example for such a graph is given by interpreting tables as vertices and foreign key relationships among them as edges.
Other taxonomies might ensue as consequence of operations. For instance, a hierarchical clustering operation might produce something like a taxonomy.
- Basic operations
- Filter: as in SQL
- Project: as in SQL
- Outer/inner equi-join: connect metadata and schema elements that pertain to the same schema elements (e.g., find INDs whose right-hand side is also a UCC)
-
Key-based/global aggregation: group a collection of metadata elements (either by a key or all of them) and aggregate some non-key attributes with an aggregation function, such as
min
,max
,count
,count-distinct
,avg
, andcollect
- Sort: as in SQL
- Graph operations
- Clustering: cluster based on the graph's topology
- Connected components
- Transitive reduction: remove edges of a graph that could be inferred from transitivity
- PageRank
- Data mining operations
- Clustering: based on a user-defined distance function
- Frequent item sets / association rule mining
- Classification/regression: supervised learners/classifiers, such as SVMs, kNN, and logistic regression
- Specialized
- PK/FK classification
- Constraint proposal
- Outlier detection: might also be realized using the above operations
The type of visualization to use depends on the facet of the data to be represented.
- Bag of elements
- Table: could support, sortation, filtering, pagination
- Charts: column/bar chart, scatter/bubble chart, heat map, line plot, area chart, pie chart, ...
- Graph
- Graph: plain graph
- Table: list the graph adjacencies
- Charts: heat map of the adjacency matrix
- Transactions
- Sunburst chart: requires an order of the items
- Tree chart: requires an order of the items
- Show the arity distribution of a certain type of integrity constraint.
- Show the distribution of a certain type of integrity constraint among the tables.
- Show some distribution of a combination of metadata (e.g., INDs + #distinct values of the IND columns).
- Show the number of tuples and columns of each table.
- Designate PKs from UCCs or FKs from INDs, respectively.
- Run PageRank over the schema graph.
- Cluster tables by schema similarity.
- Look for association rules within a set of UCCs.
- Train a classifier for FKs given some ground truth.