In this directory we have useful general tools.
[[TOC]]
util/Spn.hpp
and util/Spn.cpp
implement sum-product networks (SPNs). An SPN is a learned data structure that
captures the joint probability distribution
SPNs were first proposed by Hoifung Poon and Pedro Domingos [PDF].
Our implementation of SPNs is more close to the work
of Hilprecht et al. [PDF] who use SPNs in a database context for
cardinality estimation and call them relational sum-product networks (RSPNs). Still util/Spn.hpp
and util/Spn.cpp
here only represent the
general concept of sum-product networks while the database specific extension making them relational lies in catalog/SpnWrapper.hpp
and
catalog/SpnWrapper.cpp
(more on SpnWrapper can be found here).
Since SPNs are a learned data structure, we first need to train our SPN on some data. For this we can use the method
static Spn learn_spn(
Eigen::MatrixXf &data,
Eigen::MatrixXi &null_matrix,
std::vector<LeafType> &leaf_types
);
Each column in the data
matrix represents a random variable and the rows represent the different values of the random
variable. The null_matrix
represents possible NULL
values in the dataset. LeafType
is an enum representing the
different possible kinds of variables. Possible values are:
DISCRETE
for random variables with discrete values, like integersCONTINUOUS
for random variables with continuous domains, like floatsAUTO
to assignDISCRETE
to integer domains andCONTINUOUS
to floating point domains
The vector leaf_types
should contain the type for each random variable in order.
We can compute likelihoods of predicates and the expectation of attributes with SPNs with the following methods:
float likelihood(const Filter &filter) const;
float upper_bound(const Filter &filter) const;
float lower_bound(const Filter &filter) const;
float expectation(unsigned attribute_id, const Filter &filter) const;
std::size_t estimate_number_distinct_values(unsigned attribute_id) const;
Filter
is a typename to pass SQL predicates to SPNs. It is a
std::unordered_map<unsigned, std::pair<SpnOperator, float>>
where unsigned is the attribute id, SpnOperator
is the
operator of the predicate and the float is the value to be compared to. This also implies that we only support
Sargable clauses as this is a general limitation of SPNs.
likelihood
returns the likelihood of a predicate to be true.
The methods upper_bound
and lower_bound
are only useful on continuous domains. Continuous domains compute the
likelihood with histograms and bins. On a bin we use linear interpolation to compute the likelihood. For instance, we
have a bin on the interval ]0;1] with 30% of the values of a continuous random variable in this bin. If we want to
know the likelihood of this variable being < 0.5, we can say that it is 15%, assuming the values are evenly distributed
across the bin. In general, this works well, but the upper bound of the likelihood can be interesting for e.g.
worst case computations. With the upper_bound
method we would get 30% instead of 15% and with lower_bound
we would
get 0%. On DISCRETE
domains it returns the same value as the method likelihood
.
The method expectation
returns the expectation of an attribute (random variable), optionally with a filter predicate
to get the conditional expectation.
The method estimate_number_distinct_values
does as the method says, it estimates the number of distinct values of an
attribute (random variable). This method only makes sense on DISCRETE
domains. On CONTINUOUS
domains the method
returns the number of rows in the dataset.
We can update an SPN model with the following methods:
void update_row(Eigen::VectorXf &old_row, Eigen::VectorXf &updated_row);
void insert_row(Eigen::VectorXf &row);
void delete_row(Eigen::VectorXf &row);
Each value in the row
vector represents a value of the respective random variable by index.
Disclaimer: Automated updates are not implemented yet. This is a future task.