Skip to content

Documentation

Anja Jentzsch edited this page Jul 2, 2014 · 12 revisions

Table of Contents

Basic Idea/Motivation

Wikidata could be a lot smarter than it is right now e.g. by suggesting fields to fill and probable values.

For example: when an editor edits an item about a person that is still missing the date of birth, this should be suggested as a possible property. Or when the editor is entering the sex of the person, Wikidata should be smart and suggest the ones that are used most for these properties first. Think of it as something very similar to the famous 'people who bought x also bought y' systems." (Quim Gil) - https://bugzilla.wikimedia.org/show_bug.cgi?id=46555

Ideas regarding important Use Cases

There are two primary use cases when it comes to suggesting properties:

1) Suggesting generally helpful properties for a newly created item

Even though an official ontology for the items in wikidata does not exist, there are still "classifying" properties/classifiers (e.g InstanceOf, SubclassOf, memberOf etc.). The values of those properties are classes.

Items that belong to the same class often by nature have a lot in common. This makes knowing an item's classifiers combined with their concrete values very useful for suggesting properties.

Therefore a selection of widely used/important classifiers could be suggested as initial properties to a users that wants to create a new item. The idea is that after the property value was filled in, the user can profit from the derived suggestions during further editing.

Due to the fact that there are no predefined/fixed classifiers the first step is to find a way to identify properties that serve as classifiers - a few thoughts regarding this issue:

  • Only properties with the (value) type 'item' should be considered a potential classifier since a class that is used as the value for a classifier is most likely going to be represented by an item in wikidata.
  • Classifier by definition group together similiar objects. So in order to decide if a certain property is potentially used as a classifier it is a good idea to compare items that have the same value for this property and look for patterns like common properties or property values. Comming from a different angle one could also use the fact that the sole information that an item has a certain classifier (not considering its value) is in most cases not very valuable for finding patterns/suggesting properties
  • Frequentliy used classifiers are more relevant for the described use case, because they are more likely to be an useful suggestion for a new item

2) Suggesting properties for an item based on its preexisting properties

Two major steps are necessary in order to be able to do 2):

  1. For every property we need to know how often it is used in general and how often it appears in combination with every single other property. So that given the properties p1 and p2 we can determine the probability for p1 being used together with p2 (and of course the other way around).
    This information can intuitively be stored in a table like structure. This way it is easy to look up the probability of a certain property combination by just finding the corresponding row and clumn.
    The table can be precomputed and only needs to be updated from time to time (updates could be done incrementally)
  2. The next step for suggesting new properties for an item based on its preexisting properties is to create a ranking including every property, which is not already used to describe the item on hand.
This can be done using the information that is stored in the table. An intuitiv algorithm for ranking a certain property p could look like this:
  • look up the probability for all combinations that involve p being used in connection with one of the preexisting properties
  • compute the arithmetic mean of all the probabilities and use it for the ranking.
As the last step the highest ranking properties should be suggested as additional attributes for the item on hand

Once classifiers as described in 1) are identified, classifier/value combinations can be integrated in the previously mentioned table, so that they - like properties - can be suggested and used to suggest properties or other classifier/value pairs respectively.

Other promising Use Cases

Suggesting *misfit* properties

The approach for suggesting new properties could be slightly modified in order to analyze a given set of properties and find the ones that do not fit in with the rest. In terms of suggesting properties an approach for this task could look like the following: Given a set of Properties P and a property p element P, we generate suggestions for P-p and filter the result in order to get the correlation value for p. We do this for every Property in the set and finally filter out the ones with a correlation value that lies below a certain threshold.

Architecture/Implementation

architecture architecture

There are any number of concrete Suggesters that implement the Suggester Interface which provides a function to obtain fitting attributes for an entity or a certain set of attributes (optionally values can be suggested). This interface may be used at different places. E.g. for adding attribute suggestions to the edit item page.

Dataflow

It is essential to define on what data a suggester-engine is basing its suggestions and how it can access this data/how the data is generated - this can vary depending on the suggester that is used - for the SimplePHPSuggester we have the following approach in mind:

|XML or JSON dump| -> (dump parser) -> |feature CSV| -> (data analyzer) -> |suggester data| -> (suggester engine)

XML or JSON dump

|XML or JSON dump| is a dump of the full entity data of all entities, provided as a compressed file. The XML also wraps JSON, but that "internal" JSON should be avoided as soon as "real" JSON dumps are available.

Dump Parser

(dump parser) is a script that reads the dump and extracts the information("features") needed by the data analyzer, and outputs them to a CSV file (which can then be loaded into MySQL, or processed with Hadoop, or whatever).

Feature CSV

|feature CSV| is a CSV file containing the features needed by the data analyzer. In the most basic form, the file consists of two columns: |subject| and |feature|. For suggesting properties based on existing properties, that would mean |itemid| |propertyid| (for suggesting products to customers, like on amazon, it would correspond to |customer| |product|). For some use cases like finding classifiers (see here) or suggesting values it might also be useful to have additional columns for storing the value and type of a property.

Data Analyzer

(data analyzer) generates the information used by the suggester (the data could either be stored into files to be loaded into a database, or directly into the database). This includes finding correlations between features by counting co-occurences of properties and possibly property/value pairs as well as computing their usage frequencies. Filtering and perhaps clustering could also well be integral parts in this process.

Suggester Data

|suggester data| is the information used by the suggester engine. It resides in a database that can be accessed from the application servers. The goal is to structure and index the data in such a way that the suggester engine can generate suggestions instantly.

For example the Data that is accessed by the SimplePHPSuggester is stored according to the following schema:

wbs_PropertyPairs(pid1 INT, pid2 INT, count INT, correlation FLOAT, primary key(pid1, pid2))

  • Explanation: One entry for each occuring binary combination of properties, count (how often does the combination occur), correlation(how like is it to find pid1 in combination with pid2)
wbs_Properties(pid INT, count INT, type VARCHAR(20), primary key(pid))
Example Usages

1. Use Case:

Assumptions:

  • we are given an entity E described by a list of Properties L
  • we want the top X property-suggestions for the given set of properties
  • we only wish to suggest properties that correlate well with the properties in L(their rating has to pass a certain threshold T) and that are not already contained in L of course
In order to achieve all this we could use the following query:

SELECT pid2, sum(correlation) AS cor
FROM wbs_PropertyPairs
WHERE pid1 IN L AND pid2 NOT IN L
GROUP BY pid2
HAVING sum(correlation) > T
ORDER BY cor DESC
LIMIT X

(see here for further information on this use case)

2. Use Case:

A simple algorithm for finding classifying properties could make use of the query below:

SELECT pid, count
FROM wbs_Properties
WHERE type = "item"
ORDER BY count
LIMIT 5

(see here for further information on the described use case)

Suggester Engine

(suggester engine) generates suggestions based on the provided data set. It has information about how the data is structured, and how it should be interpreted. It's written in PHP and is called directly from (or is part of) a MediaWiki extension.

Detailed Ideas regarding Implementation:

Algorithms

Ideas for evaluating Suggestions

It is important to define a method for evaluating suggestions in oder to be able to evaluate and compare different suggester-engines. One promising approach is outlined below:

First we need to randomly select a test set of entities. Then for every entity we divide the existing property information (snaks) into two parts. One part is taken as input for the suggester-engine that is being tested, while the second part is matched against the output predictions/suggestions. If a suggestion can be matched, it is obviously correct. However the case is more complicated, if there is no match, because a suggestion could still be good/correct even though the proposed property does not already exist for the entity on hand (open world assumption). In these cases outside information could be taken into account and/or manual evalution could be done for a sample of the suggestions (unfortunately this would be more or less subjective). In addition user feedback could also be used systematically for evaluating as well as improving suggestions during later stages of the project.

UI Integration

While the input field is still empty, the best property suggestions are displayed based on existing attributes of the current item. When the user starts typing, the input is used to filter the suggested properties: Properties that do not match the current input are replaced by traditional search results. Another approach (which we are not working on currently) is to rank all properties that match the input and display the top n of them.

There are basically two ideas to integrate the suggester results with the "add statement" functionality.
1. combine suggestions with former search results on client side.

or
2. combine suggestions with former search results on server side.

Clone this wiki locally