This project includes the results of the FOAF (Friend-of-a-Friend) ReferenceSite created for Stanbol. It also provides the necessary steps to configure the foaf-site in Stanbol entityhub and use it in the enhancement phase as an EntityHubLinking Engine to enhance the content. Below sections will take you through a step by step guide on the FOAF site integration process
FOAF data is mainly provided by Linked-Data projects. There are several datasources mentioned in the FOAF project wiki [1], most of them are social networking sites offering their data in FOAF format. However most of the projects are
out of date therefore it was not recommended to use them as the datasources for my project. The 2 best options were;
- The billion-tripple challenge (btc) 2012 project [2] :
A web-crawled dataset including data from dbpedia, freebase, datahub, timbl, rest datasources. Quantity wise this has a sufficient amount (1436545545 quads) of data, foaf data and it's fairly upto date. - WebDataCommons project [3] :
A linked-data project which has a dataset (1079175202 quads) created in August 2012. But the sources of the data is not specified in the project.
After a discussion with Stanbol community and other related FOAF communities I selected the btc2012 dataset as it has a sufficiently up-to-date FOAF dataset. Following section will describe how I developed a ReferenceSite in Stanbol project with the selected dataset.
For this purpose I used the generic-rdf indexing tool in Stanbol project. Some of the tasks such as FOAF filtering required additional configuration files to be copied to the tool from other sources. Below guide will explain how to develop a FOAF datasite as a custom vocabulary integration in Stanbol.
###Building the indexing tool
The generic-rdf indexing tool can be found in the Stanbol trunk at [4]. Build it from source using mvn clean install
. This will create the org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar
file in the target. Then intialize the tool with the below command :
java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar init
Above initialization command will create the indexing tool directories for various purposes in the indexing process. The main directories are as below:
/indexing /config {the main configuration directory} /destination {the target directory of Solr indexing files and extracted entity data} /dist {the results of the indexing process including a reference-site data-file and solr-index} /resources {the rdf datasources to be used for the indexing process}
For demo purpose I have uploaded the pre-built jar file and the indexing directories with init command executed.
The uploaded files here under directory: generic-rdf/indexing are pre-configured with the required configurations to execute FOAF filtering and indexing.
Below steps will describe each configuration done to achieve FOAF filtering on the used btc2012 dataset.
###Configuring the tool to filter foaf entities
indexing/config is the main configuration directory of the tool and the main configuration file is indexing.properties
.
To give a unique name to the EntityHub site, set the 'name' value in indexing.properties to a suitable unique Site name (eg: foaf-site )
The FOAF filtering configurations require to edit the EntityDataIterable field to support FOAF entity iterations as below.
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,config:indexingsource,bnode:true
(Please note the additional bnode:true parameter above is activated to process blank nodes in the dataset)
Above entityDataIterable configuration requires 2 additional configuration files : indexingsource.properties
and propertiyfilter.config
. These files are not included in generic-rdf index tool by default.
You can use the 2 files used in freebase indexing tool at [5] for filtering purpose.
Copy the 2 files into indexing/config and add the below entry to propertyfilter.config:
foaf:*
Above entry instructs the tool to filter entities from the datasource which defines some foaf property in foaf namespace.
To index only foaf:Person
and foaf:Organization
type entities, activate 'values' in entityTypes.properties
file as below:
values=foaf:Person;foaf:Organization
Check above entity filtering in entityTypes.properties is enabled in indexing.properties as a entityProcessor by searching for below entry.
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;
To match entity-mentions in the content and link them to Entities in the FOAF dataset, certain foaf properties should be identified as the fields to map entities and copy them as label fields in the entityhub. For this purpose I have used foaf fields like foaf:name, firstName, givenName as label fields. These entries should be configured in the mappings.txt
as below;
foaf:name > rdfs:label foaf:nick > rdfs:label foaf:givenName > rdfs:label foaf:familyName > rdfs:label foaf:firstName > rdfs:label
In the enhancement phase, to traverse between entities, the Stanbol engine uses the redirect field. In FOAF there are 2 main fields to link similar/related entities. They are rdfs:seeAlso
and owl:sameAs
. To use both of them as redirect fields in Stanbol engines, they have to be converged as Stanbol only allows 1 redirect field. Therefore I will merge both these fields into Stanbol internally used fise:redirects
and used as the single redirect field in the linking engine configuration explained later.
Following are the extra configurations to be added to mappings.txt in the indexing tool:
rdfs:seeAlso | d=entityhub:ref owl:sameAs | d=entityhub:ref rdfs:seeAlso > fise:redirects owl:sameAs > fise:redirectsNow all the necessary configurations to index and filter a FOAF dataset is done. You need to include the FOAF dataset files to index in indexing/resources/rdfdata. For this I have used the datahub/data-4.nq [6] and timbl/data-6.nq [7] datasets available at the btc2012 project site. Download the data files from given links and copy them to indexing/resources/rdfdata directory prior to indexing.
Now you can run the indexing tool using below command:
java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar index
Above will execute the entity extraction and indexing process and create 2 files in the indexing/dist directory.
Copy the generated org.apache.stanbol.data.site.foaf-site-1.0.0.jar
to ${stanbol-server}/fileinstall directory.
Copy the generated foaf-site.solrindex.zip
to ${stanbol-server}/datafiles directory.
Launch Stanbol server using full-launcher and access the foaf-site at : localhost:8080/entityhub/site/foaf-site
The next step is to create an Enhancement Engine in Stanbol utilizing above created FOAF Site.
Following are the enhancement engine configurations required to create a FOAF site linking engine.
- Configure a new entityhub-linking-engine [8] with below configuration changes:
Name : foaf-site-linking Referenced site : foaf-site Redirect field : fise:redirects Case sensitivity : disabled
* Configure a weighted enhancement chain [9] using above created foaf-site-linking engine by doing below configuration changes. In the enhancement-chain I have added several available engines to perform language detection and natural language processing prior to foaf-linking:
Name : foaf-site-chain Engines : langdetect, opennlp-sentence, opennlp-token, opennlp-pos, foaf-site-linking
Now you can invoke the new foaf-site-chain by going to : http://localhost:8080/enhancer/chain/foaf-site-chain
and giving a test content like : "Tim Bernes Lee is the inventor of World Wide Web".
You can even try it using a REST client like curl without using the Stanbol web-interface as below :
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data "Tim Bernes Lee is the inventor of World Wide Web" http://localhost:8080/enhancer/chain/foaf-site-chain
If the configurations are done correctly Tim Berness Lee
and World Wide Web
should be identified as entities from the foaf-site dataset. Please refer the screen-shot image attached here with the demo results. This foaf-site-linking engine will be used as the base of the foaf-disambiguation engine to be created in the 2nd phase of the GSOC project.
[1] http://www.w3.org/wiki/FoafSites
[2] http://km.aifb.kit.edu/projects/btc-2012/
[3] http://webdatacommons.org/
[4] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/genericrdf
[5] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase
[6] http://km.aifb.kit.edu/projects/btc-2012/datahub/data-4.nq.gz
[7] http://km.aifb.kit.edu/projects/btc-2012/timbl/data-6.nq.gz
[8] https://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[9] http://stanbol.apache.org/docs/trunk/components/enhancer/chains/weightedchain.html