Skip to content

Commit

Permalink
readme etc. ready for publication
Browse files Browse the repository at this point in the history
  • Loading branch information
jjarosch committed Dec 18, 2019
1 parent 0f7f0db commit 614657d
Show file tree
Hide file tree
Showing 27 changed files with 3,210 additions and 1 deletion.
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## KNIME files
.knimeLock
.project
.savedWithData
workflowset.meta

## KNIME directories containing node data
internal/
port_*/
.metadata/
internalTables/
.*.autoSave/
43 changes: 42 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,43 @@
# DFD-Wikidata-Joiner
KNIME workflow for joining DFD entries with Wikidata items

A [KNIME Analytics Platform](https://www.knime.com/knime-analytics-platform) workflow for conservatively matching [DFD (Digital Dictionary of Surnames in Germany)](http://www.namenforschung.net/en/dfd/dictionary/list/) entries with [Wikidata](https://www.wikidata.org/) family name items. It outputs a list of matches ready for the [QuickStatements](https://tools.wmflabs.org/quickstatements/) bulk editing tool, which then adds the DFD-ID using the corresponding property [P6597](http://www.wikidata.org/entity/P6597) to the Wikidata items.

The workflow fetches a full list of currently published DFD articles and a list of family name items from Wikidata’s SPARQL endpoint. Family name items are first filtered in the SPARQL query to:
* exclude items which already have “[DFD-ID](http://www.wikidata.org/entity/P6597)
* only items with the statement ”[instance of](http://www.wikidata.org/entity/P31): [family name](http://www.wikidata.org/entity/Q101352)
* only items with the statement ”[writing system](http://www.wikidata.org/entity/P282): [Latin script](http://www.wikidata.org/entity/Q8229)
* exclude items with additional other values for “[writing system](http://www.wikidata.org/entity/P282)
* only items with a value for “[native label](http://www.wikidata.org/entity/P1705)

Further filtering is done in KNIME, because it would be too costly for the Wikidata SPARQL endpoint:
* exclude items whose value of “[native label](http://www.wikidata.org/entity/P1705)” occurs on more than one family name item
* exclude items with more than one value of “[native label](http://www.wikidata.org/entity/P1705)

This results in a list of items where unequivocal 1:1 matches between DFD and Wikidata are possible and a new statement with “[DFD-ID](http://www.wikidata.org/entity/P6597)” is unlikely to be erroneous or problematic.

The name forms from DFD are then joined to the native labels from Wikidata, using exact (and case-sensitive) string matching.

The resulting list of matches is transformed to the CSV format required by QuickStatements. The final result can be copy-and-pasted to [a new batch on QuickStatements](https://tools.wmflabs.org/quickstatements/#/batch/new). QuickStatements will then perform a batch of edits on Wikidata to add the statements.

## Download, Installation and Requirements

All releases can be downloaded as a .knwf file from the [release page of this repository](https://github.com/digicademy/DFD-Wikidata-Joiner/releases).

The .knwf file can then be imported in [KNIME version 4.1.0 or higher](https://www.knime.com/downloads/download-knime).

The [KNIME Semantic Web](https://hub.knime.com/knime/extensions/org.knime.features.semanticweb) extension is required. When importing the workflow, installation of this extension is automatically prompted.

## License

The software is published under the terms of the MIT license.

## Research Software Engineering and Development

Copyright 2019 <a href="https://orcid.org/0000-0001-8483-8123">Julian Jarosch</a>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

3 changes: 3 additions & 0 deletions knwf/.artifacts/workflow-configuration-representation.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

{
}
3 changes: 3 additions & 0 deletions knwf/.artifacts/workflow-configuration.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

{
}
62 changes: 62 additions & 0 deletions knwf/CSV Reader (#1)/settings.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://www.knime.org/2008/09/XMLConfig" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.knime.org/2008/09/XMLConfig http://www.knime.org/XMLConfig_2008_09.xsd" key="settings.xml">
<entry key="node_file" type="xstring" value="settings.xml"/>
<config key="flow_stack"/>
<config key="internal_node_subsettings">
<entry key="memory_policy" type="xstring" value="CacheSmallInMemory"/>
</config>
<config key="model">
<entry key="url" type="xstring" value="http://www.namenforschung.net/alle.csv"/>
<entry key="colDelimiter" type="xstring" value="%%00009"/>
<entry key="rowDelimiter" type="xstring" value="%%00010"/>
<entry key="quote" type="xstring" value="&quot;"/>
<entry key="commentStart" type="xstring" value="#"/>
<entry key="hasRowHeader" type="xboolean" value="false"/>
<entry key="hasColHeader" type="xboolean" value="false"/>
<entry key="supportShortLines" type="xboolean" value="true"/>
<entry key="limitRowsCount" type="xlong" value="-1"/>
<entry key="skipFirstLinesCount" type="xint" value="-1"/>
<entry key="characterSetName" type="xstring" isnull="true" value=""/>
<entry key="limitAnalysisCount" type="xint" value="-1"/>
</config>
<config key="nodeAnnotation">
<entry key="text" type="xstring" value="read all published DFD %%00010entries from online list"/>
<entry key="bgcolor" type="xint" value="16777215"/>
<entry key="x-coordinate" type="xint" value="265"/>
<entry key="y-coordinate" type="xint" value="179"/>
<entry key="width" type="xint" value="151"/>
<entry key="height" type="xint" value="38"/>
<entry key="alignment" type="xstring" value="CENTER"/>
<entry key="borderSize" type="xint" value="0"/>
<entry key="borderColor" type="xint" value="16777215"/>
<entry key="defFontSize" type="xint" value="11"/>
<entry key="annotation-version" type="xint" value="20151123"/>
<config key="styles"/>
</config>
<entry key="customDescription" type="xstring" isnull="true" value=""/>
<entry key="state" type="xstring" value="CONFIGURED"/>
<entry key="factory" type="xstring" value="org.knime.base.node.io.csvreader.CSVReaderNodeFactory"/>
<entry key="node-name" type="xstring" value="CSV Reader"/>
<entry key="node-bundle-name" type="xstring" value="KNIME Base Nodes"/>
<entry key="node-bundle-symbolic-name" type="xstring" value="org.knime.base"/>
<entry key="node-bundle-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-bundle-version" type="xstring" value="4.1.0.v201912041211"/>
<entry key="node-feature-name" type="xstring" value="KNIME Core"/>
<entry key="node-feature-symbolic-name" type="xstring" value="org.knime.features.base.feature.group"/>
<entry key="node-feature-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-feature-version" type="xstring" value="4.1.0.v201912041824"/>
<config key="factory_settings"/>
<entry key="name" type="xstring" value="CSV Reader"/>
<entry key="hasContent" type="xboolean" value="false"/>
<entry key="isInactive" type="xboolean" value="false"/>
<config key="ports">
<config key="port_1">
<entry key="index" type="xint" value="1"/>
<entry key="port_dir_location" type="xstring" isnull="true" value=""/>
</config>
</config>
<config key="filestores">
<entry key="file_store_location" type="xstring" isnull="true" value=""/>
<entry key="file_store_id" type="xstring" isnull="true" value=""/>
</config>
</config>
78 changes: 78 additions & 0 deletions knwf/Column Filter (#12)/settings.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://www.knime.org/2008/09/XMLConfig" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.knime.org/2008/09/XMLConfig http://www.knime.org/XMLConfig_2008_09.xsd" key="settings.xml">
<entry key="node_file" type="xstring" value="settings.xml"/>
<config key="flow_stack"/>
<config key="internal_node_subsettings">
<entry key="memory_policy" type="xstring" value="CacheSmallInMemory"/>
</config>
<config key="model">
<config key="column-filter">
<entry key="filter-type" type="xstring" value="STANDARD"/>
<config key="included_names">
<entry key="array-size" type="xint" value="2"/>
<entry key="0" type="xstring" value="DFD-ID"/>
<entry key="1" type="xstring" value="DFD-Lemma"/>
</config>
<config key="excluded_names">
<entry key="array-size" type="xint" value="1"/>
<entry key="0" type="xstring" value="Col2"/>
</config>
<entry key="enforce_option" type="xstring" value="EnforceExclusion"/>
<config key="name_pattern">
<entry key="pattern" type="xstring" value=""/>
<entry key="type" type="xstring" value="Wildcard"/>
<entry key="caseSensitive" type="xboolean" value="true"/>
</config>
<config key="datatype">
<config key="typelist">
<entry key="org.knime.core.data.StringValue" type="xboolean" value="false"/>
<entry key="org.knime.core.data.BooleanValue" type="xboolean" value="false"/>
<entry key="org.knime.core.data.IntValue" type="xboolean" value="false"/>
<entry key="org.knime.core.data.DoubleValue" type="xboolean" value="false"/>
<entry key="org.knime.core.data.LongValue" type="xboolean" value="false"/>
<entry key="org.knime.core.data.date.DateAndTimeValue" type="xboolean" value="false"/>
</config>
</config>
</config>
</config>
<config key="nodeAnnotation">
<entry key="text" type="xstring" value="irrelevant%%00010columns"/>
<entry key="bgcolor" type="xint" value="16777215"/>
<entry key="x-coordinate" type="xint" value="513"/>
<entry key="y-coordinate" type="xint" value="179"/>
<entry key="width" type="xint" value="134"/>
<entry key="height" type="xint" value="38"/>
<entry key="alignment" type="xstring" value="CENTER"/>
<entry key="borderSize" type="xint" value="0"/>
<entry key="borderColor" type="xint" value="16777215"/>
<entry key="defFontSize" type="xint" value="11"/>
<entry key="annotation-version" type="xint" value="20151123"/>
<config key="styles"/>
</config>
<entry key="customDescription" type="xstring" isnull="true" value=""/>
<entry key="state" type="xstring" value="IDLE"/>
<entry key="factory" type="xstring" value="org.knime.base.node.preproc.filter.column.DataColumnSpecFilterNodeFactory"/>
<entry key="node-name" type="xstring" value="Column Filter"/>
<entry key="node-bundle-name" type="xstring" value="KNIME Base Nodes"/>
<entry key="node-bundle-symbolic-name" type="xstring" value="org.knime.base"/>
<entry key="node-bundle-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-bundle-version" type="xstring" value="4.1.0.v201912041211"/>
<entry key="node-feature-name" type="xstring" value="KNIME Core"/>
<entry key="node-feature-symbolic-name" type="xstring" value="org.knime.features.base.feature.group"/>
<entry key="node-feature-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-feature-version" type="xstring" value="4.1.0.v201912041824"/>
<config key="factory_settings"/>
<entry key="name" type="xstring" value="Column Filter"/>
<entry key="hasContent" type="xboolean" value="false"/>
<entry key="isInactive" type="xboolean" value="false"/>
<config key="ports">
<config key="port_1">
<entry key="index" type="xint" value="1"/>
<entry key="port_dir_location" type="xstring" isnull="true" value=""/>
</config>
</config>
<config key="filestores">
<entry key="file_store_location" type="xstring" isnull="true" value=""/>
<entry key="file_store_id" type="xstring" isnull="true" value=""/>
</config>
</config>
62 changes: 62 additions & 0 deletions knwf/Column Rename (#2)/settings.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://www.knime.org/2008/09/XMLConfig" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.knime.org/2008/09/XMLConfig http://www.knime.org/XMLConfig_2008_09.xsd" key="settings.xml">
<entry key="node_file" type="xstring" value="settings.xml"/>
<config key="flow_stack"/>
<config key="internal_node_subsettings">
<entry key="memory_policy" type="xstring" value="CacheSmallInMemory"/>
</config>
<config key="model">
<config key="all_columns">
<config key="0">
<entry key="old_column_name" type="xstring" value="Col0"/>
<entry key="new_column_name" type="xstring" value="DFD-ID"/>
<entry key="new_column_type" type="xint" value="7"/>
</config>
<config key="1">
<entry key="old_column_name" type="xstring" value="Col1"/>
<entry key="new_column_name" type="xstring" value="DFD-Lemma"/>
<entry key="new_column_type" type="xint" value="0"/>
</config>
</config>
</config>
<config key="nodeAnnotation">
<entry key="text" type="xstring" value=""/>
<entry key="bgcolor" type="xint" value="16777215"/>
<entry key="x-coordinate" type="xint" value="393"/>
<entry key="y-coordinate" type="xint" value="179"/>
<entry key="width" type="xint" value="134"/>
<entry key="height" type="xint" value="19"/>
<entry key="alignment" type="xstring" value="CENTER"/>
<entry key="borderSize" type="xint" value="0"/>
<entry key="borderColor" type="xint" value="16777215"/>
<entry key="defFontSize" type="xint" value="11"/>
<entry key="annotation-version" type="xint" value="20151123"/>
<config key="styles"/>
</config>
<entry key="customDescription" type="xstring" isnull="true" value=""/>
<entry key="state" type="xstring" value="IDLE"/>
<entry key="factory" type="xstring" value="org.knime.base.node.preproc.rename.RenameNodeFactory"/>
<entry key="node-name" type="xstring" value="Column Rename"/>
<entry key="node-bundle-name" type="xstring" value="KNIME Base Nodes"/>
<entry key="node-bundle-symbolic-name" type="xstring" value="org.knime.base"/>
<entry key="node-bundle-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-bundle-version" type="xstring" value="4.1.0.v201912041211"/>
<entry key="node-feature-name" type="xstring" value="KNIME Core"/>
<entry key="node-feature-symbolic-name" type="xstring" value="org.knime.features.base.feature.group"/>
<entry key="node-feature-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-feature-version" type="xstring" value="4.1.0.v201912041824"/>
<config key="factory_settings"/>
<entry key="name" type="xstring" value="Column Rename"/>
<entry key="hasContent" type="xboolean" value="false"/>
<entry key="isInactive" type="xboolean" value="false"/>
<config key="ports">
<config key="port_1">
<entry key="index" type="xint" value="1"/>
<entry key="port_dir_location" type="xstring" isnull="true" value=""/>
</config>
</config>
<config key="filestores">
<entry key="file_store_location" type="xstring" isnull="true" value=""/>
<entry key="file_store_id" type="xstring" isnull="true" value=""/>
</config>
</config>
85 changes: 85 additions & 0 deletions knwf/Joiner (#5)/settings.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://www.knime.org/2008/09/XMLConfig" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.knime.org/2008/09/XMLConfig http://www.knime.org/XMLConfig_2008_09.xsd" key="settings.xml">
<entry key="node_file" type="xstring" value="settings.xml"/>
<config key="flow_stack"/>
<config key="internal_node_subsettings">
<entry key="memory_policy" type="xstring" value="CacheSmallInMemory"/>
</config>
<config key="model">
<entry key="duplicateHandling" type="xstring" value="AppendSuffixAutomatic"/>
<entry key="compositionMode" type="xstring" value="MatchAll"/>
<entry key="joinMode" type="xstring" value="InnerJoin"/>
<config key="leftTableJoinPredicate">
<entry key="array-size" type="xint" value="1"/>
<entry key="0" type="xstring" value="DFD-Lemma"/>
</config>
<config key="rightTableJoinPredicate">
<entry key="array-size" type="xint" value="1"/>
<entry key="0" type="xstring" value="nl"/>
</config>
<entry key="suffix" type="xstring" value="(*)"/>
<config key="leftIncludeCols">
<entry key="array-size" type="xint" value="2"/>
<entry key="0" type="xstring" value="DFD-ID"/>
<entry key="1" type="xstring" value="DFD-Lemma"/>
</config>
<config key="rightIncludeCols">
<entry key="array-size" type="xint" value="2"/>
<entry key="0" type="xstring" value="item"/>
<entry key="1" type="xstring" value="nl"/>
</config>
<entry key="leftIncludeAll" type="xboolean" value="true"/>
<entry key="rightIncludeAll" type="xboolean" value="true"/>
<entry key="rmLeftJoinCols" type="xboolean" value="true"/>
<entry key="rmRightJoinCols" type="xboolean" value="true"/>
<entry key="maxOpenFiles" type="xint" value="200"/>
<entry key="rowKeySeparator" type="xstring" value="_"/>
<entry key="enableHiLite" type="xboolean" value="false"/>
<entry key="numBitsInitial" type="xint" value="6"/>
<entry key="numBitsMaximal" type="xint" value="32"/>
<entry key="usedMemoryThreshold" type="xdouble" value="0.85"/>
<entry key="minAvailableMemory" type="xlong" value="10000000"/>
<entry key="memUseCollectionUsage" type="xboolean" value="true"/>
<entry key="version" type="xstring" value="version_3"/>
</config>
<config key="nodeAnnotation">
<entry key="text" type="xstring" value="join by exactly matching%%00010name forms"/>
<entry key="bgcolor" type="xint" value="16777215"/>
<entry key="x-coordinate" type="xint" value="885"/>
<entry key="y-coordinate" type="xint" value="379"/>
<entry key="width" type="xint" value="151"/>
<entry key="height" type="xint" value="38"/>
<entry key="alignment" type="xstring" value="CENTER"/>
<entry key="borderSize" type="xint" value="0"/>
<entry key="borderColor" type="xint" value="16777215"/>
<entry key="defFontSize" type="xint" value="11"/>
<entry key="annotation-version" type="xint" value="20151123"/>
<config key="styles"/>
</config>
<entry key="customDescription" type="xstring" isnull="true" value=""/>
<entry key="state" type="xstring" value="IDLE"/>
<entry key="factory" type="xstring" value="org.knime.base.node.preproc.joiner.Joiner2NodeFactory"/>
<entry key="node-name" type="xstring" value="Joiner"/>
<entry key="node-bundle-name" type="xstring" value="KNIME Base Nodes"/>
<entry key="node-bundle-symbolic-name" type="xstring" value="org.knime.base"/>
<entry key="node-bundle-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-bundle-version" type="xstring" value="4.1.0.v201912041211"/>
<entry key="node-feature-name" type="xstring" value="KNIME Core"/>
<entry key="node-feature-symbolic-name" type="xstring" value="org.knime.features.base.feature.group"/>
<entry key="node-feature-vendor" type="xstring" value="KNIME AG, Zurich, Switzerland"/>
<entry key="node-feature-version" type="xstring" value="4.1.0.v201912041824"/>
<config key="factory_settings"/>
<entry key="name" type="xstring" value="Joiner"/>
<entry key="hasContent" type="xboolean" value="false"/>
<entry key="isInactive" type="xboolean" value="false"/>
<config key="ports">
<config key="port_1">
<entry key="index" type="xint" value="1"/>
<entry key="port_dir_location" type="xstring" isnull="true" value=""/>
</config>
</config>
<config key="filestores">
<entry key="file_store_location" type="xstring" isnull="true" value=""/>
<entry key="file_store_id" type="xstring" isnull="true" value=""/>
</config>
</config>
Loading

0 comments on commit 614657d

Please sign in to comment.