Recogito Similarity Scan

This repository contains scripts to experiment with approaches to establish 'similarity' between documents in Recogito. This work is part of the Mellon-funded work plan for the Pelagios 7 project, specifically the work item 'enhancing discovery'.

Ruby Scripts

I did some first proof-of-concept scripts in Ruby (see ruby folder). Run bundle install to install dependencies before running the scripts. All scripts assume an instance of Recogito to be running locally, but don't write back to the DB.

There's no config file. You may need to modify settings within the script code directly, according to your own environment.

Python Scripts

Because Python is already set up on the Recogito production server, I ported (and completed) the scripts to Python.

Pre-requisites

The scripts have a few dependencies (and sub-dependencies) for database/index access and text processing: SQLAlchemy, elasticsearch and textdistance.

$ pip install sqlalchemy
$ pip install psycopg2  # used by SQLAlchemy
$ pip install textdistance[extras] 
$ pip install textdistance[JaroWinkler]
$ pip install elasticsearch==5.5.3 # 5.x required for Recogito - don't use newer ones!

Create a copy of config.ini.template named config.ini and modify according to your DB settings.

Exploring the data

Handy SQL query to explore the raw data in the DB:

SELECT 
  similarity.*,
  doc_a.title,
  doc_b.title
FROM similarity
JOIN document doc_a
  ON doc_a.id = doc_id_a
JOIN document doc_b
  ON doc_b.id = doc_id_b
WHERE entity_jaccard > 0;

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
python		python
ruby		ruby
README.md		README.md
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recogito Similarity Scan

Ruby Scripts

Python Scripts

Pre-requisites

Exploring the data

About

Releases

Packages

Languages

pelagios/recogito2-similarity-scan

Folders and files

Latest commit

History

Repository files navigation

Recogito Similarity Scan

Ruby Scripts

Python Scripts

Pre-requisites

Exploring the data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages