Saffron is a tool for providing multi-stage analysis of text corpora by means of state-of-the-art natural language processing technology. Saffron consists of a set of self-contained and independent modules that individually provide distinct analysis of text. These modules are as follows
- Corpus Indexing: Analyses raw text documents in various formats and indexes them for later components
- Term Extraction: Extracts keyphrases that are the terms of each single document in a collection
- Concept Consolidation: Detects and removes variations from the list of terms of each document
- Author Consolidation: Detects and removes name variations from the list of authors of each document
- DBpedia Lookup: Links terms extracted from a document to URLs on the Semantic Web
- Author Connection: Associates authors with terms from the documents and identifies the importance of the term to each author
- Term Similarity: Measures the relevance of each term to each other term
- Author Similarity: Measures the relevance of each author to each other author
- Taxonomy Extraction: Organizes the terms into a single hierarchical graph that allows for easy browsing of the corpus and deep insights.
- RDF Extraction: Creates a knowledge graph
Saffron requires the use of Apache Maven to run. If using the Web Interface MongoDB will also be needed to store the data. Both need to be installed before trying to run Saffron:
- Install Maven
- Install MongoDb (use the default settings)
-
Run the following script to obtain the resources on which Saffron depends:
./install.sh
-
To build the dependencies Saffron requires, use the following command:
mvn clean install
-
Start a MongoDB session by typing 'mongod' on a terminal. MongoDB has to be running for Saffron to operate.
-
The file saffron-web.sh contains some information, such as the name given to the database, the host and port it will run on. If you need to change the database name (default to saffron_test) edit the file saffron-web.sh and change the line: export MONGO_DB_NAME=saffron_test
To change the Mongo HOST and PORT, simply edit the same file on the following:
export MONGO_URL=localhost export MONGO_PORT=27017
By default all results will be stored in the Mongo database, and the JSON files will be generated in /web/data/. However, you can change it to store in in the Mondo database only by setting the following line to false:
export STORE_LOCAL_COPY=true
-
To start the Saffron Web server, simply choose a directory for Saffron to create the models and run the command as follows
./saffron-web.sh
-
Then open the following url in a browser to access the Web Interface
See the Wiki for more details on how to use the Web Interface
All steps of Saffron can be executed by running the saffron.sh
script, without using the Web Interface. This
script takes three arguments
- The corpus, which may be
- A folder containing files in TXT, DOC or PDF
- A zip file containing files in TXT, DOC or PDF
- A Json metadata file describing the corpus (see Saffron Formats for more details on the format of the file)
- The output folder to which the results are written
- The configuration file (as described in Saffron Formats)
For example
./saffron.sh corpus.json output/ config.json
If the Web Interface is used and STORE_LOCAL_COPY was set to true, or Saffron was used with the command line, the following files are generated and stored in /web/data/. (see Saffron Formats for more details on each file)
terms.json
: The terms with weightsdoc-terms.json
: The document term map with weightsauthor-terms.json
: The connection between authors and termsauthor-sim.json
: The author-author similarity graphterm-sim.json
: The term-term similarity graphtaxonomy.json
: The final taxonomy over the corpusconfig.json
: The configuration file for the run
To create a .dot file for the generated taxonomy, you can use the following command:
python taxonomy-to-dot.py taxonomy.json > taxonomy.dot
If you have results fron using Saffron version 3.3, you will need to do the following to make it compatible with the version 3.4
Before starting Saffron, edit the following file:
upgrade3.3To3.4.sh
and change the following configurations to reflect the database you want to upgrade:
export MONGO_URL=localhost
export MONGO_PORT=27017
export MONGO_DB_NAME=saffron_test
Run the script by executing:
./upgrade3.3To3.4.sh
Full details of the configuration can be seen from the JavaDoc
The JavaDoc is available at (https://saffron.pages.insight-centre.org/saffron/)
The Wiki gives more details on how the approach of Saffronto use the web interface (https://gitlab.insight-centre.org/saffron/saffron/-/wikis/home)
FORMATS.md gives an exhaustive description of the input and output generated by Saffron
For full API documentation, see Saffron API Documentation