Skip to content

A lightweight, configurable tool for indexing metadata into solr.

License

Notifications You must be signed in to change notification settings

dchandekstark/solrizer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

solrizer

A lightweight, configurable tool for indexing metadata into solr. Can be triggered from within your application, from the command line, or as a JMS listener.

Solrizer provides the baseline and structures for the process of solrizing. In order to actually read objects from a
datasource and write solr documents into a solr instance, you need to use an implementation specific gem, such as
solrizer-fedora, which provides the mechanics for reading from a fedora repository and writing to a solr instance.

Installation

The gem is hosted on rubygems.org. The best way to manage the gems for your project is to use bundler. Create a Gemfile in the root of your application and include the following:

source "http://rubygems.org"

gem 'solrizer'

Then:

bundle install

Usage

Fire up the console:

The code snippets in the following sections can be cut/pasted into your console, giving you the opportunity to play with Solrizer.

Start up a console and load solrizer:

irb
require "rubygems"
require "solrizer"

Field Mapper

The FieldMapper maps term names and values to Solr fields, based on the term’s data type and any index_as options. Solrizer comes with default mappings (which are defined in the config/solr_mappings.yml) to dynamic field types defined in the Hydra Solr schema.xml file. A copy of that is available :
https://github.com/projecthydra/hydra-head/blob/master/hydra-core/lib/generators/hydra/templates/solr_conf/conf/schema.xml

More information on the conventions followed for the dynamic solr fields is here:
https://github.com/projecthydra/hydra-head/wiki/Solr-Schema

default_mapper = Solrizer::FieldMapper::Default.new

# some of the default mappings in solrizer
default_mapper.solr_name("foo",:string,:searchable)     # returns foo_tesim
default_mapper.solr_name("foo",:date,:searchable)       # returns foo_dtsim
default_mapper.solr_name("foo",:integer,:searchable     # returns foo_isim
default_mapper.solr_name("foo",:string,:facetable)      # returns foo_sim
default_mapper.solr_name("foo",:integer,:facetable)     # returns foo_iim
default_mapper.solr_name("foo",:string,:sortable)       # returns foo_si
default_mapper.solr_name("foo",:string,:displayable)    # returns foo_ssm
  1. Using default indexing strategies
solr_doc = {}
Solrizer.insert_field(solr_doc, 'title', 'whatever', :searchable) 
=> {"title_tesim"=>["whatever"]}

Solrizer.insert_field(solr_doc, 'pub_date', 'Nov 2012', :sortable, :displayable) 
=> {"title_tesim"=>["whatever"], "pub_date_ssi"=>["Nov 2012"], "pub_date_ssm"=>["Nov 2012"]}
  1. You can also index dates
  • as a date
    solr_doc = {}
    Solrizer.insert_field(solr_doc, ‘pub_date’, Date.parse(‘Nov 7th 2012’), :searchable)
    => {"pub_date_dtsi"=>[“2012-11-07T00:00:00Z”]}
    1. or as a string
      solr_doc = {}
      Solrizer.insert_field(solr_doc, ‘pub_date’, Date.parse(‘Nov 7th 2012’), :sortable, :displayable)
      => {"pub_date_ssi"=>[“2012-11-07”], “pub_date_ssm”=>[“2012-11-07”]}
    1. or a string that is stored as a date
      solr_doc = {}
      Solrizer.insert_field(solr_doc, ‘pub_date’, ‘Jan 29th 2013’, :dateable)
      => {"pub_date_dtsi"=>[“2013-01-29T00:00:00Z”]}
    1. Using a custom indexing strategy
      All you have to do is create your own index descriptor:

      solr_doc = {}
      displearchable = Solrizer::Descriptor.new(:integer, :indexed, :stored)
      Solrizer.insert_field(solr_doc, ‘some_count’, 45, displearchable)
      {"some_count_isi"=>[“45”]}
    1. Changing the behavior of a default descriptor

    Simply override the methods within Solrizer::DefaultDescriptors

    1. before
      solr_doc = {}
      Solrizer.insert_field(solr_doc, ‘title’, ‘foobar’, :facetable)
      => {"title_sim"=>[“foobar”]}
    1. redefine facetable:
      module Solrizer
      module DefaultDescriptors
      def self.facetable
      Descriptor.new(:string, :indexed, :stored)
      end
      end
      end
    1. after
      solr_doc = {}
      Solrizer.insert_field(solr_doc, ‘title’, ‘foobar’, :facetable)
      => {"title_ssi"=>[“foobar”]}
    1. Creating your own Indexers

      module MyMappers
      def self.mapper_one
      Solrizer::Descriptor.new(:string, :indexed, :stored)
      end
      end

    solr_doc = {}

    Solrizer::FieldMapper.descriptors = [MyMappers]
    => [MyMappers]

    Solrizer.insert_field(solr_doc, ‘title’, ‘foobar’, :mapper_one)
    => {"title_ssi"=>[“foobar”]}

    1. Using OM
      Same as it ever was:

      t.main_title(:index_as=>[:facetable],:path=>"title", :label=>"title") { … }

    But now you may also pass an Descriptor instance if that works for you:


    indexer = Solrizer::Descriptor.new(:integer, :indexed, :stored)
    t.main_title(:index_as=>[indexer],:path=>"title", :label=>"title") { … }

    Extractor and Extractor Mixins

    Solrizer::Extractor provides utilities for extracting solr fields from objects or inserting solr fields into documents:

    extractor = Solrizer::Extractor.new
    
    extractor.format_node_value(["foo     ","\n      bar"])                     # returns "foo bar"
    
    solr_doc = Hash.new
    extractor.insert_solr_field_value(solr_doc, "foo","bar")         # solr_doc is now {"foo" => ["bar"]}
    extractor.insert_solr_field_value(solr_doc,"foo","baz")    # solr_doc is now {"foo" => ["bar","baz"]}
    extractor.insert_solr_field_value(solr_doc, "boo","hoo")         # solr_doc is now {"foo" => ["bar","baz"], "boo" => ["hoo"]}
    

    Solrizer provides some default mixins:

    • Solrizer::HTML::Extractor -=> provides html_to_solr method
    • Solrizer::XML::Extractor -=> provides xml_to_solr method
    xml = "<fields><foo>bar</foo><bar>baz</bar></fields>"
    
    extractor.xml_to_solr(xml)      # returns {:foo_tesim=>"bar", :bar_tesim=>"baz"}
    

    Solrizer::XML::TerminologyBasedSolrizer

    Another powerful mixin for use with classes that include the OM::XML::Document module is Solrizer::XML::TerminologyBasedSolrizer.
    The methods provided by this module map provides a robust way of mapping terms and solr fields via om terminologies. A notable example
    can be found in ActiveFedora::NokogiriDatatstream.

    JMS Listener for Hydra Rails Applications

    The executables: solrizer and solrizerd

    The solrizer gem provides two executables:

    • solrizer is a stomp consumer which listens for fedora.apim.updates and solrizes (or de-solrizes) objects accordingly.
    • solrizerd is a wrapper script that spawns a daemonized version of solrizer and handles start|stop|restart|status requests.

    Usage

    The usage for solrizerd is as follows:

     solrizerd command --hydra_home PATH [options] 
    

    The commands are as follows:

    • start start an instance of the application
    • stop stop all instances of the application
    • restart stop all instances and restart them afterwards
    • status show status (PID) of application instances

    Required parameters:

    —hydra_home: this is the path to your hydra rails applications’ root directory. Solrizerd needs this in order to load all your models and corresponding terminoligies.

    The options:

    • -p, —port Stomp port 61613
    • -o, —host Host to connect to localhost
    • -u, —user User name for stomp listener
    • -w, —password Password for stomp listener
    • -d, —destination Topic to listen to (default: /topic/fedora.apim.update)
    • -h, —help Display this screen

    Note:

    Since the solrizer script must fire up your hydra rails application, it must have all the gems installed that your hydra instance needs.

    Note on Patches/Pull Requests

    • Fork the project.
    • Make your feature addition or bug fix.
    • Add tests for it. This is important so I don’t break it in a
      future version unintentionally.
    • Commit, do not mess with rake file, version, or history.
      (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
    • Send me a pull request. Bonus points for topic branches.

    Acknowledgements

    Technical Lead: Matt Zumwalt (MediaShelf)

    Thanks to

    Douglas Kim, who created the initial code base for Solrizer.
    Chris Fitzpatrick, who patiently ran the first prototype through its paces for weeks.
    Bess Sadler, who created the JMS integration for Solrizer, generously served as a sounding board for numerous design issues around solr indexing, and pushes the technology forward with the skill of a true engineer.

    Copyright

    Copyright © 2010 Matt Zumwalt. See LICENSE for details.

    About

    A lightweight, configurable tool for indexing metadata into solr.

    Resources

    License

    Stars

    Watchers

    Forks

    Packages

    No packages published

    Languages

    • Ruby 95.8%
    • Perl 4.2%