Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby, elasticsearch upgrades, and documentation #211

Merged
merged 13 commits into from
Mar 23, 2023
2 changes: 1 addition & 1 deletion .ruby-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.7.1
3.1.2
28 changes: 27 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,47 @@ Versioning](https://semver.org/spec/v2.0.0.html).
### Security
-->

## [Unreleased](https://github.com/CDRH/datura/compare/v0.2.0-beta...dev)
## [1.0.0](https://github.com/CDRH/datura/compare/v0.2.0-beta...dev)

### Added
- minor test for Datura::Helpers.date_standardize
- documentation for web scraping
- documentation for CsvToEs (transforming CSV files and posting to elasticsearch)
- documentation for adding new ingest formats to Datura
- byebug gem for debugging
- instructions for installing Javascript Runtime files for Saxon
- API schema can either be 1.0 or 2.0 (which includes nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo:
```
api_version: '2.0'
```
See new schema (2.0) documentation [here](https://github.com/CDRH/datura/docs/schema_v2.md)
- schema validation with API version 2.0, invalidly constructed documents will not post
- authentication with Elasticesarch 8.5; add the following to `public.yml` or `private.yml` in the data repo:
```
es_user: username
es_password: ********
```
- field overrides for new fields in the new API schema
- functionality to transform EAD files and post them to elasticsearch

### Changed
- update ruby to 3.1.2
- date_standardize now relies on strftime instead of manual zero padding for month, day
- minor corrections to documentation
- XPath: "text" is now ingested as an array and will be displayed delimitted by spaces
- refactored command line methods into elasticsearch library
- refactored and moved date_standardize and date_display helper methods
- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches

### Migration
- check to make sure "text" xpath is doing desired behavior
- use Elasticsearch 8.5 or higher and add authentication as described above if security is enabled. See [dev docs instructions](https://github.com/CDRH/cdrh_dev_docs/blob/update_elasticsearch_documentation/publishing/2_basic_requirements.md#downloading-elasticsearch).
- upgrade data repos to Ruby 3.1.2
- add api version to config as described above
- make sure fields are consistent with the api schema, many have been renamed or changed in format
- add nil checks with get_text and get_list methods
- add EadToES overrides if ingesting EAD files
- if overriding the `read_csv` method in `lib/datura/file_type.rb`, the hash must be prefixed with ** (`**{}`).

## [v0.2.0-beta](https://github.com/CDRH/datura/compare/v0.1.6...v0.2.0-beta) - 2020-08-17 - Altering field and xpath behavior, adds get_elements

Expand Down
9 changes: 7 additions & 2 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,13 @@ GEM
mime-types (3.4.1)
mime-types-data (~> 3.2015)
mime-types-data (3.2022.0105)
mini_portile2 (2.8.0)
minitest (5.16.3)
netrc (0.11.0)
nokogiri (1.13.8-x86_64-darwin)
nokogiri (1.13.9)
mini_portile2 (~> 2.8.0)
racc (~> 1.4)
nokogiri (1.13.9-x86_64-darwin)
racc (~> 1.4)
racc (1.6.0)
rake (13.0.6)
Expand All @@ -35,6 +39,7 @@ GEM
unf_ext (0.0.8.2)

PLATFORMS
ruby
x86_64-darwin-20

DEPENDENCIES
Expand All @@ -45,4 +50,4 @@ DEPENDENCIES
rake (~> 13.0)

BUNDLED WITH
2.2.26
2.2.33
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Looking for information about how to post documents? Check out the

## Install / Set Up Data Repo

Check that Ruby is installed, preferably 2.7.x or up. If you are using RVM, see the RVM section below.
Check that Ruby is installed, preferably 3.1.2 or up. If you are using RVM, see the RVM section below.

If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension).

Expand Down
2 changes: 1 addition & 1 deletion datura.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Gem::Specification.new do |spec|
]
spec.require_paths = ["lib"]

spec.required_ruby_version = "~> 2.5"
spec.required_ruby_version = "~> 3.1"
spec.add_runtime_dependency "colorize", "~> 0.8.1"
spec.add_runtime_dependency "nokogiri", "~> 1.10"
spec.add_runtime_dependency "rest-client", "~> 2.1"
Expand Down
5 changes: 5 additions & 0 deletions docs/1_setup/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ default:
collection:
es_index
es_path
es_user
es_password
```
(The options es_user and es_password are needed if you are using a secured Elasticsearch index.)

If there are any settings which must be different based on the local environment (your computer vs the server), place these in `config/private.yml`.

Expand Down Expand Up @@ -118,6 +121,8 @@ Some stuff commonly in `private.yml`:
- `threads: 5` (5 recommended for PC, 50 for powerful servers)
- `es_path: http://localhost:9200`
- `es_index: some_index`
- `es_user: elastic` (if you want to use security on your local elasticsearch instance)
- `es_password: ******`
- `solr_path: http://localhost:8983/solr`
- `solr_core: collection_name`

Expand Down
2 changes: 1 addition & 1 deletion docs/1_setup/prepare_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ You will need to make sure that somewhere, the following are being set in your p

### Step 2: Prepare Elasticsearch Index

Make sure elasticsearch is installed and running in the location you wish to push to. If there is already an index you will be using, take note of its name and skip this step. If you want to add an index, run this command with a specified environment:
Make sure elasticsearch is installed and running in the location you wish to push to. If there is already an index you will be using, take note of its name and skip this step. (Note that each index must be dedicated to data on one version of the API schema) If you want to add an index, run this command with a specified environment:

```
admin_es_create_index -e development
Expand Down
2 changes: 1 addition & 1 deletion docs/4_developers/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ TODO

### Elasticsearch

TODO
See installation instructions [here](https://github.com/CDRH/cdrh_dev_docs/blob/update_elasticsearch_documentation/publishing/2_basic_requirements.md#downloading-elasticsearch).

### Apache Permissions

Expand Down
Loading