Skip to content
This repository has been archived by the owner on Sep 12, 2022. It is now read-only.

Latest commit

 

History

History
699 lines (540 loc) · 20.4 KB

README.md

File metadata and controls

699 lines (540 loc) · 20.4 KB

Insuraquest

A school project from course Programming Project 2, Group 2, Professional Bachelor "Applied IT" at Erasmumshogeschool Brussel.

Search engine application for insurance documents. Insurance companies store documents on legislation, jurisprudence and legal doctrine in their particular field. The goal is to provide employees an easy-to-use search engine application based on the algorythms of the Elasticsearch framework.

Table of Contents

Features

  • Registry and login of users
  • Management of roles and authorisations: guest, user, librarian, admin
  • Upload of documents in .pdf, .png or .jpeg format on server location
  • At upload, the librarian enters all metadata (tags) with the upload
  • Documents are picked up by fsCrawler, converted to json and presented, together with all the manual metadata, to the Elasticsearch stack for indexing
  • Elasticsearch stores all documents on index "insuraquest" in json format
  • All users (except for guests) can perform full text searches on content and add filters based on criteria such as language, issuer, insurance type, etc.
  • Search results are shown in order of relevance (highest scores are shown on top); highlighting leads to rendering only some fragments of the content
  • Full reading of the document is only a click away.
  • Modification of tags
  • Tables related to users and auhorisations as well as metadata options are stored in a SQL database

Components

Component Version
Linux Ubuntu
FsCrawler
Elasticsearch 6.8
Elasticsearch-PHP 6.7
PHP 7.4.10 (cli)
Composer 2.0.6
Laravel Installer 4.1.0
MySQL

Note: fsCrawler vXXXX is only compatible with Elasticsearch 6.8. Consequently, require package Elasticsearch-PHP v6.7 in your Laravel composer.json file

Documentation

Full documentation can be found here. Docs are stored within the repo under /docs/, so if you see a typo or problem, please submit a PR to fix it!

We also provide a code examples generator for PHP using the util/GenerateDocExamples.php script. This command parse the util/alternative_report.spec.json file produced from this JSON specification and it generates the PHP examples foreach digest value. The examples are stored in asciidoc format under docs/examples folder.

Installation via Composer

The recommended method to install Elasticsearch-PHP is through Composer.

  1. Add elasticsearch/elasticsearch as a dependency in your project's composer.json file (change version to suit your version of Elasticsearch, for instance for ES 7.0):

        {
            "require": {
                "elasticsearch/elasticsearch": "^7.0"
            }
        }
  2. Download and install Composer:

        curl -s http://getcomposer.org/installer | php
  3. Install your dependencies:

        php composer.phar install
  4. Require Composer's autoloader

    Composer also prepares an autoload file that's capable of autoloading all the classes in any of the libraries that it downloads. To use it, just add the following line to your code's bootstrap process:

        <?php
    
        use Elasticsearch\ClientBuilder;
    
        require 'vendor/autoload.php';
    
        $client = ClientBuilder::create()->build();

You can find out more on how to install Composer, configure autoloading, and other best-practices for defining dependencies at getcomposer.org.

PHP Version Requirement

Version 7.0 of this library requires at least PHP version 7.1. In addition, it requires the native JSON extension to be version 1.3.7 or higher.

Elasticsearch-PHP Branch PHP Version
7.0 >= 7.1.0
6.0 >= 7.0.0
5.0 >= 5.6.6
2.0 >= 5.4.0
0.4, 1.0 >= 5.3.9

Authorisation and Authentication

Authentication

Since InsuraQuest uses Laravel Jetstream, it includes login, registration, email verification, two-factor authentication and session management out of the box. Jetstream uses Laravel Fortify, which is a front-end agnostic authentication backend for Laravel.

In the config/fortify.php configuration file you can customize the different aspects, choose which aspects you wish to implement on your project etc.

The logic to be executed on authorisation request, can be found and modified in App\Actions\Fortify.

More info and documentation on Jetstream can be found on the jetstream website .

Authorisation

InsuraQuest implements authorisation through the attribute 'type' which is included in each user-instance. There are four types: guest, user, librarian and admin. Types are made cascading. Each new level has the permissions of the level below + additional permissions.

Authorisation is enforced on the different routes (web.php). On mixed views, it is also enforced on view-level by implementing the native Laravel @can and @cannot.

Adjusting types

Types can be adjusted in the database directly, or on the 'user administration'-page when you are signed in with an adminaccount.

Types

When a visitor is not yet signed in, he will get rerouted to the login-screen. By default - when a new user gets registered - he is assigned the type 'guest'. He will be able to see the landingpage and the documentation, but cannot query any documents. A user can query documents, open them and mail them. A librarian can upload new files, delete files and change the tags on them. An admin can view all the users, their information, and adjust their type.

Back-end - deployment

Description of deployment set-up.

Elasticsearch

  • v. 6.8.13
  • 1 shard
  • 1 replica
  • single node
  • analyzer: fscrawler_path
  • production index: insuraquest
  • custom fields in index mapping:
    "external": {
        "properties": {
            "title": {
            "type": "text"
            },
            "language": {
            "type": "keyword"
            },
            "date_published": {
            "type": "date"
            },
            "issuer": {
            "type": "keyword"
            },
            "category": {
            "type": "keyword"
            },
            "tag": {
            "type": "keyword"
            }
        }
    }
  • insuraquest index created on first run of FSCrawler

Kibana

  • v. 6.8.13

FSCrawler

  • v. 6-2.6
  • utility has been converted in systemd unit to be used as a service -> /etc/systemd/system/fscrawler.service
  • utility run by dedicated user fscrawler
  • analyzer of FSCrawler makes use of Apache Tika to parse and tokenize binary documents, including pdfs
  • fscrawler exposes a REST API running at http://127.0.0.1:8080/fscrawler
  • custom fields added to mapping defined under /home/student/.fscrawler/_default/6/_settings.json

LEMP stack

Considered more robust than built-in Laravel server.

Ngin-x

  • Configuration file -> /etc/nginx/sites-available/default:
server {
        listen 80;
        server_name 10.3.50.7;
        root /var/www/insuraquest_production/insuraquest/public;

        add_header X-Frame-Options "SAMEORIGIN";
        add_header X-XSS-Protection "1; mode=block";
        add_header X-Content-Type-Options "nosniff";

        index index.php;

        charset utf-8;

        location / {
        try_files $uri $uri/ /index.php?$query_string;
        }

        location = /favicon.ico { access_log off; log_not_found off; }
        location = /robots.txt  { access_log off; log_not_found off; }

        error_page 404 /index.php;

        location ~ \.php$ {
        fastcgi_pass unix:/var/run/php/php7.4-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.(?!well-known).* {
        deny all;
    }
}

Mysql

  • Default configuration
  • Populated after deployment of git repo with:
php artisan migrate:fresh --seed

Git deployment

  • Deployment via bare git repo living under /home/student/insuraquest/bare_project.git
git init --bare /home/student/insuraquest/bare_project.init
  • Post-receive hook allows to push changes to working directory living under /var/www/insuraquest_production
#!/bin/bash

#check out the files
git --work-tree=/var/www/insuraquest_production --git-dir=/home/student/insuraquest/bare_project.git checkout -f

chmod +x /path/to/bare_project.git/hooks/post-receive
  • Configuration of local repo to push to the server
git remote add live '[email protected]:/home/student/insuraquest/bare_project.git'
git push --set-upstream live main

Laravel

  • After project is pushed to /var/www/insuraquest_production, composer update is called to update all dependencies:
composer update 
  • Set the ownership of /var/www/insuraquest_production to www-data group to grant Ngin-x read and execute permissions.
chgrp -R www-data insuraquest_production
  • Fix broken storage symbolic links
php artisan storage:link

Quickstart

Index a document

In elasticsearch-php, almost everything is configured by associative arrays. The REST endpoint, document and optional parameters - everything is an associative array.

To index a document, we need to specify three pieces of information: index, id and a document body. This is done by constructing an associative array of key:value pairs. The request body is itself an associative array with key:value pairs corresponding to the data in your document:

$params = [
    'index' => 'my_index',
    'id'    => 'my_id',
    'body'  => ['testField' => 'abc']
];

$response = $client->index($params);
print_r($response);

The response that you get back indicates the document was created in the index that you specified. The response is an associative array containing a decoded version of the JSON that Elasticsearch returns:

Array
(
    [_index] => my_index
    [_type] => _doc
    [_id] => my_id
    [_version] => 1
    [result] => created
    [_shards] => Array
        (
            [total] => 1
            [successful] => 1
            [failed] => 0
        )

    [_seq_no] => 0
    [_primary_term] => 1
)

Get a document

Let's get the document that we just indexed. This will simply return the document:

$params = [
    'index' => 'my_index',
    'id'    => 'my_id'
];

$response = $client->get($params);
print_r($response);

The response contains some metadata (index, version, etc.) as well as a _source field, which is the original document that you sent to Elasticsearch.

Array
(
    [_index] => my_index
    [_type] => _doc
    [_id] => my_id
    [_version] => 1
    [_seq_no] => 0
    [_primary_term] => 1
    [found] => 1
    [_source] => Array
        (
            [testField] => abc
        )

)

If you want to retrieve the _source field directly, there is the getSource method:

$params = [
    'index' => 'my_index',
    'id'    => 'my_id'
];

$source = $client->getSource($params);
print_r($source);

The response will be just the _source value:

Array
(
    [testField] => abc
)

Search for a document

Searching is a hallmark of Elasticsearch, so let's perform a search. We are going to use the Match query as a demonstration:

$params = [
    'index' => 'my_index',
    'body'  => [
        'query' => [
            'match' => [
                'testField' => 'abc'
            ]
        ]
    ]
];

$response = $client->search($params);
print_r($response);

The response is a little different from the previous responses. We see some metadata (took, timed_out, etc.) and an array named hits. This represents your search results. Inside of hits is another array named hits, which contains individual search results:

Array
(
    [took] => 33
    [timed_out] =>
    [_shards] => Array
        (
            [total] => 1
            [successful] => 1
            [skipped] => 0
            [failed] => 0
        )

    [hits] => Array
        (
            [total] => Array
                (
                    [value] => 1
                    [relation] => eq
                )

            [max_score] => 0.2876821
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_index] => my_index
                            [_type] => _doc
                            [_id] => my_id
                            [_score] => 0.2876821
                            [_source] => Array
                                (
                                    [testField] => abc
                                )

                        )

                )

        )

)

Delete a document

Alright, let's go ahead and delete the document that we added previously:

$params = [
    'index' => 'my_index',
    'id'    => 'my_id'
];

$response = $client->delete($params);
print_r($response);

You'll notice this is identical syntax to the get syntax. The only difference is the operation: delete instead of get. The response will confirm the document was deleted:

Array
(
    [_index] => my_index
    [_type] => _doc
    [_id] => my_id
    [_version] => 2
    [result] => deleted
    [_shards] => Array
        (
            [total] => 1
            [successful] => 1
            [failed] => 0
        )

    [_seq_no] => 1
    [_primary_term] => 1
)

Delete an index

Due to the dynamic nature of Elasticsearch, the first document we added automatically built an index with some default settings. Let's delete that index because we want to specify our own settings later:

$deleteParams = [
    'index' => 'my_index'
];
$response = $client->indices()->delete($deleteParams);
print_r($response);

The response:

Array
(
    [acknowledged] => 1
)

Create an index

Now that we are starting fresh (no data or index), let's add a new index with some custom settings:

$params = [
    'index' => 'my_index',
    'body'  => [
        'settings' => [
            'number_of_shards' => 2,
            'number_of_replicas' => 0
        ]
    ]
];

$response = $client->indices()->create($params);
print_r($response);

Elasticsearch will now create that index with your chosen settings, and return an acknowledgement:

Array
(
    [acknowledged] => 1
)

Upload a document

A Librarian has the possibility to upload new files. When uploading a document it is possible to add tags to the uploaded document. The content for the tags is pulled from a mySql table and added to the form.

  • Title, Language, Date Published, Issuer, Category, Keyword.
  • These values are required to be entered by the Librarian to upload a document.
  • A file can be uploaded, which must be pdf and max 2048kb.
  • A document is required for upload.
FileUploadController.php

 $this->validate($request, [
            'title' => 'required',
            'language' => 'required',
            'date' => 'required|date',
            'issuer' => 'required',
            'category' => 'required',
            'tag' => 'required',
            'file' => 'required|mimes:pdf|max:2048'
 ]

When a document is uploaded, the file and tags are posted to fscrawler, which will index the document before adding to our ElasticSearch node.

FileUploadController.php

$file = $request->file('file');
        $pathname = $file->store('public');
        $fully_qualified_pathname = storage_path('app/' . $pathname);
        $client = new Client();
        try {
            $client->request('POST', 'http://127.0.0.1:8080/fscrawler/_upload',
            );
        } catch (GuzzleException $e) {
            echo $e;
        }

A plugin is added for form layout -> tailwind.config.js
https://tailwindcss-custom-forms.netlify.app/

 plugins: [
        require('@tailwindcss/custom-forms'),
      ]

Mail a document

After a user gets all his search results, he can view more details on any of the results.
Here he has the possibility to edit, delete or mail the pdf shown.

Modified or created files for mail functionality are

  • MailController.php
  • EmailInsuraquest.php
  • insuraEmail.blade.php
  • web.php

Commands used Laravel Mailable Markdown class used for creating emails.

 php artisan make:mail EmailInsuraquest --markdown=Email.insuraEmail

Mail controller, essentially we will define the have the logic to display the user’s list. Run the command to create the controller.

 php artisan make:controller MailController

Possibility to test email function http://localhost:8000/send-email -> sends mail to mailTrap (account Bart)

todo: implement the mail functionality into the one search result

Unit Testing using Mock a Elastic Client

use GuzzleHttp\Ring\Client\MockHandler;
use Elasticsearch\ClientBuilder;

// The connection class requires 'body' to be a file stream handle
// Depending on what kind of request you do, you may need to set more values here
$handler = new MockHandler([
  'status' => 200,
  'transfer_stats' => [
     'total_time' => 100
  ],
  'body' => fopen('somefile.json'),
  'effective_url' => 'localhost'
]);
$builder = ClientBuilder::create();
$builder->setHosts(['somehost']);
$builder->setHandler($handler);
$client = $builder->build();
// Do a request and you'll get back the 'body' response above

Wrap up

That was just a crash-course overview of the client and its syntax. If you are familiar with Elasticsearch, you'll notice that the methods are named just like REST endpoints.

You'll also notice that the client is configured in a manner that facilitates easy discovery via the IDE. All core actions are available under the $client object (indexing, searching, getting, etc.). Index and cluster management are located under the $client->indices() and $client->cluster() objects, respectively.

Check out the rest of the Documentation to see how the entire client works.

License

Please note that this project is for use within the school context. For further development, please contact te
The user may choose which license they wish to use. Since there is no discriminating executable or distribution bundle to differentiate licensing, the user should document their license choice externally, in case the library is re-distributed. If no explicit choice is made, assumption is that redistribution obeys rules of both licenses.