Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/issue 8895 search api #89

Closed
wants to merge 14 commits into from
Closed

Conversation

mcantelon
Copy link
Member

Added new fields to File model and add logic to index files in stored AIPs using metsrw to parse file metadata from METS files. It's related to issue #8895.

@@ -0,0 +1,13 @@
from __future__ import absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not specify this as a pip dependency instead of vendoring it? Pip can install from git, and we can start working on getting it into PyPI. See https://github.com/artefactual/automation-tools/blob/dev/dataverse/requirements/base.txt#L4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... that works slick.

@mistydemeo
Copy link
Contributor

One thing I'm wondering about - I'm not sure what the requirements were here. Is there any requirement to support clauses other than AND? E.g. to be able to specify NOT or OR clauses?

@Hwesta Hwesta added the Request: discussion The path towards resolving the issue is unclear and opinion is sought from other community members. label Nov 18, 2015
@mistydemeo
Copy link
Contributor

Ping - any thoughts re: my last comment?

@qubot qubot force-pushed the dev/issue-8895-search-api branch 4 times, most recently from 9dc7738 to 548ab10 Compare December 1, 2015 04:50
@mcantelon
Copy link
Member Author

@mistydemeo Sorry, didn't see this... The requirements are pretty rudimentary... here are the searches that were anticipated: https://wiki.archivematica.org/Research_data_management#METS_parsing

@qubot qubot force-pushed the dev/issue-8895-search-api branch 3 times, most recently from 6676e60 to 3c5eda6 Compare December 2, 2015 03:45
@sevein sevein added this to the 0.9.0 milestone Apr 6, 2016
@qubot qubot force-pushed the dev/issue-8895-search-api branch 2 times, most recently from a3ccc17 to 4eb3a27 Compare April 14, 2016 18:58
@jhsimpson jhsimpson modified the milestones: 0.10.0, 0.9.0 Jul 26, 2016
@jhsimpson jhsimpson removed this from the 0.11.0 milestone May 10, 2017
@jhsimpson jhsimpson added this to the 0.11.0 milestone Aug 11, 2017
@qubot qubot force-pushed the dev/issue-8895-search-api branch from 4eb3a27 to d59e806 Compare October 12, 2017 03:19
@sevein sevein modified the milestones: 0.11.0, 0.12.0 Oct 16, 2017
@qubot qubot force-pushed the dev/issue-8895-search-api branch from d59e806 to 6a07894 Compare October 25, 2017 09:55

* uuid (file UUID)
* package (package UUID)
* name (enter or partial filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo? Should "enter" be "entire"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think there's a typo in the line below: "PRONUM"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed these typos in #262

* min_size (minimum filesize)
* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (boolean: whether or not file data is valid or malformed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording sounds contradictory: "whether or not file data is valid or malformed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think valid should be validated here. No? If so, I think the validated attribute of File only records whether the file has been validated, not whether it is valid. (Unless the help text is wrong in models/event.py::File

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should source_package be listed as a search filter here? FileFilter seems to expose it. It's a bit confusing since it only references Transfers I believe. See the pre-existing Package.index_file_data_from_transfer_mets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed validated to valid a nullable boolean field where None indicates not validated, True valid, and False invalid. Migrations and help text updated accordingly in #262.

docs/search.rst Outdated

http://<storage service URL>/api/v2/search/file/?min_size=29965171

Here is an example JSON response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text above the URL doesn't appear to match what that URL actually searches for.

@qubot qubot force-pushed the dev/issue-8895-search-api branch from 9ae818f to 4f981e4 Compare November 8, 2017 00:14
@qubot qubot force-pushed the dev/issue-8895-search-api branch from ed899d4 to 970fd10 Compare November 8, 2017 00:28
@qubot qubot force-pushed the dev/issue-8895-search-api branch from 8708a3a to e4bd212 Compare November 10, 2017 02:42
Copy link
Contributor

@jrwdunham jrwdunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments have led to changes in the PR that supersedes this one: see #262


class Meta:
model = models.File
fields = ['uuid', 'name', 'file_type', 'min_size', 'max_size', 'format_name', 'pronom_id', 'pipeline', 'source_package', 'normalized', 'validated', 'ingestion_time']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should source_package in this list be package?

* min_size (minimum filesize)
* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (boolean: whether or not file data is valid or malformed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think valid should be validated here. No? If so, I think the validated attribute of File only records whether the file has been validated, not whether it is valid. (Unless the help text is wrong in models/event.py::File

filter_backends = (filters.DjangoFilterBackend,)
filter_class = FileFilter

@list_route(methods=['get'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? TODO: find out.

class Migration(migrations.Migration):

dependencies = [
('locations', '0015_gpg_encrypted_space'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs another rebase: 0016_mirror_location_aip_replication.py migration has since been introduced.

metadata['filename'] = premis_object.original_name

if len(premis_object.object_identifiers[0]):
if premis_object.object_identifiers[0]['type'] == 'UUID':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that files in a METS file can have other types of ids (e.g., handle, PURL), we need a more robust way of getting the UUID. I think this needs to loop through all identifiers and find the first UUID one, if it exists.


# Indicate whether or not a file has been validated in metadata and if it passed
if premis_event.event_type == 'validation':
metadata['validated'] = premis_event.outcomes[0]['outcome'] == "pass"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that the validated attribute of a file does actually mean "is valid". I think maybe the name should be change to is_valid. But if that's the case, then I think we want 3 possible values: True, False, and None...

For example, if you wanted to get details about the transfer source location
contained in the space 6d0b6cce-4372-4ef8-bf48-ce642761fd41 you could HTTP get::

http://<storage service URL>/api/v2/search/location/?space=7ec3d5d9-23ec-4fd5-b9fb-df82da8de630&purpose=TS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that if you pass a nonsense param to one of these search endpoints, you get all of the resources. Is that expected/desired? For example http://192.168.168.192:8000/api/v2/search/location/?moonbeam=figaro returns all locations. As does http://192.168.168.192:8000/api/v2/search/location/

* min_size (minimum filesize)
* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (boolean: whether or not file data is valid or malformed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should source_package be listed as a search filter here? FileFilter seems to expose it. It's a bit confusing since it only references Transfers I believe. See the pre-existing Package.index_file_data_from_transfer_mets


* uuid (file UUID)
* package (package UUID)
* name (enter or partial filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed these typos in #262

* min_size (minimum filesize)
* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (boolean: whether or not file data is valid or malformed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed validated to valid a nullable boolean field where None indicates not validated, True valid, and False invalid. Migrations and help text updated accordingly in #262.

@jrwdunham
Copy link
Contributor

I am closing this in favour of #262

@jrwdunham jrwdunham closed this Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request: discussion The path towards resolving the issue is unclear and opinion is sought from other community members.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants