Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align frontend data model with Metadata names of edusharing service. #94

Open
MRuecklCC opened this issue May 19, 2022 · 11 comments
Open
Labels
feature Anything related to the behavior of the feature extractors

Comments

@MRuecklCC
Copy link
Contributor

Currently, the data model uses its own names for the different extractors. Eventually we want to align the names of the extractors to comply with the edusharing naming conventions?

@MRuecklCC
Copy link
Contributor Author

As part of this it may also make sense to simplify the API input and output models. After some discussion with @RMeissnerCC we decided to:

  • remove the whitelist feature
  • slightly un-nest the input model
  • turn the individual metadata fields into Union[Result, Error], to avoid an "all-or-nothing" response. I.e. if a single extractor fails, the response will contain an error message for that extractor, but the other extractor results will be present.

@MRuecklCC
Copy link
Contributor Author

The simplifications of the API data model was done as part of #100.

@MRuecklCC MRuecklCC added the feature Anything related to the behavior of the feature extractors label Jun 1, 2022
@RobertMeissner
Copy link
Collaborator

The simplifications of the API data model was done as part of #100.

Is there now anything left in this issue or can it be closed?

@MRuecklCC
Copy link
Contributor Author

The main issue is still unresolved: https://issues.edu-sharing.net/jira/browse/KBMBF-475

@MRuecklCC
Copy link
Contributor Author

MRuecklCC commented Jun 28, 2022

To make some progress on this front, i spent a while going through the current meta data fields defined by the edusharing service and checking them out in elasticsearch. A couple of those fields are:

Misc attributes

Quality attributes

  • ccm:oeh_quality_personal_law: "Pers\u00f6nlichkeitsrechte"
  • ccm:oeh_quality_protection_of_minors: "Jugendschutz"
    • This currently is a boolean (0/1) field in edusharing.
  • ccm:oeh_quality_copyright_law: "Urheberrecht"
  • ccm:oeh_quality_criminal_law: "Strafrecht"
  • ccm:oeh_quality_login: "Login notwendig"
    • This currently is a boolean(0/1) field in edusharing.
  • ccm:oeh_quality_relevancy_for_education: "geeignet f\u00fcr Bildung (WLO-Suche)"
  • ccm:oeh_quality_transparentness: "Anbieter Renommee"
  • ccm:oeh_quality_didactics: "Didaktik/Methodik"
  • ccm:oeh_quality_medial: "Medial passend"
  • ccm:oeh_quality_language: "Sprachlich"
  • ccm:oeh_quality_neutralness: "Neutralit\u00e4t"
  • ccm:oeh_quality_currentness: "Aktualit\u00e4t"
  • ccm:oeh_quality_data_privacy: "Datenschutz"
    • This is a 0 - 5 stars field, but does not use a vocabulary / value-space....
  • ccm:oeh_quality_correctness: "Sachrichtigkeit"

Available Extractors

On the other side, we have the current extractor implementations:

  • Advertisement
  • EasyPrivacy
  • MaliciousExtensions
  • ExtractFromFiles
  • FanboyAnnoyance
  • FanboyNotification
  • FanboySocialMedia
  • AntiAdBlock
  • EasylistGermany
  • EasylistAdult
  • Paywalls
  • Security
  • IFrameEmbeddable
  • PopUp
  • RegWall
  • LogInOut
  • Cookies
  • GDPR
  • Javascript
  • Accessibility
  • LicenceExtractor

@MRuecklCC
Copy link
Contributor Author

MRuecklCC commented Jun 28, 2022

Mapping between extractors and meta data fields

As a first step, the following relations come to mind:

  • ccm:oeh_quality_protection_of_minors (Jugendschutz):

    • EasylistAdult would need to be modified to the binary output schema
  • ccm:oeh_quality_login (Login notwendig)

    • LogInOut, RegWall, Paywalls
    • Extractors would need to be combined into a single binary output.
  • ccm:oeh_quality_data_privacy (Datenschutz)

    • EasyPrivacy, GDPR, Cookies
    • Extractors would need to be combined into a single 0-5 star value space.
  • ccm:accessibilitySummary

    • We could map the AccessibilityExtractors output score to the A,AA,AAA scale.

@MRuecklCC
Copy link
Contributor Author

Given the example of the ccm:oeh_quality_protection_of_minors it also becomes clear, that the current response data model may be inadequat.

Consider the following two scenarios, where the service receives a request to extract meta information for a website that contains adult advertisement.

  1. The advertisement is detected with the EasylistAdult extractor which immediately makes clear, that the content is not suited as OER, the service could respond with a 0-Star rating for ccm:oeh_quality_protection_of_minors.
  2. The EasylistAdult extractor does not detect the ad (because it's not part of the respective blacklist). If the service responds with a 5-Star rating (because it didn't detect anything) that would be bad. A more conservative approach would be to omit the ccm:oeh_quality_protection_of_minors assessment (better safe than sorry).

Similar arguments can be made for other attributes. In those cases, the response data model for those cases could be either

  • explicit about it ("hey I am not entirely sure, but I didn't find anything suspicious, here is my X-Star rating")
  • omit the respective assessment ("hey im not gonna tell you, because im not entirely sure")

In abstract terms:

  • If the extractor's goal is to guarantee the absence of something that is defined via a blacklist, there will always be the issue that the blacklist may be incomplete.
  • If the extractor's goal is to guarantee the presence of something that is defined via a whitelist, there will always be the issue that the whitelist may be incomplete.

In both cases we could refrain from responding with an assessment or at least wrap it into a "maybe"/"potentially"

@RobertMeissner
Copy link
Collaborator

Regarding your latest comment: so basically, there is no safe way of using black-/whitelists and make a solid statement. All we say is based on us relying on the lists to be "complete", whatever that means

@MRuecklCC
Copy link
Contributor Author

MRuecklCC commented Jul 19, 2022

I read about accessibility ratings and lighthouse

  • Current valuespace is weird: Accessibility Vocabulary  oeh-metadata-vocabs#18
  • Lighthouse has some WCAG checks, but no automatic WCAG rating output
  • There is an online tool that does WCAG ratings: https://www.siteimprove.com/toolkit/accessibility-checker
    • Probably uses lighthouse under the hood :-)
    • Rather slow as well
    • Probably rate limited so not really suitable for our case
    • Spits out scores from 0-100 for each WCAG level
  • Lighthouse output does not automatically provide a WCAG rating. In addition it does not provide all checks to fully assess the WCAG rating. This means, even if we check the detailed output of lighthouse, we cannot fully automate a WCAG rating from it (only give a suggestion)
  • To deduce a WCAG suggestion from lighthouse we need to analyse the individual checks by hand and correlate them with the WCAG ratings: https://github.com/dequelabs/axe-core/blob/develop/doc/rule-descriptions.md

@lummerland
Copy link

Given the example of the ccm:oeh_quality_protection_of_minors it also becomes clear, that the current response data model may be inadequat.

It looks as we need to discuss and decide more ore less every field and mapping because of special characteristics. I think it would be helpful to have more detailed information on top of the "simple" mapping of fields. E.g.

  • How much confidence do we have in the result of the specific data?
  • What strategies could be used to put the data into the documents? meaning: can we put it directly into the metadata without risk? Or should we better let it be a proposal that somebody has to accept manually? Or something else?
  • How much risk do we take when data is wrong, vague or sth else?
  • ...

@MRuecklCC
Copy link
Contributor Author

MRuecklCC commented Jul 21, 2022

As a first shot I will provide a new API endpoint providing the following 4 attributes:

  • ccm:oeh_quality_protection_of_minors
  • ccm:oeh_quality_login
  • ccm:oeh_quality_data_privacy
  • ccm:accessibilitySummary

The structure will follow what is available on the /extract endpoint, the mapping from Extractor to LRMI meta data field will be implemented in the most trivial way from the extractors listed above.

The endpoint will be POST {base-uri}/lrmi-suggestions. It will take a JSON will the following structure

{
    "url":"https://some-domain.de/path/to/content.html"
}

For now, the endpoint will only provide results for html content. Responses for non html content is unspecified for now.
The response body will look as following:

{
"ccm:oeh_quality_protection_of_minors": {
    "stars": 0-5, # may be missing. In that case there will be an exception message
    "explanation": "some human readable string",
    "error": "" # will only be present if extraction failed, in which case neither stars, explanation or extra will be available.
    "extra": {
      # attribute specific extra information. the structure depends on the attribute.
    },
"ccm:oeh_quality_login": {
  # same as above
   },
"ccm:oeh_quality_data_privacy": {
  # same as above
   },
"ccm:accessibilitySummary": {
  # same as above
   }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Anything related to the behavior of the feature extractors
Projects
None yet
Development

No branches or pull requests

3 participants