Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AQUA as word alignment engine #495

Open
johnml1135 opened this issue Sep 19, 2024 · 6 comments
Open

Add AQUA as word alignment engine #495

johnml1135 opened this issue Sep 19, 2024 · 6 comments
Assignees
Labels

Comments

@johnml1135
Copy link
Collaborator

johnml1135 commented Sep 19, 2024

This is to add the AQUA missing words assessment to Serval.

Implementation option 1 (fully modular):

  • Use a new folder: Assessment
  • Move some stuff from machine to ServiceToolkit, especially the ClearML stuff
  • Use the same engine/job combo project

Implementation option 2 (start combining):

  • Rename Machine folder to Backend/Calculations/NLP_Engines, etc.
  • Combine Machine Engine and Job into one deployment
  • Add AQUA assessment to that deployment
  • Create a few more projects (as needed) to hold the unique aspects of the various engines (translation and assessment).
@johnml1135 johnml1135 self-assigned this Sep 19, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Sep 19, 2024
@johnml1135 johnml1135 moved this from 🆕 New to 🏗 In progress in Serval Sep 19, 2024
@johnml1135
Copy link
Collaborator Author

johnml1135 commented Sep 19, 2024

Do the following to the existing files:

  • Keep Serval and the other primary folders as is.
  • Rename Machine as Backend
    • Rename Serval.Machine.Shared as Serval.Backend.Shared
    • Combine Serval.Machine.EngineServer and Serval.Machine.JobServer into Serval.Backend.BackendServer
    • Add Aqua assessments (and future things as well) to that backend server
    • Pull out parts of Serval.Backend.Shared into Serval.Backend.Machine - the parts that are specific to translation such as:
      • TranslationEngine
      • TrainSegmentPair
      • Corpus
      • Anything with Smt, Nmt or Thot
    • Create a new project called Serval.Backend.Aqua and include
      • The same job, build, ClearML, etc. aspects from Machine as is
      • A new job runner for AQUA word alignment job
      • Two assessments from the data:
        • A formal equivalence assessment, that is, one number per verse/segment
        • A source word assessment, giving a number associated with each source word per verse
      • One engine can create both assessments
      • Other "reference AQUA engines" can be created and then compared against the primary AQUA engine
      • The API Layer hands down "use this corpora, here are your reference engines" and the AQUA Backend passes back "here are the fully calculated scores to pass to Lynx".
    • Lynx then determines the relevance and meaning of the two AQUA metrics

@ddaspit, what do you think?

@johnml1135
Copy link
Collaborator Author

Ok - we will abandon the assessment API for right now and make a word alignment API.

  • Call it AQUA enhanced word alignment
  • Add thot word alignment
  • Add the word alignment to the existing EngineServer and JobServer - but keep those docker containers separate.
  • Add a new Serval.AQUA.WordAlignment project under Machine (if needed - if we do the Z-score in Serval).
  • Start with planning out the new API layer and adding the machine.py normal word alignment.

@johnml1135
Copy link
Collaborator Author

@ddaspit, what do you think - the basic refactoring would be:

  • There are base "Engine", "Corpora", "Job" classes
  • Inheritance tree:
    • Engine - Name, Id, Revision, etc.
      • TrainingEngine (Source and Target Corpus)
        • TranslationEngine - no changes
        • WordAlignmentEngine - no changes
      • AssessmentEngine
    • Corpus - a single language and set of files
    • TrainingCorpus - Source and Target sets of files
    • FilteredCorpus - a corpus reference with textIds and ScriptureRef filtering
    • Job - state,
      • TrainingBuildJob - IsPersisted, FilteredCorpus etc.
        • TranslationBuildJob - add pretranslations
        • WordAlignmentBuildJob - add word alignments
      • AssessmentJob - FilteredCorpus

Keep all API and database things the same. This is refactoring with 0 other changes. Just get ready for WordAlignment, don't add it yet.

@johnml1135 johnml1135 changed the title Add AQUA missing words assessment Add AQUA as word alignment engine Sep 30, 2024
@johnml1135
Copy link
Collaborator Author

@ddaspit - How should we represent word alignments at the Serval API layer? Here is the interface that I am assuming:

  • When training an engine, you can also have Serval align a portion of the training data, or other data
  • After training, the user can pass a source and target segment (with an optional scripture reference) to be aligned
  • Data back needs to include:
    • The aligned words
    • A single metric signifying the quality of the alignment
    • Indication of the tokenization of the source and target sentences

Options:

  1. Just take word pairs and a score -> John:Juan:0.89
  2. Just take number pairs and a score -> 7:8:0.89
  3. Add both by using a "|" -> 7|John:8|Juan:0.89
  4. Use json to add the tokenization:
{
   source_tokenization: ["His", "name", "is", "John"],
   target_tokenization: ... ,
   alignment: 1:1:0.89, 2:2:0.7546
}

@ddaspit
Copy link
Contributor

ddaspit commented Sep 30, 2024

You should take a look at the TranslationResult model for inspiration. We will probably want a subset of the properties in that model, specifically SourceTokens, TargetTokens, and Alignment.

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Dec 12, 2024

TODO:

  • pass up tokenized words from machine.py to S3 bucket
  • Make integration and E2E tests work
  • Test E2E both batch word alignment and "new parallel segments" word alignment
  • Update machine.py word alignment to use word_alignment_inputs.json as inference input and word_alignment_outputs.json for outputs
  • Live word alignments need the most work
  • Change GetBestPhraseAlignment to GetBestWordAlignment

Order to proceed:

  • Get Serval API integration tests working
  • Write more Unit tests if deemed necessary
  • Make changes above
  • Get E2E test working - batch alignments
  • Get live alignments working
  • Cut a release
  • Make a new branch and start on AQUA
    • Will need to align preprocess train data to vref (for AQUA calculations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants