Add AQUA as word alignment engine #495

johnml1135 · 2024-09-19T17:43:47Z

This is to add the AQUA missing words assessment to Serval.

Implementation option 1 (fully modular):

Use a new folder: Assessment
Move some stuff from machine to ServiceToolkit, especially the ClearML stuff
Use the same engine/job combo project

Implementation option 2 (start combining):

Rename Machine folder to Backend/Calculations/NLP_Engines, etc.
Combine Machine Engine and Job into one deployment
Add AQUA assessment to that deployment
Create a few more projects (as needed) to hold the unique aspects of the various engines (translation and assessment).

johnml1135 · 2024-09-19T19:57:31Z

Do the following to the existing files:

Keep Serval and the other primary folders as is.
Rename Machine as Backend
- Rename Serval.Machine.Shared as Serval.Backend.Shared
- Combine Serval.Machine.EngineServer and Serval.Machine.JobServer into Serval.Backend.BackendServer
- Add Aqua assessments (and future things as well) to that backend server
- Pull out parts of Serval.Backend.Shared into Serval.Backend.Machine - the parts that are specific to translation such as:
  - TranslationEngine
  - TrainSegmentPair
  - Corpus
  - Anything with Smt, Nmt or Thot
- Create a new project called Serval.Backend.Aqua and include
  - The same job, build, ClearML, etc. aspects from Machine as is
  - A new job runner for AQUA word alignment job
  - Two assessments from the data:
    - A formal equivalence assessment, that is, one number per verse/segment
    - A source word assessment, giving a number associated with each source word per verse
  - One engine can create both assessments
  - Other "reference AQUA engines" can be created and then compared against the primary AQUA engine
  - The API Layer hands down "use this corpora, here are your reference engines" and the AQUA Backend passes back "here are the fully calculated scores to pass to Lynx".
- Lynx then determines the relevance and meaning of the two AQUA metrics

@ddaspit, what do you think?

johnml1135 · 2024-09-19T22:30:08Z

Ok - we will abandon the assessment API for right now and make a word alignment API.

Call it AQUA enhanced word alignment
Add thot word alignment
Add the word alignment to the existing EngineServer and JobServer - but keep those docker containers separate.
Add a new Serval.AQUA.WordAlignment project under Machine (if needed - if we do the Z-score in Serval).
Start with planning out the new API layer and adding the machine.py normal word alignment.

johnml1135 · 2024-09-20T19:25:22Z

@ddaspit, what do you think - the basic refactoring would be:

There are base "Engine", "Corpora", "Job" classes
Inheritance tree:
- Engine - Name, Id, Revision, etc.
  - TrainingEngine (Source and Target Corpus)
    - TranslationEngine - no changes
    - WordAlignmentEngine - no changes
  - AssessmentEngine
- Corpus - a single language and set of files
- TrainingCorpus - Source and Target sets of files
- FilteredCorpus - a corpus reference with textIds and ScriptureRef filtering
- Job - state,
  - TrainingBuildJob - IsPersisted, FilteredCorpus etc.
    - TranslationBuildJob - add pretranslations
    - WordAlignmentBuildJob - add word alignments
  - AssessmentJob - FilteredCorpus

Keep all API and database things the same. This is refactoring with 0 other changes. Just get ready for WordAlignment, don't add it yet.

johnml1135 · 2024-09-30T20:08:25Z

@ddaspit - How should we represent word alignments at the Serval API layer? Here is the interface that I am assuming:

When training an engine, you can also have Serval align a portion of the training data, or other data
After training, the user can pass a source and target segment (with an optional scripture reference) to be aligned
Data back needs to include:
- The aligned words
- A single metric signifying the quality of the alignment
- Indication of the tokenization of the source and target sentences

Options:

Just take word pairs and a score -> John:Juan:0.89
Just take number pairs and a score -> 7:8:0.89
Add both by using a "|" -> 7|John:8|Juan:0.89
Use json to add the tokenization:

{
   source_tokenization: ["His", "name", "is", "John"],
   target_tokenization: ... ,
   alignment: 1:1:0.89, 2:2:0.7546
}

ddaspit · 2024-09-30T22:12:33Z

You should take a look at the TranslationResult model for inspiration. We will probably want a subset of the properties in that model, specifically SourceTokens, TargetTokens, and Alignment.

johnml1135 · 2024-12-12T17:05:15Z

TODO:

pass up tokenized words from machine.py to S3 bucket
Make integration and E2E tests work
Test E2E both batch word alignment and "new parallel segments" word alignment
Update machine.py word alignment to use word_alignment_inputs.json as inference input and word_alignment_outputs.json for outputs
Live word alignments need the most work
Change GetBestPhraseAlignment to GetBestWordAlignment

Order to proceed:

Get Serval API integration tests working
Write more Unit tests if deemed necessary
Make changes above
Get E2E test working - batch alignments
Get live alignments working
Cut a release
Make a new branch and start on AQUA
- Will need to align preprocess train data to vref (for AQUA calculations)

johnml1135 self-assigned this Sep 19, 2024

johnml1135 added the AQUA label Sep 19, 2024

johnml1135 added this to Serval Sep 19, 2024

github-project-automation bot moved this to 🆕 New in Serval Sep 19, 2024

johnml1135 moved this from 🆕 New to 🏗 In progress in Serval Sep 19, 2024

johnml1135 changed the title ~~Add AQUA missing words assessment~~ Add AQUA as word alignment engine Sep 30, 2024

johnml1135 assigned Enkidu93 Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AQUA as word alignment engine #495

Add AQUA as word alignment engine #495

johnml1135 commented Sep 19, 2024 •

edited

Loading

johnml1135 commented Sep 19, 2024 •

edited

Loading

johnml1135 commented Sep 19, 2024

johnml1135 commented Sep 20, 2024

johnml1135 commented Sep 30, 2024

ddaspit commented Sep 30, 2024

johnml1135 commented Dec 12, 2024 •

edited

Loading

Add AQUA as word alignment engine #495

Add AQUA as word alignment engine #495

Comments

johnml1135 commented Sep 19, 2024 • edited Loading

johnml1135 commented Sep 19, 2024 • edited Loading

johnml1135 commented Sep 19, 2024

johnml1135 commented Sep 20, 2024

johnml1135 commented Sep 30, 2024

ddaspit commented Sep 30, 2024

johnml1135 commented Dec 12, 2024 • edited Loading

johnml1135 commented Sep 19, 2024 •

edited

Loading

johnml1135 commented Sep 19, 2024 •

edited

Loading

johnml1135 commented Dec 12, 2024 •

edited

Loading