If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

andrewjbtw · 2024-11-22T20:45:02Z

The Cocina model has a field for langaugeTag where users can specify the language of a file. Currently we only use that field to drive the caption display interface.

For generating new captions, we want users to be able to specify a language for Whisper to use for transcription. Since a media item could have multiple files in different languages, we need to be able to specify this language on a per-file basis. We propose to use the languageTag on each media file to-be-captioned for this purpose.

Example:

In this QA item, the language has been set to English on the audio file that would be sent to Whisper. The idea is for Whisper to use that language when generating captions.

Users are already able to edit the language field using the file_manifest.csv in Preassembly or the Argo structural metadata editing that uses the same CSV format. So in the near term we do not need to add a UI for language specification.

Logic

Using the existing logic to determine which media files to caption, look for the languageTag on those items. Any other files can be ignored.
Have whisper try to transcribe in that langauge
Apply the same language value to the VTT/TXT files that come back from Whisper so that it shows up in the caption display UI
If no language is specified, auto-detect (what we currently do)

Additional information

This differs from the OCR approach because in OCR we:

are able to select multiple languages for detection
ABBYY (and presumably other OCR tools) detect language in smaller chunks within an item, not on a per-file basis
set the language at the whole-item level, not per-file

Whisper can only be given a single language.

The text was updated successfully, but these errors were encountered:

peetucket · 2024-11-26T19:59:10Z

See sul-dlss/speech-to-text#51 for work that must also occur

peetucket transferred this issue from sul-dlss/speech-to-text Nov 26, 2024

peetucket self-assigned this Nov 26, 2024

peetucket linked a pull request Nov 26, 2024 that will close this issue

[HOLD] add language tag from cocina to sqs message #1428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

andrewjbtw commented Nov 22, 2024 •

edited

Loading

peetucket commented Nov 26, 2024

If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

Comments

andrewjbtw commented Nov 22, 2024 • edited Loading

peetucket commented Nov 26, 2024

andrewjbtw commented Nov 22, 2024 •

edited

Loading