Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If an item's cocina specifies the language for a media file, pass that language on to Whisper instead of using Whisper's language detection #1427

Open
andrewjbtw opened this issue Nov 22, 2024 · 1 comment · May be fixed by #1428
Assignees

Comments

@andrewjbtw
Copy link

andrewjbtw commented Nov 22, 2024

The Cocina model has a field for langaugeTag where users can specify the language of a file. Currently we only use that field to drive the caption display interface.

For generating new captions, we want users to be able to specify a language for Whisper to use for transcription. Since a media item could have multiple files in different languages, we need to be able to specify this language on a per-file basis. We propose to use the languageTag on each media file to-be-captioned for this purpose.

Example:

In this QA item, the language has been set to English on the audio file that would be sent to Whisper. The idea is for Whisper to use that language when generating captions.

Screenshot 2024-11-22 at 12 39 40 PM

Users are already able to edit the language field using the file_manifest.csv in Preassembly or the Argo structural metadata editing that uses the same CSV format. So in the near term we do not need to add a UI for language specification.

Logic

  • Using the existing logic to determine which media files to caption, look for the languageTag on those items. Any other files can be ignored.
  • Have whisper try to transcribe in that langauge
  • Apply the same language value to the VTT/TXT files that come back from Whisper so that it shows up in the caption display UI
  • If no language is specified, auto-detect (what we currently do)

Additional information

This differs from the OCR approach because in OCR we:

  • are able to select multiple languages for detection
  • ABBYY (and presumably other OCR tools) detect language in smaller chunks within an item, not on a per-file basis
  • set the language at the whole-item level, not per-file

Whisper can only be given a single language.

@peetucket peetucket transferred this issue from sul-dlss/speech-to-text Nov 26, 2024
@peetucket peetucket self-assigned this Nov 26, 2024
@peetucket
Copy link
Member

See sul-dlss/speech-to-text#51 for work that must also occur

@peetucket peetucket linked a pull request Nov 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants