Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing updates for ISO 639-3 three-letter codes #2069

Closed
3 of 4 tasks
escowles opened this issue Jan 24, 2023 · 7 comments · Fixed by #2124
Closed
3 of 4 tasks

Indexing updates for ISO 639-3 three-letter codes #2069

escowles opened this issue Jan 24, 2023 · 7 comments · Fixed by #2124
Assignees
Labels
DSSG approved Work approved by the DSSG group

Comments

@escowles
Copy link
Member

escowles commented Jan 24, 2023

CaMS is now including ISO 639-3 three-letter language codes where appropriate (sample record: https://catalog.princeton.edu/catalog/9930372423506421/). DSSG discussed and approved two changes to how these are indexed:

  1. Display the more specific language names from ISO 639-3, e.g., the record above should display "Language: Wu Chinese".
  2. Index both the more specific name and the more general/family name in the Language facet, so both "Chinese" and "Wu Chinese".

Implementation notes

  • Add the specific language name in the 'language_facet' .
  • Don't add it in the 'language_iana_s'

Checklist

@escowles escowles changed the title Indexing updates for ISO 693-3 three-letter codes Indexing updates for ISO 639-3 three-letter codes Jan 26, 2023
@mzelesky
Copy link
Member

See https://catalog.princeton.edu/catalog/9930372423506421/staff_view

The 041s that have the ISO 639-3 have a $2 of iso639-3

So traject can handle the 041s with $2 iso639-3 separately from 041s without that $2.

@christinach christinach added the DSSG approved Work approved by the DSSG group label Feb 28, 2023
@christinach
Copy link
Member

Add the specific language name in the 'language_facet' .

@sandbergja sandbergja self-assigned this Mar 3, 2023
@sandbergja
Copy link
Member

sandbergja commented Mar 3, 2023

We aren't always guaranteed to be able to find the more general language name (the "macrolanguage" using ISO 639-3 terminology) given a certain ISO 639-3 code.

The provided example, wuu, is included in the ISO 639-3 macrolanguage table with a macrolanguage of zho (Chinese). In the example MARC record, the 008 contains the ISO 639-2B code chi (Chinese), so we could also generate the requested facet "Chinese" from that.

However, another example: nuz (Tlamacazapa Nahuatl) isn't mapped to a macrolanguage of Nahuatl, since no such macrolanguage exists in ISO 639-3 (it does in ISO-639-2 and ISO-639-1). It seems that we would want both "Tlamacazapa Nahuatl" and "Nahuatl" in the facets -- will CaMS catalogers always put in both the ISO 639-3 and 639-2B code in these records?

@sandbergja
Copy link
Member

sandbergja commented Mar 3, 2023

Some possible gems:

We could also forego using an additional gem, and use the debian iso-codes package or the tables directly from SIL.

@sandbergja
Copy link
Member

sandbergja commented Mar 3, 2023

Also noting that, while we currently use the language_facet field for both the facet and the show page, we'll have to use two separate fields (and do the appropriate orangelight work) in order to meet these requirements.

Also, for the facet display, do we want them to be presented as a list, or as a "pivot facet" like the call number facet?

@maxkadel
Copy link
Contributor

maxkadel commented Mar 6, 2023

Trying to understand this issue better - these descriptions of the relationship between the iso-639-2/iso-639-5 and iso-639-3 standards were helpful to me - https://www.loc.gov/standards/iso639-2/iso639jac.html and https://iso639-3.sil.org/about/relationships

It looks like Nahuatl is a "collective", rather than a "macrolanguage" (see https://iso639-3.sil.org/code/nah vs. https://iso639-3.sil.org/code/zho).

From https://www.loc.gov/standards/iso639-2/faq.html#18:

Macrolanguages are distinguished from language collections in that the individual languages that correspond to a macrolanguage must be very closely related, and there must be some domain in which only a single language identity is recognized.

So, I think that it is intentional that the different languages with "Nahuatl" in their name are not grouped together in the macrolanguages table (presumably because they are not linguistically closely related, although they share a name - I am not a linguist).

I think that in order to do the work of the future ticket of linking together Indigenous languages, we will need to create our own concordance, as we did with the "augment the subject" work for "Indigenous Studies".

I like the idea of making this a pivot facet.

Also, the there is only one (legit) app that currently depends on either of the gems listed above, but the https://github.com/bbenno/languages gem supports the homosaurus vocabulary, which we also use, so might be a cool project to support.

@sandbergja
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DSSG approved Work approved by the DSSG group
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants