Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name prefixes and suffixes #35

Open
qgroom opened this issue May 31, 2020 · 30 comments
Open

name prefixes and suffixes #35

qgroom opened this issue May 31, 2020 · 30 comments

Comments

@qgroom
Copy link
Member

qgroom commented May 31, 2020

Name prefixes and suffixes are important disambiguating information. These should not be lost when data are converted or transcribed.

@matdillen
Copy link

How do we define what constitutes a prefix and what constitutes a suffix? Maybe it's more worthwhile to work with different aliases in general rather than separation into different properties. That also addresses the issue of name changes, such as with maiden names.

@qgroom
Copy link
Member Author

qgroom commented Jun 2, 2020

At least in English there is a well define list of prefixes and suffixes e.g. https://mediawiki.middlebury.edu/LIS/Name_Standards

@qgroom
Copy link
Member Author

qgroom commented Jun 2, 2020

How do we define what constitutes a prefix and what constitutes a suffix? Maybe it's more worthwhile to work with different aliases in general rather than separation into different properties. That also addresses the issue of name changes, such as with maiden names.

I like your aliases idea, but perhaps we should flag this issue for discussion

@dshorthouse
Copy link
Contributor

The more we dig into this, the more we'll realize that human names are one of the harder problems to solve in informatics because there are many, many cultural-specific usages with numerous edge cases. The last thing we want to do is perpetuate a Western view by baking unnecessary structure into something that is necessarily fluid. That said, schema.org is clearly biased to a Western view. Here are some links if you do want to dig into this:

https://en.wikipedia.org/wiki/Personal_name
https://en.wikipedia.org/wiki/Honorific
https://en.wikipedia.org/wiki/Title

My preference is allow these to be shared as aliases, #20 with the intent to supplement & enhance search as opposed to a bundling of rigidly defined reconciliation groups.

@qgroom
Copy link
Member Author

qgroom commented Jun 3, 2020

In that case we should perhaps get rid of the verbatim name concept, because this would just be one alias.

@dshorthouse
Copy link
Contributor

@qgroom Good point, though we'd want to change the example and tweak the definition of #20 to accommodate what verbatimName was also meant to contain. However, we get stuck chasing our own tails when we do not have the capacity to additionally express in what language/script any one of these #20 represent, nor which of them is uninterpreted, as written on a label.

@qgroom
Copy link
Member Author

qgroom commented Jun 3, 2020

The point of the extension is not to provide a whole biography of the person so perhaps restricting entries to two names is sufficient

@dshorthouse
Copy link
Contributor

Agree. I think our task here is to provide guidance on which two names & to seek consistency across all data providers who will commit to do the same. Are you suggesting we keep verbatimName and drop alternateName (or vice versa)?

What about when the verbatim text on a collector label is written, Mrs. and Mr. John Smith and we want to represent the wife in the duo? Is the verbatimName (if there is such a thing) Mrs. Smith in this case or are we interpreting the content?

@matdillen
Copy link

We are covering names related to occurrences. So, at most, we would expect two names for a single agent:

  • the verbatim name as it is connected to the occurrence (e.g. written on the label).
  • a general name that makes it clear who this person is (e.g. if the verbatim name is abbreviated).

Agents can have other aliases, but these could be obtained through the identifier. There could be multiple aliases tied to a single occurrence, but I suppose this would mostly be in different roles or tied to different actions?

@matdillen
Copy link

What about when the verbatim text on a collector label is written, Mrs. and Mr. John Smith and we want to represent the wife in the duo? Is the verbatimName (if there is such a thing) Mrs. Smith in this case or are we interpreting the content?

I suppose verbatimName should be Mrs. and Mr. John Smith for both agents. name can be Mrs. John Smith and Mr. John Smith respectively?

@wouteraddink
Copy link

I guess this is only for attribution of dead people, right? As for living people we should rely on identifiers like ORCID iDs.

@dshorthouse
Copy link
Contributor

@wouteraddink No, there's no distinction here for living/dead. There is no concept of a version of record (or versioning) for one's name in ORCID & not really modelled as such in Wikidata for that matter. So, while we will of course use ORCID, VIAF, etc. as the glue to stitch together datasets or to share disambiguations, we still do require evidence as expressed at a particular time on a particular occurrence record.

@qgroom
Copy link
Member Author

qgroom commented Jun 3, 2020

What about when the verbatim text on a collector label is written, Mrs. and Mr. John Smith and we want to represent the wife in the duo? Is the verbatimName (if there is such a thing) Mrs. Smith in this case or are we interpreting the content?

I suppose verbatimName should be Mrs. and Mr. John Smith for both agents. name can be Mrs. John Smith and Mr. John Smith respectively?

Agreed, but it will be critical to provide guidance that recommends that people do not strip the prefixes and suffixes from the verbatimName. Ultimately, it is the frequent loss of these elements that started this issue.

@wouteraddink
Copy link

wouteraddink commented Jun 3, 2020 via email

@qgroom
Copy link
Member Author

qgroom commented Jun 3, 2020

I guess we need to get all the occurence recording tools to store ORCID iDs in the records by default :) Starting with iNaturalist..

iNaturalist does already. inaturalist/inaturalist#2444
They started to just after Biodiversity Next, though I think it was triggered by a request from WikiProject Biodiversity https://www.wikidata.org/wiki/Wikidata:WikiProject_Biodiversity

@wouteraddink
Copy link

wouteraddink commented Jun 3, 2020 via email

@timrobertson100
Copy link
Member

I guess we need to get all the occurence recording tools to store ORCID iDs in the records by default :) Starting with iNaturalist..

Yes. Related, is that multiple values for recordedByID (and identifiedByID) are supported in GBIF.org too. GBIF will support search by any value provided but recognize ORCID and WikiData IDs explicitly. Example web search and API response.

There is work underway to expand this with the agent-action attribution extension for Darwin Core which is about ready to deploy in the sandbox environment for IPT users.

@SiobhanLeachman
Copy link

I agree that "it will be critical to provide guidance that recommends that people do not strip the prefixes and suffixes from the verbatimName. Ultimately, it is the frequent loss of these elements that started this issue." To me this comment hits the nail on the head, particularly for women collectors. Practically, when trying to disambiguate people, clues such as prefixes and suffixes can be vital for connecting and linking them to their work through identifiers such as Wikidata. Should prefixes and suffixes be stripped from verbatimName their collecting work will be much more likely to be attributed to the incorrect person. This is a current systemic bias that is rife in many herbarium databases. I always come back to this example when thinking about this issue. [https://www.wikidata.org/wiki/Q66487066] Mrs J N Rose is the name used to refer to her in papers and databases. But it's the vital "Mrs" that notifies people she's a different person than J N Rose.

@dshorthouse
Copy link
Contributor

I pushed this out yesterday as a first pass toward splitting and parsing lists of people names into their component parts: https://bionomia.net/parse. Strikes me that we need a common vocabulary for parts of names such that when we make recommendations on how they be stored (and shared anew & reconstructed), we can say something about what are these component parts. At the moment (cultural preferences notwithstanding), I have terms like title, appellation, given, particle, family, and suffix.

@SiobhanLeachman
Copy link

I've just tried it out with the examples you've given and really like how it performs. It is clear, concise and seems to deal well with most of the permutations of people's names given in https://mediawiki.middlebury.edu/LIS/Name_Standards. However one issue worth considering is the "family" name where there might be two or even more permutations. I know for Wikidata I currently enter both a woman's maiden name and her married name under the family name property but then qualify it with statements about the nature of that name using the "object has role" property. If she has been married more than once I also add the qualifier property "series ordinal" to indicate the series of married names she has. As regards Spanish names, in Wikidata there is the property "second family name in Spanish name". I'm wondering how the "family" will deal with names like these. Will it be a matter of adding all the family names with just a space between them as given in https://mediawiki.middlebury.edu/LIS/Name_Standards?

@debpaul
Copy link

debpaul commented Aug 7, 2020

I tried it out and it works great :-) I do note that it did not (cannot?) decide how to parse initials only. Example, Robert K Godfrey often signed his herbarium sheet labels with only RKG (especially for the det). What to do with initials only?

@debpaul
Copy link

debpaul commented Aug 7, 2020

To @matdillen et al, the multiple names interests me. How do we help with the aliases situation? What recommendations do we make. Note a recent thread with one M H Fitch. In hunting more labels from this person, I had the good (amazing?) fortune to find a herbarium sheet where this person signed their name twice. It took me a bit of sleuthing to decide that's what the two names on the label were intended to represent. So on this label M H FITCH == Mrs D B FITCH and the H = maiden name = HUNTINGTON. We have three names here (and more if you count the ones where we spell out her first names, and the names of her husband). What to do? See Twitter thread for whole story. And how the signatures appear see label on herbarium sheet.. @SiobhanLeachman did the rest of the detective work to bring all the pieces together.

@deepreef
Copy link

deepreef commented Aug 8, 2020

Cool! I just tried a set of 1,000 records (out of >47K unique text strings) from my Agent "dirty bucket", and like @debpaul the results were AWESOME! I guess I'll need to run it 48 times, though -- but it's so fast that shouldn't take long! My main problem is that I haven't yet finished parsing out multi-person text strings yet, so I've still got a fair bit of work to do before I can run with it.

As for the Alias situation, after years (decades, actually) of wrestling with this, I see no viable option other than to define two distinct classes: Agent and AgentName. Instances of the former are technically instances of dwc:Organism (including both people and organizations). Instances of the latter are UTF-8 text-string literals. There are some parallels with scientific nomenclature (in terms of homonyms, synonyms [=aliases], etc.), and the relationship with taxon concepts, but it's a bit more free-form because there is no universally adopted Code of Nomenclature for Agents.

Regarding what elements to parse, I use four: Prefix, GivenName, FamilyName, Suffix. All four may include one or more space characters. I handle cases of initials like "RKG" as GivenName="RK", FamilyName = "G"; but that's a bit dangerous because it requires assumptions to parse.

The only aspect of this that I'm fairly convinced of is the distinction between the two classes "Agent" and "AgentName" (including the need for distinct unique identifiers for each). The rest I'm totally flexible on and am very interested in what this group ultimately concludes.

@SiobhanLeachman
Copy link

I've just come across several herbarium sheets which list the collector as "J. G. Lemmon and wife" see https://www.gbif.org/occurrence/437664829. I can't imagine this will be the only instance of this. Any idea how this might be dealt with? By the way I'm in the process of attributing these specimens via Bionomia to Sara Plummer Lemmon https://www.wikidata.org/wiki/Q4815030.

@dshorthouse
Copy link
Contributor

@debpaul @deepreef @SiobhanLeachman We got off on a tangent with this ticket, which I believe was meant to capture our thoughts about name prefixes and suffixes, but this is good! It means there are many other nuances to contend with. I hope we don't lose track of all these, but I'm not yet sure what to do about them. Perhaps a working, controlled vocab for parts of names is one way to do it. But, this could be a fools' errand unless there are specific implementations that require that names be parsed into their components.

Insofar as Bionomia is concerned, basic parsing is done primarily to help create scores of pairwise structural similarities of names that would otherwise be lost through dumb search. In effect, parsing helps enhance search. For example, an agent namestring like "M.A. Smith" would have a greater structural similarity to "Michael Allen Smith" than to either "Michael P. Smith" (poor, likely different person) or "Michael Smith" (middling score). It's possible that name prefixes and suffixes could help inform such scores & then improve on search, but I've not yet investigated this. Clearly however, there are cultural issues that parsing names must be sensitive to.

Finally, if we deem prefixes and suffixes to be important, then we have to say something about how they should be stored in agents tables in collections management systems & then how they should be reconstructed (or not) prior to sharing alongside occurrence records.

@deepreef
Copy link

I support capturing/parsing prefixes & suffixes, and treating them as distinct components of a agent name-string from the "proper" name. Suffixes, in particular, can be important for disambiguation. @dshorthouse : can you elaborate on your definitions for title, appellation, particle, and suffix, and also comment on the value of parsing these as distinct properties (as opposed to generic prefix and suffix)?

Regarding my four elements, I have no formal definitions, but generally think of them as:
Prefix: Text characters that are unambiguously associated with and typically precede the fully-formatted proper name.
FamilyName: The latter part(s)of a fully-formatted proper name, beginning with the first character of the string by which the name is generally alphabetized in a bibliographic listing.
GivenName: Part(s) of the proper name that precede the FamilyName and are typically delineated from the FamilyName by a space character.
Suffix: Text characters that are unambiguously associated with and typically follow the fully-formatted proper name.

I realize these are vague definitions, but as my wife once pointed out to me many years ago: "It's better to be vaguely correct than precisely wrong".

@dshorthouse
Copy link
Contributor

Moi, definitions?

The parser I use makes heavy use of the Namae gem, itself used to parse citations. It's based on a parsing expression grammar – positional rather than semantic. I merely do a scary amount of cleaning prior to passing "clean(er) strings"™, generally because of the mess of stuff commonly found in recordedBy. The key:value pairs are used in that gem and I merely pass 'em along. That said, here are their definitions written as regex gobbledy-gook:

TITLE = /\s*\b(sir|count(ess)?|colonel|(gen|adm|col|maj|capt|cmdr|lt|sgt|cpl|pvt|prof|dr|md|ph\.?d|rev|mme|abbé|ptre|bro|esq)\.?|docteur|father|cantor|vicar|père|pastor|rabbi|reverend|pere|soeur|sister|professor)(\s+|$)/i

APPELLATION = /\s*\b((mrs?|ms|fr|hr)\.?|miss|herr|frau)(\s+|$)/i

SUFFIX = /\s*\b(JR|Jr|jr|SR|Sr|sr|[IVX]{2,})(\.|\b)/

These are by no means tied to formal, linguistic definitions but are merely "good enough"™. The author of the Namae gem did not provision a config for a "particle", but these are generally conjunctions like "von", "de", "di", "d'", "l'", "el", "der", "da" and the like. I have no immediate use for all these other bits except to set them aside while building pairwise comparisons of family names and various renditions of given names as a means to enhance what gets returned through search. However, my use-case through Bionomia leans more to the permissive side because it's reliant on human judgement to a great extent.

@deepreef
Copy link

OK, this is very helpful! So basically they're defined through enumerations (fair enough). I'm not sure I buy that it's necessary to distinguish title form appellation, but I do like the idea of prefix, particle and suffix -- especially if the particle can be defined by enumeration. I guess I'm always coming at these things from the semantic perspective (for better, or for worse).

@qgroom
Copy link
Member Author

qgroom commented Aug 11, 2020

Something that has been in my mind since the preconference workshop of Biodiversity Next is to create a disambiguation manual for person names in collections. You all have a massive amount of knowledge and experience and it would be valuable to document the processes, logic, best practices and resources available for disambiguation. We already have a wealth of material, but it need structuring and I think we perhaps need to formalize some of the procedures we perhaps take for granted.
Again this is a digression from this issue, but I will continue to work on this and bring it up at the next Task Group meeting.

@deepreef
Copy link

@qgroom : absolutely! I'm happy to help in any way I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants