-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
name prefixes and suffixes #35
Comments
How do we define what constitutes a prefix and what constitutes a suffix? Maybe it's more worthwhile to work with different aliases in general rather than separation into different properties. That also addresses the issue of name changes, such as with maiden names. |
At least in English there is a well define list of prefixes and suffixes e.g. https://mediawiki.middlebury.edu/LIS/Name_Standards |
I like your aliases idea, but perhaps we should flag this issue for discussion |
The more we dig into this, the more we'll realize that human names are one of the harder problems to solve in informatics because there are many, many cultural-specific usages with numerous edge cases. The last thing we want to do is perpetuate a Western view by baking unnecessary structure into something that is necessarily fluid. That said, schema.org is clearly biased to a Western view. Here are some links if you do want to dig into this: https://en.wikipedia.org/wiki/Personal_name My preference is allow these to be shared as aliases, #20 with the intent to supplement & enhance search as opposed to a bundling of rigidly defined reconciliation groups. |
In that case we should perhaps get rid of the verbatim name concept, because this would just be one alias. |
@qgroom Good point, though we'd want to change the example and tweak the definition of #20 to accommodate what verbatimName was also meant to contain. However, we get stuck chasing our own tails when we do not have the capacity to additionally express in what language/script any one of these #20 represent, nor which of them is uninterpreted, as written on a label. |
The point of the extension is not to provide a whole biography of the person so perhaps restricting entries to two names is sufficient |
Agree. I think our task here is to provide guidance on which two names & to seek consistency across all data providers who will commit to do the same. Are you suggesting we keep What about when the verbatim text on a collector label is written, |
We are covering names related to occurrences. So, at most, we would expect two names for a single agent:
Agents can have other aliases, but these could be obtained through the |
I suppose |
I guess this is only for attribution of dead people, right? As for living people we should rely on identifiers like ORCID iDs. |
@wouteraddink No, there's no distinction here for living/dead. There is no concept of a version of record (or versioning) for one's name in ORCID & not really modelled as such in Wikidata for that matter. So, while we will of course use ORCID, VIAF, etc. as the glue to stitch together datasets or to share disambiguations, we still do require evidence as expressed at a particular time on a particular occurrence record. |
Agreed, but it will be critical to provide guidance that recommends that people do not strip the prefixes and suffixes from the |
I guess we need to get all the occurence recording tools to store ORCID iDs
in the records by default :) Starting with iNaturalist..
…On Wed, 3 Jun 2020 at 15:06, David Shorthouse ***@***.***> wrote:
@wouteraddink <https://github.com/wouteraddink> No, there's no
distinction here for living/dead. There is no concept of a version of
record (or versioning) for one's name in ORCID & not really modelled as
such in Wikidata for that matter. So, while we will of course use ORCID,
VIAF, etc. as the glue to stitch together datasets or to share
disambiguations, we still do require evidence as expressed at a particular
time on a particular occurrence record.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADAUXTETDG3AFAIAYCZU7TRUZDGZANCNFSM4NPEWVHA>
.
--
Coördinator Research-data and E-infrastructure
International Biodiversity Infrastructures
Natural Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands
Coordination team member, Distributed System of Scientific Collections (
DiSSCo <http://dissco.eu/>)
Node Manager for DiSSCo, Global Biodiversity Information Facility (GBIF
<http://www.gbif.org/>)
Regional representative for Europe, Biodiversity Information Standards
Organisation (TDWG <http://tdwg.org/>)
Chair Biodiversity Data Integration IG, Research Data Alliance (RDA
<http://www.rd-alliance.org/>)
Catalogue of Life Ambassador (CoL <http://www.catalogueoflife.org>)
*ORCID*: 0000-0002-3090-1761 | *Linkedin*: *linkedin.com/in/wouteraddink/
<http://linkedin.com/in/wouteraddink/>*
*Twitter*: @wouter99999 | *Tel*: +31 (0) 71 751 9364
[email protected] - www.naturalis.nl - www.catalogueoflife.org -
www.dissco.eu
|
iNaturalist does already. inaturalist/inaturalist#2444 |
Cool, I did not know that
Op wo 3 jun. 2020 17:37 schreef Quentin Groom <[email protected]>:
… I guess we need to get all the occurence recording tools to store ORCID
iDs in the records by default :) Starting with iNaturalist..
iNaturalist does already. inaturalist/inaturalist#2444
<inaturalist/inaturalist#2444>
They started to just after Biodiversity Next, though I think it was
triggered by a request from WikiProject Biodiversity
https://www.wikidata.org/wiki/Wikidata:WikiProject_Biodiversity
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADAUXXTJ4B2MKNH53T2T3TRUZU2JANCNFSM4NPEWVHA>
.
|
Yes. Related, is that multiple values for recordedByID (and identifiedByID) are supported in GBIF.org too. GBIF will support search by any value provided but recognize ORCID and WikiData IDs explicitly. Example web search and API response. There is work underway to expand this with the agent-action attribution extension for Darwin Core which is about ready to deploy in the sandbox environment for IPT users. |
I agree that "it will be critical to provide guidance that recommends that people do not strip the prefixes and suffixes from the verbatimName. Ultimately, it is the frequent loss of these elements that started this issue." To me this comment hits the nail on the head, particularly for women collectors. Practically, when trying to disambiguate people, clues such as prefixes and suffixes can be vital for connecting and linking them to their work through identifiers such as Wikidata. Should prefixes and suffixes be stripped from verbatimName their collecting work will be much more likely to be attributed to the incorrect person. This is a current systemic bias that is rife in many herbarium databases. I always come back to this example when thinking about this issue. [https://www.wikidata.org/wiki/Q66487066] Mrs J N Rose is the name used to refer to her in papers and databases. But it's the vital "Mrs" that notifies people she's a different person than J N Rose. |
I pushed this out yesterday as a first pass toward splitting and parsing lists of people names into their component parts: https://bionomia.net/parse. Strikes me that we need a common vocabulary for parts of names such that when we make recommendations on how they be stored (and shared anew & reconstructed), we can say something about what are these component parts. At the moment (cultural preferences notwithstanding), I have terms like |
I've just tried it out with the examples you've given and really like how it performs. It is clear, concise and seems to deal well with most of the permutations of people's names given in https://mediawiki.middlebury.edu/LIS/Name_Standards. However one issue worth considering is the "family" name where there might be two or even more permutations. I know for Wikidata I currently enter both a woman's maiden name and her married name under the family name property but then qualify it with statements about the nature of that name using the "object has role" property. If she has been married more than once I also add the qualifier property "series ordinal" to indicate the series of married names she has. As regards Spanish names, in Wikidata there is the property "second family name in Spanish name". I'm wondering how the "family" will deal with names like these. Will it be a matter of adding all the family names with just a space between them as given in https://mediawiki.middlebury.edu/LIS/Name_Standards? |
I tried it out and it works great :-) I do note that it did not (cannot?) decide how to parse initials only. Example, Robert K Godfrey often signed his herbarium sheet labels with only RKG (especially for the det). What to do with initials only? |
To @matdillen et al, the multiple names interests me. How do we help with the aliases situation? What recommendations do we make. Note a recent thread with one M H Fitch. In hunting more labels from this person, I had the good (amazing?) fortune to find a herbarium sheet where this person signed their name twice. It took me a bit of sleuthing to decide that's what the two names on the label were intended to represent. So on this label M H FITCH == Mrs D B FITCH and the H = maiden name = HUNTINGTON. We have three names here (and more if you count the ones where we spell out her first names, and the names of her husband). What to do? See Twitter thread for whole story. And how the signatures appear see label on herbarium sheet.. @SiobhanLeachman did the rest of the detective work to bring all the pieces together. |
Cool! I just tried a set of 1,000 records (out of >47K unique text strings) from my Agent "dirty bucket", and like @debpaul the results were AWESOME! I guess I'll need to run it 48 times, though -- but it's so fast that shouldn't take long! My main problem is that I haven't yet finished parsing out multi-person text strings yet, so I've still got a fair bit of work to do before I can run with it. As for the Alias situation, after years (decades, actually) of wrestling with this, I see no viable option other than to define two distinct classes: Agent and AgentName. Instances of the former are technically instances of dwc:Organism (including both people and organizations). Instances of the latter are UTF-8 text-string literals. There are some parallels with scientific nomenclature (in terms of homonyms, synonyms [=aliases], etc.), and the relationship with taxon concepts, but it's a bit more free-form because there is no universally adopted Code of Nomenclature for Agents. Regarding what elements to parse, I use four: Prefix, GivenName, FamilyName, Suffix. All four may include one or more space characters. I handle cases of initials like "RKG" as GivenName="RK", FamilyName = "G"; but that's a bit dangerous because it requires assumptions to parse. The only aspect of this that I'm fairly convinced of is the distinction between the two classes "Agent" and "AgentName" (including the need for distinct unique identifiers for each). The rest I'm totally flexible on and am very interested in what this group ultimately concludes. |
I've just come across several herbarium sheets which list the collector as "J. G. Lemmon and wife" see https://www.gbif.org/occurrence/437664829. I can't imagine this will be the only instance of this. Any idea how this might be dealt with? By the way I'm in the process of attributing these specimens via Bionomia to Sara Plummer Lemmon https://www.wikidata.org/wiki/Q4815030. |
@debpaul @deepreef @SiobhanLeachman We got off on a tangent with this ticket, which I believe was meant to capture our thoughts about name prefixes and suffixes, but this is good! It means there are many other nuances to contend with. I hope we don't lose track of all these, but I'm not yet sure what to do about them. Perhaps a working, controlled vocab for parts of names is one way to do it. But, this could be a fools' errand unless there are specific implementations that require that names be parsed into their components. Insofar as Bionomia is concerned, basic parsing is done primarily to help create scores of pairwise structural similarities of names that would otherwise be lost through dumb search. In effect, parsing helps enhance search. For example, an agent namestring like "M.A. Smith" would have a greater structural similarity to "Michael Allen Smith" than to either "Michael P. Smith" (poor, likely different person) or "Michael Smith" (middling score). It's possible that name prefixes and suffixes could help inform such scores & then improve on search, but I've not yet investigated this. Clearly however, there are cultural issues that parsing names must be sensitive to. Finally, if we deem prefixes and suffixes to be important, then we have to say something about how they should be stored in agents tables in collections management systems & then how they should be reconstructed (or not) prior to sharing alongside occurrence records. |
I support capturing/parsing prefixes & suffixes, and treating them as distinct components of a agent name-string from the "proper" name. Suffixes, in particular, can be important for disambiguation. @dshorthouse : can you elaborate on your definitions for Regarding my four elements, I have no formal definitions, but generally think of them as: I realize these are vague definitions, but as my wife once pointed out to me many years ago: "It's better to be vaguely correct than precisely wrong". |
Moi, definitions? The parser I use makes heavy use of the Namae gem, itself used to parse citations. It's based on a parsing expression grammar – positional rather than semantic. I merely do a scary amount of cleaning prior to passing "clean(er) strings"™, generally because of the mess of stuff commonly found in
These are by no means tied to formal, linguistic definitions but are merely "good enough"™. The author of the Namae gem did not provision a config for a "particle", but these are generally conjunctions like "von", "de", "di", "d'", "l'", "el", "der", "da" and the like. I have no immediate use for all these other bits except to set them aside while building pairwise comparisons of family names and various renditions of given names as a means to enhance what gets returned through search. However, my use-case through Bionomia leans more to the permissive side because it's reliant on human judgement to a great extent. |
OK, this is very helpful! So basically they're defined through enumerations (fair enough). I'm not sure I buy that it's necessary to distinguish |
Something that has been in my mind since the preconference workshop of Biodiversity Next is to create a disambiguation manual for person names in collections. You all have a massive amount of knowledge and experience and it would be valuable to document the processes, logic, best practices and resources available for disambiguation. We already have a wealth of material, but it need structuring and I think we perhaps need to formalize some of the procedures we perhaps take for granted. |
@qgroom : absolutely! I'm happy to help in any way I can. |
Name prefixes and suffixes are important disambiguating information. These should not be lost when data are converted or transcribed.
The text was updated successfully, but these errors were encountered: