This Python project aims to calculate genetic distances between languages based on the Glottolog hypothesized tree of languages. The genetic distance metric is defined as the number of steps upward on the tree until two languages are unified under a single node, divided by the number of branches in between Language 1 and the root. This metric represents the percentage of L1's descent not shared by Language 2. The calculation is quoted from the description provided by Prof. David Mortensen's website.
The motivation for this project stems from the need to calculate genetic distances between languages, an important task in linguistics and language-related research. Although lang2vec is a popular tool in this field, it lacks transparency in how genetic distances are calculated. In addition, the methodology used by lang2vec is not explicitly described in its paper, and the results appear to be somewhat erroneous.
- Calculate genetic distances based on the Glottolog hypothesized tree of languages.
- Intuitive distance metric that represents the percentage of L1's descent not shared by L2.
- Provides better language genetic distances.
- Outputs distance metrics in a structured format for further analysis.
To use this project, you can follow these steps:
- Download this CSV file containing language information.
- Download this Newick tree structure into a txt format file.
- Substitue the file paths of these two files with you paths.
- Provide a list of Glottolog language names that you wish to study. To check if a language is supported by Glottolog, search in the Name entry here
- Run through the code and then use the
get_distance
function to calculate genetic distances between languages. - Access the calculated distances for your analysis or research.
This method for calculating genetic distances has yielded dissimilarity matrices that are more intuitive and accurate when compared to the results obtained from lang2vec.
The dissimilarity matrix is a powerful tool for comparing languages and understanding their genetic relationships, which is a quick tool for sanity check and comparison. Languages from the same top level language families should have a lower dissimilarity score. You can explore and analyze the dissimilarity matrix in the notebook by providing a list of language families you wish to investigate.
Here is the Indo-European language dissimilarity matrix calculated according to lang2vec genetic distance:
Here is the Indo-European language dissimilarity matrix calculated according to this project's code:
More comparison can be seen here:
-
Glottolog: The Glottolog database and tree of languages were used as a basis for this project.
-
Prof. David Mortensen's Website: The methodology for calculating genetic distances is inspired by the description provided by Prof. David Mortensen.
-
lang2vec: lang2vec is a widely used tool for calculating language vectors and related metrics.