Skip to content

Latest commit

 

History

History
58 lines (41 loc) · 2.22 KB

README.md

File metadata and controls

58 lines (41 loc) · 2.22 KB

Dialect map data

About

This repository contains static data to be used in the rest of the Dialect Map components 💬.

Jargons are grouped in order to improve one-on-one comparison when the meaning of the jargons are equal, although the term to describe it varies from science to science. These groups are later on used by a range of data-ingestion pipelines to generate NLP metrics on the ArXiv papers dataset, so they can be compared within the Dialect map UI.

Environment setup

The project uses AJV-CLI to validate the JSON schemas, and the jargon list. It can be installed by running:

npm install --no-optional

Syntax validation

To validate the JSON-Schema syntax:

make validate

Available data

Categories

The full corpus of ArXiv categories is formed by both currently and legacy used ones.

Jargons

Initial

The initial set of jargon groups was collected through a Google form set up by Kyle Cranmer on Twitter, having the scientific community responses collected from December 01 to December 31, 2020.

⚠️ Disclaimer: no more terms will be collected this way.

New terms

New terms can be added by creating a Pull Request (PR). These PRs will be later on reviewed by the Dialect map team to ensure that the resulting JSON is well formatted.