Skip to content

Commit

Permalink
rev
Browse files Browse the repository at this point in the history
  • Loading branch information
Edouard-Legoupil committed Nov 27, 2023
1 parent ec10574 commit 6c40522
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 10 deletions.
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
# A tutorial on how to use record linkage to remove duplicate from a Registration list



Record linkage, also known as data matching or deduplication or Unique Entity Estimation (UEE), is the process of identifying and linking records within or between datasets that refer to the same entity or individual. The goal of record linkage is to reconcile and merge information from different non-matching sources to create a unified and accurate view of the underlying entities.

In UNHCR context, this can be the case when merging registration list from different field partners, for instance when creating a sampling universe to organise a survey. Registration records form each list may vary in terms of data quality, format, and completeness. Record linkage helps to overcome these challenges by identifying and connecting related records, even when they do not have a common unique identifier.


* [Presentation](https://unhcr-americas.github.io/record_linkage/)
* [Full example based on dummy data](https://unhcr-americas.github.io/record_linkage/FastLink.html)
* [Pipeline on real data](https://github.com/unhcr-americas/record_linkage/blob/main/deduplicate.Rmd)



## Process

The process of record linkage typically involves several steps:

* __Data Cleaning__: Before linking records, it is essential to clean and standardize the data to ensure consistency. This may involve tasks such as correcting typos, standardizing formats, and handling missing or incomplete information.
Expand All @@ -19,13 +26,9 @@ The process of record linkage typically involves several steps:
* __Linking and Merging__: After determining which records are matches, the linked records are merged or consolidated to create a single, comprehensive record that combines information from the original sources.


## How to?

There are numerous packages for Record Linkage, such as {RecordLinkage} & {fastLink}

In this [presentation](https://unhcr-americas.github.io/record_linkage/), we focus on [Fastlink](https://github.com/kosukeimai/fastLink) which was also highlighted in this [presentation from UN Stat Commission](https://www.youtube.com/watch?v=S7boX8X4uXU) - a practical example from DANE in Colombia - matching a survey - - Gran encuesta integrada de hogares (GEIH) - with a registry - Registro Estadístico de Relaciones Laborales (RELAB) -
## Reference

You can check a full [demo here](https://unhcr-americas.github.io/record_linkage/FastLink.html)
There are numerous packages for Record Linkage, such as {RecordLinkage} & {fastLink}. In this repo, we focus on [fastLink}](https://github.com/kosukeimai/fastLink) which was also highlighted in the [presentation from UN Stat Commission](https://www.youtube.com/watch?v=S7boX8X4uXU) displaying a practical example from DANE in Colombia, aiming at matching a national household survey (_Gran encuesta integrada de hogares - GEIH_) - with a registry (_Registro Estadístico de Relaciones Laborales - RELAB_).

You can also check the [Seminar on record linkage](https://github.com/cleanzr/record-linkage-tutorial)

Expand Down
2 changes: 1 addition & 1 deletion index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ Each steps described above to be:

* piped using the tidyverse approach

Revise the[demo script](FastLink.html) and [applied real use case notebook](https://github.com/unhcr-americas/record_linkage/blob/main/deduplicate.Rmd) to use the functions
Revise the [demo script](FastLink.html) and [applied real use case notebook](https://github.com/unhcr-americas/record_linkage/blob/main/deduplicate.Rmd) to use the functions

]
.pull-right[
Expand Down
Loading

0 comments on commit 6c40522

Please sign in to comment.