-
Notifications
You must be signed in to change notification settings - Fork 0
/
main.tex
3474 lines (2370 loc) · 280 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% These commented lines make the thesis look good as a standalone document
% \documentclass[a4paper,11pt]{report}
% \usepackage[tmargin=2.5cm,bmargin=2.5cm,lmargin=2.8cm,rmargin=2.8cm,headheight=0cm]{geometry}
% these two lines are to make the thesis printable in a book (with alternate spacing)
\documentclass[epsfig,a4paper,11pt,titlepage,twoside,openany]{book}
\usepackage[paperheight=29.7cm,paperwidth=21cm,outer=1.5cm,inner=2.5cm,top=2cm,bottom=2cm]{geometry}
\usepackage{epsfig}
\usepackage{plain}
\usepackage{setspace}
\usepackage{titlesec}
\usepackage[T2A,T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian, english]{babel}
\usepackage{todonotes}
\usepackage{cite} \graphicspath{ {graphics/} }
\usepackage{csquotes}
\usepackage{amsmath}
\usepackage{algorithm,algorithmic}
\usepackage{array}
\usepackage[colorlinks=true,linkcolor=black,citecolor=blue]{hyperref}
\usepackage{tabularx}
\usepackage{multirow}
\usepackage{float}
\usepackage{makecell}
\usepackage[capitalise]{cleveref}
\usepackage{amssymb}
\usepackage{longtable}
\usepackage{lscape}
\usepackage{subcaption}
\usepackage[inline]{enumitem}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{xspace}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.95}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\definecolor{mgray}{rgb}{0.5,0.5,0.5}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
aboveskip=5mm,
belowskip=5mm,
showstringspaces=false,
columns=flexible,
basicstyle={\ttfamily},
numberstyle=\tiny\color{mgray},
numbers=left,
numbersep=5pt,
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
breaklines=true,
breakatwhitespace=true,
tabsize=2,
sensitive = true,
escapeinside={(*}{*)},
frame=single,
language=Python,
}
\lstset{style=mystyle}
\newcommand{\footurl}[1]{\footnote{\url{#1}}}
\newcommand{\soweego}[0]{\texttt{soweego}\xspace}
\linespread{1.25}
\begin{document}
\pagenumbering{gobble} \input{cover_page}
\clearpage
% \listoftodos
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Nota
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Sezione Ringraziamenti opzionale
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{acknowledgements}
\clearpage
\pagestyle{plain}
\mainmatter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Nota
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Si ricorda che il numero massimo di facciate e' 30. Nel conteggio delle
%% facciate sono incluse indice sommario capitoli Dal conteggio delle facciate
%% sono escluse frontespizio ringraziamenti allegati
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% indice
\tableofcontents
\clearpage
% gruppo per definizone di successione capitoli senza interruzione di pagina
\begingroup
% Template styles nessuna interruzione di pagina tra capitoli ridefinizione dei
% comandi di clear page
\renewcommand{\cleardoublepage}{} \renewcommand{\clearpage}{}
% redefinizione del formato del titolo del capitolo da formato Capitolo X Titolo
% capitolo a formato X Titolo capitolo
\titleformat{\chapter} {\normalfont\Huge\bfseries}{\thechapter}{1em}{}
\titlespacing*{\chapter}{0pt}{0.59in}{0.02in}
\titlespacing*{\section}{0pt}{0.20in}{0.02in}
\titlespacing*{\subsection}{0pt}{0.10in}{0.02in}
% end template style
% ---------- Below are MY styles
% % % nessuna interruzione di pagina tra capitoli ridefinizione dei comandi di
% % % clear
% % % page
% \renewcommand{\cleardoublepage}{}
% % \renewcommand{\clearpage}{}
% % % redefinizione del formato del titolo del capitolo da formato Capitolo X
% % % Titolo
% % % capitolo a formato X Titolo capitolo
% \titleformat{\chapter} {\normalfont\Huge\bfseries}{\thechapter}{1em}{}
% \titlespacing*{\chapter}{0pt}{0pt}{0.17in}
% \titlespacing*{\section}{0pt}{0.20in}{0.09in}
% \titlespacing*{\subsection}{0pt}{0.10in}{0.08in}
% \setlength\abovedisplayshortskip{5pt} \setlength\belowdisplayshortskip{15pt}
% \setlength\abovedisplayskip{5pt} \setlength\belowdisplayskip{15pt}
% ------------ End my styles
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Introduction}
\label{chap:introduction}
In this era of \textit{big data}, there are scores of datasets of every kind, for all sorts of human endeavours. Nowadays, most companies strive to create some sort of dataset of its customers and enlarge it as the years go by, allowing them to better understand their clientele and provide the best service possible. In many domains, however, there is no need to start a dataset from scratch, since much of the information is readily available in public knowledge bases. \textit{Wikidata} is one such knowledge base, and it is the workhorse that powers the \textit{Wikipedia} free encyclopedia, and all of Wikimedia's projects.
From the perspective of Wikidata, having these external datasets consume its knowledge is normal. On the other hand, there might be an external dataset which grows to contain some information that might become useful to Wikidata, since it will allow the enrichment of its own knowledge base. This information is usually manually added by volunteers, but may sometimes be too much, some of it might be overlooked. One of Wikidata's concerns is maintaining the highest quality of data, however, some of the data suffer from lack of references which are invaluable to verify the accuracy of the information contained on their datasets.
The root of the problem may come from the fact that references are only added for a subset of the data, like some of the most visited or sought after articles. The project \soweego aims to automate this task by automatically adding external references to Wikidata information.
\soweego is an open source project of the Wikimedia Foundation, which uses techniques of record linkage together with machine learning classifiers to find the external references for Wikidata entities. \soweego does this by finding potential matches of Wikidata entities in external databases and then ranking the likelihood of these potential matches using machine learning. The external databases \soweego considers when performing these matches are the following:
\begin{itemize}
\item \texttt{IMDb}: for information regarding \textit{actors, musicians, directors, producers,} and \textit{writers}.
\item \texttt{Musicbrainz} and \texttt{Discogs}: for information regarding \textit{musicians} and \textit{bands}.
\end{itemize}
This dissertation aims at improving the prediction power of \soweego. Specifically, the current set of machine learning classifiers available in the \soweego project will be improved by finding the best configuration for them. Following that, the idea of \textit{ensemble classifiers} will be explored, in which all of the classifiers available in \soweego will be joined, and the \textit{opinion} of the whole group will be considered when making a prediction. Motivating this choice is the fact that considering the viewpoint of multiple base classifiers, may smooth out errors that might be produced by any classifier individually.
In \autoref{chap:state-of-the-art}, the technique of \textit{record linkage} will be explained. This is commonly used when working with a problem where there is a need to find matching entries across different datasets. It will also be shown how the process of performing record linkage can be thought of as a pipeline with different steps, each of which is in charge of performing a different transformation on the data. Common terms that will be used when talking about record linkage, and in this dissertation in general, can be found in the \hyperref[sec:apx-glossary]{Glossary}.
A more in depth description of the problem that \soweego solves will be presented in \autoref{chap:problem-statement}. This section will also present an overview of the data that will be used, and which areas the current project aims to improve.
\soweego, as record linkage, can be viewed as a series of steps that are closely related to that of the record linkage framework presented in \autoref{chap:state-of-the-art}. In \autoref{chap:approach}, an explanation about every part of \soweego will be given, together with what happens during each part.
The data used in this project will be described in \autoref{chap:data}. Here the external databases will be explored, centering around explaining its shape, provided fields, and the quality of each database. After this, an explanation of how the data is preprocessed before being consumed by \soweego is given. Finally, an explanation of each feature extracted from the data is given.
In \autoref{chap:algorithms} an explanation of each of the machine learning classifiers provided by \soweego is given. These will be called the \textit{baseline classifiers} since are what the new \textit{ensemble classifiers} will be compared against. An explanation of the ensemble models used will also be given in this chapter.
Following that, \autoref{chap:results} proceeds to show how the \textit{configuration} of the baseline classifiers was optimized, and the evaluation performances obtained by the baseline classifiers, after said optimization, and the ensemble classifiers. An explanation of the metrics used, and a comparison of the results between baseline and ensembles will also be given in this chapter, where the differences between the performances obtained are explored.
Finally, in \autoref{chap:conclusion} it will be discussed how ensembles classifiers are better across all considered metrics than almost all of the baseline models. According to the results obtained, ensembles are better than all baseline classifiers at predicting when a pair of records is actually different (e.g. only pairs which the classifier is very sure about are classified as matches). As such, ensemble models are a great choice when they are used in a situation where predicting a negative sample as positive is not desirable. As shown, there are also differences among the evaluated ensemble models, some are more \textit{precise} (they tend to mark as matches pairs which are actually matches), while others are more \textit{inclusive} (they tend to recognize more matches as so, even if some non-matches are also included).
\section{Contributions}
\label{sec:contributions}
As said in the introduction, the present dissertation centers around contributing to the open source project \soweego. The main contributions made to the project as part of this work are the following:
\begin{itemize}
\item The creation of an \textit{importer} functionality which allows \soweego to work with data from IMDb. This functionality will be explained in \autoref{sec:soweego-st-importer}.
\item Implement some of the \textit{baseline} algorithms (explained in \autoref{sec:baseline-classifiers}) used by \soweego. The implemented models are specifically: \textit{logistic regression}, \textit{random forest}, and \textit{single layer} and \textit{multi layer} perceptrons.
\item Optimize the \textit{configuration} of the baseline algorithms employing a grid search procedure. This is explained in \autoref{sec:hyperpar-optimization}.
\item Implement 4 different methods to \textit{ensemble} the baseline classifiers, described in \autoref{sec:ensemble-models}.
\item Evaluate the performance of the baseline and ensemble classifiers when applied to the current problem. This is done in \autoref{sec:baseline-evaluation} and \autoref{sec:ensemble-evaluation} respectively.
\item Compare the evaluation performance of \textit{ensembles} and \textit{baseline} classifiers to see how their behaviour differs and if there was any improvement when using ensembles (\autoref{sec:comparative-evaluation}).
\end{itemize}
\section{A brief overview of Wikidata}
\label{sec:intro-wikidata}
This section will give a brief overview of what Wikidata is and how it works. These concepts will be extensively used throughout the discussion.
\textit{Wikidata}, is defined as the \textit{knowledge base} that supports
\textit{Wikipedia} and all of the \textit{Wikimedia Foundation}
projects\footurl{https://wikimediafoundation.org/}. A \textit{knowledge base} is a collection of statements of and about the world. There are many different knowledge bases available on the web but, \textit{Wikidata} is one of the most complete out there, it also has a very active and extense community of users and volunteers.
In this specific \textit{knowledge base} there are different \textit{items} which are interconnected and integrated with each other. They achieve this through the use of a \textit{Wikidata} construct called
\textit{properties}.
Each \textit{item} is identified by using a special tag named \texttt{QID} (Q-identifier), it was given this name because each of these unique values starts with the letter \textit{Q}. Information about a specific item (QID) is
recorded by the use of \textit{statements} which are \textit{key-value} pairs
that contain specific information about each item.
The key of each statement is a \textit{property}, and the value can be a reference to another item via its QID, a string, a number, or media
files\footurl{https://www.wikidata.org/wiki/Help:Statements}. Statements can
also be extended by the use of \textit{qualifiers} which are \textit{extra information} about the
statement. An example of this can be a person who studied at some university for
a certain period of time (this duration would be the qualifier).
Properties\footurl{https://www.wikidata.org/wiki/Help:Properties}, like items,
have their own ID. These are known as \texttt{PID} (Property-ID). Some properties may accept only one value or multiple values, and
some may also be constrained in some specific way: for example, its value can
only be a string composed of one letter.
In Wikidata both \textit{items} and \textit{properties} are associated with a name and a
brief description, the main difference is how they are used: items, have the
goal of representing a specific entity while properties are used to express information regarding said entity. Another difference between them would be that \textit{items} are displayed on their respective pages on Wikipedia and \textit{properties} are not.
Consider the QID \textit{Q42}, which identifies the writer
\textit{Douglas Adams}. Below is a table which shows some of the statements about him
using QIDs and PIDs.
\begin{table}[H]
\centering
\begin{tabular}{l|l}
PID & QID \\ \hline
P31 (instance of) & Q5 (human) \\ \hline
P21 (sex) & Q6581097 (male) \\ \hline
P19 (place of birth) & Q350 (Cambridge) \\ \hline
\multirow{3}{*}{P106 (occupation)} & Q28389 (screenwriter) \\ \cline{2-2}
& Q6625963 (novelist) \\ \cline{2-2}
& Q245068 (comedian)
\end{tabular}
\caption{Wikidata statements describing Douglas Adams}
\label{tab:intro-wikidata-douglas42}
\end{table}
Here it is possible to see that this way of expressing information is very
flexible and allows for a quick and rich construction of knowledge.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{State of the art}
\label{chap:state-of-the-art}
This chapter will mainly cover \textit{Record Linkage}, a standard method used to link a pair of databases. The next section will give a brief introduction to the record linkage workflow, and mention some of the most common terms used in this document. \autoref{sec:rl-as-a-workflow} will show how record linkage can be thought of as a pipeline, and every step will be explained in depth. Finally, \autoref{sec:rl-related-work} will mention related work.
\section{Introduction to Record Linkage}
\label{sec:rl-intro}
\textit{Record linkage} is commonly used when there is a need for data integration (e.g. merging two datasets). This is achieved by finding pairs of records which represent the same underlying \textit{entity}, and by joining them according to a previously defined set of criteria. It is possible to apply the technique both for connecting a pair of databases (which we call \textit{catalogs} in this use case) and for removing duplicates from one single catalog.
For a more formal description of the problem, suppose there are two datasets ($\Psi$ and $\Omega$). The goal of record linkage is to find all those pairs $(x, y)$ where $x \in \Psi$ and $y \in \Omega$, such that both $x$ and $y$ refer to the same underlying entity. Hence, the algorithm must compare the entities from $\Psi$ and $\Omega$, and decide which of them are a match (i.e., refer to the same underlying entity) and which are not.
Record Linkage can also be naturally expressed or viewed as a problem of clustering, in which the objective is to cluster all the entities representing the same \textit{latent entity} together.
The goal, as defined in \cite{fellegi69_theor_recor_linkag}, is to find two different sets $M$ and $U$, where $M$ is composed of all matching pairs and $U$ contains all non-matching ones:
\begin{align*}
M &= \{(x, y); x = y; x \in \Psi; y \in \Omega\} \\
U &= \{(x, y); x \neq y; x \in \Psi; y \in \Omega\}
\end{align*}
The problem of joining different catalogs is trivial when the catalogs have a
shared, unique identifier. For example, when joining the datasets of
customers from two banks, and each customer is associated with their government
ID, or some other unique identifier. In this case, it is just a matter of
matching on this attribute and merging the resulting pairs.
However, it is commonly the case (as is the case for this project as well), that
catalogs do not have a common unique identifying attribute, thus requiring a less trivial approach.
As shown later in this chapter (\autoref{sec:rl-main-approaches}), there are two main methods of performing record linkage, but basically, they can be separated into methods which match directly comparing pairs of fields, and methods which measure the similarity between said pairs.
Besides not having a unique identifier, it may also be the case that the datasets that need to be merged, have some mistakes. This is something especially noticeable when we are working with data representing people. These are some of the mistakes that might come up:
\begin{itemize}
\item The names were misspelled.
\item The first and last names are reversed.
\item The format of the date is not standardized.
\item The dates presented are simply not possible since they are too forward in the future or too far in the past.
\end{itemize}
Many of these can be somewhat reduced by preprocessing the dataset, but others may not, and it is necessary to account for them when working with the data.
To give an example of a situation in which \textit{record linkage} would be used,
suppose there are two databases of clients that need to be joined for some company. The tables below represent said databases.
\begin{table}[H]
\centering{}
\begin{tabular}{l|l|l}
First Name & Last Name & Birth Year \\ \hline
Brenda & Williams & 1990 \\
Meng & Yu & 1950 \\
Gary & Crowley & 1956 \\
Bruno & Lima & 1956
\end{tabular}
\caption{Example catalog from company \#1}
\label{tab:ex-catalog-1}
\end{table}
\begin{table}[H]
\centering{}
\begin{tabular}{l|l|l}
First Name & Last Name & Birth Year \\ \hline
Emma & Farr & 1971 \\
Yu & Meng & 1950 \\
Juri & Petrov & 1932 \\
Gary & Crowley & 1956
\end{tabular}
\caption{Example catalog from company \#2}
\label{tab:ex-catalog-2}
\end{table}
The catalogs above do not have a unique identifier that unequivocally links them, so it is necessary to check the available attributes. The fields can be compared in turn and if they all match then it can be said that two entities are the same. For example, it can be easily seen that \textit{Gary Crowley} is in both datasets since an entity with said name exists in both and their birth dates match. On the other hand, nothing can be confidently said about the other possible matches since none of them all the fields match. There is a pair (\textit{Meng Yu} in the first dataset, and \textit{Yu Meng} in the second) which might be the same person since the names are similar and they share a birth year. For this last instance, there is no way to be sure if the first and last names are reversed in one of the datasets or if they are actually different people. This kind of situation is quite common in real-world datasets.
\subsection{Main Approaches}
\label{sec:rl-main-approaches}
There are two main approaches\footnote{Besides these two approaches, in the record linkage literature we commonly find the term \textit{clerical}, which entails \textit{manual} work, i.e., by a clerk.} to consider when talking about record linkage: 1)\textit{deterministic}, and 2) \textit{probabilistic}.
The main difference between these is how the pairs of entities are compared. A more thorough description of each method follows.
\subsubsection{Deterministic}
\label{sec:rl-approach-deterministic}
In deterministic record linkage, all fields are compared to see if they match. If the fields do match, then it can be said that a pair of entities are the same. However, if any of the fields that are being compared only partially match or do not match at all, then the pair is considered as a non-match.
Take for instance the task of joining a catalog that provides information regarding where someone
lives with another catalog that provides their age. Both catalogs only contain the
name of the person as an identifying feature, so a composition of
these two fields (attributes) can be used to identify each entity.
\begin{table}[H]
\centering{}
\begin{tabular}{|l|l|l|l|}
First Name & Last Name & Age \\ \hline
Sarah & Miller & 42 \\
John & Piers & 35 \\
\end{tabular}
\caption{Deterministic example, catalog 1}
\label{tab:ex-deterministic-1}
\end{table}
\begin{table}[H]
\centering{}
\begin{tabular}{|l|l|l|l|}
First Name & Last Name & Address \\ \hline
Sarah & Miller & New York \\
Johnny & Piers & Berlin \\
\end{tabular}
\caption{Deterministic example, catalog 2}
\label{tab:ex-deterministic-2}
\end{table}
After matching these results, there is a clear match for \textit{Sarah Miller}, but no available match for \textit{John Piers} and \textit{Johnny Piers}, since the
names are different and there is no way to be sure if \textit{Johnny} is just a nickname (making them the same person)
or if these two entities are different people.
Note also that in this example both rows coincide completely on the \textit{Last Name} field. However, when doing deterministic record linkage partial matches are not considered \cite{dusetzina_m_2014}, because there can be no uncertainty in the match.
It should be noted that this linking procedure is not optimal when the provided data is not very clean. Preprocessing steps can be taken to ensure that the catalogs are as
uniform as possible, but even so, it is never easy to know if there might be a match missing or not.
\subsubsection{Probabilistic}
\label{sec:rl-approach-probabilistic}
In probabilistic record linkage \cite{fellegi69_theor_recor_linkag,Newcombe:1962_proba_rl,Sayers2015} all fields are compared, taking into account how good is each field at telling if a pair of entities is a match or not. This is done by using a weight $w_i$ for each field.
The value of this weights is derived, for a pair of entities $r$, from the probabilities $u_i$ and $m_i$. Where $u_i$ is the probability that field $i$ matched and that $r \in U$ (the pair is not a match), and $m_i$ is the probability that the field $i$ matches and $r \in M$ (the pair is a match).
For example, if entities on the dataset currently being worked with have a \textit{weekday} field then it can be said that this field has 7 possible values, one for each day of the week. Moreover, if every day of the week appears in the datasets approximately the same number of times, then the probability that for a random entity to have a specific weekday is $1/7$, and this is the $u_{weekday}$ probability for that field.
On the other hand, considering that the data has a perfect quality, if two entities are the same then this it is 100\% probable that their weekday field matches. But if the dataset has some error, say that the \textit{weekday} field is wrong 4\% of the time, then the probability that a pair $r \in M$ agrees on field \textit{weekday} is 96\%. This values are rarely known beforehand and may be estimated from the dataset \cite{christen12_data}.
The weight $w_i$ for each field is then calculated as:
\begin{equation*}
w_i = \begin{cases}
\log_2(\frac{m_i}{u_i}) & \text{if}\ a_i = b_i \\
\log_2(\frac{1- m_i}{1-u_i}) & \text{if}\ a_i \neq b_i
\end{cases}
\end{equation*}
Where $a_i$ and $b_i$ are the field $i$ of the entities $(a,b) = r$. If the fields match then the first case is taken, otherwise the fields don't match and the second is taken. The sum $W = \sum_{i=1}ˆN w_i$ for every field in the pair can then be taken and compared against a set of upper $t_u$ and lower $t_l$ thresholds to know if the pair is a non-match, a match, or a potential match. These thresholds need to be determined a priori.
\begin{equation*}
\begin{cases}
\text{non-match}, & \text{if}\ W \leq l_l \\
\text{potential-match}, & \text{if}\ l_l < W < l_u \\
\text{match}, & \text{if}\ l_u \leq W
\end{cases}
\end{equation*}
An extension to the probabilistic record linkage method presented above is that instead of computing a weight $w_i$ for each pair of fields $i$, what is done is that this is instead substituted for a similarity metric that says how similar the fields are \cite{porter1997approximate,winkler1990string}. This metric is obtained by a \textit{comparison function}, which takes a pair of fields and tells, in percentage, how similar they are. One naive example of such a comparison function is the percentage of characters that two texts have in common.
Following the example catalogs defined in the previous section, now it is possible to compare the \textit{First Name} and \textit{Last Name} fields separately. Suppose they are compared with the function mentioned above that provides the percentage of letters both fields have in common.
$$
fc(s1, s2) = \frac{\| s1 \cap s2 \|}{\| s1 \cup s2 \|}
$$
It might be that both last names are exactly the same, as is with the first name for \textit{Sarah}. However, with the first name for \textit{John} it would show a similarity of only $0.8$ with \textit{Johnny}. Taking both the similarity given for the first and last names then it is possible to make a better informed final decision on whether both records should be marked as a definite match or not. For instance, it may be said that if the average similarity assigned to all fields is greater than $0.5$ then it may be stated that a pair is a match. Under these assumptions \textit{John Piers} and \textit{Johnny Piers} will be classified as the same person.
The probabilistic record linkage approach is the one that will be implemented in this project.
\section{Record Linkage as a Workflow}
\label{sec:rl-as-a-workflow}
An abstraction that is commonly used when performing record linkage is that of viewing the problem as a workflow \cite{christen12_data}, where each step is in charge of performing a different transformation to the data. In probabilistic record linkage \cite{fellegi69_theor_recor_linkag} the resulting matches are usually segmented into three sets depending on how sure the probabilistic model was that a pair was a match or not. For example, pairs for which the certainty of the model falls in the interval $[0, 0.4]$ are \textit{non-matches}. Pairs whose certainty of a match is in $[0.4, 0.6]$ are \textit{potential matches}. And those whose certainty is in the range $[0.6, 1.0]$ are \textit{matches}.
\autoref{fig:rl-workflow} shows a graphical representation of said steps.
\begin{figure}[H]
\centering \includegraphics[width=\textwidth]{rl-workflow}
\caption{Record linkage workflow.}
\label{fig:rl-workflow}
\end{figure}
This process is separated into five steps and it may be applied to $n$ input catalogs, they all need to go through the \textit{Data Preprocessing} step so they are as uniform and clean as possible.
In the following sections a more in-depth description of each step will be provided.
\subsection{Data Preprocessing}
\label{sec:rl-workflow-data-preprocessing}
This is the first and most significant step in which the data is preprocessed. During this step, the catalog will be taken in and its data will be rearranged and \textit{cleaned up}. The actual meaning of this depends on the kind of data that the specific catalog
contains and it is mostly up to the designer of the system to decide in which way the presented data
should be preprocessed \cite{Rahm00datacleaning}.
However, some of the most common cleaning procedures which are almost always done are:
\begin{enumerate}
\item Normalize the case of the text (e.g. make sure all text is lowercase).
\item Remove invalid characters for the current encoding.
\item Standardize all characters (e.g. transform accented characters into their
\textit{non-accented} equivalent).
\item Ensure all dates comply with a given format, such as \textit{DD/MM/YYYY}.
\item Remove empty values or fill them with a default value.
\item Standardize the \textit{gender} representation. (i.e. either \textit{m} or
\textit{male})
\item Normalize names and addresses. \cite{Churches2002}
\end{enumerate}
The goal of this step is for the catalogs and their data to become as similar as possible amongst
themselves, so that there will be no need to deal with the potential differences between them once the matching process has begun.
It has also been shown \cite{@clark2004_rl_for_injury} that a well made data
preprocessing step is necessary for the record linkage algorithm to have positive
performance results.
For example, consider that the following data is available in some input catalogs. Note that
in this step no distinction has been made on whether the catalog is a source
catalog or not, so in this specific example the
\texttt{Catalog ID}/\texttt{Target ID} fields will be omitted.
\begin{table}[H]
\centering
\begin{tabular}{l|l|l}
First Name & Last Name & Birth Date \\ \hline
\foreignlanguage{russian}{Анника} & \foreignlanguage{russian}{Цлаире} & 12 Sep. 1948 \\
Abélia & Benoît & 14/4/1987 \\
Dave & Smith & 7/28/1976
\end{tabular}
\caption{Data preprocessing example: dirty catalog.}
\label{tab:data-prepr-ex-dirty}
\end{table}
Before proceeding it is necessary to clean this data. Specifically, since the first entity has its first and last names written with Cyrillic letters, and the date format is different from what is expected. The second entity's first and last name use accented letters. And finally, the date of the third entity also does not respect the stated date requirements (the month and day are reversed).
After passing the input catalog through the data preprocessing step the following output will be obtained:
\begin{table}[H]
\centering
\begin{tabular}{l|l|l}
First Name & Last Name & Birth Date \\ \hline
Annika & Claire & 12/9/1948 \\
Abelia & Benoit & 14/4/1987 \\
Dave & Smith & 28/7/1976
\end{tabular}
\caption{Data preprocessing example: cleaned catalog.}
\label{tab:data-prepr-ex-cleaned}
\end{table}
Notice that now all the entities conform with the required format. Once all the
catalogs have been cleaned they may be sent over to the next step of the workflow
(\textit{blocking}).
\subsection{Blocking}
\label{sec:rl-workflow-blocking}
Once the cleaned entities from the previous step are available, it is necessary to define which
of them should be compared with one another. By default, the pairwise comparison will be performed over all entities from all catalogs. However, this can quickly
become infeasible when catalogs start getting bigger, or when their number starts to increase.
Also, if the dataset being utilized has many different
entities then when comparing all entities against each other it will show that most of these are not matches.
% Suppose for example that we have two datasets with $100 000$ entities each. If
% we were to
It is generally the case when comparing all entities with each other that
the number of comparisons needed increases quadratically with the size of the
catalogs, while the number of true matches increases linearly
\cite{christen12_data}.
Blocking (also known as \textit{indexing}) is defined as comparing a specific entity only against entities that might be a match. When matching said entity against a dataset, it is possible to use a simple rule to define a subset of all entities
composed only of those which the rule suggests are possible matches.
This new set of entities made to be matched against is not guaranteed to contain any match, nor it
is guaranteed that none of the entities in the set which are an actual match were left out, but it entails a much smaller number of comparisons.
If a good blocking rule is chosen then the record linkage process becomes much more efficient.
Suppose there are two catalogs, $\Psi$ and $\Omega$, and one specific entity is being considered $x \in
\Psi$ and there is a need to know which entities in the catalog $\Omega$ match with it.
Rather than comparing $x$ with the whole $\Omega$ dataset (which would result in
$\Psi \times \Omega$ comparisons), what must be done is derive a subset of possible matches using a
blocking rule $r$, which takes a pair of entities and shows if they might be a \textit{match} according to the rule or not.
$$
blocked = {y; r(x,y) = 1; y \in \Omega;}
$$
There are many possible blocking rules that can be used
\cite{christen12_survey_index_techn_scalab_recor_linkag_dedup}, ranging from the
very simple \textit{exact string match}, to the more complex phonetical q-gram
comparison of text. A more complex algorithm may be more precise than a simpler one, but is usually slower. For example, a metric that compares multiple subsets of a text may be resistant to minor spelling errors, while one that compares the string character by character may not. However, the latter is less computationally expensive to perform.
These rules are applied to a pair of fields from two different entities. For
example, it is possible to apply an \textit{exact string match} rule to the \textit{name}
field of two entities and see that they match if their names are exactly the
same. This simple operation makes it easy and fast to segment the dataset into
clusters of possible matches.
The decision regarding the \textit{rule} used for blocking is part of the record
linkage problem, and it needs to be simple enough so that performance does not suffer, but good enough so that the resulting set of potential
matches includes real matches.
It is also possible to freely use multiple blocking
rules to get multiple sets of possible matches and then use the union of these
to do the final comparisons.
The concept of \textit{blocking} has been used in record linkage almost since
the start of the field \cite{fellegi69_theor_recor_linkag}, and it is now an integral part of most record linkage solutions \cite{winkler2006overview}. The
field of blocking is a big area of research in and of itself, with many
different methods proposed over the years
\cite{christen12_survey_index_techn_scalab_recor_linkag_dedup, Baxter2003ACO}.
However, the evaluation of these methods is outside the scope of the current
dissertation and it will be limited to work with one of the most basic blocking rules,
which is a blocking technique that divides two strings into their corresponding
tokens, and if any of these match then the pair is considered to be a potential match.
For now, only blocking for fields which are strings has been discussed. But
blocking can be applied to many other fields as well (although strings are the
most common ones). For example, other possible ways in which blocking could be applied are:
\begin{itemize}
\item Gather all entities with a specific \textit{zip code}.
\item Gather all entities which were born on a specific \textit{year}; etc.
\end{itemize}
Blocking is quite flexible and it is up to the designer how strict he or she wishes to make it.
Although it is agreed
\cite{christen12_survey_index_techn_scalab_recor_linkag_dedup} that it is better
to have a more \textit{lax} blocking technique which includes entities that are
not a match and later filter this out during the actual \textit{linking} step
(see \autoref{sec:rl-workflow-linking}), than using a more precise blocking but
which leaves out potential matches.
In other words, when choosing a blocking rule its preferable to choose one which tends to include more entities than necessary, than having one which only includes a few non-matches. The choice of this trade-off needs to be done on an per problem basis.
As an example suppose the following two catalogs are available. They both contain the
first and last names and the zip code of each person. The first also contains
whether said person is \textit{male} or \textit{female}, while the second catalog
contains the \textit{academic title} of each person.
\begin{table}[H]
\centering
\begin{tabular}{l|l|l|l|l}
Source ID & First Name & Last Name & Zip Code & Sex \\ \hline
A1 & Anna & Claire & 27321 & f \\
A2 & Claudio & Bari & 34918 & m
\end{tabular}
\caption{Source Catalog: Blocking example.}
\label{tab:blocking-ex-source}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{l|l|l|l|l}
Target ID & First Name & Last Name & Zip Code & Education \\ \hline
B1 & Annika & Claire & 27321 & Ph.D. \\
B2 & Claude & Bari & 34918 & Bachelor \\
B3 & Jane & Stel & 34918 & Bachelor
\end{tabular}
\caption{Target Catalog: Blocking example.}
\label{tab:blocking-ex-target}
\end{table}
Suppose the goal is to perform blocking on the \textit{Zip Code} field, and to achieve that,
\textit{exact match} strategy is used. It is possible to see that in total there are only two zip
codes (\textit{27321} and \textit{34918}), so after blocking there will end up being two sets. The first one (for zip code \textit{27321}) will contain the pair
$[(A1, B1)]$, both of which could arguably be the same person. In the second set, the remaining pairs $[(A2, B2), (A2, B3)]$ can be found.
For the first set its only necessary to compare the source entity \textit{Anna Claire}
with \textit{one} target entity. In the second case, the
source entity \textit{Claudio Bari} must be compared with two target entities. In both cases, the
total number of comparisons for a source entity is less than what would be needed
without blocking (3 comparisons).
The gain in the example above doesn't seem like much, but it is easy to imagine that
this procedure would be very effective at reducing the total number of
comparisons once the number of zip codes and, by extension the amount of sets
generated by blocking, start to increase in number.
In short, the goal of blocking is to reduce as much as possible the number of
comparisons that each entity will be subject to by using a \textit{blocking
rule} which finds the best potential matches for a said entity in an efficient
way.
\subsection{Feature Extraction}
\label{sec:rl-workflow-feat-extraction}
The feature extraction step is in charge of taking a pair of entities and
extracting from them a set of \textit{features}, which characterize how similar
the entities in said pair are. The pair of entities for which the features are extracted
are those pairs which the previous \textit{blocking} step said may be good
candidates for a match.
As mentioned before, each entity is composed of a series of fields. By comparing said fields, and extracting a value from each comparison, a new vector can be generated. This vector, called \textit{feature vector}, characterizes the similarity between each field of the pair of entities.
The comparison itself is made by applying a function to a given pair of fields.
$$
feature_i = f(entity_y.field_i , entity_z.field_q)
$$
The function $f$ is a \textit{comparison function}, which shows the
similarity, or distance, between two fields of the same kind (for example, the \textit{name} field of both entities). As such, the decision of which functions to select is
part of the record linkage problem, and the designer is free to use as many as needed (i.e. if desired the designer could compare two textual fields from a pair of entities
using multiple string similarity functions).
Some of these comparison functions yield a percentage that describes how
similar two specific fields are. Others only provide information on whether or not two fields are the same or not. The latter are referred to as \textbf{exact match} functions.
When one of the fields being compared has no value then the function treats the
pair as if it was not a match and returns a similarity of zero.
For example, one such comparison function can be the \textit{Levenshtein distance} \cite{levenshtein1966binary}, which provides the difference between two strings. For this specific case it will be used to see how similar two strings truly are (this similarity metric will be explained more in depth in \autoref{sec:feature-levenshtein-on-name-tokens}).
Actually, the \texttt{Levenshtein} function will show a distance that increases the more two strings are dissimilar, which is not the desired behaviour. What is needed is for the result to be $0$ when the strings are dissimilar and $1$ when they're the same. So the distance function will need to be adapted to be the $1 -$ [this value divided by the length of the longest string\footnote{which is also, incidentally, the maximum value the \texttt{Levenshtein} function can return}]. Basically, the new function is just the complement of the percentage of the maximum value that can be returned by \texttt{Levenshtein(a, b)}.
$$
lev(a,b) = 1 - \frac{Levenshtein(a, b)}{max(|a|, |b|)}
$$
To continue with the example from the last section, consider this \autoref{tab:blocking-ex-source} as the source catalogue, and \autoref{tab:blocking-ex-target} as the target catalogue. Also suppose that the second
set yielded by the blocking procedure above $[(A2, B2), (A2, B3)]$ is being used. Applying the
\texttt{lev} distance mentioned above to the first and last name fields will
yield the following results:
\begin{table}[H]
\centering
\begin{tabular}{l|l|l|l|l}
Pair IDs & Lev(First Name) & Lev(Last Name) \\ \hline
(A2, B2) & 0.714 & 1.0 \\
(A2, B3) & 0.142 & 0.0
\end{tabular}
\caption{Example features obtained using adapted Levenshtein distance.}
\label{tab:ex-feature-levens}
\end{table}
As expected, the names \textit{Claudio} and \textit{Claude} are quite similar,
and their last name is an exact match. On the other hand \textit{Claudio} and
\textit{Jane} are very dissimilar, with their last name (\textit{Bari} and
\textit{Stel} respectively) being as different as possible.
In this example, only two fields were used and one comparison function was applied, in practice
however more fields and potentially many different comparison
functions can be used for the same fields.
\subsection{Linking/Classification}
\label{sec:rl-workflow-linking}
In the \textit{linking} (also called \textit{classification}) step, the list of features extracted in the previous part is consequently fed to a decision model that takes the features and then decides whether said pair is a match or not.
There are many decision models that can be used for this step \cite{gu06_decis_model_recor_linkag}.
The \textit{go to} model for record linkage is the basic probabilistic model defined in
\cite{fellegi69_theor_recor_linkag}. Other possible models \cite{Elfeky_tailor} include using the \textit{k-means} \cite{Hartigan1979} clustering algorithm to cluster the feature vectors into \textit{matched}, \textit{unmatched} and \textit{potential match} clusters, or if there is labeled data available then it is possible to train a supervised machine learning algorithm to learn when to classify
feature vectors as a match or not.
Regardless of the model used, the output of this step can be either two sets (\textit{matches} and \textit{non-matches}) or three sets (\textit{matches}, \textit{non-matches}, and \textit{potential matches}). Usually, the set of potential matches is reviewed manually to define which are actually a match \cite{dusetzina_m_2014}.
To follow with the previous example, consider as input to the linking step
the feature vectors in \autoref{tab:ex-feature-levens}. And suppose that as a
decision model a simple rule-based model is being used:
\begin{equation*}
decision\_model(f_i) =
\begin{cases}
\text{match}, & \text{if}\ mean(f_i) \geq 0.5 \\
\text{non-match}, & \text{otherwise}
\end{cases}
\end{equation*}
Where $f_i$ is one of the feature vectors. Applying this model to the incoming
feature vectors would yield the following results:
\begin{table}[H]
\centering
\begin{tabular}{l|l|l}
Pair IDs & $mean(f_i)$ & Model's decision \\ \hline
(A2, B2) & 0.857 & match \\
(A2, B3) & 0.071 & non-match
\end{tabular}
\caption{Example of the model's decisions given the input feature vectors in
\autoref{tab:ex-feature-levens}.}
\label{tab:ex-linking}
\end{table}
With these results it is now possible to merge the corresponding entities (in this case $(A2, B2)$). Just to clarify, the procedure to join the data in the pairs predicted as \textit{matches} is not part of the record linkage process.
\subsection{Evaluation}
\label{sec:rl-workflow-evaluation}
As the name suggests, in the evaluation step, the evaluation of the record linkage procedure performance is made. Record linkage can be thought of as a supervised machine learning problem. As such, its results are evaluated like those of any other supervised algorithm.
To do this its necessary to have some previously labeled data that may be used to calculate the metrics given the prediction of the model. This \textit{data} is nothing more than the labeled pairs of entities that say whether they are actually a match or not\footnote{same as the \textit{training set} when referring to supervised machine learning}, and it should be different from the one that’s currently being classified.\footnote{\textit{classification set} in supervised learning}.
The metrics considered when evaluating a record linkage pipeline are the commonly used
\cite{Powers2011_evaluation} \textit{precision}, \textit{recall}, and
\textit{$F_1$-Score}.
Before defining the above metrics more in depth, its necessary to first define the concept of \textbf{true/false positive} and \textbf{true/false negatives} from which they are derived.
\begin{table}[H]
\centering
\begin{tabular}{ll|l|l|}
\cline{3-4}
& & \multicolumn{2}{l|}{True Conditions} \\ \cline{3-4}
& & Real Positives & Real Negatives \\ \hline
\multicolumn{1}{|l|}{\multirow{2}{*}{Predicted Conditions}} & Predicted Positives & True Positives & False Positives \\ \cline{2-4}
\multicolumn{1}{|l|}{} & Predicted Negatives & False Negatives & True Negatives \\ \hline
\end{tabular}
\caption{Confusion Matrix: Definition}
\label{tab:confusion-matrix-definition}
\end{table}
\begin{description}
\item[True Positives (TP)] are the elements predicted as \textit{positive}
and are labeled as \textit{positive} in the evaluation data.
\item[False Positives (FP)] are the elements predicted as \textit{positive}
but are really labeled as \textit{negative} in the evaluation data.
\item[False Negatives (FN)] are the elements predicted as \textit{negatives}
which are really labeled as \textit{positive} in the evaluation data.
\item[True Negatives (TN)] are the elements predicted as \textit{negative}
and are really labeled as \textit{negative} in the evaluation data.
\end{description}
The \textit{counts} of the different clusters defined above are what is used to compute the metrics of \textit{precision}, \textit{recall}, and \textit{F1-Score}. All of the employed metrics have results in the range $[0,1]$, where $0$ is the worst score and $1$ the best.
\subsubsection{Precision}
\label{sec:evaluation-metric-precision}
Precision is a metric that measures the percentage of \textit{True Positives}
amongst all the elements which were predicted as \textit{positive}. The less \textit{negative examples} are miss-classified as \textit{positive}, the higher the
\textit{Precision} will be. It is defined as:
$$
Precision = \frac{\|TP\|}{\|TP + FP\|}
$$
\subsubsection{Recall}
\label{sec:evaluation-metric-recall}
\textit{Recall}, on the other hand, measures the percentage of \textit{positive} values which where actually predicted as \textit{positive}. The less \textit{positive} examples are miss-classified as \textit{negative}, the higher the
\textit{Recall} will be. It is defined as:
$$
Recall = \frac{\|TP\|}{\|TP + FN\|}
$$
Recall alone is not an informative metric, since achieving a high recall is
quite easy: just classifying everything as \textit{positive} will yield a
perfect \textit{recall}. For instance, if all samples are predicted as \textit{positive} then $FN = \varnothing$, meaning that $\frac{\|TP\|}{\|TP\| + 0} = 1$.
\subsubsection{$F_1$-Score}
\label{sec:evaluation-metric-f1}
The $F_1$ score is the \textit{harmonic mean} of the \textit{precision} and the
\textit{recall}. It is $1$ when both \textit{precision} and \textit{recall} are
$1$, and it is $0$ when they both are $0$. This metric is commonly used to join
its two component metrics into one. It is defined as:
$$
F_1 = 2 * \frac{Precision * Recall}{Precision + Recall}
$$
Even though the $F_1$ metric is commonly used for evaluating machine learning problems, its usage for record linkage has been criticized \cite{hand17_note_using_f_measur_evaluat} since it presupposes that both \textit{precision} and \textit{recall} are of equal importance. In some scenarios, be them record linkage or other machine learning problems, it might be that a higher \textit{recall} is more valued than a higher \textit{precision} (or vice versa). In this cases the $F_1$ metric should be used with care, and analyzed together with \textit{precision} and \textit{recall}.
\section{Related Work}
\label{sec:rl-related-work}
This section shows related work which might be relevant to the present dissertation. Of special interest are any applications of machine learning for record linkage problems, and especially works that have successfully applied \textit{ensemble} classifiers in a record linkage pipeline.
In \cite{Christen:2008_svm}, an SVM is trained as the model which distinguishes if two entities are the same or not given their comparison vectors. In this work, the authors start by identifying \textit{clear match} examples, which are those whose elements in the comparison vectors are near to 1. They then proceed to train the SVM on these \textit{clear matches}. The rest of the data then gets processed, and a prediction for each pair is obtained. Those pairs which are confidently predicted as matches are used, together with the original training data, to train a new SVM, iterating this process multiple times.
In \cite{ektefa2011comparative} different classification models are tested for unsupervised record linkage. Specifically the comparison was made considering \textit{naive bayes}, \textit{support vector machines}, \textit{decision trees}, and \textit{bayesian networks}. After the comparison the result was that among these, \textit{SVM} and \textit{naive bayes} classifiers performs best. The way in which the training is done is similar to \cite{Christen:2008_svm}, where the set of clear entity pairs are chosen as initial training set, and then the classifier is trained in an iterative way, each time including in the training set the top predictions made during the previous generations.
In \cite{Wilson2011_slp_in_rl} a \textit{single layer perceptron} is used to separate matching pairs from non-matching pairs. The authors report using this classifier outperforms the basic probabilistic record linkage approach.
Instead of using a simple comparison function, in \cite{gottapu2016entity} a convolutional neural network (CNN) is used to extract how similar two strings of text are. The authors compare this with the \textit{Jaccard} text similarity metric and find that a record linkage pipeline using the CNN extracted comparisons has a better performance thank one using Jaccard similarity.
In \cite{Cochinwala2001}, a record linkage pipeline is built that leverage a decision tree classifier to derive a series of rules to classify a pair of entities by looking directly at the fields of said entities.
The TAILOR framework was developed in \cite{Elfeky_tailor} and provides a series of tools to perform record linkage. For using unlabelled datasets the authors propose clustering the unlabelled pairs of entities based on their comparison vectors into three clusters (matched, non-matches, and potential matches). The elements of the \textit{matches} and \textit{non-matches} clusters are then used to train a decision tree. The authors found that their method had better performance than the standard \textit{probabilistic record linkage} with thresholds.
In \cite{kim2016random_forest_dbscan}, the goal was to find, in a database of patents, all those entities which are the same. The \textit{DBSCAN} clustering algorithm is used to perform the blocking, and then the entities are fed to a \textit{random forest classifier} to predict if they are a match or not.
An ensemble composed of an \textit{SVM}, \textit{logistic regression}, and \textit{neural network} is made in \cite{KhaVo2019_metaensemble}. This ensemble is used as a classifier in a record linkage pipeline that needs to find the information of unique entities across multiple databases of medical domain. In this work, the authors optimize the base algorithms by means of a grid search procedure, and report that the ensemble has a better performance than any of the base classifiers when these are evaluated individually
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Problem Statement}
\label{chap:problem-statement}
Wikidata is a huge repository of knowledge which people from all over the world consume. However, a big part of the information it contains can not be verified since it is missing references to external, and trusted, sources of information. These references are essential to ensure that the content is of the highest quality.
Currently, the latest available snapshot in the Wikimedia Grafana interface\footnote{Snapshot for \textit{2019-09-05} can be seen at \url{https://grafana.wikimedia.org/dashboard/snapshot/jWeAFAg0QczHXK62h7rXvJPH2Zh4D609}}, states that there are a total of \textit{744.4 million} \textbf{statements} in Wikidata. Of these, a full \textit{200.5 million (26.89\%)} is un-referenced, which represents a large chunk of the total statements. This means that a sure way to improve the quality of Wikidata is to develop a way in which these statements can be populated with its respective reference.
\texttt{soweego} is a great answer to this problem since it strives to automatically link Wikidata with external catalogs, making its content more trustworthy. Even though the project has done great contributions by constructing a machine learning based pipeline to automatically link entities, there is still some space for improvement. Specifically, \texttt{soweego}'s pipeline allows for only one machine learning algorithm to operate at a time. This means that, when executing the pipeline, a single algorithm needs to be chosen. This is not an easy task since not all algorithms perform optimally on all catalogs. Meaning that a trade-off must ultimately be made.
\texttt{soweego} may benefit from considering the opinion of more than one algorithm at a time since it allows the final result to be derived from a wider spectrum of information provided by a diverse set of classifiers.
\section{Use Case}
\label{sec:statement-use-case}
As a starting point, \texttt{soweego} currently focuses on the \textit{people} domain. At the time of this writing, the \textit{Wikidata Statistics} page\footurl{https://www.wikidata.org/wiki/Wikidata:Statistics}, states that there are a total of 61 million entities, of which 5.2 million are people (roughly 10\% of all entities), which is a large portion of all entities.
Note that even though the development is centered around \textit{people} the idea is to develop a flexible system that can easily be extended to include external catalogs containing different kinds of entities and not exclusively centering on people.
For the purpose of this dissertation the following ones were selected as a starting point\footurl{https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego/Timeline\#Monthly\_updates}. Note that this section is limited to mention the catalogs and their entities, and a more in depth explanation of each will be done in \autoref{sec:data-preprocessing}.
\subsection{Internet Movie Database (IMDb)}
\label{sec:catalog-imdb}
\textbf{IMDb}\footurl{https://www.imdb.com/} provides a complete and up to date list of information about movies and the people that appear in them. They provide a dump of their data which is free to use, although somewhat limited on the information it contains.
The \textit{people dump} from IMDb states the three primary professions of each person, which are the three main jobs for which they have appeared in credits of movies.
This information may be leveraged by separating the incoming data into their corresponding professions. These professions will come to be the \textit{type of entities} that are provided by the external catalog. In IMDb, these professions are not mutually exclusive, so there might be a person which is an \textbf{actor} and also a \textbf{musician}. The following list contains all the types of entities extracted from IMDb:
\begin{itemize}
\item Actor
\item Director
\item Musician
\item Producer
\item Writer
\end{itemize}
IMDb’s dump does not provide such a clean separation of the different types of entities. This clear segmentation was the result of an analysis and conversation made by the whole development group\footurl{https://github.com/Wikidata/soweego/issues/165} with the scope of finding the best matches between the IMDb professions and Wikidata entities.
IMDb’s data is mostly provided by contributors. However, information like \textit{name}, \textit{description}, \textit{image} (either the profile image of a person or the cover for a movie/TV-Series), are manually curated by the IMDb team. Automated access to the data itself is, as said above, restricted since IMDb only provides a partial dump of their whole database, and using web crawlers to access more information is not permitted according to their terms of usage.
\subsection{MusicBrainz}
\label{sec:catalog-musicbrainz}
\textbf{MusicBrainz}\footurl{https://musicbrainz.org/} is a music encyclopedia, which contains metadata about music, musicians, and bands.
Note that in this chapter’s introduction it was mentioned that \texttt{soweego} works mainly with people, however now it is possible to see that it also includes \textit{musical bands}. This is because \textit{bands} have very similar fields to people, and it is an example of the project's flexibility.
The MusicBrainz database is collectively maintained, with the help of the Musicbrainz team. All of Musicbrainz data is in the public domain, and as such, it is free to use.
The types of entities utilized for this project from the Musicbrainz database are:
\begin{itemize}
\item Musician
\item Band
\end{itemize}
\subsection{Discogs}
\label{sec:catalog-discogs}
The \textbf{Discogs}\footurl{https://www.discogs.com} database is also maintained in a crowd sourced way. It contains information about musical records, musicians, and bands. The database is in the public domain, so it can be used freely.
From the Discogs database the information retrieved are the following entity types:
\begin{itemize}
\item Musician
\item Band
\end{itemize}
\chapter{Approach}
\label{chap:approach}
This chapter will cover the architecture of \soweego project, and a description of each of its parts, and how it relates to the record linkage \textit{workflow}.
\section{System Architecture}
\label{sec:system-architecture}
% NOTE: these sections should provide a more in depth explanation of how
% everything works, but it should still leave everything in generic terms. For
% instance, the used vectors are explained in the next chapter and the baseline
% models/ ensemble models have their own chapter as well. This is to prevent
% this section from bloating up too much.
\texttt{soweego} presents a pipeline though which data flows, and is composed of a series of steps which closely resemble the record linkage workflow (explained in depth in \autoref{sec:rl-as-a-workflow}). The pipeline is executed once for each \textit{concrete entity type}\footnote{Every different combination of \textit{catalog + entity} is a concrete entity type, since each represents a different entity type with different provenance. As seen later, each is saved as a different \textit{model} in \texttt{soweego}'s internal database}.
Specifically, the \texttt{soweego} pipeline is structured as follows:
\begin{figure}[H]
\centering \includegraphics[width=\textwidth]{soweego_structure}
\caption{Structure of \texttt{soweego}. \textit{Catalog A} and \textit{Catalog
B} represent external databases which will be imported into
\textit{Wikidata}.}
\label{fig:soweego-structure}
\end{figure}