paper published in Computing and the Humanities (1999) 32, 39-56


Assessment of Systems for Nominal Retrieval and Historical Record Linkage

 

Gerrit Bloothooft

Utrecht Institute of Linguistics (OTS)
Utrecht University
Trans 10, 3512 JK Utrecht, The Netherlands
Email
Keywords: nominal record linkage, nominal retrieval, assessment

 

Abstract

Problems in retrieval of names from large data bases and in nominal record linkage are discussed with respect to computational solutions. The quest for robust methods that can handle the typical variability of historical nominal information is discussed, with some emphasis on probabilistic methods. It is argued that comparison and assessment of different systems used on the same data could enhance our understanding of methodological issues.

 

1. Introduction

In the 1960s, there were interdisciplinary contacts between researchers with a background in computer science, mathematics, medicine, and history, to discuss the development of record linkage systems. Ten years later these contacts had ceased and research into nominal record linkage within a historical context was looking for its own solutions. A major reason for this development was probably that historical record linkage systems tended to be heavily dependent on the specific characteristics of historical data and on the corresponding research aims, which hampered cross-fertilisation with other disciplines. But this argument also proved to hold within the historical context itself. Apart from some broad agreement on the philosophy of linkage, developers went their own way. However, if we want to assess the pros and cons of the various approaches, the obvious way to do this would be to have different systems operate on the same data. Such an enterprise has not yet been undertaken, but would be most revealing with respect to the power of various linking strategies and their dependence on available data (richness of the information and the language, for instance) and research goals.

This paper gives an overview of several problem areas of nominal retrieval and historical record linkage. Challenges and limitations in the use of common data will be discussed for the standardisation problem - how to normalise spelling and variants of names, the learning problem - how to teach a system the variation in names, the retrieval problem - how to find names in huge databases that may be variants of a target name, the parsing problem - how to label first name(s), patronymic, surname(s), profession, place names and other items in word strings, and the linkage problem - how to decide what information refers to the same person. No attempt will be made to demonstrate solutions to these problems in detail, but it is hoped that the information presented and the questions raised will provide inspiration and, as a second theme, will show why computing for the humanities is special and challenging.

 

2. The standardisation problem

The task for a system performing nominal record linkage or retrieval would be considerably more transparent if all names were spelt in a standardised way. Indeed, it could be said that under such circumstances we would be better able to understand fundamental issues of record linkage. Two mainstream approaches can be distinguished in standardising names: the first directly tries to transform a given name into a standard, using a coding scheme or some extensive rule set; the second always compares two names and produces an estimate of the the probability that the names are variants of each other, or computes a matching or proximity score. This procedure can also result in a single standard for a given name, by finding the standard with the highest probability or matching score.

Standardisation of names presupposes a one-to-one relation between any given name and a standard. We could question the reality of such a presupposition. Onomastics may give some etymological reasoning to show the development from some basic name form into numerous variants. For first names in particular, many productive mechanisms exist, for instance reduction (Cornelis -> Cor, Kees), diminutive suffixes (Trijn -> Trijntje), suffix variation (Klaasje, Klaaske, Klaassie), sound change (Dirk -> Derk, Durk), children’s language (Wim -> Pim; Geertruida -> Geesje), or latinised forms (Willem ->Wilhelmus). This type of variation exists alongside true spelling variation such as Gesina -> Geessiena. If one aims to map names into sets each related to a single standard, there is the problem of overlap between sets (names that could be members of more than one set) and the problem of overgeneralisation (not all names in a set are mutually interchangeable). Overlaps between sets arise for reduced names, such as Dolf, that may be considered as a first name on its own, but also as a reduction of different base names that end with -dolf (Adolf, Rudolf, Ludolf). Sometimes names are seen that are a mix of two base names such as Andriaan, possibly composed from Adriaan and Andreas. This is an example of the fact that the choice and usage of names in real life is seldom based on profound knowledge of onomastics or linguistics, but is the result of human creativity (or errors), which we should welcome, but which also poses enormous analytical problems. On the other hand, it is likely that the variation in the name of the same individual, profession or place, over a limited period of time and in a limited region, is much smaller than may be expected from onomastic dictionaries. For instance, the base name Catharina has the variant Cathalijne and the reduction Trijntje. Whereas Catharina and Cathalijne may have been interchanged because of a confusion of r and l, and Trijntje is a normal diminutive name of Catharina, Trijntje and Cathalijne are unlikely to be used for the same person.

Suppose that for some database of names we know the standards. If we encounter a new historical source it is probable that this source includes names not present in the database of names we already have. For these names we will have to make decisions about standards. The size of this task will be illustrated for Dutch first names. As a reference we have an electronic dictionary of Dutch first names, manually collected over the last four decades with the help of numerous informants (vd Schaar, Berns and Gerritzen, 1992). This database contains 19,923 first names from the whole of the Netherlands and from all ages. Each name is linked to one or more base names which could be considered as standards. Next to this reference database we have eight sets of first names taken from nine Dutch sources ranging from the late 16th century to the 1947 census1. Table 1 summarises the sources and gives their sample sizes, which range from 21 up to 690 thousand records.

Table 1. First names in nine different Dutch sources. The column 'problem names' involves initials, corrupted fields, incorrect name type and first names with unidentifiable base name. The count for different first names involves all first names to which a base name could be attached. The same name for males and females was counted twice. The remaining number of names after semi-phonemic conversion and after reduction to the stem is given.

Source

sample
size

problem
names

different
first
names

different names at semi-phonemic
level

different stems at
semi-phonemic
level

Dutch dictionary of first names

-

-

19,923

15,620

78%

8,807

44%

Dutch Genealogical Association database

-

-

9,760

7,932

75%

3,943

37%

1947 census

86,000

174

4,584

3,555

75%

2,150

47%

19th century civil registration Meierij region

66,110

17

1,410

1,170

82%

823

58%

17th-19th century various sources Brabant

584,316

3,213

5,674

4,119

73%

2,692

47%

1776-1811 parish registers Amsterdam

343,002

1,321

11,826

7,744

64%

4,364

37%

18th century parish registers Goes

49,193

82

2,086

1,418

67%

966

46%

1578-1649 Amsterdam merchants

23,814

38

2,094

1,481

71%

1,206

58%

1531-1611 citizen register Amsterdam

21,229

66

1,725

1,393

79%

1,069

62%

TOTAL    

39,286

26,056

66%

12,978

33%

 

Table 2. Coverage of the Dutch dictionary of first names for eight Dutch sources at the level of original writing, after semi-phonemic conversion of the names and after reduction of the name to the stem. Per source the coverage is also considered relative to all names in all other sources (dictionary plus seven sources).

 

 

covered by dictionary

covered by leaving-one-out

Source

different
first
names

directly

at
semi-phonemic
name level

at
semi-phonemic
stem level

directly

at
semi-phonemic
name level

at
semi-phonemic
stem level

Dutch dictionary of first names

19,923

           
1947 census

4,584

64%

81%

95%

77%

88%

98%

Dutch Genealogical Association database

9,760

53%

71%

92%

67%

80%

94%

19th century civil registration Meierij region

1,410

51%

68%

85%

81%

86%

94%

19th century various sources Brabant province

5,674

39%

64%

83%

64%

78%

90%

1776-1811 parish registers Amsterdam

11,826

19%

59%

82%

33%

68%

86%

18th century parish registers Goes

2,086

33%

69%

87%

68%

83%

94%

1580-1632 Amsterdam merchants

2,094

32%

65%

80%

64%

82%

88%

1560-1640 citizen register Amsterdam

1,725

38%

66%

84%

67%

83%

92%

TOTAL

27,502

28%

63%

86%

 

 

The first issue we have to address in this case of real-life data is their validity. It shows that most databases contain a number of problematic names. These consist of initials, no first names (family names or other names), and corrupted or truncated data. Of course the information is not labelled as such, we have to find this out for ourselves. Recognition of initials (J.C.M.) will pose no problems, although their interpretation does. The presence of family names (van Akersloot, referring to a place name, or Kuyper, referring to a profession) in a first-name field requires the availability of reference databases for other name types to recognise their class. Table 1 shows that the larger databases, the contents of which were less well-controlled than the other dedicated databases, had considerable numbers of problem names.

For the present purpose, we manually separated the problem names from names that were considered to be real first names and which will be subjected to further analysis. The number of different first names per source is also presented in Table 1. There is a weak relationship between this number and the sample size, but differences between sources are large. We now consider the question how many of the first names in the eight sources can be found in the Dutch dictionary of first names (see Table 2). This coverage ranges between 19 and 64%. These are astonishing figures. It shows that a simple look-up is highly insufficient: the total number of new names in the eight sources (19,824) is about equal to the number of names in the reference dictionary (19,923). And even if we compare each source with the total of names in the dictionary plus all seven other sources (left column under 'covered by leaving one out' in Table 2) the coverage still ranges still between 33 and 81%. Although from the present data we may conclude that the percentage of unknown spellings of first names is likely to exceed 20%, in the long run this percentage is expected to decrease slowly with increasing number of names in a reference table. There is no reason to believe that this behaviour will be different for surnames, toponyms, or names for professions.

Since we do not find all names by direct look-up, we can try to remove part of the spelling variation by adopting rules in the hope that this will lead to a considerable reduction in the number of unseen variants. The effect of rules can be demonstrated again with our example of first names. We have developed rules that convert a name into a so-called semi-phonemic form. This form is related to the pronunciation of a name. For instance, the names Adriaan, Aadriaan, Adryan, Adrieaan will all map to adryan. We now first investigated the coverage of the Dutch dictionary of first names with respect to the eight historical sources mentioned before at this semi-phonemic level. Table 1 shows that such a transformation results in a reduction of the number of variants to between 64 and 82% of the original number of names per source (and to 66% for the total set of names). Table 2 shows an improvement of coverage of the Dutch dictionary of first names to between 64 and 81%. If we follow the leaving-one-out count, the coverage increases to between 68 and 86%.

The final step we can try for first names is to remove the productive variation in suffixes and to concentrate on the stem of the name. For this, we need an algorithm that can make a reliable stem-suffix separation. Since by and large we know the suffixes, this is possible, although ambiguities pose difficulties: Aaltje -> Aal#tje or Aalt#je. Table 2 indicates that the number of stems per source (all at the semi-phonemic level) is between 37 and 62% of the original number of names. For all names together this amounts to 33%, one-third of the total number of names. It is interesting that this percentage for the total set of names is lower than for each individual source. This is related to the fact that the various sources tend to have their specific sets of suffixes with stems, while they differ less in the set of stems itself. If we compute the coverage of the Dutch dictionary at the semi-phonemic stem level we find figures between 80 and 95%. For the leaving-one-out method this is 86 and 98%.

We do not yet have data that underpin a comparable behaviour for other name types, but the general tendency is not expected to be different: any new source we encounter will provide new problems in the interpretation of names even if we accumulate data over all sources investigated. One should not be surprised if several tens of percents of all names give rise to interpretation problems. On the other hand, the new names often have a frequency of occurrence close to one. Their total frequency will then easily be less than 1% of all samples in a database. Whether this is serious or not depends on the research aims.

Although we have seen that rules can reduce the variation in names, the development of rules is hard, because they should be robust. Normally, a rule works at a focus that will be changed if some conditions in the context of the focus are fulfilled. This requires that the context is robust with respect to spelling variation. Here we are especially hampered by the fact that we will continuously face new spelling variation that could not be anticipated at the moment rules were developed. This means that sometimes a rule will not work because the context does not match due to some fancy spelling, and sometimes a rule will work while it should not because the context fulfils conditions by accident.

Another issue related to robustness is multilinguality. In historical and modern sources names from various language origins may show up. This complicates matters since rules are likely to be language-dependent. In a contemporary database of 40,000 different first names obtained from German Telekom (phone number directories)2 most names were of non-German origin! For a proper identification and standardisation of the names it is then needed to have at least a basic set of names from all languages involved. This is a real issue in historical databases that are for instance related to trade or immigration. Whereas the older databases are likely to be limited to European names, the present information sources - which are the historical sources of the near future - are likely to contain names from all over the world. It would be highly beneficial to nominal retrieval and record linkage, but also for onomastic research in general, to start international cooperation to bring together onomastic databases that at least include the frequency of occurrence per contemporary name per country (or language).

Name standardisation always implies some loss of information. In cases of writing or typing errors (Corenlis -> Cornelis), or spelling variation that does not affect pronunciation (Fransien, Francine) this is hardly regrettable, but when in standardisation information about the sex or language of the name is lost this may be a serious drawback. If we standardise Alberta to Albert we loose gender information, if we standardise Alexandros and Alessandro to Alexander we loose a language indication (Greek and Italian). On the other hand a Frenchman called Alexandre may be mentioned in Dutch archives as Alexander, but reversely, a typing error in the name of a Dutchman called Alexander may result in Alexandre. The e and r are neighbours on a qwerty key board! This implies that we need to store not only the standard (or even more than one standard) of a name but also all possible interpretations of the difference between the actual writing and the standard.

Although the use of name standards can be very useful, it may be clear by now that automatic detection of a standard to is not possible for many names we encounter in historical sources. We first showed that new sources are likely to contain a considerable number of name spellings we have not seen before, which rules out a look-up. This remains true after we have used algorithms that can convert spelling to some pronunciation form, or algorithms that reduce a first name to its stem. Under multilingual circumstances we lack databases of names from various languages to facilitate interpretation and standardisation. This implies that even under the best possible conditions, considerable parts of standardisation still need to be done manually every time new sources are taken into consideration. Although this may be unavoidable for the time being, we may also reconsider the need for name standardisation within the framework of systems we concentrate on in this paper: systems for nominal retrieval and nominal record linkage. What we actually want to know is the set of names that have some likelihood corresponding to some target name. If we utilise standards, this set is given by names that have a common standard and we only have to determine the most likely standard of the target name. If for a considerable number of names we cannot link standards to names directly, as suggested by the previous sections, we may adopt a more flexible probabilistic approach.

 

3. The learning problem

Instead of investigating whether two names share the same standard, we can investigate whether two names have some probability of being variants of each other. This problem is not very different from the general problem of checking spelling when a word has been typed that is not part of the known vocabulary and suggestions for correctly spelled words are presented. An excellent overview of this area has been given by Kukich (1992). However, where spelling checkers may focus on systematic characteristics of typing errors, take advantage of the knowledge of the keyboard layout, or know about letter confusions made in optical character reading (OCR), this type of variation is just one out of many in the historical spelling of names (where typing errors or OCR errors may occur during the input into an electronic database). Still, checking spelling is an example of a technology developed in the context of computational linguistics that could be helpful to further solutions in the specific area of the historical spelling of names.

Before we proceed, we need to discuss the issue of what a variant actually is. In the present context, we consider as variants all those different writings that denote the same individual, profession, place or entities for any other type of information. For the development of algorithms that can predict whether two names are variants of each other, we need to have a lot of examples of proven variants first. As discussed earlier, onomastic sources provide etymologically related variants, but these are not necessarily all variants used for the same individual in real life. The latter information is very difficult to obtain. Only genealogists and historians may have such information on the basis of personal investigations, but collecting spelling variations will rarely have been their first interest. Still, there is a lot to gain from an enquiry among the thousands of amateur genealogists in every country, which may result in a very valuable knowledge base of names used for the same individual. Another option is an iterative procedure where the output of a system for nominal record linkage is used to create and/or update the knowledge base. The system starts with an arbitrary set of parameters for name matching, or uses data derived from onomastic sources by taking all etymologically related names as variants of some name. Once a full record linkage procedure has been performed and validated, the linked names could be transferred to the knowledge base of 'real variants'. Since manual validation of results of record linkage may be very laborious, for the collection of 'real variants' one could decide to use automatically only the links with a very high confidence level. For instance, in case of a first name combined with a rare surname or the reverse, the rare name groups variants of the other name with a high probability. The same holds, with even higher likelihood, with relatively rare name combinations of husband and wife. A third option is to use onomastic sources with a selection of variants of some base name which are not too different from each other.

Once we have a database of name pairs that are likely to be variants we can try to generalise properties that describe these variants. Character deletion, insertion, substitution or transposition can be modelled with a probability of occurrence which can be derived automatically. Much of this spelling variation will depend on the context. If the context needed to define the possible occurrence of some variation is limited, the variation can be expected to be present in many names, which may generate abundant training material and a fairly precise estimation of the probability of occurrence of that variation. This is for instance the case with the A in name initial position which, in Dutch, can be replaced by Ae with a high probability (Albert and Aelbert). In other cases the context may be complex and only a few examples of the variation can be found. If there are insufficient data it is not possible to make a reliable estimate of the probability of occurrence of the variation. In such a case, the developer may decide to include the few names that show the variation in a table of exceptions.

Broadly, there are two ways of modelling variation in names. The first method takes a standard name and models all variation that may arise from this standard. This requires a model for every standard name. For any new name the probability can be computed that the name has been generated by any of the models. The model with the highest probability wins. Hidden Markov Models could be used for this. Every standard name is then described in terms of letter positions and letters that can be possibly written at a particular position. Transition probabilities describe which positions may occure one after another in variants, while observation probabilities describe the occurrence of a letter at a position in a variant. This approach requires enough variants per standard name to estimate the probabilities (in an iterative process). Often this is not the case and one needs to generalise probabilities over models. To my knowledge this attempt has not been made yet. A second method gives a procedure to compare any two names and computes the probability that the two names are variants of each other. The technique of dynamic programming provides a way to find the optimal match between the names and the corresponding probability. The technique requires the probabilities of interchange of letters or pairs of letters, which can be obtained in an iterative way from proven variants (Bloothooft, 1994). The advantage of this approach is that phenomena are pooled across all names, which results in better estimation of probabilities. The disadvantage is that some types of variant are found in certain names only, and this property is lost in a generalised approach.

Both methods mentioned only use the directly neighbouring context of a letter and would become very complicated if one wished to extent this content. Extended context would reduce the number of variants in any single context, and would reduce the accuracy of the probability estimation. This pinpoints the difficulty of the problem. The number of types of variation in spelling of names is extremely large and most types occur only rarely. But, because of the huge number of variation types, one is likely to encounter a rare spelling variation in many names!

Even if we had powerful matching algorithms, a table of exceptions would remain necessary. Abbreviations need to be expanded [Pr. -> Pieter], reduced and full name variants [Els <-> Elisabeth] should point to each other, while name variants exist that cannot be associated by an algorithmic method with any acceptable probability [Dirk -> Theodorus; translations such as White, Leblanc, de Wit] and for which the relation should be made explicit in a table. The construction of such a table is a bit complicated if we want to avoid having all spelling variants of names as entries. Ideally, we would just need the selection of names that cannot be related to each other by a sufficient matching probability, but that together can capture all other likely variants of a name. For instance, Dieric has sufficient probability to be a variant of Dirk. Then the exception table tells us to look into variants of Theodorus, on which basis a link to Teodoor may be found. The variants Dieric and Teodoor do not need to be part of the table themselves.

Methods that are based on automatic extraction of information from training data have the advantage that the method itself is not dependent on the data, although the outcome is. This means that the methods can be applied on various data sets. Assessment of a single method applied to various data sets, or various methods applied to the same data set, could help in understanding its weak and strong points.

 

4. The retrieval problem

Nowadays there are big historical databases that may contain hundreds of thousands of records. Suppose that we have a target name and that we would like to retrieve all records from the database that correspond to that name. What process most efficiently results in the required set? Again there are several possible approaches. The first would be to have all database entries standardised. The target name should be given a standard name too and consequently the set we are looking for is the one with the same standard. This would be a good solution if we could standardise properly and fast. As has been argued before, this may be hard for historical databases where a considerable number of names are given in a spelling not encountered before.

If we cannot use standards, an approach with an algorithm that finds the best matching candidates for some target name in a probabilistic way would be needed. And even then, it could be a laborious task if we have to compute a matching probability of the target name for all names. This can be avoided by using a procedure that quickly searches for relevant candidates while ignoring all others. As an example, we will describe here a method that has the attractive property that it uses in its search the probabilities we already know from the learning phase: the probabilities of skips, insertions and substitutions at any point in a name. The method does not start with the database, but focuses on the target name. It simply generates all possible variants from that name that may exist on the basis of the probabilities of skips, insertions and substitutions and for which the total matching probability is below some pre-defined threshold. If this generation of variants is implemented from left to right through the target name, the result is a tree-building process, in which branches whose probability drops below the threshold, or for which no variants exist in the database, are pruned. Once the end of the target word is reached, we have generated all variants in the database with a matching probability that is still above the threshold.

As an example of a search procedure, we give the search of variants for the first name Nicolaas in a database of 36,485 variants of first names3. The search is done on the semi-phonemic form of the names after a suffix reduction to a stem-like form. The target word becomes nykolaas in this way. The computations are based on the probabilities of diphone interchanges that were obtained on the basis of the variants known from the Dutch dictionary of first names (19,923 names). The technique of dynamic programming is used to compute the maximum matching probability of two names4. The threshold is set at .25, which means that all strings for which the matching probability drops below 0.25 are pruned. Table 3 gives the generated initial strings, together with their probability to match the corresponding initial string of the target word nykolaas, and the number of candidate names in the table that start with such a string.

Table 3. Search of variants of the first name Nicolaas with semi-phonemic form nykolaas. The combinations ae, ai and au are internally represented by a single symbol which is not followed here for reasons of clarity.

 

possible initial strings probablity number of candidates search result

1-character
a .46 1505 continue
e .46 906 continue
i .46 328 continue
o .46 240 continue
y .46 146 continue
ae .46 143 continue
n 1.00 378 continue
m .46 1295 continue

2-character string
an .46 410 continue
en .46 98 search closed
in .46 63 search closed
on .46 16 search closed
yn .46 27 search closed
aen .46 10 search closed
ne .41 86 continue
ni .86 28 continue
ny 1.00 76 continue
mi .26 49 continue
my .46 148 continue

3-character string
ann .36 11 continue
nek .54 2 search closed
nik .86 13 continue
nyk 1.00 40 continue
nyu nyukolaas
nyx .91 2 continue
mik .26 9 search closed
myk .46 8 search closed

4-character string
anny .36 11 search closed
nikk .86 6 continue
nikl .32 6 continue
nyk .29 7 search closed
nyke .29 2 continue
nykk 1.00 2 continue
nykl .56 4 continue
nyko 1.00 22 continue
nyxo .91 2 continue

5-character string
nikko nikkolaas
nikla .32 3 continue
nykel .29 2 search closed
nykko nykkolaas
nykla nyklaas
nyklae nyklaes
nykok nykoklaas
nykol 1.00 13 continue
nykoo nykoolaas
nyxol .91 2 continue

6-character string
niklaa niklaas
nykola 1.00 4 continue
nykolae nykolaes
nykolai nykolais
nykolau .51 2 continue
nykole .42 2 search closed
nykoll nykollaas
nyxola nyxolas

7-character string
nykolas 1.00 2 nykolas
nykolaa 1.00 2 nykolaas
nykolaast
nykolaus nykolaus

The first step, with just a single character, shows that on the basis of probability of substitution and insertion for a name starting with n, only names starting with a, e, i , o, y, ae, m and n should be considered. With two characters, we see that, for candidates starting with a vowel, the next character should be an n; after an initial n the vowels i, y, and e are still acceptable and after an initial m, i and y. In the next step all vowel starts are ignored because their probability drops below .25, with the exception of ann. Starts with n and m provide seven alternatives, of which nyu already points uniquely to nyukolaas. For this branch the search is closed. In subsequent steps, branches are closed because the threshold drops below 0.25, while others end at accepted names. Note that the number of alternative strings per step is only 11 at maximum, which creates a fast search. A lower probability threshold would create a considerable increase in the number of options, which would slow down the procedure progressively.

The aforementioned search takes from a few seconds up to tens of seconds to yield a result (depending on the probability threshold, the size of the database and the speed of the machine). Especially in applications were users consult a database on-line this is a bit long. And although we may expect that efficient implementations and still faster computers may reduce the search time, there always is the simple solution that the search for variants can be done off-line. In practise, most on-line searches will concern names that are present in the database in the given spelling. We can analyse all available names off-line and we can store, for every name, the set of likely variants. Only in those cases where a user asks for a name that is unknown to the data base in its given spelling, the full procedure needs to be followed in real time.

Assessment of the power of systems for nominal retrieval of names could be realised by comparing their output on the same resource(s). Dependence on name types, capability of handling names from different languages, portability to other sources, efficiency, the need for exception tables, possibilities for automatic improvement of rules or probabilities, could be among the criteria for comparison.


5. The parsing problem

Ingredients for nominal record linkage such as first name(s), patronymic, surname(s), profession, address and place name, are often not tagged as such in the original source. In this respect, Gross (1990) provides an interesting overview for multiple first names and surnames of artists (Dante Gabriel Rossetti versus Gabriel Charles Dante Rosetti; Francisco de Goya versus Francisco de Goya y Lucientes). This means that uncertainties in interpretation can arise which a table can seldom make explicit. Other examples are Klaas Willems, where Willems may be a patronymic or a surname; Petrus Bakker, where Bakker may be the profession ( baker) or the family name, and so on. Early choices in interpretation may lead to errors that are difficult to recover. A solution could be the development of a name parser that sorts out all possible interpretations of a name and to store all these interpretations. The requirements are the availability of lexicons of the various name types and some rules that show the most likely order of appearance of elements of the name. Again, the appearance of names not yet present in the lexicons is a difficulty here. We first have to find the best matching names for all name classes and subsequently have to decide what interpretation we can give to a name. The name Quireijn, for instance, could have a best .95 correspondence to the first name Quirinus, and only a best .05 correspondence to the surname Kwiers. Such information can be used to find the most likely parses of a name. However, lexicons of first names, surnames, professions, and names of places are often not available (and certainly not on a multilingual scale), but should be on the list of long term desiderata.

Once a good name-parsing procedure has been developed it should be possible to simplify both the input of names into a database and the formulation of a search query. Full names could be given in one field without the necessity of making decisions on the interpretation of parts of the name. This is particularly important in situations that we came across in which various archives utilised their own structure for the storage of parts of a proper name. It would be inefficient to adapt software for the retrieval of names to each different type of field definition. A single field for the whole name would be more efficient, provided that an effective name parser is available.

Another subject for parsing may be a name itself. The search for name variants would be facilitated if a name could be parsed morphologically. We have tried a standard morphological parser for Dutch on a database of 146,000 surnames from the census of 1947. Results were disastrous, with a very high rate of erroneous parses. Major problems were the many surnames of foreign origin, uncommon compounds, and variations in spelling that were incompatible with modern Dutch. A morphological parser needs a complete set of morphemes, but because morphemes in names are subject to variation in spelling, the set of morphemes will always be incomplete until the time we have seen all variants in spelling of the morphemes. This creates a vicious circle: we hope to solve the problem of variation in spelling by a morphological analysis, but this analysis is impossible by present day methods because these require an invariant spelling.


6. The linkage problem

Record linkage systems for historical data aim to combine information on the same individual. Proper names are central to most systems, but profession, address, place of origin or birth, dates and relations between persons can also be used as a basis for linkage. Available sources and richness and reliability of information in these sources have a considerable influence on the possibilities for linkage. The system should be able to cope with inaccurate, incomplete, multi-interpretable and loosely structured data. The spelling of names, the given dates, relations between people, interpretation of names of places and professions, should all be treated with great care and should not be taken for granted.

The development of systems for automatic nominal record linkage over the last decades shows that the task is not hopelessly ad-hoc and can be accomplished under certain circumstances. An overview is given by Winchester (1992). The basic argument is that the presentation of information on individuals in most cases has and had the intention to identify each individual uniquely. There should be some logic behind the use of information, although part of the context for that logic may now have been lost. This context may have been lost because of incomplete archiving, the loss of archival materials, and the loss of contextual knowledge that was evident to everybody involved at the time of writing, but not to us. This includes the socio-economic setting, personal histories and relations, and so on. Still, enough information may have been left to perform linkage if we are able to capture and utilise all the available information.

A major distinction should be made between tasks where the available information consists of names, professions, places, a few dates but little to no relational information on the family on the one hand, and tasks where records from birth, marriage and decease provide information on a web of family relations, on the other hand. An example of the first task are listed data, such as census lists, tax lists or lists of tradesmen or craftsmen (Rhodri Davis, 1992), the second task is exemplified by family reconstruction (King, 1992;1996). For family reconstruction, the average available amount of information per person is often richer than in lists when data are available on marriage, births of children, decease of children, partner or the person involved.

In comparison with many other historical data, the information needed for nominal record linkage has the advantage that it is relatively well-structured. Therefore, we may assume that a large but limited rule set and a limited amount of knowledge of the world are needed to solve the problem. And this makes nominal record linkage methodologically attractive, not only for the historian, but also as an intellectual challenge for the computer scientist and the expert in artificial intelligence. But there is no such thing as the solution in nominal record linkage. Numerous examples have been given in the literature that show ambiguous data, unlikely but existing family relations, unpredictable name changes depending on social context (Morton, 1994). But we should not forget that many links can be made reliably with very reasonable accuracy (Bouchard, 1992, 95.4%; Atack et al, 1992, >70%; Harvey et al, 1996, max 62.1%). As Bouchard (1992) has pointed out, research goals and data characteristics will determine the acceptability of a result in terms of completeness and accuracy. In cases where we want to obtain high accuracy, we would prefer to see a system for nominal record linkage as an assistant that needs supervision (Atack, Bateman, and Gregson, 1992). In other cases, where studies on large populations are undertaken, a certain percentage of erroneous links would not necessarily be disastrous for making reliable estimates on demographic or socio-economic developments, for instance, and full output validation could be omitted.

Nominal record linkage is a technology and should be judged by its ability to come to the same results as humans do. This is not to say that we should try to imitate human reasoning in all details. A computer can recognise speech without having ears, a plane can fly without flapping wings and a car can move without legs. But all technologies have learned a lot from looking into processes in nature and from human behaviour. Little is known explicitly about human reasoning strategies for historical record linkage. Still, we assume that humans can do a reasonable job if it comes to validation of automatic linking. A few characteristics may stand out: humans often possess more information than was utilised in the automatic linkage process, in other words, the machine did and could not utilise all available data; and humans are far more flexible, powerful and creative than machines in the interpretation of data because of their enormous knowledge of the (historical) world behind the data.

Human reasoning has the capacity to keep optional interpretations of the data alive during the whole process of linkage and to adapt hypotheses as soon as this is beneficial to the overall likelihood of the solution. In Bloothooft (1995) I have shown a linkage strategy that can efficiently handle additional data and that can make local adaptations and optimisations without repeating the whole linkage procedure. In my view this is a capacity of human reasoning we should try to incorporate in automatic linking. Another aspect, not yet applied in that paper, is to loosen the standardisation of names. As has been argued before, we could generate a set of likely variants for any name. It should be possible to adapt a linkage strategy to cope with these sets, in stead of using standards. The same holds for dates. It has been shown that errors in given age and dates of birth and death are not uncommon, which may lead to erroneous links or missing links. This asks for flexibility in the interpretation of dates, in the same sense as systems should be flexible in the interpretation of nominal data.

The laws of mathematics and physics are considered to be true in the whole universe at any time. Consequently, physical experiments replicated under the same conditions should give the same results everywhere. History is different in that it concerns just one experiment that never will and can be replicated, and historical truth can never be proved in the way we can prove laws of physics. Unlike classical physical observations, historical observations are imprecise, incomplete, and one-off. Unlike classical mathematics, historical reasoning is of a fuzzy kind, dependent on time and place. Nevertheless, we believe that we are able to make a reasonable guess about historical truth which implies that, at least locally, historical observations and reasoning should contain some kernel of focus and constancy over time and place. This statement is worth exploring further. There has been a tendency in nominal record linkage for systems to take into account the specific properties of the data under investigation and the specific goals of the research. Although there is scientific wisdom behind such a scheme, it is also associated with the primacy of the historical result over methodology. Now that there are mature, but dedicated nominal linkage systems, it is worthwhile to have them operate on data for which they have not been specifically designed. This may allow us to distinguish those elements of linkage strategies that have a broad validity from those elements that are highly dependent on data, language and culture. Representative selections of data from various countries and epochs, for which human validation of links is available, would form the richest and most challenging task to present day nominal linkage systems from which many new methodological ideas may arise.

 

Notes

1 The Dutch Genealogical Association has collected a database of first names from an unknown variety of sources, probably mainly 18th and 19th century, in the Netherlands (courtesy of D.P. Arnoldussen); the 1947 census data are samples from various, mainly rural communities scattered around the country (courtesy of Doreen Gerritzen); the Meiery data stem from samples from 19th century civil registration in 12 villages in a region around 's-Hertogenbosch (courtesy of Gerard Trienekens); within the frame work of the LIAS project, names were collected from 19th century legacy tax registers (504,000 records) and from some parish registers (1620-1812; 19,075 records including more than one name) in the province of Brabant (courtesy of Arie van Vliet and Toine Schijvenaars); the data from parish registers in Amsterdam come from the BABE system in the Amsterdam municipal archive and comprise full data from the period 1801-1811 and partial data from 1776-1801 (courtesy of Hans Ernst); complete 18th century parish registers from Goes, a town in the province of Zeeland were available (courtesy of Willem de Vries); names from a series of sources related to Amsterdam merchants were available over the period 1578-1649, including registers from the the exchange bank, stock holders, chartering, property tax and others (courtesy of Clé Lesger and Oscar Gelderblom).; the oldest citizen registers from Amsterdam cover the period 1531-1611 (also courtesy of Clé Lesger and Oscar Gelderblom). The number of names given for databases that are also discussed in Bloothooft (1994) may deviate slightly from numbers presented there due to critical evaluation of the original sources.

 

2 Courtesy of German Telekom (BERKOM) and Andreas Mengel. The 40,000 names have a frequency higher than one. The complete set, including frequency one data, comprises 252,000 different first names, of which less than 20% is of German origin (including many variants originating from typing errors).

 

3 Database from sources mentioned under note 1, but in an intermediate stage.

 

4 The search strategy given here is very different from the one presented in Bloothooft (1994). The attraction of the present strategy is the analogy between the analysis phase (estimating diphone interchange probabilities on the basis of the best match between proven variants) and the search phase (applying diphone interchange probabilities to predict all possible variants).

 

References

Atack, J., Bateman, F., and Gregson, M.E. "Matchmaker, Matchmaker, Make Me a Match". Historical Methods, 25, 2 (1992), 53-65.

Bloothooft, G. "Corpus-Based Name Standardization". History & Computing, 6, 3 (1994), 153-167.

Bloothooft, G. "Multi-Source Family Reconstruction". History & Computing, 7, 2 (1995), 90-103.

Bouchard, G. "Current Issues and New Prospects for Computerized Record Linkage in the Province of Quebec". Historical Methods, 25, 2 (1992), 67-73.

Gross, A.D. "Personal Name Pattern Matching". Proceedings Ve Congrès "History & Computing", Montpellier, (1992), 19-27.

Harvey, C., Green, E.M., Corfield, P.J. "Record Linkage Theory and Practice: an Experiment in the Application of Multiple Pass Linkage Algorithms", History & Computing, 8, 2 (1996), 78-89.

King, S. "Record Linkage in a Protoindustrial Community". History & Computing, 4,1 (1992), 27-33.

King, S. "Historical Demography, Life-cycle Reconstruction and Family Reconstruction: New Perspectives". History & Computing, 8, 2 (1996), 62-77.

Kukich, K. "Techniques for Automatically Correcting Words in Text". ACM (Association for Computing Machinery) Computing Surveys, 24, 4 (1992), 377-439.

Morton, G. "Presenting the Self: Record Linkage and Referring to Ordinary Historical Persons". History & Computing, 6, 1 (1994), 12-20.

Rhodri Davies, H. "Automated Record Linkage of Census Enumerators’ Books and Registration Data: Obstacles, Challenges and Solutions". History & Computing, 4,1 (1992), 16-26.

Schaar, J, van der, Gerritzen, D., and Berns, J.B. "Spectrum voornamenboek". Spectrum, Utrecht, 1992.

Winchester, I. "What Every Historian Needs to Know About Record Linkage for the Microcomputer Era". Historical Methods, 25, 2 (1992), 149-165.

 

The author

Gerrit Bloothooft is member of the Utrecht Institute of Linguistics at Utrecht University, the Netherlands. In 1985 he obtained a PhD in mathematics and physics at the Free University of Amsterdam. His interests are in the field of speech technology, phonetics, and nominal record linkage.