An Entity Resolution Primer

Author’s Note: Hello Reader! My name is Jonathan Armoza and I am a data science intern at Neustar and a PhD candidate in English Literature at New York University. My work focuses on the development of computational text mining and visualization methods in the emerging field of digital humanities.

Entity Resolution in History

The era of big data has created the need to develop techniques and mechanisms to not only handle large datasets, but to understand them as well. Much of this influx of information is about people, places, and things. Although some of that data is anonymized, there are a number of reasons we might want to understand how to associate those real world “entities” with their data points. This act of linking the information contained in records with real world “entities” is called entity resolution. In the post below I’ll demonstrate an entity resolution process through the lens of historical data, taking note of the varying options at each step from preprocessing through to the analysis of resolved entities.

Entity resolution is even older than the practice of record keeping. From an ice age hunter tracking footprints to a data warehouse administrator merging databases on different continents, each connects past evidence or “references” to a real world entity. One example that comes to mind are the ship manifests (see Figure 1) submitted at Ellis Island, NY, where many people immigrated to America. These records contain fields like name, age, gender, and nationality of each of the immigrants on board the ships. They sometimes contain the transcription errors of foreign-sounding names into English, and other discrepancies due to language barriers between the immigrants and the staff of the shipping line recording all of the information required by US authorities. On Ellis Island, the manifests were used by immigration agents to account for all of a ship’s passengers and to match a person with an identity. The ship manifests therefore not only refer to these people, but they can be thought of as how the journeys they undertook were documented. Decades later, genealogists would use US census records (see Figure 2) to resolve which manifest entry referred to what citizen, and to resolve differences between their initial immigration data and how they identified themselves once living in the United States.

 

Ellis Island - Ship Manifest - NYT715_1185-0831

FIGURE 1. Example of an Ellis Island Ship Manifest from 1908

 

US 1940 Census Page - m-t0627-02476-00275

FIGURE 2. Example page from the 1940 US Census

Essentially, immigration agents and genealogists undertook entity resolution processes, albeit with different goals. And they both used the best tools at their disposal: visual inspection of records and people. The goals and problems of entity resolution in digital settings are not so different, but the large scale of today’s datasets demand well defined heuristics and more complex, computational means of assessing and analyzing them. Let’s explore modern entity resolution in greater detail by following how the example of ship manifests and census data can be resolved through a few of its processes.

Modern Entity Resolution

ERFlow

FIGURE 3. An entity resolution workflow


The five basic steps of entity resolution are (1) preprocessing records, (2) blocking, (3) matching, (4) verification, and (5) evaluation. It’s important to note that each step in an entity resolution workflow is a point where decisions must be made about data. Metrics that verify if data is ready to continue through the workflow might fail for instance, and as Figure 3 indicates, it will be up to the current step in the workflow to either reject those results or send them back to a previous step. This often occurs after matching, but even the clear cut step of pre-processing can have a notable effect on the accuracy and speed of the record matching and resolution to follow.

Below I’m going to use fictional data based on Ellis Island ship manifest entries and US census records, run them through these processes, and explain the options and pitfalls at each step. My sample entity resolution problem will demonstrate which of the John O’Reillys in the census records most likely refer to the John O’Reillys in the entries of ship manifests. Remember, errors in the records could have been introduced by the the ship’s ticketing agent, the census taker, or even digital transcription.


Census (1940)

name age_at_last_birthday sex place_of_birth occupation year_of_record
O’reilly, John 30 M Canada-English clerk 1940
O’Reilly, John 16 M Irish Free State tailor 1940
Oreilly, John 16 M Irish Free State tailor 1940
O’reilly, Jon 32 M England Handyman 1940
Oreilly, Jon 32 M England Handyman 1940
O’Reilly, John 18 M Canada-English clerk 1940


Ship Manifest (1937)

family_name given_name age_yrs sex occupation nationality
O’reily John 42 M clerk England
O’Reilly John 10 M <blank> Ireland
Oreilly John 10 <blank> tlr Ireland
O’ReillY Jon 39 M locksmith <blank>
O’reilly <blank> 28 M locksmith England
O’reilly John 24 M clerk Canada


A. Preprocessing

Preprocessing of the data set produces a common set of fields and standardizes, or cleans,  the data using techniques such as norms for character case, category order, and stripping punctuation. Each record is assigned a unique ID field to prevent mixing up similar records. And like genealogists I’ll use the US census as the authoritative record. Post-normalization, our records now look like this:

Preprocessed Record – Census (1940)

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940
5 oreilly john 18 m canada clerk 1940


Preprocessed Record – Ship Manifest (1937)

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937
9 oreilly jon 39 m <blank> locksmith 1937
10 oreilly <blank> 28 m england locksmith 1937
11 oreilly john 24 m canada clerk 1937


B. Blocking

A brute force comparison of all records would be in quadratic time or O(n2), which is a prohibitive runtime when comparing millions of records. Thus, the first post-cleaning step is to index the records using a blocking criterion. The “last_name”, “first_name”, and “sex” fields are not useful for dividing this particular data set given their similar values across records. In this case, “birth_country” is the blocking field – though “occupation” would work as well. This decision could be made programmatically by examining criteria like diversity of field values as I have above, and can be adjusted as more records are brought into the entity resolution system and resolved.

Blocking Hierarchy

FIGURE 4.  A blocking hierarchy

The current blocking hierarchy illustrated in Figure 4 results in four indices containing records for matching, reducing a potential 25 comparisons down to 13. The number of comparisons will increase if I try to address false matches and false rejections later in the resolution process, but blocking produces an advantageously smaller subset of necessary record comparisons (e.g. [ (r1,r2), (r1,r6,), (r1,r7),…(r5, r9) ]).

Block 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
5 oreilly john 19 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937


Block 2

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937


Block 3

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940
6 oreily john 42 m england clerk 1937
10 oreilly <blank> 28 m england locksmith 1937


Block 4

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


This simple blocking method is based on a direct matching of the field values, but there are many others to consider. Famously, the US Census Bureau developed “
Soundex”, a blocking method to help find names that sounded the same though they might be spelled differently. Soundex belongs to a group of algorithms called “match codes” that create hashes based on values of fields. Other more advanced blocking methods include sorted neighborhood, q-gram based blocking, canopy clustering, string-map, and suffix-array based blocking <see Christen 2007>. Overall, blocking can be thought of as match prospecting, a way of reducing the amount of comparison pairs for the next entity resolution step: data matching, a.k.a. entity reference resolution.


C. Data Matching

Now that the records are in more manageable groups, I’ll undertake the step most associated with entity resolution: asserting which record references which entity. The most straightforward way of matching is determining equivalence between the fields of a record, but even with our idealized data set, there are still differences in spellings, missing fields, and possible incorrect ages. Matching records should result in groupings of records that are more closely associated. Since real data tend to be messy though, you can see that exact equivalence methods would fail to produce such associations.

Direct match methods are probabilistic, allowing for some variation in the fields of a record. One example, the Levenshtein edit distance, is an approximate string matching algorithm that compares two strings by producing a single value based on the number of character edits (insertions, deletions, substitutions) required to transform one string to the other. Other approximate string match algorithms, for instance q-gram and Jaro-Winkler, relate the “distance” between strings as a function of the number of ordered character sequences present. In the sample records, the edit distance between the “last_name”, “first_name”, and “sex” fields is small. This is one reason these fields were not chosen as blocking criteria as they are comparatively constant.

RangeMatchHierarchy

FIGURE 5. Matching hierarchy and the ranges of match for each used field

I have to determine which field(s) will act more definitively in resolving the blocked records to distinct entities. Establishing a clear hierarchy of fields and ranges of match for those fields will then help inform and shape the resolution results (see Figure 5). I already have used “birth_country” for blocking and “last_name”, “first_name”, and “sex” were also (rejected) blocking for blocking as their values are not very diverse. Since direct match methods are meant to refine groupings via narrower ranges-of-match, I can’t use any of those as match criteria. However, neither “age” and “occupation” have been used and both contain a more diverse set of values. Once the results from these match methods are calculated, a normalized match result value called a match score must be determined for each field. This takes the form of a manually chosen or calculated expected value (e.g. an edit distance of zero) and an acceptable range around that ideal in which valid values can fall. For now in my example, I’ll just say that if a match result falls within the selected ranges of match, it’s an acceptable match. Using those fields we arrive at the following entity resolutions:

Entity 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937

Entity 2

id last_name first_name age sex birth_country occupation year_of_record
5 oreilly john 19 m canada clerk 1940

Entity 3

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 6

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937

What happens with these entities (or record clusters) is now another key decision point in the entity resolution process. One possibility is to take these clusters of records and merge each into one based on a rule favoring predominant field values or the most recent record. Another possibility is to take the most representative record based on the a similar criterion, keeping the rest around for later reference or just purging them. Let’s evaluate the results above to see what decision might be appropriate.


D. Verification

Entity 1 presents as a correct resolution in that there is a 1:1 record pairing and that it has met the proposed match criteria. Entities 2 and 6 represent another scenario where a lone record from one data set does not correspond to a record in the other data set. This can be deemed either a full or partial entity resolution, depending on the rules of the system. In my case, I’ll consider them full resolutions since I’m interested in determining which records likely correspond to individual immigrants and citizens recorded by the census.

Entity 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937

Entity 2

id last_name first_name age sex birth_country occupation year_of_record
5 oreilly john 19 m canada clerk 1940

Entity 6

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937

Entity 3 represents a partially successful resolution in that most of the matching criteria has been met (even with the blank occupation for record 7), but the four records in its grouping will need to be revisited and split into two entities.

Entity 3

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937


Notice that record 10 in Entity 5’s group has a different occupation than the John O’Reillys in Entity 4 (a levenshtein distance of 8 operations) but its fields are very similar otherwise. Once the entities are grouped, there may also be other verification criteria to consider. We want to be able to group records from across data sets to resolve them, but the records of Entity 4 come from just one data set, the 1940 census. One of the people potentially represented by Entity 4’s records could have changed occupations over time. This is an example where I can set a rule saying that records with a later “year_of_record” contain the
most representative value. In order to fix this I would have to merge Entity 4 and 5 and then split them. However, given that Entity 4’s records are so similar, the decision to pair either record 3 or 4 with record 10 is unclear.

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 7 failed the age matching criterion, but let’s suppose this John O’Reilly from the ship’s manifest really matches one of the records in Entity 4 or 5. The range of values considered for a match (or “range-of-match”) for “age” treated that field as a range of integers, but ultimately the recording of a number is still a character whether done by hand or by keyboard. “39” is one edit distance away from “29”.

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


The initial resolutions of Entity 4 and 7 above are instances of false positive and false negative matches, respectively. These types of match outcomes are more difficult to determine without further review of the resolved entity references. Another method to decrease the number of unnecessary record comparisons is called
asserted equivalence, where an external data source is considered to be the source of truth. During the verification step, this can be a way to validate the recent entity resolutions and to adjust the matching criteria to reduce the amount of incorrect matches or non-matches. Note that the expert data from can be used at any point during entity resolution – blocking, matching, or evaluation.

Entity resolution systems try to strike a balance between unmatched references and the quality of the matches generated. The accuracy of match outcomes are a function of the selected blocking and matching methods. Since 100% correct outcomes are unlikely, one can attempt to increase this accuracy by considering other relationships between records other than direct field-to-field matching.

E. Return to Matching: More Advanced Match Methods

Extending Data Matching with Weights

In a weighted match method, for instance, each field is given an additional value (zero through one  inclusive) to indicate its importance when compared with other records. This stands in contrast to the discrete hierarchy of the blocking and matching schemes (a coordinated set of methods) used in my example above where match results are either positive or negative. Methods like approximate string matching can still be used when comparing fields, but once those assessments are made, each result is weighted to produce the set of entity resolutions. This produces a new, probabilistic measure via resolution function E:

E(Ri,Rj) = w1*m1(Ri1,Rj1) + w2*m2(Ri2,Rj2) + …. + wn*mn(Ri,Rj)

For E, I’ll now fully consider the values of a probabilistic match score. In order to take advantage of weighted matching, we now must consider the three sets of outcomes possible from probabilistic matching: matches, non-matches, and now potential matches. Looking again at the imperfect results of my first custom blocking and matching methods, you can see that introducing weights may result in fewer false positives and negatives. For instance, giving “age” a weight of 0.25 during matching and “occupation” a 0.75 weight results in splitting Entity 4’s cluster of records with record 10 merged into the Entity 7 cluster.

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


This set of weights and matching results form an
agreement pattern. We can model a distribution of error based on the norms established by the resolution results because we have a measure of probability for each comparison made based on this pattern. Parts of the potential matches can then be folded into the match and nonmatch sets depending on the allowable ranges of values of E.

 Transitive Equivalence

The next logical step after considering weighted relationships between fields, would be to look at the relationships between records themselves and how those relationships can be used as another form of matching criteria. One such method is transitive equivalence, using field-matching in non-sequential records to overcome definite mismatches in fields across sequential records. A matching scheme that includes a transitive equivalence measure holds that intermediary records can link previously and successively processed records, making them all potentially equivalent.

Let’s say that record A mismatches with record B, but the next record C matches with B. The transitive equivalence measure would check to see if C also matches with A. If it did, it could be understood that all three records match and therefore resolve to the same entity.

Transitive Equivalence

FIGURE 6. Transitive equivalence relations between records

 

Consider for instance, if the records of Block 1 were processed for matching in this way:

Block 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
5 oreilly john 19 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937


Here, the “last_name” matches for records 0 and 5 but would be deemed a non-match due to age. Comparing record 5 to 11 shows that age would have met its range-of-match and matched occupations, thus inferring that records 0, 5, and 11 all potentially refer to the same entity. So with transitive equivalence, the order in which records are considered matters. If record 5’s age had been lower within the range of +/- 5 years, the transitive equivalence relationship would have been lost, and resulted in a potential false negative.

G. Evaluation

One matching scheme that prevents record processing order from hindering the establishment of entity reference relationships is association analysis. Using this method records are thought of as being in a network with multiple relationships, each of which is determined by the results from the matching criteria for each of their fields. This enables a more comprehensive probabilistic match that can be considered via network/graph theory measurements of connection and distribution such as homophily, multiplexity, mutuality, and closure.

ER Network Form

FIGURE 7. Entity reference relationships in network form

As Figure 7 illustrates, the blocking or matching schemes may split references across entities based on fields like name or address that are deemed distinguishing. Metrics from networks/graph theory can be used to re-match those references appropriately, remediating inaccurate entity resolutions. Such metrics can reveal relationships between references and between entities that are larger than local ones that can be established by record processing order and single field matching.

Though association analysis can be included in a matching scheme, it can also allow help evaluate entity resolution outputs. If one can assess relationships between entity references based on network measures then association analysis naturally extends to the aggregated data of previously resolved entities. This can help some of the larger scale data relationships with which many fields have trouble resolving – such as determining the members of a household. If I took a sample of names from the ship manifests and censuses, for instance, this method could aid in resolving the movements of members between households over time and provide other useful statistics on the average composition and dynamics of those households.

Conclusion

In this post I have outlined a five step workflow for resolving data to the real world entities they reference – (1) Pre-processing, (2) Blocking, (3) Matching, (4) Verification, and (5) Evaluation. I suggested some basic methods for moving data through each step towards those resolutions, and for understanding their quality. There are many more matching and verification techniques available than what we covered in this post.

The US Census Bureau’s “72-year rule” limits current access to much of its records as they contain personally identifiable information too recent for public use. Thus running Ellis Island manifests through an entity resolution system is mostly academic, though it illustrates how entity resolution gets interesting. With the employment of entity resolution techniques, a ledger filled with common names like O’Reilly becomes a map of traceable relationships, and access to its information is no longer limited to the tools of visual inspection and prior knowledge.