An Entity Resolution Primer

Author’s Note: Hello Reader! My name is Jonathan Armoza and I am a data science intern at Neustar and a PhD candidate in English Literature at New York University. My work focuses on the development of computational text mining and visualization methods in the emerging field of digital humanities.

Entity Resolution in History

The era of big data has created the need to develop techniques and mechanisms to not only handle large datasets, but to understand them as well. Much of this influx of information is about people, places, and things. Although some of that data is anonymized, there are a number of reasons we might want to understand how to associate those real world “entities” with their data points. This act of linking the information contained in records with real world “entities” is called entity resolution. In the post below I’ll demonstrate an entity resolution process through the lens of historical data, taking note of the varying options at each step from preprocessing through to the analysis of resolved entities.

Entity resolution is even older than the practice of record keeping. From an ice age hunter tracking footprints to a data warehouse administrator merging databases on different continents, each connects past evidence or “references” to a real world entity. One example that comes to mind are the ship manifests (see Figure 1) submitted at Ellis Island, NY, where many people immigrated to America. These records contain fields like name, age, gender, and nationality of each of the immigrants on board the ships. They sometimes contain the transcription errors of foreign-sounding names into English, and other discrepancies due to language barriers between the immigrants and the staff of the shipping line recording all of the information required by US authorities. On Ellis Island, the manifests were used by immigration agents to account for all of a ship’s passengers and to match a person with an identity. The ship manifests therefore not only refer to these people, but they can be thought of as how the journeys they undertook were documented. Decades later, genealogists would use US census records (see Figure 2) to resolve which manifest entry referred to what citizen, and to resolve differences between their initial immigration data and how they identified themselves once living in the United States.

 

Ellis Island - Ship Manifest - NYT715_1185-0831

FIGURE 1. Example of an Ellis Island Ship Manifest from 1908

 

US 1940 Census Page - m-t0627-02476-00275

FIGURE 2. Example page from the 1940 US Census

Essentially, immigration agents and genealogists undertook entity resolution processes, albeit with different goals. And they both used the best tools at their disposal: visual inspection of records and people. The goals and problems of entity resolution in digital settings are not so different, but the large scale of today’s datasets demand well defined heuristics and more complex, computational means of assessing and analyzing them. Let’s explore modern entity resolution in greater detail by following how the example of ship manifests and census data can be resolved through a few of its processes.

Modern Entity Resolution

ERFlow

FIGURE 3. An entity resolution workflow


The five basic steps of entity resolution are (1) preprocessing records, (2) blocking, (3) matching, (4) verification, and (5) evaluation. It’s important to note that each step in an entity resolution workflow is a point where decisions must be made about data. Metrics that verify if data is ready to continue through the workflow might fail for instance, and as Figure 3 indicates, it will be up to the current step in the workflow to either reject those results or send them back to a previous step. This often occurs after matching, but even the clear cut step of pre-processing can have a notable effect on the accuracy and speed of the record matching and resolution to follow.

Below I’m going to use fictional data based on Ellis Island ship manifest entries and US census records, run them through these processes, and explain the options and pitfalls at each step. My sample entity resolution problem will demonstrate which of the John O’Reillys in the census records most likely refer to the John O’Reillys in the entries of ship manifests. Remember, errors in the records could have been introduced by the the ship’s ticketing agent, the census taker, or even digital transcription.


Census (1940)

name age_at_last_birthday sex place_of_birth occupation year_of_record
O’reilly, John 30 M Canada-English clerk 1940
O’Reilly, John 16 M Irish Free State tailor 1940
Oreilly, John 16 M Irish Free State tailor 1940
O’reilly, Jon 32 M England Handyman 1940
Oreilly, Jon 32 M England Handyman 1940
O’Reilly, John 18 M Canada-English clerk 1940


Ship Manifest (1937)

family_name given_name age_yrs sex occupation nationality
O’reily John 42 M clerk England
O’Reilly John 10 M <blank> Ireland
Oreilly John 10 <blank> tlr Ireland
O’ReillY Jon 39 M locksmith <blank>
O’reilly <blank> 28 M locksmith England
O’reilly John 24 M clerk Canada


A. Preprocessing

Preprocessing of the data set produces a common set of fields and standardizes, or cleans,  the data using techniques such as norms for character case, category order, and stripping punctuation. Each record is assigned a unique ID field to prevent mixing up similar records. And like genealogists I’ll use the US census as the authoritative record. Post-normalization, our records now look like this:

Preprocessed Record – Census (1940)

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940
5 oreilly john 18 m canada clerk 1940


Preprocessed Record – Ship Manifest (1937)

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937
9 oreilly jon 39 m <blank> locksmith 1937
10 oreilly <blank> 28 m england locksmith 1937
11 oreilly john 24 m canada clerk 1937


B. Blocking

A brute force comparison of all records would be in quadratic time or O(n2), which is a prohibitive runtime when comparing millions of records. Thus, the first post-cleaning step is to index the records using a blocking criterion. The “last_name”, “first_name”, and “sex” fields are not useful for dividing this particular data set given their similar values across records. In this case, “birth_country” is the blocking field – though “occupation” would work as well. This decision could be made programmatically by examining criteria like diversity of field values as I have above, and can be adjusted as more records are brought into the entity resolution system and resolved.

Blocking Hierarchy

FIGURE 4.  A blocking hierarchy

The current blocking hierarchy illustrated in Figure 4 results in four indices containing records for matching, reducing a potential 25 comparisons down to 13. The number of comparisons will increase if I try to address false matches and false rejections later in the resolution process, but blocking produces an advantageously smaller subset of necessary record comparisons (e.g. [ (r1,r2), (r1,r6,), (r1,r7),…(r5, r9) ]).

Block 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
5 oreilly john 19 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937


Block 2

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937


Block 3

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940
6 oreily john 42 m england clerk 1937
10 oreilly <blank> 28 m england locksmith 1937


Block 4

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


This simple blocking method is based on a direct matching of the field values, but there are many others to consider. Famously, the US Census Bureau developed “
Soundex”, a blocking method to help find names that sounded the same though they might be spelled differently. Soundex belongs to a group of algorithms called “match codes” that create hashes based on values of fields. Other more advanced blocking methods include sorted neighborhood, q-gram based blocking, canopy clustering, string-map, and suffix-array based blocking <see Christen 2007>. Overall, blocking can be thought of as match prospecting, a way of reducing the amount of comparison pairs for the next entity resolution step: data matching, a.k.a. entity reference resolution.


C. Data Matching

Now that the records are in more manageable groups, I’ll undertake the step most associated with entity resolution: asserting which record references which entity. The most straightforward way of matching is determining equivalence between the fields of a record, but even with our idealized data set, there are still differences in spellings, missing fields, and possible incorrect ages. Matching records should result in groupings of records that are more closely associated. Since real data tend to be messy though, you can see that exact equivalence methods would fail to produce such associations.

Direct match methods are probabilistic, allowing for some variation in the fields of a record. One example, the Levenshtein edit distance, is an approximate string matching algorithm that compares two strings by producing a single value based on the number of character edits (insertions, deletions, substitutions) required to transform one string to the other. Other approximate string match algorithms, for instance q-gram and Jaro-Winkler, relate the “distance” between strings as a function of the number of ordered character sequences present. In the sample records, the edit distance between the “last_name”, “first_name”, and “sex” fields is small. This is one reason these fields were not chosen as blocking criteria as they are comparatively constant.

RangeMatchHierarchy

FIGURE 5. Matching hierarchy and the ranges of match for each used field

I have to determine which field(s) will act more definitively in resolving the blocked records to distinct entities. Establishing a clear hierarchy of fields and ranges of match for those fields will then help inform and shape the resolution results (see Figure 5). I already have used “birth_country” for blocking and “last_name”, “first_name”, and “sex” were also (rejected) blocking for blocking as their values are not very diverse. Since direct match methods are meant to refine groupings via narrower ranges-of-match, I can’t use any of those as match criteria. However, neither “age” and “occupation” have been used and both contain a more diverse set of values. Once the results from these match methods are calculated, a normalized match result value called a match score must be determined for each field. This takes the form of a manually chosen or calculated expected value (e.g. an edit distance of zero) and an acceptable range around that ideal in which valid values can fall. For now in my example, I’ll just say that if a match result falls within the selected ranges of match, it’s an acceptable match. Using those fields we arrive at the following entity resolutions:

Entity 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937

Entity 2

id last_name first_name age sex birth_country occupation year_of_record
5 oreilly john 19 m canada clerk 1940

Entity 3

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 6

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937

What happens with these entities (or record clusters) is now another key decision point in the entity resolution process. One possibility is to take these clusters of records and merge each into one based on a rule favoring predominant field values or the most recent record. Another possibility is to take the most representative record based on the a similar criterion, keeping the rest around for later reference or just purging them. Let’s evaluate the results above to see what decision might be appropriate.


D. Verification

Entity 1 presents as a correct resolution in that there is a 1:1 record pairing and that it has met the proposed match criteria. Entities 2 and 6 represent another scenario where a lone record from one data set does not correspond to a record in the other data set. This can be deemed either a full or partial entity resolution, depending on the rules of the system. In my case, I’ll consider them full resolutions since I’m interested in determining which records likely correspond to individual immigrants and citizens recorded by the census.

Entity 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937

Entity 2

id last_name first_name age sex birth_country occupation year_of_record
5 oreilly john 19 m canada clerk 1940

Entity 6

id last_name first_name age sex birth_country occupation year_of_record
6 oreily john 42 m england clerk 1937

Entity 3 represents a partially successful resolution in that most of the matching criteria has been met (even with the blank occupation for record 7), but the four records in its grouping will need to be revisited and split into two entities.

Entity 3

id last_name first_name age sex birth_country occupation year_of_record
1 oreilly john 13 m ireland tailor 1940
2 oreilly john 13 m ireland tailor 1940
7 oreilly john 10 m ireland <blank> 1937
8 oreilly john 10 <blank> ireland tlr 1937


Notice that record 10 in Entity 5’s group has a different occupation than the John O’Reillys in Entity 4 (a levenshtein distance of 8 operations) but its fields are very similar otherwise. Once the entities are grouped, there may also be other verification criteria to consider. We want to be able to group records from across data sets to resolve them, but the records of Entity 4 come from just one data set, the 1940 census. One of the people potentially represented by Entity 4’s records could have changed occupations over time. This is an example where I can set a rule saying that records with a later “year_of_record” contain the
most representative value. In order to fix this I would have to merge Entity 4 and 5 and then split them. However, given that Entity 4’s records are so similar, the decision to pair either record 3 or 4 with record 10 is unclear.

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 7 failed the age matching criterion, but let’s suppose this John O’Reilly from the ship’s manifest really matches one of the records in Entity 4 or 5. The range of values considered for a match (or “range-of-match”) for “age” treated that field as a range of integers, but ultimately the recording of a number is still a character whether done by hand or by keyboard. “39” is one edit distance away from “29”.

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


The initial resolutions of Entity 4 and 7 above are instances of false positive and false negative matches, respectively. These types of match outcomes are more difficult to determine without further review of the resolved entity references. Another method to decrease the number of unnecessary record comparisons is called
asserted equivalence, where an external data source is considered to be the source of truth. During the verification step, this can be a way to validate the recent entity resolutions and to adjust the matching criteria to reduce the amount of incorrect matches or non-matches. Note that the expert data from can be used at any point during entity resolution – blocking, matching, or evaluation.

Entity resolution systems try to strike a balance between unmatched references and the quality of the matches generated. The accuracy of match outcomes are a function of the selected blocking and matching methods. Since 100% correct outcomes are unlikely, one can attempt to increase this accuracy by considering other relationships between records other than direct field-to-field matching.

E. Return to Matching: More Advanced Match Methods

Extending Data Matching with Weights

In a weighted match method, for instance, each field is given an additional value (zero through one  inclusive) to indicate its importance when compared with other records. This stands in contrast to the discrete hierarchy of the blocking and matching schemes (a coordinated set of methods) used in my example above where match results are either positive or negative. Methods like approximate string matching can still be used when comparing fields, but once those assessments are made, each result is weighted to produce the set of entity resolutions. This produces a new, probabilistic measure via resolution function E:

E(Ri,Rj) = w1*m1(Ri1,Rj1) + w2*m2(Ri2,Rj2) + …. + wn*mn(Ri,Rj)

For E, I’ll now fully consider the values of a probabilistic match score. In order to take advantage of weighted matching, we now must consider the three sets of outcomes possible from probabilistic matching: matches, non-matches, and now potential matches. Looking again at the imperfect results of my first custom blocking and matching methods, you can see that introducing weights may result in fewer false positives and negatives. For instance, giving “age” a weight of 0.25 during matching and “occupation” a 0.75 weight results in splitting Entity 4’s cluster of records with record 10 merged into the Entity 7 cluster.

Entity 4

id last_name first_name age sex birth_country occupation year_of_record
3 oreilly jon 29 m england handyman 1940
4 oreilly jon 29 m england handyman 1940

Entity 5

id last_name first_name age sex birth_country occupation year_of_record
10 oreilly <blank> 28 m england locksmith 1937

Entity 7

id last_name first_name age sex birth_country occupation year_of_record
9 oreilly jon 39 m <blank> locksmith 1937


This set of weights and matching results form an
agreement pattern. We can model a distribution of error based on the norms established by the resolution results because we have a measure of probability for each comparison made based on this pattern. Parts of the potential matches can then be folded into the match and nonmatch sets depending on the allowable ranges of values of E.

 Transitive Equivalence

The next logical step after considering weighted relationships between fields, would be to look at the relationships between records themselves and how those relationships can be used as another form of matching criteria. One such method is transitive equivalence, using field-matching in non-sequential records to overcome definite mismatches in fields across sequential records. A matching scheme that includes a transitive equivalence measure holds that intermediary records can link previously and successively processed records, making them all potentially equivalent.

Let’s say that record A mismatches with record B, but the next record C matches with B. The transitive equivalence measure would check to see if C also matches with A. If it did, it could be understood that all three records match and therefore resolve to the same entity.

Transitive Equivalence

FIGURE 6. Transitive equivalence relations between records

 

Consider for instance, if the records of Block 1 were processed for matching in this way:

Block 1

id last_name first_name age sex birth_country occupation year_of_record
0 oreilly john 27 m canada clerk 1940
5 oreilly john 19 m canada clerk 1940
11 oreilly john 24 m canada clerk 1937


Here, the “last_name” matches for records 0 and 5 but would be deemed a non-match due to age. Comparing record 5 to 11 shows that age would have met its range-of-match and matched occupations, thus inferring that records 0, 5, and 11 all potentially refer to the same entity. So with transitive equivalence, the order in which records are considered matters. If record 5’s age had been lower within the range of +/- 5 years, the transitive equivalence relationship would have been lost, and resulted in a potential false negative.

G. Evaluation

One matching scheme that prevents record processing order from hindering the establishment of entity reference relationships is association analysis. Using this method records are thought of as being in a network with multiple relationships, each of which is determined by the results from the matching criteria for each of their fields. This enables a more comprehensive probabilistic match that can be considered via network/graph theory measurements of connection and distribution such as homophily, multiplexity, mutuality, and closure.

ER Network Form

FIGURE 7. Entity reference relationships in network form

As Figure 7 illustrates, the blocking or matching schemes may split references across entities based on fields like name or address that are deemed distinguishing. Metrics from networks/graph theory can be used to re-match those references appropriately, remediating inaccurate entity resolutions. Such metrics can reveal relationships between references and between entities that are larger than local ones that can be established by record processing order and single field matching.

Though association analysis can be included in a matching scheme, it can also allow help evaluate entity resolution outputs. If one can assess relationships between entity references based on network measures then association analysis naturally extends to the aggregated data of previously resolved entities. This can help some of the larger scale data relationships with which many fields have trouble resolving – such as determining the members of a household. If I took a sample of names from the ship manifests and censuses, for instance, this method could aid in resolving the movements of members between households over time and provide other useful statistics on the average composition and dynamics of those households.

Conclusion

In this post I have outlined a five step workflow for resolving data to the real world entities they reference – (1) Pre-processing, (2) Blocking, (3) Matching, (4) Verification, and (5) Evaluation. I suggested some basic methods for moving data through each step towards those resolutions, and for understanding their quality. There are many more matching and verification techniques available than what we covered in this post.

The US Census Bureau’s “72-year rule” limits current access to much of its records as they contain personally identifiable information too recent for public use. Thus running Ellis Island manifests through an entity resolution system is mostly academic, though it illustrates how entity resolution gets interesting. With the employment of entity resolution techniques, a ledger filled with common names like O’Reilly becomes a map of traceable relationships, and access to its information is no longer limited to the tools of visual inspection and prior knowledge.

 

Neustar at the 2015 Grace Hopper Celebration of Women in Computing

Author’s Note: Hello readers! I’m Julie Hollek, and I am a data scientist at Neustar, where I focus on understanding questions around identity. Before joining Neustar, I received my PhD in astronomy from the University of Texas at Austin, where I studied the chemical composition of the oldest stars in our galaxy.

This past October, I presented a Student Opportunity Lab (SOL) at the Anita Borg Institute’s Grace Hopper Celebration of Women in Computing (GHC) about my transition from graduate student in astronomy to data scientist at Neustar.  My talk, “Old Stars to Neustar: Academia Lessons Translated to Data Science”, explored four foundational skills I learned through my academic training that are most helpful to my current role as a data scientist: networking, problem solving, communication, and grit.

It’s interesting that, specifically, working through the downtime in my academic career –difficult collaborations, the research-teaching-study-life balancing act, and more– helped me develop grit. This has directly transferred not only to working in industry, but to data science in particular. Data scientists are often a minority amongst technologists, where being a woman in this field means being a minority within a minority. Thus, grit is all the more important to take a complex idea from inception to completion in a field that does not necessarily reflect your experiences.

Nearly half the GHC attendees are still in college, and SOLs like this provide them with invaluable access to industry professionals who discuss different aspects of working in technology.  For me, presenting this SOL was inspiring because I was able to connect with students from across the globe.  Encouraging young people into STEM careers and engaging with highly enthusiastic students is infectious both were conference highlights.

Since this was my first time attending Grace Hopper Conference, I was also excited to learn from women technologists showcasing cutting-edge work.  Technical talks at GHC serve the dual purpose of allowing these technologists to promote their work and enabling attendees of all levels to learn about new and interesting subjects. There were several tracks that were especially relevant to the Neustar Research group, including big data, internet of things, and (most pertinent) data science. Data science as a field is exploding, as witnessed by the popular track devoted specifically to the topic at GHC. One especially outstanding talk was that of Anita Mehrotra who gave a talk entitled “Virality at Buzzfeed”.

black? blue? gold? white?

The talk was framed around an article polling readers on which colors they saw in “The Dress“, a meme that originated on Tumblr. The Buzzfeed post quickly went viral and resulted in the largest traffic ever seen on the site. Buzzfeed’s business model relies on views and clicks, but is also very invested in user shares, where readers use social media platforms to distribute articles to their networks. This generates over 75% of their traffic. By using the Susceptible-Infected-Recovered model from biostatistics, data scientists can characterize the “virality” of a post, or essentially the ratio of shared views to seed views. Those metrics are then used to optimize performance. Furthermore, by evaluating viral posts using basic graph theory principles, they can answer questions such as over which platforms did the sharing occur (edge attributes)? Did large numbers of people share the post or did a few individuals share it with a lot of other people (tree width)? 

“The Dress” talk highlights the importance of GHC.  This conference is unique in that it offers a way for women to lead the discourse about emerging technologies. The talk showcased novel data science techniques and tools as told through the context of a popular meme.  GHC provided an abundance of interesting topics for further research and showed new and insightful analytics that can be applied to a number of business questions here at Neustar.  And all this while maintaining diversity and inclusion as GHC’s core mission, something that is also at the heart of what we’re trying to do at Neustar.

 

Neustar at SIAM’s SODA 2015

Author’s Note: Hello readers! I’m Sonya Berg, and this is my first post on Neustar’s research blog. I am a data scientist at Neustar Research, where my focus is on solving big data problems in advertising. Before joining Neustar, I completed a PhD in mathematics from UC Davis, where I developed algorithms for quantum computing.

Matt Curcio, Rafael Solari and I had the pleasure of starting our new year off by attending SIAM’s Symposium on Discrete Algorithms, or SODA, conference in sunny San Diego. SODA is a computer science theory conference, with most talks given by academics. Below I highlight two examples of high-level topics we were excited as practitioners to see covered by theorists at SODA.

  1. Dataset privacy.  In a variety of industries, publicly releasing datasets, or even just statistics about datasets, has proved to be useful for R&D. However, there is general concern that publishing data about individuals can compromise personal privacy. For example, last summer we highlighted breaches of privacy in a published dataset. At the SODA conference we learned about advancements in privacy research, including how to privately conduct supervised learning, as well as the application of privacy mechanisms to discourage lying.
  2. Distributed computing. As a big data shop, we use various distributed computing systems, such as Hadoop, Hive, and Amazon’s Redshift. We’ve seen the ugly sides of parallelism, such as clusters bottlenecking on intra-cluster bandwidth during repartitioning. We were glad to learn at SODA that academics are responding to this reality by exploring theoretical approaches to minimizing bandwidth complexity.

Learning directly from the top computer science researchers in the world can certainly be a challenge. But, we continue to attend conferences such as SODA because we return home inspired to apply new theoretical concepts to our most interesting R&D problems.

San Diego Sunset

Beautiful San Diego

 

HLL talk at SFPUG

I had the pleasure of speaking at the SF PostgreSQL User Group’s meetup tonight about sketching, the history of HLL, and our implementation of HLL as a PG extension. My slides are embedded below and you can get a PDF copy here. Be sure to click the gear below to show speaker’s notes for context!

If video is made available, I’ll post an update with a link!

Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset

In my previous post, Differential Privacy: The Basics, I provided an introduction to differential privacy by exploring its definition and discussing its relevance in the broader context of public data release. In this post, I shall demonstrate how easily privacy can be breached and then counter this by showing how differential privacy can protect against this attack. I will also present a few other examples of differentially private queries.

The Data

There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. It was obtained via a FOIL (Freedom of Information Law) request earlier this year and has been making waves in the hacker community ever since.

The release of this data in this unalloyed format raises several privacy concerns. The most well-documented of these deals with the hash function used to “anonymize” the license and medallion numbers. A bit of lateral thinking from one civic hacker and the data was completely de-anonymized. This data can now be used to calculate, for example, any driver’s annual income. More disquieting, though, in my opinion, is the privacy risk to passengers. With only a small amount of auxiliary knowledge, using this dataset an attacker could identify where an individual went, how much they paid, weekly habits, etc. I will demonstrate how easy this is to do in the following section.

Violating Privacy

Let’s consider some of the different ways in which this dataset can be exploited. If I knew an acquaintance or colleague had been in New York last year, I could combine known information about their whereabouts to try and track their movements for my own personal advantage. Maybe they filed a false expense report? How much did they tip? Did they go somewhere naughty? This can be extended to people I don’t know – a savvy paparazzo could track celebrities in this way, for example.

There are other ways to go about this too. Simply focusing the search on an embarrassing night spot, for example, opens the door to all kinds of information about its customers, such as name, address, marital status, etc. Don’t believe me? Keep reading…

Stalking celebrities

First things first. How might I track a person? Well, to zone in on a particular trip, I can use any combination of known characteristics that appear in the dataset, such as the pickup or drop-off coordinates or datetime, the medallion or license number, or even the fare amount from a receipt. Being the avid fanboy that I am (note: sarcasm), I thought it might be interesting to find out something new about some of the celebrities who had been seen in New York in 2013. In particular, where did they go to / come from, and how much did they tip?

In order to do this, I spent some of the most riveting hours of my professional career searching through images of “celebrities in taxis in Manhattan in 2013” to find enough information to identify the correct record in the database. I had some success – combining the below photos of Bradley Cooper and Jessica Alba with some information from celebrity gossip blogs allowed me to find their trips, which are shown in the accompanying maps.

Bradley Cooper (Click to Explore)

Jessica Alba (Click to Explore)

In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip. Ironically, he got in the cab to escape the photographers! We also know that Jessica Alba got into her taxi outside her hotel, the Trump SoHo, and somewhat surprisingly also did not add a tip to her $9 fare. Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain. Considering the speculative drivel that usually accompanies these photos (trust me, I know!), a celebrity journalist would be thrilled to learn this additional information.

A few innocent nights at the gentlemen’s club

But OK, perhaps you’re not convinced. After all, this dataset is (thankfully) not real-time. How about we leave the poor celebrities alone and consider something a little more provocative. Larry Flynt’s Hustler Club is in a fairly isolated location in Hell’s Kitchen, and no doubt experiences significant cab traffic in the early hours of the morning. I ran a query to pull out all pickups that occurred outside the club after midnight and before 6am, and mapped the drop-off coordinates to see if I could pinpoint individuals who frequented the establishment. The map below shows my results – the yellow points correspond to drop-offs that are closely clustered, implying a frequent customer.

Click to Explore

The potential consequences of this analysis cannot be overstated. Go ahead, zoom in. You will see that the GPS coordinates are terrifyingly precise. Using this freely-obtainable, easily-created map, one can find out where many of Hustler’s customers live, as there are only a handful of locations possible for each point. Add a little local knowledge, and, well, it’s not rocket science. “I was working late at the office” no longer cuts it: Big Brother is watching.

Even without suspicions or knowledge of the neighborhood, I was able to pinpoint certain individuals with high probability. Somewhat shockingly, just googling an address reveals all kinds of information about its inhabitants. Take the following example:

Examining one of the clusters in the map above revealed that only one of the 5 likely drop-off addresses was inhabited; a search for that address revealed its resident’s name. In addition, by examining other drop-offs at this address, I found that this gentleman also frequented such establishments as “Rick’s Cabaret” and “Flashdancers”. Using websites like Spokeo and Facebook, I was also able to find out his property value, ethnicity, relationship status, court records and even a profile picture!

Of course, I will not publish any of this information here, but this was by no means a unique situation. While the online availability of all this potentially private information is part of a wider discussion, it’s fair to say that this guy has a right to keep his nighttime activities a secret.

To reiterate: the findings in this section were not hard to uncover. Equipped with this dataset, and just a little auxiliary information about you, it would be quite trivial for someone to follow your movements, collecting data on your whereabouts and habits, while you remain blissfully unaware. A stalker could find out where you live and work. Your partner may spy on you. A thief could work out when you’re away from home, based on your habits. There are obvious mitigating factors here, such as the population density of Manhattan and the time delay of the data release, but the point still stands.

Applying Differential Privacy

So, we’re at a point now where we can agree this data should not have been released in its current form. But this data has been collected, and there is a lot of value in it – ask any urban planner. It would be a shame if it was withheld entirely.

Enter differential privacy. Sure, the data could be anonymized, but that obviously didn’t work out too well. More care could be taken, for example hashes could be salted as suggested in this forum. But that still doesn’t really protect against either of the attacks above. If you read my first post, you will know that only differential privacy guarantees the protection of all individuals in the data.

So how should we go about applying differential privacy? Remember that differential privacy works by adding noise to the query. Our three maps above are point queries: they filter the data rather than aggregating it. This means we have to be extra careful. We cannot simply run the query and add noise to the result (as we might do with an aggregation query) since this will release private information! To understand why, suppose we framed our question like this: “How many taxis picked up a passenger outside The Greenwich Hotel at 19:35 on July 8, 2013?” By running our original query, even after adding noise to the output coordinates, we have indirectly answered this question accurately. This constitutes a privacy breach, as information about specific individuals can be learned in this way.

Instead, we have to turn our query into an aggregation. If we place a grid over our map, and count the number of output coordinates that fall into each cell, we end up with a set of counts that are generally independent of each other. Differential privacy is now straightforward to apply, as we are working with aggregate quantities. The privatized versions of these three queries are displayed below. Please refer to the appendix for a more technical discussion of how this was done.

We can appreciate from these privatized maps the importance of ε, the privacy parameter, which I introduced in Differential Privacy: The Basics. When ε is low we can no longer accurately track either celebrity, nor learn how much they spent. In fact, it takes unreasonably high levels of ε to reliably locate them. We could opt for a finer grid, indeed we could all but replicate the point query with a fine enough grid, but differential privacy’s Laplace mechanism is robust enough to effectively ensure that the noise obfuscates any actual difference in the data. The privatization of the Hustler query is more interesting – since it is less targeted, the difference caused by the participation of one individual is less pronounced. As a result, there is more information in the privatized output – for example, the square over Wall Street still stands out, which echoes the actual evidence shown above.

Cooper, Privatized
(Click to Interact)

Alba, Privatized
(Click to Interact)

Hustler Customers, Privatized
(Click to Interact)

What about other queries? After all, did we just turn the data into nonsense? Clearly, our ability to learn what we sought from these queries was severely compromised. However, since they were so targeted, one could argue it is a good thing that the results are hard to interpret. The visualization below shows the results of other queries, and demonstrates that useful insights may still be extracted, particularly when aggregating over all records.

Click to Interact

Concluding Remarks

Now that you’re an expert in differential privacy, I urge you to consider: what should have been released by the NYC Taxi and Limousine Commission? How should the public be able to access it? Is this in line with Freedom of Information Law? If not, how should the law be changed to accommodate private data release?

It is only by considering these questions that we can hope to motivate a change in practices. With data privacy and security mishaps cropping up in the news almost daily, the topic has never before received so much attention. This post is yet another reminder of the fallibility of our systems and practices. As we know, differential privacy, with its strong guarantees – borne out of its sturdy fundamentals – offers a solution to many of these concerns.

As data scientists, it is in our best interests to encourage free data flow in our society, and lead the charge to ensure participants in these datasets are protected. The science is clearly there, and will continue to be developed in the coming years. The onus is now on industry and governments to pay attention and embrace differential privacy to protect their stakeholders and citizens.

Appendices

SQL queries used in this analysis

Tracking Bradley Cooper and Jessica Alba:

SELECT D.dropoff_latitude, D.dropoff_longitude, F.total_amount, F.tip_amount
FROM tripData AS D, tripFare AS F
WHERE D.hack_license = F.hack_license AND D.pickup_datetime = F.pickup_datetime
  AND pickup_datetime > "2013-07-08 19:33:00" AND pickup_datetime < "2013-07-08 19:37:00"
  AND pickup_latitude > 40.719 AND pickup_latitude < 40.7204
  AND pickup_longitude > -74.0106 AND pickup_longitude < -74.01;
SELECT D.dropoff_latitude, D.dropoff_longitude, F.total_amount, F.tip_amount
FROM tripData AS D, tripFare AS F
WHERE D.hack_license = F.hack_license AND D.pickup_datetime = F.pickup_datetime
  AND dropoff_datetime > "2013-09-07 12:19:00" AND dropoff_datetime < "2013-09-07 12:25:00"
  AND dropoff_latitude > 40.727 AND dropoff_latitude < 40.728
  AND dropoff_longitude > -73.994 AND dropoff_longitude < -73.993;

Identifying Hustler customers:

SELECT dropoff_latitude, dropoff_longitude
FROM tripData
WHERE pickup_latitude > 40.767249 AND pickup_latitude < 40.768
  AND pickup_longitude > -73.996782 AND pickup_longitude < -73.995538
  AND HOUR(pickup_datetime) < 6
  AND trip_distance > 5;
Privacy calculations

I have chosen to use the Laplace mechanism to privatize the above 3 queries, as it is relatively easy to apply and explain. However, in general, the Laplace mechanism is not appropriate for geospatial data. For example, it does not consider topographical features – inspections of the privatized maps show positive counts in the middle of the East River! Rather, there are more complex methods for adding noise to spatial data – Graham Cormode’s paper, Differentially Private Spatial Decompositions, offers a more thorough mechanism, while this blog covers the topic more generally. However, I will proceed with Laplace, in the hope that, by moving away from discussions of complicated mechanisms, the core tenets of differential privacy may be grasped more easily.

So how should we go about privatizing the above queries using the Laplace mechanism? As mentioned above, these are point queries. We should not apply noise directly to the output, because our sensitivity is essentially undefined. Why? Let’s consider the identity query, Q(D) = D. This is a point query – there is no aggregation. What happens if we remove a record from the dataset? We have Q(D’) = D’. Fine. But now look at what happens to our sensitivity. Recall the formula:

\Delta \mathit{f}=\underset{D,D'}{max}||\mathit{Q(D)}-\mathit{Q(D')}||_{1}

This formula is only defined when Q() returns a vector of fixed length (independent of D), such as an average, or a sum. With Q(D)=D, this is of course not the case, and so we cannot calculate the sensitivity. Intuitively, this result is true for other point queries. So we need to find another way to privatize point queries.

We know that privatizing aggregation queries is a lot more straightforward. Is there a way to turn our point query into an aggregation query? What if we bucketed the data into similar groups and then counted how many records returned by the query fall into each group? If we define our groups tightly enough, we would fully replicate the point query. We end up with a series of independent counts, which is easily privatized – the change from removing one record, any record, is at most 1. So this is our sensitivity. Adding Laplace noise with a sensitivity of 1 to the count in each bucket guarantees differential privacy.

Of course, if we have too many buckets, we will have lots of 0s and 1s. Since adding Laplace noise to this essentially hides this difference (when ε is at a reasonable level of course), we may struggle to learn anything from this data. By allowing each bucket to incorporate a greater range of values, we can obtain more meaningful privatized counts, albeit by sacrificing a degree of accuracy. The optimal way to group the data therefore depends on which features of the output are of interest.

This method is not confined to one-dimensional responses. As can be seen in the privatized maps above, the histogram is 2-dimensional, reflecting the latitude / longitude output from each query. In fact, any mapping of n-dimensional data into independent buckets is sufficient to ensure differential privacy. For the celebrity queries, I did treat the latitude/longitude coordinates and the fare/tip amounts as two separate histogram queries, as these pairs are broadly uncorrelated with each other.

For additional detail on the privacy calculations used throughout this post, the reader is encouraged to examine the code behind the various visualizations. While it is clear that the Laplace mechanism is applied slightly differently each time, it is not hard to imagine how this might be automated so differential privacy could be applied more generally.

Hitting the Books: EADS Summer School on Hashing

Rob, Matt, and I just wrapped up our trip to Copenhagen for the EADS Summer School on Hashing at the University of Copenhagen and it was a blast! The lineup of speakers was, simply put, unbeatable: Rasmus Pagh, Graham Cormode, Michael Mitzenmacher, Mikkel Thorup, Alex Andoni, Haim Kaplan, John Langford, and Suresh Venkatasubramanian. There’s a good chance that any paper you’ve read on hashing, sketching, or streaming has one of them as a co-author or is heavily influenced by their work. The format was three hour-and-a-half lectures for four days, with exercises presented at the end of each lecture. (Slides can be found here. They might also post videos. UPDATE: They’ve posted videos!)

Despite the depth of the material, almost all of it was approachable with some experience in undergraduate math and computer science. We would strongly recommend both of Michael Mitzenmacher’s talks (1, 2) for an excellent overview of Bloom Filters and Cuckoo hashing that are, in my opinion, significantly better and more in depth than any other out there. Specifically, the Bloom Filter talk presents very elegantly the continuum of Bloom Filter to Counting Bloom Filter to Count-Min Sketch (with “conservative update”) to the Stragglers Problem and Invertible Bloom Filters to, finally, extremely recent work called Odd Sketches.

Similarly, Mikkel Thorup’s two talks on hashing (1, 2) do a very thorough job of examining the hows and whys of integer hashing, all the way from the simplest multiply-mod-prime schemes all the way to modern work on tabulation hashing. And if you haven’t heard of tabulation hashing, and specifically twisted tabulation hashing, get on that because (1) it’s amazing that it doesn’t produce trash given how simple it is, (2) it’s unbelievably fast, and (3) it has been proven to provide the guarantees required for almost all of the interesting topics we’ve discussed on the blog in the past: Bloom Filters, Count-Min sketch, HyperLogLog, chaining/linear-probing/cuckoo hash tables, and so on. We really appreciated how much attention Mikkel devoted to practicality of implementation and to empirical performance when discussing hashing algorithms. It warms our heart to hear a leading academic in this field tout the number of nanoseconds it takes to hash an item as vocally as the elegance of the proof behind it!

We love this “Summer School” format because it delivers the accumulated didactic insight of the field’s top researchers and educators to both old techniques and brand new ones. (And we hope by now that everyone reading our blog appreciates how obsessed we are with teaching and clarifying interesting algorithms and data structures!) Usually most of this insight (into origins, creative process, stumbling blocks, intermediate results, inspiration, etc.) only comes out in conversation or lectures, and even worse is withheld or elided at publishing time for the sake of “clarity” or “elegance”, which is a baffling rationale given how useful these “notes in the margin” have been to us. The longer format of the lectures really allowed for useful “digressions” into the history or inspiration for topics or proofs, which is a huge contrast to the 10-minute presentations given at a conference like SODA. (Yes, obviously the objective of SODA is to show a much greater breadth of work, but it really makes it hard to explain or digest the context of new work.)

In much the same way, the length of the program really gave us the opportunity to have great conversations with the speakers and attendees between sessions and over dinner. We can’t emphasize this enough: if your ambition to is implement and understand cutting edge algorithms and data structures then the best bang for your buck is to get out there and meet the researchers in person. We’re incredibly lucky to call most of the speakers our friends and to regularly trade notes and get pointers to new research. They have helped us time and again when we’ve been baffled by inconsistent terminology or had a hunch that two pieces of research were “basically saying the same thing”. Unsurprisingly, they are also the best group of people to talk to when it comes to understanding how to foster a culture of successful research. For instance, Mikkel has a great article on how to systematically encourage and reward research article that appears in the March 2013 issue of CACM (pay-wall’d). Also worthwhile is his guest post on Bertrand Meyer’s blog.

If Mikkel decides to host another one of these, we cannot recommend attending enough. (Did we mention it was free?!) Thanks again Mikkel, Rasmus, Graham, Alex, Michael, Haim, and John for organizing such a great program and lecturing so eloquently!

Open Source Release: java-hll

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

AK at re:Invent 2013

I was given the opportunity to speak at AWS re:Invent 2013, on Wednesday, Nov. 13th, about how we extract maximum performance, across our organization, from Redshift. You can find the slides here, and my voice-over, though not identical, is mostly captured by the notes in those slides.

The talk, in brief, was about the technical features of Redshift that allow us to write, run, and maintain many thousands of lines of SQL that run over hundreds of billions of rows at a time. I stayed away from nitty-gritty details in the talk, since I only had 15 minutes, and tried to focus on high-level take-aways:

  • SQL is code, so treat it like the rest of your code. It has to be clean, factored, and documented. Use CTEs to break queries into logical chunks. Use window functions to express complex ideas succinctly.
  • Redshift is an MPP system with fast IO, fast sorting, and lots of storage. Use this to your advantage by storing multiple different sort orders of your data if you have different access patterns. Materialize shared intermediates so many queries can take advantage of them.
  • Redshift has excellent integration with S3. Use the fat pipes to cheaply materialize query intermediates for debugging. Use the one-click snapshot feature to open up experimentation with schema, data layout, and column compression. If it doesn’t work out, you revert to your old snapshot.
  • Use the operational simplicity of Redshift to be frugal. Turn over the responsibility of managing cluster lifecycles to the people that use them. For instance, devs and QA rarely need their clusters when the workday is done. The dashboards are such a no-brainer that it’s barely a burden to have them turn off and start up their own clusters. Allow users to take responsibility for their cluster, and they will become more responsible about using their cluster.
  • Use the operational simplicity of Redshift to be more aware. With just a few clicks, you can launch differently sized clusters and evaluate your reports and queries against all of them. Quantify the cost of your queries in time and money.
  • It’s a managed service: stop worrying about nodes and rows and partitions and compression! Get back to business value:
    • How long does the customer have to wait?
    • How much does this report cost?
    • How do I make my staff more productive?
    • How do I minimize my technical debt?

Slides and Videos are up for AK Data Science Summit

We’re happy to announce that all the talks (slides and video) from our Data Science Summit are available for free, right here on the blog! You can find a full list of the conference’s content here.

Enjoy!

Data Science Summit – Update

I don’t think I’m going out on a limb saying that our conference last week was a success. Thanks to everyone who attended and thanks again to all the speakers. Muthu actually beat us to it and wrote up a nice summary. Very kind words from a great guy. For those of you that couldn’t make it (or those that want to relive the fun) we posted the videos and slides. Thanks again to everyone for making it such a success!

P1010510

Muthu being Muthu during David Woodruff’s fantastic talk on streaming linear algebra

Follow

Get every new post delivered to your Inbox.

Join 289 other followers