Graph Analytics for Identity Resolution—Transforming Billions of Customer Records in One Minute

June 18, 2014 Chunsheng (Victor) Fang, Ph.D.

Joint work performed by Victor Fang + Jarrod Vawdrey.

Getting a single view of customer data is harder than it sounds.

But, data science has changed the game.

Like the printing press, the cotton gin, and the Model T, Pivotal Data Scientists have invented a revolutionary way to make the very difficult much easier. In this case, it had a significant impact on marketing ROI.

Taking only days of consulting and one minute to compute, the Pivotal Data Science team was able to transform billions of disparate customer records spread across multiple data silos and lines of business into a clean, ROI-optimized, customer-centric segmentation model for target marketing at a major insurance provider.

Many have used or heard of machine learning-based analytics to solve these types of problems, and the Pivotal Data Science team used graph-based analytics to solve this problem as they have with other applications. In this blog, we illustrate just one such example of how Pivotal uses graph-based data science to tackle, at scale, the real-world problem of identity resolution for customer segmentation in the insurance industry.

Figure 1, A large scale real world graph visualized by the authors. Different colors indicating different Connected Components.

Background—The Marketing ROI Optimization Problem

Back in 2012, a major insurance firm engaged the Pivotal Data Science team to improve customer segmentation and optimize ROI in target marketing campaigns. To achieve the goal, the team needed to segment customers into clusters of like characteristics using data from multiple data sources.

The problem was that each of the firm’s lines of business (LOB) had separate marketing outreach programs and systems. Identifying individual customer records across LOBs became ambiguous, uncertain, and very difficult to resolve. Without a clear, unified view of customer data, the firm could not derive descriptive features of customer identities for downstream segmentation and ultimately optimize the ROI of marketing campaigns.

The diagram below shows an example of the types of identity data sources that were resolved and linked in this effort.

Figure 2. Example of the types of data sources that can be combined to develop a detailed 360-degree view of consumers.

The Pivotal Data Science team solved the identity resolution and subsequent customer segmentation task by leveraging Pivotal’s Greenplum Database (GPDB), the massively parallel processing (MPP), petabyte-scale analytic data warehouse. GPDB was used to query, cleanse, and join data from across multiple internal and external sources. Each table held billions of records and included data such as membership signups, checkup records, and sales transactions.

Solving the Identity Resolution Problem via Graph Theory

In this customer’s environment, a single, real consumer would appear to have multiple identities across data sources from different LOBs due to differences in collection, process, and storage. For example, a person may have used an email to register over the firm’s web channel to explore firm’s offerings, then later used a phone number over its phone channel to sign up for a new plan. By looking at data collected from different channels, there are clues or pieces of each person’s identity and behavior—often referred to as digital fingerprints. A data science approach on big data can be used to link this kind of data across large, siloed LOB data stores.

To elaborate, consider the following data tables—a highly simplified version of the actual multi-column ones:

  1. Table A contains two columns: “customer_id_1”, which can be thought of as usernames collected through an online signup process, and “join_key” which we have identified as the link to a second table. In our actual engagement with the firm, the join key was ultimately a combination of several fields, but we have simplified this here, for clarity.
  2. Table B also contains two columns: “customer_id_2”, which can also be thought of as usernames, however collected from a different source than those of Table A, and a “join_key” which links records in Table B to records in Table A.

Figure 3. Example tables with customer IDs and join keys

Technically, the problem is figuring out how to group and link these different usernames to a common identity as illustrated in Figure 4. What we have is a graph problem in disguise. More specifically, we need a connected component analysis of an undirected graph, which is a classical graph theory algorithm.

In graph theory, a connected component of an undirected graph G(V, E) is a sub-graph in which any two vertices are connected to each other by paths. For those that are familiar with machine learning, think about it as a clustering algorithm for a graph-topology dataset, rather than for points in Euclidean space.

Figure 4. The connected component example for identity resolution

In graph applications and analytics, the data volume is typically very large, and this makes single-machine execution impossible. In this particular task, there is sensitive, personally identifiable information (PII), and the PII data must not leave the database. Fortunately, we can use GPDB as a big data analytics platform and run Python, Perl, R, or other toolsets in a distributed fashion and within the database itself.

Implementing the Distributed Connected Component Algorithm

There are many ways of finding connected components. We implemented a classical breadth-first search algorithm in a distributed fashion. We approached this problem in our environment using a multi-step process described below.

At a high level, the connected component algorithm in GPDB was approached as follows: (1) Build a data model to accommodate analysis on large graphs that cannot fit into memory on a single node. (2) Implement an in-database version of connected component algorithms that can operate on the table created in step 1. (3) Use the resulting graph to resolve entities and form a single view of identities.

The first task was transforming the multi-column original data table (Figure 3) into a graph table by dropping the join key and removing duplicate records. In this example, we constructed graph table of millions rows (edges) from billions of records. The graph table contains records consisting of two columns representing vertices in the graph that are connected by an edge. For each entry, these vertices represent the cust_ids. Where data exists to support the uniqueness, the cust_ids refer to a single entity, i.e., the same individual captured in two different ways as described earlier. Keep in mind that this is big data, which means the data won’t all fit into one compute node and is distributed as shards across the cluster.

The second task is implementing the algorithm in PostgreSQL and PL/Python in GPDB. Conceptually, the connected component algorithm starts with all vertices in the graph belonging to their own group. These groups are iteratively merged when an edge exists between them, until no edges remain between any two vertices assigned to different groups.

The implementation proceeds as follows:

  1. A table is initialized containing all vertices, each assigned a unique connected component label
  2. At each iteration, we identify the set of all edges that contain vertices with different connected component labels that still need to be grouped. For each such group, we merge all vertices from that particular edge set into a new group , and and assign them a new connected components label which is the representative of that vertex set, e.g. the maximum among all vertices.
  3. Then, we repeat the process until convergence is achieved, and this is defined as an iteration in which no vertices are assigned a new connected components label (therefore no more groups are merged). This monotonic procedure has a theoretical guarantee to reach convergence.

The last task is to map the data in the original table to the resulting connected components labels from the output of the connected components analysis. In this case, the outcome is the unique mappings of the complete set of cust_ids associated with a single, identified customer and a joined set of correlated data.

Achieving One-Minute Performance and Scale

Efficiency was crucial for this task. Our graph dataset had millions of vertices and edges, but our in-database execution took about one minute to compute all the connected components.

Of course, this demonstrates the advantages of analysis, including graph analysis, on an MPP framework like GPDB. There were several key benefits worth pointing out to other data scientists:

  1. We were able to use in-database computing, and the data never left the database so that personally identifiable information was never at risk.
  2. GPDB’s MPP query optimization engine guaranteed scalability. So, we never had to worry about it in our code.
  3. From identification of the problem to full implementation, our work took days, hurdles and all, thanks to the ability to prototype rapidly and directly within the MPP environment.

Figure 5. The MPP architecture of Pivotal Greenplum Database

What’s Next with Identity Resolution and Marketing ROI?

Other variants of connected component implementations are possible. Pivotal Data Scientists are always innovating, and, since the above implementation in 2012, we have created a patent-pending algorithm for optimizing segmentation processes and marketing ROI. As part of the Pivotal Data Science Tools, the new algorithm improves performance by many fold. Such efficiency will prove value in many other applications, and we will soon post another technical blog on this new algorithm.

Learn More:

Jarrod_Headshot About the Author: Jarrod Vawdrey is an innovative data scientist with experience designing, coding, testing and supporting next-generation data mining, modeling, and storage solutions. Specialty interests include distributed computing, machine learning, technologies integration, and information extraction from unstructured data. Jarrod is proficient in java (Hadoop, Mahout), R, SAS (Access, Base, Component Object(s), Enterprise Miner, Graph, Macro, Scoring Accelerator, Share, Stat), SQL (Greenplum, Postgres, PL/Java, PL/pgsql, PL/R)

About the Author


Industry Analyst Insight on How Big Data is Bigger Than Data
Industry Analyst Insight on How Big Data is Bigger Than Data

This week, Gartner analyst Svetlana Sicular named her top 5 big data companies to watch, listing Pivotal as...

Igor Lebovic – Viral Content Coefficient
Igor Lebovic – Viral Content Coefficient

Igor Lebovic from AirPair discusses the “viral content coefficient” and strategies as to how companies can ...

SpringOne 2021

Register Now