The Problem: Many data sources contain free-text fields that can contain miss-spellings or alternate names for the same entity, making them unsuitable for many analytical purposes. These challenges are most often associated with 3 different types of information; personal identities (first, middle, and last name), business names, and address information. For example for geographic information such as cities, there could be many permutations of the same city name (Aukland, Akland, Auklan, 123 Aukland, Aukland City, etc.).
Matching Algorithms: There are 2 ways to create matching algorithms, using a deterministic match or hashing codes, where similar strings get the same or similar match codes, or via a probabilistic model. Probabilistic matching algorithms are extremely challenging as matching operations scale at O(N2), where N is the number of unique values. A comparison of 1,000 unique names with each other, using a simple probabilistic model requires 1,000,000 operations.
This scaling challenge is why most legacy data quality vendor solutions use proprietary deterministic algorithms. Deterministic approaches can miss obvious matches because of the limitations of the approach. They also throw away valuable information. For example, deterministic algorithms can easily miss that “Otronto” is a miss-typing of “Toronto”, even though “Otronto” is contained in only 1 record in a large dataset containing thousands of “Toronto”s.
Entity Resolution And Surviving Record Analysis: Once matching entities are identified, a process to select which of the different names within a cluster should be selected for all the matching records. In most data quality solutions, this step is often a unscientific manual and tedious process.
The MatchBox Difference
- State-of-the-art optimizations eliminate unnecessary string comparisons. MatchBox optimizations focus the matching search space to only somewhat similar strings, enabling the use of highly sophisticated predictive AI models to perform string comparisons.
- Seamless Apache Spark Integration. MatchBox is the first commercial data quality solution to leverage Spark, the fastest, most scalable analytics technology available. With Spark at the core, users get seamless integration with their Big Data workflows. MatchBox functionality is exposed through a workbench interface, and through PySpark functions.
- Automated Entity Resolution. Most data quality solutions provide users with a list of close matches, but lack automated entity resolution or selection logic. MatchBox leverages the latest advancements in scalable graph capabilities, joining all matches and their connections within a cluster. An algorithm automatically assign entities to the most frequently used name, a name identified in a 3rd party master list, or by using the features of the graph itself.