Entity Matching

Find similar entities across datasets to identify duplicates and related records. This is inspired and partly adopted from the great open source work by OpenSanctions.

Blog post for OpenAleph

How it works

Entity matching compares multiple signals:

Names (with normalization and phonetic encoding)
Identifiers (registration numbers, tax IDs, etc.)
Properties (email, phone, address, etc.)

The index stores multiple name representations to catch variations:

Normalized keywords (names)
Heavier normalized name keywords (name_keys)
Name symbols (cross-language and cross-alphabet matching) (name_symbols)
Phonetic codes (sound-alike matching) (name_phonetics)
Name parts (partial matching) (name_parts)

Name matching strategies

1. Normalized keywords

Names are normalized and matched as exact keywords (after normalization).

Example:

"John Smith & Associates Ltd." → "john smith associates ltd"

Normalization: - Lowercase conversion - Special character removal - Whitespace collapsing - Diacritic folding

Exact name matches (with order preserved) receive the highest boost.

2. Name symbols

Cross-language and cross-alphabet matching via symbolic representations. This can be considered as a synonyms search, but more precise and context specific than a global synonyms file.

This uses rigour.names. The example symbol used here from wikidata: Vladimir

The extracted symbols are indexed in the name_symbols keyword field.

Example:

"Vladimir Putin" → [NAME:47200243]
"Владимир Путин" → [NAME:47200243]

Same symbol = same entity name (part) across languages.

3. Phonetic encoding

Sound-alike matching using Double Metaphone algorithm.

The phonetic representations are indexed in the name_phonetics keyword field.

Example:

"Smith" → "SM0"
"Smythe" → "SM0"

Catches alternate spellings and transcription variations.

4. Name parts

Individual name components for partial matching.

Index field: name_parts (keyword)

Example:

"John Smith & Associates" → ["john", "smith", "associates"]

Matches entities sharing name components.

5. Name keys

Sorted token concatenation for order-independent matching.

Index field: name_keys (keyword)

Example:

"John A. Smith Jr." → "jjrsmith"
"Smith John Jr. A." → "jjrsmith"

Matches names containing the same tokens regardless of order.

Identifier matching

Exact matching on unique identifiers:

Registration numbers
Tax IDs
Passport numbers
License numbers
Other unique codes

Identifiers have high matching weight (boost: 3.0).

Property matching

Additional signals from entity properties:

High-value properties (boost: 2.0)

IP addresses
URLs
Email addresses
Phone numbers

General properties

All other properties contribute to similarity score without boosting.

Properties are sorted by specificity - more unique values score higher.

Schema compatibility

Matching respects entity type compatibility:

Person matches Person and LegalEntity
Company matches Company, Organization, and LegalEntity
Some other entity schemata like Document are not matchable

Only compatible schema types can match each other.

Scoring

Match scores combine multiple factors:

Signal	Boost	Index field
Names (exact, order preserved)	5.0	`names`
Name keys (order-independent)	3.0	`name_keys`
Identifiers	3.0	`properties.*` (for group type "identifier")
High-value properties	2.0	`properties.*` (ip, url, email, phone)
Name parts	1.0	`name_parts`
Other properties	1.0	`properties.*`
Phonetic codes	0.8	`name_phonetics`
Name symbols	0.8	`name_symbols`

Higher boost = more important for matching.

Performance limits

To prevent query explosion:

Maximum 500 query clauses
Maximum 5 names used per entity
Names selected by diversity (Levenshtein distance)

Entities with many aliases use representative names only.

Name selection

See openaleph_search.query.matching:pick_names

For entities with many aliases, the system selects representative names:

Pick centroid name (most representative)
Pick most dissimilar names using Levenshtein distance
Use up to 5 names total

This prevents performance issues while maintaining matching quality.

Query structure

A match query combines multiple strategies:

{
  "bool": {
    "must": [
      {
        "bool": {
          "should": [
            // Name matching clauses (using terms queries for efficiency)
            {"terms": {"names": ["john smith"], "boost": 5.0}},
            {"terms": {"name_keys": ["johnsmith"], "boost": 3.0}},
            {"terms_set": {"name_parts": {"terms": ["john", "smith"], "minimum_should_match_script": {...}}}},
            {"terms_set": {"name_phonetic": {"terms": ["JN", "SM0"], "minimum_should_match_script": {...}}}},
            {"terms_set": {"name_symbols": {"terms": ["[NAME:12345]"], "minimum_should_match_script": {...}}}}
          ],
          "minimum_should_match": 1
        }
      },
      {
        "bool": {
          "should": [
            // Identifier matching
            {"term": {"properties.registrationNumber": "ABC123"}}
          ],
          "minimum_should_match": 0
        }
      }
    ],
    "should": [
      // Property scoring
      {"term": {"emails": "john@example.com"}},
      {"term": {"countries": "us"}}
    ]
  }
}

For name_parts, phonetics, and symbols, terms_set queries require at least 2 matching terms to reduce false positives.

Optimization tips

For better matching

Include multiple name variants when available
Provide identifiers (registration numbers, tax IDs)
Add email, phone, address properties
Specify country/jurisdiction

For performance

Filter by dataset to reduce search space
Filter by schema to search specific entity types
Use specific identifiers to narrow results

Name processing pipeline

Names go through multiple processing stages:

Unicode normalization (NFC), lowercase (if latinizable)
Schema-specific tokenization
Token sorting (for name keys)
Phonetic encoding (for phonetic field)
Symbol generation (for cross-language and cross-alphabet)

Each stage creates different search representations optimized for specific matching scenarios.