Mentions
Find documents that mention a given named entity – or a filtered set of named entities – by phrase-matching name variants against the documents bucket.
Reverse percolation
Where Percolation runs one document against many stored queries ("which of my entities are mentioned in this text?"), the mentions queries run the inverse: one or many entities against many stored documents ("which of my documents mention this entity?"). Same signal – names as phrases in fulltext – applied from the other direction, with no precomputed stored queries involved.
Two flavours, sharing a common mention clause:
MentionsQuery(parser, entity_id)– single entity. Loads the entity, pulls its matchable names, runs the mention clause against the Document hierarchy.MultiMentionsQuery(parser, source_parser)– filtered set. Scrolls a second parser over any entity population (one or several collections, narrowed by property filters), collects every matching entity's names, and runs the same mention clause over the union. Capped at 10k names – see Multi-entity variant below.
Both return entities from the Document hierarchy (Document, PlainText, HyperText, Pages, Page, …) whose indexed text contains one of those names as a phrase.
How it works
At query time, MentionsQuery loads the target entity via get_entity and extracts its matchable names (registry.name, matchable=True – typically name, alias, previousName …). It then builds the same shape of search query a user would write against the documents bucket, with a mention_clause ANDed into the bool body.
The mention clause is a bool.should with minimum_should_match: 1 combining:
- Fulltext phrase match – one
match_phraseclause per matchable name onField.CONTENT, withslop: 2. Slop 2 matches the percolator's ingest clauses, tolerating inserted middle initials ("Jane Doe"matches"Jane A. Doe") and reversed last-name-first variants ("Doe, Jane").Field.TEXT(the per-property text group) is deliberately excluded –contentalready aggregates every text-type value, and matching both fields double-counts the same mention. - Structured-name bonus – a
termsclause overField.NAMES(keyword) across all matchable name variants, boosted to2.0. This catches documents that carry the entity's name as an extracted property value (namesgroup) rather than only as free text. - Optional synonym expansion – when
parser.synonyms=true, two extra clauses are added viaExpandNameSynonymsMixin(shared withEntitiesQuery): atermsclause overField.NAME_SYMBOLSfrom the entity's NAME-category symbols (boost0.5), and atermsclause overField.NAME_KEYSfromindex_name_keysof the entity's names (boost0.3). Mirrors the user-text synonyms path inEntitiesQuery.get_text_query– same fields, same boosts – but derives keys directly from the entity's discrete names rather than n-gramming user query tokens.
The mention clause is ANDed (bool.must) with the rest of the query built by EntitiesQuery – parser.text, filters, negative filters, auth – so parser.text narrows mention hits instead of replacing them.
Default schema scope
Default schemata is ["Document"] (not ["Thing"] as in EntitiesQuery), which covers Document and all its descendants. The index resolver uses the Document-hierarchy buckets (documents, pages, page) accordingly. A caller-supplied filter:schema or filter:schemata overrides it – e.g. filter:schema=Pages to scope to page entities only.
Sort and highlights
- Sort. When no explicit
sortis passed,MentionsQueryforces_scorerather than falling back to_doc. The mention clause carries all the scoring signal even whenparser.textis empty, so the base "empty query → no ranking" heuristic would hide the best hits. An explicitsortfrom the parser still wins. - Highlights. With
highlight=true, thecontenthighlight config has itshighlight_queryreplaced with the same phrase shoulds used by the mention clause (merged with any filter-value clauses the base class added).textandtranslationhighlight configs are left untouched so anyparser.texthighlighting the base helper built there survives – the mention clause only matchescontent, so injecting the phrase shoulds intotext/translationwouldn't produce highlights anyway.
Querying – MentionsQuery
Via Python:
from openaleph_search.parse.parser import SearchQueryParser
from openaleph_search.query.mentions import MentionsQuery
parser = SearchQueryParser([("highlight", "true"),
("limit", "50")])
query = MentionsQuery(parser, entity_id="person-abc")
result = query.search()
for hit in result["hits"]["hits"]:
doc = hit["_source"]
print(doc["caption"], "→", hit.get("highlight", {}).get("content", []))
Or via CLI:
openaleph-search search mentions person-abc \
--args "filter:dataset=news_archive&highlight=true&limit=50"
Standard parser knobs flow through automatically:
filter:*– applied as filters on the document search (filter:dataset,filter:countries,filter:schema=Pages, …).q=…– free-text narrowing that ANDs with the mention requirement (e.g. "documents that mention this person and contain the word 'invoice'").synonyms=true– opt-in symbol / name-key expansion of the entity's names (see above).highlight=true+highlight_count=N– fragment snippets with<em>…</em>markup around matched phrases, same shape asEntitiesQueryhighlights.limit/offset– pagination.sort– overrides the default_scoresort.dehydrate=true– strips bulkypropertiesfrom the response.auth– same auth filters as any other entity query.
Errors
MentionsQuery(parser, entity_id=...) raises ValueError if:
entity_idis falsy.- The entity is not found in the index.
- The entity has no matchable names (
registry.name,matchable=True). A Document or a nameless schema cannot be the subject of a mentions search – there is nothing to phrase-match on.
Multi-entity variant – MultiMentionsQuery
MultiMentionsQuery generalizes the same mention clause from one entity to a filtered set of entities. Instead of a single entity_id, it takes a second SearchQueryParser – the source parser – that narrows the entity population by property filters (country, topic, schema, dataset…). The names from every matching entity are scrolled, deduped, and fed into the same mention-clause shape the single-entity path uses.
Use this when you want to screen one or many collections (a watchlist, a sanctions list, a PEP roster) against a target document set in a single call.
Shape of the call
from openaleph_search.parse.parser import SearchQueryParser
from openaleph_search.query.mentions import MultiMentionsQuery
# Source: any watchlist Person on a sanctions topic, across multiple collections.
source = SearchQueryParser([
("filter:schema", "Person"),
("filter:dataset", "watchlist_a"),
("filter:dataset", "watchlist_b"),
("filter:topics", "sanction"),
])
# Target: news documents, ranked + highlighted.
target = SearchQueryParser([
("filter:dataset", "news_archive"),
("highlight", "true"),
("limit", "50"),
])
query = MultiMentionsQuery(target, source)
result = query.search()
Also exposed via the CLI:
openaleph-search search multi-mentions \
--source-args "filter:schema=Person&filter:dataset=watchlist_a&filter:topics=sanction" \
--args "filter:dataset=news_archive&highlight=true&limit=50"
The source parser can span multiple datasets at once (repeat filter:dataset=...) – it's a plain EntitiesQuery on the source side. Any filter valid on EntitiesQuery works.
What it shares with MentionsQuery
The match surface, ranking, and highlight behaviour are identical:
- Same
match_phraseclauses onField.CONTENT(slop 2). - Same structured-name bonus on
Field.NAMES(boost 2.0). - Same optional
synonyms=trueexpansion (Field.NAME_SYMBOLS+Field.NAME_KEYS). - Same default schema scope (
["Document"]), overridable viafilter:schema. - Same
_scoredefault sort and samecontenthighlight injection.
A target document that mentions more of the filtered entities' names ranks higher (BM25 summing over the matched phrase clauses), so the natural answer to "which news doc is most relevant to my watchlist?" falls out of the default ordering.
What differs
- Name source. Stage 1 scrolls the source parser and extracts values from a small subset of name-typed properties (
SOURCE_NAME_PROPSinquery/mentions.py, currently("name",)– primary names only). Aliases / previousNames are intentionally excluded at the source side to keep recall deliberate; they're available on the matchable-names side viaMentionsQueryon a single entity. - Scale budget.
MAX_SOURCE_NAMES = 10_000. Two guards:- On the first response,
track_total_hits: max_names + 1short-circuits if the source filter matches more entities than the budget can ever accommodate. - During the scroll, a per-hit check on the deduped name set catches the case where multi-valued
namearrays push name count past entity count. Either trip raisesValueErrorwith a message pointing at the source filter.
- On the first response,
- Response shape. The response is a flat ranked list of matching documents. It does not say which of the filtered entities each document mentioned – for that attribution, use the Percolation path with its
percolator_matchsignal on each hit.
Limits specific to the multi variant
- 10k names cap. Sources that exceed the cap raise before running Stage 2. Narrow the source filter.
- No per-document attribution. As noted above.
- Stage 1 cost. One
es.countrequest upfront (shard-level; near-free) followed by one scroll session (with continuations if the filter is large). The count fail-fasts when the source filter is grossly oversized without touching the scroll at all.
Limits and trade-offs
Recall is bounded by the entity's stored names
The mention clause is built from the entity's matchable name properties only. Variants the entity doesn't know about won't match a document: a PDF that says "Müller" against an entity stored only as "Mueller" will not fire unless synonyms=true lifts them into the same name-symbol / name-key bucket.
If recall matters, either enrich the entity with aliases / previousName values, or opt in to synonyms=true.
Phrase matching tolerates small slop
match_phrase clauses use slop: 2, matching the percolator's ingest side. That tolerates inserted middle initials ("Jane Doe" matches "Jane A. Doe") and reversed last-name-first variants ("Doe, Jane" / "Doe Jane"). It does not tolerate large token gaps or out-of-order rearrangements beyond two positions – for those, store explicit alias variants or opt in to synonyms=true.
No identifier signal
The percolator also fires on exact identifier matches (IMO, VAT, registration numbers, …). MentionsQuery does not – it is name-only. If you need identifier-based document discovery, run a plain EntitiesQuery with the identifier value as free text, or build a dedicated query against Field.IDENTIFIERS.
Document hierarchy only
Results are scoped to the Document bucket family by default. Things (Person, Company, …) and Intervals (Ownership, Sanction, …) are not searched – they have no meaningful fulltext for a mention clause to fire against. The subject entity itself, which lives in the things bucket, is naturally absent from results.
Scaling beyond one entity
MentionsQuery takes a single entity_id. To screen a filtered population of entities in one call, use MultiMentionsQuery (up to 10k names). To screen a single text against many stored entities, use Percolation instead.