Significant Terms
Discover statistically unusual terms and phrases by comparing search results against background datasets.
Read more in Elasticsearch documentation
Blog post for discovery in OpenAleph
What is significant analysis
Significant analysis identifies terms that appear more frequently in search results than expected based on the background dataset. This surfaces important concepts that distinguish your results.
Significant terms
Find keywords over-represented in search results.
Basic usage
# Find significant mentions in names
openaleph-search search query-string "corruption" --args "facet_significant=names"
# Multiple fields
openaleph-search search query-string "investigation" \
--args "facet_significant=names&facet_significant=countries"
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
facet_significant |
string | - | Field name |
facet_significant_size:FIELD |
int | 20 | Number of terms |
facet_significant_total:FIELD |
bool | false | Include total count |
facet_significant_values:FIELD |
bool | true | Return values |
facet_significant_type:FIELD |
string | - | Aggregation type |
Response structure
Significant terms results are wrapped in a sampler aggregation:
{
"aggregations": {
"names.significant_sampled": {
"names.significant_terms": {
"buckets": [
{
"key": "jane doe",
"doc_count": 45,
"score": 0.8745,
"bg_count": 120
}
]
}
}
}
}
score: Statistical significance (0-1, higher = more unusual)doc_count: Frequency in search resultsbg_count: Frequency in background datasetkey: The significant term
Significant text
Extract meaningful phrases from text content.
Basic usage
# Analyze document content
openaleph-search search query-string "laundering" \
--args "facet_significant_text=content"
# Custom configuration
openaleph-search search query-string "investigation" \
--args "facet_significant_text=content&facet_significant_text_size=10"
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
facet_significant_text |
string | content | Text field to analyze |
facet_significant_text_size |
int | 5 | Number of phrases |
facet_significant_text_min_doc_count |
int | 5 | Minimum document frequency |
facet_significant_text_shard_size |
int | 200 | Documents sampled per shard |
Response structure
{
"aggregations": {
"significant_text": {
"significant_text": {
"buckets": [
{
"key": "money laundering",
"doc_count": 28,
"score": 0.9156,
"bg_count": 145
}
]
}
}
}
}
Background filtering
Background datasets define the comparison baseline for significance scoring. The background_filter restricts which documents are used to compute background term frequencies.
Dataset-specific
When filtering by datasets or collections, the background is scoped accordingly:
OpenAleph
Currently, OpenAleph uses collection_id filter here with the numeric IDs instead.
Performance
background_filter requires Elasticsearch to compute background term frequencies on-the-fly by intersecting posting lists for each candidate term with the filtered document set. This is significantly slower than the default behavior (pre-computed index-level term statistics). The shard_min_doc_count setting helps by eliminating rare terms before the expensive background lookups happen.
No filter
When no collection or dataset filter is applied, background_filter is omitted entirely. Elasticsearch then uses pre-computed index-level term statistics as background, which is fast.
Sampling
Both significant terms and significant text aggregations are wrapped in a sampler to cap the number of foreground documents processed per shard.
Significant terms sampling
Adaptive random sampling (default):
When enabled, a fast _count query is performed first to determine the result set size. The sampling probability is then computed to target a fixed foreground size:
For example, with the default target of 10,000 and a query matching 100,000 documents, the probability is set to 0.1 (10%). For queries matching fewer than 10,000 documents, no sampling is applied (probability = 1.0).
Configurable via OPENALEPH_SEARCH_SIGNIFICANT_TERMS_SAMPLER_SIZE (default: 10000).
Top-N sampling (fallback, set OPENALEPH_SEARCH_SIGNIFICANT_TERMS_RANDOM_SAMPLER=false):
When random sampling is disabled, falls back to top-N sampling which takes the highest-scoring documents per shard. This biases toward query-relevant documents and may give different significance results.
- With collection/dataset filter:
samplerwith configurableshard_size - Without filter:
diversified_sampleracross datasets
Configurable via OPENALEPH_SEARCH_SIGNIFICANT_TERMS_SAMPLER_SIZE (default: 10000).
Significant text sampling
Uses diversified_sampler (no collection/dataset filter) or sampler (with filter) to cap foreground documents for significant text analysis.
Configurable via OPENALEPH_SEARCH_SIGNIFICANT_TEXT_SAMPLER_SIZE (default: 200).
Shard size calculation
For the significant terms aggregation's internal shard_size (number of candidate terms per shard, separate from the sampler):
Minimum document counts
Two thresholds control which terms are evaluated:
min_doc_count(default:3): Minimum total foreground frequency across all shards for a term to appear in final results. Applied after merging.shard_min_doc_count(default:1): Minimum foreground frequency on a single shard before the expensive background frequency lookup happens. Should be lower thanmin_doc_countsince documents are distributed across shards.
Configurable via OPENALEPH_SEARCH_SIGNIFICANT_TERMS_MIN_DOC_COUNT and OPENALEPH_SEARCH_SIGNIFICANT_TERMS_SHARD_MIN_DOC_COUNT.
Interpretation
Significance scores
Score ranges:
- 0.9+: Highly significant
- 0.7-0.9: Moderately significant
- 0.5-0.7: Somewhat significant
- <0.5: Low significance
Document counts
Consider both absolute and relative frequencies:
- High
doc_count+ highscore= Important frequent term - Low
doc_count+ highscore= Rare but highly relevant - High
doc_count+ lowscore= Common but not distinctive
Background comparison
Compare doc_count vs bg_count:
doc_count>>bg_count(relative to sizes) = Over-representeddoc_count≈bg_count= Normal frequency
Configuration
Minimum document count
openaleph-search search query-string "report" \
--args "facet_significant_text=content&facet_significant_text_min_doc_count=10"
Higher values focus on more frequent terms.
Shard size tuning
openaleph-search search query-string "evidence" \
--args "facet_significant_text=content&facet_significant_text_shard_size=500"
Larger sizes improve accuracy but increase query time.
Duplicate filtering
Significant text automatically filters duplicate content to prevent skewing results.
Examples
Investigative journalism
# Key terms in corruption investigation
openaleph-search search query-string "minister" \
--args "filter:countries=us&facet_significant=names&facet_significant_text=content"
Document classification
# Discover PDF document themes
openaleph-search search query-string "*" \
--args "filter:schema=Document&filter:properties.mimeType=application/pdf&facet_significant_text=content&facet_significant_text_size=15"
Entity analysis
# Significant company attributes
openaleph-search search query-string "*" \
--args "filter:schema=Company&facet_significant=countries&facet_significant=properties.sector"
Temporal analysis
# Significant terms in time period
openaleph-search search query-string "event" \
--args "filter:gte:dates=2016-01-01&filter:lt:dates=2017-01-01&facet_significant=names&facet_significant_text=content"
Performance
Sampling efficiency
All significant aggregations are wrapped in a sampler to limit the foreground document set:
- Significant terms: default 10000 documents per shard
- Significant text: default 200 documents per shard
- Diversified sampling across datasets when no collection/dataset filter is applied
- Configurable via settings
Background filter cost
The background_filter is the main performance bottleneck. For each candidate term in the foreground, Elasticsearch must intersect its posting list with the background filter's document set. Mitigation strategies:
shard_min_doc_count: Eliminates rare terms before background lookups (default: 3)- Smaller sampler size: Fewer foreground docs means fewer candidate terms
- Filter caching: Elasticsearch caches
collection_idfilters as BitSets; repeated queries with the same collection set benefit from cache hits
Query optimization
Monitor performance through:
- Query
tooktimes in responses - Elasticsearch slow query logs
- Resource usage during aggregations
- Filter cache hit rates:
GET /_nodes/stats/indices/query_cache
Adjust sampler sizes and minimum document counts based on dataset characteristics.
Error handling
No results
Empty analysis returns appropriate structure:
{
"aggregations": {
"names.significant_sampled": {
"names.significant_terms": {
"buckets": []
}
}
}
}
Field validation
- Non-existent fields return empty results
- Text analysis requires analyzable text fields (usually full text in
contentfield) - Terms analysis requires keyword fields
Edge cases
- Very small datasets may not provide meaningful significance
- Single-document results cannot generate scores
- Cross-shard coordination ensures consistent sampling