Significant Terms
Discover statistically unusual terms and phrases by comparing search results against background datasets.
Read more in Elasticsearch documentation
Blog post for discovery in OpenAleph
What is significant analysis
Significant analysis identifies terms that appear more frequently in search results than expected based on the background dataset. This surfaces important concepts that distinguish your results.
Significant terms
Find keywords over-represented in search results.
Basic usage
# Find significant mentions in names
openaleph-search search query-string "corruption" --args "facet_significant=names"
# Multiple fields
openaleph-search search query-string "investigation" \
--args "facet_significant=names&facet_significant=countries"
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
facet_significant |
string | - | Field name |
facet_significant_size:FIELD |
int | 20 | Number of terms |
facet_significant_total:FIELD |
bool | false | Include total count |
facet_significant_values:FIELD |
bool | true | Return values |
facet_significant_type:FIELD |
string | - | Aggregation type |
Response structure
{
"aggregations": {
"names.significant_terms": {
"buckets": [
{
"key": "jane doe",
"doc_count": 45,
"score": 0.8745,
"bg_count": 120
}
]
}
}
}
score: Statistical significance (0-1, higher = more unusual)doc_count: Frequency in search resultsbg_count: Frequency in background datasetkey: The significant term
Significant text
Extract meaningful phrases from text content.
Basic usage
# Analyze document content
openaleph-search search query-string "laundering" \
--args "facet_significant_text=content"
# Custom configuration
openaleph-search search query-string "investigation" \
--args "facet_significant_text=content&facet_significant_text_size=10"
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
facet_significant_text |
string | content | Text field to analyze |
facet_significant_text_size |
int | 5 | Number of phrases |
facet_significant_text_min_doc_count |
int | 5 | Minimum document frequency |
facet_significant_text_shard_size |
int | 200 | Documents sampled per shard |
Response structure
{
"aggregations": {
"significant_text": {
"significant_text": {
"buckets": [
{
"key": "money laundering",
"doc_count": 28,
"score": 0.9156,
"bg_count": 145
}
]
}
}
}
}
Background filtering
Background datasets define the comparison baseline:
Dataset-specific
When filtering by datasets:
OpenAleph
Currently, OpenAleph uses collection_id filter here with the numeric IDs instead.
Full dataset
Without filters, uses entire accessible datasets.
Sampling
Significant text sampling
Diversified sampling (default):
Samples across datasets for representative analysis.
Regular sampling (when filtering by dataset):
Shard size calculation
For significant terms:
Ensures sufficient sampling for statistical significance.
Interpretation
Significance scores
Score ranges:
- 0.9+: Highly significant
- 0.7-0.9: Moderately significant
- 0.5-0.7: Somewhat significant
- <0.5: Low significance
Document counts
Consider both absolute and relative frequencies:
- High
doc_count+ highscore= Important frequent term - Low
doc_count+ highscore= Rare but highly relevant - High
doc_count+ lowscore= Common but not distinctive
Background comparison
Compare doc_count vs bg_count:
doc_count>>bg_count(relative to sizes) = Over-representeddoc_count≈bg_count= Normal frequency
Configuration
Minimum document count
openaleph-search search query-string "report" \
--args "facet_significant_text=content&facet_significant_text_min_doc_count=10"
Higher values focus on more frequent terms.
Shard size tuning
openaleph-search search query-string "evidence" \
--args "facet_significant_text=content&facet_significant_text_shard_size=500"
Larger sizes improve accuracy but increase query time.
Duplicate filtering
Significant text automatically filters duplicate content to prevent skewing results.
Examples
Investigative journalism
# Key terms in corruption investigation
openaleph-search search query-string "minister" \
--args "filter:countries=us&facet_significant=names&facet_significant_text=content"
Document classification
# Discover PDF document themes
openaleph-search search query-string "*" \
--args "filter:schema=Document&filter:properties.mimeType=application/pdf&facet_significant_text=content&facet_significant_text_size=15"
Entity analysis
# Significant company attributes
openaleph-search search query-string "*" \
--args "filter:schema=Company&facet_significant=countries&facet_significant=properties.sector"
Temporal analysis
# Significant terms in time period
openaleph-search search query-string "event" \
--args "filter:gte:dates=2016-01-01&filter:lt:dates=2017-01-01&facet_significant=names&facet_significant_text=content"
Performance
Sampling efficiency
Sampling balances performance and accuracy:
- Default 200 documents per shard
- Diversified sampling across datasets
- Configurable via shard_size parameters
Query optimization
Monitor performance through:
- Query
tooktimes in responses - Elasticsearch slow query logs
- Resource usage during aggregations
Adjust shard sizes and minimum document counts based on dataset characteristics.
Error handling
No results
Empty analysis returns appropriate structure:
Field validation
- Non-existent fields return empty results
- Text analysis requires analyzable text fields (usually full text in
contentfield) - Terms analysis requires keyword fields
Edge cases
- Very small datasets may not provide meaningful significance
- Single-document results cannot generate scores
- Cross-shard coordination ensures consistent sampling