Table of Contents
- Understanding Full-Text Search Basics
- PostgreSQL Full-Text Search Components
- Creating and Indexing Full-Text Search
- Advanced Full-Text Search Techniques
- Performance Optimization
- Real-World Example: Blog Post Search
- Limitations and When to Use Alternatives
- Conclusion
- References
1. Understanding Full-Text Search Basics
Before diving into PostgreSQL-specific features, let’s clarify what full-text search (FTS) is and how it differs from simple pattern matching:
-
Simple Pattern Matching (LIKE/ILIKE): Matches substrings exactly (e.g.,
WHERE body LIKE '%database%'). It does not account for word variations (e.g., “databases” or “databasing”), ignores context, and is slow on large datasets. -
Full-Text Search: Analyzes text to extract meaningful “lexemes” (root words), filters out trivial “stop words” (e.g., “the,” “and”), and allows querying based on these lexemes. It supports relevance ranking, proximity searches, and natural language queries.
For example, a full-text search for “running” would match “run,” “ran,” or “runner” (via stemming) and rank results by how relevant they are to the query.
2. PostgreSQL Full-Text Search Components
PostgreSQL FTS relies on two core data types and several functions to process and query text:
tsvector: The Searchable Vector
A tsvector (text search vector) is a sorted list of distinct lexemes (stemmed words) with their positions in the original text. It is the “searchable” representation of your text data.
Example:
Convert a sentence to a tsvector using to_tsvector:
SELECT to_tsvector('english', 'PostgreSQL is a powerful database. It supports full-text search!');
Output:
'databas':5 'full':9 'power':4 'postgresql':1 'search':10 'support':8 'text':10
Here, 'databas':5 means the lexeme “databas” (stemmed from “database”) appears at position 5.
tsquery: The Search Query
A tsquery (text search query) represents a search pattern using lexemes and operators. It is used to match against tsvector values.
Example:
Convert a query string to a tsquery using to_tsquery:
SELECT to_tsquery('english', 'power & search'); -- Match "power" AND "search"
Output:
'power' & 'search'
Operators supported in tsquery:
&: AND (both lexemes must match)|: OR (either lexeme matches)!: NOT (exclude lexeme)<->: Proximity (lexemes within a certain distance, e.g.,'run' <-> 'fast'for “run fast”)
Text Search Configurations
PostgreSQL uses “text search configurations” to define how text is processed (tokenization, stemming, stop words). The default is english, but configurations exist for 80+ languages (e.g., french, german).
View available configurations:
SELECT cfgname FROM pg_ts_config;
Custom configurations can be created for domain-specific needs (e.g., medical terminology) by modifying dictionaries or stop words.
3. Creating and Indexing Full-Text Search
To use FTS effectively, you’ll need to store tsvector data and index it for fast queries.
Storing tsvector Columns
You can either:
- Compute
tsvectoron the fly during queries (simpler but slower), or - Store a precomputed
tsvectorcolumn (faster queries, requires storage).
Option 2: Precomputed tsvector Column
Use a generated column to auto-update the tsvector when the source text changes:
CREATE TABLE blog_posts (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL,
-- Generated tsvector column (updates automatically)
search_vector TSVECTOR GENERATED ALWAYS AS (
to_tsvector('english', title || ' ' || body)
) STORED
);
Indexing with GIN and GIST
To speed up tsvector queries, use PostgreSQL’s specialized indexes:
- GIN (Generalized Inverted Index): Optimized for
tsvector; faster for read-heavy workloads but slower to build/update. - GIST (Generalized Search Tree): Slower for reads but faster to update; better for write-heavy workloads.
Example: Create a GIN Index
CREATE INDEX idx_blog_posts_search ON blog_posts USING GIN (search_vector);
Now queries on search_vector will use the index, drastically improving performance.
4. Advanced Full-Text Search Techniques
Ranking Results with ts_rank
Not all matches are equal. Use ts_rank or ts_rank_cd to score results by relevance:
ts_rank: Basic ranking based on lexeme frequency and position.ts_rank_cd: Considers “cover density” (closeness of lexemes).
Example:
SELECT
id,
title,
ts_rank(search_vector, query) AS score
FROM
blog_posts,
to_tsquery('english', 'postgresql & search') AS query
WHERE
search_vector @@ query -- @@ is the "matches" operator
ORDER BY
score DESC;
Highlighting Matches with ts_headline
Use ts_headline to extract and highlight snippets of text containing matches:
SELECT
id,
title,
ts_headline(
'english',
body,
to_tsquery('english', 'postgresql & search'),
'StartSel=<strong>, StopSel=</strong>' -- Custom highlight tags
) AS highlighted_body
FROM blog_posts
WHERE search_vector @@ to_tsquery('english', 'postgresql & search');
Output:
Highlights matched words in <strong> tags.
Phrase and Proximity Search
Use the <-> (proximity) operator to search for phrases or nearby words:
'postgresql <-> database': Matches “PostgreSQL database”.'postgresql <2> database': Matches “PostgreSQL X database” (X is 1 word).
Example:
SELECT title
FROM blog_posts
WHERE search_vector @@ to_tsquery('english', 'postgresql <-> search');
Boolean Operators and Weighting
-
Boolean Queries: Combine operators for complex logic:
to_tsquery('english', 'power & (search | database)')(power AND (search OR database)). -
Weighting: Assign weights (A, B, C, D) to prioritize columns (e.g., title over body):
search_vector TSVECTOR GENERATED ALWAYS AS ( setweight(to_tsvector('english', title), 'A') || -- Title (highest weight) setweight(to_tsvector('english', body), 'B') -- Body (lower weight) ) STORED
5. Performance Optimization
- Index Selectively: Use
GINfor read-heavy,GISTfor write-heavy. - Partial Indexes: Index only relevant rows (e.g.,
WHERE status = 'published'). - Analyze Queries: Use
EXPLAIN ANALYZEto check if indexes are being used:EXPLAIN ANALYZE SELECT * FROM blog_posts WHERE search_vector @@ to_tsquery('english', 'postgresql'); - Custom Dictionaries: Remove domain-specific stop words (e.g., “API” in tech blogs) to improve lexeme quality.
6. Real-World Example: Blog Post Search
Let’s build a complete example for searching blog posts with ranking and highlighting.
Step 1: Create the Table
CREATE TABLE blog_posts (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL,
published_at TIMESTAMP DEFAULT NOW(),
search_vector TSVECTOR GENERATED ALWAYS AS (
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'B')
) STORED
);
-- Add GIN index
CREATE INDEX idx_blog_posts_search ON blog_posts USING GIN (search_vector);
Step 2: Insert Sample Data
INSERT INTO blog_posts (title, body) VALUES
('Getting Started with PostgreSQL', 'PostgreSQL is an open-source relational database. It supports full-text search out of the box.'),
('Advanced PostgreSQL Features', 'Learn about indexing, stored procedures, and full-text search in PostgreSQL.'),
('Database Performance Tips', 'Optimize your PostgreSQL database with proper indexing and query tuning.');
Step 3: Search with Ranking and Highlighting
SELECT
id,
title,
ts_rank(search_vector, query) AS score,
ts_headline('english', body, query, 'StartSel=<mark>, StopSel=</mark>') AS snippet
FROM
blog_posts,
to_tsquery('english', 'postgresql & search') AS query
WHERE
search_vector @@ query
ORDER BY
score DESC;
Output:
Returns posts ranked by relevance, with “postgresql” and “search” highlighted in the body snippet.
7. Limitations and When to Use Alternatives
PostgreSQL FTS is powerful, but it has limitations:
- No Built-In Faceting: Cannot easily group results by categories (e.g., “most searched tags”).
- Limited Synonym Support: Requires custom dictionaries for synonyms.
- Scalability: Struggles with extremely large datasets (100M+ rows) compared to distributed tools.
Alternatives to Consider:
- Elasticsearch/Solr: Better for faceting, synonyms, and distributed search.
- Apache Lucene: For embedded FTS in applications.
8. Conclusion
PostgreSQL Full-Text Search is a robust, built-in solution for most text search needs. It eliminates the complexity of integrating external tools, supports advanced features like ranking and highlighting, and performs well for medium-sized datasets. By leveraging tsvector, tsquery, and GIN indexes, you can build fast, relevant search functionality directly in your PostgreSQL database.
For small to medium apps, PostgreSQL FTS is often sufficient. For large-scale or feature-heavy search (e.g., e-commerce), consider combining it with Elasticsearch for hybrid search.