Table of Contents
- Understanding Full-Text Search Basics
- How PostgreSQL Full-Text Search Works
- Core Components: tsvector and tsquery
- Basic Full-Text Search Queries
- Advanced Features
- 5.1 Relevance Ranking
- 5.2 Stemming and Dictionaries
- 5.3 Stop Words
- 5.4 Phrase Search and Proximity
- 5.5 Prefix Matching
- Performance Optimization
- Practical Use Cases
- Limitations and When to Consider Alternatives
- Conclusion
- References
1. Understanding Full-Text Search Basics
Before diving into PostgreSQL’s implementation, let’s clarify what full-text search (FTS) is and why it matters.
Full-text search is a technique for searching natural language text in documents (e.g., articles, product descriptions, or user comments) by analyzing the content semantically, rather than just matching exact substrings. Unlike simple LIKE or ILIKE queries— which are slow for large datasets and limited to literal matches—FTS:
- Normalizes text: Converts words to their root form (stemming), e.g., “running” → “run”, “cats” → “cat”.
- Ignores trivial words: Filters out “stop words” like “the”, “and”, or “is” that don’t add meaning.
- Ranks results by relevance: Prioritizes documents with more matches, closer word proximity, or more important terms (e.g., words in titles vs. body text).
- Supports advanced queries: Phrases (“quick brown dog”), boolean logic (“dog OR cat AND NOT bird”), and proximity (“dog NEAR cat”).
These features make FTS far more powerful for user-facing search than traditional string matching.
2. How PostgreSQL Full-Text Search Works
PostgreSQL’s FTS engine processes text in two main stages: document processing (preparing text for search) and query processing (matching user input to documents). Here’s a high-level overview:
Step 1: Document Processing
- Tokenization: The input text (e.g., an article’s content) is split into “tokens” (words or numbers) using a parser. PostgreSQL’s default parser splits text on whitespace and punctuation, but custom parsers are possible.
- Normalization: Tokens are transformed into “lexemes” (normalized word forms) via:
- Stemming: Reducing words to their root (e.g., “jumps” → “jump”, “better” → “good” with some dictionaries).
- Case folding: Converting to lowercase (e.g., “Quick” → “quick”).
- Stop word removal: Filtering out non-essential words (e.g., “the”, “a”, “is”).
- tsvector Creation: The normalized lexemes are stored in a
tsvector(text search vector), a sorted, deduplicated list of lexemes with their positions in the original text (e.g.,'quick:1 brown:2 dog:3').
Step 2: Query Processing
- Query Parsing: The user’s search input (e.g., “quick dog”) is parsed into a
tsquery(text search query), which includes operators for boolean logic, proximity, etc. - Normalization: Like documents, query terms are stemmed and normalized to match lexemes in
tsvector. - Matching: The
tsqueryis compared againsttsvectorusing the@@operator (pronounced “matches”). PostgreSQL returns documents where thetsvectormatches thetsquery.
3. Core Components: tsvector and tsquery
At the heart of PostgreSQL FTS are two data types: tsvector (for documents) and tsquery (for search queries).
tsvector: The Searchable Document
A tsvector is a sorted, compressed list of lexemes (normalized words) with their positions in the original text. Positions help with ranking and phrase searches.
Example:
For the sentence “The quick brown fox jumps over the lazy dog”, the tsvector (using the English dictionary) would be:
'brown:3 dog:8 fox:4 jump:5 lazy:7 quick:2'
- Lexemes are stemmed (“jumps” → “jump”) and lowercased.
- Stop words (“the”, “over”) are removed.
- Positions (e.g.,
brown:3) indicate where the lexeme appeared in the original text.
tsquery: The Search Query
A tsquery represents the user’s search intent, with operators to refine matches. Common operators include:
&: AND (both terms must match).|: OR (either term matches).!: NOT (exclude the term).<->: Proximity (terms must appear in order, adjacent or with up to N words between them; e.g.,<->= adjacent,<2>= within 2 words).:*: Prefix (match lexemes starting with the term).
Examples:
'quick & dog': Match documents with both “quick” and “dog”.'brown <-> dog': Match “brown dog” as a phrase (adjacent terms).'jump:*': Match “jump”, “jumping”, “jumps”, etc.
4. Basic Full-Text Search Queries
Let’s put this into practice with a sample dataset. Suppose we have an articles table:
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
published_at TIMESTAMP
);
-- Insert sample data
INSERT INTO articles (title, content) VALUES
('The Quick Brown Fox', 'The quick brown fox jumps over the lazy dog. The dog was not amused.'),
('A Day in the Park', 'I saw a dog chasing a cat in the park. The cat was faster than the dog.'),
('Cooking with Herbs', 'Herbs like basil and rosemary add flavor to dishes. Basil is my favorite.');
Basic Match with to_tsvector and to_tsquery
To search the content column, use to_tsvector (converts text to tsvector) and to_tsquery (converts text to tsquery), then compare with @@:
-- Find articles mentioning "dog"
SELECT title FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'dog');
Result:
title
----------------------
The Quick Brown Fox
A Day in the Park
Simplifying with plainto_tsquery
For user input without operators (e.g., a search bar), plainto_tsquery converts spaces to OR by default:
-- Search for "dog cat" (dog OR cat)
SELECT title FROM articles
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'dog cat');
Result:
title
----------------------
The Quick Brown Fox
A Day in the Park
Phrase Search with phraseto_tsquery
To search for exact phrases, use phraseto_tsquery, which converts the input to a proximity query with <-> operators:
-- Search for "brown dog" as a phrase
SELECT title FROM articles
WHERE to_tsvector('english', content) @@ phraseto_tsquery('english', 'brown dog');
Result:
title
----------------------
The Quick Brown Fox
5. Advanced Features
5.1 Relevance Ranking
Not all matches are equal. Use ts_rank or ts_rank_cd to score documents by relevance, then sort results.
ts_rank: Ranks based on lexeme frequency and proximity.ts_rank_cd: Adds “cover density” (rewards compact clusters of matches).
Example:
SELECT
title,
ts_rank(to_tsvector('english', content), to_tsquery('english', 'dog')) AS rank
FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'dog')
ORDER BY rank DESC;
Result:
title | rank
----------------------+-----------
A Day in the Park | 0.0307818
The Quick Brown Fox | 0.0256545
The second article has more “dog” mentions, so it ranks higher.
5.2 Stemming and Dictionaries
PostgreSQL uses dictionaries to control stemming, stop words, and synonym expansion. The default is english, but others (e.g., french, german) are available. You can even create custom dictionaries.
Check stemming for a word:
SELECT to_tsvector('english', 'jumping jumps jumped');
-- Result: 'jump:1,2,3' (all forms stem to "jump")
Use a different dictionary:
-- French stemming
SELECT to_tsvector('french', 'chat chats chattant');
-- Result: 'chat:1,2,3' ("chat" = "cat" in French)
5.3 Stop Words
Stop words are trivial words excluded from search. To view stop words for a dictionary:
SELECT unnest(pg_get_stopwords('english')) AS stop_words;
-- Returns: 'a', 'and', 'the', 'is', ...
To disable stop words (not recommended for most cases), use the simple dictionary (no stemming or stop words):
SELECT to_tsvector('simple', 'the quick brown fox');
-- Result: 'brown:3 fox:4 quick:2 the:1' (stop words retained)
5.4 Phrase Search and Proximity
Use the <-> operator in tsquery to enforce word order and proximity.
a <-> b: Match “a” immediately followed by “b”.a <3> b: Match “a” followed by “b” with up to 2 words in between (total distance ≤3).
Example:
-- Find "dog" followed by "lazy" with up to 1 word between them
SELECT title FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'dog <2> lazy');
Result:
title
----------------------
The Quick Brown Fox
(Original text: “the lazy dog” → “dog” is after “lazy”, so reverse order won’t match. To match “lazy dog”, use lazy <-> dog.)
5.5 Prefix Matching
Use :* to match lexemes starting with a term (e.g., “run” → “running”, “runner”).
Example:
SELECT title FROM articles
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'jump:*');
Result:
title
----------------------
The Quick Brown Fox -- Contains "jumps" → stemmed to "jump"
6. Performance Optimization
For large datasets, FTS queries can be slow without indexes. PostgreSQL offers two main index types for tsvector: GIN and GIST.
6.1 GIN and GIST Indexes
GIN (Generalized Inverted Index)
- Best for: High-cardinality data (many unique lexemes).
- Pros: Fast lookups, smaller than GIST for large datasets.
- Cons: Slower writes (index updates are costly).
Create a GIN index on a tsvector expression:
CREATE INDEX idx_articles_content_fts ON articles
USING gin(to_tsvector('english', content));
Now, queries using to_tsvector('english', content) @@ ... will use this index, drastically speeding up searches.
GIST (Generalized Search Tree)
- Best for: Read-heavy workloads with very large
tsvectors (e.g., books or long documents). - Pros: Smaller index size than GIN, faster writes.
- Cons: Slower read performance than GIN.
Create a GIST index:
CREATE INDEX idx_articles_content_gist ON articles
USING gist(to_tsvector('english', content));
6.2 Storing tsvector for Faster Queries
Computing tsvector on the fly (e.g., to_tsvector('english', content)) can slow down writes. Instead, store the tsvector in a column and update it via triggers.
Example:
-- Add a tsvector column
ALTER TABLE articles ADD COLUMN content_fts tsvector;
-- Populate it
UPDATE articles SET content_fts = to_tsvector('english', content);
-- Create a trigger to update on insert/update
CREATE OR REPLACE FUNCTION update_fts()
RETURNS TRIGGER AS $$
BEGIN
NEW.content_fts = to_tsvector('english', NEW.content);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER articles_fts_trigger
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION update_fts();
-- Index the stored tsvector
CREATE INDEX idx_articles_content_fts_stored ON articles USING gin(content_fts);
-- Query using the stored column
SELECT title FROM articles WHERE content_fts @@ to_tsquery('english', 'dog');
7. Practical Use Cases
7.1 E-Commerce Product Search
Let users search product catalogs by name, description, or tags. Use ranking to prioritize products with matches in titles over descriptions.
SELECT
product_name,
ts_rank(
setweight(to_tsvector('english', product_name), 'A') || -- Title (higher weight)
setweight(to_tsvector('english', description), 'B'), -- Description (lower weight)
to_tsquery('english', 'wireless & headphone')
) AS rank
FROM products
WHERE ...
ORDER BY rank DESC;
setweight assigns importance: ‘A’ (highest), ‘B’, ‘C’, ‘D’ (lowest).
7.2 Content Management Systems (CMS)
Enable blog or wiki search with phrase matching and relevance. For example, a tech blog might let users search for “PostgreSQL full-text search”.
7.3 Support Ticket Systems
Help agents find related tickets by searching user messages. Use proximity search to match phrases like “login error” or “payment failed”.
8. Limitations and When to Consider Alternatives
PostgreSQL FTS is powerful, but it’s not a silver bullet. Consider alternatives like Elasticsearch or Apache Solr if you need:
- Distributed search: Scaling across multiple nodes.
- Real-time indexing: Near-instant updates for high-write workloads.
- Advanced analytics: Faceted search (e.g., “filter by price range” + “color”), aggregations.
- Rich language support: Better handling for languages with complex morphology (e.g., Arabic, Japanese).
- Synonym expansion: Advanced synonym management (PostgreSQL supports synonyms via dictionaries but is less flexible).
9. Conclusion
PostgreSQL’s full-text search is a hidden gem. It provides robust, built-in search functionality with minimal setup, making it ideal for small to medium applications, or as a cost-effective alternative to external search engines. With features like relevance ranking, stemming, and GIN indexing, it can handle most user-facing search needs.
By leveraging PostgreSQL FTS, you avoid the complexity of managing separate search infrastructure, reduce latency, and keep your data in one familiar system. While it has limitations, it’s often the best first step before graduating to specialized tools like Elasticsearch.
So next time you need search functionality, give PostgreSQL FTS a try—you might be surprised by its power.