cyberangles guide

Full-Text Search in PostgreSQL: A Complete Tutorial

In today’s data-driven world, efficient search functionality is a cornerstone of user-friendly applications. Whether you’re building a blog, an e-commerce platform, or a content management system, allowing users to quickly find relevant information is critical. While simple pattern matching with `LIKE` or `ILIKE` works for basic needs, it falls short for complex scenarios involving natural language, stemming, or relevance ranking. Enter **PostgreSQL Full-Text Search (FTS)**—a powerful, built-in feature that transforms raw text into searchable tokens, enabling sophisticated querying with support for stemming, stop words, ranking, and more. Unlike external tools like Elasticsearch or Solr, PostgreSQL FTS eliminates the need for additional infrastructure, making it ideal for applications where simplicity and integration with existing databases are priorities. This tutorial will guide you through everything you need to know about PostgreSQL FTS, from core concepts to advanced techniques, with practical examples to help you implement robust search in your applications.

Table of Contents

  1. Understanding Full-Text Search Basics
  2. PostgreSQL Full-Text Search Components
  3. Creating and Indexing Full-Text Search
  4. Advanced Full-Text Search Techniques
  5. Performance Optimization
  6. Real-World Example: Blog Post Search
  7. Limitations and When to Use Alternatives
  8. Conclusion
  9. References

1. Understanding Full-Text Search Basics

Before diving into PostgreSQL-specific features, let’s clarify what full-text search (FTS) is and how it differs from simple pattern matching:

  • Simple Pattern Matching (LIKE/ILIKE): Matches substrings exactly (e.g., WHERE body LIKE '%database%'). It does not account for word variations (e.g., “databases” or “databasing”), ignores context, and is slow on large datasets.

  • Full-Text Search: Analyzes text to extract meaningful “lexemes” (root words), filters out trivial “stop words” (e.g., “the,” “and”), and allows querying based on these lexemes. It supports relevance ranking, proximity searches, and natural language queries.

For example, a full-text search for “running” would match “run,” “ran,” or “runner” (via stemming) and rank results by how relevant they are to the query.

2. PostgreSQL Full-Text Search Components

PostgreSQL FTS relies on two core data types and several functions to process and query text:

tsvector: The Searchable Vector

A tsvector (text search vector) is a sorted list of distinct lexemes (stemmed words) with their positions in the original text. It is the “searchable” representation of your text data.

Example:
Convert a sentence to a tsvector using to_tsvector:

SELECT to_tsvector('english', 'PostgreSQL is a powerful database. It supports full-text search!');

Output:
'databas':5 'full':9 'power':4 'postgresql':1 'search':10 'support':8 'text':10

Here, 'databas':5 means the lexeme “databas” (stemmed from “database”) appears at position 5.

tsquery: The Search Query

A tsquery (text search query) represents a search pattern using lexemes and operators. It is used to match against tsvector values.

Example:
Convert a query string to a tsquery using to_tsquery:

SELECT to_tsquery('english', 'power & search'); -- Match "power" AND "search"

Output:
'power' & 'search'

Operators supported in tsquery:

  • &: AND (both lexemes must match)
  • |: OR (either lexeme matches)
  • !: NOT (exclude lexeme)
  • <->: Proximity (lexemes within a certain distance, e.g., 'run' <-> 'fast' for “run fast”)

Text Search Configurations

PostgreSQL uses “text search configurations” to define how text is processed (tokenization, stemming, stop words). The default is english, but configurations exist for 80+ languages (e.g., french, german).

View available configurations:

SELECT cfgname FROM pg_ts_config;

Custom configurations can be created for domain-specific needs (e.g., medical terminology) by modifying dictionaries or stop words.

To use FTS effectively, you’ll need to store tsvector data and index it for fast queries.

Storing tsvector Columns

You can either:

  1. Compute tsvector on the fly during queries (simpler but slower), or
  2. Store a precomputed tsvector column (faster queries, requires storage).

Option 2: Precomputed tsvector Column
Use a generated column to auto-update the tsvector when the source text changes:

CREATE TABLE blog_posts (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    body TEXT NOT NULL,
    -- Generated tsvector column (updates automatically)
    search_vector TSVECTOR GENERATED ALWAYS AS (
        to_tsvector('english', title || ' ' || body)
    ) STORED
);

Indexing with GIN and GIST

To speed up tsvector queries, use PostgreSQL’s specialized indexes:

  • GIN (Generalized Inverted Index): Optimized for tsvector; faster for read-heavy workloads but slower to build/update.
  • GIST (Generalized Search Tree): Slower for reads but faster to update; better for write-heavy workloads.

Example: Create a GIN Index

CREATE INDEX idx_blog_posts_search ON blog_posts USING GIN (search_vector);

Now queries on search_vector will use the index, drastically improving performance.

4. Advanced Full-Text Search Techniques

Ranking Results with ts_rank

Not all matches are equal. Use ts_rank or ts_rank_cd to score results by relevance:

  • ts_rank: Basic ranking based on lexeme frequency and position.
  • ts_rank_cd: Considers “cover density” (closeness of lexemes).

Example:

SELECT 
    id, 
    title, 
    ts_rank(search_vector, query) AS score
FROM 
    blog_posts, 
    to_tsquery('english', 'postgresql & search') AS query
WHERE 
    search_vector @@ query -- @@ is the "matches" operator
ORDER BY 
    score DESC;

Highlighting Matches with ts_headline

Use ts_headline to extract and highlight snippets of text containing matches:

SELECT 
    id, 
    title,
    ts_headline(
        'english', 
        body, 
        to_tsquery('english', 'postgresql & search'),
        'StartSel=<strong>, StopSel=</strong>' -- Custom highlight tags
    ) AS highlighted_body
FROM blog_posts
WHERE search_vector @@ to_tsquery('english', 'postgresql & search');

Output:
Highlights matched words in <strong> tags.

Use the <-> (proximity) operator to search for phrases or nearby words:

  • 'postgresql <-> database': Matches “PostgreSQL database”.
  • 'postgresql <2> database': Matches “PostgreSQL X database” (X is 1 word).

Example:

SELECT title 
FROM blog_posts 
WHERE search_vector @@ to_tsquery('english', 'postgresql <-> search');

Boolean Operators and Weighting

  • Boolean Queries: Combine operators for complex logic:
    to_tsquery('english', 'power & (search | database)') (power AND (search OR database)).

  • Weighting: Assign weights (A, B, C, D) to prioritize columns (e.g., title over body):

    search_vector TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('english', title), 'A') || -- Title (highest weight)
        setweight(to_tsvector('english', body), 'B')    -- Body (lower weight)
    ) STORED

5. Performance Optimization

  • Index Selectively: Use GIN for read-heavy, GIST for write-heavy.
  • Partial Indexes: Index only relevant rows (e.g., WHERE status = 'published').
  • Analyze Queries: Use EXPLAIN ANALYZE to check if indexes are being used:
    EXPLAIN ANALYZE 
    SELECT * FROM blog_posts 
    WHERE search_vector @@ to_tsquery('english', 'postgresql');
  • Custom Dictionaries: Remove domain-specific stop words (e.g., “API” in tech blogs) to improve lexeme quality.

6. Real-World Example: Blog Post Search

Let’s build a complete example for searching blog posts with ranking and highlighting.

Step 1: Create the Table

CREATE TABLE blog_posts (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    body TEXT NOT NULL,
    published_at TIMESTAMP DEFAULT NOW(),
    search_vector TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('english', title), 'A') ||
        setweight(to_tsvector('english', body), 'B')
    ) STORED
);

-- Add GIN index
CREATE INDEX idx_blog_posts_search ON blog_posts USING GIN (search_vector);

Step 2: Insert Sample Data

INSERT INTO blog_posts (title, body) VALUES 
('Getting Started with PostgreSQL', 'PostgreSQL is an open-source relational database. It supports full-text search out of the box.'),
('Advanced PostgreSQL Features', 'Learn about indexing, stored procedures, and full-text search in PostgreSQL.'),
('Database Performance Tips', 'Optimize your PostgreSQL database with proper indexing and query tuning.');

Step 3: Search with Ranking and Highlighting

SELECT 
    id,
    title,
    ts_rank(search_vector, query) AS score,
    ts_headline('english', body, query, 'StartSel=<mark>, StopSel=</mark>') AS snippet
FROM 
    blog_posts,
    to_tsquery('english', 'postgresql & search') AS query
WHERE 
    search_vector @@ query
ORDER BY 
    score DESC;

Output:
Returns posts ranked by relevance, with “postgresql” and “search” highlighted in the body snippet.

7. Limitations and When to Use Alternatives

PostgreSQL FTS is powerful, but it has limitations:

  • No Built-In Faceting: Cannot easily group results by categories (e.g., “most searched tags”).
  • Limited Synonym Support: Requires custom dictionaries for synonyms.
  • Scalability: Struggles with extremely large datasets (100M+ rows) compared to distributed tools.

Alternatives to Consider:

  • Elasticsearch/Solr: Better for faceting, synonyms, and distributed search.
  • Apache Lucene: For embedded FTS in applications.

8. Conclusion

PostgreSQL Full-Text Search is a robust, built-in solution for most text search needs. It eliminates the complexity of integrating external tools, supports advanced features like ranking and highlighting, and performs well for medium-sized datasets. By leveraging tsvector, tsquery, and GIN indexes, you can build fast, relevant search functionality directly in your PostgreSQL database.

For small to medium apps, PostgreSQL FTS is often sufficient. For large-scale or feature-heavy search (e.g., e-commerce), consider combining it with Elasticsearch for hybrid search.

9. References