cyberangles guide

PostgreSQL Query Optimization: Avoiding Common Pitfalls

In the world of data-driven applications, PostgreSQL has emerged as a robust, open-source relational database management system (RDBMS) trusted by developers and enterprises alike. However, even the most well-designed databases can grind to a halt when faced with poorly optimized queries. Slow queries not only degrade user experience but also increase infrastructure costs, strain server resources, and hinder scalability. Query optimization in PostgreSQL is not just about writing "correct" SQL—it’s about writing *efficient* SQL that leverages the database’s internals (e.g., indexes, query planner, statistics) to minimize execution time and resource usage. In this blog, we’ll explore the most common pitfalls developers and database administrators (DBAs) encounter when writing PostgreSQL queries, why they happen, and how to avoid them. Whether you’re a beginner or an experienced developer, these insights will help you unlock PostgreSQL’s full performance potential.

Table of Contents

  1. Pitfall 1: Missing or Inappropriate Indexes
  2. Pitfall 2: Overusing SELECT *
  3. Pitfall 3: Inefficient Joins
  4. Pitfall 4: Not Using EXPLAIN to Analyze Queries
  5. Pitfall 5: Overusing Subqueries (When Joins Are Better)
  6. Pitfall 6: Ignoring Data Type Mismatches
  7. Pitfall 7: Unnecessary ORDER BY Clauses
  8. Pitfall 8: Overusing LIKE with Leading Wildcards
  9. Pitfall 9: Poor Connection and Transaction Management
  10. Conclusion
  11. References

Pitfall 1: Missing or Inappropriate Indexes

What is the Pitfall?

Indexes are PostgreSQL’s primary tool for speeding up read operations by allowing the database to quickly locate rows without scanning entire tables. However, two common mistakes occur here: missing critical indexes (leading to slow sequential scans) and over-indexing (slowing down writes like INSERT, UPDATE, and DELETE).

Why It’s a Problem

  • Missing indexes: Without indexes on columns used in WHERE, JOIN, ORDER BY, or GROUP BY clauses, PostgreSQL defaults to a sequential scan (scanning every row in the table), which is catastrophic for large datasets (e.g., 1M+ rows).
  • Over-indexing: Each index adds overhead to write operations (PostgreSQL must update indexes when data changes). For tables with high write throughput (e.g., user activity logs), excessive indexes can turn fast INSERTs into bottlenecks.

How to Detect It

  • Use EXPLAIN ANALYZE on slow queries to check for Seq Scan (sequential scan) in the output—this often indicates missing indexes.
  • Monitor index usage with pg_stat_user_indexes:
    SELECT schemaname, relname, indexrelname, idx_scan  
    FROM pg_stat_user_indexes  
    WHERE idx_scan = 0; -- Identifies unused indexes  

How to Fix It

  • Add indexes for critical columns: Index columns frequently used in WHERE, JOIN, ORDER BY, or GROUP BY clauses. For example:
    -- Index for a filter on "user_id"  
    CREATE INDEX idx_orders_user_id ON orders(user_id);  
    
    -- Index for sorting on "created_at"  
    CREATE INDEX idx_orders_created_at ON orders(created_at);  
  • Use partial or expression indexes for filtered or computed data. For example, an index on active users only:
    CREATE INDEX idx_active_users ON users (email) WHERE is_active = true;  
  • Drop unused indexes identified via pg_stat_user_indexes to reduce write overhead.

Pitfall 2: Overusing SELECT *

What is the Pitfall?

Using SELECT * retrieves all columns from a table, even when only a subset is needed. This is a common anti-pattern in PostgreSQL (and SQL generally).

Why It’s a Problem

  • Increased I/O: Fetching unused columns forces the database to read more data from disk or memory, increasing I/O and slowing down query execution.
  • Prevents index-only scans: PostgreSQL can perform an “index-only scan” if the index contains all columns needed by the query. SELECT * often includes columns not in the index, forcing the database to access the full table (a “heap scan”).
  • Fragility: If the table schema changes (e.g., new columns added), SELECT * may return unexpected data, breaking downstream applications.

How to Detect It

Audit your codebase for SELECT * statements. Tools like pg_stat_statements (enabled via the pg_stat_statements extension) can help identify frequently run SELECT * queries:

SELECT query, calls, total_time  
FROM pg_stat_statements  
WHERE query LIKE 'SELECT *%'  
ORDER BY total_time DESC;  

How to Fix It

Explicitly list the columns you need. For example:

❌ Bad:

SELECT * FROM users WHERE email = '[email protected]';  

✅ Good:

SELECT id, name, email FROM users WHERE email = '[email protected]';  

Pitfall 3: Inefficient Joins

What is the Pitfall?

Joins are essential for combining data from multiple tables, but poorly designed joins (e.g., missing join conditions, inefficient join types, or joining large tables first) can cripple performance.

Why It’s a Problem

  • Cartesian products: Forgetting a join condition (e.g., FROM orders, users instead of FROM orders JOIN users ON orders.user_id = users.id) results in a Cartesian product, where every row in one table is joined with every row in another. This can generate billions of rows and crash the database.
  • Joining large tables first: PostgreSQL’s query planner tries to minimize intermediate result sizes, but if you join a large table with another large table early in the query, it can lead to excessive memory usage and slowdowns.

How to Detect It

  • Use EXPLAIN ANALYZE to check for Nested Loop, Hash Join, or Merge Join operations. A Hash Join on large tables may be slow if memory is limited.
  • Look for abnormally high rows values in the query plan, indicating a potential Cartesian product.

How to Fix It

  • Always specify join conditions: Use explicit JOIN syntax (e.g., INNER JOIN, LEFT JOIN) with ON clauses to avoid Cartesian products.
  • Join small tables first: Filter and reduce table sizes before joining. For example, filter orders by date first, then join with users:
    -- Filter early to reduce join size  
    SELECT u.name, o.order_total  
    FROM (SELECT * FROM orders WHERE created_at >= '2023-01-01') o  
    JOIN users u ON o.user_id = u.id;  
  • Ensure join columns are indexed: Index columns used in ON clauses (e.g., orders.user_id and users.id) to speed up join lookups.

Pitfall 4: Not Using EXPLAIN to Analyze Queries

What is the Pitfall?

The EXPLAIN command (and its variant EXPLAIN ANALYZE) is PostgreSQL’s built-in tool for understanding how the query planner executes your SQL. Failing to use it is like driving a car without a speedometer—you’re flying blind.

Why It’s a Problem

Without EXPLAIN, you can’t know if PostgreSQL is using indexes, performing sequential scans, or making suboptimal join decisions. You might waste hours optimizing a query that’s already efficient, or ignore a query that’s silently causing sequential scans on a 10M-row table.

How to Use EXPLAIN

  • EXPLAIN: Shows the query plan (estimated costs, row counts, operations) without executing the query.
  • EXPLAIN ANALYZE: Executes the query and shows actual runtime statistics (useful for comparing estimates vs. reality).

Example output for a slow query:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;  

Sample output (problematic):

Seq Scan on orders  (cost=0.00..10000.00 rows=1000 width=100) (actual time=0.02..500.00 rows=1000 loops=1)  
  Filter: (user_id = 123)  
Planning Time: 0.100 ms  
Execution Time: 500.200 ms  

Here, Seq Scan (sequential scan) indicates no index on user_id. Adding idx_orders_user_id would replace this with an Index Scan, reducing execution time to milliseconds.

Key Metrics to Look For:

  • Scan type: Seq Scan (bad for large tables) vs. Index Scan/Index Only Scan (good).
  • Rows vs. Actual Rows: If estimated rows (rows) differ significantly from actual rows (actual rows), PostgreSQL’s statistics may be outdated—run ANALYZE to update them.
  • Cost: Lower total cost (sum of Startup Cost and Total Cost) is better.

Pitfall 5: Overusing Subqueries (When Joins Are Better)

What is the Pitfall?

Subqueries (nested queries) are useful for isolating logic, but correlated subqueries (subqueries that reference columns from the outer query) often execute row-by-row, leading to poor performance. In many cases, joins or CTEs (Common Table Expressions) are more efficient.

Why It’s a Problem

Correlated subqueries run once per row in the outer query. For example, a subquery in a WHERE clause that checks EXISTS for each row can balloon to millions of executions on large datasets.

Example

❌ Slow Correlated Subquery:

SELECT u.name  
FROM users u  
WHERE EXISTS (  
  SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 1000  
);  

This runs the subquery once per user, which is slow for 100k+ users.

✅ Faster Join Equivalent:

SELECT DISTINCT u.name  
FROM users u  
JOIN orders o ON u.id = o.user_id  
WHERE o.total > 1000;  

Joins are optimized by PostgreSQL’s query planner and often execute in a single pass over the data.

Pitfall 6: Ignoring Data Type Mismatches

What is the Pitfall?

PostgreSQL is strict about data types, but implicit type casting (e.g., comparing a string to a number) can silently degrade performance by preventing index usage.

Why It’s a Problem

If a query compares columns of mismatched types, PostgreSQL may cast one of the columns to match the other. For example:

SELECT * FROM users WHERE id = '123'; -- '123' is a string, id is integer  

Here, PostgreSQL casts id to text (via id::text = '123'), which invalidates any index on id (since the index is on integer, not text). The result? A sequential scan instead of an index scan.

How to Detect It

Look for Function Scan or Cast operations in EXPLAIN output, which indicate implicit casting. For example:

Seq Scan on users  (cost=0.00..5000.00 rows=1 width=100)  
  Filter: ((id)::text = '123'::text)  

How to Fix It

Ensure data types match explicitly. Use the correct type in the query:

SELECT * FROM users WHERE id = 123; -- 123 is an integer (matches id's type)  

Pitfall 7: Unnecessary ORDER BY Clauses

What is the Pitfall?

ORDER BY sorts results, which is computationally expensive (O(n log n) time complexity) for large datasets. Many developers include ORDER BY out of habit, even when results don’t need to be sorted.

Why It’s a Problem

Sorting millions of rows consumes CPU and memory. For example, a query like SELECT * FROM orders ORDER BY created_at on a 1M-row table may take seconds to sort, even with an index.

How to Fix It

  • Omit ORDER BY if sorting isn’t required (e.g., internal reports, bulk exports where order doesn’t matter).
  • Use indexes to speed up ORDER BY: If sorting is necessary, add an index on the sorted column(s). For example:
    CREATE INDEX idx_orders_created_at ON orders(created_at);  
    PostgreSQL can then use the index to return pre-sorted results without explicit sorting.

Pitfall 8: Overusing LIKE with Leading Wildcards

What is the Pitfall?

The LIKE operator is commonly used for text searches, but patterns with leading wildcards (e.g., LIKE '%john%') cannot use B-tree indexes, leading to sequential scans.

Why It’s a Problem

B-tree indexes (PostgreSQL’s default) are ordered and work best for prefix matches (e.g., LIKE 'john%'). Leading wildcards (%john) or middle wildcards (%john%) force PostgreSQL to scan the entire table to find matches, which is slow for large text columns.

How to Fix It

  • Use prefix searches when possible: LIKE 'john%' can use a B-tree index.
  • Leverage PostgreSQL’s text search: For full-text search, use tsvector and tsquery with to_tsvector and to_tsquery:
    -- Add a GIN index for text search  
    CREATE INDEX idx_users_bio_fts ON users USING gin (to_tsvector('english', bio));  
    
    -- Search for 'john' in bio  
    SELECT * FROM users WHERE to_tsvector('english', bio) @@ to_tsquery('english', 'john');  
  • Use trigram indexes for partial matches: The pg_trgm extension enables indexes for %john%-style patterns:
    -- Enable the extension  
    CREATE EXTENSION IF NOT EXISTS pg_trgm;  
    
    -- Add a GIN index for trigram searches  
    CREATE INDEX idx_users_bio_trgm ON users USING gin (bio gin_trgm_ops);  
    
    -- Now '%john%' uses the index  
    SELECT * FROM users WHERE bio LIKE '%john%';  

Pitfall 9: Poor Connection and Transaction Management

What is the Pitfall?

Even well-optimized queries can fail if connections or transactions are mismanaged. Common issues include unclosed connections, long-running transactions, and underutilized connection pools.

Why It’s a Problem

  • Unclosed connections: Each connection consumes memory. If an application leaks connections (e.g., not closing them after use), the database will hit its max_connections limit, blocking new requests.
  • Long-running transactions: Transactions prevent PostgreSQL’s VACUUM from cleaning up dead tuples (old row versions), leading to “table bloat” (increased disk usage and slower queries).

How to Fix It

  • Use connection pools: Tools like PgBouncer or Pgpool-II manage a pool of reusable connections, preventing leaks and reducing overhead from connection setup/teardown.
  • Keep transactions short: Avoid long-running transactions (e.g., wrapping UI workflows in a transaction). Commit early and often.
  • Monitor connections: Use pg_stat_activity to track idle or long-running transactions:
    -- Find long-running transactions (> 5 minutes)  
    SELECT pid, now() - query_start AS duration, query  
    FROM pg_stat_activity  
    WHERE state = 'active' AND now() - query_start > '5 minutes'::interval;  

Conclusion

PostgreSQL query optimization is a journey, not a one-time task. By avoiding these common pitfalls—from misusing indexes and SELECT * to ignoring EXPLAIN and data type mismatches—you can significantly improve query performance and ensure your database scales with your application.

Remember: Optimization starts with understanding why a query is slow (use EXPLAIN ANALYZE), then systematically addressing bottlenecks (e.g., adding indexes, rewriting subqueries, fixing data types). With practice, you’ll develop an intuition for writing efficient PostgreSQL queries that stand the test of time.

References