Table of Contents
- Pitfall 1: Missing or Inappropriate Indexes
- Pitfall 2: Overusing
SELECT * - Pitfall 3: Inefficient Joins
- Pitfall 4: Not Using
EXPLAINto Analyze Queries - Pitfall 5: Overusing Subqueries (When Joins Are Better)
- Pitfall 6: Ignoring Data Type Mismatches
- Pitfall 7: Unnecessary
ORDER BYClauses - Pitfall 8: Overusing
LIKEwith Leading Wildcards - Pitfall 9: Poor Connection and Transaction Management
- Conclusion
- References
Pitfall 1: Missing or Inappropriate Indexes
What is the Pitfall?
Indexes are PostgreSQL’s primary tool for speeding up read operations by allowing the database to quickly locate rows without scanning entire tables. However, two common mistakes occur here: missing critical indexes (leading to slow sequential scans) and over-indexing (slowing down writes like INSERT, UPDATE, and DELETE).
Why It’s a Problem
- Missing indexes: Without indexes on columns used in
WHERE,JOIN,ORDER BY, orGROUP BYclauses, PostgreSQL defaults to a sequential scan (scanning every row in the table), which is catastrophic for large datasets (e.g., 1M+ rows). - Over-indexing: Each index adds overhead to write operations (PostgreSQL must update indexes when data changes). For tables with high write throughput (e.g., user activity logs), excessive indexes can turn fast
INSERTs into bottlenecks.
How to Detect It
- Use
EXPLAIN ANALYZEon slow queries to check forSeq Scan(sequential scan) in the output—this often indicates missing indexes. - Monitor index usage with
pg_stat_user_indexes:SELECT schemaname, relname, indexrelname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0; -- Identifies unused indexes
How to Fix It
- Add indexes for critical columns: Index columns frequently used in
WHERE,JOIN,ORDER BY, orGROUP BYclauses. For example:-- Index for a filter on "user_id" CREATE INDEX idx_orders_user_id ON orders(user_id); -- Index for sorting on "created_at" CREATE INDEX idx_orders_created_at ON orders(created_at); - Use partial or expression indexes for filtered or computed data. For example, an index on active users only:
CREATE INDEX idx_active_users ON users (email) WHERE is_active = true; - Drop unused indexes identified via
pg_stat_user_indexesto reduce write overhead.
Pitfall 2: Overusing SELECT *
What is the Pitfall?
Using SELECT * retrieves all columns from a table, even when only a subset is needed. This is a common anti-pattern in PostgreSQL (and SQL generally).
Why It’s a Problem
- Increased I/O: Fetching unused columns forces the database to read more data from disk or memory, increasing I/O and slowing down query execution.
- Prevents index-only scans: PostgreSQL can perform an “index-only scan” if the index contains all columns needed by the query.
SELECT *often includes columns not in the index, forcing the database to access the full table (a “heap scan”). - Fragility: If the table schema changes (e.g., new columns added),
SELECT *may return unexpected data, breaking downstream applications.
How to Detect It
Audit your codebase for SELECT * statements. Tools like pg_stat_statements (enabled via the pg_stat_statements extension) can help identify frequently run SELECT * queries:
SELECT query, calls, total_time
FROM pg_stat_statements
WHERE query LIKE 'SELECT *%'
ORDER BY total_time DESC;
How to Fix It
Explicitly list the columns you need. For example:
❌ Bad:
SELECT * FROM users WHERE email = '[email protected]';
✅ Good:
SELECT id, name, email FROM users WHERE email = '[email protected]';
Pitfall 3: Inefficient Joins
What is the Pitfall?
Joins are essential for combining data from multiple tables, but poorly designed joins (e.g., missing join conditions, inefficient join types, or joining large tables first) can cripple performance.
Why It’s a Problem
- Cartesian products: Forgetting a join condition (e.g.,
FROM orders, usersinstead ofFROM orders JOIN users ON orders.user_id = users.id) results in a Cartesian product, where every row in one table is joined with every row in another. This can generate billions of rows and crash the database. - Joining large tables first: PostgreSQL’s query planner tries to minimize intermediate result sizes, but if you join a large table with another large table early in the query, it can lead to excessive memory usage and slowdowns.
How to Detect It
- Use
EXPLAIN ANALYZEto check forNested Loop,Hash Join, orMerge Joinoperations. AHash Joinon large tables may be slow if memory is limited. - Look for abnormally high
rowsvalues in the query plan, indicating a potential Cartesian product.
How to Fix It
- Always specify join conditions: Use explicit
JOINsyntax (e.g.,INNER JOIN,LEFT JOIN) withONclauses to avoid Cartesian products. - Join small tables first: Filter and reduce table sizes before joining. For example, filter
ordersby date first, then join withusers:-- Filter early to reduce join size SELECT u.name, o.order_total FROM (SELECT * FROM orders WHERE created_at >= '2023-01-01') o JOIN users u ON o.user_id = u.id; - Ensure join columns are indexed: Index columns used in
ONclauses (e.g.,orders.user_idandusers.id) to speed up join lookups.
Pitfall 4: Not Using EXPLAIN to Analyze Queries
What is the Pitfall?
The EXPLAIN command (and its variant EXPLAIN ANALYZE) is PostgreSQL’s built-in tool for understanding how the query planner executes your SQL. Failing to use it is like driving a car without a speedometer—you’re flying blind.
Why It’s a Problem
Without EXPLAIN, you can’t know if PostgreSQL is using indexes, performing sequential scans, or making suboptimal join decisions. You might waste hours optimizing a query that’s already efficient, or ignore a query that’s silently causing sequential scans on a 10M-row table.
How to Use EXPLAIN
EXPLAIN: Shows the query plan (estimated costs, row counts, operations) without executing the query.EXPLAIN ANALYZE: Executes the query and shows actual runtime statistics (useful for comparing estimates vs. reality).
Example output for a slow query:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
Sample output (problematic):
Seq Scan on orders (cost=0.00..10000.00 rows=1000 width=100) (actual time=0.02..500.00 rows=1000 loops=1)
Filter: (user_id = 123)
Planning Time: 0.100 ms
Execution Time: 500.200 ms
Here, Seq Scan (sequential scan) indicates no index on user_id. Adding idx_orders_user_id would replace this with an Index Scan, reducing execution time to milliseconds.
Key Metrics to Look For:
- Scan type:
Seq Scan(bad for large tables) vs.Index Scan/Index Only Scan(good). - Rows vs. Actual Rows: If estimated rows (
rows) differ significantly from actual rows (actual rows), PostgreSQL’s statistics may be outdated—runANALYZEto update them. - Cost: Lower
total cost(sum ofStartup CostandTotal Cost) is better.
Pitfall 5: Overusing Subqueries (When Joins Are Better)
What is the Pitfall?
Subqueries (nested queries) are useful for isolating logic, but correlated subqueries (subqueries that reference columns from the outer query) often execute row-by-row, leading to poor performance. In many cases, joins or CTEs (Common Table Expressions) are more efficient.
Why It’s a Problem
Correlated subqueries run once per row in the outer query. For example, a subquery in a WHERE clause that checks EXISTS for each row can balloon to millions of executions on large datasets.
Example
❌ Slow Correlated Subquery:
SELECT u.name
FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 1000
);
This runs the subquery once per user, which is slow for 100k+ users.
✅ Faster Join Equivalent:
SELECT DISTINCT u.name
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.total > 1000;
Joins are optimized by PostgreSQL’s query planner and often execute in a single pass over the data.
Pitfall 6: Ignoring Data Type Mismatches
What is the Pitfall?
PostgreSQL is strict about data types, but implicit type casting (e.g., comparing a string to a number) can silently degrade performance by preventing index usage.
Why It’s a Problem
If a query compares columns of mismatched types, PostgreSQL may cast one of the columns to match the other. For example:
SELECT * FROM users WHERE id = '123'; -- '123' is a string, id is integer
Here, PostgreSQL casts id to text (via id::text = '123'), which invalidates any index on id (since the index is on integer, not text). The result? A sequential scan instead of an index scan.
How to Detect It
Look for Function Scan or Cast operations in EXPLAIN output, which indicate implicit casting. For example:
Seq Scan on users (cost=0.00..5000.00 rows=1 width=100)
Filter: ((id)::text = '123'::text)
How to Fix It
Ensure data types match explicitly. Use the correct type in the query:
SELECT * FROM users WHERE id = 123; -- 123 is an integer (matches id's type)
Pitfall 7: Unnecessary ORDER BY Clauses
What is the Pitfall?
ORDER BY sorts results, which is computationally expensive (O(n log n) time complexity) for large datasets. Many developers include ORDER BY out of habit, even when results don’t need to be sorted.
Why It’s a Problem
Sorting millions of rows consumes CPU and memory. For example, a query like SELECT * FROM orders ORDER BY created_at on a 1M-row table may take seconds to sort, even with an index.
How to Fix It
- Omit
ORDER BYif sorting isn’t required (e.g., internal reports, bulk exports where order doesn’t matter). - Use indexes to speed up
ORDER BY: If sorting is necessary, add an index on the sorted column(s). For example:
PostgreSQL can then use the index to return pre-sorted results without explicit sorting.CREATE INDEX idx_orders_created_at ON orders(created_at);
Pitfall 8: Overusing LIKE with Leading Wildcards
What is the Pitfall?
The LIKE operator is commonly used for text searches, but patterns with leading wildcards (e.g., LIKE '%john%') cannot use B-tree indexes, leading to sequential scans.
Why It’s a Problem
B-tree indexes (PostgreSQL’s default) are ordered and work best for prefix matches (e.g., LIKE 'john%'). Leading wildcards (%john) or middle wildcards (%john%) force PostgreSQL to scan the entire table to find matches, which is slow for large text columns.
How to Fix It
- Use prefix searches when possible:
LIKE 'john%'can use a B-tree index. - Leverage PostgreSQL’s text search: For full-text search, use
tsvectorandtsquerywithto_tsvectorandto_tsquery:-- Add a GIN index for text search CREATE INDEX idx_users_bio_fts ON users USING gin (to_tsvector('english', bio)); -- Search for 'john' in bio SELECT * FROM users WHERE to_tsvector('english', bio) @@ to_tsquery('english', 'john'); - Use trigram indexes for partial matches: The
pg_trgmextension enables indexes for%john%-style patterns:-- Enable the extension CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Add a GIN index for trigram searches CREATE INDEX idx_users_bio_trgm ON users USING gin (bio gin_trgm_ops); -- Now '%john%' uses the index SELECT * FROM users WHERE bio LIKE '%john%';
Pitfall 9: Poor Connection and Transaction Management
What is the Pitfall?
Even well-optimized queries can fail if connections or transactions are mismanaged. Common issues include unclosed connections, long-running transactions, and underutilized connection pools.
Why It’s a Problem
- Unclosed connections: Each connection consumes memory. If an application leaks connections (e.g., not closing them after use), the database will hit its
max_connectionslimit, blocking new requests. - Long-running transactions: Transactions prevent PostgreSQL’s
VACUUMfrom cleaning up dead tuples (old row versions), leading to “table bloat” (increased disk usage and slower queries).
How to Fix It
- Use connection pools: Tools like PgBouncer or Pgpool-II manage a pool of reusable connections, preventing leaks and reducing overhead from connection setup/teardown.
- Keep transactions short: Avoid long-running transactions (e.g., wrapping UI workflows in a transaction). Commit early and often.
- Monitor connections: Use
pg_stat_activityto track idle or long-running transactions:-- Find long-running transactions (> 5 minutes) SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > '5 minutes'::interval;
Conclusion
PostgreSQL query optimization is a journey, not a one-time task. By avoiding these common pitfalls—from misusing indexes and SELECT * to ignoring EXPLAIN and data type mismatches—you can significantly improve query performance and ensure your database scales with your application.
Remember: Optimization starts with understanding why a query is slow (use EXPLAIN ANALYZE), then systematically addressing bottlenecks (e.g., adding indexes, rewriting subqueries, fixing data types). With practice, you’ll develop an intuition for writing efficient PostgreSQL queries that stand the test of time.