Database Indexing: From 30-Second Queries to Lightning Speed
It was my fourth month at Amazon, and our user analytics dashboard was becoming unusable. What started as a simple "show me last month's user activity" query was taking 45 seconds and timing out during peak hours. Our database had grown to 50 million user events, and our queries were doing full table scans on massive datasets.
My database architect pulled me into a meeting room and drew a simple analogy: "Maya, imagine trying to find a specific book in a library where all the books are randomly scattered on the floor. That's what your database is doing right now. Indexes are like the card catalog system - they tell you exactly where to find what you're looking for."
That afternoon, we added a few strategic indexes and rewrote some queries. The same analytics query dropped from 45 seconds to 80 milliseconds. Since then, I've been obsessed with understanding how databases think and how to make them fly.
The Full Table Scan Horror Story
Here's what our original, poorly performing query looked like:
-- The nightmare query that brought our system to its knees
SELECT
u.username,
u.email,
COUNT(e.event_id) as event_count,
AVG(e.session_duration) as avg_session_duration,
MAX(e.created_at) as last_activity
FROM users u
LEFT JOIN user_events e ON u.user_id = e.user_id
WHERE e.created_at >= '2024-01-01'
AND e.created_at < '2024-02-01'
AND e.event_type IN ('page_view', 'click', 'purchase')
AND u.status = 'active'
AND u.subscription_tier IN ('premium', 'enterprise')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY event_count DESC
LIMIT 100;
-- Execution plan revealed the horror:
-- 1. Full table scan on users (2M rows)
-- 2. Full table scan on user_events (50M rows)
-- 3. Nested loop join (100 billion comparisons!)
-- 4. No index on created_at, event_type, status, or subscription_tier
-- Total execution time: 45+ seconds
Let me show you the schema and data we were working with:
-- Our original table structure (before optimization)
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(255) NOT NULL,
status VARCHAR(20) NOT NULL,
subscription_tier VARCHAR(20) NOT NULL,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
CREATE TABLE user_events (
event_id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
event_type VARCHAR(50) NOT NULL,
session_duration INTEGER,
created_at TIMESTAMP NOT NULL,
metadata JSONB
);
-- Sample data to understand the scale
INSERT INTO users (user_id, username, email, status, subscription_tier, created_at, updated_at)
VALUES
(1, 'maya_codes', 'maya@coffee.dev', 'active', 'premium', '2023-01-15', '2024-01-20'),
(2, 'alice_dev', 'alice@tech.com', 'active', 'enterprise', '2023-03-10', '2024-01-18'),
(3, 'bob_analytics', 'bob@data.io', 'inactive', 'basic', '2023-06-20', '2024-01-15'),
-- ... imagine 2 million more rows
INSERT INTO user_events (event_id, user_id, event_type, session_duration, created_at, metadata)
VALUES
(1, 1, 'page_view', 45, '2024-01-15 10:30:00', '{"page": "/dashboard"}'),
(2, 1, 'click', 5, '2024-01-15 10:31:00', '{"button": "analytics"}'),
(3, 2, 'purchase', 120, '2024-01-16 14:22:00', '{"amount": 99.99}'),
-- ... imagine 50 million more rows
-- Performance analysis before indexing
EXPLAIN ANALYZE
SELECT
u.username,
COUNT(e.event_id) as event_count
FROM users u
LEFT JOIN user_events e ON u.user_id = e.user_id
WHERE e.created_at >= '2024-01-01'
AND u.status = 'active'
GROUP BY u.user_id, u.username;
/*
Execution Plan (BEFORE optimization):
-> Nested Loop Left Join (cost=0.00..2847291.50 rows=982341 time=45234.567ms)
-> Seq Scan on users u (cost=0.00..45320.00 rows=325432)
Filter: (status = 'active')
Rows Removed by Filter: 1674568
-> Seq Scan on user_events e (cost=0.00..2456789.67 rows=147)
Filter: ((created_at >= '2024-01-01') AND (user_id = u.user_id))
Rows Removed by Filter: 49852853
Planning Time: 2.456 ms
Execution Time: 45234.567 ms
*/
The execution plan revealed the nightmare: sequential scans on massive tables with nested loops joining billions of rows. Time to fix this.
The Index Transformation
Here's how we systematically optimized the database:
-- Step 1: Analyze query patterns and create strategic indexes
-- Index for user filtering (status, subscription_tier are frequently queried together)
CREATE INDEX idx_users_status_subscription ON users(status, subscription_tier)
WHERE status = 'active'; -- Partial index for better performance
-- Covering index for user basic info (includes all columns we select)
CREATE INDEX idx_users_covering ON users(user_id, username, email, status, subscription_tier);
-- Time-based index for events (most common query pattern)
CREATE INDEX idx_events_created_at ON user_events(created_at);
-- Composite index for event filtering (order matters!)
CREATE INDEX idx_events_composite ON user_events(user_id, created_at, event_type);
-- Covering index for event aggregations
CREATE INDEX idx_events_covering ON user_events(user_id, event_type, created_at)
INCLUDE (session_duration);
-- Step 2: Rewrite the query to be index-friendly
SELECT
u.username,
u.email,
COUNT(e.event_id) as event_count,
AVG(e.session_duration) as avg_session_duration,
MAX(e.created_at) as last_activity
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id -- Changed to INNER JOIN
WHERE u.status = 'active'
AND u.subscription_tier IN ('premium', 'enterprise')
AND e.created_at >= '2024-01-01'
AND e.created_at < '2024-02-01'
AND e.event_type IN ('page_view', 'click', 'purchase')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY event_count DESC
LIMIT 100;
-- Performance analysis after indexing
EXPLAIN (ANALYZE, BUFFERS)
SELECT
u.username,
u.email,
COUNT(e.event_id) as event_count
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id
WHERE u.status = 'active'
AND u.subscription_tier IN ('premium', 'enterprise')
AND e.created_at >= '2024-01-01'
AND e.created_at < '2024-02-01'
AND e.event_type IN ('page_view', 'click', 'purchase')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY COUNT(e.event_id) DESC
LIMIT 100;
/*
Execution Plan (AFTER optimization):
-> Limit (cost=245.67..289.34 rows=100 time=78.234ms)
-> Sort (cost=245.67..267.89 rows=8890)
Sort Key: (count(e.event_id)) DESC
-> HashAggregate (cost=198.45..234.56 rows=8890)
Group Key: u.user_id, u.username, u.email
Filter: (count(e.event_id) > 10)
-> Hash Join (cost=45.23..167.89 rows=12234)
Hash Cond: (e.user_id = u.user_id)
-> Index Scan using idx_events_composite on user_events e
Index Cond: ((created_at >= '2024-01-01') AND (created_at < '2024-02-01'))
Filter: (event_type = ANY ('{page_view,click,purchase}'))
-> Index Scan using idx_users_status_subscription on users u
Index Cond: ((status = 'active') AND (subscription_tier = ANY ('{premium,enterprise}')))
Planning Time: 1.234 ms
Execution Time: 78.234 ms
*/
The results were dramatic:
- Before: 45+ seconds with full table scans
- After: 78 milliseconds with index scans
- Improvement: 577x faster!
Understanding Index Types and When to Use Them
B-Tree Indexes - The Workhorses
-- B-Tree indexes are perfect for equality and range queries
-- They maintain sorted order and support efficient lookups
-- Single column index
CREATE INDEX idx_events_created_at ON user_events(created_at);
-- Composite index (column order is crucial!)
-- Good for queries that filter on user_id, or user_id + created_at, or all three
CREATE INDEX idx_events_user_time_type ON user_events(user_id, created_at, event_type);
-- This index can efficiently support these queries:
-- 1. WHERE user_id = 123
-- 2. WHERE user_id = 123 AND created_at > '2024-01-01'
-- 3. WHERE user_id = 123 AND created_at > '2024-01-01' AND event_type = 'click'
-- But NOT efficiently support:
-- WHERE created_at > '2024-01-01' (user_id is not specified)
-- WHERE event_type = 'click' (skips the leading columns)
-- Query to see index usage
SELECT
schemaname,
tablename,
indexname,
idx_tup_read,
idx_tup_fetch,
idx_scan
FROM pg_stat_user_indexes
WHERE tablename IN ('users', 'user_events')
ORDER BY idx_scan DESC;
-- Test different query patterns to understand index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM user_events
WHERE user_id = 12345
AND created_at BETWEEN '2024-01-01' AND '2024-01-31'
AND event_type = 'purchase';
-- Uses idx_events_user_time_type efficiently
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM user_events
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31'
AND event_type = 'purchase';
-- May not use the composite index efficiently due to missing user_id
Partial Indexes - Space-Efficient Targeting
-- Partial indexes only include rows that match a condition
-- Perfect for queries that always filter on specific values
-- Only index active users (saves space and improves performance)
CREATE INDEX idx_users_active_subscription
ON users(subscription_tier, created_at)
WHERE status = 'active';
-- Only index recent events (for time-series data)
CREATE INDEX idx_events_recent
ON user_events(user_id, event_type)
WHERE created_at >= '2024-01-01';
-- Only index important event types
CREATE INDEX idx_events_critical
ON user_events(user_id, created_at)
WHERE event_type IN ('purchase', 'signup', 'cancellation');
-- Example query that benefits from partial index
SELECT COUNT(*)
FROM users
WHERE status = 'active'
AND subscription_tier = 'premium'
AND created_at >= '2023-01-01';
-- Uses idx_users_active_subscription efficiently
-- Monitor partial index effectiveness
SELECT
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
JOIN pg_index ON indexrelid = pg_index.indexrelid
WHERE indpred IS NOT NULL -- Only partial indexes
ORDER BY idx_scan DESC;
Covering Indexes - Include Everything Needed
-- Covering indexes include extra columns in the leaf level
-- Allows index-only scans without touching the table
-- Cover common user queries
CREATE INDEX idx_users_status_covering
ON users(status, subscription_tier)
INCLUDE (username, email, created_at);
-- Cover event aggregation queries
CREATE INDEX idx_events_user_covering
ON user_events(user_id, created_at)
INCLUDE (event_type, session_duration, metadata);
-- This query can be satisfied entirely from the index
SELECT username, email, created_at
FROM users
WHERE status = 'active'
AND subscription_tier = 'premium';
-- Check for index-only scans in query plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT username, email, created_at
FROM users
WHERE status = 'active'
AND subscription_tier = 'premium';
/*
Look for "Index Only Scan" in the output:
-> Index Only Scan using idx_users_status_covering
Index Cond: ((status = 'active') AND (subscription_tier = 'premium'))
Heap Fetches: 0 -- This means no table access needed!
*/
Hash Indexes - For Exact Matches Only
-- Hash indexes are perfect for equality comparisons
-- But they don't support range queries or sorting
-- Good for exact lookups on high-cardinality columns
CREATE INDEX idx_users_email_hash ON users USING HASH (email);
CREATE INDEX idx_events_session_hash ON user_events USING HASH (session_id);
-- These queries benefit from hash indexes
SELECT * FROM users WHERE email = 'maya@coffee.dev';
SELECT * FROM user_events WHERE session_id = 'abc123def456';
-- These queries CANNOT use hash indexes
SELECT * FROM users WHERE email LIKE 'maya%'; -- Pattern matching
SELECT * FROM user_events WHERE session_id > 'abc'; -- Range query
Advanced Indexing Strategies
Expression Indexes - For Computed Values
-- Index on expressions for queries that use functions or calculations
-- Index for case-insensitive email searches
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
-- Index for extracting date parts
CREATE INDEX idx_events_month ON user_events(EXTRACT(MONTH FROM created_at));
-- Index for JSON queries
CREATE INDEX idx_events_metadata_path ON user_events USING GIN ((metadata->'page'));
-- Index for calculated values
CREATE INDEX idx_events_duration_bucket
ON user_events((CASE
WHEN session_duration < 30 THEN 'short'
WHEN session_duration < 300 THEN 'medium'
ELSE 'long'
END));
-- These queries can use the expression indexes
SELECT * FROM users WHERE LOWER(email) = 'maya@coffee.dev';
SELECT COUNT(*) FROM user_events WHERE EXTRACT(MONTH FROM created_at) = 12;
SELECT * FROM user_events WHERE metadata->>'page' = '/dashboard';
Multi-Column Statistics for Better Query Planning
-- PostgreSQL can create extended statistics for correlated columns
-- Helps the query planner make better decisions
-- Create extended statistics for correlated columns
CREATE STATISTICS stat_users_status_tier
ON status, subscription_tier
FROM users;
CREATE STATISTICS stat_events_user_type_time
ON user_id, event_type, created_at
FROM user_events;
-- Analyze tables to populate statistics
ANALYZE users;
ANALYZE user_events;
-- View statistics information
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename IN ('users', 'user_events')
ORDER BY tablename, attname;
Query Optimization Techniques
Query Rewriting for Better Performance
-- Original slow query with subquery
SELECT u.username, u.email
FROM users u
WHERE u.user_id IN (
SELECT e.user_id
FROM user_events e
WHERE e.created_at >= '2024-01-01'
AND e.event_type = 'purchase'
GROUP BY e.user_id
HAVING COUNT(*) > 5
);
-- Optimized version with JOIN
SELECT DISTINCT u.username, u.email
FROM users u
INNER JOIN (
SELECT e.user_id
FROM user_events e
WHERE e.created_at >= '2024-01-01'
AND e.event_type = 'purchase'
GROUP BY e.user_id
HAVING COUNT(*) > 5
) active_buyers ON u.user_id = active_buyers.user_id;
-- Even better with window functions
SELECT username, email
FROM (
SELECT
u.username,
u.email,
COUNT(*) OVER (PARTITION BY u.user_id) as purchase_count
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id
WHERE e.created_at >= '2024-01-01'
AND e.event_type = 'purchase'
) ranked
WHERE purchase_count > 5;
Avoiding Common Performance Pitfalls
-- PITFALL 1: Functions on indexed columns prevent index usage
-- BAD: Function on indexed column
SELECT * FROM user_events
WHERE EXTRACT(YEAR FROM created_at) = 2024;
-- GOOD: Range query that can use index
SELECT * FROM user_events
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01';
-- PITFALL 2: Leading wildcards prevent index usage
-- BAD: Leading wildcard
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- GOOD: Use full-text search or reverse the string
CREATE INDEX idx_users_email_reverse ON users(REVERSE(email));
SELECT * FROM users WHERE REVERSE(email) LIKE REVERSE('%@gmail.com');
-- PITFALL 3: OR conditions can be inefficient
-- BAD: OR conditions
SELECT * FROM user_events
WHERE event_type = 'purchase' OR event_type = 'signup';
-- GOOD: Use IN operator
SELECT * FROM user_events
WHERE event_type IN ('purchase', 'signup');
-- PITFALL 4: Implicit data type conversions
-- BAD: String comparison with integer column
SELECT * FROM users WHERE user_id = '12345';
-- GOOD: Use correct data type
SELECT * FROM users WHERE user_id = 12345;
-- PITFALL 5: SELECT * when you don't need all columns
-- BAD: Selecting unnecessary data
SELECT * FROM user_events
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 10;
-- GOOD: Select only what you need
SELECT event_type, created_at, session_duration
FROM user_events
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 10;
Real-World Performance Monitoring
Setting Up Query Performance Monitoring
-- Enable query logging for slow queries
-- Add to postgresql.conf:
-- log_min_duration_statement = 1000 # Log queries taking > 1 second
-- log_statement = 'all' # Log all statements
-- log_duration = on # Log query execution time
-- Create a view to monitor slow queries
CREATE VIEW slow_queries AS
SELECT
query,
calls,
total_time,
mean_time,
max_time,
stddev_time,
(total_time / sum(total_time) OVER()) * 100 AS percentage_of_total
FROM pg_stat_statements
WHERE mean_time > 100 -- Queries averaging > 100ms
ORDER BY total_time DESC;
-- Monitor index usage effectiveness
CREATE VIEW index_usage_stats AS
SELECT
schemaname,
tablename,
indexname,
idx_tup_read,
idx_tup_fetch,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as size,
CASE
WHEN idx_scan = 0 THEN 'Never used'
WHEN idx_scan < 10 THEN 'Rarely used'
WHEN idx_scan < 100 THEN 'Sometimes used'
ELSE 'Frequently used'
END as usage_frequency
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Find tables that might need indexes
CREATE VIEW tables_needing_indexes AS
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
seq_tup_read / GREATEST(seq_scan, 1) AS avg_rows_per_scan,
pg_size_pretty(pg_total_relation_size(relid)) as table_size
FROM pg_stat_user_tables
WHERE seq_scan > 100 -- Tables with many sequential scans
AND seq_tup_read / GREATEST(seq_scan, 1) > 1000 -- Reading many rows per scan
ORDER BY seq_tup_read DESC;
Automated Index Recommendation System
-- Create a function to analyze query patterns and suggest indexes
CREATE OR REPLACE FUNCTION suggest_indexes()
RETURNS TABLE(
table_name TEXT,
suggested_index TEXT,
reason TEXT,
estimated_benefit TEXT
) AS $$
BEGIN
-- Analyze queries from pg_stat_statements
RETURN QUERY
WITH query_analysis AS (
SELECT
qs.query,
qs.calls,
qs.total_time,
qs.mean_time,
-- Extract table names and WHERE conditions
regexp_matches(qs.query, 'FROM\s+(\w+)', 'gi') as table_matches,
regexp_matches(qs.query, 'WHERE\s+(\w+)', 'gi') as where_matches
FROM pg_stat_statements qs
WHERE qs.mean_time > 50 -- Focus on slower queries
),
slow_table_scans AS (
SELECT
st.tablename,
st.seq_scan,
st.seq_tup_read,
st.seq_tup_read / GREATEST(st.seq_scan, 1) as avg_scan_size
FROM pg_stat_user_tables st
WHERE st.seq_scan > 50
)
SELECT
sts.tablename::TEXT,
format('CREATE INDEX idx_%s_suggested ON %s (%s)',
sts.tablename, sts.tablename, 'column_name')::TEXT,
'High sequential scan activity'::TEXT,
format('Potential %sx improvement',
ROUND(sts.avg_scan_size / 1000.0, 1))::TEXT
FROM slow_table_scans sts
WHERE sts.avg_scan_size > 1000;
END;
$$ LANGUAGE plpgsql;
-- Run the analysis
SELECT * FROM suggest_indexes();
Performance Testing Framework
-- Create a performance testing framework
CREATE TABLE performance_tests (
test_id SERIAL PRIMARY KEY,
test_name VARCHAR(255) NOT NULL,
query_text TEXT NOT NULL,
expected_max_time_ms INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE performance_results (
result_id SERIAL PRIMARY KEY,
test_id INTEGER REFERENCES performance_tests(test_id),
execution_time_ms NUMERIC,
rows_returned INTEGER,
buffers_hit INTEGER,
buffers_read INTEGER,
executed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Function to run performance tests
CREATE OR REPLACE FUNCTION run_performance_test(test_name_param VARCHAR)
RETURNS VOID AS $$
DECLARE
test_record performance_tests%ROWTYPE;
start_time TIMESTAMP;
end_time TIMESTAMP;
execution_time_ms NUMERIC;
explain_output TEXT;
BEGIN
-- Get test details
SELECT * INTO test_record
FROM performance_tests
WHERE test_name = test_name_param;
IF NOT FOUND THEN
RAISE EXCEPTION 'Test % not found', test_name_param;
END IF;
-- Record start time
start_time := clock_timestamp();
-- Execute the query (in a more complex implementation,
-- you'd capture the actual results and explain output)
EXECUTE test_record.query_text;
-- Record end time
end_time := clock_timestamp();
execution_time_ms := EXTRACT(EPOCH FROM (end_time - start_time)) * 1000;
-- Store results
INSERT INTO performance_results (
test_id,
execution_time_ms,
executed_at
) VALUES (
test_record.test_id,
execution_time_ms,
CURRENT_TIMESTAMP
);
-- Check if performance meets expectations
IF execution_time_ms > test_record.expected_max_time_ms THEN
RAISE WARNING 'Performance test % failed: %ms > %ms expected',
test_name_param, execution_time_ms, test_record.expected_max_time_ms;
ELSE
RAISE NOTICE 'Performance test % passed: %ms', test_name_param, execution_time_ms;
END IF;
END;
$$ LANGUAGE plpgsql;
-- Add performance tests
INSERT INTO performance_tests (test_name, query_text, expected_max_time_ms)
VALUES
('user_analytics_dashboard',
'SELECT u.username, COUNT(e.event_id) FROM users u JOIN user_events e ON u.user_id = e.user_id WHERE e.created_at >= CURRENT_DATE - INTERVAL ''30 days'' GROUP BY u.user_id, u.username LIMIT 100',
100),
('recent_purchases',
'SELECT * FROM user_events WHERE event_type = ''purchase'' AND created_at >= CURRENT_DATE - INTERVAL ''7 days'' ORDER BY created_at DESC LIMIT 50',
50);
-- Run tests
SELECT run_performance_test('user_analytics_dashboard');
SELECT run_performance_test('recent_purchases');
-- Analyze test results over time
SELECT
pt.test_name,
AVG(pr.execution_time_ms) as avg_time_ms,
MAX(pr.execution_time_ms) as max_time_ms,
MIN(pr.execution_time_ms) as min_time_ms,
COUNT(*) as test_runs,
pt.expected_max_time_ms
FROM performance_tests pt
JOIN performance_results pr ON pt.test_id = pr.test_id
WHERE pr.executed_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY pt.test_id, pt.test_name, pt.expected_max_time_ms
ORDER BY avg_time_ms DESC;
Index Maintenance and Optimization
Monitoring Index Health
-- Check for unused indexes (candidates for removal)
SELECT
schemaname,
tablename,
indexname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
pg_stat_get_tuples_inserted(relid) +
pg_stat_get_tuples_updated(relid) +
pg_stat_get_tuples_deleted(relid) as table_writes
FROM pg_stat_user_indexes
JOIN pg_stat_user_tables USING (schemaname, tablename, relid)
WHERE idx_scan < 10 -- Very rarely used
AND pg_relation_size(indexrelid) > 1024 * 1024 -- Larger than 1MB
ORDER BY pg_relation_size(indexrelid) DESC;
-- Check for duplicate or redundant indexes
WITH index_columns AS (
SELECT
schemaname,
tablename,
indexname,
array_agg(attname ORDER BY attnum) as columns,
pg_relation_size(indexrelid) as index_size
FROM pg_stat_user_indexes
JOIN pg_index ON indexrelid = pg_index.indexrelid
JOIN pg_attribute ON pg_attribute.attrelid = pg_index.indrelid
AND pg_attribute.attnum = ANY(pg_index.indkey)
GROUP BY schemaname, tablename, indexname, indexrelid
)
SELECT
ic1.schemaname,
ic1.tablename,
ic1.indexname as index1,
ic2.indexname as index2,
ic1.columns,
pg_size_pretty(ic1.index_size) as index1_size,
pg_size_pretty(ic2.index_size) as index2_size
FROM index_columns ic1
JOIN index_columns ic2 ON ic1.schemaname = ic2.schemaname
AND ic1.tablename = ic2.tablename
AND ic1.indexname < ic2.indexname -- Avoid duplicates
WHERE ic1.columns = ic2.columns -- Same column set
OR ic1.columns <@ ic2.columns -- ic1 is a prefix of ic2
OR ic2.columns <@ ic1.columns; -- ic2 is a prefix of ic1
-- Monitor index bloat
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
CASE
WHEN pg_relation_size(indexrelid) = 0 THEN 0
ELSE ROUND(
(pg_relation_size(indexrelid) -
(SELECT SUM(avg_width) FROM pg_stats
WHERE schemaname = pg_stat_user_indexes.schemaname
AND tablename = pg_stat_user_indexes.tablename))
/ pg_relation_size(indexrelid)::NUMERIC * 100, 2)
END as estimated_bloat_percentage
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1024 * 1024 -- Larger than 1MB
ORDER BY pg_relation_size(indexrelid) DESC;
Automated Index Maintenance
-- Create maintenance procedures
CREATE OR REPLACE FUNCTION maintenance_rebuild_indexes()
RETURNS TEXT AS $$
DECLARE
index_record RECORD;
maintenance_log TEXT := '';
BEGIN
-- Rebuild indexes with high bloat
FOR index_record IN
SELECT indexname, tablename
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 10 * 1024 * 1024 -- > 10MB
LOOP
-- Rebuild index concurrently to avoid blocking
EXECUTE format('REINDEX INDEX CONCURRENTLY %I', index_record.indexname);
maintenance_log := maintenance_log || format('Rebuilt index %s on table %s\n',
index_record.indexname,
index_record.tablename);
END LOOP;
-- Update table statistics
EXECUTE 'ANALYZE';
maintenance_log := maintenance_log || 'Updated table statistics\n';
RETURN maintenance_log;
END;
$$ LANGUAGE plpgsql;
-- Schedule maintenance (would typically be done via cron or pg_cron)
-- SELECT maintenance_rebuild_indexes();
Final Thoughts: Indexes as Performance Multipliers
That 45-second-to-80-millisecond optimization taught me that database performance isn't magic - it's about understanding how your database thinks and giving it the tools it needs to work efficiently. Every query tells a story, and execution plans are the database's way of showing you exactly what's happening under the hood.
The key insights that transformed my approach to database optimization:
- Measure before optimizing - Use
EXPLAIN ANALYZE
religiously - Index strategically - More indexes aren't always better
- Consider query patterns - Design indexes for how you actually query data
- Monitor continuously - Performance degrades over time without maintenance
- Test everything - What works in development might not work in production
Whether you're building a startup's first database or optimizing queries for millions of users, understanding indexing and query optimization will make you a more effective developer. The difference between a sluggish application and a responsive one often comes down to a few well-placed indexes and thoughtfully written queries.
Remember: premature optimization is the root of all evil, but premature pessimization is even worse. Start with good practices, measure everything, and optimize based on real data.
Currently writing this from Victrola Coffee on Capitol Hill, where I'm optimizing a complex analytics query while enjoying my usual cortado. Share your database optimization victories @maya_codes_pnw - we've all had those magical moments when a query goes from minutes to milliseconds! 🚀☕
Add Comment
No comments yet. Be the first to comment!