Navigation

Databases

Database Indexing Query Optimization 2025 Complete Guide Performance Tuning SQL Indexes B-Tree Composite

#Backend #Database
Master database performance from a developer who optimized queries handling millions of records at Amazon, covering indexing strategies, query optimization, and real-world performance tuning.

Database Indexing: From 30-Second Queries to Lightning Speed

It was my fourth month at Amazon, and our user analytics dashboard was becoming unusable. What started as a simple "show me last month's user activity" query was taking 45 seconds and timing out during peak hours. Our database had grown to 50 million user events, and our queries were doing full table scans on massive datasets.

My database architect pulled me into a meeting room and drew a simple analogy: "Maya, imagine trying to find a specific book in a library where all the books are randomly scattered on the floor. That's what your database is doing right now. Indexes are like the card catalog system - they tell you exactly where to find what you're looking for."

That afternoon, we added a few strategic indexes and rewrote some queries. The same analytics query dropped from 45 seconds to 80 milliseconds. Since then, I've been obsessed with understanding how databases think and how to make them fly.

The Full Table Scan Horror Story

Here's what our original, poorly performing query looked like:

-- The nightmare query that brought our system to its knees
SELECT 
    u.username,
    u.email,
    COUNT(e.event_id) as event_count,
    AVG(e.session_duration) as avg_session_duration,
    MAX(e.created_at) as last_activity
FROM users u
LEFT JOIN user_events e ON u.user_id = e.user_id
WHERE e.created_at >= '2024-01-01'
    AND e.created_at < '2024-02-01'
    AND e.event_type IN ('page_view', 'click', 'purchase')
    AND u.status = 'active'
    AND u.subscription_tier IN ('premium', 'enterprise')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY event_count DESC
LIMIT 100;

-- Execution plan revealed the horror:
-- 1. Full table scan on users (2M rows)
-- 2. Full table scan on user_events (50M rows) 
-- 3. Nested loop join (100 billion comparisons!)
-- 4. No index on created_at, event_type, status, or subscription_tier
-- Total execution time: 45+ seconds

Let me show you the schema and data we were working with:

-- Our original table structure (before optimization)
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    email VARCHAR(255) NOT NULL,
    status VARCHAR(20) NOT NULL,
    subscription_tier VARCHAR(20) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL
);

CREATE TABLE user_events (
    event_id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    session_duration INTEGER,
    created_at TIMESTAMP NOT NULL,
    metadata JSONB
);

-- Sample data to understand the scale
INSERT INTO users (user_id, username, email, status, subscription_tier, created_at, updated_at)
VALUES 
    (1, 'maya_codes', 'maya@coffee.dev', 'active', 'premium', '2023-01-15', '2024-01-20'),
    (2, 'alice_dev', 'alice@tech.com', 'active', 'enterprise', '2023-03-10', '2024-01-18'),
    (3, 'bob_analytics', 'bob@data.io', 'inactive', 'basic', '2023-06-20', '2024-01-15'),
    -- ... imagine 2 million more rows

INSERT INTO user_events (event_id, user_id, event_type, session_duration, created_at, metadata)
VALUES 
    (1, 1, 'page_view', 45, '2024-01-15 10:30:00', '{"page": "/dashboard"}'),
    (2, 1, 'click', 5, '2024-01-15 10:31:00', '{"button": "analytics"}'),
    (3, 2, 'purchase', 120, '2024-01-16 14:22:00', '{"amount": 99.99}'),
    -- ... imagine 50 million more rows

-- Performance analysis before indexing
EXPLAIN ANALYZE
SELECT 
    u.username,
    COUNT(e.event_id) as event_count
FROM users u
LEFT JOIN user_events e ON u.user_id = e.user_id
WHERE e.created_at >= '2024-01-01'
    AND u.status = 'active'
GROUP BY u.user_id, u.username;

/*
Execution Plan (BEFORE optimization):
-> Nested Loop Left Join  (cost=0.00..2847291.50 rows=982341 time=45234.567ms)
   -> Seq Scan on users u  (cost=0.00..45320.00 rows=325432) 
        Filter: (status = 'active')
        Rows Removed by Filter: 1674568
   -> Seq Scan on user_events e  (cost=0.00..2456789.67 rows=147) 
        Filter: ((created_at >= '2024-01-01') AND (user_id = u.user_id))
        Rows Removed by Filter: 49852853

Planning Time: 2.456 ms
Execution Time: 45234.567 ms
*/

The execution plan revealed the nightmare: sequential scans on massive tables with nested loops joining billions of rows. Time to fix this.

The Index Transformation

Here's how we systematically optimized the database:

-- Step 1: Analyze query patterns and create strategic indexes

-- Index for user filtering (status, subscription_tier are frequently queried together)
CREATE INDEX idx_users_status_subscription ON users(status, subscription_tier)
WHERE status = 'active';  -- Partial index for better performance

-- Covering index for user basic info (includes all columns we select)
CREATE INDEX idx_users_covering ON users(user_id, username, email, status, subscription_tier);

-- Time-based index for events (most common query pattern)
CREATE INDEX idx_events_created_at ON user_events(created_at);

-- Composite index for event filtering (order matters!)
CREATE INDEX idx_events_composite ON user_events(user_id, created_at, event_type);

-- Covering index for event aggregations
CREATE INDEX idx_events_covering ON user_events(user_id, event_type, created_at) 
INCLUDE (session_duration);

-- Step 2: Rewrite the query to be index-friendly
SELECT 
    u.username,
    u.email,
    COUNT(e.event_id) as event_count,
    AVG(e.session_duration) as avg_session_duration,
    MAX(e.created_at) as last_activity
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id  -- Changed to INNER JOIN
WHERE u.status = 'active'
    AND u.subscription_tier IN ('premium', 'enterprise')
    AND e.created_at >= '2024-01-01'
    AND e.created_at < '2024-02-01'
    AND e.event_type IN ('page_view', 'click', 'purchase')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY event_count DESC
LIMIT 100;

-- Performance analysis after indexing
EXPLAIN (ANALYZE, BUFFERS)
SELECT 
    u.username,
    u.email,
    COUNT(e.event_id) as event_count
FROM users u
INNER JOIN user_events e ON u.user_id = e.user_id
WHERE u.status = 'active'
    AND u.subscription_tier IN ('premium', 'enterprise')
    AND e.created_at >= '2024-01-01'
    AND e.created_at < '2024-02-01'
    AND e.event_type IN ('page_view', 'click', 'purchase')
GROUP BY u.user_id, u.username, u.email
HAVING COUNT(e.event_id) > 10
ORDER BY COUNT(e.event_id) DESC
LIMIT 100;

/*
Execution Plan (AFTER optimization):
-> Limit  (cost=245.67..289.34 rows=100 time=78.234ms)
   -> Sort  (cost=245.67..267.89 rows=8890)
        Sort Key: (count(e.event_id)) DESC
        -> HashAggregate  (cost=198.45..234.56 rows=8890)
             Group Key: u.user_id, u.username, u.email
             Filter: (count(e.event_id) > 10)
             -> Hash Join  (cost=45.23..167.89 rows=12234)
                  Hash Cond: (e.user_id = u.user_id)
                  -> Index Scan using idx_events_composite on user_events e
                       Index Cond: ((created_at >= '2024-01-01') AND (created_at < '2024-02-01'))
                       Filter: (event_type = ANY ('{page_view,click,purchase}'))
                  -> Index Scan using idx_users_status_subscription on users u
                       Index Cond: ((status = 'active') AND (subscription_tier = ANY ('{premium,enterprise}')))

Planning Time: 1.234 ms
Execution Time: 78.234 ms
*/

The results were dramatic:

  • Before: 45+ seconds with full table scans
  • After: 78 milliseconds with index scans
  • Improvement: 577x faster!

Understanding Index Types and When to Use Them

B-Tree Indexes - The Workhorses

-- B-Tree indexes are perfect for equality and range queries
-- They maintain sorted order and support efficient lookups

-- Single column index
CREATE INDEX idx_events_created_at ON user_events(created_at);

-- Composite index (column order is crucial!)
-- Good for queries that filter on user_id, or user_id + created_at, or all three
CREATE INDEX idx_events_user_time_type ON user_events(user_id, created_at, event_type);

-- This index can efficiently support these queries:
-- 1. WHERE user_id = 123
-- 2. WHERE user_id = 123 AND created_at > '2024-01-01'
-- 3. WHERE user_id = 123 AND created_at > '2024-01-01' AND event_type = 'click'

-- But NOT efficiently support:
-- WHERE created_at > '2024-01-01' (user_id is not specified)
-- WHERE event_type = 'click' (skips the leading columns)

-- Query to see index usage
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_tup_read,
    idx_tup_fetch,
    idx_scan
FROM pg_stat_user_indexes
WHERE tablename IN ('users', 'user_events')
ORDER BY idx_scan DESC;

-- Test different query patterns to understand index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM user_events 
WHERE user_id = 12345 
    AND created_at BETWEEN '2024-01-01' AND '2024-01-31'
    AND event_type = 'purchase';
-- Uses idx_events_user_time_type efficiently

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM user_events 
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31'
    AND event_type = 'purchase';
-- May not use the composite index efficiently due to missing user_id

Partial Indexes - Space-Efficient Targeting

-- Partial indexes only include rows that match a condition
-- Perfect for queries that always filter on specific values

-- Only index active users (saves space and improves performance)
CREATE INDEX idx_users_active_subscription 
ON users(subscription_tier, created_at) 
WHERE status = 'active';

-- Only index recent events (for time-series data)
CREATE INDEX idx_events_recent 
ON user_events(user_id, event_type) 
WHERE created_at >= '2024-01-01';

-- Only index important event types
CREATE INDEX idx_events_critical 
ON user_events(user_id, created_at) 
WHERE event_type IN ('purchase', 'signup', 'cancellation');

-- Example query that benefits from partial index
SELECT COUNT(*)
FROM users 
WHERE status = 'active' 
    AND subscription_tier = 'premium' 
    AND created_at >= '2023-01-01';
-- Uses idx_users_active_subscription efficiently

-- Monitor partial index effectiveness
SELECT 
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes 
JOIN pg_index ON indexrelid = pg_index.indexrelid
WHERE indpred IS NOT NULL  -- Only partial indexes
ORDER BY idx_scan DESC;

Covering Indexes - Include Everything Needed

-- Covering indexes include extra columns in the leaf level
-- Allows index-only scans without touching the table

-- Cover common user queries
CREATE INDEX idx_users_status_covering 
ON users(status, subscription_tier) 
INCLUDE (username, email, created_at);

-- Cover event aggregation queries
CREATE INDEX idx_events_user_covering 
ON user_events(user_id, created_at) 
INCLUDE (event_type, session_duration, metadata);

-- This query can be satisfied entirely from the index
SELECT username, email, created_at
FROM users 
WHERE status = 'active' 
    AND subscription_tier = 'premium';

-- Check for index-only scans in query plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT username, email, created_at
FROM users 
WHERE status = 'active' 
    AND subscription_tier = 'premium';

/*
Look for "Index Only Scan" in the output:
-> Index Only Scan using idx_users_status_covering
     Index Cond: ((status = 'active') AND (subscription_tier = 'premium'))
     Heap Fetches: 0  -- This means no table access needed!
*/

Hash Indexes - For Exact Matches Only

-- Hash indexes are perfect for equality comparisons
-- But they don't support range queries or sorting

-- Good for exact lookups on high-cardinality columns
CREATE INDEX idx_users_email_hash ON users USING HASH (email);
CREATE INDEX idx_events_session_hash ON user_events USING HASH (session_id);

-- These queries benefit from hash indexes
SELECT * FROM users WHERE email = 'maya@coffee.dev';
SELECT * FROM user_events WHERE session_id = 'abc123def456';

-- These queries CANNOT use hash indexes
SELECT * FROM users WHERE email LIKE 'maya%';  -- Pattern matching
SELECT * FROM user_events WHERE session_id > 'abc';  -- Range query

Advanced Indexing Strategies

Expression Indexes - For Computed Values

-- Index on expressions for queries that use functions or calculations

-- Index for case-insensitive email searches
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Index for extracting date parts
CREATE INDEX idx_events_month ON user_events(EXTRACT(MONTH FROM created_at));

-- Index for JSON queries
CREATE INDEX idx_events_metadata_path ON user_events USING GIN ((metadata->'page'));

-- Index for calculated values
CREATE INDEX idx_events_duration_bucket 
ON user_events((CASE 
    WHEN session_duration < 30 THEN 'short'
    WHEN session_duration < 300 THEN 'medium'
    ELSE 'long'
END));

-- These queries can use the expression indexes
SELECT * FROM users WHERE LOWER(email) = 'maya@coffee.dev';
SELECT COUNT(*) FROM user_events WHERE EXTRACT(MONTH FROM created_at) = 12;
SELECT * FROM user_events WHERE metadata->>'page' = '/dashboard';

Multi-Column Statistics for Better Query Planning

-- PostgreSQL can create extended statistics for correlated columns
-- Helps the query planner make better decisions

-- Create extended statistics for correlated columns
CREATE STATISTICS stat_users_status_tier 
ON status, subscription_tier 
FROM users;

CREATE STATISTICS stat_events_user_type_time 
ON user_id, event_type, created_at 
FROM user_events;

-- Analyze tables to populate statistics
ANALYZE users;
ANALYZE user_events;

-- View statistics information
SELECT 
    schemaname,
    tablename,
    attname,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs
FROM pg_stats 
WHERE tablename IN ('users', 'user_events')
ORDER BY tablename, attname;

Query Optimization Techniques

Query Rewriting for Better Performance

-- Original slow query with subquery
SELECT u.username, u.email
FROM users u
WHERE u.user_id IN (
    SELECT e.user_id 
    FROM user_events e 
    WHERE e.created_at >= '2024-01-01'
        AND e.event_type = 'purchase'
    GROUP BY e.user_id
    HAVING COUNT(*) > 5
);

-- Optimized version with JOIN
SELECT DISTINCT u.username, u.email
FROM users u
INNER JOIN (
    SELECT e.user_id
    FROM user_events e
    WHERE e.created_at >= '2024-01-01'
        AND e.event_type = 'purchase'
    GROUP BY e.user_id
    HAVING COUNT(*) > 5
) active_buyers ON u.user_id = active_buyers.user_id;

-- Even better with window functions
SELECT username, email
FROM (
    SELECT 
        u.username,
        u.email,
        COUNT(*) OVER (PARTITION BY u.user_id) as purchase_count
    FROM users u
    INNER JOIN user_events e ON u.user_id = e.user_id
    WHERE e.created_at >= '2024-01-01'
        AND e.event_type = 'purchase'
) ranked
WHERE purchase_count > 5;

Avoiding Common Performance Pitfalls

-- PITFALL 1: Functions on indexed columns prevent index usage
-- BAD: Function on indexed column
SELECT * FROM user_events 
WHERE EXTRACT(YEAR FROM created_at) = 2024;

-- GOOD: Range query that can use index
SELECT * FROM user_events 
WHERE created_at >= '2024-01-01' 
    AND created_at < '2025-01-01';

-- PITFALL 2: Leading wildcards prevent index usage
-- BAD: Leading wildcard
SELECT * FROM users WHERE email LIKE '%@gmail.com';

-- GOOD: Use full-text search or reverse the string
CREATE INDEX idx_users_email_reverse ON users(REVERSE(email));
SELECT * FROM users WHERE REVERSE(email) LIKE REVERSE('%@gmail.com');

-- PITFALL 3: OR conditions can be inefficient
-- BAD: OR conditions
SELECT * FROM user_events 
WHERE event_type = 'purchase' OR event_type = 'signup';

-- GOOD: Use IN operator
SELECT * FROM user_events 
WHERE event_type IN ('purchase', 'signup');

-- PITFALL 4: Implicit data type conversions
-- BAD: String comparison with integer column
SELECT * FROM users WHERE user_id = '12345';

-- GOOD: Use correct data type
SELECT * FROM users WHERE user_id = 12345;

-- PITFALL 5: SELECT * when you don't need all columns
-- BAD: Selecting unnecessary data
SELECT * FROM user_events 
WHERE user_id = 12345 
ORDER BY created_at DESC 
LIMIT 10;

-- GOOD: Select only what you need
SELECT event_type, created_at, session_duration 
FROM user_events 
WHERE user_id = 12345 
ORDER BY created_at DESC 
LIMIT 10;

Real-World Performance Monitoring

Setting Up Query Performance Monitoring

-- Enable query logging for slow queries
-- Add to postgresql.conf:
-- log_min_duration_statement = 1000  # Log queries taking > 1 second
-- log_statement = 'all'              # Log all statements
-- log_duration = on                  # Log query execution time

-- Create a view to monitor slow queries
CREATE VIEW slow_queries AS
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time,
    stddev_time,
    (total_time / sum(total_time) OVER()) * 100 AS percentage_of_total
FROM pg_stat_statements
WHERE mean_time > 100  -- Queries averaging > 100ms
ORDER BY total_time DESC;

-- Monitor index usage effectiveness
CREATE VIEW index_usage_stats AS
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_tup_read,
    idx_tup_fetch,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as size,
    CASE 
        WHEN idx_scan = 0 THEN 'Never used'
        WHEN idx_scan < 10 THEN 'Rarely used'
        WHEN idx_scan < 100 THEN 'Sometimes used'
        ELSE 'Frequently used'
    END as usage_frequency
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Find tables that might need indexes
CREATE VIEW tables_needing_indexes AS
SELECT 
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    seq_tup_read / GREATEST(seq_scan, 1) AS avg_rows_per_scan,
    pg_size_pretty(pg_total_relation_size(relid)) as table_size
FROM pg_stat_user_tables
WHERE seq_scan > 100  -- Tables with many sequential scans
    AND seq_tup_read / GREATEST(seq_scan, 1) > 1000  -- Reading many rows per scan
ORDER BY seq_tup_read DESC;

Automated Index Recommendation System

-- Create a function to analyze query patterns and suggest indexes
CREATE OR REPLACE FUNCTION suggest_indexes()
RETURNS TABLE(
    table_name TEXT,
    suggested_index TEXT,
    reason TEXT,
    estimated_benefit TEXT
) AS $$
BEGIN
    -- Analyze queries from pg_stat_statements
    RETURN QUERY
    WITH query_analysis AS (
        SELECT 
            qs.query,
            qs.calls,
            qs.total_time,
            qs.mean_time,
            -- Extract table names and WHERE conditions
            regexp_matches(qs.query, 'FROM\s+(\w+)', 'gi') as table_matches,
            regexp_matches(qs.query, 'WHERE\s+(\w+)', 'gi') as where_matches
        FROM pg_stat_statements qs
        WHERE qs.mean_time > 50  -- Focus on slower queries
    ),
    slow_table_scans AS (
        SELECT 
            st.tablename,
            st.seq_scan,
            st.seq_tup_read,
            st.seq_tup_read / GREATEST(st.seq_scan, 1) as avg_scan_size
        FROM pg_stat_user_tables st
        WHERE st.seq_scan > 50
    )
    SELECT 
        sts.tablename::TEXT,
        format('CREATE INDEX idx_%s_suggested ON %s (%s)', 
               sts.tablename, sts.tablename, 'column_name')::TEXT,
        'High sequential scan activity'::TEXT,
        format('Potential %sx improvement', 
               ROUND(sts.avg_scan_size / 1000.0, 1))::TEXT
    FROM slow_table_scans sts
    WHERE sts.avg_scan_size > 1000;
END;
$$ LANGUAGE plpgsql;

-- Run the analysis
SELECT * FROM suggest_indexes();

Performance Testing Framework

-- Create a performance testing framework
CREATE TABLE performance_tests (
    test_id SERIAL PRIMARY KEY,
    test_name VARCHAR(255) NOT NULL,
    query_text TEXT NOT NULL,
    expected_max_time_ms INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE performance_results (
    result_id SERIAL PRIMARY KEY,
    test_id INTEGER REFERENCES performance_tests(test_id),
    execution_time_ms NUMERIC,
    rows_returned INTEGER,
    buffers_hit INTEGER,
    buffers_read INTEGER,
    executed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Function to run performance tests
CREATE OR REPLACE FUNCTION run_performance_test(test_name_param VARCHAR)
RETURNS VOID AS $$
DECLARE
    test_record performance_tests%ROWTYPE;
    start_time TIMESTAMP;
    end_time TIMESTAMP;
    execution_time_ms NUMERIC;
    explain_output TEXT;
BEGIN
    -- Get test details
    SELECT * INTO test_record 
    FROM performance_tests 
    WHERE test_name = test_name_param;
    
    IF NOT FOUND THEN
        RAISE EXCEPTION 'Test % not found', test_name_param;
    END IF;
    
    -- Record start time
    start_time := clock_timestamp();
    
    -- Execute the query (in a more complex implementation, 
    -- you'd capture the actual results and explain output)
    EXECUTE test_record.query_text;
    
    -- Record end time
    end_time := clock_timestamp();
    execution_time_ms := EXTRACT(EPOCH FROM (end_time - start_time)) * 1000;
    
    -- Store results
    INSERT INTO performance_results (
        test_id, 
        execution_time_ms, 
        executed_at
    ) VALUES (
        test_record.test_id,
        execution_time_ms,
        CURRENT_TIMESTAMP
    );
    
    -- Check if performance meets expectations
    IF execution_time_ms > test_record.expected_max_time_ms THEN
        RAISE WARNING 'Performance test % failed: %ms > %ms expected', 
                     test_name_param, execution_time_ms, test_record.expected_max_time_ms;
    ELSE
        RAISE NOTICE 'Performance test % passed: %ms', test_name_param, execution_time_ms;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Add performance tests
INSERT INTO performance_tests (test_name, query_text, expected_max_time_ms)
VALUES 
    ('user_analytics_dashboard', 
     'SELECT u.username, COUNT(e.event_id) FROM users u JOIN user_events e ON u.user_id = e.user_id WHERE e.created_at >= CURRENT_DATE - INTERVAL ''30 days'' GROUP BY u.user_id, u.username LIMIT 100',
     100),
    ('recent_purchases',
     'SELECT * FROM user_events WHERE event_type = ''purchase'' AND created_at >= CURRENT_DATE - INTERVAL ''7 days'' ORDER BY created_at DESC LIMIT 50',
     50);

-- Run tests
SELECT run_performance_test('user_analytics_dashboard');
SELECT run_performance_test('recent_purchases');

-- Analyze test results over time
SELECT 
    pt.test_name,
    AVG(pr.execution_time_ms) as avg_time_ms,
    MAX(pr.execution_time_ms) as max_time_ms,
    MIN(pr.execution_time_ms) as min_time_ms,
    COUNT(*) as test_runs,
    pt.expected_max_time_ms
FROM performance_tests pt
JOIN performance_results pr ON pt.test_id = pr.test_id
WHERE pr.executed_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY pt.test_id, pt.test_name, pt.expected_max_time_ms
ORDER BY avg_time_ms DESC;

Index Maintenance and Optimization

Monitoring Index Health

-- Check for unused indexes (candidates for removal)
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    pg_stat_get_tuples_inserted(relid) + 
    pg_stat_get_tuples_updated(relid) + 
    pg_stat_get_tuples_deleted(relid) as table_writes
FROM pg_stat_user_indexes
JOIN pg_stat_user_tables USING (schemaname, tablename, relid)
WHERE idx_scan < 10  -- Very rarely used
    AND pg_relation_size(indexrelid) > 1024 * 1024  -- Larger than 1MB
ORDER BY pg_relation_size(indexrelid) DESC;

-- Check for duplicate or redundant indexes
WITH index_columns AS (
    SELECT
        schemaname,
        tablename,
        indexname,
        array_agg(attname ORDER BY attnum) as columns,
        pg_relation_size(indexrelid) as index_size
    FROM pg_stat_user_indexes
    JOIN pg_index ON indexrelid = pg_index.indexrelid
    JOIN pg_attribute ON pg_attribute.attrelid = pg_index.indrelid 
        AND pg_attribute.attnum = ANY(pg_index.indkey)
    GROUP BY schemaname, tablename, indexname, indexrelid
)
SELECT 
    ic1.schemaname,
    ic1.tablename,
    ic1.indexname as index1,
    ic2.indexname as index2,
    ic1.columns,
    pg_size_pretty(ic1.index_size) as index1_size,
    pg_size_pretty(ic2.index_size) as index2_size
FROM index_columns ic1
JOIN index_columns ic2 ON ic1.schemaname = ic2.schemaname 
    AND ic1.tablename = ic2.tablename
    AND ic1.indexname < ic2.indexname  -- Avoid duplicates
WHERE ic1.columns = ic2.columns  -- Same column set
    OR ic1.columns <@ ic2.columns  -- ic1 is a prefix of ic2
    OR ic2.columns <@ ic1.columns; -- ic2 is a prefix of ic1

-- Monitor index bloat
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    CASE 
        WHEN pg_relation_size(indexrelid) = 0 THEN 0
        ELSE ROUND(
            (pg_relation_size(indexrelid) - 
             (SELECT SUM(avg_width) FROM pg_stats 
              WHERE schemaname = pg_stat_user_indexes.schemaname 
                AND tablename = pg_stat_user_indexes.tablename)) 
            / pg_relation_size(indexrelid)::NUMERIC * 100, 2)
    END as estimated_bloat_percentage
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1024 * 1024  -- Larger than 1MB
ORDER BY pg_relation_size(indexrelid) DESC;

Automated Index Maintenance

-- Create maintenance procedures
CREATE OR REPLACE FUNCTION maintenance_rebuild_indexes()
RETURNS TEXT AS $$
DECLARE
    index_record RECORD;
    maintenance_log TEXT := '';
BEGIN
    -- Rebuild indexes with high bloat
    FOR index_record IN 
        SELECT indexname, tablename
        FROM pg_stat_user_indexes
        WHERE pg_relation_size(indexrelid) > 10 * 1024 * 1024  -- > 10MB
    LOOP
        -- Rebuild index concurrently to avoid blocking
        EXECUTE format('REINDEX INDEX CONCURRENTLY %I', index_record.indexname);
        maintenance_log := maintenance_log || format('Rebuilt index %s on table %s\n', 
                                                   index_record.indexname, 
                                                   index_record.tablename);
    END LOOP;
    
    -- Update table statistics
    EXECUTE 'ANALYZE';
    maintenance_log := maintenance_log || 'Updated table statistics\n';
    
    RETURN maintenance_log;
END;
$$ LANGUAGE plpgsql;

-- Schedule maintenance (would typically be done via cron or pg_cron)
-- SELECT maintenance_rebuild_indexes();

Final Thoughts: Indexes as Performance Multipliers

That 45-second-to-80-millisecond optimization taught me that database performance isn't magic - it's about understanding how your database thinks and giving it the tools it needs to work efficiently. Every query tells a story, and execution plans are the database's way of showing you exactly what's happening under the hood.

The key insights that transformed my approach to database optimization:

  1. Measure before optimizing - Use EXPLAIN ANALYZE religiously
  2. Index strategically - More indexes aren't always better
  3. Consider query patterns - Design indexes for how you actually query data
  4. Monitor continuously - Performance degrades over time without maintenance
  5. Test everything - What works in development might not work in production

Whether you're building a startup's first database or optimizing queries for millions of users, understanding indexing and query optimization will make you a more effective developer. The difference between a sluggish application and a responsive one often comes down to a few well-placed indexes and thoughtfully written queries.

Remember: premature optimization is the root of all evil, but premature pessimization is even worse. Start with good practices, measure everything, and optimize based on real data.


Currently writing this from Victrola Coffee on Capitol Hill, where I'm optimizing a complex analytics query while enjoying my usual cortado. Share your database optimization victories @maya_codes_pnw - we've all had those magical moments when a query goes from minutes to milliseconds! 🚀☕

Share this article

Add Comment

No comments yet. Be the first to comment!