Optimizing a Min/Max Query in Postgres on a Table with Hundreds of Millions of Rows

As the amount of data stored in databases continues to grow, optimizing queries becomes increasingly important. In this article, we will explore how to optimize a min/max query in Postgres that is affected by an index on a table with hundreds of millions of rows.

Background

The problem statement involves a query that attempts to find the maximum value of a column after grouping over two other columns:

SELECT address,
       token_id,
       MAX(input_tx_time) AS last_tx_time
FROM processed.token_utxo
WHERE input_tx_time < date_trunc('day', current_timestamp)
GROUP BY address, token_id
LIMIT X;

The table processed.token_utxo has approximately 330 million rows and an index over all three columns:

CREATE INDEX idx_token_utxo ON processed.token_utxo USING btree (address, token_id, input_tx_time);

Running the query with a limit of 1,000,000 results in a plan that uses the index and completes in approximately 1 minute. However, running the query with a limit of 10,000,000 results in a plan that does not appear to use the index and completes in approximately 30 minutes.

The problem statement also mentions that if the query is run without a limit, it will error out with a “temp file reached max size” issue. This suggests that there may be an underlying problem with the index itself, rather than just a speed issue.

Understanding the Query Plan

To understand why the query plan changes when the limit increases, we need to examine the query plan in detail. The explain.depesz.com website provides a detailed explanation of the query plan for each scenario:

From these explanations, we can see that the faster plan uses 642,780 heap fetches to fetch 2,414,121 rows, which is approximately half of the table.

Optimization Techniques

To optimize this query, we need to focus on two main areas:

1. Index Utilization

The issue here is that when the limit increases, Postgres switches from using the index to fetching all rows in memory and then calculating the maximum value. This happens because the index cannot be loaded into memory for large tables.

Solution: Vacuuming the Table

One possible solution is to vacuum the table more aggressively. This will help reduce the size of the index and make it more feasible to load into memory.

VACUUM (FULL) processed.token_utxo;

Additionally, we should also ensure that random_page_cost and work_mem are set correctly:

ANALYZE processed.token_utxo;
SET random_page_cost = 1.0;
SET work_mem = '42598MB';

By adjusting these parameters, Postgres will be able to use the index more effectively and reduce the number of heap fetches.

2. Estimated Plan Choice

Another issue here is that the estimated plan choice is not accurate for large tables. This can lead to a slower query even if the correct plan is being used.

Solution: Adjusting `effective_cache_size` and `temp_file_limit`

To improve the accuracy of estimated plan choice, we need to adjust effective_cache_size and temp_file_limit.

ALTER SYSTEM SET effective_cache_size = 500MB;
ALTER SYSTEM SET temp_file_limit = 10000;

By increasing these parameters, Postgres will be able to better estimate which plans are likely to perform well.

Conclusion

In this article, we explored how to optimize a min/max query in Postgres that is affected by an index on a table with hundreds of millions of rows. We discussed the importance of vacuuming the table and adjusting random_page_cost, work_mem, effective_cache_size, and temp_file_limit to improve performance.

While optimizing queries, it’s essential to consider both query plan details and system parameters to ensure that we are addressing the root cause of the issue.

By applying these techniques, you should be able to optimize your min/max query and achieve better performance on large tables.

Last modified on 2025-02-15