Database queries · Postgresql · Indexes

The power of the index

A discussion came up on IRC today, someone was asking why a query was being done the way it was, by Postgres. The person had looked at the explain function of Postgres for a given query, and the planner showed how the query was going to be actioned.

The query was such

SELECT COUNT(*) FROM one, two
WHERE one.id = two.id AND d >= 56 AND d <= 69 AND a > 110;

The planner explained the query plan thus

Aggregate (cost=1365.77..1365.77 rows=1 width=0) (actual time=5.250..5.250 rows=1 loops=1)
  -> Nested Loop (cost=0.00..1365.12 rows=250 width=0) (actual time=0.034..5.181 rows=230 loops=1)
    -> Seq Scan on two (cost=0.00..180.00 rows=391 width=4) (actual time=0.014..3.737 rows=388 loops=1)
      Filter: (a > 110)
    -> Index Scan using one_id_idx on one (cost=0.00..3.02 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=388)
      Index Cond: (one.id = "outer".id)
      Filter: ((d >= 56) AND (d <= 69))
Total runtime: 5.339 ms

So, what’s happening? The query planner has done one loop on table two and 388 loops on the index of table one.

The question is, why did the planner choose that as the best way to implement the query?

This is where understanding an index is so important.

First I will explain why the query planner chose this route, then I will explain a bit more about indexes.

In the first place, the query planner had two create subsets of each table, rows from table one had to have field d between 56 and 69, and rows from table two had to have a field a greater than 110.

That created two sets with the following number of rows each.

SELECT COUNT(*) FROM one WHERE d >=56 AND d <=69;
 count
----------
   6544
(1 row)

SELECT COUNT(*) FROM two WHERE a > 119;
 count
----------
    388
(1 row)

Next the join - the part of the query that says

WHERE one.id = two.id

There are two choices to be made, for the same algorithm. That is, one set must have every one of its items searched for in the second set.

The cost of the query will be

O(n)*O(log2(m))

The reason for the cost - every item in one set = O(n) (Where n is the number of items in the first set)

Searching for the item in the other set O(log2(m)) (where m is the number of items in the second set)

A search is O(log2(n)) (where n is the number of rows in the set being searched) because the fastest way that we know how to search is binary search of a sorted list, that is, we look at the midpoint of a sorted list, and if our item is greater than the value held at the midpoint, we then search the top half of the list, and if our item is less than the value held at the midpoint we search the bottom half. And so on until we either find the item, or there is nothing left to search. Because the number of items being searched each time is reduced by half, we end up with a total number of times that we searched (worst case) of log2(n), the logarithm with a base of 2 of n.

An index is a sorted list (and that’s all an index is), so searching an index is going to cost O(log(n)) (where n is the number of rows in the set being searched).

The planner has chosen to iterate over every item in the smaller set two which has 388 items, and will search the larger set one which has 6544 items.

The reason why can be seen in the following math.

Search the larger set for the keys in the smaller set

log2(6544) = 12.67596
multiplied by the number of searches
388 * 13 = 5044 (we cannot do a partial operation) 

Whereas searching the smaller set for the items in the larger set:

log2(388) = 8.5999
multiplied by the number of searches
6544 * 9 = 58896 (we cannot do a partial operation) 

If the planner had iterated over every item in the larger set, searching the smaller set, there would have (worst case) been 58896 operations.

However, by iterating over every item in the smaller set, there are only (worst case) 5044 operations.

Summary

By having an index to search, what might have been a full table scan (O(n)), becomes a binary search (O(log2(n))).

Published:
comments powered by Disqus