Sunday, April 18, 2010

Materialization in PostgreSQL 9.0

Some plan types, such as nested loops and, to a lesser degree, merge joins, rely on the ability to "rescan" their inner input. Rescanning basically means backing up: the executor asks the target node to produce the same output tuples it has already spit out all over again - either all of them, or just a specified portion. This can be expensive: if the plan that produced those tuples has a high cost associated with it, we'd very much prefer not to do it more than once.

Enter Materialize. When considering a Nested Loop or Merge Join, the planner considers inserting a "Materialize" node atop the inner plan. This node stores a copy of the tuples generated by the inner plan in an in-memory buffer or in a temporary file, making it very cheap to return them over again if needed.

In versions 8.4 and prior, the logic for inserting Materialize nodes when considering Nested Loop plans is thoroughly bogus. The planner had no idea that rescanning a Materialize node would be any cheaper than scanning it the first time - which is pretty horrible when you consider that this is precisely the point of having Materialize nodes. In 9.0, the logic has been extensively rewritten. Based on the limited testing I've done so far, 9.0 will materialize the inner side of a nested loop a great deal more frequently than 8.4, and most of those cases seem to be wins.

It's probably not too surprising that if the node being rescanned is, say, a join, it's faster to reread the tuples from a buffer than it is to redo the whole join. Similarly, if you're rescanning a base table with a filter condition - e.g. find all the rows in table foo where x = 1 - it's faster to save the rows that meet the condition than it is to rescan the whole table and test the condition over again for each row. Amazingly, however, at least in some cases, it seems to be faster to use materialization even when rescanning a base table WITHOUT a filter condition. In other words, if we need to read all the rows in table foo three times, it's at least sometimes faster to read the table once, dump all the tuples into a tuplestore, and then reread the tuplestore twice more than it is to read the base table three times.

How is that possible, you ask? I haven't played with this enough to be sure, but I suspect the win comes from eliminating tuple visibility checks and other overhead.

Even though this isn't the sort of feature that tends to make the headlines, it's a fairly significant adjustment to the planner, so I expect to find some lurking problems (a few have been found and fixed already). It may be that the planner is now too aggressive at materializing, whereas before it seems it wasn't aggressive enough. Or it may turn out that materializing increases query memory usage too much, or just isn't faster in some as-yet-unknown set of cirumstances. On the whole, though, the new logic is far more sensible than the old logic, so I'm hopeful that we'll seem some modest performance improvements out of this change once the dust settles.

2 comments:

  1. Took me time to read the whole article, the article is great but the comments bring more brainstorm ideas, thanks.

    - Johnson

    ReplyDelete
  2. "...it's at least sometimes faster to read the table once, dump all the tuples into a tuplestore, and then reread the tuplestore twice more than it is to read the base table three times. How is that possible, you ask?"

    A possible reason could be projection: If only a subset of the attributes are required after the materialize node, there is no point in materializing also all other attributes, i.e., the materialized tuples are smaller. Reading the smaller materialized tuplestore is faster than scanning the larger table.

    ReplyDelete