You are currently offline, waiting for your internet to reconnect

INF: Comparison of Join Techniques

This article was previously published under Q197297
This article has been archived. It is offered "as is" and will no longer be updated.
SUMMARY
Microsoft SQL Server version 7.0 includes some new join techniques, such asthe hash join and sort-merge join, to supplement nested-loop joins. Thisarticle describes these new techniques and compares and contrasts them.
MORE INFORMATION
Microsoft SQL Server version 6.5 and earlier had only one technique forjoining records: the nested loops join. SQL Server 7.0 implements hash andsort-based algorithms. The decision by the query engine on whether to usehash versus sort depends on relative sizes of the inputs, whether they aresorted or not, whether the hash table will fit entirely in memory, and soon. The optimizer will always choose the best join plan to execute yourquery, and that may be any one of the three options. The difference betweenhash joins and merge joins is often small, but there are certain situationswhere one is better than the other. There are considerable differencesbetween these two join types and the nested loops join, as shown below. Theexample in this article is provided only as an illustration. The table atthe end of this article is an illustration of some decisions that mayaffect the join type used. It is generally recommended that you not forcejoin types, overriding the optimizer.

Nested Loop Joins

In a nested loops join, one of the tables is scanned; for every row in thattable, a lookup is performed on the second table and the matching rows areretrieved. We will call these tables outer and inner input. The basicalgorithm is to scan the inner input once for each value of outer input,looking for a match. Depending on the presence of indexes on the innerinput the lookup can run quite efficiently. Indexed nested loops join isobviously more efficient than non-indexed nested loops. Also, if the tableshave a 1-1 relationship, the scan of the inner table can stop as soon asthe row is found that matches the row in the outer table.

Any query can be processed using this methodology including joins oninequality conditions.

Nested loops example:
SELECT * FROM reserves AS r1 JOIN sailors AS s1 ON r1.sid=s1.sid option				
(loop join)
Reserves has 529 pages and 100000 rows
Sailors has 195 pages 40000 rows

Table 'sailors'. Scan count 100000, logical reads 207596, physical reads 3,read-ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read-ahead reads 0.
 |--Nested Loops(Inner Join)       |--Sort(ORDER BY:([r1].[sid] ASC))       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))       |--Clustered Index Seek(OBJECT:([Northwind].[dbo].[sailors].[I2] AS[s1]), SEEK:([s1].[sid]=[r1].[sid]) ORDERED)				

Hash Joins

In a hash match, a hash table is created in memory for one of the inputs(build input) using a repeatable randomizing function and this table issearched for matching values from the other input (probe input). The hashtable performs like an index. The data from the build input is stored inmemory. If the build input does not fit in memory it is partitionedrecursively until it fits in memory. The probe input is then hashed andused to search for matches. Unlike with nested loops, the presence orabsence of indexes is not particularly a concern in this case. Hash joinsare CPU-intensive in comparison to nested loops and are affected byavailable memory. Hash joins are better when there is a significantdifference in the sizes of the tables being joined.

In a hash join the build input determines the recursion depth because itcan stop partitioning when the input fits in memory. A single hash join canperform both grouping and join at the same time when the grouping attributeis also the join attribute. The result of this join is not in anyparticular order. Inequality conditions cannot be satisfied by this type ofjoin.
  |--Hash Match(Inner Join, HASH:([s1].[sid])=([r1].[sid]),RESIDUAL:([s1].[sid]=[r1].[sid]))       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS [s1]))       |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))				
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read-ahead reads 0.
Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read-ahead reads 0.

Sort-Merge Join

In a sort-merge, sorting the larger input dominates a large portion of thetotal cost. In a merge join, the inputs are sorted on the join column andthe sorted pair is merged into one, eliminating the column values that donot match from the sorted result set.

The algorithm first examines the tuple of one input, scans the secondsorted input until a row with equal or larger join value is found. Match uptuple in R1 with all tuples in R2 with matching join value. Repeat for alltuples in R1.

Merge join requires that both inputs be in memory (as opposed to a hashjoin, where only the smaller input is in memory). The merge join isobviously more efficient if the inputs are already sorted. Also the resultof the join is sorted on the join attribute. Inequality conditions cannotbe satisfied by this type of join.
  |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([r1].[sid])=([s1].[sid]),RESIDUAL:([s1].[sid]=[r1].[sid]))       |--Sort(ORDER BY:([r1].[sid] ASC))       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS s1]), ORDERED)				

Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read-ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read-ahead reads 0.

By changing the size of the input such that the relative sizes of theinputs vary, (in our example reserves had only 500 rows and sailors 50,000rows) merge join for the same query as above took much longer than hash ascan be seen below:

Merge join:
SQL Server Execution Times:
CPU time = 1081 ms, elapsed time = 2211 ms.

Hash join:
SQL Server Execution Times:
CPU time = 181 ms, elapsed time = 479 ms.

If an index exists on the join key, the data does not need to be sorted andtherefore a merge join will be the best choice.
==========================================================================Property                 |  Nested loop   | Hash or merge=========================|=============================================Only limited memory      | Better         |Worse,                                          |memory is needed for sorting                                          |or building hash tableNeed first row fast      | Better         |Worse,                                          |need to sort or hash                                          |before a row is returnedSelecting only a few     | Better         |Worserows from inner table    |                |Parallelism              |Works best      |Will workIndex available on inner |Works best      |Will workWithout one equality     |Works           |Needs at least 1 equality2 Very large inputs      |Worse           |Not much better,                                          |hash and sort will have                                          |almost equal costOne very small           |Worse           |Hash better than mergeand large input          |                |				
prodsql Comparison optimizer
Properties

Article ID: 197297 - Last Review: 12/05/2015 09:58:14 - Revision: 1.1

Microsoft SQL Server 7.0 Standard Edition

  • kbnosurvey kbarchive kbinfo KB197297
Feedback