Partitioned Parallel Hash Join in Parallel Database / Parallel Hash Join Technique
Partitioned Parallel Hash Join
The Sequential Hash Join technique discussed in the post Hash Join technique in DBMS can be parallelized. .
Please go through the post Hash Join technique in DBMS for better understanding of Parallel Hash-Join.
Assume that we have,
· n processors, P0, P1, …, Pn-1
· two tables, r and s (r and s are already partitioned into disks of n processors),
· s is the smaller table
Parallel Hash-Join Algorithm:
Step 1: Choose a hash function h1. h1 should be able to take the join attribute value of each tuple of relations r and s, and maps them to one of the n processors.
Step 2: Smaller relations s is taken as the Build relation.
Step 3: Each processor Pi reads the tuples of smaller relation s in it’s disk Di, and sends them to different processors based on the hash function h1.
Step 4: On receiving the records of si at every destination processor Pi, the Pi further partitions the tuples using another hash function h2. This hash function is used for joining the tuples locally at every processor. This step is independent for any processor.
Step 5: On completion of the distribution of all the records of smaller relation s, it is now the turn for the larger relation r. Step 3 is done for all the records of relation r.
Step 6: On receiving the records of ri at every destination processor Pi, the Pi further partitions the tuples using another hash function h2. This phase is called the Probe Phase.
Step 7: At last, all the records which can be joined will be in same processors’ same partitions. Each processor Pi executes the build and probe phases of the hash–join algorithm on the local partitions ri and si of r and s to produce a partition of the final result of the hash–join.
Points to remember:
Hash Join at each processor is independent of the other.
It works for equi-join and natural join conditions