Hash Join technique in DBMS

Hash Join Technique in DBMS / Simple Hash Join Technique / Sequential Hash Join in Database


Hash Join

Hash join is one type of joining techniques that are used to process a join query. Hash join is proposed for performing joins that are Natural joins or Equi-joins. There are several variants of hash joins, like Simple Hash Join, Partitioned Hash Join, and Hybrid Hash Join.

Simple Hash Join:

Input: The list of tables to be joined, say r and s and the joining attributes r.A and s.B of tables r and s.
Output: Joined relation.

Steps:

1. Identify a hash function such that the result of the function should point to one of the identified range of values (or buckets, say bucket 0, bucket 1, …, bucket 9 for example).
2. Identify the smaller relation among the relations that are to be joined, say r and s.
3. Partition the smaller relation, say r, into the identified buckets using the join attribute of r. That is, r is partitioned into available buckets as r0, r1, r2, …, r9. This step is called Building Phase (sometimes Partitioning Phase).
4. Partition the relation s into the identified buckets using the join attribute of s. That is, s is partitioned into available buckets as s0, s1, s2, …, s9. This step is called Probing Phase as the records of s probe for the matching records in the appropriate buckets.
5. Finally the result can be collected from different buckets and submitted as result.

How does simple Hash join work?

Example:

Consider the following two table instances for tables Employee(EmpID, EName, Dept) and Department(DeptNo, DName, DLocation);


EmpID
EName
DeptNo
E101
Kumar
2
E103
Wess
1
E107
Terry
1
E102
Gowri
2
E110
Morgan
3
E111
Sachin
3
Table 1 - Employee
DeptNo
DName
DLocation
1
Design
Chennai
2
Production
Chennai
3
Administration
Bangalore
Table 2 – Department
Step 1: Hash function looks like the one given below;
h(Hash_key) = Hash_key_value mod n
where n is the number of partitions (or buckets). Let us choose the DeptNo attribute as hash key and partition the relation into 3 partitions such as 0, 1, and 2. Then, our hash function will look like as follows;
h(DeptNo) = DeptNo mod 3
Step 2: The Department table is the smaller table.
Step 3: Partition Department using the hash function. The first record of Department will go into Bucket 1, second record into Bucket 2, and third record into Bucket 0.
DeptNo
DName
DLocation
1
Design
Chennai
2
Production
Chennai
3
Administration
Bangalore
Figure 1 - Hashing of Department table
 
Step 4: Now partition the larger relation Employee using the same hash function. That is use the following hash function;
h(DeptNo) = DeptNo mod 3
The figure given below shows the partitioning of Employee table into 3 buckets.

EmpID
EName
DeptNo
E101
Kumar
2
E103
Wess
1
E107
Terry
1
E102
Gowri
2
E110
Morgan
3
E111
Sachin
3
Figure 2 - Hashing of Employee table

After successful partitioning, our hash buckets will look the figure given below; in this figure, the first table shows 0th partition of Department table, and the second one shows the 0th partition of Employee table.


Figure 3 - Bucket status after hashing of all records of the tables to be joined

Carefully observe the data stored in every bucket. Bucket 0 stores records of zeroth partitions of both tables where the attribute DeptNo (joining attribute) is having the same value, i.e, 3. It is true for other partitions also. Hence, at this stage the join is trivial. This is why this phase of this joining technique is named Probing phase, where the Probe relation searches for the joining attribute value of other relation and gets joined. At this stage, the joining is trivial.
The major advantage is that, if we join with conventional join technique, then the comparison requires,
6 records X 3 records = 18 comparisons
When we use hash join, we need
(1 record X 2 records in Bucket 0)+ (1 X 2 in Bucket 1)+ (1 X 2 in Bucket 2) = 6 comparisons

Points to note:

1. Only Equi-join or Natural Join can be performed using Hash join
2. Simple hash join assumed one of the relations as small relation, i.e, the whole relation can fit into memory.
3. Smaller relation is chosen in the building phase.
4. Only a single pass through the tables is required. That is one time scanning of relation is required.
5. Hash function should be chosen such that it should not cause skewness.