Please visit, subscribe and share 10 Minutes Lectures in Computer Science

## Comparison of Data Partitioning Strategies with Example:

 ENo EName DOB Designation Salary E101 Kumar 14-JUN-1980 Lecturer 10002 R1 E102 Melvin 12-MAR-1975 AP 23000 R2 E103 Ram 10-MAR-1978 AP 22051 R3 E107 Jacob 28-MAY-1970 Professor 30011 R4 E109 Nisha 10-AUG-1981 Lecturer 9031 R5 E105 Shaik 20-DEC-1971 Professor 29002 R6 E104 Vinay 15-SEP-1982 Attender 4004 R7 E106 Menon 30-AUG-1978 AP 21016 R8
Table 1 - Employee

Let us suppose a table Employee with the schema Employee(ENo, EName, DOB, Designation, Salary). Table 1 shows Employee table with sample data. We assume that Shared Nothing Parallel Architecture is used. For partitioning this table using Hash or Range partitioning we need one or more attributes as partitioning attributes. Let us choose Salary as partitioning attribute.
The partitioning attribute salary is represented in the following figures inside big rectangle with all these ranges in small rectangles. The big rectangle means the whole table (Employee). Consider the salary values differently for different partitioning. That is, treat the values as 100 to MAX for Round-robin and Hash partitioning (not as group of values instead treat as one group), and as different vectors for Range partitioning in the following discussions. The same representation is used for all the figures (Round-robin, Hash, and Range partitioning).
Note: We cannot physically partition a table using more than one partitioning technique. We have taken this example to explain with different partitioning techniques’ advantages and disadvantages.

Round-robin technique:
In Round-robin technique, we use the function i mod n to distribute data into multiple disks. Here, variable i represents the position (physical order) of the record in the actual table. Variable n represents the number of target disks into which the table should be partitioned.
For our table Employee, Figure 1 shows the data distribution using Round-robin partitioning. Here, you would find distribution links from every small rectangle to every disk. That means, all the ranges are available with all the disks. For example, the records of Table 1, according to the function i mod 4, will be sent to the 4 disks as follows;
Disk 0    : R4, R8 (i.e, records with salary values 30011, and 21016 respectively)
Disk 1    : R1, R5 (i.e, records with salary values 10002, and 9031 respectively)
Disk 2    : R2, R6 (i.e, records with salary values 23000, and 29002 respectively)
Disk 3    : R3, R7 (i.e, records with salary values 22051, and 4004 respectively)
By looking at the sample data and the distribution to different disks, it is clear that the distribution is not even distribution.
 Figure 1 - Data distribution in Shared Nothing Parallel Architecture using Round-robin technique

Distribution is equal, so load is equal (all the disks have equal amount of records)
Good for sequential scan.
Hard to answer range queries because of uneven distribution. For this, let us consider the following query;
SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
The above query is a Range query. This query demands all the records of Employee table for which the salary value falls between 10002 and 20000. In our case, the distribution is not based on attribute; it was actually based on the record position. Hence, all the salary values falls in the given range may be in all the disks. So, we need complete scan on all the disks for answering this query.
Hard to answer point queries. Consider another query;
SELECT * FROM Employee WHERE EName = ‘Murugan’;
This query is a Point query. Query needs the records of employee ‘Murugan’. Again, for this query too we cannot identify the exact disk where the records with the specified values would be stored. Hence, we need full scan on all the disks to answer this query.

Hash Partitioning:
Hash partitioning uses Hash function. Let us use the following hash function on partitioning attribute Salary.
h (attribute mod n) = h (salary mod 4) = h(4004 mod 4) = 0 à Disk 0
For this function, the records will be distributed as follows;
Disk 0    : R2, R7, R8
Disk 1    :
Disk 2    : R1, R6
Disk 3    : R3, R4, R5
In Hash partitioning, we have uneven distribution. It is shown in Figure 2.
 Figure 2 - Data distribution in Shared Nothing Parallel Architecture using Hash Partitioning technique

Sequential can be done efficiently by exploiting all the processors in parallel. For example, the query,
SELECT * FROM Employee ORDER BY Phone;
can be executed in parallel using all the 4 processors.
Best suited for point queries. Consider the following query; (point query on partitioning attribute Salary)
SELECT * FROM Employee WHERE Salary = 12000;
This query can be executed efficiently by only one of the available processors. The reason is, on applying the hash function h(salary mod 4) which is  h(12000 mod 4), will point to disk 0 where the actual record is stored. During this time we can exploit other processors for executing something else. That is, many numbers of simple transactions can be executed in parallel by various processors, which lead to higher Transaction Throughput.
Hard to answer range queries. Consider the query,
SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;
For this query, the salary in the range 10000 to 20000 will be distributed to almost all the partitions. For example, 10000 mod 4 will point Disk 0, 10001 mod 4 will point Disk 1, and so on. Hence, this query cannot be executed at a particular processor. Instead, it has to be executed by all the processors in parallel.
Hard to answer point queries on non-partitioning attributes. Consider the following query;
SELECT * FROM Employee WHERE EName = ‘Murugan’;
Our table is partitioned using Salary as the partitioning attribute using Hash partitioning. Hence, EName is the non-partitioning attribute. So, the queries with Salary in the WHERE clause can be efficiently executed at one processor, all the other queries which do not have Salary in WHERE clause must be executed at the all the processors as we do not know in which partition the value ‘Murugan’ would be available.

Range Partitioning:
For Range partitioning, we need a range vector. For this we have chosen the following vector as range vector;
v [5000, 10000, 25000]
These vector represents 4 partitions, viz, 5000 and less, 10000 and less, 25000 and less, and the other salary values more than 25000. For this range vector, our data will be distributed as follows;
Disk 0    : R7 (only record 7 is <=5000)
Disk 1    : R5 (only record 5 is >5000 and <=10000)
Disk 2    : R2, R3, R8 (the range >10000 and <=25000)
Disk 3    : R4, R6 (the range >25000)
Figure 3 shows the data distribution in Range partitioning.
 Figure 3 - Data distribution in Shared Nothing Parallel Architecture using Range Partitioning technique

Good for sequential scan. For example, the query,
SELECT * FROM Employee ORDER BY Phone;
can be executed efficiently by all the processors in parallel.
Point queries can be executed efficiently as in the case of Hash partitioning. For example, the query,
SELECT * FROM Employee WHERE Salary = 12000;
need to be executed by only one of all the available processors. The reason is, the salary value 12000, in our case, belongs to Disk 2. Hence, we need to scan only the records at Disk 2.
Range queries can be executed efficiently compared to other partitioning techniques. For example, the query,
SELECT * FROM Employee WHERE Salary BETWEEN 10001 AND 20000;
This query needs one to few disks to be scanned based on the values specified in the WHERE clause. For our query, we need to scan only Disk 2. Because, the range 10001 and 20000 stored in Disk 2 only.