## Parallel External Sort-Merge

#### Assumptions:

Assume n
processors, P

_{0}, P_{1}, …, P_{n-1}and n disks D_{0}, D_{1}, …, D_{n-1}.
Disk D

_{i}is associated with Processor P_{i}.
Relation R is partitioned into R

_{0}, R_{1}, …, R_{n-1}using Round-robin technique or Hash Partitioning technique or Range Partitioning technique (partitioned on any attribute)#### Objective:

Our objective
is to sort a relation (table) R

_{i}on an attribute A in parallel where R resides on n disks.#### Steps:

**Step 1:**Sort the relation partition R

_{i}which is stored on disk D

_{i}on the sorting attribute of the query.

**Step 2:**Identify a range partition vector v and range partition every R

_{i}into processors, P

_{0}, P

_{1}, …, P

_{n-1}using vector v.

**Step 3:**Each processor P

_{i}performs a merge on the incoming range partitioned data from every other processors (The data are actually transferred in order. That is, all processors send first partition into P0, then all processors sends second partition into P1, and so on).

**Step 4:**Finally, concatenate all the sorted data from different processors to get the final result.

#### Point to note:

Range
partition must be done using a good range-partitioning vector. Otherwise, skew
might be the problem.

### Example:

Let us
explain the above said process with simple example. Consider the following
relation schema Employee;

**Employee (Emp_ID, EName, Salary)**

Assume that
relation

**into 3 disks D***Employee is permanently partitioned using Round-robin technique*_{0}, D_{1}, and D_{2}which are associated with processors P_{0}, P_{1}, and P_{2}. At processors P_{0}, P_{1}, and P_{2}, the relations are named Employee0, Employee1 and Employee2 respectively. This initial state is given in Figure 1.Figure 1 - Partitioned Employee |

Assume that
the following sorting query is initiated.

SELECT
* FROM Employee ORDER BY Salary;

As already said,
the table Employee is not partitioned on the sorting attribute Salary. Then,
the Parallel External Sort-Merge technique works as follows;

#### Step 1:

Sort the data
stored in every partition (every disk) using the ordering attribute Salary. (Sorting
of data in every partition is done temporarily). At this stage every Employee

_{i }contains salary values of range minimum to maximum. The partitions sorted in ascending order is shown below, in Figure 2.Figure 2 - Partitioned Employee in ascending order on Salary attribute |

#### Step 2:

We have to
identify a range vector v on the Salary attribute. The range vector is of the
form

**v[v0, v1, …, vn-2]**. For our example, let us assume the following range vector;
v[14000,
24000]

This range
vector represents 3 ranges, range 0 (14000 and less), range 1 (14001 to 24000)
and range 2 (24001 and more).

Redistribute every
partition (Employee0, Employee1 and Employee2) using these range vectors into 3
disks temporarily. What would be the status of Temp_Employee 0, 1, and 2 after
distributing Employee 0 is given in Figure 3.

Figure 3 - Distribution of Employee 0 according to the range vector |

#### Step 3:

Actually, the
above said distribution is executed at all processors in parallel such that
processors P0, P1, and P2 are sending the first partition of Employee 0, 1, and 2 to disk 0. Upon receiving
the records from various partitions, the receiving processor P0
merges the sorted
data. This is shown in Figure 4.

Figure 4 - Merging data of range <=14000 at D0 from all the disks |

The above
said process is done at all processors for different partitions. The final
version of Temp_Employee 0, 1, and 2 are shown in Figure 5.

Figure 5 - Status of Temp_Employee 0, 1, and 2 after distributing all ranges from all disks |

#### Step 4:

The final
concatenation of sorted data from all the disks is trivial.