Computer Science and Engineering - Tutorials, Notes, MCQs, Questions and Answers: Parallel Database

Showing posts with label Parallel Database. Show all posts

Wednesday, March 12, 2014

Parallel Database - Intraquery Parallelism

Intraquery Parallelism

It is about executing a single query in parallel using multiple processors or disks. This can be done by dividing the data into many smaller units and execute the query on those smaller tables. We have so many queries which are complex and consume more time and resources. For example, consider the query given below;

SELECT * FROM Email ORDER BY Start_Date;

This query will sort the records of Email table in ascending order on the attribute Start_Date. Assume that the Email table has 10000 records. To sort the table we need at least 10000 comparisons of Salary attribute values, if the query is executed in single processor. If the same query is executed in several processors simultaneously, we can finish sorting in lesser time. For example, if 10000 records are distributed equally into 10 processors, 1000 each, then every processor can sort 1000 records in lesser time, later it can be combined/merged using other algorithms to produce final sorted result.

In Intraquery Parallelism, we have two types;

Parallel Database - Intraoperation Parallelism

Intra-operation Parallelism

It is about parallelizing a single relational operation given in a query.

SELECT * FROM Email ORDER BY Start_Date;

In the above query, the relational operation is Sorting. Since a table may have large number of records in it, the operation can be performed on different subsets of the table in multiple processors, which reduces the time required to sort.

Consider another query,

SELECT * FROM Student, CourseRegd WHERE Student.Regno = CourseRegd.Regno;

In this query, the relational operation is Join. The query joins two tables Student, and CourseRegd on common attribute Regno. Parallelism is required here if the size of tables is very large. Usually, order of tuples does not matter in DBMS. Hence, the tables arranged in random order needs every record of one table should be matched with every record of other table to complete the join process. For example, if Student has 10000 records and CourseRegd has 60000 records, then join requires 10000 X 60000 comparisons. If we exploit parallelism in here, we could achieve better performance.

There are many such relational operations which can be executed in parallel using many processors on subsets of the table/tables mentioned in the query. The following list includes the relational operations and various techniques used to implement those operations in parallel.

Parallel Sort

Parallel Join

Selection

DuplicateElimination

Projection

Aggregation

Tuesday, March 11, 2014

Parallel Sort - Range Partitioning Sort

Range Partitioning Sort

Assumptions:

Assume n processors, P₀, P₁, …, P_n-1 and n disks D₀, D₁, …, D_n-1.

Disk D_i is associated with Processor P_i.

Relation R is partitioned into R₀, R₁, …, R_n-1 using Round-robin technique or Hash Partitioning technique or Range Partitioning technique (if range partitioned on some other attribute other than sorting attribute)

Objective:

Our objective is to sort a relation (table) R_i that resides on n disks on an attribute A in parallel.

Steps:

Step 1: Partition the relations R_ion the sorting attribute A at every processor using a range vector v. Send the partitioned records which fall in the i^th range to Processor P_i where they are temporarily stored in D_i.

Step 2: Sort each partition locally at each processor P_i. And, send the sorted results for merging with all the other sorted results which is trivial process.

Point to note:

Range partition must be done using a good range-partitioning vector. Otherwise, skew might be the problem.

Example:

Let us explain the above said process with simple example.Consider the following relation schema Employee;

Employee (Emp_ID, EName, Salary)

Assume that relation Employee is permanently partitioned using Round-robin technique into 3 disks D₀, D₁, and D₂ which are associated with processors P₀, P₁, and P₂. At processors P₀, P₁, and P₂, the relations are named Employee0, Employee1 and Employee2 respectively. This initial state is given in Figure 1.

Figure 1

Assume that the following sorting query is initiated.

SELECT * FROM Employee ORDER BY Salary;

As already said, the table Employee is not partitioned on the sorting attribute Salary. Then, the Range-Partitioning technique works as follows;

Step 1:

At first we have to identify a range vector v on the Salary attribute. The range vector is of the form v[v0, v1, …, vn-2]. For our example, let us assume the following range vector;

v[14000, 24000]

This range vector represents 3 ranges, range 0 (14000 and less), range 1 (14001 to 24000) and range 2 (24001 and more).

Redistribute the relations Employee0, Employee1 and Employee2 using these range vectors into 3 disks temporarily. After this distribution disk 0 will have range 0 records (i.e, records with salary value less than or equal to 14000), disk 1 will have range 1 records (i.e, records with salary value greater than 14000 and less than or equal to 24000), and disk 2 will have range 2 records (i.e, records with salary value greater than 24000).

This redistribution according to range vector v is represented in Figure 2 as links to all the disks from all the relations. Temp_Employee0, Temp_Employee1, and Temp_Employee2, are the relations after successful redistribution. These tables are stored temporarily in disks D₀, D₁, and D₂. (They can also be stored in Main memories (M₀, M₁, M₂) if they fit into RAM).

Figure 2

Step 2:

Now, we got temporary relations at all the disks after redistribution.

At this point, all the processors sort the data assigned to them in ascending order of Salary individually. The process of performing the same operation in parallel on different sets of data is called Data Parallelism.

Final Result:

After the processors completed the sorting, we can simply collect the data from different processors and merge them. This merge process is straightforward as we have data already sorted for every range. Hence, collecting sorted records from partition 0, partition 1 and partition 2 and merging them will give us final sorted output.

TOPICS (Click to Navigate)

Wednesday, March 12, 2014

Parallel Database - Intraquery Parallelism

Intraquery Parallelism

IntraoperationParallelism

InteroperationParallelism

Parallel Database - Intraoperation Parallelism

Intra-operation Parallelism

Parallel Sort

Range-Partitioning Sort

Parallel External Sort-Merge

Parallel Join

Partitioned Join

Fragment and Replicate Join

Partitioned Parallel Hash Join

Parallel Nested-Loop Join

Selection

DuplicateElimination

Projection

Aggregation

Tuesday, March 11, 2014

Parallel Sort - Range Partitioning Sort

Range Partitioning Sort

Assumptions:

Objective:

Steps:

Point to note:

Example:

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse