Computer Science and Engineering - Tutorials, Notes, MCQs, Questions and Answers

Saturday, March 15, 2014

Parallel Database - Parallel External Sort-Merge Technique

Parallel External Sort-Merge

Assumptions:

Assume n processors, P₀, P₁, …, P_n-1 and n disks D₀, D₁, …, D_n-1.

Disk D_i is associated with Processor P_i.

Relation R is partitioned into R₀, R₁, …, R_n-1 using Round-robin technique or Hash Partitioning technique or Range Partitioning technique (partitioned on any attribute)

Objective:

Our objective is to sort a relation (table) R_i on an attribute A in parallel where R resides on n disks.

Steps:

Step 1: Sort the relation partition R_i which is stored on disk D_i on the sorting attribute of the query.

Step 2: Identify a range partition vector v and range partition every R_i into processors, P₀, P₁, …, P_n-1 using vector v.

Step 3: Each processor P_i performs a merge on the incoming range partitioned data from every other processors (The data are actually transferred in order. That is, all processors send first partition into P0, then all processors sends second partition into P1, and so on).

Step 4: Finally, concatenate all the sorted data from different processors to get the final result.

Point to note:

Range partition must be done using a good range-partitioning vector. Otherwise, skew might be the problem.

Example:

Let us explain the above said process with simple example. Consider the following relation schema Employee;

Employee (Emp_ID, EName, Salary)

Assume that relation Employee is permanently partitioned using Round-robin technique into 3 disks D₀, D₁, and D₂ which are associated with processors P₀, P₁, and P₂. At processors P₀, P₁, and P₂, the relations are named Employee0, Employee1 and Employee2 respectively. This initial state is given in Figure 1.

Figure 1 - Partitioned Employee

Assume that the following sorting query is initiated.

SELECT * FROM Employee ORDER BY Salary;

As already said, the table Employee is not partitioned on the sorting attribute Salary. Then, the Parallel External Sort-Merge technique works as follows;

Step 1:

Sort the data stored in every partition (every disk) using the ordering attribute Salary. (Sorting of data in every partition is done temporarily). At this stage every Employee_icontains salary values of range minimum to maximum. The partitions sorted in ascending order is shown below, in Figure 2.

Figure 2 - Partitioned Employee in ascending order on Salary attribute

Step 2:

We have to identify a range vector v on the Salary attribute. The range vector is of the form v[v0, v1, …, vn-2]. For our example, let us assume the following range vector;

v[14000, 24000]

This range vector represents 3 ranges, range 0 (14000 and less), range 1 (14001 to 24000) and range 2 (24001 and more).

Redistribute every partition (Employee0, Employee1 and Employee2) using these range vectors into 3 disks temporarily. What would be the status of Temp_Employee 0, 1, and 2 after distributing Employee 0 is given in Figure 3.

Figure 3 - Distribution of Employee 0 according to the range vector

Step 3:

Actually, the above said distribution is executed at all processors in parallel such that processors P0, P1, and P2 are sending the first partition of Employee 0, 1, and 2 to disk 0. Upon receiving the records from various partitions, the receiving processor P0 merges the sorted data. This is shown in Figure 4.

Figure 4 - Merging data of range <=14000 at D0 from all the disks

The above said process is done at all processors for different partitions. The final version of Temp_Employee 0, 1, and 2 are shown in Figure 5.

Figure 5 - Status of Temp_Employee 0, 1, and 2 after distributing all ranges from all disks

Step 4:

The final concatenation of sorted data from all the disks is trivial.

Wednesday, March 12, 2014

Parallel Database - Intraquery Parallelism

Intraquery Parallelism

It is about executing a single query in parallel using multiple processors or disks. This can be done by dividing the data into many smaller units and execute the query on those smaller tables. We have so many queries which are complex and consume more time and resources. For example, consider the query given below;

SELECT * FROM Email ORDER BY Start_Date;

This query will sort the records of Email table in ascending order on the attribute Start_Date. Assume that the Email table has 10000 records. To sort the table we need at least 10000 comparisons of Salary attribute values, if the query is executed in single processor. If the same query is executed in several processors simultaneously, we can finish sorting in lesser time. For example, if 10000 records are distributed equally into 10 processors, 1000 each, then every processor can sort 1000 records in lesser time, later it can be combined/merged using other algorithms to produce final sorted result.

In Intraquery Parallelism, we have two types;

Parallel Database - Intraoperation Parallelism

Intra-operation Parallelism

It is about parallelizing a single relational operation given in a query.

SELECT * FROM Email ORDER BY Start_Date;

In the above query, the relational operation is Sorting. Since a table may have large number of records in it, the operation can be performed on different subsets of the table in multiple processors, which reduces the time required to sort.

Consider another query,

SELECT * FROM Student, CourseRegd WHERE Student.Regno = CourseRegd.Regno;

In this query, the relational operation is Join. The query joins two tables Student, and CourseRegd on common attribute Regno. Parallelism is required here if the size of tables is very large. Usually, order of tuples does not matter in DBMS. Hence, the tables arranged in random order needs every record of one table should be matched with every record of other table to complete the join process. For example, if Student has 10000 records and CourseRegd has 60000 records, then join requires 10000 X 60000 comparisons. If we exploit parallelism in here, we could achieve better performance.

There are many such relational operations which can be executed in parallel using many processors on subsets of the table/tables mentioned in the query. The following list includes the relational operations and various techniques used to implement those operations in parallel.

Parallel Sort

Parallel Join

Selection

DuplicateElimination

Projection

Aggregation

Advanced DBMS Topics

Advanced concepts in DBMS

Advanced Database Topics

(Click on the links to navigate)

Advanced Concepts in DBMS:

This link takes you to the section which broadly discusses about Database Design using ER model and Normalization techniques, various normal forms, Indexing and Tuning, Performance issues in DBMS, basic building blocks of a DBMS software, etc. with appropriate examples (in some posts). This section is especially meant for those who are naive to DBMS. More.....

ADBMS Quizzes:

You can give quizzes on various topics in DBMS basics and advanced concepts. Also, set of solved exercises are included. More.....

Parallel Database Concepts:

In this section, I have discussed about parallel database concepts like, parallel database architectures, basic issues in parallelizing database accesses, data distribution to parallel machines, types of parallel operations, achievability of parallel operations, some keywords used in parallel databases, real time parallel implementation through Oracle, MySQL, etc. More.....

Distributed Database Concepts:

This link is about architecture used in distributed database, data distribution through fragmentation and replication, distributed transactions, distributed concurrency control, some examples on how to fragment and allocate databases, etc. More.....

Object Database Concepts:

This link will take to the Object database concepts like object oriented database, object relational database, concepts of types, inheritance etc. More.....

SQL:

This link lists the syntaxes for creating tables, and executing queries (Handling DDL, DML, TCL queries) mainly for SQL new learners. More.....

Previous Year University Exam Questions:

Set of University exam questions can be found here. University questions comprises of Indian and few foreign Universities. Some of them are solved or links are provided for answers. More.....

Other Computer Science and Engineering Subjects:

Notes, PPTs, PDFs, Question papers etc. for other computer science related subjects are provided here. More.....

Links to useful video lectures in Computer Science and Engineering Subjects:

In this page you can find the links to most popular, useful, simple and easily understandable online video lectures for students in the field of computer science and engineering by many eminent speakers. More.....

******************************

TOPICS (Click to Navigate)

Saturday, March 15, 2014

Parallel Database - Parallel External Sort-Merge Technique

Parallel External Sort-Merge

Assumptions:

Objective:

Steps:

Point to note:

Example:

Step 1:

Step 2:

Step 3:

Step 4:

Wednesday, March 12, 2014

Parallel Database - Intraquery Parallelism

Intraquery Parallelism

IntraoperationParallelism

InteroperationParallelism

Parallel Database - Intraoperation Parallelism

Intra-operation Parallelism

Parallel Sort

Range-Partitioning Sort

Parallel External Sort-Merge

Parallel Join

Partitioned Join

Fragment and Replicate Join

Partitioned Parallel Hash Join

Parallel Nested-Loop Join

Selection

DuplicateElimination

Projection

Aggregation

Advanced DBMS Topics

Featured Content

Multiple choice questions in Natural Language Processing Home

All time most popular contents

Report Abuse