Saturday, March 15, 2014

Parallel Database - Parallel External Sort-Merge Technique



Parallel External Sort-Merge

Assumptions:

Assume n processors, P0, P1, …, Pn-1 and n disks D0, D1, …, Dn-1.
Disk Di is associated with Processor Pi.
Relation R is partitioned into R0, R1, …, Rn-1 using Round-robin technique or Hash Partitioning technique or Range Partitioning technique (partitioned on any attribute)

Objective:

Our objective is to sort a relation (table) Ri on an attribute A in parallel where R resides on n disks.

Steps:

Step 1: Sort the relation partition Ri which is stored on disk Di on the sorting attribute of the query.
Step 2: Identify a range partition vector v and range partition every Ri into processors, P0, P1, …, Pn-1 using vector v.
Step 3: Each processor Pi performs a merge on the incoming range partitioned data from every other processors (The data are actually transferred in order. That is, all processors send first partition into P0, then all processors sends second partition into P1, and so on).
Step 4: Finally, concatenate all the sorted data from different processors to get the final result.

Point to note:

Range partition must be done using a good range-partitioning vector. Otherwise, skew might be the problem.

Example:

Let us explain the above said process with simple example. Consider the following relation schema Employee;
Employee (Emp_ID, EName, Salary)
Assume that relation Employee is permanently partitioned using Round-robin technique into 3 disks D0, D1, and D2 which are associated with processors P0, P1, and P2. At processors P0, P1, and P2, the relations are named Employee0, Employee1 and Employee2 respectively. This initial state is given in Figure 1.
Figure 1 - Partitioned Employee



Assume that the following sorting query is initiated.
SELECT * FROM Employee ORDER BY Salary;
As already said, the table Employee is not partitioned on the sorting attribute Salary. Then, the Parallel External Sort-Merge technique works as follows;

Step 1:

Sort the data stored in every partition (every disk) using the ordering attribute Salary. (Sorting of data in every partition is done temporarily). At this stage every Employeei contains salary values of range minimum to maximum. The partitions sorted in ascending order is shown below, in Figure 2.

Figure 2 - Partitioned Employee in ascending order on Salary attribute


Step 2:

We have to identify a range vector v on the Salary attribute. The range vector is of the form v[v0, v1, …, vn-2]. For our example, let us assume the following range vector;
v[14000, 24000]
This range vector represents 3 ranges, range 0 (14000 and less), range 1 (14001 to 24000) and range 2 (24001 and more).
Redistribute every partition (Employee0, Employee1 and Employee2) using these range vectors into 3 disks temporarily. What would be the status of Temp_Employee 0, 1, and 2 after distributing Employee 0 is given in Figure 3.
Figure 3 - Distribution of Employee 0 according to the range vector


Step 3:

Actually, the above said distribution is executed at all processors in parallel such that processors P0, P1, and P2 are sending the first partition of Employee 0, 1, and 2 to disk 0. Upon receiving the records from various partitions, the receiving processor P0 merges the sorted data. This is shown in Figure 4. 


Figure 4 - Merging data of range <=14000 at D0 from all the disks


The above said process is done at all processors for different partitions. The final version of Temp_Employee 0, 1, and 2 are shown in Figure 5.

Figure 5 - Status of Temp_Employee 0, 1, and 2 after distributing all ranges from all disks



Step 4:

The final concatenation of sorted data from all the disks is trivial.


Wednesday, March 12, 2014

Parallel Database - Intraquery Parallelism



Intraquery Parallelism

It is about executing a single query in parallel using multiple processors or disks. This can be done by dividing the data into many smaller units and execute the query on those smaller tables. We have so many queries which are complex and consume more time and resources. For example, consider the query given below;
SELECT * FROM Email ORDER BY Start_Date;
This query will sort the records of Email table in ascending order on the attribute Start_Date. Assume that the Email table has 10000 records. To sort the table we need at least 10000 comparisons of Salary attribute values, if the query is executed in single processor. If the same query is executed in several processors simultaneously, we can finish sorting in lesser time. For example, if 10000 records are distributed equally into 10 processors, 1000 each, then every processor can sort 1000 records in lesser time, later it can be combined/merged using other algorithms to produce final sorted result.
In Intraquery Parallelism, we have two types;

Parallel Database - Intraoperation Parallelism



Intra-operation Parallelism

It is about parallelizing a single relational operation given in a query.
SELECT * FROM Email ORDER BY Start_Date;
In the above query, the relational operation is Sorting. Since a table may have large number of records in it, the operation can be performed on different subsets of the table in multiple processors, which reduces the time required to sort.

Consider another query,
SELECT * FROM Student, CourseRegd WHERE Student.Regno = CourseRegd.Regno;
In this query, the relational operation is Join. The query joins two tables Student, and CourseRegd on common attribute Regno. Parallelism is required here if the size of tables is very large. Usually, order of tuples does not matter in DBMS. Hence, the tables arranged in random order needs every record of one table should be matched with every record of other table to complete the join process. For example, if Student has 10000 records and CourseRegd has 60000 records, then join requires 10000 X 60000 comparisons. If we exploit parallelism in here, we could achieve better performance.

There are many such relational operations which can be executed in parallel using many processors on subsets of the table/tables mentioned in the query. The following list includes the relational operations and various techniques used to implement those operations in parallel.


Advanced DBMS Topics

Advanced concepts in DBMS

Advanced Database Topics
(Click on the links to navigate)

This link takes you to the section which broadly discusses about Database Design using ER model and Normalization techniques, various normal forms, Indexing and Tuning, Performance issues in DBMS, basic building blocks of a DBMS software, etc. with appropriate examples (in some posts). This section is especially meant for those who are naive to DBMS. More.....

You can give quizzes on various topics in DBMS basics and advanced concepts. Also, set of solved exercises are included. More.....


In this section, I have discussed about parallel database concepts like, parallel database architectures, basic issues in parallelizing database accesses, data distribution to parallel machines, types of parallel operations, achievability of parallel operations, some keywords used in parallel databases, real time parallel implementation through Oracle, MySQL, etc. More.....

This link is about architecture used in distributed database, data distribution through fragmentation and replication, distributed transactions, distributed concurrency control, some examples on how to fragment and allocate databases, etc. More.....

This link will take to the Object database concepts like object oriented database, object relational database, concepts of types, inheritance etc. More.....

This link lists the syntaxes for creating tables, and executing queries (Handling DDL, DML, TCL queries) mainly for SQL new learners. More.....

Set of University exam questions can be found here. University questions comprises of Indian and few foreign Universities. Some of them are solved or links are provided for answers. More.....

Notes, PPTs, PDFs, Question papers etc. for other computer science related subjects are provided here. More.....

In this page you can find the links to most popular, useful, simple and easily understandable online video lectures for students in the field of computer science and engineering by many eminent speakers. More.....




******************************






Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents