Sunday, February 23, 2014

Data Partitioning Strategies in Parallel Database Systems



Data Partitioning Strategies in Parallel Database Systems


Partitioning the tables/databases is very important step in parallelizing the database activities. By partitioning the data equally into many different processors’ workload, we can achieve better performance (better parallelism) of the whole system. We have the following types of fragmentation (partitioning) techniques in Parallel database;

Horizontal Fragmentation – Partitioning the tables using the conditions  specified through WHERE clause of SQL query to distribute bunch of tuples (records). For example, consider the following schema;
STUDENT (Regno, SName, Address, Branch, Phone)
In this table, Branch column is used to store the academic branch in which the student admitted in. suppose, there are various branches like ‘BTech CIVIL’, ‘BTech MECH’, ‘BTech CSE’, and so on, then the following query would be used to partition the table using horizontal partitioning.
SELECT * FROM student WHERE Branch = branch name;
In the result, there won’t be any changes in the schema of the table; i.e the structure of the table is unaltered. Only, ‘BTech CIVIL’ students fragmented into one partition, ‘BTech CSE’ students fragmented into one partition and so on.
Vertical Fragmentation – Partitioning tables using the decomposition rules to break and distribute the tables into multiple partitions vertically (different schemas) is called Vertical fragmentation.
For example, if we would like to break STUDENT into different tables like STUDENT(Regno, SName, Address, Branch) and STU_PHONE(Regno, Phone), i.e into two different schema on columns is Vertical partitioning.

Among the above discussed fragmentation techniques, partitioning any relation with respect to Parallel databases involves Horizontal Partitioning. Horizontal data partition helps us to distribute the data into several processors to execute queries on them simultaneously.

Partitioning Strategies:

There are various partitioning strategies proposed to manage the data distribution into multiple processors evenly.
Let us assume that in our parallel database system we have n processors P0, P1, P2, …, Pn-1 and n disks D0, D1, D2, …, Dn-1 where we partition our data. The value of n is chosen according to the degree of parallelism required. The partitioning strategies are,

Round-Robin Partitioning

 

                It is the simplest form of partitioning strategy. Here, the data are distributed into various disks in the fashion, first record into first disk, second record into second disk, and so on. If the number of available disks n is 10, then first record goes to D1 (1 mod 10 = 1), second record goes to D2 (2 mod 10 =2), and so on and 10th record goes to D0 (10 mod 10 = 0), 11th record goes to D1 (11 mod 10 = 1). This scheme distributes data evenly in all the disks.

Hash Partitioning

 

                This strategy identifies one or more attributes as partitioning attributes (which is not required in Round-Robin partitioning). We need a hash function which is carefully chosen (careful to avoid data skewness) which takes the identified partitioning attributes as input to hash function.
For example, consider the following table;
EMPLOYEE(ENo, EName, DeptNo, Salary, Age)
If we choose DeptNo attribute as the partitioning attribute and if we have 10 disks to distribute the data, then the following would be a hash function;
h(DeptNo) = DeptNo mod 10
If we have 10 departments, then according to the hash function, all the employees of department 1 will go into disk 1, department 2 to disk 2 and so on.

As another example, if we choose the EName of the employees as partitioning attribute, then we could have the following hash function;
h(EName) = (Sum of ASCII value of every character in the name) mod n,
where n is the number of disks/partitions needed.

Range Partitioning

 

In Range Partitioning we identify one or more attributes as partitioning attributes. Then we choose a range partition vector to partition the table into n disks. The vector is the values present in the partitioning attribute.
For example, for the EMPLOYEE relation given above, if the partitioning attribute is Salary, then the vector would be one as follows;
[5000, 15000, 30000],
where every value means the individual range of salaries. That is, 5000 represents the first range (0 – 5000), 15000 represents the range (5001 – 15000), 30000 represents the third range (15001 – 30000), and it includes the final range which is (30001 – rest). Hence, the vector with 3 values represents 4 disks/partitions.

The above discussed partition strategies must be chosen carefully according to the workload of your parallel database system. The workload may involve many components like, which attribute is frequently used in any queries as filtering condition, the number of records in a table, the size of the database, approximate number of incoming queries in a day, etc.

Detailed Example:




Let us start with the following table Emp_table. Emp_table instance has 14 records and every record stores information about the name of the employee; his/her work grade, and the department name. Assume that we have 3 processors namely P0, P1, P2, and 3 Disks associated with those 3 processors namely D0, D1, D2.



Emp_table
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
BLAKE
4
SALES
FORD
4
RESEARCH
KING
5
ACCOUNTING
SCOTT
4
RESEARCH
MILLER
2
ACCOUNTING
TURNER
3
SALES
WARD
2
SALES
MARTIN
2
SALES
ADAMS
1
RESEARCH
JONES
4
RESEARCH
JAMES
1
SALES
CLARK
4
ACCOUNTING
ALLEN
3
SALES
Table 1 – Emp_table

A. Round-Robin Partitioning:

In this strategy we partition records in a round-robin manner using the function i mod n, where i is the record position in the table and n is the number of partitions/disks which is in our case 3. On the application of partitioning technique, first record goes into D1, second record goes into D2, third record goes into D0, fourth record goes into D1, and so on. After distribution of records, we will get the following partitions;



Emp_table_Partition0
ENAME
GRADE
DNAME
FORD
4
RESEARCH
MILLER
2
ACCOUNTING
MARTIN
2
SALES
JAMES
1
SALES
Table 2 – Records 3, 6, 9, 12 mod 3

Emp_table_Partition1
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
KING
5
ACCOUNTING
TURNER
3
SALES
ADAMS
1
RESEARCH
CLARK
4
ACCOUNTING
Table 3 – Records 1, 4, 7, 10, 13 mod 3

Emp_table_Partition2
ENAME
GRADE
DNAME
BLAKE
4
SALES
SCOTT
4
RESEARCH
WARD
2
SALES
JONES
4
RESEARCH
 ALLEN
3
SALES
Table 4 – Records 2, 5, 8, 11, 14 mod 3

B. Hash Partitioning:

Let us take GRADE attribute of the Emp_table to explain Hash partitioning. Let us choose a hash function as follows;
h(GRADE) = (GRADE mod n)
where GRADE is the value of GRADE attribute of a record and n is number of partitions which is 3 in our case. While applying the hash partitioning on GRADE, we will get the following partitions of Emp_table. For example, the GRADE of ‘Smith’ is 1 and while hashing the function shows partition 1 (i.e 1 mod 3 = 1). The GRADE of ‘Blake’ is 4, then (4 mod 3) directs to partition 1. The GRADE of ‘King’ is 5 which directs to partition 2 (5 mod 3 = 2).

Emp_table_Partition0
ENAME
GRADE
DNAME
TURNER
3
SALES
 ALLEN
3
SALES
Table 5 – GRADEs 3 mod 3

Emp_table_Partition1
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
BLAKE
4
SALES
FORD
4
RESEARCH
SCOTT
4
RESEARCH
ADAMS
1
RESEARCH
JONES
4
RESEARCH
JAMES
1
SALES
CLARK
4
ACCOUNTING
Table 6 – GRADEs 1, 4 mod 3

Emp_table_Partition2
ENAME
GRADE
DNAME
KING
5
ACCOUNTING
MILLER
2
ACCOUNTING
WARD
2
SALES
MARTIN
2
SALES
Table 7 – GRADEs 2, 5 mod 3

C. Range Partitioning:

Let us consider GRADE of Emp_table to partition under range partitioning. For applying range partition, we need to first identify partitioning vector, [v0, v1, …, vn-2]. Let us choose the following vector as range partitioning vector for our case;
[2, 4]
According to the vector, the records having the GRADE value 2 and less will go into partition 0, greater than 2 and less than or equal to 4 will go into partition 1, and all the other values (greater than 4) will go into partition 2 as depicted in the following tables.

Emp_table_Partition0
ENAME
GRADE
DNAME
SMITH
1
RESEARCH
MILLER
2
ACCOUNTING
WARD
2
SALES
MARTIN
2
SALES
ADAMS
1
RESEARCH
JAMES
1
SALES
Table 8 – GRADE values 1 and 2

Emp_table_Partition1
ENAME
GRADE
DNAME
BLAKE
4
SALES
FORD
4
RESEARCH
SCOTT
4
RESEARCH
TURNER
3
SALES
JONES
4
RESEARCH
CLARK
4
ACCOUNTING
ALLEN
3
SALES
Table 9 – GRADE values 3 and 4

Emp_table_Partition2
ENAME
GRADE
DNAME
KING
5
ACCOUNTING
Table 10 – GRADE value 5 and above

Saturday, February 22, 2014

Parallel Database Architectures


Parallel Database Architecture


Today everybody interested in storing the information they have got. Even small organizations collect data and maintain mega databases. Though the databases eat space, they really helpful in many ways. For example, they are helpful in taking decisions through a decision support system. To handle such a voluminous data through conventional centralized system is bit complex. It means, even simple queries are time consuming queries. The solution is to handle those databases through Parallel Database Systems, where a table / database is distributed among multiple processors possibly equally to perform the queries in parallel. Such a system which share resources to handle massive data just to increase the performance of the whole system is called Parallel Database Systems.
We need certain architecture to handle the above said. That is, we need architectures which can handle data through data distribution, parallel query execution thereby produce good throughput of queries or Transactions. Figure 1, 2 and 3 shows the different architecture proposed and successfully implemented in the area of Parallel Database systems. In the figures, P represents Processors, M represents Memory, and D represents Disks/Disk setups.

1. Shared Memory Architecture

 

Figure 1 - Shared Memory Architecture


In Shared Memory architecture, single memory is shared among many processors as show in Figure 1. As shown in the figure, several processors are connected through an interconnection network with Main memory and disk setup. Here interconnection network is usually a high speed network (may be Bus, Mesh, or Hypercube) which makes data sharing (transporting) easy among the various components (Processor, Memory, and Disk).

Advantages:


  • Simple implementation
  • Establishes effective communication between processors through single memory addresses space.
  • Above point leads to less communication overhead.

Disadvantages:


  • Higher degree of parallelism (more number of concurrent operations in different processors) cannot be achieved due to the reason that all the processors share the same interconnection network to connect with memory. This causes Bottleneck in interconnection network (Interference), especially in the case of Bus interconnection network.

  • Addition of processor would slow down the existing processors.

  • Cache-coherency should be maintained. That is, if any processor tries to read the data used or modified by other processors, then we need to ensure that the data is of latest version.

  • Degree of Parallelism is limited. More number of parallel processes might degrade the performance.

2. Shared Disk Architecture 

 

Figure 2 - Shared Disk Architecture

In Shared Disk architecture, single disk or single disk setup is shared among all the available processors and also all the processors have their own private memories as shown in Figure 2.

Advantages:


  • Failure of any processors would not stop the entire system (Fault tolerance)
  • Interconnection to the memory is not a bottleneck. (It was bottleneck in Shared Memory architecture)
  • Support larger number of processors (when compared to Shared Memory architecture)

Disadvantages:


  • Interconnection to the disk is bottleneck as all processors share common disk setup.

  • Inter-processor communication is slow. The reason is, all the processors have their own memory. Hence, the communication between processors need reading of data from other processors’ memory which needs additional software support. 

Example Real Time Shared Disk Implementation

  •  DEC clusters (VMScluster)  running Rdb

3. Shared Nothing Architecture


Figure 3 - Shared Nothing Architecture


 In Shared Nothing architecture, every processor has its own memory and disk setup. This setup may be considered as set of individual computers connected through high speed interconnection network using regular network protocols and switches for example to share data between computers. (This architecture is used in the Distributed Database System). In Shared Nothing parallel database system implementation, we insist the use of similar nodes that are Homogenous systems. (In distributed database System we may use Heterogeneous nodes)

Advantages:


  • Number of processors used here is scalable. That is, the design is flexible to add more number of computers.
  • Unlike in other two architectures, only the data request which cannot be answered by local processors need to be forwarded through interconnection network.

Disadvantages:


  • Non-local disk accesses are costly. That is, if one server receives the request. If the required data not available, it must be routed to the server where the data is available. It is slightly complex.
  • Communication cost involved in transporting data among computers.

Example Real Time Shared Nothing Implementation

  • Teradata
  • Tandem
  • Oracle nCUBE

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents