Thursday, March 6, 2014

Comparison of Data-Partitioning Strategies in Parallel Database



Comparison of different data-partitioning strategies based on the data-access types:

We have already discussed about different data-partitioning techniques, namely, Round-robin, Hash and Range Partitioning in an older post.


The above said data-partitioning strategies are useful in partitioning data across several processors to establish a complete parallel system. Here, the term ‘Complete Parallel System’ means the Parallel Database System which can deliver the expected improvement in the performance of the whole system. Hence, it cannot be achieved directly without considering how the database is going to be accessed by the end user and other related details.
The following list would help us in choosing the right partitioning technique;

  • The workload of the system

a.       What is the size of the database?
b.      How many users are accessing the system?
c.       What type of accesses they are? (reading/writing) and so on.

  • What type of data-accesses? (Scanning the entire relation, Point queries, Range queries).

  • The nature of data stored in the table.

a.       The data types
b.      Cardinality of data of every column (uniqueness of values stored in a column) etc.

The above are not the complete list. Different systems try with different options. In this post, let us discuss about select of partitioning strategies based on the access type on data, i.e, based on the type of query which tries to scan the entire relation, range queries or point queries.
Let us use the following three queries for explaining the concept;

  A.      SELECT * FROM Employee WHERE Emp_ID = ‘E101’; (Assume Emp_ID is Primary key)
  B.      SELECT * FROM Employee WHERE EName = ‘Murugan’; (Assume EName is non-key attribute)
  C.      SELECT * FROM Employee ORDER BY Phone;
  D.      SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 20000;

 

1.Round-robin Technique

Round-robin partitioning strategy partitions the given relational table into set of disks based on the position of the records. That is, it sends ith record into disk (i mod n). hence, this strategy ensures even distribution of records into available disks.

Advantages:

  • Suitable for queries which requires scanning the entire relation. For example, Query C can be handled efficiently by distributing data evenly among the available disks.
  • Due to even distribution of records, the workload is well balanced among disks.

Disadvantages:

  • Records are not distributed in any order in this strategy. Hence, we cannot handle point queries and range queries efficiently (when compared to other strategies). So, the strategy is not well suited for Point and Range queries. Query A, B and D cannot be answered efficiently (when compared to other strategies. That is, for example, query b can be answered efficiently if Hash partitioning strategy is used).

 

2. HashPartitioning

Hash partitioning strategy works by using a hash function which is designed to distribute the records into many disks based on the input values (partitioning attribute values).

Advantages:

  • Sequential scan can be done efficiently. Query C can be answered efficiently as the data are available in all the disks, sorting can be done in parallel
  • Well balanced data distribution is possible provided that good portioning attributes are chosen and good hash function is chosen.
  • Point queries can be executed very efficiently compare to Round-robin partitioning. The reason is, in Round-robin technique to answer query B, one has to search in all the disks. But in the case of Hash partitioning the similar values could be found in one location only. Hence, the other processors could be used to handle other queries. This leads to higher Transaction throughput.

Disadvantages:

  • It is difficult to answer range queries (when compared to Range partitioning technique). The data distributed unevenly. Hence, to answer Query D, we have to use all the processors.
  • Not good for partitioning on non-partitioning attributes. Suppose, if the table Employee is partitioned using Hash partitioning on Emp_ID attribute, then query B has to scan all the disks.

 

3. RangePartitioning

Range partitioning strategy partitions the data based on the partitioning attributes values. We need to find set of range vectors on which we are about to partition. For example, the records with Salary range 100 to 5000 will be in disk 1, 5001 to 10000 in disk 2, and so on.

Advantages:

  • Sequential scan can be done in parallel using all the processors. Query C can be executed efficiently.
  • Well balanced data distribution is possible, only if good partitioning vector is chosen. One has to choose the partitioning vector which would at least near equally into many disks.
  • Point queries can be handled efficiently just like Hash partitioning (when compare to Round-robin technique). Queries A and B can be handled efficiently.
  • For range queries, according to the range given in the query, we may need to scan one to few disks as whole. Hence, Range queries can be handled efficiently. Query D can be executed efficiently (if not execution skew present).
  • Higher throughput and good response time.

Disadvantages:

  • If the query specifies large range values, i.e, if the result consists of large number of records in very few disks among the available disk, an I/O bottleneck may be created on those disks. This is called Execution skew. If the same query executed in Round-robin or Hash partitioned disks, we could get better performance compared to Range partitioning.

The Table 1 given below, compares the partitioning strategies.

Partitioning Strategy
Advantages
Disadvantages
Round-robin
·         Sequential scan of entire relation
·         Well balanced data distribution
·         Records are distributed in random. No specific order is followed.
·         Hard to answer range queries.
Hash Partitioning
·         Sequential scan of entire relation
·         Well balanced data distribution
·         For point queries, only one disk has to be searched (only if the data partitioned on the query attribute).
·         Hard to answer range queries
·         Not so good for point queries on non-partitioning attribute.

Range Partitioning
·         Sequential scan of entire relation
·         Well balanced data distribution
·         For point queries, only one disk has to be searched (only if the data partitioned on the query attribute).
·         For range queries, only few disks have to be searched according to the range specified in the query (only if the data partitioned on the query attribute).
·         Execution skew might occur.
Table 1 – Comparison of Partitioning Strategies
Figure 1 - Round-robin Partitioning Strategy in Shared Nothing Parallel Architecture

Figure 2 - Hash Partitioning Strategy in Shared Nothing Parallel Architecture

Figure 3 - Range Partitioning Strategy in Shared Nothing Parallel Architecture









Tuesday, March 4, 2014

Types of Data Access in database



Types of Data Access in database (in view of parallel database data access)

In any relational databases the term Data Access involves reading an entire relational table. This is treated as the only kind of access to data stored in a table. For example, a query for reading (SELECT queries) a table for some values needs to read entire table, if we do not have indexes on those attributes mentioned part of WHERE condition. This type of access can further be classified into the following in view of executing parallel queries;

1. Scanning the entire relation

It means searching for a particular value in a table may need to search in all the records of the table.

Eg_1: SELECT * FROM Employee WHERE EName = ‘Raman’;
Eg_2: SELECT * FROM Employee ORDER BY Phone_No;

The query in Eg_1 needs to access entire relation in searching for the EName value ‘Raman’, if EName is a non-key attribute. When it is a non-key attribute, there may be duplicate values stored in the column EName, i.e, there may be more persons with the same name.
If EName is a key attribute, this query need not scan entire table. When EName is a key, then, only one such value (in our case ‘Raman’) presents in the whole table for the column EName. But, there are two possible cases in accessing the data;
Best Case            : Possibilities for the required data available in first few disc blocks accesses.
Worst Case         : Required data might be available in the last few disc blocks of the table.
The query in Eg_2 requires scanning the entire relation. It actually sorts the data on the attribute mentioned in ORDER BY clause. The result is going to be the entire table in ascending order of Phone_No attribute.

2. Point Queries

It involves associating the particular attribute with a particular value.
Eg_3: SELECT * FROM Employee WHERE Phone_No = 9998881234;
In Eg_3, we are looking for the employee details whose phone number is mentioned in the WHERE clause. This type of query is called as Point Queries (Eg_1 is also a Point Query). Based on the availability of Indexes, it may need to scan entire relation (as discussed for Eg_1).

3. Range Queries

The queries used for locating all the records for which the value of the column specified in the WHERE condition falls in a specific range is called Range Queries.
Eg_4: SELECT * FROM Employee WHERE Age BETWEEN 30 and 45;
This query retrieves all the records whichever satisfies the range 30 to 45 for Age attribute. Again, based on the type of attribute this query may scan entire relation (just same as the above discussion for Scanning the entire relation).

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents