How would we parallelize the SELECTION operation in an SQL query?
- Assume that
the table is partitioned and stored in disks D0, D1, …, Dn-1 with processors
P0, P1, …, Pn-1. For
example, consider the following figure where the table Employee is partitioned using
Round-robin partitioning technique.
Figure 1 - Partitioned table Employee
- Recall the types of queries based on data access. Basically we have three types;
- Point queries – In point queries, the WHERE conditions are specified like “attribute_name = value” format.
- Range queries – In range queries, the WHERE conditions are specified like “attribute_name >= value AND attribute_name <= value”, or “attribute_name BETWEEN value1 and value 2”.
- Scanning the entire relation – In queries where the WHERE condition involves non-key attributes, then the system has to look for all the data in the specified table. This is called the relation scan.
- If the table is partitioned based on an attribute other than the WHERE condition attribute,
- If the table is range partitioned on an attribute other than the range attribute specified in the query,
- If the table is partitioned on one attribute and the query involves one more attribute with and condition, etc.
Point to note:
- According to CASE 3, it means, we can achieve parallelism by exploiting all the processors.
- According to CASE 1, only one processor has to execute and it is known initially. That means, the other processors can effectively be used for executing other queries at that time.
- According to CASE 2, only few processors where the range of values stored need to execute. The other processors can be utilized for executing other queries.