How do we perform Duplicate data elimination, Projection, and Aggregation in parallel in parallel database system?
Assume that the table is partitioned and stored in disks D0, D1, …, Dn-1 with processors P0, P1, …, Pn-1. For example, consider the following figure where the table Employee is partitioned used Round-robin partitioning technique.
|Figure 1 - Partitioning Employee table|
Duplicate elimination is about removing duplicate values or neglecting repeated values that are stored in an attribute for various reasons; For example, in one or all of the following situations we need duplicate elimination;
- When we would like to count the number of unique values present in a table under a particular attribute.
- When we need to retrieve the values and present only the unique values of an attribute.
- When we need to join two tables, etc.
The main approach used for duplicate elimination is by sorting the data on the attribute where in the duplicate values to be removed.
Duplicate elimination can be achieved in the following two ways in parallel database;
1. During parallel sort, if we find any repeated values while partitioning, those can be discarded immediately. (This method is for tables that are not partitioned). For example, in Figure 1, while you partition the data, you send tuples to different partitions based on the partition attribute and conditions. Suppose that we are partitioning on Salary attribute. During the process, if you find any of the repeated values which are already send into a partition, you can discard those values if they repeat. In our example, salary value 5000 occurs in two records. Hence, one can be accepted and the other can be discarded.
2. We can partition the table into many partitions (using range or hash partitioning), and instruct all the processors to sort the data locally and remove the duplicates. (This works only for the data that are hash or range partitioned on the duplicate elimination attribute)
Projection means selection of one or more attributes from a table with the records stored in them. This operation can be parallelized as follows;
1. Projection without duplicate elimination: while you read data into various disks during partitioning, you can project the required columns.
2. Projection with duplicate elimination: any of the techniques suggested in Duplicate elimination section above can be used.
Aggregation operation involves finding the count of records, sum of values stored in an attribute, minimum or maximum value of all the values stored in an attribute, and average value of all the attribute values. This operation basically needs grouping. That is, for example, we can find the sum of salary for all the records of Employee table, or we can find sum of salary with some filter conditions. In the first case, all the records of Employee table come under one group. In the later case, we choose the group based on the conditions included.