Sunday, March 9, 2014

Generic steps involved in execution of an SQL Query

How an SQL Query is executed by a Database Management System?


In this post, I would like to discuss the Query Execution process in simple terms. The detailed diagram of Query Execution process is given in Figure 1. I have mentioned two main components in the figure, namely Component A (Primary Storage) and Component B (Secondary Storage). First of all, let us have an abstracted view on these components on their roles in the process of Query Execution.

A - Primary Storage:

Primary storage (it is your RAM) also called as Primary Memory, Main Memory, Memory, etc. All of us know that the Operating System brings all the files needed to be processed into main memory first. A DBMS too actually maintains database in number of different files. A DBMS have many modules for managing data and executing queries. Among them two modules are given in the Figure. They are Query Processor and Storage Manager. These two in turn have many sub-modules in them (they are not discussed here directly).
After receiving the query, the Query Processor module does the following;

  • It verifies the syntax of the query,
  • It checks for the table names and columns specified in the query against in Data Dictionary,
  • It generates various possible cost based plans for the given query,
  • It chooses one among the generated plans which would be best and,
  • It instructs the Buffer manager to bring the required data into the buffer for execution.
Upon receiving the instruction from the programs that executes query in DBMS, the Buffer Manager (which is part of Storage Manager) does the following;

  • It checks for the availability of the requested block in buffer,
  • If the requested block is not in the buffer, then allocates the space in the buffer,
  • It then locates the data block in the disk, and transfers to the buffer, and
  • It forwards the address of the block in main memory to the program which initiated the request.
As a whole, we can define the major roles of these two components as follows;
Query Processor – identifies the cost effective plan to execute a query.
Buffer Manager – tries to reduce the number disk I/O between disk and main memory.

B - Secondary Storage:

It is your hard disk where all the files are stored. Basically, hard disk storage units are divided into tracks and sectors physically. Operating systems treats the basic data units as Block, which is collection of one or more contiguous sectors. And, when transfers files from disk to main memory, operating systems handle them in units of blocks. The data transfer rate of hard disk drive is several orders of magnitude slower than that of the main memory. This is where buffers and Buffer managers are helpful in handling the data very efficiently when they are needed. Disk subsystem is the system which consists of disk controller and set of disks. It handles disks for disk I/O.

With this introduction, let us discuss the process with an example query in the following section. 

Let us now discuss the control flow for executing the given query,
SELECT * FROM Employee WHERE Emp_ID = ‘E101’;

Step 1: Query Input

The query is given to the Query processor module through any one of the various SQL interfaces (SQL tool, Application programs with embedded SQL, etc.)

Step 2: At Query Processor

Query Processor verifies the syntax of the query.
Then it checks for the correctness of the table name Employee and the attribute Emp_ID in the Data Dictionary. Also, it checks for the user authentication. That is, if the user is privileged to access the requested data.
It generates various possible equivalent queries (relational algebra equivalents) using the meta information (for example, size of the table, keys of the table, integrity constraints, indexes on various fields, and much more). For our query, we could write the following relational algebra expression;
σEmp_ID = ‘E101’(Employee)
As mentioned above, using the indexes we could generate many other equivalent queries.
Finally, it chooses one expression as execution plan for the given query. Let us assume the above relational algebra expression is chosen to execute the given query.
Now the instruction is given to the Buffer Manager to fetch the data needed by the query.

Step 3: At Buffer Manager

It checks for the data needed in the buffer itself. If the requested data not available with buffer, then it initiates a request to disk subsystem to fetch the requested data.
As discussed above, the requested data alone cannot be transferred, instead the whole block where the data is available will be transferred. In our example, the block with 4 records are transferred by the disk subsystem to main memory, which will be linear scanned to locate the actual record.

Selection query is processed as discussed. If the query involves UPDATE operation, then the other steps must be included. For example, let us consider the following query;
UPDATE Employee SET Department = ‘Marketing’ WHERE Emp_ID = ‘E101’;
For this query, fetching the required data from disk to main memory is done as discussed above. But, to ensure the changes made to the data to be reflected in the database permanently, we need to follow the other steps too as discussed below.

Step 4&5: Logging

If the query actually changes the data (i.e, if the operation is UPDATE), logging the changes made to the data is an important step. This ensures the safety of data from abnormal loss of data.
If an entry of changes is written into the Log cache (which is actually one of several caches handled in Buffer), it also written into the Log File which is available in the disk.

Step 6: Actual Updating of Data in disk

Once the data is successfully written into the log file, it is time to make the actual changes and write the updated record into disk. The modified data which is available in Buffer and not yet written to disk is called Dirty pages. When a successful log write is completed, then these dirty pages are written to disk and treated as clean pages.
As discussed above, one of the major goals of a database system is to minimize the number of disk accesses. Hence, it is decided by the Buffer Manager to retain the data in the buffer according to several algorithms.

No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery