Wednesday, February 5, 2014

Functional Dependency in Database

Functional Dependency in Database Management System / Role of Functional Dependency in Database design / Armstrong's Axioms / Functional Dependency Examples

Functional Dependency in Database Management System

Introduction

To proceed further with 2NF, 3NF and so on, it is essential to know about constraints, especially keys for a relation (table). Functional Dependency is the one which ensures the strong relationship between the attributes of a table. It is written as follows;

X → Y

It can be read as, X uniquely determines Y. In other words; Y is functionally depend on X. In the above expression,

I. X is a set of one or more attributes and Y is another set of one of more attributes.

II. For one X value, there is only one possible Y value (Uniqueness).

III. It ensures, set of attributes are depend on other set of attributes.

Consider the table 1 given below;

*RegNo*	*SName*	*Gen*	PR	*Phone*	*Courses*
R1	Sundar	M	BTech	9898786756	Database
R2	Ram	M	MS	9897786776	Database
R3	Karthik	M	MCA	8798987867	Data Structures
R4	John	M	BSc	7898886756	Multimedia
R1	Sundar	M	BTech	9898786756	Data Structures
R3	Karthik	M	MCA	8798987867	Multimedia

Table 1 - STUDENT

For this table, we can write the Functional Dependencies as follows;

RegNo → SName

RegNo → Gen

RegNo → PR

RegNo → Phone

Or, it can also be written as,

RegNo → SName Gen PR Phone (1)

According to I above, in equation 1, (RegNo) is X and (SName Gen PR Phone) is Y.

According to II, if we consider the FD RegNo → SName, it means, for any Register Number, there is only one Student Name. That is, in table 1, for register number ‘R1’, the student name is ‘Sundar’. See that the register number ‘R1’ comes at two places. In both place, ‘Sundar’ is the name. It is applicable for all the RegNo values. That is, irrespective of the number of occurrences of one register number, we have exactly one name during all the occurrences.

According to III, it is very clear that the set of attributes SName, Gen, PR, and Phone are dependent on the RegNo attribute.

In the process of Normalization, except 1NF, all the other normal forms should satisfy the previous normal form properties as the first qualifier. That is, if one would like to check a given table is in 2NF or not, first he has to check whether it satisfies the condition of 1NF.

For a table, it is good practice to create Primary Key. This key can be identified using the set of functional dependencies we found holding on the given table. For table 1, RegNo can identify all the other attributes uniquely except Courses. That is, we cannot say RegNo → Courses. The reason is, for some register numbers, we will get more than one value pointed. Hence, composite primary key is the solution. That is, try to have more than one attribute in the left hand side of the functional dependency to check the possibility of identifying records uniquely. For table1, the following is the possible key.

RegNo Courses → SName Gen PR Phone

And, this is correct. Means, the combination of attributes RegNo and Courses together can identify all the other information uniquely. As the result, the combination (RegNo, Courses) is the Primary key for this relation.

Armstrong’s Axioms

It is not enough to consider the identified FDs for our further analysis for a key. It is always required to identify the other set of hidden function dependencies. This can be done using Armstrong’s axioms (inference rules).

Rule 1: Reflexivity rule – If X is a set of attributes, and Y is subset of X, then we would say, X → Y.

Rule 2: Augmentation rule – If X → Y, and Z is another set of attributes, then we write XZ → XY.

Rule 3: Transitivity rule – If X → Y, and Y → Z, then X → Z is true.

From these rules, we can infer the following additional set of rules;

Rule 4: Union rule – If X → Y and X → Z, then X → YZ is true.

Rule 5: Decomposition rule – Its reverse of Rule 4. If X → YZ, then X → Y, and X → Z are true.

Rule 6: Pseudotransitivity rule – If X → Y and ZY → A are true, then XZ → A is also true.

Sunday, February 2, 2014

Normalization

↑ INDEX NEXT - First Normal Form (1NF) →

What is Normalization? / Why do we need to normalize a table? / What are modification anomalies?

Why do we need to Normalize a table?

Introduction

A database which is designed using a model (Example – ER model) is perfect to some extent. This is true as for as the design is concerned. That is, it depends on the chosen attributes. But, there might be some hidden problems hidden in the design. One has to identify and remove those problems. Then only we could say that the design is good. The actual problems are due to the data stored or the dependency of the attribute over the other, etc. If we eliminate those problems, then we could say that the final design is perfect. Let us analyze the possible problems with the following table (designed using ER model) which is populated with sample data.

The following table (STUDENT) shows information about the students of an Engineering college. The student information like Registration Number (RegNo), Name(SName), Gender(Gen), Program Joined(PR), Course Number(CNo), Course Name(CName), Professor Name(PN), and Professor Office Address(POA) are stored in the table. Also assume the following;

1. Every course is offered by only one professor.

2. Student can register many courses (say maximum 6)

3. POA gives the office address of the professor which is also unique for every professor.

4. All the students’ information who joined for any program will be included in this table.

For every table, it is good to have a key to uniquely identify some information. The key for the table STUDENT is (RegNo, CNo), a composite primary key. Because, no single attribute can identify the records uniquely.

*RegNo*	*SName*	*Gen*	PR	*CNo*	*CName*	PN	*POA*
R1	Sundar	M	BTech	C101	Database	Kumar	CA101
R2	Ram	M	MS	C101	Database	Kumar	CA101
R3	Malini	F	MS	C101	Database	Kumar	CA101
R1	Sundar	M	BTech	C105	Data Structures	Praveen	CA107
R4	Kathik	M	MCA	C105	Data Structures	Praveen	CA107
R5	John	M	BSc	C104	Multimedia	Kesavan	CA103

Table 1 - STUDENT

This table looks error free according to its design, but has many hidden problems related to data stored. To identify those problems let us discuss about Modification Anomalies.

Modification Anomalies:

It is a set of problems caused due to the data manipulation in any table. Let us discuss one by one using the above sample tables STUDENT.

1. Insertion Anomaly

Insertion anomaly is the one which is caused during insertion of records into a table. In our sample table STUDENT, to insert information about any student he must have registered atleast one course, and the course must be offered by one professor. That is, if a student information is inserted without course information, according to integrity constraint violation (part of the Primary key is NULL – Cno is null), the record will not be accepted. The same is applicable for course data insertion without professor. This problem is called insertion anomaly.

2. Deletion Anomaly

Deletion of value of some attributes in a record leads to loss of some other important information too. In our table STUDENT, for some reasons, if the subject ‘Multimedia’ is canceled, then we lose the information about the student ‘John’ as well as the information about professor ‘Kesavan’. Also, withdrawal of a course registered by student ‘John’ will lead to loss of information. Such a inconvenience caused due to the deletion is deletion anomaly.

3. Updation Anomaly

An important statement regarding database and data is “Data redundancy leads to inconsistency”. It says if you have data repeated in one or more tables may take you to an inconsistent state while updating such information. In our table, student ‘Sundar’ has registered two courses. It leads to duplication of RegNo, SName, Gen, and PR values. Course ‘C101’ is registered by 3 students. If for any reason, the change of course name ‘Database’ to ‘Database Systems’ must be changed in all the 3 records. Failing which leads to inconsistent state. That is if one of the record is not changed the values ‘Database’ then we have two values for ‘C101’. If Sundar’s data is changed, it must be changed in all the records wherever ‘R1’ available.

Our table STUDENT has all the above said anomalies. To free the table from these anomalies we need to normalize the table using Normalization techniques. Various normal forms which are very basic and essential are discussed in the post Normal Forms.

↑ INDEX NEXT - First Normal Form (1NF) →

Friday, January 31, 2014

Distributed databases - Concurrency Control

Concurrency control in distributed database / Various locking protocols in distributed database / Single lock manager and distributed lock manager approaches in handling concurrent transactions

Concurrency Control in Distributed Database

Concurrency control schemes dealt with handling of data as part of concurrent transactions. Various locking protocols are used for handling concurrent transactions in centralized database systems. There are no major differences between the schemes in centralized and distributed databases. The only major difference is that the way the lock manager should deal with the replicated data. The following topics discusses about several possible schemes that are applicable to an environment where data can be replicated in several sites. We shall assume the existence of the shared and exclusive lock modes.

Locking protocols

1. Single lock manager approach

2. Distributed lock manager approach

a) Primary Copy protocol

b) Majority protocol

c) Biased protocol

d) Quorum Consensus protocol

Some assumptions:

Our distributed database system consists of n sites (servers/computers in different locations)

Data are replicated in two or more sites

***********

Go to Distributed Database home

Monday, November 25, 2013

Scaleup and Speedup

Scaleup in Parallel (Systems) Database

Scaleup is the ability to keep the same performance levels (response time) when both workload (transactions) and resources (CPU, memory) increase proportionally.

Scaleup = (Small system small problem elapsed time[single machine])/(large system large problem elapsed time[parallel machine])

For example, if 50 users consume close to 100 percent of the CPU during normal processing, then adding more users would cause the system to slow down due to contention for limited CPU cycles. However, by adding more CPUs, we can support extra users without degrading performance.

Scaleup increases the Throughput, i.e, number of tasks completed in a given time increased.

Scale Up

Linear Scaleup

Scaleup is linear if the resources increase in proportional with the problem size (it is very rare). According to the equation given above, if small system small problem elapsed time is equal to large system large problem elapsed time, then Scaleup = 1 which is linear.

Sub-linear Scaleup

Scaleup is sub-linear if large system large problem elapsed time is greater than small system small problem elapsed time, then the scaleup is sub-linear.

Further useful discussions:

If scaleup is 1, i.e Linear, then performance of the system is excellent.

If scaleup is Sub-linear and the value falls between 0 and 1, then we need extra care to choose our plan for parallel execution. For example, if small system small problem elapsed time is 5 seconds, and large system large problem elapsed time is 5 seconds. This clearly shows Linear. That is, 5/5 = 1. For other values of denominator, especially low values (not possible beyond a limit), we would say that the system performance is excellent. But, for higher values of the denominator, say 6, 7, 8 and so on, the scale up value falls below 1 which needs much attention for better workload redistribution.

Speedup in Parallel (Systems) Database

Speedup is the effect of applying an increasing number of resources to a fixed amount of work to achieve a proportional reduction in execution times:

Speedup = (Small system elapsed time[single machine])/(large system elapsed time[parallel machine])

Speedup results in resource availability for other tasks. For example, if queries usually take ten minutes to process in one CPU and running in parallel in more CPUs reduces the time, then additional queries can run without introducing the contention that might occur were they to run concurrently.

Speedup reduces the Response Time, i.e, the time to complete a task is reduced.

Speed Up

Linear Speedup

Speedup is linear if the speedup is N. That is, the small system elapsed time is N times larger than the large system elapsed time (N is number of resources, say CPU). For example, if single machine does the job in 10 seconds and if parallel machine (10 single machines) does the same job in 1 second, then the speedup is (10/1)=10 (refer to the equation above) which is equal to N which is the size of larger system. The speedup is achieved due to the 10 times powerful system.

Sub-Linear Speedup

Speedup is sub-linear if speedup is less than N (which is usual in most of the parallel systems).

Further useful discussions:

If the Speedup is N, i.e Linear, then it means the expected performance is achieved.

If the Speedup is not equal to N, then following two cases possible;

Case 1: If Speedup > N, then it means the system performs more than it is designed for. The Speedup value in this case would be less than 1.

Case 2: If Speedup < N, then it is Sub-linear. In this case, the denominator (large system elapsed time) is more than the single machine's elapsed time. In such case, the value would vary between 0 and 1, where we need to fix a threshold value such that the value below threshold should not be accepted as parallel operation. Such a system needs extra care to redistribute the workload among processors.

What are the reasons for Sub-linear performance of speedup and scaleup

Following are the major factors which affect the efficient parallel operation and can reduce both speedup and scaleup.

Startup costs.

For every single operation which is to be done in parallel, there is an associated cost involved in starting the process. That is, breaking a given big process into many small processes consumes some amount of time or resources. For example, a query which tries to sort the data of a table need to partition the table as well as instructing various parallel processors to execute the query before any parallel operation begins.
Example:

SELECT * FROM Emp ORDER BY Salary;

This query sorts the records of Emp table on Salary attribute. Let us assume that there are 1 million employees, so one million records. It will consume some time to execute in a computer with minimal resources. If we would like to execute the query in parallel, we need to distribute (or redistribute) the data into several processors (in case of Shared Nothing Architecture), also, we need to instruct all the processors to execute the query. These two operations need some amount of time well before any parallel execution happens.

Interference.

Since processes executing in a parallel system often access shared resources, a slowdown may result from the interference of each new process as it competes with existing processes for commonly held resources, such as a system bus, or shared disks, or even locks. Both speedup and scaleup are affected by this phenomenon.

Skew.

When breaking down a single task into number of parallel small tasks, it is very hard to make them equal in size. Hence, the performance of the system depends on the slowest CPU which processes the larger sub-task. This type of uneven distribution of a job is called skew. For example, if a task of size 100 is divided into 10 parts, and the division is skewed, there may be some tasks of size less than 10 and some tasks of size more than 10; if even one task happens to be of size 20, the speedup obtained by running the tasks in parallel is only five, instead of ten as we would have hoped.

TOPICS (Click to Navigate)

Wednesday, February 5, 2014