Computer Science and Engineering - Tutorials, Notes, MCQs, Questions and Answers: 2PC Protocol

Showing posts with label 2PC Protocol. Show all posts

Saturday, July 19, 2014

Failure of a Coordinator Site

Handling of the Coordinator Site failure by 2PC / How does 2 Phase Commit Protocol handle the failure of a coordinator site? / Steps involved in handling coordinator failure by 2PC protocol

Handling the Failure of a Coordinator Site

Let us suppose that the Coordinator Site failed during execution of 2 Phase Commit (2PC) protocol for a transaction T. This situation can be handled in two ways;

The other sites which are participating in the transaction T may try to decide the fate of the transaction. That is, they may try to decide on Commit or Abort of T using the control messages available in every site.

The second way is to wait until the coordinator site recovers.

Method 1

Let us see, how the transaction T’s final status can be decided through the first method in detail.

[a] If an active site has <commit T> message in its log - <commit T> message is decided by the coordinator site. If the coordinator site sends the <commit T> message to all the participating sites, then only they can write the message into their log files. Hence, the decision is to commit the transaction T.

[b] If an active site recorded <abort T> message in its log – This clearly shows that the decision taken by the coordinator site before it fails was to abort the transaction T. Hence, the decision should be abort T.

[c] If some active sites do not hold a <ready T> message in their log files – As stated in 2PC protocol, if one or more of the participating sites do not contain <ready T> message in their log files, then it clearly shows that those sites must not have responded to the coordinator on the <prepare T> message. Hence, the coordinator must have taken a decision to abort the transaction T. So, we abort T.

Method 2

If none of the cases [a], [b], and [c] holds, we can apply only the second way of handling the failure of coordinator site. That is, we need to wait until the transaction coordinator recovers.

Two Phase Commit Protocol in Pictures

Two Phase Commit Protocol in Pictorial Representation / Pictorial Representation of 2PC Protocol / Easy understanding of 2PC through Pictures / What is 2PC Protocol is all about? / Explain the steps involved in Two Phase Commit Protocol

Two Phase Commit (2PC) protocol commits the transaction if all are ready to commit. See steps 1 to 4 below;

Step 1

Step 1 - Transaction Coordinator for Transaction T sends <prepare T> message

Step 2

Step 2 - Transaction Managers of all the participating sites recording their stand <ready T> for T and send <ready T> to TC

Step 3

Step 3 - If all sites are ready, then TC writes and sends <commit T> message to all the participating sites

Step 4

Step 4 - The participating sites recording and executing the decision of TC on Transaction T

Two Phase Commit (2PC) protocol aborts the transaction, if any of the participating sites are not ready for a commit. See steps 1 to 4 given below;

Step 1

Step 1 - Transaction Coordinator for Transaction T sends <prepare T> message

Step 2

Step 2 - One of the sites is not ready to commit. Hence, it writes <no T> to log and sends <abort T> to TC

Step 3

Step 3 - As one of the sites is not ready for commit, TC decides to abort and includes <abort T> to log and sends <abort T> to all the TMs

Step 4

Step 4 - Participating sites records and executes the decision of TC to abort T

Wednesday, July 16, 2014

Failure of a parcipating site

Failure of a Participating Site / Handling the failure of a participating site by 2PC protocol in distributed database / 2 Phase Commit (2PC) protocol failure handling techniques

Recall the messages used by the TC (Transaction Coordinator) to perform any transaction in a Distributed Database.

<Prepare T> - It is send by the TC to the TM (Transaction Managers) of all the participating sites. It instructs the participating sites to ready for Commit the transaction T.

<Ready T> - This message is send by the TMs of all the participating sites to the TC if they are ready to commit the transaction T.

<Abort T> - This message indicates that the system which sends this message is not ready to commit or cannot commit T. This message can be send by the participating sites (TMs can send) or by the initiating site (TC can send).

<Commit T> - Instructs the system to commit, i.e, permanently store the changes in the database.

Handling a Failure of a participating site

Let us assume that the failed site is S_i and the Transaction Coordinator is TC.There are two things we need to look into to handle such failure;

1. The response of the Transaction Coordinator of transaction T.

If the failed site have not sent any <ready T> message, the TC cannot decide to commit the transaction [Remember, in distributed database all the participating sites must be ready to commit. Even if, one site is not ready, then the whole transaction needs to be aborted by the TC]. Hence, the transaction T should be aborted and other participating sites to be informed.

If the failed site have sent a <ready T> message, the TC can assume that the failed site also was ready to commit, hence the transaction can be committed by TC and the other sites will be informed to commit. In this case, the site which recovers from failure has to execute the 2PC protocol to set its local database up-to-date.

2. The response of the failed site when it recovers.

When recover from failure, the recovering site S_i must identify the fate of the transactions which were going on during the failure of S_i. This can be done by examining the log file entries of site S_i.

The following are the possible cases and relevant actions;

[a] If the log contains a <commit T> entry - It means that all the other sites including Si have responded with <ready T> message to TC and TC must have send <commit T> to all the participants. Because, the participating sites are not allowed to insert <commit T> message in the log file without the coordinator’s decision. Hence, the recovered site S_i can perform redo(T). That is, T is executed once again locally by S_i.

[b] If the log contains an <abort T> entry – Any site can have <abort T> message in its entry, if the decision taken by the coordinator TC is to abort the transaction T. Hence, site S_i executes undo(T).

[c] If the log contains a <ready T> entry – This means that the site S_i failed immediately after sending its own status on transaction T. Now, it has contact the TC or other sites for deciding the fate of the transaction T.

The first choice is to contact the TC of transaction T. If the TC have an entry <commit T>, then according to the above discussions, it is clear that the Si have to perform redo(T). If the TC have an entry <abort T>, then S_i performs undo(T).

The second choice is to contact the other sites which have participated in transaction T (this choice is chosen only if TC is not available). Then the decision can be taken based on the other sites’ log entries.

[d] If the log contains no control messages, i.e, no <abort T, <commit T>, or <ready T> - It clearly shows that the site S_i has failed well before responding to the <prepare T> message. Hence, the TC must have aborted the transaction. So, S_i needs to execute a undo(T).

This is how the 2PC handles the failure of a participating Site.

The handling of other types of failures can be visited through the following links;

Failure of a coordinator

Network partition

Set of possible system failure modes in distributed database

Possible System Failure Modes in Distributed Database / Types of System Failures in Distributed Database / How does 2PC protocol handles failures in distributed database?

The known errors or failures like software errors, hardware failures, hard disk failures, and power failures are very common in both centralized database system and distributed database systems. Apart from these common failures, a distributed database system may suffer from some of the failures as listed below;

Failure of a site - Distributed database consists of two or more servers. These servers are otherwise called as site. Any of these sites might fail. Though it is a hardware or software failure, as a distributed system it must be treated differently.

Loss of messages - The messages which are shared in between a set of sites might be lost. TCP/IP protocols are responsible to handle these losses.

Failure of a communication link - A connection/communication links between a set of sites might be failed. In such case, the distributed database system may try to identify an alternate route to send the messages.

Network partition - A distributed database system is said to be partitioned if it has two or more subsystems. A subsystem may be a set of one or more sites which has one connection to the other subsystems. For example, consider a distributed database system which manages sites at three different college campuses. Every campus may be internally having more sites. But they are connected to other campuses through a single connection. Now the problem is, if this connection is failed, the distributed database system cannot differentiate or diagnose the actual problem. The failure can be treated as Failure of a site, loss of messages, or a communication link failure.

Among all the failures discussed above, Failure of a site and Network partition need extra care when handling failures in a distributed database.

Please recall from the post Distributed Transactions, the various components of Transaction System Structure. As shown in the figure below, every site has its own Transaction Coordinator, and Transaction Manager. In distributed database systems, the resources are shared among many sites. Hence, the site which initiates a transaction T may be treated as coordinating site (the Transaction coordinator [TC] is responsible). The other sties which are participating in the process of completing the transaction T may be called as participating sites (the Transaction managers [TM] of those sites are responsible). So, the failure of a site may be treated in two difference sense; failure of a participating site and failure of a coordinator.

Figure 1 - Transaction System Structure in a Distributed Database

The links that are given below will take you to the posts which about handling failures in distributed database.

TOPICS (Click to Navigate)