Sie sind auf Seite 1von 73

Distributed Database Design

This chapter introduces the basic principles of distributed database


design and related concepts. All distributed database design concepts,
such as fragmentation, replication, and data allocation are discussed in
detail. The different types of fragmentations are illustrated with
examples. The benefits of fragmentation, objectives of fragmentation,
different allocation strategies, and allocation of replicated and non-
replicated fragments are explained here briefly. Different types of
distribution transparencies have also been focused in this chapter.

The outline of this chapter is as follows. Section 5.1 represents the basic
concepts of distributed database design. The objectives of data
distribution are introduced in Section 5.2. In Section 5.3, data
fragmentation – one important issue in distributed database design – is
explained briefly with examples. Section 5.4 focuses on the allocation
of fragments, and the measure of costs and benefits of fragment
allocation. In Section 5.5, different types of distribution transparencies
are represented.

Distributed Database Design


Concepts
In a distributed system, data are physically distributed among several
sites but it provides a view of single logical database to its users. Each
node of a distributed database system may follow the three-tier
architecture like the centralized database management system (DBMS).
Thus, the design of a distributed database system involves the design of
a global conceptual schema, in addition to the local schemas, which
conform to the three-tier architecture of the DBMS in each site. The
design of computer network across the sites of a distributed system
adds extra complexity to the design issue. The crucial design issue
involves the distribution of data among the sites of the distributed
system. Therefore, the design and implementation of the distributed
database system is a very complicated task and it involves three
important factors as listed in the following.

 Fragmentation–. A global relation may be divided into several


non-overlapping subrelations called fragments, which are then
distributed among sites.
 Allocation–. Allocation involves the issue of allocating fragments
among sites in a distributed system. Each fragment is stored at
the site with optimal distribution.
 Replication–. The distributed database system may maintain
several copies of a fragment at different sites.

The definition and allocation of fragments must be based on how the


database is to be used. After designing the database schemas, the
design of application programs is required to access and manipulate the
data into the distributed database system. In the design of a distributed
database system, precise knowledge of application requirements is
necessary, since database schemas must be able to support
applications efficiently. Thus, the database design should be based on
both quantitative and qualitative information, which collectively
represents application requirements. Quantitative information is used
in allocation, while qualitative information is used in fragmentation.
The quantitative information of application requirements may include
the following:

 The frequency with which a transaction is run, that is, the number
of transaction requests in the unit time. In case of general
applications that are issued from multiple sites, it is necessary to
know the frequency of activation of each transaction at each site.
 The site from which a transaction is run (also called site of origin
of the transaction).
 The performance criteria for transactions.

The qualitative information of application requirements may include


the following information about the transactions that are executed:

 The relations, attributes, and tuples accessed by the transactions.


 The type of access (read or write).
 The predicates of read operations.

Characterizing these features is not trivial. Moreover, this information


is typically given for global relation and must be properly translated
into terms of all fragmentation alternatives that are considered during
database design.

Alternative Approaches for Distributed


Database Design
The distributed database design involves making decisions on the
fragmentation and placement of fragmented data across the sites of a
computer network. Two alternative design issues have been identified
for the distributed database design, namely, top-down and bottom-up
design process.

Top-down design process–. In this process, the database design starts


from the global schema design and proceeds by designing the
fragmentation of the database, and then by allocating the fragments to
the different sites, creating the physical images. The process is
completed by performing the physical design of the data at each site,
which is allocated to it. The global schema design involves both
designing of global conceptual schema and global external schemas
(view design). In global conceptual schema designing step, the user
needs to specify the data entities and to determine the applications
that will run on the database as well as statistical information about
these applications. At this stage, the design of local conceptual schemas
is considered. The objective of this step is to design local conceptual
schemas by distributing the entities over the sites of the distributed
system. Rather than distributing relations, it is quite common to
partition relations into subrelations, which are then distributed to
different sites. Thus, in a top-down approach, the distributed database
design involves two phases, namely, fragmentation and allocation.

The fragmentation phase is the process of clustering information in


fragments that can be accessed simultaneously by different
applications, whereas the allocation phase is the process of
distributing the generated fragments among the sites of a distributed
database system. In the top-down design process, the last step is the
physical database design, which maps the local conceptual schemas
into physical storage devices available at corresponding sites. Top-
down design process is the best suitable for those distributed systems
that are developed from scratch.

Bottom-up design process–. In the bottom-up design process, the issue


of integration of several existing local schemas into a global conceptual
schema is considered to develop a distributed system. When several
existing databases are aggregated to develop a distributed system, the
bottom-up design process is followed. This process is based on the
integration of several existing schemas into a single global schema. It is
also possible to aggregate several existing heterogeneous systems for
constructing a distributed database system using the bottom-up
approach. Thus, the bottom-up design process requires the following
steps:

1. The selection of a common database model for describing the


global schema of the database
2. The translation of each local schema into the common data model
3. The integration of the local schemas into a common global
schema.
Any one of the above design strategies is followed to develop a
distributed database system.

Objectives of Data Distribution


In a distributed system, data may be fragmented, and each fragment
can have a number of replicas to increase data availability and
reliability. The following objectives must be considered while designing
the fragmentation and allocation of these fragments to different sites
in a distributed system (i.e., during the design of data distribution).

1. Locality of reference–. To maximize the locality of references,


whenever possible, data should be stored close to where it is used
during data distribution. If a fragment is used at several sites, it
may be beneficial to store copies of the fragment at different
sites.
2. Improved availability and reliability of distributed data–. During
data distribution, reliability and availability are improved by
replication. A higher degree of availability can be achieved by
storing multiple copies of the same information in different sites.
Reliability is also achieved by storing multiple copies of the same
information, since it is possible to continue the normal operations
of a particular site in case of site failure, by referencing the copy
of the same information from other sites.
3. Workload distribution and improved performance–. Bad
allocation may result in underutilization of resources, and thereby
system performance may degrade. Distributing the workload over
the sites is an important feature of distributed database system.
Workload distribution should be done to take advantage of
system resources at each site and to maximize the degree of
parallelism of execution of transactions. Since workload
distribution might negatively affect processing locality, it is
necessary to consider the trade-off between them during the
design of data distribution.
4. Balanced storage capacities and costs–. Database distribution
must reflect the cost and availability of storage at different sites.
The availability and cost of storage at each site must be
considered so that cheap mass storage can be used, whenever
possible. This must be balanced against locality of reference.
5. Minimal communication costs–. The cost of processing remote
requests must be considered during data distribution. Retrieval
cost is minimized when locality of reference is maximized or when
each site has its own copy of the data. However, when replicated
data are updated, the update has to be performed at all sites
holding a replica, thwereby increasing the communication costs.

Alternative Strategies for Data Allocation


Four alternative strategies have been identified for data allocation. This
section describes these different data allocation strategies and also
draws a comparison between them.

1. Centralized–. In this strategy, the distributed system consists of a


single database and DBMS is stored at one site with users
distributed across the communication network. Remote users can
access centralized data over the network; thus, this strategy is
similar to distributed processing.

In this approach, locality of reference is the lowest at all sites,


except the central site where the data are stored. The
communication cost is very high since all users except the central
site have to use the network for all types of data accesses.
Reliability and availability are very low, since the failure of the
central site results in the loss of entire database system.
2. Fragmented (or Partitioned)–. This strategy partitions the entire
database into disjoint fragments, where each fragment is assigned
to one site. In this strategy, fragments are not replicated.

If fragments are stored at the site where they are used most
frequently, locality of reference is high. As there is no replication
of data, storage cost is low. Reliability and availability are also low
but still higher than centralized data allocation strategies, as the
failure of a site results in the loss of local data only. In this case,
communication costs are incurred only for global transactions.
However, in this approach, performance should be good, and
communication costs are low if the data distribution is designed
properly.

3. Complete replication–. In this strategy, each site of the system


maintains a complete copy of the entire database. Since all the
data are available at all sites, locality of reference, availability and
reliability, and performance are maximized in this approach.

Storage costs are very high in this case, and hence, no


communication costs are incurred due to global transactions.
However, communication costs for updating data items are the
most expensive. To overcome this problem, snapshots are
sometimes used. A snapshot is a copy of the data at a given time.
The copies are updated periodically, so they may not be always
up-to-date. Snapshots are also sometimes used to implement
views in a distributed database, to reduce the time taken for
performing a database operation on a view.

4. Selective replication–. This strategy is a combination of


centralized, fragmented, and complete replication strategies. In
this approach, some of the data items are fragmented and
allocated to the sites where they are used frequently, to achieve
high localization of reference. Some of the data items or
fragments of the data items that are used by many sites
simultaneously but not frequently updated, are replicated and
stored at all these different sites. The data items that are not used
frequently are centralized.

The objective of this strategy is to utilize all the advantages of all


other strategies but none of the disadvantages. This strategy is
used most commonly because of its flexibility [see table 5.1].

Table 5.1. Comparison of Strategies for Data Allocation


Workload
Reliability distribution
Locality of and and Storage Communication
reference availability performance costs costs

Centralized Lowest Lowest Poor Lowest Highest

Fragmented High Low for data Satisfactory Lowest Low


item, high
for system

Complete Highest Highest Best for Highest High for updating


replication reading low for reading

Selective High Low for data Satisfactory Average Low


replication item, high
for system

Data Fragmentation
In a distributed system, a global relation may be divided into several
non-overlapping subrelations and allocated to different sites,
called fragments. This process is called data fragmentation. The
objective of data fragmentation design is to determine non-overlapping
fragments, which are logical units of allocation. Fragments can be
designed by grouping a number of tuples or attributes of relations. Each
group of tuples or attributes that constitute a fragment has the same
properties.

Benefits of Data Fragmentation


Data fragmentation provides a number of advantages that are listed in
the following.

1. Better usage–. In general, applications work with views, rather


than the entire relation. Therefore, it seems to be beneficial to
fragment the relations into subrelations and store them into
different sites as units of distribution in data distribution.
2. Improved efficiency–. Fragmented data can be stored close to
where it is most frequently used. In addition, data that are not
required by local applications are not stored locally, which may
result in faster data access, thereby increasing the efficiency.
3. Improved parallelism or concurrency–. With a fragment as the
unit of distribution, a transaction can be divided into several
subtransactions that operate on different fragments in parallel.
This increases the degree of concurrency or parallelism in the
system, thereby allowing transactions to execute in parallel in a
safe way.
4. Better security–. Data that are not required by local applications
are not stored locally and, consequently, are not available to
unauthorized users of the distributed system, thus, improving the
security.
Fragmentation provides several disadvantages also. Owing to data
fragmentation, performance can degrade, and maintaining integrity
may become difficult.

1. Performance degradation–. The performance of global


applications that require data from several fragments located at
different sites may become slower and, therefore, performance
may degrade. For example, in case of global applications, it may
be necessary to retrieve data from more than one fragment
located at different sites. In this case, performing either union or
join of these fragments becomes slower and costly.
2. Maintaining integrity becomes difficult–. Owing to data
fragmentation, data and functional dependencies may be
decomposed into different fragments that might be allocated to
different sites across the network of a distributed system. In this
situation, checking functional dependencies is very complicated
and, consequently, maintaining integrity becomes very difficult.

Correctness Rules for Data Fragmentation


To ensure no loss of information and no redundancy of data (i.e., to
ensure the correctness of fragmentation), there are three different
rules that must be considered during fragmentation. These correctness
rules are listed below.

1. Completeness–. If a relation instance R is decomposed into


fragments R1, R2, ... , Rn, each data item in R must appear in at least
one of the fragments Ri. This property is identical to the lossless
decomposition property of normalization and it is necessary in
fragmentation to ensure that there is no loss of data during data
fragmentation.
2. Reconstruction–. If a relation R is decomposed into
fragments R1, R2, ... , Rn, it must be possible to define a relational
operation that will reconstruct the relation R from the
fragments R1, R2, ... , Rn. This rule ensures that constraints defined
on the data in the form of functional dependencies are preserved
during data fragmentation.
3. Disjointness–. If a relation instance R is decomposed into
fragments R1, R2, ... , Rn, and if a data item is found in the
fragment Ri, then it must not appear in any other fragment. This
rule ensures minimal data redundancy. In case of vertical
fragmentation, primary key attribute must be repeated to allow
reconstruction and to preserve functional dependencies.
Therefore, in case of vertical fragmentation, disjointness is
defined only on non-primary key attributes of a relation.

Different Types of Fragmentation


This section introduces different types of fragmentation. There are two
main types of fragmentation: horizontal and vertical. Horizontal
fragments are subsets of tuples, whereas vertical fragments are subsets
of attributes. There are also two other types of
fragmentation: mixed and derived, a type of horizontal fragmentation.
All these different types of fragmentation have been described briefly
in the following.

HORIZONTAL FRAGMENTATION
Horizontal fragmentation partitions a relation along its tuples, that is,
horizontal fragments are subsets of the tuples of a relation. A
horizontal fragment is produced by specifying a predicate that performs
a restriction on the tuples of a relation. In this fragmentation, the
predicate is defined by using the selection operation of the relational
algebra. For a given relation R, a horizontal fragment is defined as
σρ(R)

where ρ is a predicate based on one or more attributes of the


relation R.

In some cases, the choice of horizontal fragmentation strategy, that is,


the predicates or search conditions for horizontal fragmentation is
obvious. However, in some cases, it is very difficult to choose the
predicates for horizontal fragmentation and it requires a detailed
analysis of the application. The predicates may be simple involving
single attribute, or may be complex involving multiple attributes.
Further, the predicates for each attribute may be single or multi-valued,
and the values may be discrete or may involve a range of values. Rather
than complicated predicates, the fragmentation strategy involves
finding midterm predicates or minimal predicates that can be used as
the basis for fragmentation schema. A minimal predicate is the
conjunction of simple predicates, which is complete and relevant. A set
of predicates is complete if and only if any two tuples in the same
fragment are referenced with the same probability by any transaction.
A predicate is relevant if there is at least one transaction that accesses
the resulting fragments differently.

Horizontal fragments can be generated by using the following methods:

 Consider a predicate P1 which partitions the tuples of a


relation R into two parts which are referenced differently by at
least one application. Assume that P = P1.

Project

Project-id Project-name Project-type Project-leader-id Branch-no Amount


P01 Inventory Inside E001 B10 $1000000

P02 Sales Inside E001 B20 $300000

P03 R&D Abroad E004 B70 $8000000

P04 Educational Inside E003 B20 $400000

P05 Health Abroad E005 B60 $7000000

P1

Project-id Project-name Project-type Project-leader-id Branch-no Amount

P01 Inventory Inside E001 B10 $1000000

P02 Sales Inside E001 B20 $300000

P04 Educational Inside E003 B20 $400000

P2

Project-id Project-name Project-type Project-leader-id Branch-no Amount

P03 R&D Abroad E004 B70 $8000000

P05 Health Abroad E005 B60 $7000000


 Consider a new simple predicate Pi that partitions at least one
fragment of P into two parts which are referenced in a different
way by at least one application; Set P ← P ∪ Pi. Non-relevant
predicate should be eliminated from P and this procedure is to be
repeated until the set of the midterm fragments of P is complete.
Example 5.1.

Let us consider the relational schema Project [Chapter 3, Section 3.1.3]


where project-type represents whether the project is an inside project
(within the country) or abroad project (outside the country). Assume
that P1 and P2 are two horizontal fragments of the relation Project,
which are obtained by using the predicate “whether the value of
project-type attribute is ‘inside’ or ‘abroad’”, as listed in the following:

 P1: σproject-type = “inside”(Project)


 P2: σproject-type = “abroad”(Project)

The descriptions of the Project relation and the horizontal fragments of


this relation are illustrated in figure 5.1.

Table 5.1. Horizontal fragmentation of the relation Project

These horizontal fragments satisfy all the correctness rules of


fragmentation as shown below:

 Completeness–. Each tuple in the relation Project appears either


in fragment P1 or P2. Thus, it satisfies completeness rule for
fragmentation.
 Reconstruction–. The Project relation can be reconstructed from
the horizontal fragments P1 and P2 by using the union operation
of relation algebra, which ensures the reconstruction rule.

Thus, P1 ∪ P2 = Project.

 Disjointness–. The fragments P1 and P2 are disjoint, since there


can be no such project whose project-type is both “inside” and
“abroad”.

In this example, the predicate set {project-type = “inside”, project-type


= “abroad”} is complete.
Example 5.2.

Let us consider the distributed database of a manufacturing company


that has three sites in eastern, northern, and southern regions. The
company has a total of 20 products out of which the first 10 products
are produced in the eastern region, the next five products are produced
in the northern region, and the remaining five products are produced in
the southern region. The global schema of this distributed database
includes several relational schemas such as Branch, Product,
Supplier, and Sales. In this example, the horizontal fragmentation of
Sales and Product has been considered.

Assume that there are two values for region attribute, “eastern” and
“northern”, in the relational schema Sales (depo-no, depo-name,
region). Let us consider an application that can generate from any site
of the distributed system and involves the following SQL query.

 Select depo-name from Sales where depo-no = $1

If the query is initiated at site 1, it references Sales whose region is


“eastern” with 80 percent probability. Similarly, if the query is initiated
at site 2, it references Sales whose region is “northern” with 80 percent
probability, whereas if the query is generated at site 3, it references
Sales of “eastern” and “northern” with equal probability. It is assumed
that the products produced in a region come to the nearest sales depot
for sales. Now, the set of predicates is {p1, p2}, where

 p1: region = “eastern” and p2: region = “northern”.

Since the set of predicates {p1, p2} is complete and minimal, the
process is terminated. The relevant predicates cannot be deduced by
analysing the code of an application. In this case, the midterm
predicates are as follows:

 X1: (region = “eastern”) AND (region = “northern”)


 X2: (region = “eastern”) AND NOT (region = “northern”)
 X3: NOT (region = “eastern”) AND (region = “northern”)
 X4: NOT (region = “eastern”) AND NOT (region = “northern”)

Since (region = “eastern”) ⇒ NOT (region = “northern”) and (region =


“northern”) ⇒ NOT (region = “eastern”), X1 and X4 are contradictory
and X2 and X3 reduce to the predicates p1 and p2.

For the global relation Product (product-id, product-name, price,


product-type), the set of predicates are as follows:

 P1: product-id ≤ 10
 P2: 10 < product-id ≤ 15
 P3: product-id > 15
 P4: product-type = “consumable”
 P5: product-type = “Non-consumable”

It is assumed that applications are generated at site 1 and site 2 only. It


is further assumed that the applications that involve queries about
consumable products are issued at site 1, while the applications that
involve queries about non-consumable products are issued at site 2. In
this case, the fragments after reduction with the minimal set of
predicates are listed in the following:

 F1: product-id ≤ 10
 F2: (10 < product-id ≤ 15) AND (product-type = “consumable”)
 F3: (10 < product-id ≤ 15) AND (product-type = “Non-
consumable”)
 F4: product-id > 15.

The allocation of fragments is introduced in Section 5.4.

VERTICAL FRAGMENTATION
Vertical fragmentation partitions a relation along its attributes, that is,
vertical fragments are subsets of attributes of a relation. A vertical
fragment is defined by using the projection operation of relational
algebra. For a given relation R, a vertical fragment is defined as

∏a ,a ,...,a (R)
1 2 n

where a1, a2,...,an, are attributes of the relation R.

The choice of vertical fragmentation strategy is more complex than that


of horizontal fragmentation, since a number of alternatives are
available. One solution is to consider the affinity of one attribute to
another. Two different types of approaches have been identified for
attribute partitioning in vertical fragmentation of global relations as
listed in the following:

1. Grouping–. Grouping is started by assigning each attribute to one


fragment, and at each step, joining of some of the fragments are
done until some criteria is satisfied. Grouping was first suggested
in Hammer and Niamir [1979] for centralized databases and was
used for distributed databases later on [Sacca and Wiederhold,
1985].
2. Splitting–. Splitting starts with a relation and decides on beneficial
partitioning, based on the access behaviour of applications to the
attributes. One way to do this is to create a matrix that shows the
number of accesses that refer to each attribute pair. For example,
a transaction that accesses attributes a1, a2, a3, and a4 of
relation R can be represented by the following matrix:
a1 a2 a3

a1 1 0
a2 0

a3

a4

In this process, a matrix is produced for each transaction and an overall


matrix is produced showing the sum of all accesses for each attribute
pair. Pairs with high affinity should appear in the same vertical
fragment and pairs with low affinity may appear in different fragments.
This technique was first proposed for centralized database design
[Hoffer and Severance, 1975] and then it was extended for distributed
environment [Navathe et al., 1984].

Example 5.3.

In this case, the Project relation is partitioned into two vertical


fragments V1 and V2, which are described below [figure 5.2]:

Table 5.2. Vertical fragmentation of the relation Project

V1

Project-id Project-leader-id Branch-no

P01 E001 B10

P02 E001 B20

P03 E004 B70

P04 E003 B20

P05 E005 B60


V2

Project-id Project-name Project-type Amount

P01 Inventory Inside $1000000

P02 Sales Inside $300000

P03 R&D Abroad $8000000

P04 Educational Inside $400000

P05 Health Abroad $7000000

 V1: ∏Project-id, branch-no, project-leader-id(Project)


 V2: ∏Project-id, Project-name, Project-type, amount(Project).

Hence, primary key for the relation Project is Project-id, which is


repeated in both vertical fragments V1 and V2 to reconstruct the
original base relation from the fragments.

Hence, vertical fragmentation also ensures the correctness rules for


fragmentation.

 Completeness–. Each attribute in the relation Project appears


either in fragment V1 or V2, which satisfies the completeness rule
for fragmentation.
 Reconstruction–. The Project relation can be reconstructed from
the vertical fragments V1 and V2 by using the Natural join
operation of relational algebra, which ensures the reconstruction
rule.

Thus, V1×V2 = Project.


 Disjointness–. The vertical fragments V1 and V2 are disjoint,
except for the primary key project-id, which is repeated in both
fragments and is necessary for reconstruction. Hence, the primary
key Project-id of the Project relation appears in both vertical
fragments V1 and V2.

Bond energy algorithm The Bond Energy Algorithm (BEA) is the most
suitable algorithm for vertical fragmentation [Navathe et al., 1984]. The
bond energy algorithm uses attribute affinity matrix (AA) as input and
produces a clustered affinity matrix (CA) as output by permuting rows
and columns of AA. The generation of CA from AA involves three
different steps: initialization, iteration, and row ordering, which are
illustrated in the following:

 Initialization–. In this step, one column from AA is selected and is


placed into the first column of CA.
 Iteration–. In this step, the remaining n – i columns are taken
from AA and they are placed in one of the possible i + 1 positions
in CA that makes the largest contribution to the global neighbour
affinity measure. It is assumed that i number of columns are
already placed into CA.
 Row ordering–. In this step, rows are ordered in the same way as
columns are ordered. The contribution of a column Ak, which is
placed between Ai and Aj, can be represented as follows:

Now, for a given set of attributes many orderings are possible. For
example, for n number of attributes n orderings are possible. One
efficient algorithm for ordering is searching for clusters. The BEA
proceeds by linearly traversing the set of attributes. In each step, one of
the remaining attributes is added and is inserted in the current order of
attributes in such a way that the maximal contribution is achieved. This
is first done for the columns. Once all the columns are determined, the
row ordering is adapted to the column ordering, and the resulting
affinity matrix exhibits the desired clustering. To compute the
contribution to the global affinity value, the loss incurred through
separation of previously joint columns is subtracted from the gain,
obtained by adding a new column. The contribution of a pair of
columns is the scalar product of the columns, which is maximal if the
columns exhibit the same value distribution.

Example 5.4.

Consider Q = {Q1, Q2, Q3, Q4} as a set of queries, A = {A1, A2, A3, A4} as a
set of attributes for the relation R, and S = {S1, S2, S3} as a set of sites in
the distributed system. Assume that A1 is the primary key of the
relation R, and the following matrices represent the attribute usage
values of the relation Rand application access frequencies at different
sites:

A1 A2 A3

Q1 0 1 1

Q2 1 1 1

Q3 1 0 0

Q4 0 0 1

S1 S2 S3 Sum

Q1 10 20 0 30
Q2 5 0 10 15

Q3 0 35 5 40

Q4 0 10 0 10

Since A1 is the primary key of the relation R, the following attribute


affinity matrix is considered here:

A2 A3 A4

A2 45 45 0

A3 45 55 0

A4 0 0 40

Now,

 bond(A2, A3) = 45*45+45*55+0 * 0 = 4,500.


 bond(A2, A4) = 45 * 0+45 * 0+0 * 40 = 0.
 bond(A3, A4) = 45 * 0+55 * 0+0 * 40 = 0.

The contributions of the columns depending on their position are as


follows:

 For A4-A2-A3, cont(_, A4, A2) = bond(_,A4)+bond(A4, A2)−bond(_, A2)


= 0.
 For A2-A4-A3, cont(A2, A4, A3) =
bond(A2, A4)+bond(A4, A3)−bond(A2, A3) = 0+0−4,500 = −4,500.
 For A2-A3-A4, cont(A3, A4,_) = bond(A3, A4)+bond(A4,_)−bond(A3,_) =
0.

In this case, both A4-A2-A3 and A2-A3-A4 are same.


Now,

A2 A3 A4

A2 45 45 0

A3 45 55 0

A4 0 0 40

and

A1 A2 A3 A4

Q1 0 1 1 0

Q2 1 1 1 0

Q3 1 0 0 1

Q4 0 0 1 0

S1 S2 S3 Sum

Q1 10 20 0 30

Q2 5 0 10 15

Q3 0 35 5 40

Q4 0 10 0 10

Hence,
 accesses (fragment 1: {A2}): 0
 accesses (fragment 2: {A3, A4}): 50
 accesses (fragment 1 AND fragment 2): 45
 sq = −1,975
 accesses (fragment 1: {A2, A3}): 55
 accesses (fragment 2: {A4}): 40
 accesses (fragment 1 AND fragment 2): 0
 sq = 2,200
 accesses (fragment 1: {A2, A4}): 40
 accesses (fragment 2: {A3}): 10
 accesses (fragment 1 AND fragment 2): 45
 sq = −1,625

Therefore, two partitions are {A1, A4} and {A1, A2, A3}. In the case of
vertical fragmentation, the primary key will be repeated in each
partition. The same calculation can be done with all attributes.

MIXED FRAGMENTATION
Mixed fragmentation is a combination of horizontal and vertical
fragmentation. This is also referred to as hybrid or nested
fragmentation. A mixed fragment consists of a horizontal fragment that
is subsequently vertically fragmented, or a vertical fragment that is
then horizontally fragmented. A mixed fragment is defined by using
selection and projection operations of relational algebra. For example,
a mixed fragment for a given relation R can be defined as follows:

σp (∏a1, a2,..., an (R)) or ∏ a1, a2,...,an (σp (R))

where ρ is a predicate based on one or more attributes of the


relation R and a1,a2,..., an are attributes of the relation R.

Example 5.5.
Let us consider the same Project relation used in the previous example.
The mixed fragments of the above Project relation can be defined as
follows:

 P11: σproject-type = “inside”(∏Project-id, branch-no, project-leader-id(Project))


 P12: σproject-type = “abroad”(∏Project-id, branch-no, project-leader-id(Project))
 P21: σproject-type = “inside”(∏Project-id, Project-name, Project-type, amount(Project))
 P22: σproject-type = “abroad”(∏Project-id, Project-name, Project-type, amount(Project)), where
 P1: ∏Project-id, branch-no, project-leader-id(Project) and
 P2: ∏Project-id, Project-name, Project-type, amount(Project).

Hence, first the Project relation is partitioned into two vertical


fragments P1 and P2 and then each of the vertical fragments is
subsequently divided into two horizontal fragments, which are shown
in figure 5.3.

Figure 5.3. Mixed Fragmentation of the Relation Project

By virtue of fragmentation, mixed fragmentation satisfies all


correctness rules for fragmentation as explained below.

 Completeness–. Each attribute in the Project relation appears in


either of the fragments P1 or P2 and each (part) tuple that
appears in the fragment P1 appears in either of the fragments P11
or P12. Similarly, each (part) tuple that appears in the fragment
P2 appears in either of the fragments P21 or P22.
 Reconstruction–. The Project relation can be reconstructed from
the fragments P11, P12, P21, and P22 by using union and natural
join operations of relational algebra. Thus, Project = P1 ⋈ P2,
where P1 = P11 ∪ P12 and P2 = P21 ∪ P22, respectively.
 Disjointness–. The fragments P11, P12, P21, and P22 are disjoint,
since there can be no such project whose project type is both
“inside” and “abroad”. The fragments P1 and P2 are also disjoint,
except for the necessary duplication of the primary key project-id.

DERIVED FRAGMENTATION
A derived fragmentation is a horizontal fragment that is based on the
horizontal fragmentation of a parent relation and it does not depend on
the properties of its own attributes. Derived fragmentation is used to
facilitate the join between fragments. The term child is used to refer to
the relation that contains the foreign key, and the term parent is used
for the relation containing the targeted primary key. Derived
fragmentation is defined by using the semi-join operation of relation
algebra. For a given child relation C and parent relation P, the derived
fragmentation of C can be represented as follows:

Ci =C⊳Pi, l ≤ i ≤ w

where w is the number of horizontal fragments defined on C and

Pi = σFi(S)

where Fi is the predicate according to which the primary horizontal


fragment Si is defined.

If a relation contains more than one foreign key, it will be necessary to


select one of the referenced relations as the parent relation. The choice
can be based on one of the following two strategies:
1. The fragmentation that is used in most of the applications.
2. The fragmentation with better join characteristics, that is, the join
involves smaller fragments or the join that can be performed in
parallel to a greater degree.

Example 5.6.

Let us consider the following Department and Employee relation


together [Chapter 1, Example 1.1].

 Employee (emp-id, ename, designation, salary, deptno, voter-id)


 Department (deptno, dname, location)

Assume that the Department relation is horizontally fragmented


according to the deptno so that data relating to a particular
department is stored locally. For instance,

 P1: σdeptno = 10 (Department)


 P2: σdeptno = 20 (Department)
 P3: σdeptno = 30 (Department)

Hence, in the Employee relation, each employee belongs to a particular


department (deptno) that references to the deptno field in the
Department relation. Thus, it should be beneficial to store Employee
relation using the same fragmentation strategy of the Department
relation. This can be achieved by derived fragmentation strategy that
partitions the Employee relation horizontally according to the deptno
as follows:

Ci = Employee ⊳deptno Pi, 1≤ i ≤ 3

It can be shown that derived fragmentation also satisfies the


correctness rules of fragmentation.
 Completeness–. Since the predicate of the derived fragmentation
involves two relations, it is more difficult to ensure the
completeness rule for derived fragmentation. In the above
example, each tuple of Department relation must appear either in
fragments P1, P2, or P3. Similarly, each tuple of Employee relation
must appear in either of the fragments Ci. This rule is known
as referential integrity constraint, which ensures that the tuples of
any fragment of the child relation (Employee) are also in the
parent relation (Department).
 Reconstruction–. A global relation can be reconstructed both
from its horizontal and derived fragments by using join and union
operations of relational algebra. Thus, for a given relation R with
fragments R1, R2,..., Rn,

R = ∪ Ri, 1≤i ≤ n

 Disjointness–. Since derived fragmentation involves a semijoin


operation, it adds extra complexity to ensure the disjointness of
derived fragments. However, it is not desirable that a tuple of a
child relation be joined with two or more tuples of the parent
relation when these tuples are in different fragments of the
parent relation. For example, an employee of the Employee
relation with branch-no 10 should not belong to deptno = 20 or
deptno = 30.

NO FRAGMENTATION
A final strategy of the fragmentation is not to fragment a relation. If a
relation contains a smaller number of tuples and not updated
frequently, then is better not to fragment the relation. It will be more
sensible to leave the relation as a whole and simply replicate the
relation at each site of the distributed system.
The Allocation of Fragments
The allocation of fragments is a critical performance issue in the
context of distributed database design. Before allocation of fragments
into different sites of a distributed system, it is necessary to identify
whether the fragments are replicated or not. The allocation of non-
replicated fragments can be handled easily by using “best-fit”
approach. In best-fit approach, the best cost-effective allocation
strategy is selected among several alternatives of possible allocation
strategies. Replication of fragments adds extra complexity to the
fragment allocation issue as follows:

 The total number of replicas of fragments may vary.


 It is difficult to design read-only applications, because applications
can access several alternative replicas of a fragment from
different sites of the distributed system.

For allocating replicated fragments, one of the following methods can


be used:

 In the first approach, the set of all sites in the distributed system
is determined where the benefit of allocating one replica of the
fragment is higher than the cost of allocation. One replica of the
fragment is allocated to such beneficial sites.
 In the alternative approach, allocation of fragments is done using
best-fit method considering fragments are not replicated, and
then progressively replicas are introduced starting from the most
beneficial. This process is terminated when addition of replicas is
no more beneficial.

Both the above approaches have some limitations. In the first


approach, determination of the cost and the benefit of each replica of
the fragment is very complicated, whereas in the latter approach,
progressive increment of additional replicas is less beneficial.
Moreover, the reliability and availability increases if there are two or
three copies of the fragment, but further copies give a less than
proportional increase.

Measure of Costs and Benefits for Fragment


Allocation
To evaluate the cost and benefit of the allocation of fragments, it is
necessary to consider the type of fragments. Moreover, the following
definitions are assumed:

 The fragment index is represented by i


 The site index is represented by j
 The application index is represented by k
 Fkj indicates the frequency of application k at site j
 Rki represents the number of retrieval references (read
operations) of application k to fragment i
 Uki indicates the number of update references (write operations)
of application k to fragment i
 Nki = Rki+ Uki

HORIZONTAL FRAGMENTS
1. In this case, using best-fit approach for non-replicated fragments,
the fragment Ri of relation R is allocated at site j where the
number of references to the fragment Ri is maximum. The
number of local references of Ri at site j is as follows:

where Ri is allocated to site j* such that Bij* is maximum.


2. In this case, using the first approach for replicated fragments, the
fragment Ri of relation R is allocated at site j, where the cost of
retrieval references of applications is larger than the cost of
update references to Ri from applications at any other site.
Hence, Bij can be evaluated as follows:

where C is a constant that measures the ratio between the cost of


an update and a retrieval access. Typically, update accesses are
more expensive and C≥1. Ri is allocated at all sites j* where Bij* is
positive. When Bij* is negative, a single copy of Ri is placed at the
site where Bij Bij* is maximum.

3. In this case, the benefit of additional replica of a fragment is


evaluated in terms of availability and reliability. Let di denote the
degree of replicas, or Ri and Fi denote the benefit of having Ri fully
replicated at each site. The benefit function can be defined as
β(di) =(1−21−di)Fi

4. Now, the benefit of introducing a new copy of Ri at site j is as


follows:

5.

VERTICAL FRAGMENTS
In this case, the benefit is calculated by vertically partitioning a
fragment Ri into two vertical fragments Rs and Rt allocated at site s and
site t, respectively. The effect of this partition is listed below.

 It is assumed that there are two sets of


applications As and At issued at site s and site j, respectively, which
use the attributes of Rs and Rt and become local to sites s and t,
respectively.
 There is a set A1 of applications previously local to r, which uses
only attributes of Rs and Rt. An additional remote reference is now
required for these applications.
 There is a set A2 of applications previously local to r, which
reference attributes of both Rs and Rt. These applications make
two additional remote references.
 There is a set A3 of applications at sites different than r, s, or t,
which reference attributes of both Rsand Rt. These applications
make one additional remote reference.

Now the benefit of this partition is as follows:

For simplicity, it is sufficient to use Rki+CUki instead of Nki. This formula


can be used within an exhaustive splitting algorithm to determine
whether the splitting of Ri at site i into Rs and Rt at site t is convenient or
not by trying all possible combinations of sites s and t. Some care must
be taken in the case of r = s or r = t.

Transparencies in Distributed
Database Design
According to the definition of distributed database, one major objective
is to achieve the transparency into the distributed
system. Transparency refers to the separation of the higher-level
semantics of a system from lower-level implementation issues. In a
distributed system, transparency hides the implementation details from
users of the system. In other words, the user believes that he or she is
working with a centralized database system, and that all the
complexities of a distributed database are either hidden or transparent
to the user. A distributed DBMS may have various levels of
transparency. In a distributed DBMS, the following four main categories
of transparency have been identified:

 Distribution transparency
 Transaction transparency
 Performance transparency
 DBMS transparency.

Data Distribution Transparency


Data distribution transparency allows the user to perceive the
database as a single, logical entity. In other words, distribution
transparency refers to the degree or extent to which details of
fragmentation (fragmentation transparency), replication (replication
transparency) and distribution (location transparency) are hidden from
users. If a user sees all fragmentation, allocation, and replication, the
distributed DBMS is said to have no distribution transparency. The user
needs to refer to specific fragment copies by appending the site name
to the relation. If the user needs to know that the data are fragmented
and the location of fragments for retrieving data, then it is called local
mapping transparency.

If a system supports higher degree of distribution transparency, the


user sees a single integrated schema with no details of fragmentation,
allocation, or distribution. The distributed DBMS stores all the details in
the distribution catalog. All these distribution transparencies are
discussed in the following:

1. Fragmentation transparency–. Fragmentation


transparency hides the fact from users that the data are
fragmented. This is the highest level of distribution transparency.
If a distributed system has fragmentation transparency, then the
user must not be aware regarding the fragmentation of data. As a
result, database accesses must be based on global schema and
the user need not to specify the particular fragment names or
data locations.

Example 5.7.

Let us consider the relation Employee (emp-id, emp-name,


designation, salary, emp-branch, project-no) is fragmented and
stored in different sites of a distributed DBMS. Further assume
that the distributed DBMS has fragmentation transparency and,
thus, users have no idea regarding the fragmentation of the
Employee relation. Therefore, to retrieve the names of all
employees of branch number 10 from Employee relation, the user
will write the following SQL statement:

1. Select emp-name from Employee where emp-branch = 10.

This SQL statement is same as centralized DBMS.

Location transparency–. With location transparency, the user is aware


how the data are fragmented but still does not have any idea regarding
the location of the fragments. Location transparency is the middle level
of distribution transparency. To retrieve data from a distributed
database with location transparency, the end user or programmer has
to specify the database fragment names but need not specify where
these fragments are located in the distributed system.

Example 5.8.

Let us assume that the tuples of the above Employee relation is


horizontally partitioned into two
fragments EMP1 and EMP2 depending on the selection predicates
“emp-id≤100” and “emp-id>100”. Hence, the user is aware that
the Employee relation is horizontally fragmented into two
relations EMP1 and EMP2, but they have no idea in which sites
these relations are stored. Thus, the user will write the following
SQL statement for the above query “retrieve the names of all
employees of branch number 10”:

0. Select emp-name from EMP1 where emp-branch = 10


1. union
2. Select emp-name from EMP2 where emp-branch = 10.
Replication transparency–. Replication transparency means that the
user is unaware of the fact that the fragments of relations are
replicated and stored in different sites of the distributed DBMS.
Replication transparency is closely related to location transparency and
it is implied by location transparency. However, it may be possible for a
distributed system not to have location transparency but to have
replication transparency. In such cases, the user has to mention the
location of the fragments of a relation for data access. The replication
transparency is illustrated in the example 5.8.

Example 5.9.

Let us assume that the horizontal fragments EMP1 and EMP2 of


the Employee relation in example 5.7 are replicated and stored in
different sites of the distributed system. Further, assume that the
distributed DBMS supports replication transparency. In this case,
the user will write the following SQL statement for the query
“retrieve the names of all employees of branch number 20 whose
salary is greater than Rs. 50,000”:

0. Select emp-name from EMP1 where emp-branch= 20 and


salary>50,000
1. union
2. Select emp-name from EMP2 where emp-branch= 20 and
salary>50,000.
If the distributed system does not support replication
transparency, then the user will write the following SQL
statement for the above query considering there are a number of
replicas of fragments EMP1 and EMP2 of Employee relation:

3. Select emp-name from copy1 of EMP1 where emp-branch=


20 and salary>50,000
4. union
5. Select emp-name from copy3 of EMP2 where emp-branch=
20 and salary>50,000.

Similarly, the above query can be rewritten as follows for a


distributed DBMS with replication transparency which does not
exhibit location transparency:

6. Select emp-name from EMP1 at site 1 where emp-branch=


20 and salary>50,000

union

Select emp-name from EMP2 at site 3 where emp-branch=


20 and salary>50,000.

Local mapping transparency–. Local mapping transparency is the


lowest level of distribution transparency. Local transparency refers that
users are aware regarding both the fragment names and the location of
fragments, taking into account that any replication of the fragments
may exist. If a distributed system has local mapping transparency, then
the user has to explicitly mention both the fragment name and the
location for data access.

Example 5.10.

Let us consider the relation Project (project-id, project-name,


project-type, project-leader-id, branch-no, amount) is
horizontally partitioned into two fragments P1 and P2depending
on the project-type “inside” and “abroad”. Assume that the
fragmented relations P1 and P2 are replicated and stored in
different sites of the distributed DBMS. With local mapping
transparency, the user will write the following SQL statement for
the query “retrieve project names and branch numbers of all the
projects where the project amount is greater than Rs.
10,000,000”:

0. Select project-name, branch-no from copy1 of P1 at site 1


where amount>10000000
1. union
2. Select project-name, branch-no from copy3 of P2 at site 4
where amount>10000000

Hence, it is assumed that replicas of fragments P1 and P2 of the


Project relation are allocated to different sites of the distributed
system such as site1, site3, and site4.

Naming transparency–. Naming transparency means that the users are


not aware of the actual name of the database objects in the system. If a
system supports naming transparency, the user will specify the alias
names of database objects for data accessing. In a distributed database
system, each database object must have a unique name. The
distributed DBMS must ensure that no two sites create a database
object with the same name. To ensure this, one solution is to create
a central name server, which ensures the uniqueness of names of
database objects in the system. However, this approach has several
disadvantages as follows:
0. Loss of some local autonomy, because during creation of
new database objects, each site has to ensure uniqueness of
names of database objects from the central name server.
1. Performance may be degraded, if the central site becomes a
bottleneck.
2. Low availability and reliability, because if the central site
fails, the remaining sites cannot create any new database
object. As availability decreases, reliability also decreases.

A second alternative solution is to prefix a database object with the


identifier of the site that created it. This naming method will also be
able to identify each fragment along with each of its copies. Therefore,
copy2 of fragment 1 of Employee relation created at site 3 can be
referred as S3.Empolyee. F1.C2. The only problem with this approach is
that it results in loss of distribution transparency.

One solution that can overcome the disadvantages of the above two
approaches is the use of aliases(sometimes called synonyms) for each
database object. It is the responsibility of the distributed database
system to map an alias to the appropriate database object.

The distributed system R* differentiates between an object’s name and


its system-wide name (global name). The printname is the name
through which the users refer to the database object. The system-wide
name or global name is a globally unique internal identifier for the
database object that is never changed. The system-wide name contains
four components as follows:

 Creator ID–. This represents a unique site identifier for the user
who created the database object.
 Creator site ID–. It indicates a globally unique identifier for the
site from which the database object was created.
 Local name–. It represents an unqualified name for the database
object.
 Birth-site ID–. This represents a globally unique identifier for the
site at which the object was initially stored.

For example, the system-wide name, Project-


leader@India.localBranch@kolkata, represents an object with local
name localBranch, created by Project Leader in India site and initially
stored at the Kolkata site.

Distribution transparency is supported by a distributed data


dictionary or a distributed data catalog. The distributed data catalog
contains the description of the entire database as seen by the database
administrator. The database description is known as distributed global
schema.

Example 5.11.

In this example, data distribution transparencies for update application


have been considered. Assume that the relational schema Employee
(emp-id, emp-name, designation, salary, emp-branch, project-no) is
fragmented as follows:

 Emp1: σemp-branch≤10 (∏emp-id, emp-name, emp-branch, project-no(Employee))


 Emp2: σemp-branch≤10(∏emp-id, salary,design(Employee))
 Emp3: σemp-branch>10 (∏emp-id, emp-name, emp-branch, project-no(Employee))
 Emp4: σemp-branch>10(∏emp-id, salary,design(Employee))

Consider an update request is generated in the distributed system that


the branch of an Employee with emp-id 55 will be modified from emp-
brach 10 to emp-branch 20. The user written queries are illustrated in
the following for different levels of transparency such as fragmentation
transparency, location transparency, and local mapping transparency.

 Fragmentation transparency:
 Update Employee
 set emp-branch = 20
 where emp-id = 55.
 Location transparency:
 Select emp-name, project-no into $emp-name, $project-no from
Emp1
 where emp-id = 55
 Select salary, design into $salary, $design from Emp2
 where emp-id = 55
 Insert into Emp3 (emp-id, emp-name, emp-branch, project-no)
 values (55, $emp-name, 20, $project-no)
 Insert into Emp4 (emp-id, salary, design)
 values (55, $salary,$design)
 Delete from Emp1 where emp-id = 55
 Delete from Emp2 where emp-id = 55
 Local Mapping transparency:
 Select emp-name, project-no into $emp-name, $project-no from
Emp1 at site 1
 where emp-id = 55
 Select salary, design into $salary, $design from Emp2 at site 2
 where emp-id = 55
 Insert into Emp3 at site 3 (emp-id, emp-name, emp-branch,
project-no)
 values (55, $emp-name, 20, $project-no)
 Insert into Emp3 at site 7 (emp-id, emp-name, emp-branch,
project-no)
 values (55, $emp-name, 20, $project-no)
 Insert into Emp4 at site 4(emp-id, salary, design)
 values (55, $salary,$design)
 Insert into Emp4 at site 8(emp-id, salary, design)
 values (55, $salary,$design)
 Delete from Emp1 at site 1 where emp-id = 55
 Delete from Emp1 at site 5 where emp-id = 55
 Delete from Emp2 at site 2 where emp-id = 55
 Delete from Emp2 at site 6 where emp-id = 55

Hence, it is assumed that the fragment Emp1 has two replicas stored at
site 1 and site 5, respectively, the fragment Emp2 has two replicas
stored at site 2 and site 6, respectively, the fragment Emp3 has two
replicas stored at site 3 and site 7, respectively, and the fragment Emp4
has two replicas stored at site 4 and site 8, respectively.

Transaction Transparency
Transaction transparency in a distributed DBMS ensures that all
distributed transactions maintain the distributed database integrity and
consistency. A distributed transaction can update data stored at many
different sites connected by a computer network. Each transaction is
divided into several subtransactions (represented by an agent), one for
each site that has to be accessed. Transaction transparency ensures
that the distributed transaction will be successfully completed only if all
subtransactions executing in different sites associated with the
transaction are completed successfully. Thus, a distributed DBMS
requires complex mechanism to manage the execution of distributed
transactions and to ensure the database consistency and integrity.
Moreover, transaction transparency becomes more complex due to
fragmentation, allocation, and replication schemas in distributed
DBMS. Two further aspects of transaction transparency
are concurrency transparency and failure transparency, which are
discussed in the following:

1. Concurrency transparency–. Concurrency transparency in a


distributed DBMS ensures that all concurrent transactions
(distributed and non-distributed) execute independently in the
system and are logically consistent with the results, that are
obtained if transactions are executed one at a time in some
arbitrary serial order. The distributed DBMS requires complex
mechanism to ensure that both local and global transactions do
not interfere with each other. Moreover, the distributed DBMS
must ensure the consistency of all subtransactions involved with a
global transaction.
Replication adds extra complexity to the issue of concurrency in a
distributed DBMS. For example, if a copy of a replicated data item
is updated, then the update must be propagated eventually to all
copies. One obvious solution to update all copies of a data item is
to propagate the changes as part of the original transaction,
making it an atomic operation. However, the update process is
delayed if one of the sites holding a copy is not reachable during
update due to site or communication link failure. If there are
many copies of the data item, the probability of succeeding the
transaction decreases exponentially.

An alternative solution is to propagate the update to those sites


only that are currently available. The remaining sites must be
updated when they become available. Therefore, a further
strategy is to update the copies of a data item asynchronously,
sometimes after the original update.

2. Failure transparency–. Failure transparency in a distributed


DBMS promises that the system will continue its normal execution
in the event of failure and it must maintain the atomicity of the
global transaction. The atomicity of global transaction ensures
that subtransactions of the global transaction are either all
committed or all aborted. Thus, the distributed DBMS must
synchronize the global transaction to ensure that all
subtransactions have completed successfully before recording a
final COMMIT for the global transaction. Failure transparency also
ensures the durability of both local and global transactions. In
addition to all different types of failures in centralized system, the
following additional types of failures can occur in a distributed
environment:
1. The loss of a message
2. The failure of a communication link
3. The failure of a site
4. Network partitioning.

Functions that are lost due to failures will be picked up by another


network node and continued.

Performance Transparency
Performance transparency in a distributed DBMS ensures that it
performs its tasks as centralized DBMS. In other words, performance
transparency in a distributed environment assures that the system does
not suffer from any performance degradation due to the distributed
architecture and it will choose the most cost-effective strategy to
execute a request. In a distributed environment, the distributed Query
processor maps a data request into an ordered sequence of operations
on local databases. In this context, the added complexity of
fragmentation, allocation, and replication schemas is to be considered.
The distributed Query processor has to take decision regarding the
following issues:

 To perform a data request as to which fragment to access.


 If the fragment is replicated, which copy of the fragment to use.
 Which data location should be used to perform a data request?

The distributed Query processor determines an execution strategy that


would be optimized with respect to some cost function. Typically, the
costs associated with a distributed data request include the following:

 The access time (I/O) cost involved in accessing the physical data
on disk.
 The CPU time cost incurred when performing operations on data
in main memory.
 The communication cost associated with the transmission of data
across the network.
A number of query processing and query optimization techniques have
been developed for distributed database system: some of them
minimize the total cost of query execution time [Sacco and Yao, 1982],
and some of them attempt to maximize the parallel execution of
operations [Epstein et al., 1978] to minimize the response time of
queries.

DBMS Transparency
DBMS transparency in a distributed environment hides the knowledge
that the local DBMSs may be different and is, therefore, only applicable
to heterogeneous distributed DBMSs. This is also known
as heterogeneity transparency, which allows the integration of several
different local DBMSs (relational, network, and hierarchical) under a
common global schema. It is the responsibility of distributed DBMS to
translate the data requests from the global schema to local DBMS
schemas to provide DBMS transparency.

Chapter Summary
 Distributed database design involves the following important
issues: fragmentation, replication, and allocation.
 Fragmentation–. A global relation may be divided into a
number of subrelations, called fragments, which are then
distributed among sites. There are two main types of
fragmentation: horizontal and vertical. Horizontal fragments
are subsets of tuples and vertical fragments are subsets of
attributes. Other two types of fragmentations are mixed and
horizontal.
 Allocation–. Allocation involves the issue of allocating
fragments among sites.
 Replication–. The distributed database system may maintain
a copy of fragment at several different sites.
Fragmentation must ensure the correctness rules – completeness,
reconstruction, and disjointness.
Alternative data allocation strategies are centralized, partitioned,
selective replication, and complete replication.
Transparency hides the implementation details of the distributed
systems from the users. Different transparencies in distributed systems
are distribution transparency, transaction transparency, performance
transparency, and DBMS transparency.

Distributed DBMS Architecture


This chapter introduces the architecture of different distributed
systems such as client/server system and peer-to-peer distributed
system. Owing to the diversity of distributed systems, it is very difficult
to generalize the architecture of distributed DBMSs. Different
alternative architectures of the distributed database systems and the
advantages and disadvantages of each system are discussed in detail.
This chapter also introduces the concept of a multi-database system
(MDBS), which is used to manage the heterogeneity of different DBMSs
in a heterogeneous distributed DBMS environment. The classification of
MDBSs and the architecture of such databases are presented in detail.

The outline of this chapter is as follows. Section 6.2 introduces different


alternative architectures of client/server systems and pros and cons of
these systems. In Section 6.3 alternative architectures for peer-to-peer
distributed systems are discussed. Section 6.4 focuses on MDBSs. The
classifications of MDBSs and their corresponding architectures are
illustrated in this section.
Introduction
The architecture of a system reflects the structure of the underlying
system. It defines the different components of the system, the
functions of these components and the overall interactions and
relationships between these components. This concept is true for
general computer systems as well as for software systems. The
software architecture of a program or computing system is the
structure or structures of the system, which comprises software
elements or modules, the externally visible properties of these
elements and the relationships between them. Software architecture
can be thought of as the representation of an engineering system and
the process(es) and discipline(s) for effectively implementing the
design(s) of such a system.

A distributed database system can be considered as a large-scale


software system; thus, the architecture of a distributed system can be
defined in a manner similar to that of software systems. This chapter
introduces the different alternative reference architectures of
distributed database systems such as client/server, peer-to-peer and
MDBSs.

Client/Server System
In the late 1970s and early 1980s smaller systems (mini computer) were
developed that required less power and air conditioning. The term
client/server was first used in the 1980s, and it gained acceptance in
referring to personal computers (PCs) on a network. In the late 1970s,
Xerox developed the standards and technology that is familiar today as
the Ethernet. This provided a standard means for linking together
computers from different manufactures and formed the basis for
modern local area networks (LANs) and wide area networks (WANs).
Client/server system was developed to cope up with the rapidly
changing business environment. The general forces that drive the move
to client/server systems are as follows:

 A strong business requirement for decentralized computing


horsepower.
 Standard, powerful computers with user-friendly interfaces.
 Mature, shrink-wrapped user applications with widespread
acceptance.
 Inexpensive, modular systems designed with enterprise class
architecture, such as power and network redundancy and file
archiving network protocols, to link them together.
 Growing cost/performance advantages of PC-based platforms.

The client/server system is a versatile, message-based and modular


infrastructure that is intended to improve usability, flexibility,
interoperability and scalability as compared to centralized, mainframe,
time-sharing computing. In the simplest sense, the client and the server
can be defined as follows:

 A Client is an individual user’s computer or a user application that


does a certain amount of processing on its own and sends and
receives requests to and from one or more servers for other
processing and/or data.
 A Server consists of one or more computers or an application
program that receives and processes requests from one or more
client machines. A server is typically designed with some
redundancies in power, network, computing, and file storage.

Usually, a client is defined as a requester of services, and a server is


defined as the provider of services. A single machine can be both a
client and a server depending on the software configuration.
Sometimes, the term server or client refers to the software rather than
the machines. Generally, server software runs on powerful computers
dedicated for exclusive use of business applications. On the other hand,
client software runs on common PCs or workstations. The properties of
a server are:

 Passive (slave)
 Waiting for requests
 On request serves clients and sends reply.

The properties of a client are:

 Active (Master)
 Sending requests
 Waits until reply arrives.

A server can be stateless or stateful. A stateless server does not keep


any information between requests. A stateful server can remember
information between requests.

Advantages and Disadvantages of


Client/Server System
A client/server system provides a number of advantages over a
powerful mainframe centralized system. The major advantage is that it
improves usability, flexibility, interoperability and scalability as
compared to centralized, time-sharing, mainframe computing. In
addition, a client/server system has the following advantages:

 A client/server system has the ability to distribute the computing


workload between client workstations and shared servers.
 A client/server system allows the end user to use a
microcomputer’s graphical user interfaces, thereby improving
functionality and simplicity.
 It provides better performance at a reduced cost for hardware
and software than alternative mini or mainframe solutions.

The client/server environment is more difficult to maintain for a variety


of reasons, which are as follows:

 The client/server architecture creates a more complex


environment in which it is often difficult to manage different
platforms (LANs, operating systems, DBMS etc.).
 In a client/server system, the operating system software is
distributed over many machines rather than a single system,
thereby increasing complexity.
 A client/server system may suffer from security problems as the
number of users and processing sites increases.
 The workstations are geographically distributed in a client/server
system, and each of these workstations is administrated and
controlled by individual departments, which adds extra
complexity. Furthermore, communication cost is incurred with
each processing.
 The maintenance cost of a client/server system is greater than
that of an alternative mini or mainframe solution.

Architecture of Client/Server Distributed


Systems
Client/server architecture is a prerequisite to the proper development
of client/server systems. The Client/Server architecture is based on
hardware and software components that interact to form a distributed
system. In a client/server distributed database system, entire data can
be viewed as a single logical database while at the physical level data
may be distributed. From the data organizational view, the architecture
of a client/server distributed database system is mainly concentrated
on software components of the system, and this system includes three
main components: clients, servers and communications middleware.

1. A Client is an individual computer or process or user’s application


that requests services from the server. A Client is also known
as front-end application, as the end user usually interacts with
the client process. The software components required in the
client machine are the client operating system, client DBMS and
client graphical user interface. Client process is run on an
operating system that has at least some multi-tasking capabilities.
The end users interact with the client process via a graphical user
interface. In addition, a client DBMS is required at the client side,
which is responsible for managing the data that is cached in the
client. In some client/server architectures, communication
software is embedded into the client machine, as a substitute for
communication middleware, to interact efficiently with other
machines in the network.
2. A Server consists of one or more computers or is a computer
process or application that provides services to clients. A Server is
also known as back-end application, as the server process
provides the background services for the client processes. A
server provides most of the data management services such as
query processing and optimization, transaction management,
recovery management, storage management and integrity
maintenance services to the clients. In addition, sometimes
communication software is embedded into the server machine,
instead of communication middleware, to manage
communications with clients and other servers in the network.
3. Communication middleware is any process(es) through which
clients and servers communicate with each other. The
communication middleware is usually associated with a network
that controls data and information transmission between clients
and servers. Communication middleware software consists of
three main components: application program interface (API),
database translator and network translator. The API is public to
client applications through which they can communicate with the
communication middleware. The middleware API allows the client
process to be database server–independent. The database
translator translates the SQL requests into the specific database
server syntax, thus enabling a DBMS from one vendor to
communicate directly with a DBMS from another vendor, without
using a gateway. The network translator manages the network
communications protocols; thus, it allows clients to be network
protocol–independent. To accomplish the connection between
the client and the server, the communication middleware
software operates at two different levels. The physical level deals
with the communications between the client and the server
computers (computer to computer) whereas the logical level
deals with the communications between the client and the server
processes (interprocess). The basic client/server architecture is
illustrated in figure 6.1.
Figure 6.1. Client/Server Reference Architecture

The client/server architecture is intended to provide a scalable


architecture, whereby each computer or process on the network is
either a client or a server.

Architectural Alternatives for Client/Server


Systems
A client/server system can have several architectural alternatives
known as two-tier, three-tier and multi-tier or n-tier.

Two-tier architecture –. A generic client/server architecture has two


types of nodes on the network: clients and servers. As a result, these
generic architectures are sometimes referred to as two-
tierarchitectures. In two-tier client/server architecture, the user system
interface is usually located in the user’s desktop environment, and the
database management services are usually located on a server that
services many clients. Processing management is split between the user
system interface environment and the database management server
environment. The general two-tier architecture of a Client/Server
system is illustrated in figure 6.2.

Figure 6.2. Two-tier Client/Server Architecture

In a two-tier client/server system, it may occur that multiple clients are


served by a single server, called multiple clients–single server
approach. Another alternative is multiple servers providing services to
multiple clients, which is called multiple clients–multiple servers
approach. In the case of multiple clients–multiple servers approach,
two alternative management strategies are possible: either each client
manages its own connection to the appropriate server or each client
communicates with its home server, which further communicates with
other servers as required. The former approach simplifies server code
but complicates the client code with additional functionalities, which
leads to a heavy (fat) client system. On the other hand, the latter
approach loads the server machine with all data management
responsibilities and thus leads to a light (thin) client system. Depending
on the extent to which the processing is shared between the client and
the server, a server can be described as fat or thin. A fat server carries
the larger proportion of the processing load, whereas a thin server
carries a lesser processing load.

The two-tier client/server architecture is a good solution for distributed


computing when work groups of up to 100 people are interacting on a
local area network simultaneously. It has a number of limitations also.
The major limitation is that performance begins to deteriorate when
the number of users exceeds 100. A second limitation of the two-tier
architecture is that implementation of processing management services
using vendor proprietary database procedures restricts flexibility and
choice of DBMS for applications.

Three-tier architecture –. Some networks of client/server architecture


consist of three different kinds of nodes: clients, application servers,
which process data for the clients, and database servers, which store
data for the application servers. This arrangement is called three-
tier architecture. The three-tier architecture (also referred to as multi-
tier architecture) emerged to overcome the limitations of the two-tier
architecture. In the three-tier architecture, a middle tier was added
between the user system interface client environment and the
database management server environment. The middle tier can
perform queuing, application execution, and database staging. There
are various ways of implementing the middle tier, such as transaction
processing monitors, message servers, web servers, or application
servers. The typical three-tier architecture of a client/server system is
depicted in figure 6.3.
Figure 6.3. Three-tier Client/Server Architecture

The most basic type of three-tier architecture has a middle layer


consisting of transaction processing (TP) monitor technology. The TP
monitor technology is a type of message queuing, transaction
scheduling and prioritization service where the client connects to the
TP monitor (middle tier) instead of the database server. The transaction
is accepted by the monitor, which queues it and takes responsibility for
managing it to completion, thus, freeing up the client. TP monitor
technology also provides a number of services such as updating
multiple DBMSs in a single transaction, connectivity to a variety of data
sources including flat files, non-relational DBMSs and the mainframe,
the ability to attach priorities to transactions and robust security. When
all these functionalities are provided by third-party middleware
vendors, it complicates the TP monitor code, which is then referred to
as TP heavy, and it can service thousands of users. On the other hand, if
all these functionalities are embedded in the DBMS and can be
considered as two-tier architecture, it is referred to as TP Lite. A
limitation of TP monitor technology is that the implementation code is
usually written in a lower level language, and the system is not yet
widely available in the popular visual toolsets.
Messaging is another way of implementing three-tier architectures.
Messages are prioritized and processed asynchronously. The message
server connects to the relational DBMS and other data sources. The
message server architecture mainly focuses on intelligent messages.
Messaging systems are good solutions for wireless infrastructure.

The three-tier architecture with a middle layer consisting of an


application server allocates the main body of an application on a shared
host for execution, rather than in the user system interface client
environment. The application server hosts business logic,
computations, and a data retrieval engine. Thus, major advantages with
application server are with less software on the client side there is less
security to worry about, applications are more scalable, and installation
costs are less on a single server compared to maintaining each on a
desktop client.

Currently, developing client/server systems using technologies that


support distributed objects is gaining popularity, as these technologies
support interoperability across languages and platforms, as well as
enhancing maintainability and adaptability of the system. There are
currently two prominent distributed object technologies: one is
Common Object Request Broker Architecture (CORBA) and the other is
COM (Component Object Model)/DCOM.

The major advantage of three-tier client/server architecture is that it


provides better performance for groups with a large number of users
and improves flexibility compared to two-tier approach. In the case of
three-tier architecture, as data processing is separated from different
servers, it provides more scalability. The disadvantage of three-tier
architecture is that it puts a greater load on the network. Moreover, in
the case of three-tier architecture, it is much more difficult to develop
and test software than in two-tier architecture, because more devices
have to communicate to complete a user’s transaction.
In general, a multi-tier (or n-tier) architecture may deploy any number
of distinct services, including transitive relations between application
servers implementing different functions of business logic, each of
which may or may not employ a distinct or shared database system.

Peer-to-Peer Distributed System


The peer-to-peer architecture is a good way to structure a distributed
system so that it consists of many identical software processes or
modules, each module running on a different computer or node. The
different software modules stored at different sites communicate with
each other to complete the processing required for the execution of
distributed applications. Peer-to-peer architecture provides both client
and server functionalities on each computer. Therefore, each node can
access services from other nodes as well as providing services to other
nodes. In contrast with the client/server architecture, in a peer-to-peer
distributed system each node provides user interaction facilities as well
as processing capabilities.

Considering the complexity associated with discovering, communicating


with and managing the large number of computers involved in a
distributed system, the software module at each node in a peer-to-peer
distributed system is typically structured in a layered manner. Thus, the
software modules of peer-to-peer applications can be divided into the
three layers, known as the base overlay layer, the middleware layer,
and the application layer. The base overlay layer deals with the issue of
discovering other participants in the system and creating a mechanism
for all nodes to communicate with each other. This layer ensures that
all participants in the network are aware of other participants. The
middleware layer includes additional software components that can be
potentially reused by many different applications. The functionalities
provided by this layer include the ability to create a distributed index
for information in the system, a publish/subscribe facility and security
services. The functions provided by the middleware layer are not
necessary for all applications, but they are developed to be reused by
more than one application. The application layer provides software
packages intended to be used by users and developed so as to exploit
the distributed nature of the peer-to-peer infrastructure. There is no
standard terminology across different implementations of the peer-to-
peer system, and thus, the term “peer-to-peer” is used for general
descriptions of the functionalities required for building a generic peer-
to-peer system. Most of the peer-to-peer systems are developed as
single-application systems.

As a database management system, each node in a peer-to-peer


distributed system provides all data management services, and it can
execute local queries as well as global queries. Thus, in this system
there is no distinction between client DBMS and server DBMS. As a
single-application program, DBMS at each node accepts user requests
and manages execution. Like in client/server system, also in a peer-to-
peer distributed database system data is viewed as a single logical
database although the data is distributed at the physical level. In this
context, the identification of the reference architecture for a
distributed database system is necessary.

Reference Architecture of Distributed


DBMSs
This section introduces the reference architecture of a distributed
database system. Owing to diversities of distributed DBMSs, it is much
more difficult to represent a common architecture that is generally
applicable for all applications. However, it may be useful to represent a
possible reference architecture that addresses data distribution. Data in
a distributed system are usually fragmented and replicated. Considering
this fragmentation and replication issue, the reference architecture of a
distributed DBMS consists of the following schemas (as depicted
in figure 6.4):

Figure 6.4. Reference Architecture of Distributed DBMS

 A set of global external schemas


 A global conceptual schema (GCS)
 A fragmentation schema and allocation schema
 A set of schemas for each local DBMS, conforming to the ANSI–
SPARC three-level architecture.

 Global external schema –. In a distributed system, user


applications and user accesses to the distributed database are
represented by a number of global external schemas. This is the
topmost level in the reference architecture of a distributed DBMS.
This level describes the part of the distributed database that is
relevant to different users.
 Global conceptual schema –. The GCS represents the logical
description of the entire database as if it is not distributed. This
level corresponds to the conceptual level of the ANSI–SPARC
architecture of centralized DBMS and contains definitions of all
entities, relationships among entities and security and integrity
information for the whole database stored at all sites in a
distributed system.
 Fragmentation schema and allocation schema –. In a distributed
database, the data can be split into a number of non-overlapping
portions, called fragments. There are several different ways to
perform this fragmentation operation. The fragmentation schema
describes how the data is to be logically partitioned in a
distributed database. The GCS consists of a set of global relations,
and the mapping between the global relations and fragments is
defined in the fragmentation schema. This mapping is one-to-
many, that is, a number of fragments correspond to one global
relation but only one global relation corresponds to one fragment.
The allocation schema is a description of where the data
(fragments) are to be located, taking account of any replication.
The type of mapping defined in the allocation schema determines
whether the distributed database is redundant or non-redundant.
In the case of redundant data distribution, the mapping is one-to-
many, whereas in the case of non-redundant data distribution the
mapping is one-to-one.
 Local schemas –. Each local DBMS in a distributed system has its
own set of schemas. The local conceptual and local internal
schemas correspond to the equivalent levels of ANSI–SPARC
architecture. In a distributed database system, the physical data
organization at each machine is probably different, and therefore
it requires an individual internal schema definition at each site,
called local internal schema. To handle fragmentation and
replication issues, the logical organization of data at each site is
described by a third layer in the architecture, called local
conceptual schema. The GCS is the union of all local conceptual
schemas; thus, the local conceptual schemas are mappings of the
global schema onto each site. This mapping is done by local
mapping schemas. The local mapping schema maps fragments in
the allocation schema onto external objects in the local database,
and this mapping depends on the type of local DBMS. Therefore,
in a heterogeneous distributed DBMS, there may be different
types of local mappings at different nodes.

This architecture provides a very general conceptual framework for


understanding distributed databases. Furthermore, such databases are
typically designed in a top-down manner; therefore, all external-view
definitions are made globally.

Component Architecture of Distributed


DBMSs
This section introduces the component architecture of a distributed
DBMS that is independent of the reference architecture. The four major
components of a distributed DBMS that has been identified are as
follows:

 Distributed DBMS (DDBMS) component


 Data communications (DC) component
 Global system catalog (GSC)
 Local DBMS (LDBMS) component

Distributed DBMS (DDBMS) component. The DDBMS component is the


controlling unit of the entire system. This component provides the
different levels of transparencies such as data distribution
transparency, transaction transparency, performance transparency and
DBMS transparency (in the case of heterogeneous DDBMS). Ozsu has
identified four major componets of DDBMS as listed below:

1. The user interface handler –. This component is responsible for


interpreting user commands as they come into the system and
formatting the result data as it is sent to the user.
2. The semantic data controller –. This component is responsible for
checking integrity constraints and authorizations that are defined
in the GCS, before processing the user requests.
3. The global query optimizer and decomposer –. This component
determines an execution strategy to minimize a cost function and
translates the global queries into local ones using the global and
local conceptual schemas as well as the global system catalog. The
global query optimizer is responsible for generating the best
strategy to execute distributed join operations.
4. The distributed execution monitor –. It coordinates the
distributed execution of the user request. This component is also
known as distributed transaction manager. During the execution
of distributed queries, the execution monitors at various sites may
and usually do communicate with one another.

DC component. The DC component is the software that enables all sites


to communicate with each other. The DC component contains all
information about the sites and the links.

Global system catalog (GSC). The GSC provides the same functionality
as system catalog of a centralized DBMS. In addition to metadata of the
entire database, a GSC contains all fragmentation, replication and
allocation details considering the distributed nature of a DDBMS. It can
itself be managed as a distributed database and thus, it can be
fragmented and distributed, fully replicated or centralized like any
other relations in the system. [The details of GSC management will be
introduced in Chapter 12, Section 12.2].

Local DBMS (LDBMS) component. The LDBMS component is a standard


DBMS, stored at each site that has a database and responsible for
controlling local data. Each LDBMS component has its own local system
catalog that contains the information about the data stored at that
particular site. In a homogeneous DDBMS, the LDBMS component is the
same product, replicated at each site, while in a heterogeneous
DDBMS, there must be at least two sites with different DBMS products
and/or platforms. The major components of an LDBMS are as follows:

1. The local query optimizer –. This component is used as the access


path selector and responsible for choosing the best access path to
access any data item for the execution of a query (the query may
be a local query or part of a global query executed at that site).
2. The local recovery manager –. The local recovery manager
ensures the consistency of the local database inspite of failures.
3. The run-time support processor –. This component physically
accesses the database according to the commands in the schedule
generated by the query optimizer and is responsible for managing
main memory buffers. The run-time support processor is the
interface to the operating system and contains the database
buffer (or cache) manager.

Distributed Data Independence


The reference architecture of a DDBMS is an extension of ANSI–SPARC
architecture; therefore, data independence is supported by this model.
Distributed data independence means that upper levels are unaffected
by changes to lower levels in the distributed database architecture. Like
a centralized DBMS, both distributed logical data
independence and distributed physical data independence are
supported by this architecture. In a distributed system, the user queries
data irrespective of its location, fragmentation or replication.
Furthermore, any changes made to the GCS do not affect the user
views in the global external schemas. Thus, distributed logical data
independence is provided by global external schemas in distributed
database architecture. Similarly, the GCS provides distributed physical
data independence in the distributed database environment.

Multi-Database System (MDBS)


In recent years, MDBS has been gaining the attention of many
researchers who attempt to logically integrate several different
independent DDBMSs while allowing the local DBMSs to maintain
complete control of their operations. Complete autonomy means that
there can be no software modifications to the local DBMSs in a DDBMS.
Hence, an MDBS, which provides the necessary functionality, is
introduced as an additional software layer on top of the local DBMSs.

An MDBS is a software that can be manipulated and accessed through a


single manipulation language with a single common data model (i.e.,
through a single application) in a heterogeneous environment without
interfering with the normal execution of the individual database
systems. The MDBS has developed from a requirement to manage and
retrieve data from multiple databases within a single application while
providing complete autonomy to individual database systems. To
support DBMS transparency, MDBS resides on top of existing databases
and file systems and presents a single database to its users. An MDBS
maintains a global schema against which users issue queries and
updates, and this global schema is constructed by integrating the
schemas of local databases. To execute a global query, the MDBS first
translates it into a number of subqueries, and converts these
subqueries into appropriate local queries for running on local DBMSs.
After completion of execution, local results are merged and the final
global result for the user query is generated. An MDBS controls multiple
gateways and manages local databases through these gateways.

MDBSs can be classified into two different categories based on the


autonomy of the individual DBMSs. These are non-federated
MDBSs and federated MDBSs. A federated MDBS is again categorized
as loosely coupled federated MDBS and tightly coupled federated
MDBS based on who manages the federation and how the components
are integrated. Further, a tightly coupled federated MDBS can be
classified as single federation tightly coupled federated
MDBS and multiple federations tightly coupled federated MDBS. The
complete taxonomy of MDBSs [Sheth and Larson, 1990] is depicted
in figure 6.5.

Figure 6.5. Taxonomy of Multi-database Systems

Federated MDBS (FMDBS). It is a collection of cooperating database


management systems that are autonomous but participate in a
federation to allow partial and controlled sharing of their data. In a
federated MDBS, all component DBMSs cooperate to allow different
degrees of integration. There is no centralized control in a federated
architecture because the component databases control access to their
data. To allow controlled sharing of data while preserving the
autonomy of component DBMSs and continued execution of existing
applications, a federated MDBS supports two types of operations: local
or global (or federation). Local operations are directly submitted to a
component DBMS, and they involve data only from that component
database. Global operations access data from multiple databases
managed by multiple component DBMSs via the FMDBS. Thus, an
FMDBS is a cross between a DDBMS and a centralized DBMS. It is a
distributed system to global users whereas a centralized DBMS to local
users. In a simple way, an MDBS is said to be an FMDBS, if users
interface to the MDBS through some integrated views, and there is no
connection between any two integrated views. The features of an
FMDBS are listed below:

 Integrated schema exists –. The FMDBS administrator (MDBA) is


responsible for the creation of integrated schemas in a
heterogeneous environment.
 Component databases are transparent to users –. Users are not
aware of the multiple component DBMSs in an FMDBS; thus, the
users only need to understand the integrated schemas to
implement the operations on an FMDBS. They cannot change the
integrated components when they operate this FMDBS.
 A common data model (CDM) is required to implement the
federation –. The CDM must be very powerful to represent all
data models in the different components. The integration of
export schemas of component data models is placed on the CDM.
 Update transactions are a difficult issue in FMDBS –. The
component databases are completely independent and join the
federation through the integrated schema. It is difficult to decide
whether the FMDBS or the local component database systems will
control the transactions.

Two types of FMDBSs have been identified, namely, loosely coupled


FMDBS and tightly coupled FMDBS depending on how multiple
component databases are integrated. An FMDBS is loosely coupled if it
is the user’s responsibility to create and maintain the federation and
there is no control enforced by the federated system and its
administrators. Similarly, an FMDBS is tightly coupled if the federation
and its administrator(s) have the responsibility for creating and
maintaining the integration and they actively control the access to
component databases. A federation is built by a selective and
controlled integration of its components. A tightly coupled FMDBS may
have one or more federated schemas. A tightly coupled FMDBS is said
to have single federation if it allows the creation and management of
only one federated schema. On the other hand, a tightly coupled
FMDBS is said to have multiple federations if it allows the creation and
management of multiple federated schemas. A loosely coupled FMDBS
always supports multiple federated schemas.

Non-federated MDBS. In contrast to a FMDBS, a non-federated MDBS


does not distinguish local and global users. In a non-federated MDBS,
all component databases are fully integrated to provide a single global
schema known as the unified MDBS (sometimes called enterprise or
corporate). Thus, in a non-federated MDBS, all applications are global
applications (because there is no local user) and data are accessed
through a single global schema. It logically appears to its users like a
distributed database.

Five-Level Schema Architecture of federated


MDBS
The terms federated database system and federated database
architecture were introduced by Heimbigner and McLeod (1985) to
facilitate interactions and sharing among independendently designed
databases. Their main purpose was to build up a loosely coupled
federation of different component databases. Sheth and Larson (1990)
have identified a five-level schema architecture for a federated MDBS
to solve the heterogeneity of FMDBSs, which is depicted in figure 6.6.

Figure 6.6. Five-level Schema Architecture of Federated MDBS

1. Local schema –. A local schema is used to represent each


component database of a federated MDBS. A local schema is
expressed in the native data model of the component DBMS, and
hence different local schemas can be expressed in different data
models.
2. Component schema –. A componet schema is generated by
translating local schemas into a data model called the canonical
or CDM of the FMDBS. A component schema is used to facilitate
negotiation and integration among divergent local schemas to
execute global tasks.
3. Export schema –. An export schema is a subset of a component
schema, and it is used to represent only those portions of a local
database that are authorized by the local DBMS for accessing of
non-local users. The purpose of defining export schemas is to
control and manage the autonomy of component databases.
4. Federated schema –. A federated Schema is an integration of
multiple export schemas. It is always connected with a data
dictionary that stores the information about the data distribution
and the definitions of different schemas in the heterogeneous
environment. There may be multiple federated schemas in an
FMDBS, one for each class of federation users.
5. External schema or application schema –. An external schema or
application schema is derived from the federated schema and is
suitable for different users. Application schema can be a subset of
a large complicated federated schema or may be changed into a
different data model, to fit in a specific user interface for fulfilling
the requirements of different users. This allows users to put
additional integrity constraints or access-control constraints on
the federated schema.

REFERENCE ARCHITECTURE OF TIGHTLY COUPLED


FEDERATED MDBS
The architecture of an FMDBS is primarily determined by which
schemas are present, how they are arranged and how they are
constructed. The reference architecture is necessary to understand,
categorize and compare different architectural options for developing
federated database systems. This section describes the reference
architecture of a tightly coupled FMDBS. Usually, an FMDBS is designed
in a bottom-up manner to integrate a number of existing
heterogeneous databases.

In a tightly coupled FMDBS, federated schema takes the form of


schema integration. For simplicity, a single (logical) federation is
considered for the entire system, and it is represented by a GCS. A
number of export schemas are integrated into the GCS, where the
export schemas are created through negotiation between the local
databases and the GCS. Thus, in an FMDBS, the GCS is a subset of local
conceptual schemas and consists of the data that each local DBMS
agrees to share. The GCS of a tightly coupled FMDBS involves the
integration of either parts of the local conceptual schemas or the local
external schemas. Global external schemas are generated through
negotiation between global users and the GCS. The reference
architecture of a tightly coupled FMDBS is depicted in figure 6.7.
Figure 6.7. Reference Architecture of Tightly Coupled Federated MDBS

REFERENCE ARCHITECTURE OF LOOSELY COUPLED


FEDERATED MDBS
In contrast with tightly coupled FMDBS, schema intergration does not
take place in loosely coupled FMDBS; therefore, a loosely coupled
FMDBS cannot have a GCS. In this case, federated schemas for global
users are defined by importing export schemas using a user interface or
an application program or by defining a multi-database language query
that references export schema objects of local databases. Export
schemas are created based on local component databases. Thus, in a
loosely coupled FMDBS, a global external schema consists of one or
more local conceptual schemas. The reference architecture of a loosely
coupled FMDBS is depicted in figure 6.8.
Figure 6.8. Reference Architecture of Loosely Coupled Federated MDBS

Chapter Summary
This chapter introduces several atternative architectures for a
distributed database system such as Client/server, peer-to-peer and
MDBSs.

 A client/server system is a versatile, message-based and modular


infrastructure that is intended to improve usability, flexibility,
interoperability and scalability as compared to centralized, time-
sharing mainframe computing. In a client/server system, there are
two different kinds of nodes: clients and servers. In simplest sense
clients request for the services to the server and servers provide
services to the clients.
 In peer-to-peer architecture, each node provides user interaction
facilities as well as processing capabilities. A peer-to-peer
architecture provides both client and server functionalities on
each node.
 An MDBS is a software system that attempts to logically integrate
several different independent DDBMSs while allowing the local
DBMSs to maintain complete control of their operations. MDBSs
can be classified into two different categories: Non-federated
MDBS and federated MDBS. Federated MDBSs are again
categorized as loosely coupled federated MDBSs and tightly
coupled federated MDBSs. Further, tightly coupled FMDBSs can
be classified as single federation tightly coupled FMDBSs and
multiple federations tightly coupled FMDBSs.