Beruflich Dokumente
Kultur Dokumente
Odd Semester
1
Institutional Vision, Mission and Quality Policy
Our Vision
To foster and permeate higher and quality education with value added engineering, technology
programs, providing all facilities in terms of technology and platforms for all round development with
societal awareness and nurture the youth with international competencies and exemplary level of
employability even under highly competitive environment so that they are innovative adaptable and
capable of handling problems faced by our country and world at large.
RAIT’s firm belief in new form of engineering education that lays equal stress on academics and
leadership building extracurricular skills has been a major contribution to the success of RAIT as one
of the most reputed institution of higher learning. The challenges faced by our country and world in
the 21 Century needs a whole new range of thought and action leaders, which a conventional
educational system in engineering disciplines are ill equipped to produce. Our reputation in providing
good engineering education with additional life skills ensure that high grade and highly motivated
students join us. Our laboratories and practical sessions reflect the latest that is being followed in the
Industry. The project works and summer projects make our students adept at handling the real life
problems and be Industry ready. Our students are well placed in the Industry and their performance
makes reputed companies visit us with renewed demands and vigour.
Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence thereby ensuring that the Institution becomes pivotal center of service to Industry,
academia, and society with the latest technology. RAIT engages different platforms such as
technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers,
Entrepreneurial Development Center and Societal Interaction Cell. To develop the college to become
an autonomous Institution & deemed university at the earliest with facilities for advanced research
and development programs on par with international standards. To invite international and reputed
national Institutions and Universities to collaborate with our institution on the issues of common
interest of teaching and learning sophistication.
RAIT’s Mission is to produce engineering and technology professionals who are innovative and
inspiring thought leaders, adept at solving problems faced by our nation and world by providing
quality education.
The Institute is working closely with all stake holders like industry, academia to foster knowledge
generation, acquisition, dissemination using best available resources to address the great challenges
being faced by our country and World. RAIT is fully dedicated to provide its students skills that make
them leaders and solution providers and are Industry ready when they graduate from the Institution.
2
We at RAIT assure our main stakeholders of students 100% quality for the programmes we deliver.
This quality assurance stems from the teaching and learning processes we have at work at our campus
and the teachers who are handpicked from reputed institutions IIT/NIT/MU, etc. and they inspire the
students to be innovative in thinking and practical in approach. We have installed internal procedures
to better skills set of instructors by sending them to training courses, workshops, seminars and
conferences. We have also a full fledged course curriculum and deliveries planned in advance for a
structured semester long programme. We have well developed feedback system employers, alumni,
students and parents from to fine tune Learning and Teaching processes. These tools help us to ensure
same quality of teaching independent of any individual instructor. Each classroom is equipped with
Internet and other digital learning resources.
The effective learning process in the campus comprises a clean and stimulating classroom
environment and availability of lecture notes and digital resources prepared by instructor from the
comfort of home. In addition student is provided with good number of assignments that would trigger
his thinking process. The testing process involves an objective test paper that would gauge the
understanding of concepts by the students. The quality assurance process also ensures that the
learning process is effective. The summer internships and project work based training ensure learning
process to include practical and industry relevant aspects. Various technical events, seminars and
conferences make the student learning complete.
It is our earnest endeavour to produce high quality engineering professionals who are
innovatve and inspiring, thought and action leaders, competent to solve problems
faced by society, naton and world at large by striving towards very high standards in
3
learning, teaching and training methodologies.
Our Motto: If it is not of quality, it is NOT RAIT!
President, RAES
Mission
The mission of the IT department is to prepare students for overall development including
employability, entrepreneurship and the ability to apply the technology to real life problems
by educating them in the fundamental concepts, technical skills/programming skills, depth of
knowledge and development of understanding in the field of Information Technology.
To develop entrepreneurs, leaders and researchers with exemplary level of employability even
under highly competitive environments with high ethical, social and moral values.
Vision
To pervade higher and quality education with value added engineering, technology programs
to deliver the IT graduates with knowledge, skills, tools and competencies necessary to
understand and apply the technical knowledge and to become competent to practice
engineering professionally and ethically in tomorrow’s global environment.
To contribute to the overall development by imparting moral, social and ethical values.
4
Index
5
with a query tool (ever SQL,SQL server-runtress)
Query Evaluation and path expressions
b. Query Evaluation
6
LO3 Design applications using advanced models like mobile, spatial databases.
LO4 Implement a distributed database and understand its query processing and
transaction processing mechanisms
7
ITL503 OLAP LAB -- -- -- -- 25 25 50
Term Work:
1. Term work assessment must be based on the overall performance of the student with
every experiment graded from time to time. The grades should be converted into
marks as per the Credit and Grading System manual and should be added and
averaged.
2. The final certification and acceptance of term work ensures satisfactory performance
of laboratory work and minimum passing marks in term work.
8
OLAP Laboratory
Experiment No. : 1
Experiment No. 1
Aim: Implementation of Query Optimizer
9
Students will be able to understand how query optimization is done. This will help them to
write efficient queries.
Theory:
At the core of the SQL Server Database Engine are two major components: the Storage
Engine and the Query Processor, also called the Relational Engine. The Storage Engine is
responsible for reading data between the disk and memory in a manner that optimizes
concurrency while maintaining data integrity. The Query Processor, as the name suggests,
accepts all queries submitted to SQL Server, devises a plan for their optimal execution, and
then executes the plan and delivers the required results.
Queries are submitted to SQL Server using the SQL language (or T-SQL, the Microsoft SQL
Server extension to SQL). Since SQL is a high-level declarative language, it only defines
what data to get from the database, not the steps required to retrieve that data, or any of the
algorithms for processing the request. Thus, for each query it receives, the first job of the
query processor is to devise a plan, as quickly as possible, which describes the best possible
way to execute said query (or, at the very least, an efficient way). Its second job is to execute
the query according to that plan.
Each of these tasks is delegated to a separate component within the query processor;
the Query Optimizer devises the plan and then passes it along to the Execution Engine,
which will actually execute the plan and get the results from the database.
10
B. Understanding Execution Plans
Now that you’ve found some statements that are slow, it’s time for the fun to begin.
1) EXPLAIN
The EXPLAIN command is by far the must have when it comes to tuning queries. It tells you
what is really going on. To use it, simply prepend your statement with EXPLAIN and run it.
PostgreSQL will show you the execution plan it used.
When using EXPLAIN for tuning, I recommend always using the ANALYZE option (EXPLAIN
ANALYZE) as it gives you more accurate results. The ANALYZE option actually executes the
statement (rather than just estimating it) and then explains it.
Example:
EXPLAIN PLAN
References:
11
4. Raghu Ramakrishnan and Johannes Gehrke, “Database Management Systems” 3rd
Edition - McGraw Hill edition.
OLAP Laboratory
Experiment No. : 2
12
Query Evaluation and Path Expression:
b) Query Tree
Experiment No: 2
Aim: Query Evaluation and path expressions
Students will be able to understand how query optimization is done. This will help them
to write efficient queries.
Theory:
13
Translating an arbitrary SQL query into a logical query plan (i.e., a relational algebra
expression) is a complex task
Consider a general SELECT-FROM-WHERE statement of the form
SELECT Select-list
FROM R1, ..., R2 T2, ...
WHERE
Where-condition
When the statement does not use subqueries in its where-condition, we can easily translate it
into the relational algebra as follows:
SELECT *
FROM R1, . . . , R2 T2, . . .
WHERE Where-condition
Query Tree:
A query tree is a tree data structure representing a relational algebra expression. The tables
of the query are represented as leaf nodes. ... The node is then replaced by the result table.
This process continues for all internal nodes until the root node is executed and replaced by
the result table.
14
• Faculty(fid, fname, deptid)
Translate the following SQL-query into an expression of the relational algebra.
SELECT S.sname
FROM Student S
WHERE S.snum NOT IN (SELECT E.snum FROM Enrolled E)
SELECT C.name
FROM Class C
WHERE C.room = ’R128’ OR C.name IN (SELECT E.cname FROM Enrolled E GROUP BY
E.cname HAVING COUNT(*) >= 5)
SELECT F.fname
FROM Faculty F
WHERE 5 > (SELECT COUNT(E.snum) FROM Class C, Enrolled E WHERE C.name =
E.cname AND C.fid = F.fid)
Conclusion and Discussion: Thus we have learnt to write SQL equivalent Relational
Algebra Expressions and Query Tree.
References:
15
OLAP Laboratory
Experiment No. : 3
16
Implementation of concurrency control
Two Phase Locking Protocol
Experiment No: 3
Aim: Implementation of concurrency control Two Phase Locking Protocol,
Requirments: java/python.
Students will be able to understand how the two phase locking protocol will done using
java or python language. This will help them to understand the implementation of
concurrency control protocol.
Theory:
17
Concurrency control is a database management systems (DBMS) concept that is used to
address conflicts with the simultaneous accessing or altering of data that can occur with a
multi-user system. Concurrency control, when applied to a DBMS, is meant to coordinate
simultaneous transactions while preserving data integrity. The Concurrency is about to
control the multi-user access of database.
When more than one transactions are running simultaneously there are chances of a
conflict to occur which can leave database to an inconsistent state. To handle these
conflicts we need concurrency control in DBMS, which allows transactions to run
simultaneously but handles them in such a way so that the integrity of data remains
intact.
Two-Phase Locking (2PL) is a concurrency control method which divides the execution
phase of a transaction.
It ensures conflict serializable schedules.
If read and write operations introduce the first unlock operation in the transaction, then it
is said to be Two-Phase Locking Protocol.
18
3. Conservative Two-Phase Locking Protocol
Conservative Two – Phase Locking Protocol is also called as Static Two – Phase
Locking Protocol.
This protocol is almost free from deadlocks as all required items are listed in
advanced.
It requires locking of all data items to access before the transaction starts.
Conclusion:
Thus, we have learnt the 2- Phase locking in concurrency control and its implementation.
A. Beginning of transaction
C.End of transaction
References:
19
OLAP Laboratory
Experiment No. : 4
20
Experiment No: 4
Aim: Implementation of Timestamp based Protocol
Requirments: java/python.
Students will be able to understand the implementation of timestamp based protocol using
java programming or python.
Theory:
Whenever a transaction begins, it receives a timestamp. This timestamp indicates the order in
which the transaction must occur, relative to the other transactions. So, given two transactions
that affect the same object, the operation of the transaction with the earlier timestamp must
execute before the operation of the transaction with the later timestamp. However, if the
operation of the wrong transaction is actually presented first, then it is aborted and the
transaction must be restarted.
Every object in the database has a read timestamp, which is updated whenever the object's
data is read, and a write timestamp, which is updated whenever the object's data is changed.
If a transaction wants to read an object,
but the transaction started before the object's write timestamp it means that
something changed the object's data after the transaction started. In this case, the
transaction is canceled and must be restarted.
and the transaction started after the object's write timestamp, it means that it
is safe to read the object. In this case, if the transaction timestamp is after the
object's read timestamp, the read timestamp is set to the transaction timestamp.
With each transaction Ti in the system, we associate a unique fixed timestamp, de-
noted by TS(Ti). This timestamp is assigned by the database system before the trans-
action Ti starts execution. If a transaction Ti has been assigned timestamp TS(Ti), and
21
a new transaction Tj enters the system, then TS(Ti) < TS(Tj ). There are two simple
methods for implementing this scheme:
1. Use the value of the system clock as the timestamp; that is, a transaction’s time-
stamp is equal to the value of the clock when the transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp has been assigned;
that is, a transaction’s timestamp is equal to the value of the counter when the
transaction enters the system.
To implement this scheme, we associate with each data item Q two timestamp values:
Algorithm:
1. Read the two transactions.
2. Assign the timestamp value to the transaction (logical counter).
3. Associate Read timestamp and write timestamp.
Conclusion:
Thus, we have learnt t the timestamp based protocol.
Quiz /Discussion:
1. what is timestamp protocol.
2. what are operation of timestamp.
References:
22
23
OLAP Laboratory
Experiment No. : 5
24
Experiment No: 5
Aim: Implementation of Log based Recovery mechanism.
Theory:
Let's assume there is a transaction to modify the City of a student. The following logs are
written for this transaction.
25
The log is a sequence of log records, recording all the update activities in the database. In a
stable storage, logs for each transaction are maintained. Any operation which is performed on
the database is recorded is on the log. Prior to performing any modification to database, an
update log record is created to reflect that modification.
An update log record represented as: <Ti, Xj, V1, V2> has these fields:
1. Transaction identifier: Unique Identifier of the transaction that performed the write
operation.
2. Data item: Unique identifier of the data item written.
3. Old value: Value of data item prior to write.
4. New value: Value of data item after write operation.
When the system is crashed,the system consults the log to find which transactions need to be
undone and which need to be redone.
1. If the log contains the record <Ti, Start> and <Ti, Commit> or just <Ti, Commit> then
the Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record either <Ti, commit>
or <Ti, abort> then the Transaction Ti needs to be undone.
Implementation steps:
2. Program should take log file input. Menu for normal run and recovery.
3. Scan the log file and perform: Undo for uncommitted transactions and Redo for committed
transactions.
T0 start
T0,A,950
T0,B,2050
T0 commit
T1 start
T1,C,600
Conclusion:
Thus, we have learnt different kinds of failures and the log based recovery mechanism.
Quiz/discussion-
26
References:
27
OLAP Laboratory
Experiment No. : 6
28
Experiment No 6
Aim: Case Study- distributed database for a real life application and simulation of recovery
methods.
Software Required: Desktop PC, 4 GB ram, Oracle 91, MS SQL server 2000, Client/server
architecture, MySql.
Theory:
In case of immediate update mode, the recovery manager takes the following actions −
Transactions which are in active list and failed list are undone and written on the
abort list.
29
In case of deferred update mode, the recovery manager takes the following actions −
Transactions which are in the active list and failed list are written onto the abort list.
No undo operations are required since the changes have not been written to the disk
yet.
The transactions in the commit list and before-commit list are redone and written
onto the commit list in the transaction log.
The transactions in the active list and failed list are undone and written onto the abort
list in the transaction log.
E. Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers.
As a consequence, in case of a system crash, the recovery manager does not have to redo the
transactions that have been committed before checkpoint. Periodical checkpointing shortens
the recovery process.
Consistent checkpointing
Fuzzy checkpointing
1) Consistent Checkpointing
Consistent checkpointing creates a consistent image of the database at checkpoint. During
recovery, only those transactions which are on the right side of the last checkpoint are
undone or redone. The transactions to the left side of the last consistent checkpoint are
already committed and needn’t be processed again. The actions taken for checkpointing are
−
30
A “checkpoint” record is written in the transaction log.
If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery
from disk failures and power failures, otherwise it aids recovery from only power failures.
2) Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in
the log. In case of power failure, the recovery manager processes only those transactions that
were active during checkpoint and later. The transactions that have been committed before
checkpoint are written to the disk and hence need not be redone.
3) Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the time of system
crash is tfail. Let there be four transactions Ta, Tb, Tc and Td such that −
Td starts after checkpoint and was active at the time of system crash.
After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
31
G. Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site. When the controlling site has received “DONE” message from
all slaves, it sends a “Prepare” message to the slaves.
The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a timeout.
After the controlling site has received “Ready” message from all the slaves −
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the
slaves, it considers the transaction as committed.
After the controlling site has received the first “Not Ready” message from any slave
−
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves,
it considers the transaction as aborted.
32
The controlling site issues an “Enter Prepared State” broadcast message.
The slave sites vote “OK” in response.
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.
Procedure/ Program:
Steps of Distributed Database Design
Top-down approach: first the general concepts, the global framework are defined,
after then the details.
Down-top approach: first the detail modules are defined, after then the global
framework. If the system is built up from a scratch, the top-down method is more
accepted. If the system should match to existing systems or some modules are yet
ready, the down-top method is usually used.
General design steps according to the structure:
Conclusion:
The DDB database real life scenarios is studied and its requirements are documented. The
simulation tool is used to demonstrate the recovery mechanism in distributed environment.
References:
34
OLAP Laboratory
Experiment No. : 7
35
Experiment No: 7
Aim: Advanced Database Models Case study for Temporal, Mobile or Spatial databases
A Case Study on Spatio-Temporal Data Mining of Urban Social Management Events Based
on Ontology Semantic Analysis (Students may refer to different case study)
Theory:
The massive urban social management data with geographical coordinates from the
inspectors, volunteers, and citizens of the city are a new source of spatio-temporal data,
which can be used for the data mining of city management and the evolution of hot events to
improve urban comprehensive governance. . First, an ontology model for USMEs is
presented to accurately extract effective social management events from non-structured
UMSEs. Second, an explorer spatial data analysis method based on “event-event” and “event-
place” from spatial and time aspects is presented to mine the information from UMSEs for
the urban social comprehensive governance. The data mining results are visualized as a
thermal chart and a scatter diagram for the optimization of the management resources
configuration, which can improve the efficiency of municipal service management and
municipal departments for decision-making.
The concept system of social comprehensive governance is huge and complex, and there are
various kinds of events. The extraction of interesting hot events and the spatio-temporal
information mining, which is only one of the many entry points in this field is important. It
has a broad research space in the information mining of the social comprehensive
management events based on space-time management, whether in content or method. The
smart city platform adds a geographic coordinate tag for a variety of events and log data
generated from the city management process, but these data records are from inspectors,
volunteers in the city management, and even citizens; the events are described as unstructured
natural language. This case study proposes a spatio-temporal data mining approach based on
the urban social management events to extract unstructured natural language information, to
find the event spatio-temporal distribution pattern, and to provide visualized decision support
36
for the social management and comprehensive control of the city. The technical framework of
the proposed approach is shown in Figure 1.
37
The purpose of urban management and comprehensive administration is to maintain a good
environment for social development. During the process of urban management, there are a
large number of work record data. Thus, how to make use of these work records well to
excavate useful information hidden in these historical data is very important for the decision-
making of further urban social governance. The content of city management is huge with a
complicated structure for urban governance. This study puts forward a concept system of
urban social management events. An ontology model is proposed for the massive spatio-
temporal data mining of social management and comprehensive control events. It designs the
process of the construction of the ontology, builds the ontology using the existing tools, and
realizes the extraction of the hot events in city management based on the semantic reasoning
of ontology with Java-based frameworks, whose comprehensiveness and accuracy are higher
than that of the old ones. This paper also introduces the spatio-temporal information mining
for discrete USMEs from three perspectives: geographical statics, spatial aggregation and
correlation relationship. A spatial-temporal correlation data mining between events and
38
locations or between events and events is proposed to mine the spatial-temporal information
from the discrete and massive city’s comprehensive management events.
Conclusion:
Thus the case study for spatial and temporal data has been performed.
References:
39
OLAP Laboratory
Experiment No. : 8
40
Experiment No. 8
Aim: Construction star schema and Snowflake schema for company database.
Student will able to design star schema. Able to understand the different keys
and join are used in star schema.
Converting the star schema to snowflake schema.
Software Required: Desktop PC, 4 GB ram, Oracle Enterprise Edition, MS SQL server
2000, Client, MySql, Weka learning tool, SQL *loader.
Theory: Star schema is the arrangement of fact table at the core and the dimension tables
surrounding it. Each dimension table has direct relationship with fact table in the middle.
When the query made against data warehouse the result of the query are produced by
combining or joining one row of the dimension table with one or more of the fact table.
Dimension Table:
Dimension table represents business dimensions using which the metrics are
analyzed.
Dimension table often provide multiple hierarchies-hierarchies are used for
drill down and roll up.
Dimension table should have its own surrogate key as primary key without
any built in meaning. Along with the operational system key
Fact Table:
Algorithm:
41
Step2: Identify the measures or facts.
Snowflake Schema:
The original star schema for sales contains only 5 tables where as the normalized version now
extended to eleven tables. These new tables are linked back to original dimension table
through artificial keys.
Formatting a Sub-dimension:
The principle behind snowflake is normalization of the dimension tables by removing low
cardinality attribute and formatting separate table. In a similar manner some situations
provide opportunities to separate out a set of attributes and from a sub dimension. This
process is very close to snow flaking technique. The given figure shows how a demographic
sub-dimension is formed out of the customer dimension.
Although, forming sub-dimensions may be constructed snow flaking it makes a lot of sense
to separate out the demographic attributes differ in granularity, of customer dimension is very
large running into millions of rows the saving in storage space could be substantial. Another
valid reason for separating out demographic attributes relates to the browsing of attributes.
Algorithm:
2. Identify the attributes of any dimension table in star schema with low cardinality.
42
3. Remove this attributes from original dimension table and create new table
4. Link the new table with original dimensional table through artificial key.
Procedure/ Program:
Conclusion: Thus we have learned the implantation of star schema using informatica and Ms
Server 2000 tools.
References:
Experiment No. : 9
45
Experiment No. 9
Aim: OLAP Exercise a) Construction of Cubes b) OLAP Operations, OLAP Queries (Slice,
dice, roll-up, drill-down and pivot).
Data Extraction:- For a data warehouse, data is extracted from many disparate sources also
you have to extract data on the change for one time initial full load.
Data Transformation: Having information that is usable for strategic decision making is the
underlying principle of DW. Extracted data is raw data and it cannot be applied to the DW.
Thus all the extracted data must be made usable in D. Transformation of source data example
a wide Varity of manipulations to change all the extracted source data into usable information
to be stored in DW.
Data loading: It is agreed that transformation techniques end as soon as load images are
created. The next major set of functions consists of the ones that take the prepared data, apply
it to the data warehouse and store it in db
Software Required: Desktop PC, 4 GB ram, Oracle Enterprise Edition, MS SQL server 2000,
Client, MySql.
46
Data Extraction Techniques:
Static data:- Static data is the capture of data of a given point in time. Static data
capture is primarily used for initial load.
Data revision: Data revision is also known as incremental data capture. Incremental
data capture may be defined or immediate within a group of immediate data structure
there are 3 distinct options: Two separate options are available for defined data
capture.
Immediate Data Extraction: In this data extraction is real time. It occurs as the transaction
happen at the source data base and files options for immediate data extraction.
Capture through Transaction Logs: - This option uses the transaction logs of
DBMS. Maintained for recovery from possible failure.
Capture through DB triggers: - Create trigger program for all the event from
which data is to be captured.
Capture in Source Application: - Source Application is made to asset in data
capture for DW.
Deferred Data Extraction: Techniques under deferred data extraction do not capture changes
in real time. Options are:
Capture based on date and time stamp: The extraction procedure has to extract
all the records having timestamp greater then timestamp of last extraction.
Capture based on comparing the files: Compares two separate snapshots of the
source data using this it can find out all updates, deletes and inserts.
Algorithm:
Selection: - This task takes place at the beginning. In this we select either whole
records or part of several records from the source system. The task of selection
usually from part of extraction function itself.
47
Splitting/ joining: - This task includes the types of data manipulation needed to
perform on the selected parts of source records.
3. Conversion: This is an all inclusive task and includes a large verity of rudimentary
conversion of single fields for two primary resource one to standardize among the
data extractions from disparate source system and the other to make the fields usable
and understandable to users.
Summarization: Sometimes it’s not feasible to keep data at the lowest level of details
in DW so the data transformation function includes summarization of daily
transactions.
Enrichment: This task is the rearrangement and simplification of individual fields to
make them more useful for DW. We may use one or more fields from the same input
record to create a better view of data for the DW.
Format revisions: These revisions include changes to data types and lengths of
individual fields.
Decoding of fields: These are also a common type of data transformation when we
deal with multiple source system. We need to decode all such cryptic codes and
change into values that make sense to the user.
Calculated and derived values: for eg. The extracted data from sales system contains
sales, amt, units and operating cost.
Splitting of single fields: We may improve the operating performance by indexing on
individual components.
Merging of info: This includes merging of several fields to create a single field of
data.
Character set conversion: This type of data transformation relates if the conversion of
character sets to an agreed standard character set for textural data in the warehouse.
Conversion of units of measurement: In this we have to convert metrics and that the
number may all be in one standard unit of measurement.
Data/time Conversion: this relate to the representation of data and time in the
representation of data and time in standard formats.
Summarization: This includes creation of summaries to be loaded in the data
warehouse instead of loading most granular level of data.
Key restructuring: While choosing keys for data warehouse avoid keys with built-in
meanings. Transform such keys into generic keys generated by system itself. This is
called the key restructuring.
48
Duplication: In this, we keep a single record and links are duplicates in the source
systems to this single record.
Algorithm:
LOAD:
If the target table to be loaded already exists and data exists in the table, the load
process wipes out the existing data and applies data from incoming file.
APPEND:
If data already exists in table, the append process unconditionally adds the incoming
data, preserving the existing data into the target table. The incoming record may be
allowed to be added as a duplicate.
DESTRUCTIVE MERGE:
In this mode, you apply the incoming data to target data. If primary key of an
incoming record matches with the key of an existing record, update the matching
target record.
CONSTRUCTIVE MERGE:
This mode is tightly different from destructive merge. If primary key of incoming
record matches with key of existing record, add the incoming record and mark the
added record as superseding the old record.
The modes of applying data to the data warehouse let into 3 types of loads.
Initial Load: Here, every load run creates the db tables from scratch. If you need more
than one run to create a single table and your load runs for a single table must be
scheduled to run several days.
Incremental Loads: These are the applications of ongoing changes from the source
system. Changes to the source systems are always tied to specific times, irrespective
of whether they are based on explicit time-stamps in source systems.
49
Full Refresh: This type of data application involves rewriting the entire DW.
Sometimes you may also do partial refreshes to rewrite only specific tables.
As far as the data application modes are concerned, full refresh is similar to the initial
load. However in case of full mode refresh, data exists in target tables before
incoming data is applied
Algorithm:
Map the tables from the data staging area to the respective dimension table and fact
table of the data warehouse(star schema)
Load the data from the tables of the data staging area to the dimension and fact table
of the data warehouse using initial load
OLAP operations:
Pivot: This operation is also called rotate operation that rotates the data in order to provide
alternative presentation
Slice: This operation performs a selection on one dimension of the given cube resulting in a
sub cube.
Dice: The Dice operation defines a sub cube by performing a selection on two or more
dimensions.
Roll-up: This operation involves computing all the data relationships for one or more
dimensions. To do this, a computational relationship or formula might be defined.
Drill down/up: This is a specific analytical technique whereby the user navigates among the
levels of data ranging from most summarized (up) to most detail (down).
Algorithm
Procedure/ Program:
50
Loading the transformed data into a dimensional database
Building pre-calculated summary values to speed up report generation
Building (or purchasing) a front-end reporting tool
Conclusion: Thus we have created sample data warehouse using ETL tool and observe
different OLAP operations.
References:
OLAP Laboratory
51
Experiment No. : 10
52