Beruflich Dokumente
Kultur Dokumente
Starting with version 9i, and continuing with the latest 10g release, Oracle has
gradually introduced features into the database to support real-time, and nearreal-time, data warehousing. These features include:
Change Data Capture
External tables, table functions, pipelining, and the MERGE command, and
Fast refresh materialized views
Real time Data warehousing means combination of heterogeneous databases and
query and analysis purpose and Decision-making and reporting purpose.
What is ODS
ODS Stands for Operational Data Store not Online Data Storage
It is used to maintain, store the current and up to date information and the
transactions regarding the source databases taken from the OLTP system.
It is directly connected to the source database systems instead of to the staging
area.
It is further connected to data warehouse and moreover can be treated as a part
of the data warehouse database.
It is the final integration point in the ETL process before loading the data into the
Data Warehouse.
It contains near real time data. In typical data warehouse architecture,
sometimes ODS is used for analytical reporting as well as souce for Data
Warehouse
Operational Data Services is Hybrid structure that has some aspects of a data
warehouse and other aspects of an Operational system.
Contains integrated data.
It can support DSS processing.
It can also support High transaction processing.
Placed in between Warehouse and Web to support web users.
Operational data stores can be updated, do provide rapid constant time,and
contain only limited amount of historical data
An Operational Data Store presents a consistent picture of the current data
stored and managed by transaction processing system. As data is modified in the
source system, a copy of the changed data is moved into the ODS. Existing data
in the ODS is updated to reflect the current status of the source system
In its simple definition you can say data mining is a way to discover new meaning
in data.
Data mining is a concept of deriving/discovering the hidden, unexpected
information from the existing data
Data Mining is a non-trivial process of identified valid, potentially useful and
ultimately understand of data
A Datawarehouse typically supplies answer to a question like 'who is buying our
products/". A data mining approach would seek answer to questions like "Who is
NOT buying our products?".
What is ER Diagram
ER - Stands for entitity relationship diagrams. It is the first step in the design of
data model which will later lead to a physical database design of possible a OLTP
or OLAP database
The Entity-Relationship (ER) model was originally proposed by Peter in 1976
[Chen76] as a way to unify the network and relational database views.
Simply stated the ER model is a conceptual data model that views the real world
as entities and relationships. A basic component of the model is the EntityRelationship diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is
commonly used for database design.
For the database designer, the utility of the ER model is:
It maps well to the relational model. The constructs used in the ER model can
easily be transformed into relational tables. It is simple and easy to understand
with a minimum of training. Therefore, the model can be used by the database
designer
to
communicate
the
design
to
the
end
user.
In addition, the model can be used as a design plan by the database developer to
implement a data model in a specific database management software.
use as a "blueprint" for building the physical database. The information contained
in the data model will be used to define the relational tables, primary and foreign
keys, stored procedures, and triggers. A poorly designed database will require
more time in the long-term. Without careful planning you may create a database
that omits data required to create critical reports, produces results that are
incorrect or inconsistent, and is unable to accommodate changes in the user's
requirements.
as Month and Year Functions to load the time dimensions but for the lower level
i.e., Day also we have one function to implement loading of Time Dimension.
OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary
OLTP is nothing but OnLine Transaction Processing ,which contains a normalized
tables and online data, which have frequent insert/updates/delete.
But OLAP(Online Analytical Programming) contains the history of OLTP data,
which is, non-volatile ,acts as a Decisions Support System and is used for
creating forecasting reports.
Index
OLTP : FEW
OLAP : MANY
Joins
OLTP : MANY
OLAP : FEW
What is ETL
ETL is a short for Extract, Transform and Load. It is a data integration function
that involves extracting the data from outside sources , transforming it into
business needs and ultimately loading it into a Datawarehouse
ETL is an abbreviation for "Extract, Transform and Load".This is the process of
extracting data from their operational data sources or external data sources,
transforming the data which includes cleansing, aggregation, summarization,
integration, as well as basic transformation and loading the data into some form
of the data warehouse.
When a table is used to check for some data for its presence prior to loading of some
other data or the same data to another table, the table is called a LOOKUP Table.
What is a general purpose scheduling tool
General purpose of scheduling tool may be cleansing and loading data at specific
given time
The basic purpose of the scheduling tool in a DW Application is to stream line the
flow of data from Source To Target at specific time or based on some condition.
What is Normalization, First Normal Form, Second Normal Form , Third Normal
Form
Normalization can be defined as segregating of table into two different tables, so
as to avoid duplication of values.
The normalization is a step by step process of removing redundancies and
dependencies of attributes in data structure
The condition of data at completion of each step is described as a normal form.
Needs for normalization: improves data base design.
Ensures minimum redundancy of data.
Reduces need to reorganize data when design is modified or enhanced.
Removes anomalies for database activities.
First normal form :
A table is in first normal form when it contains no repeating groups.
The repeating column or fields in an un normalized table are removed from the
table and put in to tables of their own.
Such a table becomes dependent on the parent table from which it is derived.
The key to this table is called concatenated key, with the key of the parent table
forming a part it.
Second normal form:
A table is in second normal form if all its non key fields are fully dependent on
the whole key.
This means that each field in a table, must depend on the entire key.
Those that do not depend upon the combination key, are moved to another
table on whose key they depend on.
Structures which do not contain combination keys are automatically in second
normal form.
Third normal form:
A table is said to be in third normal form , if all the non key fields of the table
are independent of all other non key fields of the same table.
Normalization : The process of decomposing
redundancy is called Normalization.
tables
to eliminate
data
{primary key}
more...
4,5 NF - for multi-valued dependencies (essentially to describe many-to-many
relations)
Normalization:It is the process of efficiently organizing data in a database.There
are 2-goals of the normalization process: 1. Eliminate redundant data 2. Ensure
data dependencies make sense(only storing related data in a table)First Normal
Form:It sets the very basic rules for an organized database. 1. Eliminate
duplicate columns from the same table 2. Create separate tables for each group
of related data and identify each row with a unique column or set of
columns.Second Normal Form:Further addresses the concept of removing
duplicative data. 1.Remove subsets of data that apply to multiple rows of a table
and place them in a separate tables. 2.Create relationships between these new
tables and their predecessors through the use of foreign keys.Third Normal Form:
1.Remove columns that are not dependent upon the primary key.Fourth Normal
Form: 1.A relation is in 4NF if it has no multi valued dependencies.These
normalization guidelines are cumulative.For a database to be in 2NF, it must first
fulfill all the criteria of a 1NF database.
Which columns go to the fact table and which columns go the dimension table
The Aggreation or calculated value colums will go to Fac Tablw and details
information will go to diamensional table.
To add on, Foreign key elements along with Business Measures, such as Sales in
$ amt, Date may be a business measure in some case, units (qty sold) may be a
business measure, are stored in the fact table. It also depends on the granularity
at which the data is stored
It also means that we can have (for example) data agregated for a year for a
given product as well as the data can be drilled down to Monthly, weekl and daily
basis...teh lowest level is known as the grain. going down to details is Granularity
They are dimension tables in a star schema data mart that adhere to a common
structure, and therefore allow queries to be executed across star schemas. For
example, the Calendar dimension is commonly needed in most data marts. By
making this Calendar dimension adhere to a single structure, regardless of what
data mart it is used in your organization, you can query by date/time from one
data mart to another to another.
Conformed dimentions are dimensions which are common to the cubes.(cubes
are the schemas contains facts and dimension tables). Consider Cube-1 contains
F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions
here D1,D2 are the Conformed Dimensions
If a table is used as a dimension table for more than one fact tables. then the
dimesion table is called conformed dimesions.
Conformed Dimensions are the one if they share one or more attributes whose
values are drawn from the same domains.
A conformed dimension is a single, coherent view of the same piece of data
throughout the organization. The same dimension is used in all subsequent star
schemas defined. This enables reporting across the complete data warehouse in a
simple format
What are Semi-additive and factless facts and in which scenario will you use
such kinds of fact tables
Semi-Additive: Semi-additive facts are facts that can be summed up for some of
the dimensions in the fact table, but not the others. For example:
Current_Balance and Profit_Margin are the facts. Current_Balance is a semiadditive fact, as it makes sense to add them up for all accounts (what's the total
current balance for all accounts in the bank?), but it does not make sense to add
them up through time (adding up all current balances for a given account for
each day of the month does not give us any useful information
A factless fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to
record events or coverage information. Common examples of factless fact tables
include:
- Identifying product promotion events (to determine promoted products
that didnt sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or
university
Why are OLTP database designs not generally a good idea for a Data
Warehouse
OLTP cannot store historical information about the organization. It is used for
storing the details of daily transactions while a datawarehouse is a huge storage
of historical information obtained from different datamarts for making intelligent
decisions about the organization.
Why should you put your data warehouse on a different system than your OLTP
system
OLTP
system
stands
for
on-line
transaction
processing.
These are used to store only daily transactions as the changes have to be made
in as few places as possible. OLTP do not have historical data of the organization
Datawarehouse will contain the historical information about the organization
Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the
source from which any BI tools fetch data for Analytical, reporting or data mining
purposes. It generally contains the data through the whole life cycle of the
company/product. DWH contains historical, integrated, denormalized, subject
oriented
data.
However, on the other hand the OLTP system contains data that is generally
limited to last couple of months or a year at most. The nature of data in OLTP is:
current, volatile and highly normalized. Since, both systems are different in
nature and functionality we should always keep them in different systems.
An DW is typically used most often for intensive querying . Since the primary
responsibility of an OLTP system is to faithfully record on going transactions
(inserts/updates/deletes), these operations will be considerably slowed down by
the heavy querying that the DW is subjected to.
Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would
you put your TX logs
Raid 0 - Make several physical hard drives look like one hard drive. No
redundancy but very fast. May use for temporary spaces where loss of the files
will not result in loss of committed data.
Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin has an
exact copy of the other twins data so if one hard drive fails, the other is used to
pull the data. Raid 1 is half the speed of Raid 0 and the read and write
performance are good.
Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1. Sometimes
faster than Raid 1. Depends on vendor implementation.
Raid 5 - Great for readonly systems. Write performance is 1/3rd that of Raid 1
but Read is same as Raid 1. Raid 5 is great for DW but not good for OLTP.
Hard drives are cheap now so I always recommend Raid 1.
Normalized
Poorly modelled
DWH Schema
Surrogate
key
is
substitution
for
the
natural
primary
key.
It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity
key), key for the dimension tables primary keys. They can use Infa sequence
generator, or Oracle sequence, or SQL Server Identity values for the surrogate
key.
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are
stated as the primary keys (according to the business users) but ,not only can
these change, indexing on a numerical value is probably better and you could
consider creating a surrogate key called, say, AIRPORT_ID. This would be internal
to the system and as far as the client is concerned you may display only the
AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's
what would be in your Employee Dimension). This employee has a turnover
allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee
'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new
turnover have to belong to the new Business Unit 'BU2' but the old one should
Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your
datawarehouse everything would be allocated to Business Unit 'BU2' even what
actualy belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for
the Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the
SID of the employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural
key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key enlargement process, is that you might not have all part of
your new key within your fact table, so you might not be able to do the join on
the new enlarge key -> so you need another id.
When creating a dimension table in a data warehouse, we generally create the
tables witha system generated key to unqiuely identify a row in the dimension.
This key is also known as a surrogate key. The surrogate key is used as the
primary key in the dimension table. The surrogate key will also be placed in the
fact table and a foreign key will be defined between the two tables. When you
ultimately join the data it will join just as any other join within the database.
Surrogate key is a unique identification key, it is like an artificial or alternative
key to production key, bz the production key may be alphanumeric or composite
key but the surrogate key is always single numeric key.
assume the production key is an alphanumeric field. if u create an index for this
fields it will occupy more space, so it is not advicable to join/index, bz generally
all the datawarehousing fact table are having historical data. these factable are
linked with somany dimension table. if it's a numerical fields the performance is
high
A surrogate key is any column or set of columns that can be declared as the
primary key instead of a "real" or natural key. Sometimes there can be several
natural keys that could be declared as the primary key, and these are all called
candidate keys.So a surrogate is a candidate key. A table could actually have
more than one surrogate key, although this would be unusual. The most common
type of surrogate key is an incrementing integer, such as an auto_increment
column in MySQL, or a sequence in Oracle, or an identity column in SQL
Server.Use of surrogate key:Every join between dimension tables and fact tables
in a data warehouse environment should be based on surrogate keys, not natural
keys. It is up to the data extract logic to systematically look up and replace every
incoming natural key with a data warehouse surrogate key each time either a
dimension record or a fact record is brought into the data warehouse
environment.
what is data validation strategies for data mart validation after loading process
Data validation is to make sure that the loaded data is accurate and meets the
business requriments.
Strategies are different methods followed to meet the validation requriments
level are its children. These familial relationships enable analysts to access data
quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might
have a hierarchy that represents data at the month, quarter, and year levels.
Levels range from general to specific, with the root level as the highest or most
general level. The levels in a dimension are organized into one or more
hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general
(the root) to most specific information. They define the parent-child relationship
between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex rewrites. For
example, the database can aggregate an existing sales revenue on a quarterly
base to a yearly aggregation when the dimensional dependencies between
quarter and year are known.
What is the definition of normalized and denormalized view and what are the
differences between them
Normalization is the process of removing redundancies.
Denormalization is the process of allowing redundancies.
What is the main difference between Inmon and Kimball philosophies of data
warehousing?
Basically speaking, Inmon professes the Snowflake Schema while Kimball relies
on the Star Schema
Both differed in the concept of building teh datawarehosue..
According to Kimball ...
Kimball views data warehousing as a constituency of data marts. Data marts are
focused on delivering business objectives for departments in the organization.
And the data warehouse is a conformed dimension of the data marts. Hence a
unified view of the enterprise can be obtain from the dimension modeling on a
local departmental level.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis.
Hence the development of the data warehouse can start with data from the
online store. Other subject areas can be added to the data warehouse as their
needs arise. Point-of-sale (POS) data can be added later if management decides
it is necessary.
i.e.,
Kimball--First DataMarts--Combined way ---Datawarehouse
Inmon---First Datawarehouse--Later----Datamarts
the main difference b/w the kimball and inmon technologies is..
Kimball --- creating datamarts first then combining tehm up to form a
datawarehouse
Inmon----Creating datawarehouse --- then datamarts
Actually, the main difference is Kimball: follows Dimentional Modelling
Inmon: follows ER Modelling
RalfKimball: he follows bottum-up approach i.e., first create individual Data Marts
from the existing sources and then create Data Warehouse.
BillImmon: he follows top-down approach i.e., first create Data Warehouse from
the existing sources and then create individual Dat Marts.