Sie sind auf Seite 1von 13

Meta Data Different types of Meta Data

Metadata is "data about data".


Metadata are traditionally found in the card catalogs of libraries. As information has become
increasingly digital, metadata are also used to describe digital data using metadata standards
specific to a particular discipline. By describing the contents and context of data files, the quality
of the original data/files is greatly increased. For example, a webpage may include metadata
specifying what language it is written in, what tools were used to create it, and where to go for
more on the subect, allowing browsers to automatically impro!e the experience of users.
There are three main types of metadata:
Descriptive metadata describes a resource for purposes such as discovery
and identification. It can include elements such as title, abstract, author, and
keywords.
Structural metadata indicates how compound objects are put toether, for
e!ample, how paes are ordered to form chapters.
"dministrative metadata provides information to help manae a resource,
such as when and how it was created, file type and other technical
information, and who can access it. There are several subsets of
administrative data# two that are sometimes listed as separate metadata
types are:
$ %ihts manaement metadata, which deals with intellectual property
rihts,
and
$ &reservation metadata, which contains information needed to archive and
preserve a resource.
%esource discovery
o "llowin resources to be found by relevant criteria#
o Identifyin resources#
o 'rinin similar resources toether#
o Distinuishin dissimilar resources#
o (ivin location information.
)rani*in e+resources
o )rani*in links to resources based on audience or topic.
o 'uildin these paes dynamically from metadata stored in
databases.
,acilitatin interoperability
o -sin defined metadata schemes, shared transfer
protocols, and crosswalks between schemes, resources
across the network can be searched more seamlessly.
.ross+system search, e.., usin /01.23 protocol#
4etadata harvestin, e.., )"I protocol.
Diital identification
o 5lements for standard numbers, e.., IS'6
o The location of a diital object may also be iven usin:
a file name
a -%7
some persistent identifiers, e.., &-%7 8&ersistent
-%79# D)I 8Diital )bject Identifier9
o .ombined metadata to act as a set of identifyin data,
differentiatin one object from another for validation
purposes.
"rchivin and preservation
o .hallenes:
Diital information is fraile and can be corrupted or
altered#
It may become unusable as storae technoloies
chane.
o 4etadata is key to ensurin that resources will survive and
continue to be accessible into the future. "rchivin and
preservation re:uire special elements:
to track the lineae of a diital object,
to detail its physical characteristics, and
to document its behavior in order to emulate it in
future technoloies.
Source:
6IS). 8;33<9 Understanding Metadata.
'ethesda, 4D: 6IS) &ress, pp.=+;.
0.0 (etty>s definitions on types of metadata
Type Definition Examples
"dministrative 4etadata used
in manain
and
administerin
information
resources
+ "c:uisition information
+ %ihts and reproduction trackin
+ Documentation of leal access
re:uirements
+ 7ocation information
+ Selection criteria for diiti*ation
+ ?ersion control and
differentiation between similar
information objects
+ "udit trails created by record
keepin systems
Descriptive 4etadata used
to describe or
identify
information
resources
+ .ataloin records
+ ,indin aids
+ Speciali*ed inde!es
+ @yperlinked relationships
between resources
+ "nnotations by users
+ 4etadata for record keepin
systems enerated by records
creators
&reservation 4etadata
related to the
preservation
manaement of
information
resources
+ Documentation of physical
condition of resources
+ Documentation of actions taken
to preserve physical and diital
versions of resources, e.., data
refreshin and miration
Technical 4etadata
related to how
a system
functions or
metadata
+ @ardware and software
documentation
+ Diiti*ation information, e..,
formats, compression ratios,
scalin routines
behave + Trackin of system response
times
+ "uthentication and security data,
e.., encryption keys, passwords
-se 4etadata
related to the
level and type
of use of
information
resources
+ 5!hibit records
+ -se and user trackin
+ .ontent re+use and multi+
versionin information
Ahy Dataware houses ,ail
User Adoption.
The single measure of success for any BI project. Are the users using it? If not, it has failed.
Users Dont Know What They Dont Know.
It is utterly pointless paying a Business Analyst to spend weeks asking users what they want from a BI
project. Users dont know - and will NEVER know for absolute certain UNTIL they see something. What
does that mean? It means that waterfall/SDLC as a methodology will never be appropriate for developing
BI solutions. If you are utilising a Gantt chart for managing your BI project right now, you are heading for
failure! It is become more widely known that Agile or Scrum methodologies work best for BI. Incremental,
iterative steps are the way to go.
All BI Solutions Will Require Change.
Whether change comes from external or internal influences, it does not matter. It is inevitable. If your
toolset/method/skills cannot embrace change, you are going to fail. If your ETL processes are like plates
of spaghetti then change is not going to be easy for you. A data warehouse is a journey, not a destination,
and often you will need to change direction.
Everybody Loves Kimball.
And why not? A star-schema or dimensional model, after all, is the single goal of any BI developer. A
single fact table with some nice dimensions around it is Nirvana for an end-user. Theyre easy to
understand. Its the goal for self-serve reporting. What can go wrong? Everything! The point is you can
rarely go from source to star-schema without having to do SOMETHING to the data along the way.
Especially if your fact table requires information from more than table or data source, then you face a lot
of hard work to get that data into your star-schema. In a poll I conducted on LinkedIn a while back, I
asked BI experts how much of a BI project was spent in just getting the data right. The answer came back
as around 75-80%. Over three-quarters of any BI project is spent in just getting the data in a shape for BI!
So when (not if) you are going to need to make changes, you will have a tougher job on your hands
(especially if you built it using traditional ETL tools).
Everybody Ignores Inmon and Linstedt.
Bill Inmon, the Father of Data Warehousing, has written numerous books on the subject. His Corporate
Information Factory philosophy makes a lot of sense take your data from all your sources and compose
it into its Third Normal Form (3NF) in your Enterprise Data Warehouse (EDW). Why? It makes your data
generic and in its lowest level of granularity. Once you have your data in this state, it makes the perfect
source for your Kimball star-schemas. Dan Linstedt extends on the 3NF model by introducing Data Vault,
which provides a means of maintaining historical information about your data and where it came from. A
unique feature of Data Vault is that you can understand what state your data warehouse was in at any
point in time. So why does everybody ignore Inmon and Linstedt? Most likely because they are too
complex to build and maintain using traditional ETL tools. Instead most developers will manage all the
staging and data transformation in ancillary tables in a way only they understand using their favourite ETL
tool. Good luck for when they finally leave your organisation!
ETL Tools Do Not Build Data Warehouses.
ETL tools were designed to move data from one place to another. Over the years, extra bits may have
been cobbled on to perform tasks to ease the job of a data warehouse developer, but they still rely on too
many other things. A decent target data warehouse model, for example. As discussed in the above
points, ETL tools offer no help in providing a fast and effective means for delivering a design like a Third
Normal Form or Data Vault Enterprise Data Warehouse. This means you need a data modelling tool plus
the skills to design such architectures. Thankfully, we live in the 21
st
century where true data warehouse
automation tools are emerging. These will help lead data warehousing out of the dark ages especially
with the advent of Big Data. Inmon and Linstedt have written the rules, now let the data warehouse
automation tools take over!
Summary
To succeed in your data warehouse project, take an approach that embraces rapid change, and surround
yourself with the tools, methods and people that are willing and able to support that.
While you need a star-schema for reporting and analytics, dont try to take shortcuts to get there. You
cannot go from source to star schema without having to something in between. Bill Inmon and Dan
Linstedt have described how that in-between bit should look. Ignore them at your peril! If you have
multiple data sources, then DO look at building a 3NF or Data Vault EDW. To help you do that, look at
getting a true Data Warehouse Automation tool.
If we are to succeed with Big Data, we need to be truly successful in data warehousing.
4olap + %olap
In the )7"& world, there are mainly two different types: 4ultidimensional )7"& 84)7"&9
and %elational )7"& 8%)7"&9. @ybrid )7"& 8@)7"&9 refers to technoloies that combine
4)7"& and %)7"&.
MOLAP
This is the more traditional way of )7"& analysis. In 4)7"&, data is stored in a
multidimensional cube. The storae is not in the relational database, but in proprietary
formats.
"dvantaes:
5!cellent performance: 4)7"& cubes are built for fast data retrieval, and are optimal
for slicin and dicin operations.
.an perform comple! calculations: "ll calculations have been pre+enerated when
the cube is created. @ence, comple! calculations are not only doable, but they return
:uickly.
Disadvantaes:
7imited in the amount of data it can handle: 'ecause all calculations are performed
when the cube is built, it is not possible to include a lare amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a lare
amount of data. Indeed, this is possible. 'ut in this case, only summary+level
information will be included in the cube itself.
%e:uires additional investment: .ube technoloy are often proprietary and do not
already e!ist in the orani*ation. Therefore, to adopt 4)7"& technoloy, chances are
additional investments in human and capital resources are needed.
ROLAP
This methodoloy relies on manipulatin the data stored in the relational database to ive
the appearance of traditional )7"&>s slicin and dicin functionality. In essence, each action
of slicin and dicin is e:uivalent to addin a BA@5%5B clause in the SC7 statement.
"dvantaes:
.an handle lare amounts of data: The data si*e limitation of %)7"& technoloy is
the limitation on data si*e of the underlyin relational database. In other words,
%)7"& itself places no limitation on data amount.
.an leverae functionalities inherent in the relational database: )ften, relational
database already comes with a host of functionalities. %)7"& technoloies, since
they sit on top of the relational database, can therefore leverae these
functionalities.
Disadvantaes:
&erformance can be slow: 'ecause each %)7"& report is essentially a SC7 :uery 8or
multiple SC7 :ueries9 in the relational database, the :uery time can be lon if the
underlyin data si*e is lare.
7imited by SC7 functionalities: 'ecause %)7"& technoloy mainly relies on
eneratin SC7 statements to :uery the relational database, and SC7 statements do
not fit all needs 8for e!ample, it is difficult to perform comple! calculations usin
SC79, %)7"& technoloies are therefore traditionally limited by what SC7 can do.
%)7"& vendors have mitiated this risk by buildin into the tool out+of+the+bo!
comple! functions as well as the ability to allow users to define their own functions.
HOLAP
@)7"& technoloies attempt to combine the advantaes of 4)7"& and %)7"&. ,or
summary+type information, @)7"& leveraes cube technoloy for faster performance. Ahen
detail information is needed, @)7"& can Bdrill throuhB from the cube into the underlyin
relational data.
Star Schema & Multi Dimension Data Bases
Introduction to Multi-Dimensional Databases
Evolving from econometric research conducted in MIT in the 1960s, the multi-
dimensional database has matured into the database engine of choice for data
analysis applications This application category is commonly referred to as !"#$
%!n-"ine #nalytical $rocessing& The multi-dimension database has become popular
'ith industry because it allo's high performance access and analysis of large
amounts of related data across several applications, operating in different parts of
the organi(ation )iven that all business applications operate in a multi-tier
environment, and often use different technologies operating on different platforms, it
is important that such 'idely dispersed data can be accessed and analysed in a
meaningful 'ay
The multi-dimensional database may also offer a better concept for visualising the
'ay 'e already thin* of data in the real 'orld +or e,ample, most business managers
already thin* of data in a multi-dimensional 'ay, such as 'hen they thin* of specific
products in specific mar*ets over certain periods of time The multi-dimensional
database attempts to present such data to the end user in a useful 'ay
Star Schema Multi Dimension Data Bases
1.1 Overview of a Multi-Dimensional Database System
-elational databases store data in a t'o dimensional format, 'here tables of data
are presented as ro's and columns Multi-dimensional database systems offer an
e,tension to this system to provide a multi-dimensional vie' of the data %-and&
+or e,ample, in multi-dimensional analysis, data entities such as products,
regions, customers, dates etc may all represent different dimensions This
intrinsic feature of the database structure 'ill be covered in depth in subse.uent
sections of this paper
/ome further advantages to this database model are0
The ability to analyse large amounts of data 'ith very fast response times
To 1slice and dice1 through data, and 1drill do'n or roll up1 through various
dimensions of the defined data structure
To .uic*ly identify trends or problem areas that 'ould have been other'ise
overloo*ed in an industry environment
Multi-dimensional data structures can be implemented 'ith multi-dimensional
databases, or else they can also be implemented in a relational database
management system using such techni.ues as the 1/tar /chema1 and the
1/no'fla*e /chema1 %2eldon 1993&
The /tar /chema is a means of aggregating data based on a set of *no'n
database dimensions, attempting to store a multi-dimensional data structure in a
t'o-dimensional relational database management system %-45M/& The /no'
+la*e /chema is an e,tension of the /tar /chema by the principal of applying
additional dimensions to the /tar /chema in a -45M/
/lo'ly changing 4imensions
/lo'ly 6hanging 4imensions %/64& - dimensions that change slo'ly over time, rather than
changing on regular schedule, time-base In 4ata 2arehouse there is a need to trac*
changes in dimension attributes in order to report historical data In other 'ords,
implementing one of the /64 types should enable users assigning proper dimension7s
attribute value for given date E,ample of such dimensions could be0 customer, geography,
employee
There are many approaches ho' to deal 'ith /64 The most popular are0
8 Type 0 - The passive method
8 Type 1 - !ver'riting the old value
8 Type 2 - 6reating a ne' additional record
8 Type 3 - #dding a ne' column
8 Type 4 - 9sing historical table
8 Type 6 - 6ombine approaches of types 1,:,; %1<:<;=6&
Type 0 - The passive method In this method no special action is performed upon
dimensional changes /ome dimension data can remain the same as it 'as first time
inserted, others may be over'ritten
Type 1 - !ver'riting the old value In this method no history of dimension changes is *ept
in the database The old dimension value is simply over'ritten be the ne' one This type is
easy to maintain and is often use for data 'hich changes are caused by processing
corrections%eg removal special characters, correcting spelling errors&
5efore the change0
Customer_ID Customer_Name Customer_Type
1 6ust>1 6orporate
#fter the change0
Customer_ID Customer_Name Customer_Type
1 6ust>1 -etail
Type 2 - 6reating a ne' additional record In this methodology all history of dimension
changes is *ept in the database ?ou capture attribute change by adding a ne' ro' 'ith a
ne' surrogate *ey to the dimension table 5oth the prior and ne' ro's contain as attributes
the natural *ey%or other durable identifier& #lso 7effective date7 and 7current indicator7
columns are used in this method There could be only one record 'ith current indicator set
to 7?7 +or 7effective date7 columns, ie start>date and end>date, the end>date for current
record usually is set to value 9999-1:-;1 Introducing changes to the dimensional model in
type : could be very e,pensive database operation so it is not recommended to use it in
dimensions 'here a ne' attribute could be added in the future
5efore the change0
Customer_ID Customer_Name Customer_Type Start_Date n!_Date Current_"la#
1 6ust>1 6orporate ::-0@-:010 ;1-1:-9999 ?
#fter the change0
Customer_ID Customer_Name Customer_Type Start_Date n!_Date Current_"la#
1 6ust>1 6orporate ::-0@-:010 1@-03-:01: A
: 6ust>1 -etail 1B-03-:01: ;1-1:-9999 ?
Type 3 - #dding a ne' column In this type usually only the current and previous value of
dimension is *ept in the database The ne' value is loaded into 7currentCne'7 column and
the old one into 7oldCprevious7 column )enerally spea*ing the history is limited to the
number of column created for storing historical data This is the least commonly needed
techin.ue
5efore the change0
Customer_ID Customer_Name Current_Type $re%ious_Type
1 6ust>1 6orporate 6orporate
#fter the change0
Customer_ID Customer_Name Current_Type $re%ious_Type
1 6ust>1 -etail 6orporate
Type 4 - 9sing historical table In this method a separate historical table is used to trac* all
dimension7s attribute historical changes for each of the dimension The 7main7 dimension
table *eeps only the current data eg customer and customer>history tables
6urrent table0
Customer_ID Customer_Name Customer_Type
1 6ust>1 6orporate
Distorical table0
Customer_ID Customer_Name Customer_Type Start_Date n!_Date
1 6ust>1 -etail 01-01-:010 :1-0@-:010
1 6ust>1 !her ::-0@-:010 1@-03-:01:
1 6ust>1 6orporate 1B-03-:01: ;1-1:-9999
Type 6 - 6ombine approaches of types 1,:,; %1<:<;=6& In this type 'e have in dimension
table such additional columns as0
8 current>type - for *eeping current value of the attribute #ll history records for given item
of attribute have the same current value
8 historical>type - for *eeping historical value of the attribute #ll history records for given
item of attribute could have different values
8 start>date - for *eeping start date of 7effective date7 of attribute7s history
8 end>date - for *eeping end date of 7effective date7 of attribute7s history
8 current>flag - for *eeping information about the most recent record
In this method to capture attribute change 'e add a ne' record as in type : The
current>type information is over'ritten 'ith the ne' one as in type 1 2e store the history in
a historical>column as in type ;
Customer_ID Customer_Name Current_Type &istorical_Type Start_Date n!_Date Current_"la#
1 6ust>1 6orporate -etail 01-01-:010 :1-0@-:010 A
: 6ust>1 6orporate !ther ::-0@-:010 1@-03-:01: A
; 6ust>1 6orporate 6orporate 1B-03-:01: ;1-1:-9999 ?
Snow Flake Schema Advantages & Limitations
Advantage of Snowflake Schema
The main advantage of Snowflake Schema is the improvement of query performance due to
minimized disk storage requirements and joining smaller lookup tables.
It is easier to maintain.
Increase flexibility.
Disadvantage of Snowflake Schema
The main disadvantage of the Snowflake Schema is the additional maintenance efforts
needed to the increase number of lookup tables.
akes the queries much more difficult to create because more tables need to be joined
The Star schema vs Snowflake schema comparison brings forth four fundamental differences to
the fore!
1. Data optimization:
Snowflake model uses normalized data" i.e. the data is organized inside the database in order to
eliminate redundancy and thus helps to reduce the amount of data. The hierarchy of the business
and its dimensions are preserved in the data model through referential integrity.
#igure $ % Snow flake model
Star model on the other hand uses de&normalized data. In the star model" dimensions directly
refer to fact table and business hierarchy is not implemented via referential integrity between
dimensions.
#igure ' % Star model
2. Business model:
(rimary key is a single unique key )data attribute* that is selected for a particular data. In the
previous +advertiser, example" the -dvertiser.I/ will be the primary key )business key* of a
dimension table. The foreign key )referential attribute* is just a field in one table that matches a
primary key of another dimension table. In our example" the -dvertiser.I/ could be a foreign
key in -ccount.dimension.
In the snowflake model" the business hierarchy of data model is represented in a primary key %
#oreign key relationship between the various dimension tables.
In the star model all required dimension&tables have only foreign keys in the fact tables.
3. Pefomance:
The third differentiator in this Star schema vs Snowflake schema face off is the performance of
these models. The Snowflake model has higher number of joins between dimension table and
then again the fact table and hence the performance is slower. #or instance" if you want to know
the -dvertiser details" this model will ask for a lot of information such as the -dvertiser 0ame"
I/ and address for which advertiser and account table needs to be joined with each other and
then joined with fact table.
The Star model on the other hand has lesser joins between dimension tables and the facts table.
In this model if you need information on the advertiser you will just have to join -dvertiser
dimension table with fact table.
Star schema explained
Star schema provides fast response to queries and forms the ideal source for cube structures.
1earn all about star schema in this article.
!. "#$
Snowflake model loads the data marts and hence the 2T1 job is more complex in design and
cannot be parallelized as dependency model restricts it.
The Star model loads dimension table without dependency between dimensions and hence the
2T1 job is simpler and can achieve higher parallelism.
This brings us to the end of the Star schema vs Snowflake schema debate. 3ut where exactly do
these approaches make sense4
%hee do the two methods fit in&
5ith the snowflake model" dimension analysis is easier. #or example" +how many accounts or
campaigns are online for a given -dvertiser4,
The star schema model is useful for etrics analysis" such as % +5hat is the revenue for a given
customer4,