Beruflich Dokumente
Kultur Dokumente
Data warehouse is a re,ository o& an organi8ation9s electronically stored data. Data warehouses are designed to &acilitate re,orting and analysis in this course we would cover all the basic conce,ts o& data warehousing. 6y the end o& this course wor7 we will also 7now how a data warehouse is being build and maintained.
1 The Evolution:
Previously in the beginning the D :decision su,,ort systems; was develo,ed a&ter a long and com,le/ evolution o& in&ormation technology. ome o& the main evolutions included the ,unch cards< magnetic ta,es etc. Around the mid 1-'.s the growth o& master &iles e/,loded and this resulted in the redundant data. The 1-).s saw the advent o& the dis7 storage and how data could be directly accessed on DA D without any se=uential access. 1ith this DA D came the new ty,e o& system so&tware namely Database management ystem :D62 ;. The main aim o& the D62 was to ma7e it easy &or the ,rogrammer to store and access data on the DA D. "n addition to this D62 also too7 care o& storing the data< inde/ing etc.
re=uirements into a &orm s,eci&ically suited to satis&y the needs o& the de,artment.
The o,erational world is designed around a,,lications and &unctions such as loans< savings< ban7card< and trust &or a &inancial institution. The data warehouse world is organi8ed around ma!or sub!ects such as customer< vendor< ,roduct< and activity. The alignment around sub!ect areas a&&ects the design and im,lementation o& the data &ound in the data warehouse.
Another im,ortant way in which the a,,lication oriented o,erational data di&&ers &rom data warehouse data is in the relationshi,s o& data. 0,erational data maintains an ongoing relationshi, between two or more tables based on a business rule that is in e&&ect. Data warehouse data s,ans a s,ectrum o& time and the relationshi,s &ound in the data warehouse are vast.
2.2 !ntegration
The most im,ortant as,ect o& the data warehouse environment is the data integration. The very essence o& the data warehouse environment is that data contained within the boundaries o& the warehouse is integrated. The integration is seen in di&&erent ways> one would be the consistency o& naming conventions< consistency in the measurement variables< consistency in the ,hysical attributes o& the data and so &orth. (igure 3 shows the conce,t o& integration in a data warehouse
environment data is accurate as o& the moment o& access. "n other words< in the o,erational environment when you access a unit o& data< you e/,ect that it will re&lect accurate values as o& the moment o& access. 6ecause data in the data warehouse is accurate as o& some moment in time :i.e.< not @right now@;< data &ound in the warehouse is said to be @time variant@. (igure # shows the time variance o& data warehouse data.
The time variant o& data in this shows u, in di&&erent ways. The sim,lest way would be the data &or a time hori8on o& 1. to 1% years< but in the case o& an o,erational environment the time s,an is much shorter. The second way that time variance shows u, in the data warehouse is in the 7ey structure. Every 7ey structure in the data warehouse contains > im,licitly or e/,licitly > an element o& time< such as day< wee7< month< etc. The element o& time is almost always at the bottom o& the concatenated 7ey &ound in the data warehouse. The third way that time variance a,,ears is that data warehouse data< once correctly recorded< cannot be u,dated. Data warehouse data is< &or all ,ractical ,ur,oses< a long series o& sna,shots. 0& course i& the sna,shot o& data has been ta7en incorrectly< then sna,shots can be changed. 6ut assuming that sna,shots are made ,ro,erly< they are not altered once made.
0lder detail data is data that is stored on some &orm o& mass storage. "t is in&re=uently accessed and is
stored at a level o& detail consistent with current detailed data 5ightly summari8ed data is data that is distilled &rom the low level o& detail &ound at the current detailed level. This level o& the data warehouse is almost always stored on dis7 storage Aighly summari8ed data is com,act and easily accessible. ometimes the highly summari8ed data is &ound in the data warehouse environment and in other cases the highly summari8ed data is &ound outside the immediate walls o& the technology that houses the data warehouse The &inal com,onent o& the data warehouse is that o& meta data. "n many ways meta data sits in a di&&erent dimension than other data warehouse data< because meta data contains no data directly ta7en &rom the o,erational environment. 2eta data ,lays a s,ecial and very im,ortant role in the data warehouse. 2eta data is used as: B a directory to hel, the D analyst locate the contents o& the data warehouse< B a guide to the ma,,ing o& data as the data is trans&ormed &rom the o,erational environment to the data warehouse environment. 2eta data ,lays a much more im,ortant role in the data warehouse environment than it ever did in the classical o,erational environment.
Data enters the data warehouse &rom the o,erational environment. C,on entering the data warehouse< data goes into the current detail level o& detail< as shown. "t resides there and is used there until one o&
three events occurs: B it is ,urged< B it is summari8ed< and?or B it is archived The aging ,rocess inside a data warehouse moves current detail data to old detail data< based on the age o& data. As the data is summari8ed< it ,asses &rom the lightly summari8ed data to highly summari8ed. 6ased on the above &acts we now reali8e that the data warehouse is not built at once. "nstead it is ,o,ulated and designed one ste, at a time< it develo,s based on the evolutionary ,henomenon and not revolutionary. The cost o& building a data warehouse all at once would be very e/,ensive and the results also would not be very accurate. o it is always suggested and dictated that the environment is build using the ste, by ste, a,,roach.
2.) *ranularity
The single most im,ortant as,ect and issue o& the design o& the data warehouse is the issue o& granularity. "t re&ers to the detail or summari8ation o& the units o& data in the data warehouse. The more detail there is< the lower the granularity level. The less detail there is< the higher the granularity level. *ranularity is a ma!or design issue in the data warehouse as it ,ro&oundly a&&ects the volume o& data. The &igure below shows the issue o& granularity in a data warehouse.
Dual levels of Granularity: ometimes there is a great need &or e&&iciency in storing and accessing data and the ability to analy8e the data in great data. 1hen an organi8ation has huge volumes o& data it ma7es sense to consider two or more levels o& granularity in the detailed ,ortion o& the data warehouse. The &igure below shows two levels o& granularity in a data warehouse. "n the below &igure we see a ,hone com,any which &its the needs o& most o& its sho,s. There is a huge amount o& data in the o,erational level. The data u, to 3. days is stored in the o,erational environment. Then the data shi&ts to the lightly and highly summari8ed 8one.
This ,rocess o& granularity not only hel,s the data warehouse it su,,orts more than data marts. "t su,,orts the ,rocess o& e/,loration and data mining. E/,loration and data mining ta7es masses o& detailed historical data and e/amine the same to analy8e and ,reviously un7nown ,atterns o& business activity.
"t is usually said that i& both granularity and ,artitioning are done ,ro,erly then all most all the as,ects o& the data warehouse im,lementation comes easily. Pro,er ,artitioning o& data allows the data to grow and to be managed
Partitioning of data: The main ,ur,ose o& this ,artitioning is to brea7 u, the data into small manageable ,hysical units the main advantage o& this would be that the develo,er would have a greater &le/ibility in managing the ,hysical units o& the data. The main tas7s that are carried out while ,artitioning is as &ollows: 4estructuring "nde/ing e=uential scanning 4eorgani8ation 4ecovery 2onitoring
"n short the main aim &or this activity is the &le/ible access o& data. Partitioning can be done in many di&&erent ways. 0ne o& the ma!or issues &acing the data warehouse develo,er is whether the ,artitioning is done at system or a,,lication level. Partitioning at system level is a &unction o& the D62 and o,erating system to some e/tent.
The "ntegration o& the e/isting legacy systems is not the only di&&iculty in the trans&ormation o& data to the data warehouse. Another ma!or ,roblem would be the e&&iciency o& accessing e/isting system data. There are three ty,es o& loads which are made into the data warehouse &rom the o,erational environment: Archival data Data currently &rom the o,erational environment 0n>going changes to the data warehouse environment a&ter the last re&resh.
5oading archival data into the data warehouse is the &irst load which is done as it re,resents very minimal challenges. The second advantage o& this being done is that it is !ust a one time event. 5oading the current non>archival data &rom the o,erational environment to the data warehouse is also not o& a big challenge because even this is done once and the event is minimally disru,tive. 5oading the on>going changes o& data &rom the o,erational environment to the data warehouse is one o& the biggest challenges o& the data architect. This on>going changes ha,,ens daily and trac7ing and mani,ulating them is also not very easy. There are some common techni=ues which are &ollowed &or the data e/traction so that the amount o& o,erational data is also limited. The &irst techni=ue would be to scan the data that has been time> stam,ed in the o,erational environment. The second techni=ue to limit the data to be scanned is to scan the DdeltaE &iles. These delta &iles contain only the changes which were made in the a,,lication a&ter the last run. The third techni=ue is to scan a log &ile or an audit &ile created as a by ,roduct o& the transaction ,rocessing. The last and &inal techni=ue &or managing the amount o& data scanned is done by modi&ying the code. This is not a very ,o,ular o,tion as most o& the source code is very old and &ragile.
data which regularly changes and then we do a stability analysis to create grou,s o& data which are having similar characteristics. The stability analysis is done as shown in the &igure.
The ,rimary grou,ing is done only once &or a ma!or sub!ect area. They basically have the ,rimary 7eys o& the ma!or sub!ect area. The secondary grou,ings hold data that can e/ist multi,le times in a ma!or sub!ect area. There may be multi,le secondary grou,ings as there are distinct grou, o& data which can occur multi,le times in a ma!or database. The connector relates the data &rom one grou,ing to the other. They are li7e &oreign 7eys in tables. The Dty,e o&E data is indicated by a line leading to the right o& a grou,ing o& data. The grou,ing o& data to the le&t is a su,er ty,e and one to the right is the subty,e. 6elow shows a D"
The ,rocess which needs to be &ollowed is. 6uild a small subset =uic7ly based on the &eedbac7 Prototy,ing 5oo7ing what other ,eo,le have done 1or7ing with e/,erienced user 5oo7ing at what the organi8ation has now Aaving sessions with the simulated out,ut.
' 1onclusion
Data warehousing and business intelligence are &undamentally about ,roviding business ,eo,le with in&ormation and tools they need to ma7e both o,erational and business decisions. They are very use&ul when es,ecially when the decision ma7er needs historical or integrated data &rom multi,le sources to do the data analysis. "n this ,a,er< we have e/amined what is a data warehouse is< how the data warehouse wor7s and &inally how it is develo,ed and maintained. A small ,ro!ect is done based on the above conce,ts using the 2icroso&t F5 server 2..+.