Sie sind auf Seite 1von 20

EXTRACT, TRANSFORM AND LOAD(ETL) PERFORMANCE IMPROVED BY QUERY CACHE

PROJECT REPORT PHASE I Submitted by

P.UDHAYA KUMAR (4511030004)


In partial fulfillment for the award of the degree of

MASTER OF ENGINEERING
in

COMPUTER SCIENCE AND ENGINEERING

Nov 2012

Chapter

2 3

4 5

TABLE OF CONTENTS Title PAGE NO BONAFIDE CERTIFICATE iii iv ACKNOWLEDGEMENT v ABSTRACT LIST OF FIGURES Vi INTRODUCTION 7 1.1 Motivation 7 1.2 Problem statement 8 1.3 Project overview 8 SYSTEM ARCHITECTURE 9 SYSTEM DESIGN 10 3.1 Extract 10 3.2 Transform 11 3.3 Load 13 SYSTEM FLOW CHART 14 IMPLEMENTATION DETAILS 17 5.1 Java 17 5.2 JDBC 17 5.3 Oracle 18 5.4 JBoss 18 REFERENCES 19

VINAYAKA MISSIONS UNIVERSITY


AARUPADAI VEEDU INSTITUTE OF TECHNOLOGY

BONAFIDE CERTIFICATE

Certified that this project report titled EXTRACT, TRANSFORM AND LOAD (ETL) PERFORMANCE IMPROVED BY QUERY CACHE

is a bonafide work of P.UDHAYA KUMAR(4511030004)


who carried out the project work PHASE I under my supervision.

SIGNATURE Mr.P.T.SIVASANKAR HEAD OF THE DEPARTMENT Department of Computer Science and Engineering AVIT, Paiyanoor.

SIGNATURE Mrs. SELVI R PROJECT GUIDE Assosiate Professor Department of Computer Science and Engineering AVIT, Paiyanoor.

ACKNOWLEDGEMENT
A Project work is a product of experience and it goes a long way in shaping up a person in ones respective profession. It is a confluence of a varied thought integrated into a resourceful product. With great gratitude, I would like to acknowledge the immense help of all those who contributed significantly with their valuable suggestion and timely assistance which enabled me to face the challenges encountered in this phase with confidence. I express my sincere thanks to my Vice Chairman and Pro-Chancellor Dr.A.S.Ganesan of Vinayaka Mission University, for providing the needed facilities and creating a comfortable atmosphere required for this project. I am obliged to thank Dr.J.SHANMUGAM, my respected Principal and Mrs.R.Kalavathy, my Vice Principal along with Mr.P.T.Sivasankar, Head of Department of Computer Science and Engineering for their cheerful encouragement and their unending support. I am indebted to my faculty advisor Mr.K.Karthik, Associate Professor of CSE Department and Project Coordinator Dr.M.A.Dorairangaswamy, Professor and Dean of CSE Department, for their constructive suggestions and motivation. I would like to thank my project guide : Mrs. SELVI R, Asot professor of CSE Department for her immense help and valuable suggestion during this project work. I would also like to thank all the teaching and non-teaching staff members of Aarupadai Veedu Institute of Technology who have rendered their help during the progress of the project.

ABSTRACT:
Extract, transform and load (ETL) is the core process of data integration and is typically associated with data warehousing. ETL tools extract data from a chosen source, transform it into new formats according to business rules, and then load it into target data structure. Managing rules and processes for the increasing diversity of data sources and high volumes of data processed that ETL must accommodate, make management, performance and cost the primary and challenges for users. ETL is a key process to bring all the data together in a Standard, homogenous environment. If source data taken from various sources is not cleanse, extracted properly, transformed and integrated in the proper way, query process which is the backbone of the data warehouse could not happened In this paper we purpose an ultimate advance approach which will increase the speed of Extract, transform and load in data ware house with the support of query cache. Because the query process is the backbone of the data warehouse it will reduce response time and improve the performance of data ware house.

LIST OF FIGURES
FIGURE No: 1 2 3 4 5 6 Title Architecture diagram System design Transforming and Loading flowchart 1 flowchart 2 flowchart 3 Pageno 9 10 12 14 15 16

CHAPTER 1

1.1 MOTIVATION

We have introduced a method to improve the performance and speed of ETL in data warehouse by minimizing the response time significantly. The primary goal of this technique is to store queries and their corresponding results. If similar query is submitted by any other user the result will be obtained using cache memory. Extract, Transform, Load; three database functions that are combined into one tool that automates the process to pull data out of one database and place it into another database. The database functions are described following. Extract -- the process of reading data from a specified source database and extracting a desired subset of data. Transform -- the process of converting the extracted/ acquired data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining with other data. Load -- The process of writing the data into the target database. Query cache --Query cache will keep record of newly executed queries. The major goal of the query cache is to reduce the response time of query.

1.2 PROBLEM STATEMENT Technical challenges moving, integrating, and transforming data from disparate environments Short load windows, long load times 7

Inconsistent, difficult to maintain business rules Lack of exposure of business rules to end users Source systems missing certain critical data Poor query performance

1.3 PROJECT OVERVIEW Extract, transform and load (ETL) is the core process of data integration and is typically associated with data warehousing. ETL tools extract data from a chosen source,

transform it into new formats according to business rules, and then load it into target data structure. Managing rules and processes for the increasing diversity of data sources and high volumes of data processed that ETL must accommodate, make management, performance and cost the primary and challenges for users. ETL is a key process to bring all the data together in a standard, homogenous environment. ETL functions reshape the relevant data from the source systems into useful information to be stored in the data warehouse .Without these functions, there would be no strategic information in the data warehouse. If source data taken from various sources is not cleanse, extracted properly, transformed and integrated in the proper way, query process which is the backbone of the data warehouse could not happened In this paper we purpose an ultimate advance approach which will increase the speed of Extract, transform and load in data ware house with the support of query cache. Because the query process is the backbone of the data warehouse It will reduce response time and improve the performance of data ware house.

CHAPTER 2
8

SYSTEM ARCHITECTURE
ARCHITECTURE DIAGRAM ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours.On top of it, maintaining the code placed a great challenge among the programmers.

Fig 1: Architecture diagram

These difficulties are eliminated by ETL Tools since they are very powerful and they offer many advantages in all stages of ETL process starting from extraction, data cleansing, data profiling, transformation, debugging and loading into data warehouse when compared to the old method. 9

CHAPTER 3 SYSTEM DESIGN


DESIGN METHODOLOGY

Fig 2: System diagram

Extraction

The extraction conceptually is the simplest step, aiming at the identification of the subset of source data that should be submitted to the ETL workflow for further processing. In practice, this task is not easy, basically, due to the fact that there must be minimum interference with the software configuration at the source side. This requirement is imposed by two factors: (a) the source must suffer minimum overhead during the extraction, since other administrative activities also take place during that period, and, (b) both for technical and political reasons, administrators are quite reluctant to accept major interventions to their systems configuration. There are four policies for the extraction of data from a data source. The nave one suggests the processing of the whole data source 10

in each execution of the ETL process; however, this policy is usually not practical due to the volumes of data that have to be processed. Another idea is the use of triggers at the source side; typically, though, this method is not practical due to above mentioned requirement regarding the minimum overhead at the source site, the intervention to the sources configuration and possibly, the non-applicability of this solution in case the source is of legacy technology. In practice, the two realistic policies suggest either the consideration of only the newly changed - inserted, deleted or updated - operational records (e.g., by using appropriate timestamps at the source sites) or the parsing of the log files of the system in order to find the modified source records. In any case, this phase is quite heavy, thus, it is executed periodically when the system is idle.

Transformation & Cleaning

After their extraction from the sources, the data are transported into an intermediate storage area, where they are transformed and cleansed. That area is frequently called Data Staging Area, DSA, and physically, it can be either in a separate machine or the one used for the data warehouse. The transformation and cleaning tasks constitute the core functionality of an ETL process. Depending on the application, different problems may exist and different kinds of transformations may be needed. The problems can be categorized as follows: (a) schema-level problems: naming and structural conflicts, including granularity differences, (b) record-level problems: duplicated or contradicting records, and consistency problems, and (c) value-level problems: several low-level technical problems such as different value representations or different interpretation of the values. To deal with such issues, the integration and transformation tasks involve a wide variety

11

of functions, such as normalizing, denormalizing, reformatting, recalculating, summarizing, merging data from multiple sources, modifying key structures, adding an element of time, identifying default values, supplying decision commands to choose between multiple sources, and so forth. Usually the transformation and cleaning operations are executed in a pipelining order. However, it is not always feasible to pipeline the data from one process to another without intermediate stops. On the contrary, several blocking operations may exist and the presence of temporary data stores is frequent. At the same time, it is possible that some records may not pass through some operations for several reasons, either for data quality problems or possible system failures. In such cases, these data are temporary quarantined and processed via special purpose workflows, often involving human intervention.

Fig 3: Transforming and Loading

Loading

After the application of the appropriate transformations and cleaning operations, the data are loaded to the respective fact or dimension table of the data warehouse. There are two 12

broad categories of solutions for the loading of data: bulk loading through a DBMSspecific utility or inserting data as a sequence of rows. Clear performance reasons strongly suggest the former solution, due to the overheads of the parsing of the insert statements, the maintenance of logs and rollback-segments (or, the risks of their deactivation in the case of failures.) A second issue has to do with the possibility of efficiently discriminating records that are to be inserted for the first time, from records that act as updates to previously loaded data. DBMSs typically support some declarative way to deal with this problem (e.g., the MERGE command.) In addition, simple SQL commands are not sufficient since the open-loop-fetch technique, where records are inserted one by one, is extremely slow for the vast volume of data to be loaded in the warehouse. A third performance issue that has to be taken into consideration by the administration team has to do with the existence of indexes, materialized views or both, defined over the warehouse relations. Every update to these relations automatically incurs the overhead of maintaining the indexes and the materialized views.

CHAPTER 4 SYSTEM FLOW CHART

13

Fig 4.1: level 1 dfd for etl

14

Fig 4.2 level 2 dfd for etl

15

Fig 4.3: level 3 dfd for etl

16

CHAPTER 5 IMPLEMENTATION DETAILS


This project is implemented using Java. I have used two editors namely JCreator v3.5 and Net Beans v6.0. A brief introduction about Java is given below. 5.1 JAVA Java is a programming language developed by Sun Microsystems and released in 1995 as a core component of Sun Microsystems Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities. Java applications are typically compiled to byte code that can run on any Java virtual machine (JVM) regardless of computer architecture. 5.2JDBC The Java Database Connectivity (JDBC) API is the industry standard for databaseindependent connectivity between the Java programming language and a wide range of databases SQL databases and other tabular data sources, such as spreadsheets or flat files. The JDBC API provides a call-level API for SQL-based database access. JDBC technology allows you to use the Java programming language to exploit "Write Once, Run Anywhere" capabilities for applications that require access to enterprise data. With a JDBC technology-enabled driver, you can connect all corporate data even in a heterogeneous environment.

17

5.3 ORACLE The Oracle RDBMS stores data logically in the form of tablespaces and physically in the form of data files ("datafiles"). Tablespaces can contain various types of memory segments, such as Data Segments, Index Segments, etc. Segments in turn comprise one or more extents. Extents comprise groups of contiguous data blocks. Data blocks form the basic units of data storage. Newer versions of the database can also include a partitioning feature: this allows the partioning of tables based on different set of keys. Specific partitions can then be easily added or dropped to help manage large data sets. 5.4 JBOSS JavaBeans Open Source Software Application Server(JBoss AS, or simply JBoss) is an application server that implements the Java Platform, Enterprise Edition (Java EE). JBoss is written in Java and as such is cross-platform: usable on any operating system that supports Java. JBoss was developed by JBoss, now a division of Red Hat. Licensed under the terms of the GNU Lesser General Public License, JBoss is free and open source software

18

REFERENCES [1] The concise technical dictionary http://www.thetechdictionary.com/term/etl_%28data_integration%29 [2] The computer world magazine http://www.computerworld.com/databasetopics/businessintelligence/datawarehouse/story /0,10801,89534,00.html [3] Fundamentals of database systems. 4th Edition. Persons international and Addison Wesley. Ramez Elmasri and Shamkant B. Navathe [4] Ideal Strategy to Improve Datawarehouse Performance by Fahad Sultan & Dr. Abdul Aziz. (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 02, 2010, 409-415 [5] Efficient incremental view maintenance in data warehouses. Ki Yong Lee, Jin HyunSon, Myoung Ho Kim. Korea Advanced Institute of Science and Technology. [6] Strategy to make superior Data ware house by Vishal Gour in International Conference on advance computing and creating entrepreneurs Feb2010. [7] Shim J.; Scheuermann, P.; Vingralek, R.: Dynamic Caching of Query Results for Decision Support Systems, in: Proceedings of the 11 th International Conference on Scientific and Statistical Database Management (SSDBM99, Cleveland, Ohio, USA, July 28-30) [8] Nutt, W.; Sagiv, Y.; Shurin, S.: Deciding Equivalence among Aggregate Queries, in: 17th Symposium on Principles of Database Systems (PODS98, Seattle,Washington, USA, June 1-3), 1998

19

[9] Building an ETL Tool by Ahimanikya Satapathy SOA/Business Integration ,Sun Microsystems [10] Data Warehousing Intermediate & Advanced Topics Common Problems, Uncommon Solutions by Eric Mellum

20