Sie sind auf Seite 1von 54

ABINITIO TRAINING

3/5/2014

DAY ONE

Introduction to Data warehouse ETL AbInitio AbInitio Features Architecture GDE CO>Operating System EME Setting up Environment Data set types and Components Data types and DML I/P File, O/P file, Intermediate file and Lookup file Filter by expression, Replicate, Reformat and Redefine

3/5/2014

Introduction to Data warehouse


A

Data Warehouse is a
Subject-oriented, Integrated, Time variant and Non-volatile collection of data in support of managements decision-making process.

3/5/2014

ETL

Reading the source data. Applying business, transformation, and technical rules. Loading the data.

3/5/2014

AbInitio
AbInitio is Latin for From the Beginning.
AbInitio software is a general-purpose data processing platform for mission critical applications such as: Data warehousing Batch Processing Click-stream analysis Data movement Data transformation

3/5/2014

AbInitio Features

Transformation of disparate sources. Aggregation and other processing. Referential integrity checking. Database loading. Extraction for external processing. Aggregation and loading of data marts. Processing just about any form and volume of data. Parallel sort/merge processing. Data transformation. Re hosting of corporate data. Parallel execution of existing application.
6

3/5/2014

Architecture
User Application Development Environment GDE Shell Component Library User defined component 3rd party component AbInitio CO> Operating System Native Operating System EME

3/5/2014

GDE

3/5/2014

CO>Operating System

Parallel and distributed application execution. Control. Data Transport. Transactional semantics at the application level. Check pointing. Monitoring and debugging. Parallel file management. Metadata driven components.

3/5/2014

CO>Operating System

AbInitio Co>Operating system runs on Sun Solaris IBM AIX Hewlett-Packard HP-UX Siemens Pyramid Reliant Unix IBM DYNIX/ptx Silicon Graphics IRIX Red Hat Linux Windows NT 4.0(x86) Windows NT 2000 (x86) Compaq Tru64 UNIX IBM OS/390 NCR MP-RAS
10

3/5/2014

EME
Repository

for version controlling Used for Documentation

3/5/2014

11

Setting up Environment

3/5/2014

12

Data set types and Components


Data

Set Component Flow Components Transform Components Partitioning Components

3/5/2014

13

Data types and DML


Types

Base Void Number String Date Datetime Compound Vector Record Union

Integer Decimal Real

3/5/2014

14

DML
To

define the complete record structure. Can be defined either in grid mode or in text mode. Can be stored under a file name which can be referred multiple times or can be embedded.

3/5/2014

15

I/P File, O/P file, Intermediate file and Lookup file


Input File: Reads the data records from a serial file or multi file in the file system. Output File: Writes the data records to a serial file or a multi file in the file system. Intermediate File: Write data records to file in the middle of the graph.

Helps in debugging and further processing of intermediate file.

Lookup File: Represents one or multiple serial files or a multiple of data records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored on disk.

Look up file is not connected to other components in graph.


3/5/2014 16

Filter By Expression, Replicate, Reformat and Redefine


Filter by Expression: Enables user to track down a particular record or records, or to put together a sample of records to assists with analysis.

Allows filter the data based on expression that identifies only the records that you need. can also be used for data validation.

Replicate: Used when user want to make multiple copies of a flow for separate processing.

3/5/2014

17

Filter By Expression, Replicate, Reformat and Redefine

Changes the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. manipulates one record at a time and does work like validation and cleansing e.g. deleting bad values, setting default values, standardizing field formats or rejecting records with invalid date etc.

3/5/2014

18

Filter By Expression, Replicate, Reformat and Redefine

Transformation rules are defined for transform (0). Use of Reformat component is to Clean input data so that all of the records conform to the same convention

Redefine:

Copies data records from its input to its output without changing the values in the data records. Used to change or rename fields in a record format without changing the values in the records.

3/5/2014

19

DAY TWO

Sort,

Sort within Group, Dedup Sort Rollup and Scan Reject, Error Handling and Debugging

3/5/2014

20

Sort, Sort within Group, Dedup Sort


Sort : Used to sort group of records in a specific order with a key. Looks at all the records in the flow before it produces the final output.

3/5/2014

21

Sort, Sort within Group, Dedup Sort


Sort Within Group: Refines the order of sorted dataset by further sorting according to an order specified by a minor key parameter within an order specified by a major key parameter. Imposes an order on those records according to the minor key
3/5/2014 22

Sort, Sort within Group, Dedup Sort


Dedup Sort: Used to remove duplicate records (a group of records that share the same key), keeping a single record.

What it does: First sort the data. Set the key for grouping in the dedup component. Finally choose which duplicate to keep.

3/5/2014

23

Rollup and Scan


Rollup: Produces a single record form a group of records identified by a common key (or keys). Useful for summarizing groups of records i.e. totals, averages, max, min etc.

3/5/2014

24

Rollup and Scan


Scan: Generates a series of cumulative summary records such as successive year- to-date totals for groups of data records. Produces intermediate summary records.

3/5/2014

25

Reject, Error Handling and Debugging

Invalid data will go to Rejected Port. Setting reject-threshold parameter inside the component. GDE has a built in debugger capability. Add a Watcher File.

3/5/2014

26

DAY THREE
Join
Multi

Files Parallelism Partition and De Partition Layout, Fan-in, Fan-out and All-to-All

3/5/2014

27

Join
Join:

Used to combine data from two or more flows of records based on a matching key (or keys). Join deals with two activities. 1.Transforming data sources with different record format. 2.Combining data sources with the same record format.

3/5/2014

28

Join

Join types: Inner Join Full outer Join Explicit Join Inner Join: Uses only records with matching keys on both inputs. Full Outer Join: Uses all records from both inputs If a record from one does not have a matching record in the other input, a NULL record is used for the missing record
3/5/2014 29

Join
Explicit Join: Uses all records in one specified input (Based upon True/False), but records with matching keys in the other inputs are optional. Again a NULL record is used for the missing records.
3/5/2014 30

Multi Files

Essentially the global view of a set of ordinary files, each of which may be located anywhere the AbInitio Co-Operating System is installed. Each partition of a multi file is an ordinary file. Resides in multi directories. Identified using URL syntax with mfile: as the protocol part. One Control File.

3/5/2014

31

Parallelism

Processing of datasets in parallel for better performance. Types of Parallelism 1.Componet 2.Pipeline 3.Data Component Parallelism: When more than one component is running at the same time on different data streams. Comes for free with Graph Programming. Limitation: Scales to no. of branches a graph.

3/5/2014

32

Parallelism
Pipeline Parallelism: When two or more connected components process data one by one. Limitation:

Scales to length of branches in a graph. Some operations, like sorting, do not pipeline

Data Parallelism: Occurs when multiple copies of a process act on different sets of data at the same time. Process the whole more quickly using multiple CPU at the same time
3/5/2014 33

Partition and De Partition


Partition: Used to divide data sets into multiple sets for further processing. Types:

The component Partition by Expression partitions data by dividing it according to a DML expression.
The component Partition by Key partitions data by grouping it by a key, like dealing cards into piles according to their suit

3/5/2014

34

Partition and De Partition

The Component Partition with Load Balance Partitions Data by Dynamic load balancing. More data goes to CPUs that are less busy and vice versa, thus maximizing throughput. The Component Partition by Percentage Partitions Data by Distributing it, so the output is proportional to fraction of 100. The Component Partition by Range Partitions Data by Dividing it evenly among nodes, based on a key and a set of partitioning ranges. The Component Partition by Round-robin Partitions Data by Distributing it evenly, in block size chunks, across the output partitions, like dealing cards.

3/5/2014

35

Partition and De Partition


De Partition: Read data from multiple flows or operations and are used to recombine data records from different flows. Opposite to Partition. Types: The Concatenate component produces a single output flow that contains first all the records from the first input partition, then all the records from the second input partition, and so on.

The Gather component collects inputs from multiple partitions in an arbitrary manner, and produces a single output flow. It does not maintain sort order, but is the most efficient departitioned.
36

3/5/2014

Partition and De Partition


The

Interleave component collects records from many sources in round-robin fashion. The effect is like taking a card from each player in turn, forming a deck of cards. Merge components collets inputs from multiple sorted partitions and maintains the sort order.
37

The

3/5/2014

Layout, Fan-in, Fan-out and All-to-All


Layout: Determines the location of a resource. Either serial or parallel. Fan-In: After data partition when departition components collects data from different flows a special symbol comes into flow. Fan-Out: When partition components divides dataset into multiple sets for further processing a special symbol comes into flow.

All-to-All:

3/5/2014

38

DAY FOUR
DBC

File, Input Table, Output Table, Join with DB Sub graph, Phasing, Check point, Recovery Normalize, Denormalize Sorted

3/5/2014

39

DBC File, Input Table, Output Table, Join with DB


DBC File: Required for AbInitio while connecting to any Database system. By default it comes with extension .dbc DBC file fields

The dbms_version field is the version of your database. The db_home field is the location of your database software ( ORACLE_HOME) The db_name field is the value of the identifier for your database instance. For Oracle, this the value of the ORACLE_SID environment variable. For SQL*Net, use @db_name The db_nodes field is a list of database-accessible nodes with Ab Initio installed. Note: If Oracle is on an SMP machine, you usually use one host name unless you are running Oracle OPS (parallel), then you may need a list of all the database runs on. The #user comment and #password comment fields list your name and password. If your database is Oracle and you are identified externally, leave these fields as comments

3/5/2014

40

DBC File, Input Table, Output Table, Join with DB


Input Table: Unloads data records from a database into an AbInitio graph. Allowing you to specify as the source either a database table, or an SQL statement that selects data records from one or more tables.
3/5/2014 41

DBC File, Input Table, Output Table, Join with DB


Output Table: Loads data records from a graph into a database. Specify the records destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables. By default calls the database fast loader to perform the output operation(s).
3/5/2014 42

DBC File, Input Table, Output Table, Join with DB


Join with DB: Joins records from the flow or flows connected to its input port with records read directly from a database, and outputs new records containing data based on, or calculated from, the joined records.

3/5/2014

43

Sub Graph, Phasing, Check Point, Recovery


Sub Graph: A logical sub set of a graph. Used for manageability. Phasing: Breaking an application into separate processing unit. Breaking an application into phases limits the contention for:
Main memory. Processor(s). Breaking an application into phases costs: Disk space

3/5/2014

44

Sub Graph, Phasing, Check Point, Recovery


Check Point: Any phase break can be a checkpoint

Recovery:

3/5/2014

45

Normalize, Denormalize Sorted


Normalize: Generates multiple data records from each input data record; you can specify the number of output records, or the number of output records can depend on a field or fields in each input data record. Separate a data record with a vector field into several individual records, ach containing one element of the vector. Generates a series of output data records of each input data record by calling a transform function repeatedly.
3/5/2014 46

Normalize, Denormalize Sorted


Denormalize Sorted: Consolidates groups of related data records into a single output record with a vector field for each group. Optionally computes summary fields in the output record for each group. Denormalize sorted requires grouped input.

3/5/2014

47

DAY FIVE
Memory

Management Dead Lock Sandbox Setting, Graph and Project Parameter User defined function and Built-in functions

3/5/2014

48

Memory Management
Memory

requires for Sorting, Rollup and

Join Input must be sorted vs In-Memory Sort AI_GRAPH_MAX_CORE_SETTING

3/5/2014

49

Dead Lock
How to avoid Dead Lock : Use Concatenate and Merge with care. Use flow buffering (the GDE Default for a new graph).[*Automatic Flow Buffering is enabled] Insert a phase break before the departitioner. Dont serialize data unnecessarily; repartition instead of departition.

3/5/2014

50

Sandbox Setting, Graph and Project Parameter


Sandbox Setting: Work space is called Sand Box Setting up a standard working environment helps a development team or other team work together. Allows an application to be designed to be portable.

3/5/2014

51

Sandbox Setting, Graph and Project Parameter


Default sandbox directories $AI_RUNrun directory $AI_DMLrecord format files $AI_XFRtransform files $AI_MPgraphs $AI_DBdatabase config files

3/5/2014

52

Sandbox Setting, Graph and Project Parameter


A

parameter is simply a name value pair with a number of additional attributes. Parameters that reside in your sandbox are known as sandbox parameter, they set the context of your sandbox. Those that reside in the repository are called project parameters. Graph parameters only apply to the graph in which they are defined.
3/5/2014 53

User defined function and Built-in functions


Work like as AI built in function. Global usability across application. Like built in function stores as .XFR Built in functions Next_in_sequence() Is_blank() Is_defined() etc

3/5/2014 54