You are on page 1of 31

About * years experience in IT, * Years of experience in the field of analysis, designing and

development using AB INITIO.


Expertise in UNIX, Oracle, ETL tools, Data mining and analysis tools. Data Access methods, & Data
Modeling tools.
Has good knowledge in working with EME.
Extensive experience in Data warehousing, design of Extraction Transform & Load (ETL) environment
Expert knowledge in UNIX shell scripting.
Handled data extraction from multiple source platforms.
Extensive experience in end to end testing of Data warehousing ETL routines, which includes creating
test cases, data quality testing and value checking.
Worked in production support and Scheduling.
Has good knowledge in Oracle.
Has substantial knowledge in DB2.
Has substantial trouble-shooting experience.
Excellent analyst skills and the ability to collect and define business requirements.
Good team worker and has ability to understand and adapt to new technologies and environments
faster.
Excellent skills in written, oral and interpersonal communication.



TECHNICAL SKILLS:

DATAWAREHOUSING, REPORTING: AB INITIO GDE 1.14.35/1.14.15/1.13.1
/Co>OP 2.14.1/2.13.1
AND DATA MODELING TOOLS SQL*Plus, SQL*Loader and EME, TOAD
OPERATING SYSTEMS : WIN-NT/98/95/00, MS-DOS, UNIX

PROGRAMMING LANGUAGES : C, Visual C++ 5.0/6.0,Java, COBOL, C++.
RDBMS : MS-ACESSS, Oracle 8.1/9i, DB2
GUI : VISUAL BASIC 5.0/6.0.
SCRIPTING LANGUAGES : Shell Script





PROFESSIONAL EXPERIENCE:
AIG Insurance, NJ May2007-tilldate
Ab Initio Senior Developer/Analyst
Project: AMS (Adjustment Module System): The Adjustment Module is a complete,
automated solution to address the manual data aggregation, data entry, booking and billing
functions of the Fusion Units retro, audit and reinsurance adjustment process. Adjustments
are done by applying the actual exposure/losses on post bind and post audit transaction, to
develop an Adjusted Program Cost Final Adjustment date is reached.


Responsibilities:
Involved in Design and Development of the Interface module.
Communicated with Business users and other team in gathering the requirements and developing the
module.
Communicating with end users and system administrators to develop better understanding of
production issues and help resolving them.
Developed an interface system that transfers files from one system to another system,
keeps both the systems in sync.
Developed Ab Initio graph that uses Java code to decompress the compressed PDF file
and stored them in a directory.
Wrote Wrapper scripts that calls Ab Initio graphs and transfers files to another system.
Wrote shell script to generate XML document.
Performed reject analysis on graphs and made modifications to fix major and minor
bugs.
Performed several data quality checks and found potential issues, designed Ab Initio
graphs to resolve them.
Involved in enhancements of the project.
Developed couple of one-time graphs for data conversion to suit the on going enhancements.
Involved in designing test plans.
Involved in Unit, System, Integration and Regression testing.
Involved in designing test plans.
Gained experience in using various Ab Initio graph components.
Developed graphs for common utilities.
Tuning of Ab Initio graphs for better performance.
Hands on experience in using Autosys scheduling.
Used Squirrel to execute sql queries and to verify the results from Ab Initio graph.


Environment: Ab Initio (Co.Op 2.14, GDE 1.14.35), UNIX, DB2, EME,
Squirrel, Autosys and Windows NT



Sprint-Nextel Corp, KS Jun2006-Mar2007
Ab Initio Developer
Project: MADISON: The projects main aim is to serve advertisements on mobile phones.
The Customer feed is processed using Ab Initio and sent to Ad Server. The Ad Server
serves the advertisements to the mobile phones.


Responsibilities:
Involved in the Functional and detailed design of the project.
Gained experienced in working with EME and Project management.
Used several air utility commands in the Project Development.
Process and Transform delta feeds of customer data, which comes in daily.
Developed graphs to unload data into tables and validate the new data in comparison
with already existing data.
Wrote several Shell scripts for Project maintenance (to remove old/unused files and
move raw logs to the archives).
Wrote Shell scripts to download unprocessed raw logs and upload processed logs
to the client server.
Developed dynamic graphs to load data from data sources into tables and to parse
records.
Extensive usage of Multifile system where data is partitioned into four partitions for
parallel processing.
Wide usage of Lookup Files while getting data from multiple sources and size of
the data is limited.
Used Ab Initio for Error Handling by attaching error and rejecting files to each
transformation and making provision for capturing and analyzing the message and data
separately.
Involved in Unit, System, Regression and Integration testing of the project.
Tuning of Ab Initio graphs for better performance
Hands on experience in using PLANIT.
Used Toad to verify the counts and results of the graphs.
Generated HTML reports using Ab Initio for Business Analysts, to make their job
easier in formulating new rules.


Environment: Ab Initio (Co.Op 2.14, GDE 1.14.15), UNIX, Oracle, EME,
TOAD and Windows NT



Allstate insurance, IL Jun2004-May2006
Ab Initio Developer
Project: AF-ADW: The projects main aim is to process the feeds from all the source
systems by transforming the feeds according to the business rules and load into ADW
dataware house (Oracle) and consolidation of four ETL cycles to One ETL cycle.

Responsibilities:
Involved in the Functional design of the project.
Communicating with end users and system administrators to develop better understanding of
production issues and help resolving them.
Code reviewed for the other developers and made performance improvements.
Created sandboxes and projects using air utilities commands.
Experienced in using EME (Check-in, Check-out, Version Control).
Worked on System testing for the whole project.
Wrote a generic graph to create deltas for all the Admin systems.
Developed dynamic graphs to load data into oracle tables and to parse the records.
Worked with Partition components like Partition-by-Key, Partition-by- Expression, Partition-by-Round
robin to Partition the data from Serial File, using Multi file system.
Worked with Departition Components like Concatenate, Gather, Interleave and
Merge in order to departition and repartition data from Multifiles accordingly.
Performed transformations of source data with Transform Components like Join,
Dedup Sorted, Denormalize, Normalize, Reformat, Filter-by-Expression, Rollup.
Developed graphs to apply transformation on data according to business rules.
Created Summary tables using Rollup, Scan & Aggregate.
Tuning of Ab Initio Graphs for better performance.
Used CA-Unicenter scheduler for job scheduling.
Written many user defined functions, which are used within Allstate.
Developed internal common graphs for unloading, loading data from mainframe server to Unix.
Wrote Shell scripts to generate reject reports.


Environment: Ab Initio (Co.Op 2.13, GDE 1.13), Solaris UNIX, CA-UNICENTER, Oracle 9i, EME, SQL *
Loader and Windows NT



The Kroger Co., Cincinnati, OH Mar2002Mar2004
ETL Developer
Project: Store Managers Workbench (SMW) is an intranet tool that allows store managers, and
corporate users to measure the performance of stores, and divisions in a variety of ways. SMW allows
its users to tune the performance and profitability of stores by gathering, analyzing and displaying of a
wide variety of information. In short SMW performs business functions crucial to Krogers bottom line.

The source data comes from each store collected during shelf reviews done every day which collects
information related out of stocks items, zero items and labor hours. This information is processed and
loaded to EDW and is available for store managers, Division managers and others.

Responsibilities:
Involved in all phases of ETL cycle and prepared detailed level design that depicts the transformation
logic used in Ab Initio.
Responsible for collecting and documenting business requirements.
Responsible for designing the source to target mapping diagrams
Developed generic graph to extract data from various heterogeneous source such as Oracle databases
and from flat files.
Designed and developed various Graphs with the collection of Multiple Source, Multiple Targets and
Transformations.
Responsible for creating Ab Initio graphs for landing, the validated source data received from various
divisions, in multifiles and creating lookups for cross reference.
Used Ab Initio EME for repository of Graph object and performed check in/ check out.
Extensively used database, dataset, partition / departition, transform, sort and partition components
for extracting, transforming and loading.
Developed number of Unix Korn Shell wrapper scripts to control various Ab Initio processes and
complex functionalities such automated ftp, remote shell execution, and remote copy etc.
Wrote UNIX shell scripts to automate the some of the data extraction, and data loading process.
Involved in Creation of Sandboxes and managing Enterprise Metadata using EME.

Environment: Ab Initio (Co-op 2.12 GDE 1.12), Oracle, UNIX and Window NT.


Sonali Castings (India) Pvt. Ltd., Jan2001-Dec2001. Oracle Developer
Project: The project aim was to develop databases and Resource Management System, Payroll, HR
reports using Oracle, PL/SQL.

Responsibilities:
Involved in design and development of RMS (Resource Management System)
using ASP, as the front end and Oracle as the back end.
Participated in Integration and UAT testing.
Developed Master and Transaction forms.
Functions, Stored Procedures and Triggers were written to implement the business logic and auditing.
Extensively worked on XML, XSLT for getting the configuration settings depending on locale code,
getting user role code, implementing customized show hide matrix, which is used by ASP to display the
required operation depending on User Rights, implementing customized property XML file which has all
the user details, RMI port details and number of Library Session Connections to be opened during the
startup.
Written shell scripts for multiple purposes.
Developed Payroll, HR reports Using Reports 6i tool.

Environment: Oracle, PL/SQL, Win NT, ASP, XML.


ProMillennium Techsys (India) Pvt. Ltd., Apr2000-Dec2000.
Database Developer

Responsibilities:
Developed components to retrieve, display and administer the records that are stored onto Oracle
database.
Provide Onsite product and network support to clients in administration, maintenance.
Wrote PL/SQL programs, stored procedures, and database triggers.
Integrated and deployed components belonging to various modules in system test, integration test
and production environments.
Creating Database objects including Tables, Indexes, Clusters, sequences, roles, and privileges.
Created and coded Functional and Business constraints, packages and triggers.
Created database tables using various Normalization techniques & Database design rules.
Developed the ER diagrams showing the flow of data and the relations between tables.
Used DBCC to check physical and logical consistency of the database and solved page allocation errors
and Table corruption.
Data Upload from MS Access, Excel to Oracle.
Received training for becoming an Oracle DBA.

Environment: PL/SQL, ORACLE, UNIX Scripting.

I am open to a Technical Lead position in an organization where I can best apply my knowledge
and experience in Business Intelligence for Technical Consulting, BI Practice and Project
Leadership & Management.

BRIEF OVERVIEW OF EXPERIENCE
Business Intelligence
* Working on BI projects since 2001, with a very good command of BI and Data Warehousing
(DWH) life cycle, tools and technologies
* About 6 years of experience with Ab Initio ETL tool
* Worked with large EDW support, enhancement, development projects - for Fortune 500 US
Credit Card, Banking and Insurance companies
* Lead person for setting up of BI Practice (ETL CoE) at Syntel

Project Leadership & Management
* Over six years of experience covering project leadership, design, development,
implementation, maintenance and operation of projects
* Worked on turnkey projects, using various development methodologies
* Proven experience in delivery management, architecture validation and response to RFPs
* Good experience of application development in domain of Business Intelligence and Data
Warehousing
Specialties
ETL, Ab Initio, Data Warehousing, Data Migration, UNIX Shell Scripting, Teradata, UDB DB2,
Oracle, SQL Server
Punit Doshi's Experience
Asst Vice President
JP Morgan Chase
Public Company; 10,001+ employees; JPM; Financial Services industry
March 2011 Present (1 year)
Business Manager for GMRT Equities India team
Bank of America
Public Company; 10,001+ employees; BAC; Banking industry
August 2009 March 2011 (1 year 8 months)
I am responsible for tracking and reporting of all the project management deliverables and
metrics pertaining to projects in Global Equities I handle global work order approvals, resource
and financial forecasting, cost variance, project profitability, risk management, periodic collation
and preparation of program status detailing program overviews, key milestones, issues and key
performance indicators. I am also responsible for overseeing and coordinating all projects within
the four streams of FLT, GMF&F, HiTouch and TradePlant to help in meeting the challenges of
the overall project setup.
Developer / Sr. Specialist III
BCBSNC
Nonprofit; 1001-5000 employees; Insurance industry
January 2008 July 2009 (1 year 7 months)
CDW Maintenance / EDW Development
Responsibilities:
* Lead for delivery of all CDW maintenance project.
* Planning and leading Production fixes.
* Involvement in Planning, estimation, staffing, designing, developing and UAT for multiple
projects.
* Developed applications using Ab Initio (ETL), SQL, UNIX shell scripts.
* Designing new Ab Initio Graphs conforming to standards. Created large Ab Initio graphs with
complex transformation rules.
* Performance Tuning of the application.
* Code Review and Source Code Version Control (using EME).
Environment: Ab Initio, Control-M, Teradata
Project Lead / Architect (Contractor)
American Express
Public Company; 10,001+ employees; AXP; Financial Services industry
March 2006 December 2007 (1 year 10 months)
Merchant Submissions

* Project lead for delivery of all data processing projects to the client handling multiple projects
in parallel.
* Planning and leading Production fixes and Change Controls.
* Involvement in Planning, estimation, staffing, designing, developing and UAT for multiple
projects.
* Analysis and understanding of the business logic of the system.
* Prepared wrapper scripts, designed and create graphs encompassing the business process using
Continuous , MQueue workflow.
* Developed applications using Ab Initio (ETL), SQL, UNIX shell scripts.
* Design and developed parameterized reusable Ab Initio components.
* Designing new Ab Initio Graphs conforming to standards. Created large Ab Initio graphs with
complex transformation rules using Ab Initio GDE.
* Handled Disaster Recovery preparation.
* Performance Tuning of the application.
* Code Review and Source Code Version Control (using EME).

Environment: Ab Initio, Control-M, Actuate, UDB DB2
BI Lead for Center of Excellence
Syntel
Public Company; 10,001+ employees; SYNT; Information Technology and Services industry
September 2004 February 2006 (1 year 6 months)
Responsibilities:
* Analysis and understanding of the business logic of systems.
* Identification and Development of reusable components for different applications.
* Performance Tuning of the applications.
* Code Review and Source Code Version Control using EME.
* Evaluate technologies in the BI practice
* Pre-sales activities involving estimation, architecting solution, presentation etc.
* Defining processes and best practices
* Conducting interviews and team building
* Cross training consultants within the organization in the BI practice

Environment: BI Suite of Products
ETL Project Lead (Contractor)
Allstate
Public Company; 10,001+ employees; ALL; Insurance industry
April 2005 June 2005 (3 months)
MaxUSA project

Responsibilities:

* Worked with client CoE and consultants from Ab Initio Corp. to prepare proposal for project
tuning
* Worked as an onsite Tech Lead for co-ordinating work with offshore team
* Analysis and understanding of the business logic of the system.
* Designing the new application conforming to ETL standards.
* Identification and Development of reusable components for the entire application.
* Development of Ab Initio graphs for the services to be used
* Integration and Unit Testing of the application with all distributed components.
* Packaging and deployment into QA and PROD environments.
* Performance testing and fine tuning.

Environment: Ab Initio, Unix Shell scripting, DB2
Senior Developer
GEICO
Privately Held; 10,001+ employees; Insurance industry
February 2004 August 2004 (7 months)
Re-Pricing of auto insurance for various States

Responsibilities:
* Developed the Ab Initio ETL process for calculating the Pricing logic for various states.
* Performed Integration/Regression testing of the graphs for their applications.
* Used Ab Initio to create summary tables using rollup and aggregate components.
* Prepared documentation of the ETL process being followed for each incoming data feeds.
* Actively involved in Ab Initio Quality Assurance team in order to improve the performance of
different projects involved in EDW team.
* Interviewed prospective Ab Initio candidates for the company and helped management make
decisions regarding hiring suitable ETL candidates.
* Proposed and performing SDLC procedures for building architecture for RePricing application.

Environment: Ab Initio (Graphs, GDE, ETL), Unix shell scripts, Windows 2000, DB2EEE
Punit Doshi's Education
Worcester Polytechnic Institute
MS, Comp Sci
2001 2002

General Datawarehousing Interview Questions and Answers
What's A Data warehouse
Answer1:
A Data warehouse is a repository of integrated information, available for queries and analysis. Data and information
are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to
run queries over data that originally came from different sources". Another definition for data warehouse is: " A data
warehouse is a logical collection of information gathered from many different operational databases used to create
business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an
enterprise's past transactional and operational information, stored in a database designed to favour efficient data
analysis and reporting (especially OLAP)". Generally, data warehousing is not meant for current "live" data, although
'virtual' or 'point-to-point' data warehouses can access operational data. A 'real' data warehouse is generally preferred
to a virtual DW because stored data has been validated and is set up to provide reliable results to common types of
queries used in a business.

Answer2:
Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are
extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run
queries over data that originally came from different sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the
requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently
than traditional relational databases.

What is ODS?
1. ODS means Operational Data Store.
2. A collection of operation or bases data that is extracted from operation databases and standardized, cleansed,
consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of
operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used
to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further
become the enterprise shared operational database, allowing operational systems that are being reengineered to use
the ODS as there operation databases.

What is a dimension table?
A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. it
contains only the textual attributes.

What is a lookup table?
A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table
(fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new
records or updated records based on the lookup condition.

Why should you put your data warehouse on a different system than your OLTP system?
Answer1:
A OLTP system is basically " data oriented " (ER model) and not " Subject oriented "(Dimensional Model) .That is why
we design a separate system that will have a subject oriented OLAP system...
Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the OLTP server that will
affect the daytoday business directly.

Answer2:
The loading of a warehouse will likely consume a lot of machine resources. Additionally, users may create querries or
reports that are very resource intensive because of the potentially large amount of data available. Such loads and
resource needs will conflict with the needs of the OLTP systems for resources and will negatively impact those
production systems.
What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of
dimensions.Retrieving the required data from the actual table, which have millions of records will take more time and
also affects the server performance.To avoid this we can aggregate the table to certain required level and can use
it.This tables reduces the load in the database server and increases the performance of the query and can retrieve
the result very fastly.

What is Dimensional Modelling? Why is it important ?
Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In
this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains
the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the
dimensions on which the facts are calculated.

Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother
especially if you are pressed for time? A common response by practitioners who write on the subject is that you
should no more build a database without a model than you should build a house without blueprints.

The goal of the data model is to make sure that the all data objects required by the database are completely and
accurately represented. Because the data model uses easily understood notations and natural language , it can be
reviewed and verified as correct by the end-users.

The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the
physical database. The information contained in the data model will be used to define the relational tables, primary
and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-
term. Without careful planning you may create a database that omits data required to create critical reports, produces
results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

What is data mining?
Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware
house can be used to mine data for the most high risk people to insure in a certain geographial area.

What is ETL?
ETL stands for extraction, transformation and loading.

ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control
parameter.
Extraction
Take data from an external source and move it to the warehouse pre-processor database.
Transformation
Transform data task allows point-to-point generating, modifying and transforming data.
Loading
Load data task adds records to a database table in a warehouse.

What does level of Granularity of a fact table signify?
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the
lowest level of information that will be stored in the fact table. This constitutes two steps:

Determine which dimensions will be included.
Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements

What is the Difference between OLTP and OLAP?
Main Differences between OLTP and OLAP are:-

1. User and System Orientation

OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals.

OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives, analysis).

2. Data Contents

OLTP: manages current data, very detail-oriented.

OLAP: manages large amounts of historical data, provides facilities for summarization and aggregation, stores
information at different levels of granularity to support decision making process.

3. Database Design

OLTP: adopts an entity relationship(ER) model and an application-oriented database design.

OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design.

4. View

OLTP: focuses on the current data within an enterprise or department.

OLAP: spans multiple versions of a database schema due to the evolutionary process of an organization; integrates
information from many organizational locations and data stores

What is SCD1 , SCD2 , SCD3?
SCD Stands for Slowly changing dimensions.

SCD1: only maintained updated values.

Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using

A) Effective Date
B) Versions
C) Flags

or combination of these

SCD3: by adding new columns to target table we maintain historical information and current information.

Why are OLTP database designs not generally a good idea for a Data Warehouse?
Since in OLTP,tables are normalised and hence query response will be slow for end user and OLTP doesnot contain
years of data and hence cannot be analysed.

What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.
What are the various Reporting tools in the Market?
1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity

What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
1.Normalization is process for assigning attributes to entitiesReducesdata redundanciesHelps eliminate data
anomaliesProduces controlledredundancies to link tables

2.Normalization is the analysis offunctional dependency between attributes / data items of userviews?It reduces a
complex user view to a set of small andstable subgroups of fields / relations

1NF:Repeating groups must beeliminated, Dependencies can be identified, All key attributesdefined,No repeating
groups in table

2NF: The Table is already in1NF,Includes no partial dependenciesNo attribute dependent on a portionof primary
key, Still possible to exhibit transitivedependency,Attributes may be functionally dependent on non-keyattributes

3NF: The Table is already in 2NF, Contains no transitivedependencies

What is Fact table?
Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" ,
then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact
table also contains the foriegn keys for the dimension tables.

What are conformed dimensions?
Answer1:
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:Date
Dimensions is connected all facts like Sales facts,Inventory facts..etc

Answer2:
Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts and
dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here
D1,D2 are the Conformed Dimensions

What are the Different methods of loading Dimension tables?
Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.

Direct load:(Faster Loading)
All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table
constraints and the bad data won't be indexed.

What is conformed fact?
Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with
multiple facts tables accordingly
What are Data Marts?
Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.

There are two types of Data Marts:

1.Independent data marts sources from data captured form OLTP system, external providers or from data
generated locally within a particular department or geographic area.

2.Dependent data mart sources directly form enterprise data warehouses.

What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on
design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are
you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it
upto minute and put that data.

How are the Dimension tables designed?
Most dimension tables are designed using Normalization principles upto 2NF. In some instances they are further
normalized to 3NF.

Find where data for this dimension are located.

Figure out how to extract this data.

Determine how to maintain changes to this dimension (see more on this in the next section).

What are non-additive facts?
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact
table.

What type of Indexing mechanism do we need to use for a typical datawarehouse?
On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of
clustered/non-clustered, unique/non-unique indexes.

To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.

What Snow Flake Schema?
Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can
join. The primary dimension table is the only table that can join to the fact table.

What is real time data-warehousing?
Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time
activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the
activity is complete, there is data about it.

Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it
occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into
the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for
deriving information from data as the data becomes available.

What are slowly changing dimensions?
SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types

SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
scd3: by adding new columns to target table we maintain historical information and current information

What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables?
Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.

EX: Average daily balance

A fact table without numeric fact columns is called factless fact table.

Ex: Promotion Facts

While maintain the promotion values of the transaction (ex: product samples) because this table doesnt contain any
measures.

Differences between star and snowflake schemas?
Star schema - all dimensions will be linked directly with a fat table.
Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables.

What is a Star Schema?
Star schema is a type of organising the tables such that we can retrieve the result from the database easily and fastly
in the warehouse environment.Usually a star schema consists of one or more dimension tables around a fact table
which looks like a star,so that it got its name.

What is a general purpose scheduling tool?
The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target
at specific time or based on some condition.
What is ER Diagram?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network
and relational database views.

Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic
component of the model is the Entity-Relationship diagram which is used to visually represents data objects.

Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the
database designer, the utility of the ER model is:

it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational
tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the
database designer to communicate the design to the end user.

In addition, the model can be used as a design plan by the database developer to implement a data model in a
specific database management software.

Which columns go to the fact table and which columns go the dimension table?
The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Foreign Keys.
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.

What are modeling tools available in the Market?
here are a number of data modeling tools

Tool Name Company Name
Erwin Computer Associates
Embarcadero Embarcadero Technologies
Rational Rose IBM Corporation
Power Designer Sybase Corporation
Oracle Designer Oracle Corporation

Name some of modeling tools available in the Market?
These tools are used for Data/dimension modeling

1. Oracle Designer
2. ERWin (Entity Relationship for windows)
3. Informatica (Cubes/Dimensions)
4. Embarcadero
5. Power Designer Sybase

How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It
is not unusual for 100 years to be represented in a time dimension, with one row per day.
Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would you put your TX logs.
Transaction logs write sequentially and don't need to be read at all. The ideal is to have each on RAID 1/0 because it
has much better write performance than RAID 5.

RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tad less reliability and performance is
a little worse generally speaking.

RAID 5 is best for data generally because of cost and the fact it provides great read capability.

What are the vaious ETL tools in the Market?
Various ETL tools used in market are:

1. Informatica
2. Data Stage
3. MS-SQL DTS(Integrated Services 2005)
4. Abinitio
5. SQL Loader
6. Sunopsis
7. Oracle Warehouse Bulider
8. Data Junction

What is VLDB?
Answer 1:
VLDB stands for Very Large DataBase.

It is an environment or storage space managed by a relational database management system (RDBMS) consisting of
vast quantities of information.

Answer 2:
VLDB doesnt refer to size of database or vast amount of information stored. It refers to the window of opportunity to
take back up the database.

Window of opportunity refers to the time of interval and if the DBA was unable to take back up in the specified time
then the database was considered as VLDB.
What are Data Marts ?
A data mart is a focused subset of a data warehouse that deals with a single area(like different department) of data
and is organized for quick analysis

What are the steps to build the datawarehouse ?
Gathering bussiness requiremnts
Identifying Sources
Identifying Facts
Defining Dimensions
Define Attribues
Redefine Dimensions & Attributes
Organise Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional convetions:Cardinality/Adding ratios

What is Difference between E-R Modeling and Dimentional Modeling.?
Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model.

E-R modeling is used for normalizing the OLTP database design.

Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

Why fact table is in normal form?
Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures.

so when ever we have the keys in a table .that itself implies that the table is in the normal form.

What are the advantages data mining over traditional approaches?
Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using
the concept of Data Mining, we can predict the future of business interms of Revenue (or) Employees (or) Cutomers
(or) Orders etc.

Traditional approches use simple algorithms for estimating the future. But, it does not give accurate results when
compared to Data Mining.

What are the vaious ETL tools in the Market?
Various ETL tools used in market are:
Informatica
Data Stage
Oracle Warehouse Bulider
Ab Initio
Data Junction
What is a CUBE in datawarehousing concept?
Cubes are logical representation of multidimensional data.The edge of the cube contains dimension members and
the body of the cube contains data values.

What is data validation strategies for data mart validation after loading process ?
Data validation is to make sure that the loaded data is accurate and meets the business requriments.

Strategies are different methods followed to meet the validation requriments

what is the datatype of the surrgate key ?
Datatype of the surrgate key is either inteeger or numaric or number

What is degenerate dimension table?
Degenerate Dimensions : If a table contains the values, which r neither dimesion nor measures is called degenerate
dimensions.Ex : invoice id,empno

What is Dimensional Modelling?
Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In
this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains
the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the
dimensions on which the facts are calculated.

What are the methodologies of Data Warehousing.?
Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are
stardadly used. Other methodologies are AMM, World class methodology and many more.

What is a linked cube?
Linked cube in which a sub-set of the data can be analysed into great detail. The linking ensures that the data in the
cubes remain consistent.
What is the main difference between Inmon and Kimball philosophies of data warehousing?
Both differed in the concept of building teh datawarehosue..

According to Kimball ...

Kimball views data warehousing as a constituency of Data marts. Data marts are focused on delivering business
objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts.
Hence a unified view of the enterprise can be obtain from the dimension modeling on a local departmental level.

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data
warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as
their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary.

i.e.,
Kimball--First DataMarts--Combined way ---Datawarehouse

Inmon---First Datawarehouse--Later----Datamarts

What is Data warehosuing Hierarchy?
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to
define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to
the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a
family structure.

Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels
aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--one for product categories and one for product
suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill
down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business structures. For example, a divisional
multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher
level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to
access data quickly.

Levels
A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents
data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or
most general level. The levels in a dimension are organized into one or more hierarchies.

Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information.
They define the parent-child relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex rewrites. For example, the database can
aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies
between quarter and year are known.

What is the main differnce between schema in RDBMS and schemas in DataWarehouse....?
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modelled


DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model

What is hybrid slowly changing dimension?
Hybrid SCDs are combination of both SCD 1 and SCD 2.

It may happen that in a table, some columns are important and we need to track changes for them i.e capture the
historical data for them whereas in some columns even if the data changes, we don't care.

For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.

What are the different architecture of datawarehouse?
There are two main things

1. Top down - (bill Inmon)
2.Bottom up - (Ralph kimbol)

1.what is incremintal loading?
2.what is batch processing?
3.what is crass reference table?
4.what is aggregate fact table?
Incremental loading means loading the ongoing changes in the OLTP.

Aggregate table contains the [measure] values ,aggregated /grouped/summed up to some level of hirarchy.

what is junk dimension? what is the difference between junk dimension and degenerated dimension?
Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub
dimension.

Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like
order number and order line number and have 1:1 relationship with Fact table, In this case this dimension is removed
and the order information will be directly stored in a Fact table inorder eliminate unneccessary joins while retrieving
order information..

What are the possible data marts in Retail sales.?
Product information,sales information

What is the definition of normalized and denormalized view and what are the differences between them?
Normalization is the process of removing redundancies.

Denormalization is the process of allowing redundancies.
What is meant by metadata in context of a Datawarehouse and how it is important?
Meta data is the data about data; Business Analyst or data modeler usually capture information about data - the
source (where and how the data is originated), nature of data (char, varchar, nullable, existance, valid values etc) and
behavior of data (how it is modified / derived and the life cycle ) in data dictionary a.k.a metadata. Metadata is also
presented at the Datamart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital
information for analysis / DSS.

Differences between star and snowflake schemas?
Star schema
A single fact table with N number of Dimension

Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema

Difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better
than Star Schema to use and when the opposite is true?
Star schema contains the dimesion tables mapped around one or more fact tables.
It is a denormalised model.
No need to use complicated joins.
Queries results fastly.
Snowflake schema
It is the normalised form of Star schema.
contains indepth joins ,bcas the tbales r splitted in to many pieces.We can easily do modification directly in the tables.
We hav to use comlicated joins ,since we hav more tables .
There will be some delay in processing the Query .

What is VLDB?
The perception of what constitutes a VLDB continues to grow. A one terabyte database would normally be
considered to be a VLDB.

What's the data types present in bo?n what happens if we implement view in the designer n report
Three different data types: Dimensions,Measure and Detail.
View is nothing but an alias and it can be used to resolve the loops in the universe.

can a dimension table contains numeric values?
Yes.But those datatype will be char (only the values can numeric/char)

What is the difference between view and materialized view?
View - store the SQL statement in the database and let you use it as a table. Everytime you access the view, the SQL
statement executes.

Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once
and after that everytime you run the query, the stored result set is used. Pros include quick query results.
What is surrogate key ? where we use it expalin with examples
surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only
requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables
primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the
surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes
updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys
(according to the business users) but ,not only can these change, indexing on a numerical value is probably better
and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and
as far as the client is concerned you may display only the AIRPORT_NAME.

2. Adapted from response by Vincent on Thursday, March 13, 2003
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee
Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the
Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the
new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated
to Business Unit 'BU2' even what actualy belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee
Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All
new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee
was Employee Code 'E1' but for you it becomes

Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement
process, is that you might not have all part of your new key within your fact table, so you might not be able to do the
join on the new enlarge key -> so you need another id.

What is ER Diagram?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network
and relational database views.

Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic
component of the model is the Entity-Relationship diagram which is used to visually represents data objects.

Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the
database designer, the utility of the ER model is:

it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational
tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the
database designer to communicate the design to the end user.

In addition, the model can be used as a design plan by the database developer to implement a data model in a
specific database management software.
What is aggregate table and aggregate fact table ... any examples of both?
Aggregate table contains summarised data. The materialized view are aggregated tables.

for ex in sales we have only date transaction. if we want to create a report like sales by product per year. in such
cases we aggregate the date vales into week_agg, month_agg, quarter_agg, year_agg. to retrive date from this
tables we use @aggrtegate function.

What is active data warehousing?
An active data warehouse provides information that enables decision-makers within an organization to manage
customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced
decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality of those
customer touches which encourages customer loyalty and thus secure an organization's bottom line. The
marketplace is coming of age as we progress from first-generation "passive" decision-support systems to current-
and next-generation "active" data warehouse implementations

Why do we override the execute method is struts? Plz give me the details?
As part of Struts FrameWork we can decvelop the Action Servlet,ActionForm servlets(here ActionServlet means
which class extends the Action class is called ActionServlet and ActionFome means which calss extends the
ActionForm calss is called the Action Form servlet)and other servlets classes.

In case of ActionForm class we can develop the validate().this method will return the ActionErrors object.In this
method we can write the validation code.If this method return null or ActionErrors with size=0,the webcontainer will
call the execute() as part of the Action class.if it returns size > 0 it willnot be call the execute().it will execute the
jsp,servlet or html file as value for the input attribute as part of the attribute in struts-config.xml file.

What is the difference between Datawarehousing and BusinessIntelligence?
Data warehousing deals with all aspects of managing the development, implementation and operation of a data
warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation,
storage management, data distribution, data archiving, operational reporting, analytical reporting, security
management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that
enable an organization to analyze measurable aspects of their business such as sales performance, profitability,
operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups,
cost trends, anomalies and exceptions, etc. Typically, the term business intelligence is used to encompass OLAP,
data visualization, data mining and query/reporting tools.Think of the data warehouse as the back office and business
intelligence as the entire business including the back office. The business needs the back office on which to function,
but the back office without a business to support, makes no sense.
What is the difference between OLAP and datawarehosue?
Datawarehouse is the place where the data is stored for analyzing
where as OLAP is the process of analyzing the data,managing aggregations,
partitioning information into cubes for indepth visualization.

What is fact less fact table? where you have used it in your project?
Factless table means only the key available in the Fact there is no mesures availalabl

Why Denormalization is promoted in Universe Designing?
In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a
dimensional data modeling(star schema), these tables would be merged as a single table called DIMENSION table
for performance and slicing data.Due to this merging of tables into one large Dimension table, it comes out of
complex intermediate joins. Dimension tables are directly joined to Fact tables.Though, redundancy of data occurs in
DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is
promoted in Universe Desinging.

What is the difference between ODS and OLTP?
ODS:- It is nothing but a collection of tables created in the Datawarehouse that maintains only current data

where as OLTP maintains the data only for transactions, these are designed for recording daily operations and
transactions of a business

What is the difference between datawarehouse and BI?
Simply speaking, BI is the capability of analyzing the data of a datawarehouse in advantage of that business. A BI
tool analyzes the data of a datawarehouse and to come into some business decision depending on the result of the
analysis.
Is OLAP databases are called decision support system ??? true/false?
True

explain in detail about type 1, type 2(SCD), type 3 ?
Type-1
Most Recent Value
Type-2(full History)
i) Version Number
ii) Flag
iii) Date
Type-3
Current and one Perivies value

What is snapshot?
You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the
data. However, you must reconnect to the catalog if you want to refresh the data.

What is the difference between datawarehouse and BI?
Simply speaking, BI is the capability of analyzing the data of a datawarehouse in advantage of that business. A BI
tool analyzes the data of a datawarehouse and to come into some business decision depending on the result of the
analysis.

What are non-additive facts in detail?
A fact may be measure, metric or a dollar value. Measure and metric are non additive facts.

Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we
can add the dollar amounts and come up with the total amount.

A non additive fact, for eg measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to
'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'

Posted by Jai at 8:56 PM
1 comments:

nolan said...
Hi

I read this post two times.

I like it so much, please try to keep posting.

Let me introduce other material that may be good for our community.

Source: Civil engineering interview questions

Best regards
Henry
September 20, 2011 8:08 PM


ETL Architecture Guide
Purpose
The purpose of this document is to present architectural guidelines for developing a common-
sense approach to supply the best possible quality of data attainable for the data mart.

Terms
ETL an industry standard abbreviation for the phrase: extract, transform and load which favors the use
of applications to perform data transformations outside of a database on a row-by-row basis.
Traditionally, referred to as the hub and spokes practice.
ELT an industry standard abbreviation for the phrase: extract, load and transform which favors the use
of relational databases as the staging point before performing any transformations of source data into
target data.

Topic
When functional information requirements have been defined for the end products of the
data mart, each of these requirements will be traced back to the appropriate source
systems of record.
Best practices will be used to determine the most expedient approach to obtain the
highest quality data in terms of most accurate, complete and valid data from each
source.
Flat (non-hiearchical, non-relational) file extracts may be either pulled by an application
such as ftp or other mechanism (tape, disk, etc) or pushed by a source system extract
application to a staging area.
Relational data may also be queried to produce extract files or to load directly into
relational database staging tables.
An ETL metadata reference table will be defined (data_source_type) to uniquely identify
each type of data source (flat file, spreadsheet, hierarchical database, relational
database, multi-valued database, comma-separated variable length, fixed record length,
etc)
An ETL metadata reference table will be defined (data_source) to uniquely identify all
data sources.
All data sources will be required to have event metadata either embedded with the
source data itself in standardized header/trailer records, or in a separate file or table
which must be accessible by the ETL tools or applications at the same time as the
source data itself. At a minimum this will include file or table name, date of creation,
unique creator/author identifier, checksum, size in bytes, character set, number of
records/rows, data source type (see above) and data source identifier (see above).
Additional items may also include: cyclic redundancy check (CRC), hash totals, public
key encryption token.
Where no secure data channel exists, corporate data must be encrypted by a standard
application before transmission to a staging location.
All data mart staging areas must be kept secure with only those users responsible for
operational management and the applications themselves having the necessary
permissions. Corporate data is extremely valuable and must be kept secure at all times
as it is moved from source systems toward the data mart.
Where large quantities of data may need to be moved across a wide area network, the
ability to compress the data at the source prior to transmission and expand the data
upon receipt into the staging area is required.
The ability to archive data with an associated time context naming convention is
frequently useful and often required. Preferably the naming convention for files will
include the following format: YYYYMMDD_HHmmSS where YYYY = century and year
(ie 2007), MM = Gregorian/calendar month with leading zeroes (ie. 03 for March); DD =
day of month; HH=hour of day (using zulu, universal or Greenwich meridian timezone;
ie. 6:30AM Central Time = 12:30PM Universal Time); mm=minute of hour; SS=seconds
of minute
Any and all database objects developed to support the ETL process such as: tables,
views, functions, stored procedures, dictionary table population scripts must have
individual drop and create/replace scripts created for each object to facilitate rapid
deployment of an ETL environment to a new database or server.
Any and all scripts associated with the ETL process must be placed into a standard
version control system to support change management and rollback in case of failures to
assist in root cause determination.
The version control system will be the single source used to produce a continuous
integration environment for unit testing all ETL scripts. Business rule-driven unit tests
will be designed for each ETL component and executed after any change which touches
either the data processed by the component or any component immediately up/down
stream from the component.
Continuous integration testing of the ETL system will require the creation of a separate
ETL data work area within the development environment which can be completely
dropped, rebuilt and unit tests run on static datasets to produce expected results. These
continuous integration tests must produce easily determined pass/fail results and be
self-reporting to the ETL development team.
All required client and server database, tool or other software (including upgrades,
patches, service packs or hot-fixes) must be kept in a separate escrow to be able to
reproduce an entire ETL environment exactly as it was prior to any upgrade to any
component.
All database, ftp and other connectivity information for each environment (development,
testing/quality assurance and production) must be kept in a centralized, single location
accessible by the development team and the ETL applications themselves. Appropriate
access controls must be maintained on this information to only allow authorized
individuals the ability to update this information as needed, and to allow a single point of
maintenance which may be applied to all environments.
Any component upgrade must be tested in the continuous integration environment prior
to deployment to quality assurance and must never be applied to production without
passing the continuous integration and quality assurance testing.
There will be at minimum three (3) separate environments created to support ETL: 1)
Development, 2) test/quality assurance and 3) production. All new development and
unit testing activities will begin in the ETL development environment. Prior to migration
of the ETL component to the test/QA environment, a test plan must be written and
communicated to the quality assurance staff. This test plan must include: all setup
instructions, process execution instructions, expected results and/or success criteria and
any restart instructions.
All ETL processes will make use of parameters to provide the maximum amount of
reusability without re-coding. This also reduces the number of modules to maintain,
while increasing the number of tests required per process / module.
All ETL metadata will reside in a central repository accessible via a generic search
engine by all stakeholders in the process.
ETL metadata must be accessible and updateable by all users (ie. Wiki-style) to keep
the data current and add additional perspectives. The metadata should be periodically
peer reviewed by metadata data stewards for content accuracy and currency.
Every relational database table whether source, interim or final will have a minimum of
the following attributes defined: name, classification (process control, process log,
dimensional, master/reference, transaction, fact, associative, reject, transformation);
primary key; business key; foreign keys; surrogate key; estimated row count; anticipated
growth factor per month; average row length and indexing recommendations.
Every file whether source, interim or final will have a minimum of the following attributes
defined: creation date, identify file creator name, file classification (process control, log,
reject, source, final) file type (fixed length, comma-separated variable, other delimited
(with delimiter), structural layout containing: field name, data type (numeric, date,
date/time, alpha, GUID), maximum length, required Y/N?, identity, domain of values, and
range of values. This information must supply enough information to construct data edit
scripts to assess the validity and general quality of information contained in the file itself.
Every relational database table column will have the following metadata requirements:
(see file immediately above) in addition, foreign key columns must be associated with
their parent table names where applicable to facilitate further automation of edit
checking. Any additional business rules must be defined where applicable to a given
column. This will support the addition of data-driven validation of business rules. Each
column may have zero or many business rules applied to its validation process. Each
business rule for a given column must have a gate, event (which may be empty) or state
of relevant data required in order to fire the business rule edit test. As an example, if
address line 1, city and state are supplied we may fire the business rule to validate the
ZIP code for addresses in the United States.
Each multi-dimensional cube must have the following items defined in the ETL
metadata: by dimension: name, expected number of entries in each dimension, percent
of sparse data anticipated in the dimension, OLAP query to produce meaningful control
totals for the dimension; then for the cube one or more OLAP queries to produce a
validation of the cube being updated (may be run before building the cube and again
after in order to validate new data added had the desired impact)
Any ETL process that must execute a third-party product must log all parameters sent to
the third-party application and the resulting return code received from the product along
with a timestamp for each and a mapping of the return code into a simple pass/fail
status.
A centralized dashboard must be developed to monitor and communicate the status of
each cycle of the ETL process. All appropriate ETL processes must accurately update
the appropriate ETL logs upon completion of their cycle to reflect both total data
processed and total data rejected. This dashboard must support establishing
performance thresholds for:
o Elapsed time required to process a given quantity of data;
o Data quality assessment as a percent of total new and as a percent of total
overall for master/reference, transaction, dimension and fact data
The dashboard must also support drilling down from the dashboard to the details of the ETL logs
to provide immediate insight into the root causes of any out of threshold aggregates.
All data mart surrogate keys will be assigned via a programmatic (not via database auto
assignment / identity) approach defined for each table as appropriate to provide an added
measure of control and support portability of data between environments.
All data added to the data mart will be identified by a unique data source identifier, which will
have been previously established in the data mart source system control table.
All data loaded into the data mart during the same ETL processing cycle will be identified by a
unique identifier assigned at the beginning of the cycle. This identifier will be generated at the
time of the ETL processing cycle and will have additional metadata about the ETL cycle
associated with it including: description of the reason for the cycle, ETL start date and time, ETL
environment identifier.
When an ETL processing cycle successfully completes and the resulting data in the data mart
has passed both automated and any manual quality assurance checks, the staging data
(successfully loaded) will be associated with the unique ETL processing session identifier, it will
be archived in either a flat file or database backup.
Prior to the start of any ETL processing cycle the destination data mart, any associated OLAP
cubes, and any staging are to be affected during the process must be backed up. This is
necessary to supply a recovery position should a catastrophic failure occur during the process.
For performance reasons if the backups are done at the end of the previous ETL process, these
backups must be verified to have been successful and still viable.

Tips & Tricks

These are general guidelines ideal for implementing in the development, maintenance and testing
of Ab Initio projects. These have been gathered from various sources on the Internet, as well as
from experienced Ab Initio developers.
Project Access Control - CheckIn, CheckOut
* Before Checking In a graph, ensure that it has been deployed successfully
* Inform the ETL Admin before Checking In
* To get the latest version of the graph, Check Out from EME Data Store
* Always Check Out from EME Data store to your individual sand box before running a graph
* In case the graph is not present in the EME Data Store, Check In and run it
* Ab Initio Sand Box should be created only by the ETL Admin for all authorized users
* Ensure that User-ID and Password in the EME Settings and the Run Settings are the same,
before creating graphs on the server
* Ensure that a graph is locked before modifying a graph, to prevent any sharing conflicts
* Try to see that individual abinitio graphs are handled by separate users - when a user locks a
graph, it prevents other users from modifying it while it is locked
* Ask the DBA to create any tables needed in the target database - do not create yourself
* Report all database related activities and issues to the DBA promptly
* Inform the DBA and get his approval before you modify any table in the target database
* Only the ETL Admin has the rights to set or modify environment variables - do not change
environment variables - these are global to all graphs and should not be tampered with
This article has been adapted from a part of the Tutorial here.
Project Implementation - Recommended Best Practices
* One may encounter errors while running a graph.
Related Coverage
Backup Tips, Tricks, and Best Practices
Best Practices of Video Conferencing
Software Testing Best Practices
Best Practices for Scientific Materials Management
Maintain error logs for every error you come across.
* Maintain a consolidated, detailed error sheet with error related information and resolution
information for all users. This can be used for reference when facing similar errors later on.
* In case you have a database error, contact the DBA promptly
* In all your graphs, ensure you are using the relevant dbc file
* Always validate a graph before executing it - deploy the graph only after successful validation
* Run the ab_project_setup.ksh on a regular basis - Contact ETL Admin if you need further
details
* Check whether the test parameters are valid before running an Abinitio graph
* Save and unlock the graph after implementing the modifications desired
Handling Run Time Errors
* When testing a graph, contact the person who created it or modified it recently for assistance
* Contact the ETL Admin if the error encountered relates to Admin settings
* When you face a problem that you have not encountered and resolved before, look in to the
consolidated error sheet to see if it has previously been faced and resolved by any other user
* Check with online tech forums for further inputs on the error
Documentation Practices
* Maintain documents regarding all the modifications performed on existing graphs and scripts
* Maintain ETL design documents for all ab initio graphs created or modified - when changes
are made to the existing graphs, the documents should be suitably modified
* Follow testing rules as per the testing template when testing a graph
* Document all testing activities performed

Article Summary: These are general guidelines ideal for implementing in the development,
maintenance and testing of Ab Initio projects. These have been gathered from various sources on
the Internet, as well as from experienced Ab Initio developers.


(c) benerald cannison

These are general guidelines ideal for implementing in the development, maintenance and testing
of Ab Initio projects. These have been gathered from various sources on the Internet, as well as
from experienced Ab Initio developers.

Project Access Control - CheckIn, CheckOut

* Before Checking In a graph, ensure that it has been deployed successfully
* Inform the ETL Admin before Checking In
* To get the latest version of the graph, Check Out from EME Data Store
* Always Check Out from EME Data store to your individual sand box before running a graph
* In case the graph is not present in the EME Data Store, Check In and run it
* Ab Initio Sand Box should be created only by the ETL Admin for all authorized users
* Ensure that User-ID and Password in the EME Settings and the Run Settings are the same,
before creating graphs on the server
* Ensure that a graph is locked before modifying a graph, to prevent any sharing conflicts
* Try to see that individual abinitio graphs are handled by separate users - when a user locks a
graph, it prevents other users from modifying it while it is locked
* Ask the DBA to create any tables needed in the target database - do not create yourself
* Report all database related activities and issues to the DBA promptly
* Inform the DBA and get his approval before you modify any table in the target database
* Only the ETL Admin has the rights to set or modify environment variables - do not change
environment variables - these are global to all graphs and should not be tampered with

This article has been adapted from a part of the Tutorial here.

Project Implementation - Recommended Best Practices

* One may encounter errors while running a graph. Maintain error logs for every error you come
across.
* Maintain a consolidated, detailed error sheet with error related information and resolution
information for all users. This can be used for reference when facing similar errors later on.
* In case you have a database error, contact the DBA promptly
* In all your graphs, ensure you are using the relevant dbc file
* Always validate a graph before executing it - deploy the graph only after successful validation
* Run the ab_project_setup.ksh on a regular basis - Contact ETL Admin if you need further
details
* Check whether the test parameters are valid before running an Abinitio graph
* Save and unlock the graph after implementing the modifications desired

Handling Run Time Errors

* When testing a graph, contact the person who created it or modified it recently for assistance
* Contact the ETL Admin if the error encountered relates to Admin settings
* When you face a problem that you have not encountered and resolved before, look in to the
consolidated error sheet to see if it has previously been faced and resolved by any other user
* Check with online tech forums for further inputs on the error

Documentation Practices

* Maintain documents regarding all the modifications performed on existing graphs and scripts
* Maintain ETL design documents for all ab initio graphs created or modified - when changes
are made to the existing graphs, the documents should be suitably modified
* Follow testing rules as per the testing template when testing a graph
* Document all testing activities performed