Sie sind auf Seite 1von 5

c 

 
   




 

October 25, 2010 ² dsrealtime


At this point you should be familiar with Shared Tables, and understand how data lineage works within, and among, multiple
DataStage Jobs. Stage to Stage lineage is very useful, and may be all that you require. It¶s also powerful, though to go beyond
this and connect your Jobs to the tables that you have imported from various places.

Why?

One of the key reasons for including Shared Tables in lineage is for ³Business Lineage´. This is a high level summary of lineage
that doesn¶t illustrate the lower level transformation details«just the ³key´ sources and targets (files, tables, reports, etc.) along
the way. Another is to connect your Jobs to ³external, non Information Server assets.´ DataStage doesn¶t write directly to a (for
example) Cognos report«.it writes to a table somewhere, and that table is then read by the business intelligence tool. The
connection thru that table is critical for accurate data lineage reporting. Here is how Metadata Workbench makes the connection
between a DataStage Job and a table«

Shared Tables have a four (4) part name: Host/Database/Schema/Tablename (  
    

        
     

Relational Stages in DataStage typically use two (2) parts to identify a particular table. ³Server´ or ³Database´ or ³DSN´ name,
and tablename. The Server/DSN/Database name is usually in a dedicated property within the Stage. The Tablename might be in a
dedicated property, or it could also be embedded in User Defined SQL. Any of them might be hard coded or established via Job
Parameters.

The first thing Metadata Workbench needs to do is to ³map´ the abstract ³Server/Database/DSN´ name to a particular ³Host´ and
³Database´ combination in the list of Shared Tables. Like any application, DataStage is just pointing to some abstract ³string´
when trying to find a database. An ODBC DSN, for example, ³might´ be the name of the database, but it could also just be
³myODBCdatabase´, which really points to (in the ODBC definition) a DB2 table called HRMAIN. Even if we use the string
HRMAIN in the ³Server´ property of the Stage, we still need a way to identify the particular host that we are considering for data
lineage. This is done via the ³Database Alias´ link in the ³Advanced´ tab of the Metadata Workbench.

Go to the Advanced Tab after signing into the Workbench. Perform Automated Services for one of your projects (this could take
a long time if it¶s the first time you are doing it). When it finishes, click on ³Database Alias´. Look carefully at the values there.
These are the ³strings´ that are used in your various Jobs to identify databases. Pick the string that is appropriate for the Stage
Type that you are working with and slide your cursor over to the right. The ³Add´ button will allow you to select the desired
host/database combination that this abstract string should connect to. In the example noted above, I might assign the string (alias)
myODBCdatabase to the host/database combination of QR2H004/HRMAIN. QR2H0004 with Database HRMAIN must be
something I have already imported and is viewable in the left navigation pane of the Workbench, or in the Repository
Management tab of the Information Server Web Console.

Save it (button on the lower right).

The     


  , whenever that Stage type with that particular string (myODBCdatabase) is
found, Metadata Workbench will use QR2H004/HRMAIN, combined with the fully qualified tablename in the Stage, to match to
a particular Shared Table that has been imported previously.

Are you using Job Parameters? For Design based lineage, Metadata Workbench is smart, and will use the ³default values´ of
those Job Parameters when finding alias ³strings´ and also when obtaining the fully qualified (schema.tablename) tablenames to
use in the linking.

Are you using $PROJDEF? Run the ProcessEnvVars shell or bat file inside of /IBM/InformationServer/ASBNode/bin to obtain
the project definitions for use in this algorithm.

Operational Metadata (a whole separate blog entry is needed to discuss OMD) is used to populate the Job Parameter values from
³run time´«otherwise the same rules apply. If you are just starting, get to know lineage w/o worrying about OMD. That¶s an
advanced topic. Understand how database alias works by performing Automated Services, reviewing the Database Alias page in
the Workbench, doing the assignment and running Automated Services again. Once it is complete, go to the actual table using the
navigator frame at the left (open the Host tree and find the table) and right mouse and select ³data lineage´. If it is a source, select
³where does this go to´«..if it is a target, select ³where did this come from´«« and validate your results.

Ernie

Posted in Metadata Workbench, data lineage, datastage, meta data, metadata. Leave a Comment »
K


    

September 30, 2010 ² dsrealtime


Once you have mastered the ³navigation´ and asset selection options of Data Lineage reporting, it¶s time to look at how
DataStage Jobs are automatically linked together. By now you should be comfortable with thinking about your ³starting position´
for a data lineage report ² your initial ³perspective´ if you will (what object are you standing on when you begin). You should
also be comfortable with thinking about the ³direction´ for your Data Lineage investigation ² are you looking ³upstream´ for
³Where did this come from?´ or downstream for ³Where does this go to?´

If you need a refresher on the basics, please see Getting Started with Data Lineage!.

A typical production site for DataStage/QualityStage has MANY Jobs ² hundreds perhaps«.even thousands. All integrated and
working together to transform your data and move it from one place to another. Sometimes they are written by one very hard-
working developer, who might have all the lineage in his or her head, but more often it is a larger scale endeavor, with lots of
team members, often scattered around the globe, and with varied skill sets and possibly working on related albeit independent
solutions. They may know each other, or may not. How are the DataStage Jobs sequenced from a data flow perspective? How
does data flow between a Job developed to process data received via FTP from the mainframe and then ultimately to a datamart
that supports a reporting system? How does one Job connect to another? Sometimes it may be one giant Job, but its not likely.
Intermediate temporary tables are often created for everything from checkpoints to Operational Data Stores to ³parking lots´
where data can be restructured or delivered to another application along the way. Workbench can sort this out and provide you
with lineage through all these Jobs.

Intra-Job Data Lineage (Data Lineage between Jobs) is largely automatic. You simply have to pay attention to a few ³good
sense´ leading practices and understand the pattern. Note that this has ³nothing´ to do with Shared Tables or Table Definitions at
all« it¶s entirely done by merely parsing thru your Jobs [this is a key reason why you can get immediate insight on 7.x Jobs that
are imported into the 8.x environment --- even if you haven't compiled a single one or started your formal testing and QA
process!]

Automated Services is the ³parsing´ step at the Advanced Tab«.when you say ³run´ with your Project(s) checked, Workbench
combs through your Jobs, looking for similarities that will link Jobs together end-to-end. Here¶s what it looks for among Jobs:

a) Common or ³like´ Stages between the Target of one Job and the Source of another. Two ODBC Stages are in common, but so
is ODBC and say, Oracle OCI. Or DB2Load and DB2Connector. Or two Sequential Stages.

b) At least one column in common.

c) Same hard coded values (yuk«who does that?) OR«.same ³default´ values for Job Parameters for the critical common
properties. For RDBMS type Stages, it¶s ServerName, Schema, and Tablename. For Sequential type Stages, it¶s the filename.
The Automated Services will put together multiple Job Parameter default values if needed.

If your team follows good practices of having parameter sets or common values for things like an ODBC DSN, a high degree of
lineage will often occur immediately after the first Automated Services (informally known as ³stitching´) occurs. Expect that the
³first´ time you run Automated Services, it could take a long time and be very intense. Do it during off hours if you have
hundreds or thousands of Jobs. After that first time it will recognize a delta and only parse thru the Jobs that are new or have been
changed.

Now when you do your lineage reporting, and you start while ³standing´ on the target Stage of an application down-stream
Job«.when you ask for ³Where does this come from?´, you can expect to see Stages through many Jobs back to the ultimate
source. If it dead-ends surprisingly, it¶s probably because one of the three rules above didn¶t apply, or there¶s an odd Stage that
isn¶t supported for lineage (Redbrick is one of the only ones I¶m aware of at this point).
If that fails, use the ³Stage Binding´ at the Advanced Services tab. This is one of the options for ³manual´ binding of metadata
² a sort of ³toolbox´ of wrenches and bolts for when you need µem. The Stage Binding is designed to be used, only when
absolutely necessary, to ³force´ two Stages together for lineage purposes. It is fairly easy to use«it prompts you first for the
name of a Job, and then when it presents you with a list of Stages, slide your cursor over to the right so that you can ³add´ a
Stage from another Job. I have used it effectively for unsupported Stages, as noted, but also when the rules above don¶t apply. In
one case I was sending the output of an XML stage into a Sequential File«and in the next Job, I was reading that with an
³External Source´ Stage. There is nothing in common between those stage types, and the columns were entirely different (the
Sequential File Stage contains a column called ³myXML´ and the External Source merely carries the output of a unix list
command (a set of filenames). I was able to establish perfect lineage however, by using a Manual ³Stage Binding´, forcing the
Sequential Stage of the first Job and the External Source Stage of the second Job to be ³bolted´ together.

Good reporting! Next topic will be ³c 





    .´

Ernie

Posted in Metadata Workbench, data lineage, datastage, meta data, metadata. Tags: etl, datastage, data lineage, metadata. 1
Comment »

  

K
 

September 28, 2010 ² dsrealtime


Last night I was reminded about a series of blog entries I¶ve wanted to make concerning the InfoSphere Metadata Workbench
and how to get the most out of its Data Lineage capabilities. The Workbench is very powerful ² it illustrates relationships
between processes, business concepts, people, databases, columns, data files, and much, much more. Combined with Business
Glossary, it gives you reporting capabilities for the casual business user as well as the (often) more technical dedicated metadata
researcher.

I¶ve had a variety of entries about Workbench in the past two years (see the table of contents link in the top right, and find the
metadata section), but nothing on ³getting started´. As Metadata Workbench starts to support more and more objects, knowing
certain skills and techniques becomes that much more important. This is especially true when trying to gain the most from
Metadata Workbench when it is being used to illustrate Business Terms, Stewards, FastTrack Mappings, DataStage Jobs, Tables
and Files, External ETL Tools, scripts and processes, operational metadata data and a vast list of other data integration artifacts.

Many of you who start with Metadata Workbench begin with DataStage/QualityStage Jobs.

So I will start there.

Once you have mastered lineage with DataStage, and its combination with other objects, you can then easily move on to other
concepts for non-DataStage metadata, which I will also cover in this series of blog entries. If you are using Metadata Workbench
and are not a DataStage user, stay tuned. As we progress I will take a tour through Extensions, Extension Mappings, Extended
Data Sources and all other such concepts.


   
 ! 
 
 


 " Maybe one with a lookup or a Join, a reasonable sequence of
Stages (8 to 10 or so) and preferably a single ³major´ target. Since you are learning about the Workbench, you should be
familiar, even intimate, with this Job. That will help as you learn the various ways to navigate through the user interface, because
you will know what to expect at each particular dialog, report or screen.

[this first "getting started" assumes that you have NEVER performed Automated Services against your DataStage Project....if you
have, it's ok, but you might not get the same results as I am outlining below -- you may get more metadata than I am describing in
this initial learning step. ...and if you don't know what I'm talking about (yet), that's ok too...]

Log into the Metadata Workbench and notice the ³Engine´ pull down at the left. This is the list of your DataStage Servers and
their Projects. Open up the project, it¶s folders, and find your Job. Click directly on it. Scroll up and down in the detailed page
that appears. there is the main page with the picture of the Job (click on it and you will get an expanded view in a new window of
what the Job looks like). The metadata you are viewing is up-to-date from the last moment you or a developer saved the Job in
the DS Designer. Also there is a very important listing of the Stage types in the Job, along with their icon. Note below you have
many ³expandable´ sections for things like Job Operational metadata«..investigate the options.
Now click on the ³main´ target Stage of this Job. This brings you to a similar looking detail page, this one for the ³Stage.´ Look
around, but don¶t click anything ² when you are ready, select ³Data Lineage´ at the upper right. As you do so, consider ³where
you are standing´ (you are on a ³Stage´) and what sort of lineage you would like to see. As you will discover, knowing ³where
you are´ when you start your lineage is very important.

The default option at the next dialog is ³Where did this come from´. Ignore the three checked boxes for now and click ³Create
Report´. This will comb through ALL the possible resources for ³where´ data for the ³stage you started on´  . Look
thru the list. Note also the highlighted line. Move it up and down. This highlight bar lets you select EXACTLY which resource
you¶d like to see for your actual report. The ³total´ collection of lineage resources is in front of you right now ² you will select
which one you want for a detailed source-to-target report. This is often a point of confusion because the highlight bar is not
always obvious. Data lineage doesn¶t show you ³ALL´ the sources ² just the path to/from the ones that you select [we'll contrast
this in a later entry with Business Lineage, which DOES provide a summary of ALL sources or ALL target from a particular
resource].

Look at the bottom of the page. Find the button labeled ³Display Final Assets´. Click it. The list of objects above should get
much smaller. Most likely, it should just show the source stage for this Job, or maybe its ultimate source as well as a lookup
source stage or a source for a Join. Pick the primary source stage for the Job and then click ³Show Textual´ Report.

Review the result. The textual report isn¶t as pretty, but it tends to be more scalable. Scroll up and down, and note what you see
on the left, and the Job details you see on the right. Everything is hyperlinked. Now find the little triangle towards the top left of
this center pane where your report is (it¶s called Report Selection or similar) and click on it. That should expose again the
³assets´ page. Now you can try ³Show Graphical´. When you get there, play with it. Grab some white space around the diagram
and move the whole thing around«..try the zoom bar in the upper left. Click on the various icons in the lineage and then right
mouse on one of the stages and find ³open details in new window´. That will bring you back to a detailed viewing page and the
process starts again.

What happens if you choose the target stage of your original Job (the first stage you selected earlier) and ask for ³Data Lineage´
and select ³Where does this go to´? If you haven¶t done Automated Services as I¶ve noted above, you should likely receive ³No
assets found´ or ³No data for the report´. This is because it¶s the ³final´ target ² there isn¶t anything else. ³Where did this come
from´ will yield a similar result if you happen to be ³sitting´ on a source when you start your lineage exercise.

If you practice this, you should become very familiar with the lineage report user interface, and will have a strong base for
moving forward with more complex, and deeper, scenarios.

# $K    %%

(link to next post in this series: Linking Jobs )

Ernie

Posted in Metadata Workbench, data lineage, datastage, etl, meta data, metadata. Tags: etl, data lineage, metadata, metadata
workbench. 1 Comment »
&



K
'

December 15, 2009 ² dsrealtime


Metadata management is becoming a big issue («again, finally, for the ³nth´ time over the years), thanks to attention to
initiatives such as data governance, data quality, master data management, and others. This time it finally feels more serious.
That¶s a good thing. One conceptual issue that vendors and architects are pushing related to metadata management (whether
home-grown or part of a packaged solution) is ³data lineage.´ What does that mean?

Let¶s imagine that a user is confused about a value in a report«perhaps it is a numeric result labeled as ³Quarterly Profit
Variance´.

What do they do today to gain further awareness of this amount and trust it for making a decision? In many large enterprises, they
call someone at the ³help desk´. This leads to additional phone calls and emails, and then one or more analysts or developers
sifting thru various systems reviewing source code and tracking down ³subject matter experts.´ One large bank I visited recently
said that this can take _days_! «and that¶s assuming they ever successfully find the answer. In too many other cases the
executive cannot wait that long and makes a decision without knowing the background, with all the risks that entails.
Carefully managed metadata that supports data lineage can help.

Using the example above, quick access to a corporate glossary of terminology will enable an executive to look up the business
definition for ³Quarterly Profit Variance.´ That may help them understand the business semantics, but may not be enough«.
They or their support team may need to drill deeper. ³Where did the value come from?´ ³How is it calculated?´

Data lineage can answer these questions, tracing the data path (it¶s ³lineage´) upstream from the report. Many sources and
expressions may have contributed to the final value. The lineage path may run through cubes and database views, ETL processes
that load a warehouse or datamart, intermediate staging tables, shells and FTP scripts, and even legacy systems on the mainframe.
This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill
down for individual column and process details.

Knowing the original source, and understanding ³what happens´ to the data as it flows to a report helps boost confidence in the
results and the overall business intelligence infrastructure. Ultimately this leads to better decision making. Happy reporting!

Ernie Ostic

Posted in Business Glossary, data lineage, datastage, general, meta data. Tags: data lineage, metadata. 4 Comments »
%
   
 
   (   ) 


February 19, 2009 ² dsrealtime


copy-of-createbgimportscsv
copyofcreatebusinesstermsandattributesxmldsx3

Here are a few other Jobs for loading new Terms and Categories into Business Glossary. Like the earlier post on Business
Glossary, these DataStage Jobs read a potential source of Terms (just alter the source stage as needed) and then create a target csv
file that is in the correct format for loading into Business Glossary using the new 8.1 csv import/export features available at the
Information Server Web Console« Glossary tab. The Jobs are fairly well annotated and should be self explanatory. I haven¶t yet
set them up for Custom Attributes, nor have they been widely tested ²± but they are already being implemented at a variety of
locations. Please let me know if you find them useful.

Ernie

(the one with ³XML´ in the name is the same as the prior blog entry. Each is named .doc, but is actually a .dsx file).

Posted in Business Glossary, bg, datastage, general. Leave a Comment »


*


 
   ) 


November 30, 2008 ² dsrealtime


There are a variety of ways to import new Terms into the InfoSphere Busines Glossary. One of these, for initial loads, is to use
an XML import. The XML format is fairly easy to produce (a sample is provided with the Business Glossary and can be found
at your Information Server Web Console)

The attached DataStage Server Job illustrates how to load new Terms and Attributes from some external structure. In this
example I use a simple sequential file, but if you look at the Job you will see that it can easily be adapted to any source, or simply
write another Job to go from your source to a target that reflects the sample terms I¶ve provided below.

This was tested in 8.0, although has since been modified for use with 8.1. I¶m not sure how well it will import into 8.0. The
terms and attributes are very simplistic and use a hockey theme, just to keep things simple and allow for discussion. The code
is an example for instructional purposes only. Please let me know if you have any questions or run into problems.

(I hope you can make use of these. Seems the blog has changed and won¶t allow me to upload .txt files. I¶ve tried putting all the
content in ³notes pages´ of .ppt. You¶ll need to download and then open the .ppt [one of only a few file types allowed here] and
then see if you can cut and paste the sample .txt file, sample xml, and .dsx.).

Das könnte Ihnen auch gefallen