Sie sind auf Seite 1von 31

Datastage

Handling COBOL source file

Version

1.3

Status

Revised Version

File Name

Handling COBOL source file in Datastage

Last Edit Date

21st September, 2015

C3: Protected

Page 2 of 31

Control Sheet
Document Location
This document is located at:

Document Review
Name
Probal Banerjee

Title

Organization

Sr Associate

Cognizant

Document Approval
Name

Title

Aditi Sanyal

Organization

ETL Specialist

Cognizant

Signature

Date
22nd Sep, 2015

AS

Document Author
Name

Title

Sourav Guha

Organization

ETL Developer

Cognizant

Signature

Date
30th May, 2015

SG

Version History
Version

Amendment/Reason

Who

Date
th

1.0

Draft Initial Overview only

Sourav Guha

30 May, 2015

1.1

Revised version review comments


incorporated

Sourav Guha

15th July, 2015

1.2

Revised version review comments


incorporated

Sourav Guha

4th September, 2015

1.3

Revised version review comments


incorporated

Sourav Guha

21st September, 2015

Page 3 of 31

Table of Contents

1.0

Introduction

2.0

Context

Target audience

3.0

Assumptions

Business Scenario

Handling COBOL source file

3.1 Overview: COBOL source file

3.1.1 COBOL File layout

3.1.2 Mainframe Access Type


3.1.3 Multiple Record Type

3.1.4 Master-Detail file structure

10

3.2 Using Complex Flat File Stage

4.0

10

Step 1: Import Metadata from .cfd file

11

Step 2: Check Imported Table definitions

12

Step 3: Check Data File

16

Step 4: Edit Complex Flat File Stage

17

3.2.1 Stage Section

17

3.2.2 Output Section

22

Step 5: Edit Transformer Stage and load target dataset

24

Step 6: View target Data

26

Appendix

27

Page 4 of 31

1.0 Introduction
IBM Infosphere Datastage is a powerful data integration tool. The IBM
Infosphere Datastage is capable of integrating data on demand across
multiple and high volumes of data sources and target applications using
a high performance parallel framework. Infosphere Datastage also
facilitates extended metadata management and enterprise connectivity.
In simple words, Datastage is an ETL tool which basically performs
extraction, transformation and load operation in EDW framework. The
scalable platform provides more flexible integration of all types of data,
including big data at rest (Hadoop-based) or in motion (stream-based),
on distributed and mainframe platforms. Source file may appear in
different format viz. ASCII text fixed length, EBCIDIC text fixed length,
XML hierarchical data, comma separated (.csv) file etc. From the source
side extracting the data might be tricky at times.
What if the source is a mainframe generated COBOL source file which
needs to be processed using Datastage. So lets consider there is a
Mainframe generated Multi-record fixed length file which may contain
different sorts of information in separate records of equal length e.g.
Employee basic information in one record and payroll information in
another. That necessarily allows a room to think of the following:

What is Multi-record file?


How to handle fixed length COBOL file in Datastage?
How to identify different type of records containing different sorts
of information in same file? Etc

This document emphasizes in finding the solution to the above


questions and many more related contents at the same time. With the
help of a simple business scenario, this document will guide on how to
handle a COBOL generated file as source and will answer a few related
queries in the process. So apart from getting familiar with handing a
source file that is generated through a mainframe system, this
document also helps to create a basic concept of COBOL files and its
architecture.

2.0 Context
Target audience
The audience for this document is intended to be Datastage developers
looking for a simple development guidelines while they encounter
mainframe generated COBOL source file during job development. Apart
from the process flow, they can gather a few important concepts of such
source files from the document.

Assumptions
Following assumptions have been made before processing further with
the context
Basic concepts of ETL (Extract, Transform, and Load) are known to
the reader.
Basic stages of IBM Datastage (copy, sort, join, lookup, aggregate
etc.) and their operations are known to the reader.
COBOL metadata definition file (Info.cfd) which holds the metadata
of the source file, is stored in local system.

Business Scenario

Mainframe system
Source File gets generated

ETL Process flow


starts
Source file received

Reporting Tool
Reporting team handles
the report generation

---Business scenario Diagram---

Load
Data is stored in dataset
and subsequently in a
permanent storage

Extraction
Datstage reads data from
source file

Transformation
Necessary transformation
is applied in Parallel Job

ETL Process flow

Lets consider we have a three-layer architecture in a project, the first


being the high level architecture layer where the source Data is being
generated. Among all, the source data system is found to be a Mainframe
system. The generated data flows to the next layer which will process the data
according to the requirement. In the above High Level Diagram of the project,

the Highlighted portion in Grey depicts the ETL process flow. This is the
second layer of the project and the data processed in this layer will be available
for the Reporting layer at the end of the flow. The Reporting layer is the final
layer which will generate necessary reports for the End User.

[N.B.] Having understood the business scenario, it is important to take a note


that the point of emphasis of this document would be the ETL process flow of
the above architecture. To be more precise, the subject matter of this document
centers around the Extraction process of ETL process flow. So the
Transformation and Load part has been kept as simple as possible in the
business scenario.
ETL process flow starts: So let us consider a source data file is received from
the Mainframe system which is a fixed length Multi-record file containing
Employee basic information and payroll information in separate records of equal
length.
Extraction: Records need to be fetched from the COBOL generated source data
file using Complex Flat File Stage of Datastage. While reading records from file it
is required to identify Employee basic information and Payroll information
separately and treat them as different record sets.
Transformation: Necessary transformation logic has to be applied. Process the
Employee basic information and Payroll information separately.
Load: Load Employee basic information and Payroll information in separate
storage area. For simplicity, lets consider storing the records in two separate
Datasets.

3.0 Handling COBOL source file


1.1 Overview: COBOL source file
A mainframe generated COBOL file can have a hierarchical structure in its
arrangement of columns. It is physically flat (that is, it has no pointers or other
complicated infrastructure), but logically represents parent-child relationships.
To understand the COBOL file better first one should have a brief idea on the
COBOL file layout.

1.1.1

COBOL File layout

Fields and the PIC clause


The lowest level data item in a COBOL layout is a field, also called
an elementary item. Several fields can be associated to form a group. All the
fields together form a record.
A COBOL layout is comprised of a line for each field or group. A COBOL field
definition gives the level (will be discussed later), field name, and a "picture",
or PIC clause, which tells the data type or data category of the field, and its
size. The three data types that are most likely to be seen are:

"A" for alpha (A-Z, a-z, and space only).

"9" for a numeric field (numbers 0-9, but no letters).

"X" for any character, (including binary).

Columns, Line Numbers, and Comments


Columns 1-6 in most COBOL layouts are ignored by the compiler, as is
everything after column 72. Most likely line numbers or other comments (such
as when a field was added or changed, or where is it originated from) will be
found only in these columns.
COBOL layouts are divided into "areas", and there are many rules for what data
may be found in which area, but one thing should be remembered is that an
asterisk, *, in column 7 turns the entire line into a comment, which is ignored by
the COBOL compiler. Even if that line contains a field specification, it will be
ignored if there is an * in column 7. Find a glimpse of a simple COBOL layout
below:

Pic 1: COBOL file layout

Levels and Groups


COBOL layouts have levels, from level 01 to level 49. These levels tell the
COBOL compiler how to associate, or group, fields in the record. Level 01 is a
special case, and is reserved for the record level; the 01 level is the name of the
record. Levels from 02 to 49 are all "equal" (level 2 is no more significant than
level 3), but there is a hierarchy to the structure. Any field listed in a lower level
(higher number) is subordinate to a field or group in a higher level (lower
number). For example given below LASTNAME & FIRSTNAME are parts of, or
belong to Group NAME; as can be seen from the level numbers 13 and 11
respectively.
11

NAME
13

LASTNAME

PIC

X(0005).

13

FIRSTNAME

PIC

X(0005).

Also it needs to be noticed that NAME does not have type of its own which also
signifies that it is a Group not a Field. Again a Group can contain both: another
sub-group or Field. Level numbers will identify which group or sub-group a field
or a sub-group belongs to.
REDEFINES Clause
The REDEFINES clause allows the same memory area /block to be described by
different data items. If one or more data items are not used simultaneously, i.e.
if we are sure that two or more field values will not be in use at the same time,
then we go for REDEFINES clause. It is basically done to save and reuse memory
blocks. NAME and OFFICIAL_NAME are said to be two fields that makes use of
the REDEFINES clause in the example given below.
11

11

NAME
13

LASTNAME

PIC

X(0005).

13

FIRSTNAME

PIC

X(0005).

OFFICIAL_NAME

REDEFINES

NAME

PIC

X(0010).

But there are some rules of using this clause. Make a note of them:

Level numbers of items sharing the memory space should be same.


Length of the redefined items should be equal to one another.
Several data item can Redefine the same data item.
Even, a Record item can also redefine another Record item.

OCCURS & OCCURS DEPENDING ON Clause


Suppose monthly sales figures for the year needs to be stored. Normally we
define 12 fields, one for each month, like this:
11

MONTHLY_SALES_1

PIC

9(0020).

11

MONTHLY_SALES_2

PIC

9(0020).

..
11

MONTHLY_SALES_12

.
PIC

9(0020).

But there's an easier way in COBOL. The field can be specified once and can be
declared that it repeats 12 times. This is done with the OCCURS clause, like this:
11

MONTHLY_SALES

OCCURS 12 TIMES

PIC

9(0020).

This specifies 12 fields, all of which have the same PIC, and is called a table
(also called an array). The individual fields are referenced in COBOL by
using subscripts, such as MONTHLY_SALES (1)".
OCCURS DEPENDING ON, is an OCCURS, like above, but the number of times it
occurs in a particular record can vary (between some limits). The number of
times it actually occurs in any particular record will be given by a value in
another field of that record. This creates records that vary in size from record to
record. E.g.
11

VALID_MONTHS

11

MONTHLY_SALES
DEPENDING ON

PIC

9(0005).

OCCURS 0 TO 12 TIMES
VALID_MONTHS

PIC

9(0020).

Apart from getting familiar with COBOL layout there are few more things reader
should at least have a brief idea about before pitching into the given business
scenario and how to solve it with Datastage.

1.1.2

Mainframe Access Type

An access method (Type) defines the technique that is used to store and
retrieve data. Access methods have their own data set structures to organize
data, system-provided programs (or macros) to define data sets, and utility
programs to process data sets.
There are times when an access method identified with one organization can be
used to process a data set organized in a different manner. For example, UNIX
files, which can be processed using BSAM, QSAM, basic partitioned access
method (BPAM), or virtual storage access method (VSAM).
Normally to read COBOL source file mainframe access type should be set to
QSAM_SEQ_COMPLEX in Datastage. There are quite a few other access types
which can be found. We pick the file access based on business requirements.
For more information refer Appendix.

1.1.3

Multiple Record type

COBOL generated data files can be of either Single record type or multiple
record types. For single record types, all the record will have same structure.
This is what we normally see when we use sequential file as a source. However
in real life this may not be the case. There can be multiple records types in a
single File. As in our case although records for Employee basic information and
payroll information is available in a single file, but they will have different
structures (naturally).
To describe multiple record types more than one record description in the files
identifier entry is used. As record descriptions begin with 01 level entry, for
each of the record type 01 level entry will be specified. As it has already been

mentioned that to separate different records, there should be an identifier


column e.g. in the following data file it is DJY6001. Follow the file structure
below:

Pic 2: Multiple record type layout

1.1.4

Master-Detail file structure

A master-detail file structure is one where a master record is followed by its


detail records. Generally if data are getting extracted from a multiple-recordtype file with a master-detail structure, the master record type should be
specified explicitly. In our file, records beginning with "1" are master records
with employee information and records beginning with "2" are detail records
with Payroll information for the corresponding employee. And there can be more
than one detail records against each master record type. Later well see how to
declare EMP record type as master record type in Datastage while extracting
data.

Pic 3: Master-Detail data

The relationship between the master and detail records is inherent only in the

physical record order: payroll records correspond to the employee record they
follow. However, if this is the only means of relating detail records to their
masters, this relationship is lost when Warehouse Builder loads each record into
its target separately. To have an idea how to keep the relationship maintained,
please refer Appendix.

1.2 Using Complex Flat File Stage


Now let us see how to make use of Datastage to solve our business scenario.
Well have to use Complex flat file stage in order to read data from a COBOL
generated file. In a step by step process well see how to handle COBOL source
data file with the given business scenario. This will include the following steps:

Import Metadata from .cfd file


Check Imported Table definitions
Check Data File
Edit Complex Flat File Stage
Edit Transformer Stage
View Target Data

Step 1: Import Metadata from .cfd file


Before starting the data extraction the metadata for the data file has to be
defined. We already have made an assumption that the metadata definition is
stored in local system by the name Info.cfd. This file will be called as
Copybook. One can open this file using Notepad or similar kind of Editor.

Pic 4: Sample Metadata

In Datastage Designer, click Import > Table Definitions and then select
Cobol File Definitions.

Select PATH to the Info.cfd file.

Specify a start position at 15.

A list of all tables will be seen in tables section. If not, the copybook might
have some error. Refer FAQs.

In the To Folder box, the location to store the Table Definition has to be
specified. In our example we will store it in Table Definitions > COBOL FD
> Info.

Select all the tables and click Import. In case there is single record
metadata, only one table definitions will be imported.

Pic 5: Import metadata page

FAQ1: What is start position?


-

In the Start position field of the metadata import page, weve to specify
the position of the level 01 in the copy book. For multiple record types,
each of the table definitions should have level 01 specified at the same
position. Otherwise the metadata import may remain incomplete or may
fail. For our example it is 15 for our case which can differ for other copy
books.
FAQ2: What if I dont have a level 01 specified in the
metadata available?

As mentioned earlier level 01 specifies the table name, so the absence of


level 01 will cause metadata import failure. So if it is absent, we may
need to manually edit the copy book while importing metadata.
As we try to import metadata from a copy book that does contain any
kind of fault, well be asked to edit the copybook through a new pop up
window with the error details showing in below of the same window.

Step 2: Check Imported Table definitions


As we complete importing the COBOL file definitions, well need to check if
Metadata has been imported successfully.
-

Navigate to Table Definitions > COBOL FD > Info.

Open EMP File Definitions. In the General tab, we can find basic

information of the imported File definition viz. Mainframe platform,


Mainframe access type etc. For reference one can have a short description
of the File definition in the same tab.

Pic 6: Table definition properties General tab

Click on Columns tab. Examine the level numbers .

Pic 7: Table definition properties Columns tab

Go to Layout tab. By selecting we can examine/ notice the mainframe


layout of the Table definitions.

Pic 8: Table definition properties Layout tab

On the Format tab, we can view options if the Table definition is used with
a sequential file in a server job. Further information on this is out of scope
for this document.

Pic 9: Table definition properties Fomat tab

In a similar way SAL File definition can be checked.

FAQ: Can the imported metadata be changed manually?


-

Yes. Even after the import is done, the Metadata can be altered. In the
Columns Tab, Double click on the left side of the column that needs to be
worked upon. As a result well have the properties window of the Column;
there we can alter the Definition accordingly. But normally we should
follow what is given in the copybook unless told otherwise.
As an example, let us change the length of the ADDRESS field in the EMP
file definition.
1) Double click on the left hand side of the ADDRESS field on the Columns
Tab.
2) Specify the length to be 100 (earlier it was 20).

Pic 10: Edit column metadata

3) Click on Apply. Click Close.


4) In the Columns Tab, now the altered metadata can be seen.
5) Click OK.

Step 3: Check Data file


Before handling the Complex Flat file stage, well have a quick look at the data
file in hand.
-

Open the file INFO DATA.txt in Notepad. It is a Mainframe generated


Multi-record source file which contains Employee basic information and
employee Salary/Payroll information in separate records.

Have a close look of the different types of records containing in the data
file. Records are delimited by | character. The first character of each
record defines its record type. E.g. 1 implies EMP type, 2 will mean SAL
type records.

Pic 11: Source file

Two or more consecutive SAL type records, followed by an EMP type would
result in relating all of the SAL records to the EMP record they followed.

Step 4: Edit Complex Flat File Stage


Of course the metadata has been imported to read the COBOL data file in hand.
COBOL data file might appear in either of EBCIDIC or ASCII format, fixed length
or character delimited. In our case well consider the fixed length ASCII format
file, which appears to be a little complex. As the article progresses, the reader
will get to know the other cases too.
-

Create a parallel job with a simple structure as shown below. The Complex
flat file stage would read the data file in hand. Transformer stage will be
useful if any kind of basic logical transform to your logical data is required.
And finally the processed records will be moved to separate datasets .

Pic 12: Parallel Job

Open the Complex Flat file stage.

1.2.1

Stage Section

File Options Tab


- Move to File Options Tab. Here the source Data file name along with its full
residing path has to be specified. As shown in below image, one can
directly browse into the source file path, or even can parameterize it.

Pic 13: File Options Tab

Record type should be mentioned as Fixed from the drop down as we are
dealing with fixed length data file.

Normally this source stage reads records sequentially, but we have the
option to operate it to read records in parallel. Either check the box for
Read from multiple nodes or specify any value larger than 1 in the
Number of readers per node section.

Pic 14: Multiple node reading

We also have the luxury of fetching records applying filter or fetching first
N number of records in this tab. Both of these fields can be parameterized.

FAQ1: What is Fast Path given in the tab?


-

A navigation control in the left side below of the tab called Fast path can
be found. This will help navigate to few of the most essential Tabs;
alternatively we can choose which tab we want to work on by directly

clicking on the tab names manually.


FAQ2: How to handle a file other than fixed length one?
-

In Record Type segment of the page, you can choose different options from
a drop down according to your case.

Record Options Tab


- Move to Record Option tab. Specify Character set to be ASCII, Data
format to be Text and Record delimiter to be |.
-

For EBCIDIC formatting please select accordingly from the drop downs of
each options.

Optionally you can specify a few decimal properties in this tab. But as we
can see we are only dealing with character type values in our file, so well
ignore this.

Next specify default values of Character to be (NULL).

Records Tab
- Now well import metadata which we have already saved to read the data
file.
-

First uncheck Single Record box (Right hand side bottom). To explain, our
file is of multiple record types as we have seen our copybook contains two
record types namely EMP & SAL.

Right Click on the left pane of the tab. Select Add New Record.

Pic 15: Records tab- Add new record

Change the name of the record type to EMP from default name
NEWRECORD.

Click Load. Select the EMP table definitions saved earlier, and all the
Columns from the table.

Click OK to load the columns.

Specify Field Width in the Extended attributes for the entire CHARACTER
field. Unless you do this, View Data may fail thus resulting abortion of the
parallel job. Specify a field width that equals the length of the field.

Pic 16: Records tab- Extended attributes

Add another record by right clicking on the left pane. Alternatively, one
can find different icons at the below part of the pane to add new records.
Hover the mouse on each of the icon, and a prompt will appear describing
its function. Below image shows Insert new record after current record
option.

Pic 16: Insert new record after current record

Rename the new record type to be SAL. Repeat the process to define and
load its table definitions.

Select the EMP record type. Now click the right most icon (Toggle Master

Flag) in the bottom to make EMP record type the master .


Records ID Tab
- Move into the next tab called Records ID tab.
-

Specify record identifiers for both EMP and SAL type records.

Pic 17: Record ID- EMP

For EMP identifier field DJY6001=1 and for SAL record type it should be
2.

Pic 18: Record ID- SAL

Layout Tab
- The next tab is used to view the column layout for each of the record
types.
-

Select COBOL Layout. From the left hand side pane click on the record
type for which you want to view the column layout.

Pic 19: Layout tab

NLS Map Tab


- There is not much to alter in this tab. It may be needed to check the box
Allow per column mapping if there is a data file with such a metadata
that might need different NLS for different native type. In our case well
keep the default settings.

1.2.2

Output Section:

Selection Tab
- Move to the fast path 3 of 4 which will navigate us to Selection Tab of
Output section.
-

Select all the columns from the left hand side.

Click on >> box. This will nominate all the columns to be propagated
to next stage via output feed.

If at all there is any column which needs to be dropped, select the very
column in the right hand side and click < box. In our case, well ignore
this step.

Pic 20: Selection tab

Also ignore checking Enable all group column selection box.

Constraint Tab
- Move to the next tab.
-

Define output constraints for each record type by clicking on the Default
button. This will ensure that only record on these two types will flow into

the output link. Note that for single record type nothing needs to be
specified here.

Pic 21: Constraint tab

The applied constraints can be seen in a logical manner in the lower


portion of the tab.

Columns tab
- In columns tab the columns flowing into output link can be seen.
-

Optionally the column metadata can be saved for future reference by


clicking on Save As button.

Pic 22: Columns tab

Click on View Data to see the source data. You may notice all the columns
from all record types are populated with data in them. But, it should be
kept in mind that data for a particular column of a given record type is
only valid if the record identifier field DJY6001 holds the particular record
identifier character(1 for EMP, 2 for SAL).

Pic 23: View Data

Step 5: Edit Transformer Stage and load


target dataset
-

Open the transformer stage.

The EMP_records output link should get basic EMP information only.

The SAL_records output link should get employee payroll information. Map
the input columns accordingly.

Pic 24: Transformer stage

The flow of records needs to be controlled within their respective output


links. For this specify constraints for each link as below.

Pic 25: Transformer constraints

Open target dataset Ds_EMP and specify target dataset name as


EMP_records.ds with full pathname. Set Update Policy=Overwrite.
Optionally the dataset pathname can be parameterized for multiple
references.

Similarly configure Ds_SAL target dataset.

Pic 26: Target dataset properties

Step 6: View target Data


-

Compile the job. Save the job with a meaningful name.

Run the job.

On successful completion of the job run, open target dataset


EMP_records.ds. Click on View Data.

Pic 27: EMP dataset

Open target dataset SAL_records.ds. Click on View Data.

Pic 28: SAL dataset

4.0 Appendix:
-

For more information on Mainframe access type please refer

http://www01.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_150.htm

For more information on Multiple record types please refer

http://coboltuts.blogspot.com/p/multiple-record-types-in-earlier.html

For further information on Master records and master-Detail relationship


refer

http://docs.oracle.com/cd/B28359_01/owb.111/b31278/concept_etl_perfo
rmance.htm#i1143526

For detail information on Records Tab please refer

http://www01.ibm.com/support/knowledgecenter/SSZJPZ_8.5.0/com.ibm.swg.im.iis.d
s.design.help.doc/topics/CFF_stage_page_records_tab.html

Das könnte Ihnen auch gefallen