Beruflich Dokumente
Kultur Dokumente
Version 8
SC18-9939-00
WebSphere DataStage
Version 8
SC18-9939-00
Note Before using this information and the product that it supports, be sure to read the general information under Notices and trademarks on page 53.
Ascential Software Corporation 2002, 2005. Copyright International Business Machines Corporation 2005, 2006. All rights reserved. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Chapter 1. Using WebSphere DataStage to Run SAS Code . . . . . . . . . . 1
Writing SAS Programs . . . . . . . . . . . 1 Using SAS on Sequential and Parallel Systems . . 1 Pipeline Parallelism and SAS . . . . . . . . . 2 Required SAS Environment . . . . . . . . . 2 SAS Licenses . . . . . . . . . . . . . 2 Installation Requirements . . . . . . . . . 2 Configuring Your System to Use the WebSphere DataStage SAS Stage . . . . . . . . . . . 3 SAS Executables . . . . . . . . . . . . 4 Long Name Support . . . . . . . . . . . 5 The SAS Stage . . . . . . . . . . . . . . 6 A Data Flow . . . . . . . . . . . . . . 6 Representing SAS and Non-SAS Data in WebSphere DataStage . . . . . . . . . . . . . . . 8 WebSphere DataStage Data Set Format . . . . 8 Sequential SAS Data Set Format . . . . . . . 8 Parallel SAS Data Set Format . . . . . . . . 8 Transitional SAS Data Set Format . . . . . . 9 Getting Input from a SAS Data Set . . . . . . . 9 Getting Input from a Stage or a Transitional SAS Data Set . . . . . . . . . . . . . . . 10 Converting between Data Set Types . . . . . . 11 Converting WebSphere DataStage Data to SAS Data . . . . . . . . . . . . . . . . 11 Converting SAS Data to WebSphere DataStage Data . . . . . . . . . . . . . . . . 12 A WebSphere DataStage Example . . . . . . . 13 Making SAS Steps Parallel . . . . . . . . . 13 Executing DATA Steps in Parallel . . . . . . 13 Example Applications . . . . . . . . . . 14 Executing PROC Steps in Parallel . . . . . . . 21 Example Applications . . . . . . . . . . 21 Points to Consider in Parallelizing SAS Code . . . 27 Rules of Thumb . . . . . . . . . . . . 27 Using SAS with European Languages . . . . . 29 Using WebSphere DataStage to Do ETL . . . . . 29 Running in NLS Mode . . . . . . . . . . 30 Parallel SAS Data Sets and SAS International . . . 30 Automatic Step Insertion . . . . . . . . . 31 Handling SAS CHAR Fields Containing Multi-Byte Unicode Data . . . . . . . . . 31 Specifying an Output Schema . . . . Controlling ustring Truncation . . . . Inserted Partition and Sort Components . Environment Variables . . . . . . . . . . . . . . . . . . . . . . . 31 32 32 32
35
35 35 35 35 36 36 36 37 38 39
Index . . . . . . . . . . . . . . . 57
iii
iv
Sort Limitations
Steps that process an entire data set at once, such as a PROC SORT, can provide no output to downstream steps until the sort is complete. Pipeline parallelism allows individual records to feed the sort as they become available. The sort is therefore occurring at the same time as the upstream steps. However, pipeline parallelism is broken between the sort and the downstream steps, since no records are output by the PROC SORT until all records have been processed. A new pipeline is initiated following the sort.
Installation Requirements
WebSphere DataStage supports SAS versions 612, 8.2, 9.1, and 9.1.3. At installation time, WebSphere DataStage determines which version of SAS is installed by looking at the values in $Path environment variable. An example directory path is
/usr/local/sas/sas612
If WebSphere DataStage is unable to confirm which version is installed, you are prompted to configure the SAS system. Note: If you are running SAS 9.1 or 9.1.3 on Linux, LD_LIBRARY_PATH must not have /lib in the path. The presence of /lib causes SAS to produce a code dump when called by the SAS engine. To verify the SAS installation is correct for your environment and compatible with WebSphere DataStage requirements, run the following command on the WebSphere DataStage server:
osh -f $APT_ORCHHOME/examples/SAS/sanitytest
The log file tells you if your installation is correct. If the SAS stage is running against SAS Basic, WebSphere DataStage always uses a user ID of dsadm. Be sure that user dsadm has read permissions to all SAS data files and libraries as well as write permissions to any SAS data set to be updated.
In a configuration file with multiple nodes, you can use any directory path for each resource disk in order to obtain the SAS parallel data sets. Bold type is used for illustration purposes only to highlight the resource directory paths.
{ node "elwood1" { fastname "elwood" pools "" resource sas "/usr/local/sas/sas8.2" { } resource disk "/u1/dsadm/ibm/WDIIS/DataStage/Datasets/node" "" "sasdataset" } resource scratchdisk "/u1/dsadm/ibm/WDIIS/DataStage/Scratch" } node "solpxqa01-1" { fastname "solpxqa01" pools "" resource sas "/usr/local/sas/sas8.2" { } resource disk "/u1/dsadm/ibm/WDIIS/DataStage/Datasets/node"
{pools
{pools ""}
Note: In WebSphere DataStage Version 7.5.1 and earlier, the path for each resource disk was required to be unique in order for the SAS parallel data sets to function. For the SAS stage to behave as it did at Release 7.5.1 or earlier, use the environment variable APT_SAS_OLD_UNIQUE_DIRECTORY (see APT_SAS_OLD_UNIQUE_DIRECTORY). The resource names sas and sasworkdisk and the disk pool name sasdataset are reserved words.
When you specify a character setting, the sascs.txt file must be located in the platform-specific directory for your operating system. WebSphere DataStage accesses the setting that is equivalent to your sas_cs specification for your operating system. ICU settings can differ between platforms. DBCSLANG Settings and ICU Equivalents: For the DBCSLANG settings you use, enter your ICU equivalents. DBCSLANG Setting ICU Character Set JAPANESE icu_character_set KATAKANA icu_character_set KOREAN icu_character_set HANGLE icu_character_set CHINESE icu_character_set TAIWANESE icu_character_set
SAS Executables
US SAS and International SAS
There are two versions of the SAS executable: International SAS and US SAS. At installation time, you can install one or the other or both. However, a job runs either International SAS or US SAS. If NLS is enabled, WebSphere DataStage looks for International SAS. If NLS is not enabled, WebSphere DataStage looks for US SAS.
Use basic SAS to perform European-language data transformations, not international SAS, which is designed for languages that require double-byte character transformations. For additional information about using SAS with European languages, see Using SAS with European Languages .
v APT_SASINT_COMMAND absolute_path Overrides the $PATH directory for SAS with an absolute path to the International SAS executable. An example path is
/usr/local/sas/sas/8.2int/dbcs/sas
When you have invokded SAS in standard mode, your SAS log output has this type of header:
NOTE: SAS (r) Proprietary Software Rlease 8.2 (TS2MO)
For SAS 6.12, WebSphere DataStage handles SAS file and column names up to 8 characters.
A Data Flow
The following is a visual depiction of the underlying data flow at execution.
This example shows a SAS application reading data in parallel from DB2. The DB2 Enterprise stage streams data from DB2 in parallel and converts it automatically into the standard WebSphere DataStage data set format. The sasin operator of the SAS stage then converts it, by
Chapter 1. Using WebSphere DataStage to Run SAS Code
default, into an internal SAS data set usable by the SAS stage. Finally, the sasout operator of the SAS stage inputs the SAS data set and outputs it as a standard WebSphere DataStage data set. The SAS stage can also input and output data sets in parallel SAS data set format. Descriptions of the data set formats are in the next section.
Representing SAS and Non-SAS Data in WebSphere DataStage WebSphere DataStage Data Set Format
This is data in the normal WebSphere DataStage data set format. Stages, such as the database read and write stages, process data sets in this format.
Format
DataStage SAS Dataset UstringFields[icu_mapname]; field_name_1;field_name_2 ... field_name_n; LFile: platform_1:filename_1 LFile: platform_2:filename_2... LFile: platform_n:filename_n
Example
DataStage SAS Dataset UstringFields[ISO-2022-JP];B_FIELD;C_FIELD; LFile: eartha-gbit 0: /apt/eartha1sand0/jgeyer701/orch_master/apt/pds_files/node0/ test_saswu.psds.26521.135470472.1065459831/test_saswu.sas7bdat
Set the APT_SAS_NO_PSDS_USTRING environment variable to output a header file that does not list ustring fields. This format is consistent with previous versions of WebSphere DataStage:
DataStage SAS Dataset Lfile: HOSTNAME:DIRECTORY/FILENAME Lfile: HOSTNAME:DIRECTORY/FILENAME Lfile: HOSTNAME:DIRECTORY/FILENAME . . .
Note: The UstringFields line does not appear in the header file when any one of these conditions is met: v You are using a version of WebSphere DataStage that is prior to version 7.0.1. v Your data has no ustring schema values. v You have set the APT_SAS_NO_PSDS_USTRING environment variable. If one of these conditions is met, but your data contains multi-byte Unicode data, you must use the Modify stage to convert the data schema type to ustring. This process is described in Handling SAS CHAR Fields Containing Multi-Byte Unicode Data . A persistent parallel SAS data set is automatically converted to a WebSphere DataStage data set if the next stage in the flow is not an occurrence of the SAS stage.
In this example, v v v v The code is provided through the Property tab on the Stage page of the SAS Stage libname temp /tmp specifies the library name of the SAS input data liborch.p_out specifies the output transitional SAS data set temp.p_in is the input SAS data set
The output transitional SAS data set is named using the following syntax:
liborch.osds_name
As part of writing the SAS code executed by the SAS stage, you must associate each input and output data set with an input and output of the SAS code. In the figure above, the SAS stage executes a DATA step with a single input and a single output. Other SAS code to process the data would normally be included. In this example, the output data set is prefixed by liborch, a SAS library name corresponding to the WebSphere DataStage SAS engine, while the input data set comes from another specified library; in this example the data set file would normally be named /tmp/p_in.ssd01.
10
As part of writing the SAS code executed by the SAS stage, you must associate each input and output data set with an input and output of the SAS code. In the figure above, the SAS Stage executes a DATA step with a single input and a single output. Other SAS code to process the data would normally be included. To define the input data set connected to the DATA step input, you use the Inputs property of the SAS Stage.
11
Data Types
The SAS stage converts WebSphere DataStage data types to corresponding SAS data types. All WebSphere DataStage numeric data types, including integer, float, decimal, date, time, and timestamp, are converted to the SAS numeric data type. WebSphere DataStage raw and string fields are converted to SAS string fields. WebSphere DataStage raw and string fields longer than 200 bytes are truncated to 200 bytes, which is the maximum length of a SAS string. The following table summarizes these data type conversions:
Table 2. Conversions to SAS Data Types WebSphere DataStage Data Type date decimal[p,s] (p is precision and s is scale) int64 and uint64 int8, int16, int32, uint8, uint16, uint32, sfloat, dfloat fixed-length raw, in the form: raw[n] SAS Data Type SAS numeric date value SAS numeric not supported SAS numeric SAS numeric SAS string with length n The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. variable-length raw, in the form: raw[max=n] SAS string with a length of the actual string length The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. variable-length raw in the form: raw fixed-length string or ustring, in the form: string[n] or ustring[n] not supported SAS string with length n The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. SAS string with a length of the actual string length The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. not supported SAS numeric time value SAS numeric datetime value
Note: The WebSphere DataStage Parallel Job Developer Guide contains a table that maps SQL data types to the underlying data type.
12
v A persistent parallel SAS data set is automatically converted to a WebSphere DataStage data set if the next stage in the flow is not an SAS stage. When converting a SAS numeric field to a WebSphere DataStage numeric, you can get a numeric overflow or underflow if the destination data type is too small to hold the value of the SAS field. You can avoid the error this causes by specifying all fields as nullable so that the destination field is simply set to null and processing continues.
1126030.00
13
v Perform the same action for every record v No RETAIN, LAGN, SUM, or + statements v Order does not need to be maintained.
Example Applications
Three sample applications follow.
Example 1
Parallelizing a SAS DATA Step
This section contains an example that executes a SAS DATA step in parallel. The step takes a single SAS data set as input and writes its results to a single SAS data set as output. The DATA step recodes the salary field of the input data to replace a dollar amount with a salary-scale value.
14
This DATA step requires little effort to parallelize because it processes records without regard to record order or relationship to any other record in the input. Also, the step performs the same operation on
Chapter 1. Using WebSphere DataStage to Run SAS Code
15
16
Explanation
In this example, you: v Get the input from a SAS data set using a sequential SAS stage; v Execute the DATA step in a parallel SAS stage; v Output the results as a standard WebSphere DataStage data set (you must provide a schema for this) or as a parallel SAS data set. You may also pass the output to another SAS stage for further processing. The schema required may be generated by first outputting the data to a Parallel SAS data set, then referencing that data set. WebSphere DataStage automatically generates the schema.
The next SAS Stage can then do one of three things: v Output the results as a standard WebSphere DataStage data set using the Schema definition v Output the results as a parallel SAS data set v Pass the output directly to another SAS Stage as a transitional SAS data set The default output format is transitional SAS data set. When the output is to a parallel SAS data set or to another SAS Stage, for example, as a standardWebSphere DataStage data set, the liborch statement must be used. Conversion of the output to a standard WebSphere DataStage data set or a Parallel SAS data set is discussed in Transitional SAS Data Set Format and Parallel SAS Data Set Format .
Example 2
Using the hash Partitioner
This example reads two INFORMIX tables as input, hash partitions on the workloc field, then uses a SAS DATA step to merge the data and score it before writing it out to a parallel SAS data set.
17
The SAS stage in this example runs the following DATA step to perform the merge and score:
18
data liborch.emptabd; merge liborch.wltab liborch.emptab; by workloc; a_13 = (f1-f3)/2; a_15 = (f1-f5)/2; . . . run;
Records are hashed based on the workloc field. In order for the merge to work correctly, all records with the same value for workloc must be sent to the same processing node and the records must be ordered. The merge is followed by a parallel SAS stage that scores the data, then writes it out to a parallel SAS data set.
Example 3
Using a SAS SUM Statement with a DATA Step
This section contains an example using the SUM clause with a DATA step. In this example, the DATA step outputs a SAS data set where each record contains two fields: an account number and a total transaction amount. The transaction amount in the output is calculated by summing all the deposits and withdrawals for the account in the input data where each input record contains one transaction. Here is the SAS code for this example, as defined by the SAS Source option:
libname prod "/prod"; proc sort data=prod.trans; out=prod.s_trans by acctno; data prod.up_trans (keep = acctno sum); set prod.s_trans; by acctno; if first.acctno then sum=0; if type = "D" then sum + amt; if type = "W" then sum - amt; if last.acctno then output; run;
The SUM variable is reset at the beginning of each group of records where the record groups are determined by the account number field. Because this DATA step uses the BY clause, you use the Hash stage with this step to make sure all records from the same group are assigned to the same node. Note that DATA steps using SUM without the BY clause view their input as one large group. Therefore, if the step used SUM but not BY, you would execute the step sequentially so that all records would be processed on a single node.
19
The SAS DATA step executed by the second SAS stage, as defined by the SAS Source option, is:
data liborch.nw_trans (keep = acctno sum); set liborch.p_trans; by acctno; if first.acctno then sum=0; if type = "D" then sum + amt; if type = "W" then sum - amt; if last.acctno then output; run;
20
Example Applications
Two sample applications follow.
Example 1
Parallelizing PROC Steps
This example parallelizes a SAS application using PROC SORT and PROC MEANS. In this example, you first sort the input to PROC MEANS, then calculate the mean of all records with the same value for the acctno field.
21
Figure 8. Data Flow Diagram: Using Proc Sort and Proc Means
22
The BY clause in a SAS step signals that you want to hash partition the input to the step. Hash partitioning guarantees that all records with the same value for acctno are sent to the same processing node. The SAS PROC step executing on each node is thus able to calculate the mean for all records with the same value for acctno.
23
24
PROC MEANS pipes its results to standard output, and the SAS stage sends the results from each partition to standard output as well. Thus the list file created by the SAS stage, which you specify using the SAS List File Location Type option, contains the results of the PROC MEANS sorted by processing node. Shown below is the SAS PROC step for this application:
proc means data=liborch.p_dhist; by acctno; run;
Example
Parallelizing PROC Steps Using the CLASS Keyword
One type of SAS BY GROUP processing uses the SAS keyword CLASS. CLASS allows a PROC step to perform BY GROUP processing without your having to first sort the input data to the step. Note, however, that the grouping technique used by the SAS CLASS option requires that all your input data fit in memory on the processing node. Note also that as your data size increases, you may want to replace CLASS and NWAY with SORT and BY.
Considerations
Whether you parallelize steps using CLASS depends on the following: v If the step also uses the NWAY keyword, parallelize it. When the step specifies both CLASS and NWAY, you parallelize it just like a step using the BY keyword, except the step input doesnt have to be sorted. This means you hash partition the input data based on the fields specified to the CLASS option. See the previous section for an example using the hash partitioning method. v If the CLASS clause does not use NWAY, execute it sequentially. v If the PROC STEP generates a report, execute it sequentially, unless it has a BY clause. For example, the following SAS code uses PROC SUMMARY with both the CLASS and NWAY keywords:
libname prod "/prod"; proc summary data=prod.txdlst missing NWAY; CLASS acctno lstrxd fpxd; var xdamt xdcnt; output out=prod.xnlstr(drop=_type_ _freq_) sum=; run;
In order to parallelize this example, you hash partition the data based on the fields specified in the CLASS option. Note that you do not have to partition the data on all of the fields, only to specify enough fields that your data is be correctly distributed to the processing nodes. For example, you can hash partition on acctno if it ensures that your records are properly grouped. Or you can partition on two of the fields, or on all three. An important consideration with hash partitioning is that you should specify as few fields as necessary to the partitioner because every additional field requires additional overhead to perform partitioning.
25
The SAS code (DATA step) for the first SAS stage, as defined by the SAS Source option, is:
libname prod "/prod"; data liborch.p_txdlst set prod.txdlst; run;
The SAS code for the second SAS stage, as defined by the SAS Source option, is:
26
proc summary data=liborch.p_txdlst missing NWAY; CLASS acctno lstrxd fpxd; var xdamt xdcnt; output out=liborch.p_xnlstr(drop=_type_ _freq_) sum=; run;
Rules of Thumb
There are rules of thumb you can use to specify how a SAS program runs in parallel. Once you have identified a program as a potential candidate for use in WebSphere DataStage, you need to determine how to divide the SAS code itself into WebSphere DataStage steps. The SAS stage can be run either parallel or sequentially. Any converted SAS program that satisfies one of the four criteria outlined above will contain at least one parallel segment. How much of the program should be contained in this segment? Are there portions of the program that need to be implemented in sequential segments? When does a SAS program require multiple parallel segments? Here are some guidelines you can use to answer such questions. v Identify Slow-Running Steps. Identify the slow portions of the sequential SAS program by inspecting the CPU and real-time values for each of the PROC and DATA steps in your application. Typically, these are steps that manipulate records (CPU-intensive) or that sort or merge (memory-intensive). You should be looking for times that are a relatively large fraction of the total run time of the application and that are measured in units of many minutes to hours, not seconds to minutes. You may need to set the SAS fullstimer option on in your config.sas612 or in your SAS program itself to generate a log of these sequential run times. v Parallelize Slow-Running Steps First. Start by parallelizing only those slow portions of the application that you have identified in guideline 1. As you include more code within the parallel segment, remember that each parallel copy of your code (referred to as a partition) sees only a fraction of the data. This fraction is determined by the partitioning method you specify on the input or inpipe lines of your SAS stage source code. v Use Only One Pipe Between SAS Stages. Any two SAS stages should only be connected by one pipe. This ensures that all pipes in the WebSphere DataStage program are simultaneously flowing for the duration of the execution. If two segments are connected by multiple pipes, each pipe must drain entirely before the next one can start.
Chapter 1. Using WebSphere DataStage to Run SAS Code
27
v Minimize Number of SAS Stages. Keep the number of SAS stages to a minimum. There is a performance cost associated with each stage that is included in the data flow. Guideline 3 takes precedence over this guideline. That is, when reducing the number of stages means connecting any two stages with more than one pipe, dont do it. v Input One Sequential File Per Stage. If you are reading or writing a sequential file such as a flat ASCII text file or a SAS data set, you should include the SAS code that does this in a sequential SAS stage. Use one sequential stage for each such file. You will see better performance inputting one sequential file per stage than if you lump many inputs into one segment followed by multiple pipes to the next segment, in line with guideline 2 above. v Give Each Partition Equal Amounts of Data. When you choose a partition key or combination of keys for a parallel stage, you should keep in mind that the best overall performance of the parallel application occurs if each of the partitions is given approximately equal quantities of data. For instance, if you are hash partitioning by the key field year (which has five values in your data) into five parallel segments, you will end up with poor performance if there are big differences in the quantities of data for each of the five years. The application is held up by the partition that has the most data to process. If there is no data at all for one of the years, the application will fail because the SAS process that gets no data will issue an error statement. Furthermore, the failure will occur only after the slowest partition has finished processing, which may be well into your application. The solution may be to partition by multiple keys, for example, year and storeNumber, to use round-robin partitioning where possible, to use a partitioning key that has many more values than there are partitions in your application, or to keep the same key field but reduce the number of partitions. All of these methods should serve to balance the data distribution over the partitions. v Use Multiple Parallel Segments with Different Keys. Multiple parallel segments in your WebSphere DataStage application are required when you need to parallelize portions of code that are sorted by different key fields. For instance, if one portion of the application performs a merge of two data sets using the patientID field as the BY key, this PROC MERGE will need to be in a parallel segment that is hash partitioned on the key field patientID. If another portion of the application performs a PROC MEANS of a data set using the procedureCode field as the CLASS key, this PROC MEANS will have to be in a parallel SAS stage that is hash partitioned on the procedureCode key field. v Sort in Parallel. If you are running a query that includes an ORDER BY clause against a relational database, you should remove it and do the sorting in parallel, either using SAS PROC SORT or a WebSphere DataStage input line order statement. Performing the sort in parallel outside of the database removes the sequential bottleneck of sorting within the RDBMS. v Resort when Required. A sort that has been performed in a parallel stage will order the data only within that stage. If the data is then streamed into a sequential stage, the sort order will be lost. You will need to sort again within the sequential stage to guarantee order. v Do Not Use Custom-Specified SAS Library. Within a parallel SAS stage you may only use SAS work directories for intermediate writes to disk. SAS generates unique names for the work directories of each of the parallel stages. In an SMP environment this is necessary because it prevents the multiple CPUs from writing to the same work file. Do not use a custom-specified SAS library within a parallel stage. v Use liborch Properly. Do not use a liborch directory within a parallel segment unless it is connected to an inpipe or an outpipe. A liborch directory may not be both written and read within the same parallel stage. v Write Contents to Disk When Necessary. A liborch directory can be used only once for an input, inpipe, output or outpipe. If you need to read or write the contents of a liborch directory more than once, you should write the contents to disk via a SAS work directory and copy this as needed. v Use Pipes to Connect between Stages. Remember that all stages in a step run simultaneously. This means that you cannot write to a custom-specified SAS library as output from one stage and simultaneously read from it in a subsequent stage. Connections between stages must be via WebSphere DataStage pipes which are virtual data sets normally in transitional SAS data set format.
28
2. At installation, in the table of your sascs.txt file enter the designations of the ICU character sets you use in the right-hand columns of the table. There can be multiple entries. Use the left-hand column of the table to enter a symbol that describes the character sets. The symbol cannot contain space characters, and both columns must be filled in. The platform-specific sascs.txt files are in:
$APT_ORCHHOME/apt/etc/platform/
The platform directory names are: sun, aix, osf1 (Tru64), hpux, and lunix. For example:
$APT_ORCHHOME/etc/aix/sascs.txt
Example table entries: CANADIAN_FRENCH fr_CA-iso-8859 ENGLISH ISO-8859-5 3. Using the NLS tab of the SAS-interface stages, specify an ICU character set to be used by WebSphere DataStage to map between your ustring values and the CHAR data stored in SAS files. For additional information, see Configuring International SAS .
29
sas -source libname curr_dir \.\; DATA liborch.out_data; SET curr_dir.cl_ship; RUN; -output 0 out_data -schemaFile cl_ship -workingdirectory .;
If you know the fields contained in the SAS file and use the Set Schema From Columns option, performance is improved. The generated code follows:
sas -source libname curr_dir \.\; DATA liborch.out_data; SET curr_dir.cl_ship; RUN; -output 0 out_data -schema record(SHIP_DATE:nullable string[50]; DISTRICT:nullable string[10]; DISTANCE:nullable sfloat; EQUIPMENT:nullable string[10]; PACKAGING:nullable string[10]; LICENSE:nullable string[10]; CHARGE:nullable sfloat;)
It is also easy to write a SAS file from a WebSphere DataStage virtual datastream. Use the SAS Source and Input options of the SAS Stage to generate the SAS file described above. The following code is generated:
sas -source libname curr_dir \.\; DATA curr_dir.cl_ship; SET liborch.in_data; RUN; -input 0 in_data [seq] 0< DSLink2a.v;
When you have invoked SAS in international mode, your SAS log output has this type of header:
NOTE: SAS (r) Proprietary Software Release 8.2 (TS2MO DBCS2944)
If NLS is off for the SAS stage, SAS is invoked in standard mode. In standard mode, SAS processes your string fields and step source code using single-byte LATIN1 characters. SAS standard mode, is also called the basic US SAS product. If NLS is on and your WebSphere DataStage data includes ustring values, SAS Stage use the current value in icu_char_set to determine which code conversion to use. SAS Stage uses the environment variable settings (see Environment Variables ) to detrmine what to do if a ustring is truncated when converted to a SAS character fixed-length field. It also determines what to do if it encounters a ustring value when NLS is not enabled.
30
31
You can then fine tune the schema and specify it to the Set Schema from Columns option to take advantage of performance improvements. You can see the schema generated by Schema Source SAS Data Set Name by setting Debug Program = Yes and looking at a INFO line in the log.
Specify a value for n_codepoint_units that is 2.5 or 3 times larger than the number of characters in the ustring. This forces the SAS CHAR fixed-length field size to be the value of n_codepoint_units. The maximum value for n_codepoint_units is 200.
Environment Variables
Add or modify environment variables: v In the Designer in the Choose Environment Variable dialog box. See WebSphere DataStage Designer Client Guide for additional information. v In the Administrator in the Environment Variables dialog box. See WebSphere DataStageAdministrator Client Guide for additional information. These are the SAS-specific environment variables: v APT_SAS_COMMAND absolute_path Overrides the $PATH directory for SAS and resource sas entry with an absolute path to the US SAS executable. An example path is:
/usr/local/sas/sas8.2/sas
v APT_SASINT_COMMAND absolute_path Overrides the $PATH directory for SAS and resource sasint entry with an absolute path to the International SAS executable. An example path is:
/usr/local/sas/sas8.2int/dbcs/sas
v APT_SAS_CHARSET icu_character_set When the NLS option is not selected and a SAS-interface stage encounters a ustring, WebSphere DataStage interrogates this variable to determine what character set to use. If this variable is not set, but APT_SAS_CHARSET_ABORT is set, the stage aborts; otherwise WebSphere DataStage accesses the APT_IMPEXP_CHARSET environment variable.
32
v APT_SAS_CHARSET_ABORT Causes a SAS-interface stage to abort if WebSphere DataStage encounters a ustring in the schema and the NLS option is not selected and the APT_SAS_CHARSET environment variable is not set. v APT_SAS_PARAM_ARGUMENT Allows you to add arguments to the SAS execution line. A sample argument is
-config filename
For a complete list of all arguments, see your SAS documentation. v APT_SAS_TRUNCATION ABORT | NULL | TRUNCATE Because a ustring of n characters does not fit into n bytes of a SAS CHAR value, the ustring value may be truncated before the space pad characters and \0. The SAS stage use this variable to determine how to truncate a ustring value to fit into a SAS CHAR field. TRUNCATE, which is the default, causes the ustring to be truncated; ABORT causes the stage to abort; and NULL exports a null field. For NULL and TRUNCATE, the first five occurrences for each column cause an information message to be issued to the log. APT_SAS_ACCEPT_ERROR When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface stage from terminating. The default behavior is for WebSphere DataStage to terminate the stage with an error. APT_SAS_DEBUG_LEVEL=[0-2] Specifies the level of debugging messages to output from the SAS driver. The values of 0, 1, and 2 duplicate the output for the Debug Program option of the Sas stage: no, yes, and verbose. APT_SAS_DEBUG=1, APT_SAS_DEBUG_IO=1, APT_SAS_DEBUG_VERBOSE=1 Specifies various debug messages which are found in the SAS Log. APT_HASH_TO_SASHASH The Hash stage partitioner contains support for hashing SAS data. In addition, WebSphere DataStage provides the sashash partitioner for backward compatibility which uses an alternative non-standard hashing algorithm. Setting the APT_HASH_TO_SASHASH environment variable causes all appropriate instances of hash to be replaced by sashash. If the APT_NO_SAS_TRANSFORMS environment variable is set, APT_HASH_TO_SASHASH has no affect. APT_NO_SAS_TRANSFORMS WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting a sasout component and substituting sasRoundRobin for eRoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations. APT_NO_SASOUT_INSERT This variable selectively disables the sasout component insertions. It maintains the other SAS-specific transformations. APT_SAS_SHOW_INFO Displays the standard SAS executable log from an import or export transaction (the SAS Parallel Data Set stage). The SAS output is normally deleted since a transaction is usually successful. APT_SAS_SCHEMASOURCE_DUMP Displays the SAS CONTENTS report used by the Schema Source SAS Data Set Name option. The output also displays the command line given to SAS to produce the report and the pathname of the report. The input file and output file created by SAS is not deleted when this variable is set. APT_SAS_S_ARGUMENT number_of_characters Overrides the value of the SAS s option which specifies how many characters should be skipped from the beginning of the line when reading input SAS source records. The s option is typically set to 0 indicating that records be read starting with the first character on the line. This environment variable allows you to change that offset. For example, to skip the line numbers and the following space character in the SAS code below, set the value of the APT_SAS_S_ARGUMENT variable to 6.
Chapter 1. Using WebSphere DataStage to Run SAS Code
v v
33
v APT_SAS_NO_PSDS_USTRING Outputs a header file that does not list ustring fields. v APT_SAS_OLDDEFAULT_LOGDS Overrides the message type and makes all -logds messages W(arnings). Use this environment variable to regress to the behavior of the SAS stage in Version 7.5.1 and earlier. v APT_SAS_METADATA_ONLY Causes a psds to process only one row (from each SAS parallel data set) and the SAS Stage to display the WebSphere DataStage schema for the psds. v APT_SAS_OLD_UNIQUE_DIRECTORY Restricts the configuration file to having a unique SAS directory on each node of the configuration file. Use this environment variable to regress to the behavior of the SAS stage in Release 7.5.1 and earlier. (See Configuration File Requirements on page 3.) In addition to the SAS-specific debugging variables, you can set the APT_DEBUG_SUBPROC environment variable to display debug information about each subprocess component. Each release of SAS also has an associated environment variable for storing SAS system options specific to the release. The environment variable name includes the version of SAS it applies to; for example, SAS612_OPTIONS andSASV8_OPTIONS. The usage is the same regardless of release. Any SAS option that can be specified in the configuration file or on the command line at startup, can also be defined using the version-specific environment variable. SAS looks for the environment variable in the current shell and then applies the defined options to that session. The environment variables are defined as any other shell environment variable. A ksh example is:
export SASV8_OPTIONS=-xwait -news SAS_news_file
A option set using an environment variable overrides the same option set in the configuration file; and an option set in SASx_OPTIONS is overridden by its setting on the SAS startup command line or the OPTIONS statement in the SAS code (as applicable to where options can be defined).
34
Must Dos
WebSphere DataStage has many defaults which means that it can be very easy to include SAS Data Set stages in a job. This section specifies the minimum steps to take to get a SAS Data Set stage functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. The steps required depend on whether you are reading or writing SAS data sets.
Stage Page
The General tab allows you to specify an optional description of the stage. The Advanced tab allows you to specify how the stage executes.
Copyright IBM Corp. 2005, 2006
35
Advanced Tab
This tab allows you to specify the following: v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).
Inputs Page
The Inputs page allows you to specify details about how the SAS Data Set stage writes data to a data set. The SAS Data Set stage can have only one input link. The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the data set. The Columns tab specifies the column definitions of the data. The Advanced tab allows you to change the default buffering settings for the input link. Details about SAS Data Set stage properties are given in the following sections. See the WebSphere DataStage Parallel Job Developer Guide for a description of the editor. for a general description of the other tabs.
36
Table 3. Input Link Properties and their Attributes (continued) Category/ Property Target/Update Policy Values Append/Create (Error if exists)/ Overwrite/ Default Create (Error if exists) Mandatory? Y Repeats? N Dependent of N/A
Options Category
File
The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .psds.
Update Policy
Specifies what action will be taken if the data set you are writing to already exists. Choose from: v Append. Append to the existing data set v Create (Error if exists). WebSphere DataStage reports an error if the data set already exists v Overwrite. Overwrite any existing file set The default is Create (Error if exists).
Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the data set. It also allows you to specify that the data should be sorted before being written. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the SAS Data Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: v Whether the SAS Data Set stage is set to execute in parallel or sequential mode. v Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the SAS Data Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the SAS Data Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method. The following partitioning methods are available: v (Auto). WebSphere DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the Parallel SAS Data Set stage. v Entire. Each file written to receives the entire data set.
Chapter 2. SAS Parallel Data Set Stage
37
v Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. v Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. v Random. The records are partitioned randomly, based on the output of a random number generator. v Round Robin. The records are partitioned on a round robin basis as they enter the stage. v Same. Preserves the partitioning already in place. v DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button. v Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button. The following Collection methods are available: v (Auto). This is the default collection method for Parallel SAS Data Set stages. Normally, when you are using Auto mode, WebSphere DataStage will read any row from any input partition as it becomes available. v Ordered. Reads all records from the first partition, then all records from the second partition, and so on. v Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. v Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the data set. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default Auto methods). Select the check boxes as follows: v Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. v Stable. Select this if you want to preserve previously sorted data sets. This is the default. v Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.
Outputs Page
The Outputs page allows you to specify details about how the Parallel SAS Data Set stage reads data from a data set. The Parallel SAS Data Set stage can have only one output link.
38
The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the output link. Details about Data Set stage properties and formatting are given in the following sections. See the WebSphere DataStage Parallel Job Developer Guide for a description of the editor. for a general description of the other tabs.
Source Category
File
The name of the control file for the parallel SAS data set. You can browse for the file or enter a job parameter. The file has the suffix .psds.
39
40
Example Job
This example job shows SAS stages reading in data, operating on it, then writing it out. The example data is from a freight carrier who charges customers based on distance, equipment, packing, and license requirements. They need a report of distance traveled and charges for the month of July grouped by License type. The following table shows a sample of the data:
Table 5. Sample Freight Carrier Data Ship Date ... Jun 2 2000 Jul 12 2000 1 1 1540 1320 D D M C BUN SUM 1300 4800 District Distance Equipment Packing License Charge
41
Table 5. Sample Freight Carrier Data (continued) Ship Date Aug 2 2000 Jun 22 2000 Jul 30 2000 ... District 1 2 2 Distance 1760 1540 1320 Equipment D D D Packing C C M License CUM CUN SUM Charge 1300 13500 6000
The first stage reads the rows from the freight database where the first three characters of the Ship_date column = Jul. You reference the data being input to the this stage from inside the SAS code using the liborch library. Liborch is the SAS engine provided by WebSphere DataStage that you use to reference the data input to and output from your SAS code. This is the SAS Source code:
LIBNAME curr_dir .; DATA liborch.in_data; SET curr_dir.cl_ship; If substr(Ship_Date,1,3)="Jul"; RUN;
The second stage sorts the data that has been extracted by the first stage. This is the SAS Source code:
PROC SORT DATA=liborch.in_data out=liborch.out_sort;by License;RUN;
Finally, the third stage outputs the mean and sum of the distances traveled and charges for the month of July sorted by license type. This is the SAS Source code:
PROC SORT DATA=liborch.in_data out=liborch.out_sort;by License;RUN;
The following shows the SAS output for license type SUM:
17:39, May 26, 2003 ... LICENSE=SUM Variable Label N Mean Sum ------------------------------------------------------------------DISTANCE DISTANCE 720 1563.93 1126030.00 CHARGE CHARGE 720 28371.39 20427400.00 ------------------------------------------------------------------... Step execution finished with status = OK.
Must Dos
WebSphere DataStage has many defaults which means that it can be very easy to include SAS stages in a job. This section specifies the minimum steps to take to get a SAS stage functioning. WebSphere DataStage provides a versatile user interface, and there are many shortcuts to achieving a particular end, this section describes the basic method, you will learn where the shortcuts are when you get familiar with the product. To use an SAS stage: v In the Stage page Properties tab, under the SAS Source category: Specify the SAS code the stage will execute. You can also choose a Source Method of Source File and specify the name of a file containing the code. You need to use the liborch library in order to connect the SAS code to the data input to and/or output from the stage (see Example Job for guidance how to do this). Under the Inputs and Outputs categories:
42
Specify the numbers of the input and output links the stage connects to, and the name of the SAS data set associated with those links. v In the Stage page Link Ordering tab, specify which input and output link numbers correspond to which actual input or output link. v In the Output page Mapping tab, specify how the data being operated on maps onto the output links.
Alternatively, you can include a resource sas entry in your configuration file. For example:
resource sasint "[/usr/sas82/]" { }
(See WebSphere DataStageWebSphere DataStage Parallel Job Developer Guide for details about configuration files and the SAS resources.) When using NLS with any map, you need to edit the file sascs.txt to identify the maps that you are likely to use with the SAS stage. The file is platform-specific, and is located in $APT_ORCHHOME/apt/etc/ platform, where platform is one of sun, aix, osf1 (Tru64), hpux, and linux. The file comprises two columns: the left-hand column gives an identifier, typically the name of the language. The right-hand column gives the name of the map. For example, if you were in Canada, your sascs.txt file might be as follows:
CANADIAN_FRENCH ENGLISH fr_CA-iso-8859 ISO-8859-5
Stage Page
The General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The NLS Map tab appears if you have NLS enabled on your system, it allows you to specify a character set map for the stage.
Properties Tab
The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.
Table 6. Properties and their Attributes Category/ Property SAS Source/Source Method Values Explicit/Source File Default Explicit Mandatory? Y Repeats? N Dependent of N/A
43
Table 6. Properties and their Attributes (continued) Category/ Property SAS Source/Source SAS Source/Source File Inputs/Input Link Number Values code Default N/A Mandatory? Y (if Source Method = Explicit) Y (if Source Method = Source File) N Repeats? N Dependent of N/A
pathname
N/A
N/A
number
N/A Input Link Number N/A Output Link Number Output Link Number N/A
Inputs/Input SAS string Data Set Name. Outputs/Output Link Number Outputs/Output SAS Data Set Name. Outputs/Set Schema from Columns. Options/Disable Working Directory Warning Options/Convert Local Options/Debug Program number string
Y (if output link N number specified) Y (if output link N number specified) Y N
True/False
False
True/False
False
Y Y Y
N N N
Options/SAS List File/Job Log/ File Location None/Output Type Options/SAS Log File/Job Log/ File Location None/Output Type Options/SAS Options string
Job Log
N/A
N/A N/A
N N
N N
N/A N/A
Source
Specify the SAS code to be executed. This can contain both PROC and DATA steps.
44
Source File
Specify a file containing the SAS code to be executed by the stage.
Inputs Category
Input Link Number
Specifies inputs to the SAS code in terms of input link numbers. Repeat the property to specify multiple links. This has a dependent property: v Input SAS Data Set Name. The name of the SAS data set receiving its input from the specified input link.
Outputs Category
Output Link Number
Specifies an output link to connect to the output of the SAS code. Repeat the property to specify multiple links. This has a dependent property: v Output SAS Data Set Name. The name of the SAS data set sending its output to the specified output link. v Set Schema from Columns. Specify whether or not the columns specified on the Outputs tab are used for generating the output schema. An output schema is not required if the eventual destination stage is another SAS stage (there can be intermediate stages such as Data Set or Copy stages).
Options Category
Disable Working Directory Warning
Disables the warning message generated by the stage when you omit the Working Directory property. By default, if you omit the Working Directory property, the SAS working directory is indeterminate and the stage generates a warning message.
Convert Local
Specify that the conversion phase of the SAS stage (from the input data set format to the stage SAS data set format) should run on the same nodes as the SAS stage. If this option is not set, the conversion runs by default with the previous stages degree of parallelism and, if possible, on the same nodes as the previous stage.
Debug Program
A setting of Yes causes the stage to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if an SAS step has an error. By default, the setting is No, which causes the stage to abort when it detects an error in the SAS program. Setting the property as Verbose is the same as Yes, but in addition it causes the operator to echo the SAS source code executed by the stage.
45
Specifying Job Log causes the list to be written to the WebSphere DataStage job log. Specifying Output causes the list file to be written to an output data set of the stage. The data set from a parallel SAS stage containing the list information will not be sorted. If you specify None no list will be generated.
SAS Options
Specify any options for the SAS code in a quoted string. These are the options that you would specify to an SAS OPTIONS directive.
Working Directory
Name of the working directory on all the processing nodes executing the SAS application. All relative path names in the SAS code are relative to this path name.
Advanced Tab
This tab allows you to specify the following: v Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node. v Combinability mode. This is Auto by default, which allows WebSphere DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage. v Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning. v Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pool or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file. v Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).
46
NLS Map
The NLS Map tab allows you to define a character set map for the SAS stage. This overrides the default character set map set for the project or the job. You can specify that the map be supplied as a job parameter if required.
Inputs Page
The Inputs page allows you to specify details about the incoming data sets. There can be multiple inputs to the SAS stage. The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being passed to the SAS code. The Columns tab specifies the column definitions of incoming data.The Advanced tab allows you to change the default buffering settings for the input link. Details about SAS stage partitioning are given in the following section. See the Parallel Job Developer Guide for a general description of the other tabs.
Partitioning Tab
The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before passed to the SAS code. It also allows you to specify that the data should be sorted before being operated on. By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. If the SAS stage is operating in sequential mode, it will first collect the data using the default Auto collection method. The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on: v Whether the SAS stage is set to execute in parallel or sequential mode. v Whether the preceding stage in the job is set to execute in parallel or sequential mode. If the SAS stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning. If the SAS stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method. The following partitioning methods are available:
47
v (Auto). WebSphere DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default partitioning method for the SAS stage. v Entire. Each file written to receives the entire data set. v Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list. v Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields. v Random. The records are partitioned randomly, based on the output of a random number generator. v Round Robin. The records are partitioned on a round robin basis as they enter the stage. v Same. Preserves the partitioning already in place. v DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button. v Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button. The following Collection methods are available: v (Auto). This is the default collection method for SAS stages. Normally, when you are using Auto mode, WebSphere DataStage will eagerly read any row from any input partition as it becomes available. v Ordered. Reads all records from the first partition, then all records from the second partition, and so on. v Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over. v Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list. The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being passed to the SAS code. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods). Select the check boxes as follows: v Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list. v Stable. Select this if you want to preserve previously sorted data sets. This is the default. v Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained. If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort. You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.
48
Outputs Page
The Outputs page allows you to specify details about data output from the SAS stage. The SAS stage can have multiple output links. Choose the link whose details you are viewing from the Output Name drop-down list. The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the SAS stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link. Details about SAS stage mapping is given in the following section. See the WebSphere DataStage Parallel Job Developer Guide for a general description of the other tabs.
Mapping Tab
For the SAS stage the Mapping tab allows you to specify how the output columns are derived and how SAS data maps onto them. The left pane shows the data output from the SAS code. These are read only and cannot be modified on this tab. The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.
49
50
Contacting IBM
To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378). To learn about available service options, call one of the following numbers: v In the United States: 1-888-426-4343 v In Canada: 1-800-465-9600 To locate an IBM office in your country or region, see the IBM Directory of Worldwide Contacts on the Web at www.ibm.com/planetwide.
Accessible documentation
Documentation in the information center is provided in XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you to view documentation according to the display preferences set in your browser. It also allows you to use screen readers and other assistive technologies. Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing the online documentation using a screen-reader.
51
Your feedback helps IBM to provide quality information. You can use any of the following methods to provide comments: v Send your comments using the online readers comment form at www.ibm.com/software/awdtools/ rcf/ . v Send your comments by e-mail to comments@us.ibm.com. Include the name of the product, the version number of the product, and the name and part number of the information (if applicable). If you are commenting on specific text, please include the location of the text (for example, a title, a table number, or a page number).
52
53
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBMs future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.
54
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
This topic lists IBM trademarks and certain non-IBM trademarks. See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks. The following terms are trademarks or registered trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names might be trademarks or service marks of others.
55
56
Index A
APT_DEBUG_SUBPROC 34 APT_NO_SAS_TRANSFORMS 33 APT_NO_SASINSERT 33 APT_S_ARGUMENT 33 APT_SAS_ACCEPT_ERROR 33 APT_SAS_CHARSET 32 APT_SAS_CHARSET_ABORT 33 APT_SAS_COMMAND 32 APT_SAS_DEBUG 33 APT_SAS_DEBUG_IO 33 APT_SAS_DEBUG_LEVEL 33 APT_SAS_METADATA_ONLY 34 APT_SAS_NO_PSDS_USTRING 31, 34 APT_SAS_OLD_ UNIQUE_DIRECTORY 34 APT_SAS_OLDDEFAULT_ LOGDS 34 APT_SAS_PARAM_ARGUMENT 33 APT_SAS_SCHEMASOURCE_ DUMP 31, 33 APT_SAS_TO_SASHASH 33 APT_SAS_TRUNCATION 32, 33 APT_SASINT_COMMAND 32 arguments in SAS execution line 33 DBCSLANG settings and ICU equivalents 4 DBCSTYPE 31 debug option 33 displaying the schema 34 documentation ordering 51 Web site 51 double-byte character set 5
I
ICU character set 4, 29 international SAS 4, 30
L
liborch 10, 11, 17, 28, 29 logds messages 34 long name support 5
E
environment configuring your system 3 determining SAS version 2 executables 4 licenses 2 long name support 5 verifying SAS version 3 environment variables APT_DEBUG_SUBPROC 34 APT_HASH_TO_SASHASH 33 APT_NO_SAS_TRANSFORMS 33 APT_NO_SASOUT_INSERT 33 APT_SAS_ACCEPT_ERROR 33 APT_SAS_CHARSET 32 APT_SAS_CHARSET_ABORT 33 APT_SAS_COMMAND 32 APT_SAS_DEBUG 33 APT_SAS_DEBUG_IO 33 APT_SAS_DEBUG_LEVEL 33 APT_SAS_METADATA_ONLY 34 APT_SAS_NO_PSDS_USTRING 31, 34 APT_SAS_OLD_ UNIQUE_DIRECTORY 34 APT_SAS_OLDDEFAULT_ LOGDS 34 APT_SAS_PARAM_ARGUMENT 33 APT_SAS_S_ARGUMENT 33 APT_SAS_SCHEMASOURCE_ DUMP 31, 33 APT_SAS_SHOW_INFO 33 APT_SAS_TRUNCATION 32, 33 APT_SASINT_COMMAND 32 DBCS 31 DBCSLANG 31 DBCSTYPE 31 ETL 29 European languages 29 executables 4 multiple 5 single 5 Extract, Transform, and Load. See ETL 29
M
multi-byte Unicode 31 multiple nodes 3, 34 multiple parallel segments guidelines for 27 use with slow-running steps multiple SAS executables 5
27
N
NLS 30 non-SAS data NWAY 25 8
C
candidates for parallelization using CLASS 21, 25 using DATA steps 14 using PROC steps 21 character sets 30 CLASS 25 comparison of single and multiple executions 1 components sas 16 sasout 16 components, mapped to stage properties 6 configuration file requirements description 3 examples 3 configuring your system 3 considerations for parallelizing SAS code CPU usage 27 number of records 27 size of records 27 sorting 27 Copy stage 31
P
parallel processing and sorting 2 parallel processing model advantages 2 description 2 parallel SAS data set 3, 8, 9, 13, 17 parallel SAS data set format 8 parallel systems 1 parallelization, good candidates 14, 21, 25 parallelizing PROC steps 21 parallelizing SAS DATA steps 14 Peek stage 29, 31 pipeline parallelism 2 pipes 28 PROC MEANS 21 PROC MERGE 28 PROC SORT 21, 28 properties SAS data set stage input 36 SASA data set stage output 39 psds. See parallel SAS data set 8
D
data representations data set formats parallel 8 sequential 8 DBCS 4, 31 DBCSLANG 31 8
R
round-robin partitioning 28
H
hash partitioning hotfix 5 14, 17, 19, 20, 25, 33
S
SAS code 1 sas component 16 SAS CONTENTS report 31, 33
57
SAS data 8 SAS DATA description 17 example 14 in data flow diagram 14 SAS data set stage 35 SAS data set stage input input properties 36 SAS data set stage output properties 39 SAS DATA steps, executing in parallel 13 SAS environment. See environment 5 SAS executables. See executables 5 SAS execution line, adding arguments 33 SAS interface operators specifying a character set 30 SAS licenses 2 SAS PROC 21 SAS PROC steps, executing in parallel 21 SAS programs 1 SAS stage 41 SAS Stage configuring your system 3 converting between WebSphere DataStage and SAS data types 11 data representations 8 executing DATA steps in parallel 13 executing PROC steps in parallel 21 getting input from a SAS data set 9 getting input from a stage or transitional SAS data set 10 parallel SAS data set format 8 parallelizing SAS code rules of thumb for parallelizing 27 SAS programs that benefit from parallelization 27 pipeline parallelism and SAS 2 sample data flow 6 sequential SAS data set format 8 transitional SAS data set format 9 using SAS on sequential and parallel systems 1 WebSphere DataStage example 13 writing SAS programs 1 SAS stage components controlling ustring truncation 32 environment variables 32 generating a Proc Contents report 32 specifying a character set 30 specifying an output schema 31 SAS SUM 19 sas_cs option 33 sascs.txt file 29 sashash 33 sasout component 16 schema definition 17 schema option 12 schema, displaying 34 schemaFile option 12, 29, 31 sequential processing model, description 1 sequential SAS data set format 8 sequential systems 1 single SAS executable 5
T
transitional SAS data set format truncation, ustring 32 9
U
US SAS 4 using pipes 28 ustring 32 ustring truncation 32
58
Printed in USA
SC18-9939-00