Sie sind auf Seite 1von 37


What is Modulus and Splitting in Dynamic Hashed File? In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting". The modulus size can be increased by contacting your Unix Admin.


Types of vies in Datastage Director? There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messages. d)Schedule view e) Detail view


What are Stage Variables, Derivations and Constants? Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link. stage variables :- is the temporary memory area. derivation :- where u apply the business rule. constraints :- where u apply conditions order of execution is :- constrains , derivation, stage variables


What is the default cache size? How do you change the cache size if needed? Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. Default read cache size is 128MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. Containers: Usage and Types? Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. There are two types of shared container: 1.Server shared container. Used in server jobs (can also be used in parallel jobs). 2.Parallel shared container. Use in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job)



Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Then how about Pipeline and Partition Parallelism, are they also 2 types of Parallel processing? 3 types of parallelism. Data Parallelism, pipeline Parallelism, round robin Parallel processing are two types. 1) Pipeline parallel processing 2) Partitioning parallel processing

7. What does a Config File in parallel extender consist of?

Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. Config file was read by DataStage engine before running the job in Px. it consist of configuration about your server. ex nodes and all . 7. Functionality of Link Partitioner and Link Collector? Link Partitioner: It actually splits data into various partitions or data flows using various partition methods. Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to target. server jobs mainly execute the jobs in sequential fashion, the ipc stage as well as link Partitioner and link collector will simulate the parallel mode of execution over the sever jobs having single cpu Link Partitioner : It receives data on a single input link and diverts the data to a maximum no. of 64 output links and the data processed by the same stage having same meta data Link Collector : It will collects the data from 64 input links, merges it into a single data flow and loads to target. These both r active stages and the design and mode of execution of server jobs has to be decided by the designer 8. What is SQL tuning? How do you do it ? in database using Hints. Sql tuning can be done using cost based optimization. This parameters are very important of pfile sort_area_size, sort_area_retained_size, db_multi_block_count, open_cursors, cursor_sharing. optimizer_mode=choose/role 9. How do you track performance statistics and enhance it? Through Monitor we can view the performance statistics. 10. What is the order of execution done internally in the transformer with the stage editor having input links on the left hand side and output links? Stage variables, constraints and column derivation or expressions. 11. What are the difficulties faced in using DataStage? or what are the constraints in using DataStage ? If the number of lookups are more? 2) What will happen, while loading the data due to some regions job aborts? 12 . Differentiate Database data and Data warehouse data? Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current. By Database, one means OLTP (On Line Transaction Processing). This can be the source systems or the ODS (Operational Data Store), which contains the transactional data. 12. Dimension Modeling types along with their significance Data Modeling 1) E-R Diagrams 2) Dimensional modeling 2.a) logical modeling 2.b) Physical modeling. Data Modeling is broadly classified into 2 types. a) E-R Diagrams (Entity - Relationships). b) Dimensional Modeling. 13. What is the flow of loading data into fact & dimensional tables?

Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, the data should be loaded into Fact table. Here is the sequence of loading a data warehouse. 1. The source data is first loading into the staging area, where data cleansing takes place. 2. The data from staging area is then loaded into dimensions/lookups. 3.Finally the Fact tables are loaded from the corresponding source tables from the staging area. 14. What r XML files and how do you read data from XML files and what stage to be used? In the pallet there are Real time stages like xml-input, xml-output, xml-transformer 15. Why do you use SQL LOADER or OCI STAGE? Data will transfer very quickly to the Data Warehouse by using SQL Loader. When the source data is enormous or for bulk data we can use OCI and SQL loader depending upon the source 16. Suppose if there are million records did you use OCI? If not then what stage do you prefer? Using Orabulk 17. How do you populate source files? There are many ways to populate one is writing SQL statement in oracle is one way 18. How do you pass the parameter to the job sequence if the job is running at night? Two ways 1. Set the default values of Parameters in the Job Sequencer and map these parameters to job. 2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter. 19. What happens if the job fails at night? Job Sequence Abort 20. Explain the differences between Oracle8i/9i? Multiprocessing, databases more dimensional modeling 21. What are Static Hash files and Dynamic Hash files?

As the names it suggests what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2 GB and the overflow file is used if the data exceeds the 2GB size. The hashed files have the default size established by their modulus and separation when you create them, and this can be static or dynamic. Overflow space is only used when data grows over the reserved size for someone of the groups (sectors) within the file. There are many groups as the specified by the modulus. 22. What is Hash file stage and what is it used for? Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. We can also use the Hash File stage to avoid / remove duplicate rows by specifying the hash key on a particular field. 23. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterize the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at. 24. What are Sequencers? Sequencers are job control programs that execute other jobs with preset Job parameters. A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two modes: ALL modes. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire. ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE 25. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE. Minimize the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator) Use SQL Code while extracting the data

11. 12. 13. 14. 15. 16.

1. 2.

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

32. 33. 34. 35. 36. 37.

Handle the nulls Minimize the warnings Reduce the number of lookups in a job design Use not more than 20stages in a job Use IPC stage between two passive stages Reduces processing time Drop indexes before data loading and recreate after loading data into tables Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data. IPC Stage that is provided in Server Jobs not in Parallel Jobs Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups. Use Preload to memory option in the hash file output. Use Write to cache in the hash file input. Write into the error tables only after all the transformer stages. Reduce the width of the input record - remove the columns that you would not use. Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files. This would also minimize overflow on the hash file. If possible, break the input into multiple threads and run multiple instances of the job. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

26. How did you handle reject data? Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. We can handle rejected data by collecting them separately in sequential file...... 27. What are Routines and where/how are they written and have you written any routines before?

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines Routines are stored in the Routines branch of the DataStage Repository, where you can create, view, or edit them using the Routine dialog box. The following program components are classified as routines: Transform functions. These are functions that you can use when defining custom transforms. DataStage has a number of built-in transform functions which are located in the Routines Examples Functions branch of the Repository. You can also define your own transform functions in the Routine dialog box. Before/After subroutines. When designing a job, you can specify a subroutine to run before or after the job, or before or after an active stage. DataStage has a number of built-in before/after subroutines, which are located in the Routines Built-in Before/After branch in the Repository. You can also define your own before/after subroutines using the Routine dialog box. Custom UniVerse functions. These are specialized BASIC functions that have been defined outside DataStage. Using the Routine dialog box, you can get DataStage to create a wrapper that enables you to call these functions from within DataStage. These functions are stored under the Routines branch in the Repository. You specify the category when you create the routine. If NLS is enabled,9-4 Ascential DataStage Designer Guide you should be aware of any mapping requirements when using custom UniVerse functions. If a function uses data in a particular character set, it is your responsibility to map the data to and from Unicode. ActiveX (OLE) functions. You can use ActiveX (OLE) functions as programming components within DataStage. Such functions are made accessible to DataStage by importing them. This creates a wrapper that enables you to call the functions. After import, you can view and edit the BASIC wrapper using the Routine dialog box. By default, such functions are located in the Routines Class name branch in the Repository, but you can specify your own category when importing the functions. When using the Expression Editor, all of these components appear under the DS Routines command on the Suggest Operand menu. A special case of routine is the job control routine. Such a routine is used to set up a DataStage job that controls other DataStage jobs. Job control routines are specified in the Job control page on the Job Properties dialog box. Job control routines are not stored under the Routines branch in the Repository. Transforms are stored in the Transforms branch of the DataStage Repository, where you can create, view or edit them using the Transform dialog box. Transforms specify the type of data transformed the type it is transformed into, and the expression that performs the transformation. DataStage is supplied with a number of built-in transforms (which you cannot edit). You can also define your own custom transforms, which are stored in the Repository and can be used by other DataStage jobs. When using the Expression Editor, the transforms appear under the DSTransform command on the Suggest Operand menu. Functions take arguments and return a value. The word function is applied to many components in DataStage: BASIC functions. These are one of the fundamental building blocks of the BASIC language. When using the Expression Editor, Programming in DataStage 9-5you can access the BASIC functions via the Function command on the Suggest Operand menu. DataStage BASIC functions. These are special BASIC functions that are specific to DataStage. These are mostly used in job control routines. DataStage functions begin with DS to distinguish them from general BASIC functions. When using the Expression Editor, you can access the DataStage BASIC functions via the DS Functionscommand on the Suggest Operand menu. The following items, although called functions, are classified as routines and are described under Routines on page 9-3. When using the Expression Editor, they all appear under the DS Routines command on the Suggest Operand menu. Transform functions Custom UniVerse functions ActiveX (OLE) functions Expressions An expression is an element of code that defines a value. The word expression is used both as a specific part of BASIC syntax, and to describe portions of code that you can enter when defining a job. Areas of DataStage where you can use such expressions are: Defining breakpoints in the debugger Defining column derivations, key expressions and constraints in Transformer stages Defining a custom transform In each of these cases the DataStage Expression Editor guides you as to what programming elements you can insert into the expression. 28. What are OConv () and Iconv () functions and where are they used? IConv () - Converts a string to an internal storage format OConv () - Converts an expression to an output format. Iconv is used to convert the date into internal format i.e. only DataStage can understand Example: - date coming in mm/dd/yyyy format Datasatge will convert this date into some number like: - 740

u can use this 740 in derive in our own format by using OConv. Suppose u want to change mm/dd/yyyy to dd/mm/yyyy now u will use IConv and OConv. OConv (IConv (datecommingfromi/pstring, SOMEXYZ (see in help which is icon format), defineoconvformat)) 29. Do u know about METASTAGE? in simple terms metadata is data about data and MetaStage can be anything like DS(dataset, sq file .etc). MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. MetaStage is a metadata repository in which you can store the metadata (DDLs etc.) and perform analysis on dependencies, change impact etc. METASTAGE is datastage's native reporting tool it contains lots of functions and reports............. 30. Do you know about INTEGRITY/QUALITY stage? integriry/quality stage is a data integration tool from ascential which is used to staderdize/integrate the data from different sources 31. What are the command line functions that import and export the DS jobs? A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. Parameters: UserName,Password, Hostname, ProjectName, CurrentDirectory (C:/Ascential/ DataStage7.5.1/ dsexport.exe),FileName(JobName). 32. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Use crontab utility along with dsexecute() function along with proper parameters passed. "Control_M Scheduling Tool": Thru Control_M u can automate the job by invoking the shell script written to schedule the datastage jobs. 33. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. 34. How can we improve the performance of DataStage jobs? Performance and tuning of DS jobs: 1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment 4.Evaluate data skew 5.Isolate and solve 6.Distribute file systems to eliminate bottlenecks 7.Do not involve the RDBMS in intial testing 8.Understand and evaluate the tuning knobs available. 35. what are the Job parameters?

These Parameters are used to provide Administrative access and change run time values of the job. EDIT>JOBPARAMETERS. In that Parameters Tab we can define the name,prompt,type,value 36. what is the difference between routine and transform and function? Difference between Routiens and Transformer is that both are same to pronounce but Routines describes the Business logic and Transformer specifies that transform the data from one location to another by applying the changes by using transformation rules . 38. What are all the third party tools used in DataStage? Autosys, TNG, event coordinator are some of them that I know and worked with 39. How can we implement Lookup in DataStage Server jobs? We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key column to create. by using the hashed files u can implement the lookup in datasatge, hashed files stores data based on hashed algorithm and key values . The DB2 stage can be used for lookups. In the Enterprise Edition, the Lookup stage can be used for doing lookups. In server canvs we can perform 2 kinds of direct lookups . One is by using a hashed file and the other is by using Database/ODBC stage as a lookup 40. How can we join one Oracle source and Sequential file?. Join and look up used to join oracle and sequential file 41. What is iconv and oconv functions? Iconv( )-----converts string to internal storage formatOconv( )----converts an expression to an output format 42. Difference between Hashfile and Sequential File? Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot 43. What is DS Administrator used for - did u use it? The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. 44. How do you eliminate duplicate rows? Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set. try to use unique function. 45. Dimensional modelling is again sub divided into 2 types. a)Star Schema - Simple & Much Faster. Denormalized form. b)Snowflake Schema - Complex with more Granularity. More normalized form. 46. How will you call external function or subroutine from datastage? there is datastage option to call external programs . execSH 46. How do you pass filename as the parameter for a job?

While job development we can create a parameter 'FILE_NAME' and the value can be passed while running the job. 1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file. 2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box. 47. How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm? a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]") Here is the right conversion: Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") . ToChar(%date%, %format%) . This shuld work, in format specify which format u want i.e 'yyyy-dd-mm' 47. Whats difference betweeen operational data stage (ODS) & data warehouse? that which is volatile is ODS and the data which is nonvolatile and historical and time varient data is DWh simple terms ods is dynamic data. A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data. ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information . ods is nothing but operational data store is the part of transactional database. this db keeps integrated data from different tdb and allow common operations across organisation. eg: banking transaction. An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse 48. How can we create Containers? There are Two types of containers 1.Local Container 2.Shared Container Local container is available for that particular Job only. Where as Shared Containers can be used any where in the project. Local container: Step1:Select the stages required Step2:Edit>ConstructContainer>Local SharedContainer: Step1:Select the stages required Step2:Edit>ConstructContainer>Shared Shared containers are stored in the SharedContainers branch of the Tree Structure. containers r speacial type of jobs in the datastage that will simplify the job design either in the server or parallel jobs. 2 types of containers 1.Local containers 2 shared containers Local containers r devoloped & stored within the part of the job. shared containers can be devoloped & stored within the repository shared of 2 types 1server shared containers 2.parallel shared containers

49. Importance of Surrogate Key in Data warehousing? Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database. The concept of surrogate comes into play when there is slowely changing dimension in a table. In such condition there is a need of a key by which we can identify the changes made in the dimensions. These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3. These are system genereated key.Mainly they are just the sequence of numbers or can be alfanumeric values also. this will be used in the concept of slowly changing dimension. inorder to keep track of changes in primary key . Surrogate Key should be system generated number and it should be small integer. For each dimension table depending on the SCD and no of total records expected over a 4 years time, you may limit the max number. This will improve the indexing, performance, query processing. surrogate is the systemgenerated key it is a numaric key it is primary key in the dimension table and it is forgien key in the fact table it is used to hadle the missing data and complex situation in the datastage 50. How do you merge two files in DS? Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different. DsDeveloper 51. How do we do the automation of dsjobs? "dsjobs" can be automated by using Shell scripts in UNIX system. We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director. 52. What are types of Hashed File? Hashed File is classified broadly into 2 types. a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 types i) Generic ii) Specific. Default Hased file is "Dynamic - Type Random 30 D" . Hashed File is classified broadly into 2 types. a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 types i) Generic ii) Specific. Default Hased file is "Dynamic - Type30. 53. How do you eliminate duplicate rows? delete from from table name where rowid not in(select max/min(rowid)from emp group by column name). Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that stage we can eliminate the duplicates based on a key column. The Duplicates can be eliminated by loading the corresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash . removal of duplicates done in two ways: 1. Use "Duplicate Data Removal" stage or 2. use group by on all the columns used in select , duplicates will go away. 54. What about System variables? DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254).

@IM An item mark, Char(255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char(128). @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A subvalue mark (a delimiter used in UniVerse files), Char(252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in UniVerse files), Char(251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in UniVerse files), Char(253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE. REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written. 55. where does unix script of datastage executes weather in clinet machine or in server.suppose if it eexcutes on server then it will execute ? Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine. 56. defaults nodes for datastage parallel Edition default nodes is allways one. Actually the Number of Nodes depend on the number of processors in your system.If your system is supporting two processors we will get two nodes by default. 57. What Happens if RCP is disable ? In such case Osh has to perform Import and export every time when the job runs and the processing time job is also increased... Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased. 58. I want to process 3 files in sequentially one by one , how can i do that. while processing the files it should fetch files automatically .

If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job... 59. what is difference between data stage and informatica Here is a very good articles on these differences... which helps to get an idea.. basically it's depends on what you are tring to accomplish. what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK. 60. what is the OCI? and how to use the ETL Tools? OCI means orabulk data which used client having bulk data its retrive time is much more ie., your used to orabulk data the divided and retrived. OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data. OCI means oracle call interface i.e this acts like native tool to load oracle database . U can just drag the oracle OCI from options and load 61. How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two?. You need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage. I think there is option to load DB2 database ,u just drag that & u can use it to load or It is better to use ODBC 61. How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? From db2 stage, we can extract the data in ETL. You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out . if ur system is mainfarmes then u can utility called load and unload .. load will load the records into main farme systme from there u hv to export in to your system ( windows) 62. what is merge and how it can be done plz explain with simple example taking 2 tables ....... Merge is used to join two tables.It takes the Key columns sort them in Ascending or descending order.Let us consider two table i.e Emp,Dept.If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables. Merge stage in used for only Flat files in server edition 63. what happends out put of hash file is connected to transformer .. what error it throughs If u connect output of hash file to transformer ,it will act like reference .there is no errores at all!! It can be used in implementing SCD's. If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code. 64. What are the Repository Tables in DataStage and What are they? A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In data stage I/O and Transfer , under interface tab: input , out put & transfer pages. U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components

are:Administrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allows you to view and edit the contents of the repository. 65. how can we pass parameters to job by using file. You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix) . u can create a UNIX shell script which will pass the parameters to the job and u also can create logs for the whole run process of the job. 66. what is the meaning of the following.. 1) If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel 3)Tuning should occur on a job-by-job basis. Use the power of DBMS. Question is not clear eventhough i wil try to answer something If u have SMP machines u can use IPC,link-colector,link-partitioner for performance tuning. If u have cluster,MPP machines u can use parallel jobs. The third point specifies about tuning the performance of job,use the power of DBMS means one can improve the performance of the job by using teh power of Database like Analyzing,creating index,creating partitions one can improve the performance of sqls used in the jobs. 67. what is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made? It probably means that u can put the selection criteria in the where clause,i.e whatever data u need to filter ,filter it out inthe SQL ,rather than carrying it forward and then filtering it out. Constraints is nothing but restrictions to it is restriction to data at entry itself , as he told it will avoid unnecessary data entry . This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performance. 68. How can ETL excel file to Datamart? take the source file(excel file) in the .csv format and apply the conditions which satisfies the datamart. Create a DSN in control panel using microsoft excel drivers. U then u can read the excel file from ODBC stage. open the ODBC Data Source Administrator found in the controlpanel/administrative tools. under the system DSN tab, add the Driver to Microsoft Excel. Then u'll be able to access the XLS file from Datastage. 69. what is difference between server jobs & paraller jobs Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary. Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required. The Parallel jobs are also available if you have Datastage 6.0 PX, or Datastage 7.0 versions installed. The Parallel jobs are especially usefule if you have large amounts of data to process. Server jobs: These are compiled and run on DataStage Server Parallel jobs: These are available only if you have Enterprise Edition installed. These are compiled and run on a DataStage Unix Server, and can be run in parallel on SMP, MPP, and cluster systems. Server jobs can be run on SMP,MPP machines.Here performance is low i.e speed is less Parallel jobs can be run only on clu what is merge ?and how to use merge?ster machines .Here performance is high i.e speed is high

70. what is merge ?and how to use merge? merge is nothing but a filter conditions that have been used for filter condition. Merge is a stage that is available in both parallel and server jobs. The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel). Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table. actually the merge stage in parallel job mainly used to merge the two or more data sets. it will take one master ds file and n number of update ds files. the output will one one final ds file +number of reject ds files as there r update files maily for join. Merge stage is used to merge two flat files in server jobs. Merge is maily used to join two flat or sequential files in server jobs. Merge stage is a processing stage, it can have any no of input link and only one output link. It is having master data set and one or more data sets. The out put of the Merge stage is master dataset plus additional column from each update link. Where as Merge in Data stage server you can merge two flat file by specifying their location and name. the output will be join of two files. It is also like merge stage but only difference is that how the memory they use. 71. how we use NLS function in Datastage? what are advantages of NLS function? where we can use that one? explain briefly? As per the manuals and documents, We have different level of interfaces. Can you be more specific? Like Teradata interface operators, DB2 interface operators, Oracle Interface operators and SAS-Interface operators. Orchestrate National Language Support (NLS) makes it possible for you to process data in international languages using Unicode character sets. International Components for Unicode (ICU) libraries support NLS functionality in Orchestrate. Operator NLS Functionality* Teradata Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators * The Oracle Interface Operators* The SAS-Interface Operators * transform Operator * modify Operator * import and export Operators * generator Operator Should you need any further assistance pls let me know. I shall share as much as i can By using NLS function we can do the following - Process the data in a wide range of languages - Use Local formats for dates, times and money - Sort the data according to the local rules If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in DataStage Server engine For Parallel jobs, NLS is implemented using the ICU library. 72. What is APT_CONFIG in datastage anyaways, the APT_CONFIG_FILE (not just APT_CONFIG) is the configuration file that defines the nodes, (the scratch area, temp area) for the specific project. Datastage understands the architecture of the system through this file(APT_CONFIG_FILE). For example this file consists information of node names, disk storage information...etc. APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server. 73. what is NLS in datastage? how we use NLS in Datastage ? what advantages in that ? at the time of installation i am not choosen that NLS option , now i want to use that options what can i do ? to reinstall that datastage or first uninstall and install once again ? NLS is basically Local language setting(characterset) .Once u install the DS u will get NLS present. Just login into Admin and u can set the NLS of your project based on your project requirement. Just need to map the NLS with your project. Suppose if u know u r having file with some greek, if u hav to set the NLS for greek so while running job DS wil recognise those special characters. 74. What is the difference between Datastage and Datastage TX? Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is not a new version of Datastage 7.5. Tx is used for ODS source ,this much i know 75. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise? Data will partitioned on both the keys ! hardly it will take more for execution

76. If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datastage create? Answer is 40 . You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40 77. how can you do incremental load in datastage? You can create a table where u can store the last successfull refresh time for each table/Dimension. Then in the source query take the delta of the last successful and sysdate should give you incremental load. Incremental load means daily load. when ever you are selecting data from source, select the records which are loaded or updated between the timestamp of last successful load and todays load start date and time. for this u have to pass parameters for those two dates. store the last run date and time in a file and read the parameter through job parameters and state second argument as current date and time. 78. Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations available in the enterprise edition only? DataStage Standard Edition was previously called DataStage and DataStage Server Edition. DataStage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes. DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage Row Merger, Row splitter are only present in parallel Stage . 79. How can you implement Complex Jobs in datastage Complex design means having more joins and more look ups. Then that job design will be called as complex job. We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job. Use not more than 7 look ups for a transformer otherwise go for including one more transformer 80. how can u implement slowly changed dimensions in datastage? explain? 2) can u join flat file and database in datastage? how? Yes, we can join a flat file and database in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First. Take the flat file which you are having and use a Merge Stage to join these two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements SCDs are three typesType 1- Modify the changeType 2- Version the modified changeType 3- Historical versioning of modified change by adding a new column to update the changed data yeah u can implement SCD's in datastage. SCD type1 just use 'insert rows else update rows' or ' update rows else insert rows' in update action of target SCD type2 : u have use one hash file to look -up the target ,take 3 instance of target ,give diff condns depending on the process,give diff update actions in target ,use system variables like sysdate ,null .We can handle SCD in the following waysType I: Just overwrite; Type II: We need versioning and dates; Type III: Add old and new copies of certain important fields. Hybrid Dimensions: Combination of Type II and Type III. yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with time stamp. Step :1 time stamp we are creating via shared container. it return system time and one key. For satisfying the lookup condition we are creating a key column by using the column generator.

Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change capture stage we will find out the differences. the change capture stage will return a value for chage_code. based on return value we will find out whether this is for insert , Edit, or update. if it is insert we will modify with current timestamp and the old time stamp will keep as history. 81. how to implement routines in data stage, have any one has any material for data stage pl send to me write the routine in C or C++, create the object file and place object in lib directory. now open disigner and goto routines configure the path and routine names there are 3 kind of routines is there in Datastage. 1.server routines which will used in server jobs. these routines will write in BASIC Language

2.parlell routines which will used in parlell jobs . These routines will write in C/C++ Language 3.mainframe routines which will used in mainframe jobs 82. what is the difference between datastage and informatica The main difference between data stge and informatica is the SCALABILTY..informatca is scalable than datastage. In my view Datastage is also Scalable, the difference lies in the number of built-in functions which makes DataStage more user friendly. In my view,Datastage is having less no. of transformers copared to Informatica which makes user to get difficulties while working . The main difference is Vendors. Each one is having plus from their architecture. For Datastage it is a Top-Down approach. Based on the Businees needs we have to choose products. Main difference lies in parellism , Datastage uses parellism concept through node configuration, where Informatica does not have used both Datastage and Informatica... In my opinion, DataStage is way more powerful and scalable than Informatica. Informatica has more developer-friendly features, but when it comes to scalabality in performance, it is much inferior as compared to datastage. Here are a few areas where Informatica is inferior 1. Partitioning - Datastage PX provides many more robust partitioning options than informatica. You can also re-partition the data whichever way you want. 2. Parallelism - Informatica does not support full pipeline parallelism (although it claims). 3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookup filesets, datasets for much more efficient lookup. 4. Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the streams. In Informatica the only way is to do a Union, which by the way is always a Union-all. 83. DataStage from Staging to MDW is only running at 1 row per second! What do we do to remedy? I am assuming that there are too many stages, which is causing problem and providing the solution. In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarassing to have a tool and still write a query but that is best at times. If there are too many look ups that are being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance. Other things in general that could be looked in: 1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk. 2) Enable row buffering and set appropate size for row buffering

3) It is important to use appropriate objects between stages for performance 84. What user variable activity when it used how it used! Where it is used with real example By using This User variable activity we can create some variables in the job sequence, this variables r available for all the activities in that sequence. Most probably this activity is @ starting of the job sequence 85. what is the difference between build opts and subroutines ? Build opts generates c++ code ( oops concept) subroutine :- is normal program and u can call any where in your project. 86. There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantages for using each type? These are the three different stages: i) Custom ii) Build iii) Wrapped 87. What is the exact difference between Join, Merge and Lookup Stage?? The exact difference between Join, Merge and lookup is The three stages differ mainly in the memory they use. DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O. Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as many as input links. the concept of merge and join is different in parallel edition as u will not find join component in server merge will survive this purpose. As of my knowledge join and merge both u used to join two files of same structure where lookup u mainly use it for to compare the prev data and the curr data. We can join 2 relational tables using Hash file only in server jobs. Merge stage is only for flat files . join only max of two input datasets to single output, but merge can have more than two dataset inputs to single output. Also remember that to use Merge stage the key's field names MUST be equal in both input files (master and updates). 88. Can any one tell me how to extract data from more than 1 heterogeneous Sources? Mean, example 1 sequential file, Sybase, Oracle in a single Job. Yes you can extract the data from two heterogeneous sources in data stages using the transformer stage it's so simple you need to just form a link between the two sources in the transformer stage that's it . U can convert all heterogeneous sources into sequential files & join them using merge or U can write user defined query in the source itself to join them 89. Can we use shared container as lookup in DataStage server jobs? We can use shared container as lookup in server jobs. Wherever we can use same lookup in multiple places, on that time we will develop lookup in shared containers, then we will use shared containers as lookup. 90. How can I specify a filter command for processing data while defining sequential file output data? We have some thing called as after job subroutine and before subroutine, with then we can execute the UNIX commands. Here we can use the sort command or the filter command 91. What are validations you perform after creating jobs in designer? What r the different type of errors u faced during loading and how u solve them Check for Parameters. and check for input files are existed or not and also check for input tables existed or not and also usernames, data source names, passwords like that 92. If I add a new environment variable in Windows, how can I access it in DataStage?

u can call it in designer window . under that job properties there u can add an new environment variable r u can use the existing one U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties . 93. what are the enhancements made in datastage 7.5 compare with 7.0 Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored procedure stage, command stage and generate report option was there in file tab. In job sequence many stages like startloop activity, end loop activity,terminate loop activity and user variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stage were introduced. For all other specifications . As of my knowledge the main enhancement i found is we can generate reports in 7.5 where u can't in 7.0. and also we can import more plug-in stages in7.5 . Complex file and Surrogate key generator stages are added in Ver 7.5 94. what is data set? and what is file set? I assume you are referring Lookup fileset only. It is only used for lookup stages only. Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns 95. How the hash file is doing lookup in serverjobs?How is it comparing the key values? Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster. 96. what are the differences between the data stage 7.0 and 7.5 in server jobs? There are lot of Differences: There are lot of new stages are available in DS7.5 For Eg: CDC Stage Stored procedure Stage etc.. 97. it is possible to run parallel jobs in server jobs? No. we need UNIX server to run parallel jobs. but we can create a job in windows os PC. No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs. by configuring config file 98. how to handle the rejected rows in datastage? We can handle by using constraints and store it in file or DB. we can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constarints in the properties of the Transformer2)Use REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either ofthe two stpes above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File. 99. What are orabulk and bcp stages? these are called as pilu-in stages orabulk is used when v have bulk data in oracle then v go for orabulk for other than oracle database v go for bcp stages. ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for microsoft sql server and sysbase. 100. how is datastage 4.0 functionally different from the enterprise edition now?? what are the exact changes? There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc.......... 101. How I can convert Server Jobs into Parallel Jobs?

u cant convert server to parallel ! u have to rebuild whole graph.. There is no machanism to convert server jobs into parlell jobs. u need to re design the jobs in parlell environment using parlell job stages. have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container. 102. How much would be the size of the database in DataStage ? What is the difference between Inprocess and Interprocess ? Regarding the database it varies and dependa upon the project and for the second question ,in process is the process where teh server transfers only one row at a tiem to target and interprocess means that the server sends group of rows to the target table...these both are available at the tunables tab page of the administrator client component.. In-process You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row. Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. Inter-process : Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor. Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. 103. Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool. We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5 104. How to implement type2 slowly changing dimensions in data stage? explain with example? We can handle SCD in the following ways Type 1: Just use, Insert rows Else Update rows Or Update rows Else Insert rows, in update action of target Type 2: Use the steps as follows a) U have use one hash file to Look-Up the target b) Take 3 instances of target c) Give different conditions depending on the process d) Give different update actions in target e) Use system variables like Sysdate and Null. to develope scd type 2 us the update action in target as " insert new rows only". for this u need to maintain primary key as composite key in target table . it would be better to use timestamp column as one of key column in target table. 105. If a DataStage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error? If an error is fixed on the job where it failed then job continues leaving that error part. By specifying Check pointing in job sequence properties, if we restart the job. Then job will start by skipping upto the failed record.this option is available in 7.5 edition. The above answer is wrong ,if checkpoint run is selected then it will keep track of failed job and when you start your job again it will skip the jobs which are run with out erors and restart the failed job,(not from the record where it si stopped) 106. what is OCI? If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact with Oracle databases. It allows one to use operations like logon, execute, parss etc. using a C or C++ program Oracle Call level Interface : Oracle offers a proprietary call interface for C and C++ programmers that allows manipulation of data in an Oracle database. Version 9.n of the Oracle Call Interface (OCI) can connect and process SQL

statements in the native Oracle environment without needing an external driver or driver manager. To use the Oracle OCI 9i stage, you need only to install the Oracle Version 9.n client, which uses SQL*Net to access the Oracle server. Oracle OCI 9i works with both Oracle Version 7.0 and 8.0 servers, provided you install the appropriate Oracle 9i software. With Oracle OCI 9i, you can: Generate your SQL statement. (Fully generated SQL query/Column-generated SQL query) Use a file name to contain your SQL statement. (User-defined SQL file) Clear a table before loading using a TRUNCATE statement. (Clear table) Choose how often to commit rows to the database. (Transaction size) Input multiple rows of data in one call to the database. (Array size) Read multiple rows of data in one call from the database. (Array size) Specify transaction isolation levels for concurrency control and transaction performance tuning. (Transaction Isolation) Specify criteria that data must meet before being selected. (WHERE clause) Specify criteria to sort, summarize, and aggregate data. (Other clauses) Specify the behavior of parameter marks in SQL statements. 107. what is hashing algorithm and explain breafly how it works? hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work. It sounds fancy but these algorithms are usually quite simple and use division and remainder techniques. Any good book on database systems will have information on these techniques. Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or randomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls represent the records (on this roulette wheel there are many balls not just one). A hashing algorithm takes a variable length data message and creates a fixed size message digest. When a one-way hashing algorithm is used to generate the message digest the input cannot be determined from the output.. A mathematical function coded into an algorithm that takes a variable length string and changes it into a fixed length string, or hash value.

108. It is possible to call one job in another job in server jobs? We cannot call one job within another in DataStage, however we can write a wrapper to access the jobs in a stated sequence.We can also use sequencer to sequence the series of jobs. I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control . Click on Add Job and select the desired job. 109. "Will Datastage consider the second constraint in the transformer if the first constraint is satisfied (if link ordering is given)?" Answer: Yes. 110. What are constraints and derivation? Explain the process of taking backup in DataStage? What are the different types of lookups available in DataStage? Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further. Derivation is a method of deriving the fields, for example if you need to get some SUM,AVG etc. Constraints are condition and once meeting those records will be processed further. Example process all records where cust_id<>0. Derivations are derived expressions.for example I want to do a SUM of Salary or Calculate Interest rate etc 110. What is a project? Specify its various components?

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

DataStage jobs. Built-in components. These are predefined components used in a job. User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

111. How does DataStage handle the user security? we have to create users in the Administrators and give the necessary priviliges to users. 112. what is meaning of file extender in data stage server jobs. can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in ds jobs. file extender means the adding the columns or records to the already existing the file, in the data stage, we can run the data stage job from one job to another job in data stage. 113. What is the difference between drs and odbc stage DRS and ODBC stage are similar as both use the Open Database Connectivity to connect to a database. Performance wise there is not much of a difference.We use DRS stage in parallel jobs. To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work. Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target ... 114. how to use rank&updatestratergy in datastage Don't mix informatica with Datastage. In Datastage, we dont have such kind of stages . U can use it with ODBC stage by writing proper SQl quries. 115. What is the max capacity of Hash file in DataStage? i guess it maximum of 2GB.. Take a look at the uvconfig file: # 64BIT_FILES - This sets the default mode used to # create static hashed and dynamic files. # A value of 0 results in the creation of 32-bit # files. 32-bit files have a maximum file size of # 2 gigabytes. A value of 1 results in the creation # of 64-bit files (ONLY valid on 64-bit capable platforms). # The maximum file size for 64-bit # files is system dependent. The default behavior # may be overridden by keywords on certain commands. 64BIT_FILES 0 116. What is difference between Merge stage and Join stage? join can have max of two input datasets, Merge can have more than two input datesets. Merge and Join Stage Difference : 1. Merge Reject Links are there 2. can take Multiple Update links 3. If you used it for comparision , then first matching data will be the output . Because it uses the update links to extend the primary details which are coming from master link

Someone was saying that join does not support more than two input , while merge support two or more input (one master and one or more update links). I will say, that is highly incomplete information. The fact is join does support two or more input links (left right and possibly intermediate links). But, yes, if you are tallking about full outer join then more than two links are not supported. Coming back to main question of difference between Join and Merge Stage, the other significant differences that I have noticed are: 1) Number Of Reject Link : (Join) does not support reject link. (Merge) has as many reject link as the update links( if there are n-input links then 1 will be master link and n-1 will be the update link). 2) Data Selection : (Join) There are various ways in which data is being selected. e.g. we have different types of joins, inner, outer( left, right, full), cross join, etc. So, you have different selection criteria for dropping/selecting a row. (Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.

117. how we can create rank using datastge like in informatica. if ranking means that below prop_id rank 1 1 1 2 2 1 2 3 1 1

you can do this first,use sort stage and value of creates the column KeyChange must be set true,it makes data like below prop_id rank KeyChange() 1 1 1 2 2 1 2 3 1 1 1 0 0 1 0

if value change,keychange column set 1 else set 0,after sort stage, use transformer stage variable . 118. what is the difference between validated ok and compiled in datastage. When you compile a job, it ensure that basic things like all the important stage parameters has been set, mappings are correct, etc. and then it creates an executable job. You validate a compiled job to make sure that all the connections are valid. All the job parameters are set and a valid output can be expected after running this job. It is like a dry run where you don't actually play with the live data but you

are confident that things will work. When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made : - Connections are made to the data sources or data warehouse. - SQL SELECT statements are prepared. - Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist. 119. what are the environment variables in datastage?give some examples? There are the variables used at the project or job level. We can use them to to configure the job ie.we can associate the configuration file(Without this u can not run ur job), increase the sequential or dataset read/ write buffer. ex: $APT_CONFIG_FILE . Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see most of the environment variables. 120. purpose of using the key and difference between Surrogate keys and natural key We use keys to provide relationships between the entities(Tables). By using primary and foreign key relationship, we can maintain integrity of the data. The natural key is the one coming from the OLTP system. The surrogate key is the artificial key which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarions surrogate keys play a major role natural key :- seq no system assigned .. skey :- user assigend ! u can start with any number like 100,10001,2002 121. How do you do Usage analysis in datastage ? 1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents. 2. To find how many jobs are using a particular table. 3. To find how many jobs are usinga particular routine. Like this, u can find all the dependents of a particular object. Its like nested. U can move forward and backward and can see all the dependents. 122. How to remove duplicates in server job 1)Use a hashed file stage or 2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using -u parameter or 3)using a Sort stage Which stages u r using in the Server job. If u r using ODBC stage, then u can write User defined Query in the source stage. 123. it is possible to access the same job two users at a time in datastage? No, it is not possible to access the same job two users at the same time. DS will produce the following error : "Job is accessed by other user". No chance ..... u have to kill the job process 124. how to find errors in job sequence? using DataStage Director we can find the errors in job sequence 125. what is job control?how can it used explain with steps JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job 126. how we can call the routine in datastage job?explain with steps?

Routines are used for impelementing the business logic they are two types 1) Before Sub Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options( Before / After Sub Routines) . In transformer stage we have to edit the field and click dsRoutines.It will prompt to select the routine 127. what are the different types of lookups in datastage? - Look-up file stage - Generally used with Look Up stage . - Hash Look-up. - you can also implement a "look up" using Merge stage . there are two types of lookups lookup stage and lookupfileset Lookup:Lookup refrence to another stage or Database to get the data from it and transforms to other database. LookupFileSet: It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.U can also use the sparse look up property when u have large data in the look up table ..... 128. where actually the flat files store?what is the path? Flat files stores the data and the path can be given in general tab of the sequential file stage Normally flat file will be stored at FTP servers or local folders and more over .CSV , .EXL and .TXT file formats available for Flat files. The flat files will be stored in the unix box ,if u r environment is Unix,U need to specify the path in the properties of the sequential file stage.... u can parameterise the path 129. how to find the number of rows in a sequential file? Using Row Count system variable 130. how to implement type2 slowly changing dimenstion in datastage? give me with example? Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem . Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes. In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use. In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time. it is where the data is to be stored in the intermediate files 131. what is the purpose of exception activity in data stage 7.5? It is used to catch the exception raised while running the job. The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer. 132. What is the difference between sequential file and a dataset? When to use the copy stage?

Sequential file stores small amount of the data with any extension .txt where as DataSet stores Huge amount of the data and opens the file only with an extension .ds. Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be copied without modification or you can drop or change theorder of columns. Main difference b/w sequential file and dataset is : Sequential stores small amount of data and stores normally.But dataset load the data like ansi format. sequential file: it act as a source & permanent storage for 's extend is .txt. dataset: it act as a temporary storage stage ,mainly it used before the target stage . while using this stage the ip datas r partited &convert into internal dataset format.then it is easy to load the data in target stage . copy: it act as a has a single ip &many o/p . if u want 2 add a new stage in ur job at that time it is very easy otherwise u have to modify that whole job. 133. where we use link partitioner in data stage job?explain with example? We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which takes one input andallows you to distribute partitioned rows to up to 64 output links. 134. how to kill the job in data stage? By killing the respective process ID . You should use kill -14 so the job ends nicely. Sometimes use -9 leaves things in a bad state. U can also do it by using data stage director clean up resources 135. How to parametarise a field in a sequential file?I am using Datastage as ETL Tool,Sequential file as source. We cannot parameterize a particular field in a sequential file, instead we can parameterize the source file name in a sequential file. #FILENAME# 136. how to drop the index befor loading data in target and how to rebuild it in data stage? This can be achieved by "Direct Load" option of SQLLoaded utily. 137. If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows it overwrites the file 138. Other than Round Robin, What is the algorithm used in link collecter? Also Explain How it will works? Other than round robin, the other algorithm is Sort/Merge. Using the sort/merge method the stage reads multiple sorted inputs and writes one sorted output. 139. how to improve the performance of hash file? You can inprove performance of hashed file by 1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage 2. Write caching options -->.It makes data written into cache before being flushed to can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written

3 .Preallocating--> Estimating the approx size of the hash file so that file need not to be splitted to often after write operation 140. what is the size of the flat file? The flat file size depends amount of data contained by that flat file 141. what is data stage engine?what is its purpose? Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DS engine is to develope the jobs .Whenever the engine is on then only we will develope the jobs. 142. What is the difference between Symetrically parallel processing,Massively parallel processing? Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system. Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system and communicate via high speed network Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system. Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system and communicate via high speed network Symmetric MultiProcessing (SMP) is the processing of programs by multiple processors that share a commom operating system and memory. This SMP is also called as "Tightly Coupled MultiProcessing". A Single copy of the Operating System is in charge for all the Processors Running in an SMP. This SMP Methodolgy dosen't exceed more than 16 Processors. SMP is better than MMP systems when Online Transaction Processing is Done, in which many users can access the same database to do a search with a relatively simple set of common transactions. One main advantage of SMP is its ability to dynamically balance the workload among computers ( As a result Serve more users at a faster rate ) Massively Parallel Processing (MPP)is the processsing of programs by multiple processors that work on different parts of the program and share different operating systems and memories. These Different Processors which run , communicate with each other through message interfaces. There are cases in which there are upto 200 processors which run for a single application. An InterConnect arrangement of data paths allows messages to be sent between different processors which run for a single application or product. The Setup for MPP is more complicated than SMP. An Experienced Thought Process should to be applied when u setup these MPPand one shold have a good indepth knowledge to partition the database among these processors and how to assign the work to these processors. An MPP system can also be called as a loosely coupled system. An MPP is considered better than an SMP for applications that allow a number of databases to be searched in parallel. 143. give one real time situation where link partitioner stage used? If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection of active stage from the transformer to teh link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage 144. What does separation option in static hash-file mean? The different hashing algorithms are designed to distribute records evenly among the groups of the file based on charecters and their position in the record ids. When a hashed file is created, Separation and Modulo respectively

specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE creates a file that contains the number of groups specified by modulo. Size of Hashfile = modulus(no. groups) * Separations (buffer size) 145. How do you remove duplicates without using remove duplicate stage? In the target make the column as the key column and run the job. Using a sort stage,set property: ALLOW DUPLICATES :false . Just do a hash partion of the input data and check the options Sort and Unique. 146. How do you call procedures in datastage? Use the Stored Procedure Stage 147. How can we create read only jobs in Datastage. in export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enable that 148. How to run the job in command prompt in unix? Using dsjob command, -options dsjob -run -jobstatus projectname jobname

What is the difference between Transform and Routine in DataStage? Transformar transform the data from one from to another form .where as Routines describes the business logic 149. how do u clean the datastage repository. REmove log files periodically..... CLEAR.FILE &PH& 150. what is the transaction size and array size in OCI stage?how these can be used? Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement. 151. How to know the no.of records in a sequential file before running a server job? if your environment is unix , you can check with wc -l filename command. 152. My requirement is like this : Here is the codification suggested: SALE_HEADER_XXXXX_YYYYMMDD.PSV SALE_LINE_XXXXX_YYYYMMDD.PSV XXXXX = LVM sequence to ensure unicity and continuity of file exchanges Caution, there will an increment to implement. YYYYMMDD = LVM date of file creation COMPRESSION AND DELIVERY TO: SALE_HEADER_XXXXX_YYYYMMDD.ZIP AND SALE_LINE_XXXXX_YYYYMMDD.ZIP

if we run that job the target file names are like this sale_header_1_20060206 & sale_line_1_20060206. If we run next time means the target files we like this sale_header_2_20060206 & sale_line_2_20060206. If we run the same in next day means the target files we want like this sale_header_3_20060306 & sale_line_3_20060306. i.e., whenever we run the same job the target files automatically changes its filename to filename_increment to previous number(previousnumber + 1)_currentdate;

This can be done by using unix script 1. 2. 3. Keep the Target filename as constant name xxx.psv Once the job completed, invoke the Unix Script through After job routine - ExecSh The script should get the number used in previous file and increment it by 1, After that move the file from xxx.psv to filename_(previousnumber + 1)_currentdate.psv and then delete the xxx.psv file.This is the Easiest way to implement. 154. how to distinguish the surogate key in different dimensional tables?how can we give for different dimension tables? Use Database sequence to make your job easier to generate the surrogate key. 155. how to find the process id?explain with steps? you can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids . From the DS Director.Follow the path : Job > Cleanup Resources. There also you can see the PID.It also displays all the current running processes. Depending on your environment, you may have lots of process id's.From one of the datastage docs:you can try this on any given node: $ ps -ef | grep dsuserwhere dsuser is the account for datastage.If the above (ps command) doesn't make sense, you'll need somebackground theory about how processes work in unix (or the mksenvironment when running in windows ).Also from the datastage docs (I haven't tried this one yet, but it looks interesting) APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message uponstartup, displaying their process id U can also use Data stage Administrator.Just click on the project and execute command ,just follow the menu joice to get the job name and PID .then kill the process in the unix ,but for this u will require the user name of the datastage in which the process is locked 155. what is quality stage and profile stage? Quality Stage:It is used for cleansing ,Profile stage:It is used for profiling 156. How can I schedule the cleaning of the file &PH& by dsjob? Create a job with dummy transformer and sequentail file stage. In Before Job subroutine, use ExecTCL to execute the following command. CLEAR.FILE &PH& 157. if we using two sources having same meta data and how to check the data in two sorces is same or not?and if the data is not same i want to abort the job ?how we can do this? Use a change Capture Stage.Output it into a Transformer. Write a routine to abort the job which is initiated at the Function. @INROWNUM = 1. So if the data is not matching it is passed in the transformer and the job is aborted 158. what is difference betweend ETL and ELT? ETL usually scrubs the data then loads into the Datamart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Datamart or Datawarehouse .ETL = Extract >>> Transform >>> Load ELT = Extract >>> Load >>> Transform. ETL-> transformation takes place in staging area . and in ELT-> transormation takes at either source side r target side............

158. Can you tell me for what puorpse .dsx files are used in the datasatage .dsx is the standard file extension of all the various datastage jobs.Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using Datstage Manager. you can as well export the Datastage jobs in .xml format...... 159. What is environment variables?what is the use of this? Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job. We can also define new envrionment variable.For that we can got to DS Admin . I hope u understand.for further details refer the DS Admin guide. 159. How to write and execute routines for PX jobs in c++? You define and store the routines in the Datastage repository(ex:in routine folder). And these rountines are excuted on c++ compilers. You have to write routine in C++ (g++ in Unix). then you have to create a object file. provide this object file path in your routine. 160. have few questions 1. What are the various process which starts when the datastage engine starts? 2. What are the changes need to be done on the database side, If I have to use dB2 stage? 3. datastage engine is responsible for compilation or execution or both? There are three processes start when the DAtastage engine starts: 1. DSRPC 2. Datastage Engine Resources 3. Datastage telnet Services. 161. What is the difference between reference link and straight link ? The differerence between reference link and straight link is The straight link is the one where data are passed to next stage directly and the reference link is the one where it shows that it has a reference(reference key) to the main table. for example in oracle EMP table has reference with DEPT table. In DATASTAGE, 2 table stage as source (one is straight link and other is reference link) to 1 transformer stage as process.If 2 source as file stage(one is straight link and other is reference link to Hash file as reference) and 1 transformer stage. 162. What is Runtime Column Propagation and how to use it? If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job 163. Can both Source system(Oracle,SQLServer,...etc) and Target Data warehouse(may be oracle,SQLServer..etc) can be on windows environment or one of the system should be in UNIX/Linux environment. Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or..) . In server edition you can have both in Windows. But in PX target should be in UNIX. 164. Is there any difference b/n Ascential DataStage and DataStage. There is no difference between Ascential Datastage and Datastage ,Now its IBM websphere Datastage earlier it was Ascential Datastage and IBM has bought it and named it as above. 165. how can we test the jobs?

Testing of jobs can be performed at many different levels: Unit testing, SIT and UAT phases. Testing basically involves functionality and performance tests. Firstly data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Every iteration of code change should be accompanied by a testing iteration. Performance tests basically involve load tests and see how well the exisiting code performance in a finite period of time. Performance tuning can be performed on sql or the job design or the basic/osh code for faster processing times. Inaddition all job designs should include a error correction and fail over support so that the code is robust. 166. What is the use of Hash file??insted of hash file why can we use sequential file itself? hash file is used to eliminate the duplicate rows based on hash key,and also used for stage not allowed to use sequential file as lookup. Actually the primary use of the hash file is to do a look up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. so when a look up is made is much much faster. Also it eliminates the duplicate rows. these files are stored in the memory hence faster performance than from a sequential file 167. what is a routine? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines. Routine is user defined functions that can be reusable with in the project. 168. what is the difference between static hash files n dynamic hash files? Static hash file don't chane their number of groups(modulas) except through manual resizing. Dynamic hash file automatically change their no of groups(modulas)in response to the amount of data stored ina file. 169. how can we create environment variables in datasatage? We can create environment variables by using DataStage Administrator. This mostely will comes under Administrator part.As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under userdefined-then add. 170. how can we load source into ODS? What is ur source?. Depending on type of source, you have to use respective stage. like oracle enterprise: u can use this for oracle source and target. similarly for other sources. 171. how to eleminate duplicate rows in data stage? TO remove duplicate rows you can achieve by more than one way . 1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key. 2.Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on key while processing time. By using Hash File Stage in DS Server we can elliminate the Duplicates in DS. Using a sort stage,set property: ALLOW DUPLICATES :false OR You can use any Stage in input tab choose hash partition And Specify the key and Check the unique checkbox. if u r doing with server Jobs, V can use hashfile to eliminate duplicate rows. There are two methods for eleminating duplicate rows in datastage 1. Using hash file stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)

2. Using Sort stage by link remove duplicate stage 172. what is pivot stage?why are u using?what purpose that stage will be used? Pivot stage is used to make the horizontal rows into vertical and viceversa. Pivot stage supports only horizontal pivoting columns into rows. Pivot stage doesnt supports vertical pivoting rows into columns/ Example: In the below source table there are two cols about quarterly sales of a product but biz req. as target should contain single col. to represent quarter sales, we can achieve this problem using pivot stage, i.e. horizontal pivoting. Source Table ProdID 1010 Target Table ProdID 1010 1010 Quarter_Sales 123450 234550 Quarter Q1 Q2 Q1_Sales 123450 Q2_Sales 234550

173. what is complex stage?In which situation we are using this one? CFF stage is used to read the files in ebcidic format.mainly main frame files with redifines. A complex flat file can be used to read the data at the intial level. By using CFF, we can read ASCII or EBCIDIC data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save" (other options: continue, fail). We can flatten the arrays(COBOL files). 174. what are the main diff between server job and parallel job in datastage in server jobs we have few stages and its mainly logical intensive and we r using transformer for most of the things and it does not uses MPP systems . in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in built stages in parallel jobs and it uses MPP systems In server we dont have an option to process the data in multiple nodes as in parallel. In parallel we have an advatage to process the data in pipelines and by partitioning, whereas we dont have any such concept in server jobs. There are lot of differences in using same stages in server and parallel. For example, in parallel, a sequencial file or any other file can have either an input link or an output ink, but in server it can have both(that too more than 1). server jobs can compile and run with in datastage server but parallel jobs can compile and run with in datastage unix server. server jobs can extact total rows from source to anthor stage then only that stage will be activate and passing the rows into target level or is time taking. but in parallel jobs it is two types 1.pipe line parallelisam 2.partion parallelisam 1.based on statistical performence we can extract some rows from source to anthor stage at the same time the stage will be activate and passing the rows into target level or will maintain only one node with in source and target. 2.partion parallelisam will maintain more than one node with in source and target. 175. differentiate between pipeline and partion parallelism?

consider three cpu connected in series. When data is being fed into the first one,it start processing, simultaneously is being transferred into the second cpu and so on. u can compare with 3 section of pipe. as water enter s the pipe it start moving into all the section of pipe. Partition Pipeline- conside 3 cpu connected in parallel and being fed with data at same time thus reduces the load and efficiency. you can compare a single big pipe having 3 inbuilt pipe. As water is being fed to them it consumes large quantity in less time. 176. how to read the data from XL FILES?my problem is my data file having some commas in data,but we are using delimitor is| ?how to read the data ,explain with steps? Create DSN for your XL file by picking Microsoft Excel Driver. 2. Take ODBC as source stage 3. Configure ODBC with DSN details 4. While importing metadata for XL sheet, make sure you should select on system tables check box. Note: In XL sheet the first line should be column names. If the problem is only commas in XL file data.. We can open it in Access and save the file with Pipe (|) separator... than can be used as simple sequential file but change the dilimiter to (|).. in the format tab . 177. Disadvantages of staging area disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, there is no other disadvantages of staging area. 178. whats the meaning of performance tunning techinque,Example?? Meaning of performance tuning meaning we have to take some action to increase performance of slowly running job by 1) use link partitioner and link collector to speedup performance 2) use sorted data for aggregation 3) use sorter at source side and aggregation at target side 4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 5) do not use ipc stage at target side.............. is this only related with server jobs .because in parallel extender these things are taken care by stages 179. how to distinguish the surrogate key in different dimentional tables? the Surrogate key will be the key field in the dimensions. 180. how to read the data from XL FILES?explain with steps? Reading data from Excel file is . * Save the file in .csv (comma separated files). * use a flat file stage in datastage job panel. * double click on the flat file stage and assign input file to the .csv file (which you stored ). * import metadate for the file . (once you imported or typed metadata , click view data to check the data values) . Then do the rest transformation as needed

Create a new DSN for the Excel driver and choose the workbook from which u want data Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel

181. how can we generate a surrogate key in server/parallel jobs? In parallel jobs we can use surrogatekey generator stage. in server jobs we can use an inbuilt routine called KeyMgtGetNextValue. You can also generate the surrogate key in the database using the sequence generator. 182. what is an environment variable??

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job. We can also define new envrionment variable.For that we can got to DS Admin . Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer. ex: $APT_CONFIG_FILE Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables. 183. what are different types of file formats?? comma delimited csv files. tab delimited text files... csv files. dxs files( standard extension of data stage) 184. For what purpose is the Stage Variable is mainly used? Stage variable is temporary storage memory variable, if we are doing caluculations repeatedly the result,we can store in stage variable. The stage variable can be used in situations where U want to Store a Previous record value in a variable and compare with current record value and use if then else conditional statements. If you want to show the product list seperated , for a each manufacturer with the following rows you can use stage variable Input Rows Manufacturer Product GM GM Ford Ford Chrysler Chrysler Output rows GM Ford Chrysler Chevvy,GeoMetro Focus,Explorer Jeep,Pacifica Chevvy GeoMetro Focus Explorer Jeep Pacifica

185. Where can you output data using the Peek Stage? In datastage Director! Look at the datastage director LOg 186. how can we improve the job performance? in many ways we can improve,one simple method is by inserting IPC stage between two active stage or two passive stages..... there r lots of techniques for performance tuning as u asked ipc should b inserted btn two active stages . Some of the tips to be followed to improve the performance in DS parallel jobs-

1. Do right partitioning at right parts of the job, avoid re-partitioning of data as much as possible. 2. Sort the data before aggregating or removing duplicates. 3. use transformer and pivot stages limitedly. 4. Try to develop small simple jobs, rather than huge complex ones. 5. Study and decide in which curcumstances a join or merge can be used and in which a lookup can be used. Instead of having a job with fork join creates two jobs. These two jobs will perform better then the single job in most of the cases. Even if you want to you fork join style then use proper partitioning and sorting for the stages. We can put hash file in lookup. This will index the input data based on key column (which we define). Thus improve performance. Also Array size can be increased in final table stage. 187. what are two types of hash files?? the two types of hash files r static hash file and dynamic hash file................. The most commonly used hash File is type 32 Dynamic hash files and we use hash files in server jobs... the two type of hash file are, 1) static and 2) dynamic,,,,, the dynamic hash file is again subdivided in to Generic and Specific 188. Why job sequence is use for? what is batches? what is the difference between job sequence and batches? Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job. Batch is a collection of jobs group together to perform a specific task.i.e It is s special type of job created using Data stage director which can be sheduled to run at specific time. Difference between Sequencers and Batches: Un like as in sequencers in batches we can not provide the control information. 189. What is Integrated & Unit testing in DataStage ? Unit Testing: In Datastage senario Unit Testing is the technique of testing the individual Datastage jobs for its functionality. Integrating Testing: When the two or more jobs are collectively tested for its functionality that is callled Integrating testing. 190. how can we improve performance in aggregator stage?? For improving the performance when you use aggregator stage sort the data before u pass pass to the aggregator stage. Select the most appropriate partitioning method based on the data analysis. Hash partitoning performs well in most of the cases. 191. Why is hash file is faster than sequential file n odbc stage?? Hash file is indexed. Also it works under hashing algo. That's why the search is faster in hash file. 192. Is it possible to query a hash file? Justify your answer... No its not possible to query a hash file . The reason being its a backend file and not a datatbase which can be queried .

193. What does # indicate in environment variables? It is used to identify the parameter. 194. What is user activity in datastage? The User variable activity stage defines some variables,those are used in the sequence in future..... 195. What is the alternative way where we can do job control?? Job Control will possble Through scripting. Controling is dependent on Reqirements.need of the job. 196. What is the use of job controle?? Job control is used for scripting. With the help of scripting, we can set parameters for a caller job, execute it, do error handling etc tasks. 197. What are different types of star schema?? Multi star schema or galaxy schema is one of the type of star schema 198. What is the sequencer stage?? Lets say there are two jobs (J1 & J2) as the input links and one job (J3) as output link for a sequencer stage in a DS job sequencer. The sequencer can be set to "ALL" or "ANY". If it is set to "ALL", the sequencer triggers the third job (J3) only after the 2 input linked jobs (J1 & J2) completes and if it is set to "ANY", it just waits for any of the job (J1 or J2) to complete and triggers the third one. 199. What is the use of tunnable?? tunables are the tab in datastage administartor by which one can increase or decrease the cashe size . Tunable is project property in Datastage Administrator, in that we can change the value of cache size i.e. between 0 to 999 mb, 200. Which partition we have to used for Aggregate Stage in parallel jobs ? By default this stage allows Auto mode of partitioning. The best partitioning is based on the operating mode of this stage and preceding stage. If the aggregator is operating in sequential mode, it will first collect the data and before writing it to the file using the default Auto collection method. If the aggregator is in parallel mode then we can put any type of partitioning in the drop down list of partitioning tab. Generally auto or hash can be used 201. What is Fact loading, how to do it? firstly u have to run the hash-jobs, secondly dimensional jobs and lastly fact jobs. 202. What is the difference betwen Merge Stage and Lookup Stage? Merge stage : The parallel job stage that combines data sets lookup stage: The mainframe processing jobs and parallel active stages that perfom table lookups. Lookup stage:1. Used to perform lookups.2. Multiple reference links, single input link, single output link, single rejects link, single primary link. 3. Large amount of memory usage. Because paging required5. Data on input links or reference links need NOT to be sorted. Merge stage:1. Combines the sorted data sets with the update datasets. 2. Several reject links, multiple output links will be exist. 3. Less memory usage.4. Data need to be sorted. 203. How to run a job using command line?

dsjob -run -jobstatus projectname jobname 204. Suppose you have table "sample" & three columns in that table sample: Cola Colb Colc 1 10 100 2 20 200 3 30 300 Assume: cola is primary key How will you fetch the record with maximum cola value using data stage tool into the target system As per the question it is very clear that the source data is in Table . you can use oci stage to read the source file in the oci Stage write user defined sql query as " Select Max(cola) from the table" which will fetch the maximum value available in the table then load the data to Target Table 205. To run the job through command line Below given are syntax for running datastge jobs through command line. Command Syntax: dsjob [-file <file> <server> | [-server <server>][-user <user>][-password <password>]] <primary command> [<arguments>] Valid primary command options are: -run -stop -lprojects -ljobs -linvocations -lstages -llinks -projectinfo -jobinfo -stageinfo -linkinfo -lparams -paraminfo -log -logsum -logdetail -lognewest -report -jobid -import Status code = -9999 DSJE_DSJOB_ERROR dsjob -run [-mode <NORMAL | RESET | VALIDATE>] [-param <name>=<value>] [-warn <n>] [-rows <n>] [-wait] [-opmetadata <TRUE | FALSE>] [-disableprjhandler] [-disablejobhandler] [-jobstatus] [-userstatus] [-local] [-useid] <project> <job|jobid>

Status code = -9999 DSJE_DSJOB_ERROR