Sie sind auf Seite 1von 3

Top Answers to DataStage Interview Questions

1. Mention DataStage characteristics.


Criteria Characteristics
Support for Big Data Hadoop Access Big Data on a distributed file system, JSON
support, and JDBC integrator
Ease of use Improve speed, flexibility, and efficacy for data integration
Deployment On-premise or cloud as the need dictates
2. What is IBM DataStage?
DataStage is an extract, transform, and load tool that is part of the IBM
Infosphere suite. It is a tool that is used for working with large data warehouses
and data marts for creating and maintaining a data repository.

3. How is a DataStage source file filled?


We can develop a SQL query or we can use a row generator extract tool through which
we can fill the source file in DataStage.

Check out this video on DataStage Tutorial:

Learn for free ! Subscribe to our youtube Channel.

4. How is merging done in DataStage?


In DataStage, merging is done when two or more tables are expected to be combined
based on their primary key column.

5. What are data and descriptor files?


Both these files are serving different purposes in DataStage. A descriptor file
contains all the information or description, while a data file is the one that just
contains data.

Interested in learning DataStage? We have an in-depth DataStage Course to give you


a head start in your career!

6. How is DataStage different from Informatica?


DataStage and Informatica are both powerful ETL tools, but there are a few
differences between the two. DataStage has parallelism and partition concepts for
node configuration; whereas in Informatica, there is no support for parallelism in
node configuration. Also, DataStage is simpler to use as compared to Informatica.

7. What is a routine in DataStage?


DataStage Manager defines a collection of functions within a routine. There are
basically three types of routines in DataStage, namely, job control routine,
before/after subroutine, and transform function.

8. What is the process for removing duplicates in DataStage?


Duplicates in DataStage can be removed using the sort function. While running the
sort function, we need to specify the option which allows for duplicates by setting
it to false.

9. What is the difference between join, merge, and lookup stages?


The fundamental difference between these three stages is the amount of memory they
take. Other than that how they treat the input requirement and the various records
are also factors that differentiate one another. Based on the memory usage, the
lookup stage uses a very less amount of memory. Both lookup and merge stages use a
huge amount of memory.

Come to Intellipaat�s Community if you have more queries on DataStage!


10. What is the quality state in DataStage?
The quality state is used for cleansing the data with the DataStage tool. It is a
client-server software tool that is provided as part of the IBM Information Server.

11. What is job control in DataStage?


This tool is used for controlling a job or executing multiple jobs in a parallel
manner. It is deployed using the Job Control Language within the IBM DataStage
tool.

12. How to do DataStage jobs performance tuning?


First, we have to select the right configuration files. Then, we need to select the
right partition and buffer memory. We have to deal with the sorting of data and
handling null-time values. We need to try to use modify, copy, or filter instead of
the transformer. Reduce the propagation of unnecessary metadata between various
stages.

13. What is a repository table in DataStage?


The term �repository� is another name for a data warehouse. It can be centralized
or distributed. The repository table is used for answering ad-hoc, historical,
analytical, or complex queries.

Learn more about DataStage from this DataStage Tutorial!

14. Compare massive parallel processing with symmetric multiprocessing.


In massive parallel processing, many computers are present in the same chassis.
While in the symmetric multiprocessing, there are many processors that share the
same hardware resources. Massive parallel processing is called �shared nothing� as
there is no aspect between various computers. And it is faster than the symmetric
multiprocessing.

15. How can we kill a DataStage job?


To kill a DataStage job, we need to first kill the individual processing ID so that
this ensures that the DataStage is killed.

16. How do we compare the Validated OK with the Compiled Process in DataStage?
The Compiled Process ensures that the important stage parameters are mapped and
these are correct such that it creates an executable job. Whereas in the Validated
OK, we make sure that the connections are valid.

17. Explain the feature of data type conversion in DataStage.


If we want to do data conversion in DataStage, then we can use the data conversion
function. For this to be successfully executed, we need to ensure that the input or
the output to and from the operator is the same, and the record schema needs to be
compatible with the operator.

18. What is the significance of the exception activity in DataStage?


Whenever there is an unfamiliar error happening while executing the job sequencer,
all the stages after the exception activity are run. So, this makes the exception
activity so important in the DataStage.

Learn more about DataStage from this insightful DataStage Tutorial for Beginners
blog post!

19. What are the various types of lookups in DataStage?


There are different types of lookups in DataStage. These include normal, sparse,
range, and caseless lookups.
20. When do we use a parallel job and a server job?
Using the parallel job or a server job depends on the processing need,
functionality, time to implement, and cost. The server job usually runs on a single
node, it executes on a DataStage Server Engine and handles small volumes of data.
The parallel job runs on multiple nodes; it executes on a DataStage Parallel Engine
and handles large volumes of data.

21. What is Usage Analysis in DataStage?


If we want to check whether a certain job is part of the sequence, then we need to
right-click on the Manager on the job and then choose the Usage Analysis.

22. How to find the number of rows in a sequential file?


For counting the number of rows in a sequential file, we should use the @INROWNUM
variable.

23. What is the difference between a sequential file and a hash file?
The hash file is based on a hash algorithm, and it can be used with a key value.
The sequential file, on the other hand, does not have any key-value column. The
hash file can be used as a reference for a lookup, while a sequential file cannot
be used for a lookup. Due to the presence fo the hash key, the hash file is easier
to search than a sequential file.

24. How do we clean a DataStage repository?


For cleaning a DataStage repository, we have to go to DataStage Manager > Job in
the menu bar > Clean Up Resources.

If we want to further remove the logs, then we need to go to the respective jobs
and clean up the log files.

25. How do we call a routine in DataStage?


Routines are stored in the Routine branch of the DataStage repository. This is
where we can create, view, or edit all the Routines. The Routines in DataStage
could be the following: Job Control Routine, Before-after Subroutine, and Transform
function.

26. What is the difference between an Operational DataStage and a Data Warehouse?
An Operational DataStage can be considered as a staging area for real-time analysis
for user processing; thus it is a temporary repository. Whereas, the data warehouse
is used for long-term data storage needs and has the complete data of the entire
business.

27. What does NLS mean in DataStage?


NLS means National Language Support. This means we can use this IBM DataStage tool
in various languages like multi-byte character languages (Chinese or Japanese). We
can read and write in any language and process it as per the requirement.

Das könnte Ihnen auch gefallen