Beruflich Dokumente
Kultur Dokumente
Reformat
Reformat changes the record format of data records by dropping fields, or by using
DML expressions to add fields, combine fields, or transform the data in the records
By default reformat has got one output port but incrementing value of count
parameter number. But for that two different transform functions has to be written
for each output port.
If any selection from input ports is required the select parameter can be used
instead of using Filter by expression component before reformat
Rollup
Rollup generates data records that summarize groups of data records on the basis
of key specified.
Parts of Aggregate
Input select (optional)
Initialize
Temporary variable declaration
Rollup (Computation)
Finalize
Output select (optional)
Input_select : If it is defined , it filters the input records.
Initialize: rollup passes the first record in each group to the initialize transform
function.
Temporary variable declaration:The initialize transform function creates a temporary
record for the group, with record type temporary_type.
Rollup (Computation): Rollup calls the rollup transform function for each record in a
group, using that record and the temporary record for the group as arguments. The
out:: finalize(temp,in) =
begin
out.id :: in.id;
out.code :: in.code;
out.name :: in.name;
out.cost :: in.total_cost;
end;
Kiran's approach is working fine.
Let i/p data are
1,11,vvv,26
1,12,hhhhh,25
2,10,jjjj,10
2,17,ssasa,27
2,18,jjjjjjj,29
3,21,sdsds,34
3,23,kkkk,45
1,15,fgfdg,67
approach with coding part:
*Step 1: rollup*
Code for roll up component (for summing of cost of same ID):
type temporary_type = record
decimal("\r\n") sum_tmp_0; /**GENERATED*'sum(in.cost)'*/
end;
out::initialize(in) =
begin
out.sum_tmp_0 :: 0; /**GENERATED*'sum(in.cost)'*/
end;
out::rollup(tmp, in) =
begin
out.sum_tmp_0 :: tmp.sum_tmp_0 + (in.cost); /**GENERATED*'sum(in.cost)'*/
end;
/**/
out::finalize(tmp, in) =
begin
out.id :: in.id; /**GENERATED*'in.id'* */
out.cost_sum :: (tmp.sum_tmp_0); /**GENERATED*'sum(in.cost)'* */
end;
*sort*
sort of rollup o/p is don on id
==================================
*Inner join*
code for join(inner join):( joining o/p of rollup and i/p on
id)
out::join(in0, in1) =
begin
out.id :: in0.id;
out.code :: in0.code;
out.name :: in0.name;
out.cost :: in0.cost;
out.cost_sum :: in1.cost_sum;
end;
ouput of inner join will give you the desired output
Dynamic Script Generation is the latest buzz in Ab Initio world and one of its finest. It comes with lots of
other advantages which were not there in earlier versions of Ab Initio Co>Operating System. Now it is
available in Co>Operating System version 2.14.46 and above.
This feature typically enables the use of Ab Initio PDL (Parameter Definition Language) and Component
Folding.
Now if we enable this feature by changing the script generation method to Dynamic in Run Settings we
will be able to run a graph without a deployed script from command prompt. Here the .mp itself works as
an executable file. So you don't need to checkin ksh into the EME run directory anymore. In production
server once we run the mp file using air sandbox run command on the fly it generates a reduced script,
which contains the commands to set up the host environment.
Unlike earlier .mp file Dynamic Script Generation(DSG) enabled graph(.mp) file is a text file. You can
open and view the content from an editor.
Component Folding: It is a feature of Co>Operating system that helps combining group of components
This has to be a DSG enabled graph. The components must be foldable. They must be in the same
phase and layout. Components must be connected by a straight flow.
Now question - Does this improve the performance? Yes, in most of the cases it will bring a significant
performance boost over the traditional approach of execution.
How it works (Advantages):
1. When this feature is enabled by checking the folding option in Run Setting, Co>Operating System
runtime folds all the processes (foldable components) in a single process. As a result number of
processes is reduced when a graph executes. On any system every process has overheads of forking
new process, scheduling, memory consumption etc. These overheads will vary from OS to OS. In some
systems like Mainframe-MVS, creation and maintenance of processes are very costly compared to
different flavors of UNIX.
2. Another major benefit of component folding is the reduction of interpretation time for the DML between
processes. Because it will end up with multitool folded processes communicating with other multitool or
unitool.
3. Apart from that increase in number of processes results higher interprocess communication. Data
movement between two or more processes will not only consume time but memory too. In CFG
(Continuous Flow Graph) interprocess communication is always very high. So it is worth enabling
Component folding in a CFG.
Disadvantages of Component Folding:
1. Pipeline Parallelism: As component folding folds different component in a single process it will hinder
the pipeline parallelism of Ab Initio. If flow of our graph is like - Input File -> Filter By Expression ->
Reformat -> Output File. In traditional method by the help of Pipeline Parallelism FBE and Reformat will
execute concurrently. But now these two components are folded together so there is no chance of parallel
execution.
2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we combine
4 different components to a single process by folding OS will allow only 4 GB of address space for all 4
instead of 4X4 total 16 GB(maximum) of space for 4 independent processes. So we should avert
component folding components where memory use is very high as in case of in-memory Rollup, Join, and
Reformat with lookup. Some components like Sort, in-memory Join causes internal buffering of data.
Combing them in a single process will result writing to disk (Higher IO).
Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the folded component
group.
Excluding any component from Component Folding:
I know sometime you would wish to prevent components to be folded to allow pipeline parallelism or to
access more address space. Then you need to exclude some components from being folded.
Set AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES configuration variable to space separated mpname of
the components in your $HOME/.abinitiorc or system wide $AB_HOME/config/abinitiorc file. e.g.
export AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES= hash-rollup reformat-transform
The other way to prevent two components from getting folded is right clicking on the flow between them
and uncheck the Allow Component Folding option.
Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent and allow
component folding for your components of the graph, tune it for the highest performance.
CPU tracking report of folded components in a graph:
To report the execution detail of folded graph on console we need to override the AB_REPORTvariable
with show-folding option as AB_REPORT=show-folding flows times interval=180 scroll=true spillage
totals file-percentages.
The folded components are displayed as multitool process in CPU tracking information. The CPU time for
a folded component is shown twice once for the component itself again as a multitool component.
PDL is used to put logic for inline computation in parameter value. It provides high flexibility in terms of
interpretation. It supports both $ and ${} substitution. For this you need to set the parameter interpretation
to PDL and write the DML expression within $[ ]. The excution time of PDL is always shorter than
traditional shell interpreted parameters. For all new developments PDL is highly recommended to replace
the crude shell scripting on parameters as much as possible.
The major drawback of shell interpreted parameters is lack of support in EME dependency analysis. But
the EME understands PDL since it is the native language for Ab Initio. Also there is no overhead of
invoking shell for every parameter evaluation which may significantly increase graph pre-processing time.
Again PDL comes with numerous metaprogramming functions like add_field, make_transfom, add_rule
which help handling metadata and tranformation rules run time. The definition and utility of those
functions are well defined in Online help document. We can use the majority of the Ab Initio DML
functions as well. I would recommend looking at the metaprogramming section for starters. Then play with
the parameters editor.
Some examples of PDL Suppose in a graph we have a conditional component which runs based on existence of a file called
emp.dat.
Now FILE_NAME parameter is defined as /home/xyz/emp.dat and a conditional parameter called EXIST
is defined as
$[if (file_information($FILE_NAME).found) 1 else 0]
We can define a parameter with type and transform function with the help of parameter AB_DML_DEFS.
e.g. Suppose AB_DML_DEFS is defined as
out :: sqrt(in) = begin out :: math_sqrt(in); end;
Now in a parameter called SQRT is defined as $[sqrt (16)]
Resolved value from this parameter will be 4.
Ensure your host run settings are checked for dynamic script generation, and read the 2.14 patchset
notes for a description of any hint.