Sie sind auf Seite 1von 9

Rollup component allows the users to group the records on certain field values.

- It is a multi stage function and contains


- Initialize 2. Rollup 3. Finalize functions which are mandatory
- To counts of a particular group Rollup needs a temporary variable
- The initialize function is invoked first for each group
- Rollup is called for each of the records in the group.
- The finally function calls only once at the end of last rollup call.

Reformat
Reformat changes the record format of data records by dropping fields, or by using
DML expressions to add fields, combine fields, or transform the data in the records
By default reformat has got one output port but incrementing value of count
parameter number. But for that two different transform functions has to be written
for each output port.
If any selection from input ports is required the select parameter can be used
instead of using Filter by expression component before reformat
Rollup
Rollup generates data records that summarize groups of data records on the basis
of key specified.
Parts of Aggregate
Input select (optional)
Initialize
Temporary variable declaration
Rollup (Computation)
Finalize
Output select (optional)
Input_select : If it is defined , it filters the input records.
Initialize: rollup passes the first record in each group to the initialize transform
function.
Temporary variable declaration:The initialize transform function creates a temporary
record for the group, with record type temporary_type.
Rollup (Computation): Rollup calls the rollup transform function for each record in a
group, using that record and the temporary record for the group as arguments. The

rollup transform function returns a new temporary record.


Finalize:
If you leave sorted-input set to its default, Input must be sorted or grouped:
Rollup calls the finalize transform function after it processes all the input records
in a group.
Rollup passes the temporary record for the group and the last input record in the
group to the finalize transform function.
The finalize transform function produces an output record for the group.
Rollup repeats this procedure with each group.
Output select: If you have defined the output_select transform function, it filters the
output records.
Aggregates
Aggregate generates data records that summarize groups of data records ( similar
to rollup). But it has lesser control over data.
Scan
Scan generates a series of cumulative summary records for groups of data records.
Consider above case input records scan transform functions generates record in
output as (if input_select and output_select parameters are not specified)
Scan also can be used for multiple functionality as same as roll-up
The main difference between Scan and Rollup is Scan generates intermediate
(cumulative) result and Rollup summarizes.

Sum of two records - in abinitio


chandra asked Feb 25, 2010 | Replies (18)

I have a file as follows with 4 records


id code name cost
1 4 xxx 24.25
1 5 yyy 20.00
2 8 aaa 10.00
2 9 bbb 20.00

The output should be


1 4 xxxx 44.25
1 5 yyyy 44.25
2 8 aaaa 30.00
2 9 bbbb 30.00
ie, only the final column should be summed up for both the ids , and the input and out should be 4
records.
What are all the component i can use ? to get this result??
Hi,
You can use Scan component for cumulutive addition.
You will need to use Rollup with the key identifier as the "id" column to get the output you want. Scan
will give an aggregated value but only 1 record per key. So your output will become as follows with
Scan:
1 5 yyyy 44.25
2 9 bbbb 30.00
By the way, your input and output files has another difference. The number of characters increase by
1 letter in the output for the column name "name". Hope that was unintentional :)
I guess Scan will not give only 1 record per group.it provides the running total.

this might help:


1. Sort the data based on first field ( id)
2. Replicate the data
a. Flow 1 : connect to Rollup component with ID as key field and sumup the cost using sum(cost)
function
Connect the outport of Rollup to Join component ( in0).

b. Flow 2 : Connect the Flow 2 to join component ( in1).


3. Join the the above two flows using ID as key.
1. input file to ROLLUP
2. input file to JOIN
3. out port of ROLLUP to JOIN
Solution 1 :
Rollup: Build vector as below under the rollup function using vector_append function.
[ 1 4 xxx 24.25
1 5 yyy 20.00 ]
[ 2 8 aaa 10.00
2 9 bbb 20.00 ]
in the finalize just do the sum and again write vector at out. So your o.p vector would be as below.
[1 4 xxxx 44.25
1 5 yyyy 44.25]
[ 2 8 aaaa 30.00
2 9 bbbb 30.00]
INPUTFILE--> SORT(KEY=ID)--- > SCAN(KEY = ID)-- > OUTPUTFILE
In SCAN COMPONENT put below code:
type temporary_type =
record
decimal(8.2) total_cost;
end;
temp:: initialize(in) =
begin
temp.total_cost :: 0;
end;
out:: scan(temp,in) =
begin
out.total_cost ::
temp.total_cost+in.cost;
end;

out:: finalize(temp,in) =
begin
out.id :: in.id;
out.code :: in.code;
out.name :: in.name;
out.cost :: in.total_cost;
end;
Kiran's approach is working fine.
Let i/p data are
1,11,vvv,26
1,12,hhhhh,25
2,10,jjjj,10
2,17,ssasa,27
2,18,jjjjjjj,29
3,21,sdsds,34
3,23,kkkk,45
1,15,fgfdg,67
approach with coding part:
*Step 1: rollup*
Code for roll up component (for summing of cost of same ID):
type temporary_type = record
decimal("\r\n") sum_tmp_0; /**GENERATED*'sum(in.cost)'*/
end;

out::initialize(in) =
begin
out.sum_tmp_0 :: 0; /**GENERATED*'sum(in.cost)'*/
end;

out::rollup(tmp, in) =
begin
out.sum_tmp_0 :: tmp.sum_tmp_0 + (in.cost); /**GENERATED*'sum(in.cost)'*/
end;

/**/

out::finalize(tmp, in) =
begin
out.id :: in.id; /**GENERATED*'in.id'* */
out.cost_sum :: (tmp.sum_tmp_0); /**GENERATED*'sum(in.cost)'* */
end;
*sort*
sort of rollup o/p is don on id
==================================
*Inner join*
code for join(inner join):( joining o/p of rollup and i/p on
id)
out::join(in0, in1) =
begin
out.id :: in0.id;
out.code :: in0.code;
out.name :: in0.name;
out.cost :: in0.cost;
out.cost_sum :: in1.cost_sum;
end;
ouput of inner join will give you the desired output
Dynamic Script Generation is the latest buzz in Ab Initio world and one of its finest. It comes with lots of

other advantages which were not there in earlier versions of Ab Initio Co>Operating System. Now it is
available in Co>Operating System version 2.14.46 and above.
This feature typically enables the use of Ab Initio PDL (Parameter Definition Language) and Component
Folding.

Now if we enable this feature by changing the script generation method to Dynamic in Run Settings we
will be able to run a graph without a deployed script from command prompt. Here the .mp itself works as
an executable file. So you don't need to checkin ksh into the EME run directory anymore. In production
server once we run the mp file using air sandbox run command on the fly it generates a reduced script,
which contains the commands to set up the host environment.
Unlike earlier .mp file Dynamic Script Generation(DSG) enabled graph(.mp) file is a text file. You can
open and view the content from an editor.
Component Folding: It is a feature of Co>Operating system that helps combining group of components

and runs them as a single process.


Prerequisites of Component Folding:

This has to be a DSG enabled graph. The components must be foldable. They must be in the same
phase and layout. Components must be connected by a straight flow.
Now question - Does this improve the performance? Yes, in most of the cases it will bring a significant
performance boost over the traditional approach of execution.
How it works (Advantages):
1. When this feature is enabled by checking the folding option in Run Setting, Co>Operating System
runtime folds all the processes (foldable components) in a single process. As a result number of
processes is reduced when a graph executes. On any system every process has overheads of forking
new process, scheduling, memory consumption etc. These overheads will vary from OS to OS. In some
systems like Mainframe-MVS, creation and maintenance of processes are very costly compared to
different flavors of UNIX.
2. Another major benefit of component folding is the reduction of interpretation time for the DML between
processes. Because it will end up with multitool folded processes communicating with other multitool or
unitool.
3. Apart from that increase in number of processes results higher interprocess communication. Data
movement between two or more processes will not only consume time but memory too. In CFG
(Continuous Flow Graph) interprocess communication is always very high. So it is worth enabling
Component folding in a CFG.
Disadvantages of Component Folding:
1. Pipeline Parallelism: As component folding folds different component in a single process it will hinder
the pipeline parallelism of Ab Initio. If flow of our graph is like - Input File -> Filter By Expression ->
Reformat -> Output File. In traditional method by the help of Pipeline Parallelism FBE and Reformat will
execute concurrently. But now these two components are folded together so there is no chance of parallel
execution.
2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we combine
4 different components to a single process by folding OS will allow only 4 GB of address space for all 4
instead of 4X4 total 16 GB(maximum) of space for 4 independent processes. So we should avert
component folding components where memory use is very high as in case of in-memory Rollup, Join, and
Reformat with lookup. Some components like Sort, in-memory Join causes internal buffering of data.
Combing them in a single process will result writing to disk (Higher IO).

Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the folded component
group.
Excluding any component from Component Folding:
I know sometime you would wish to prevent components to be folded to allow pipeline parallelism or to
access more address space. Then you need to exclude some components from being folded.
Set AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES configuration variable to space separated mpname of
the components in your $HOME/.abinitiorc or system wide $AB_HOME/config/abinitiorc file. e.g.
export AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES= hash-rollup reformat-transform
The other way to prevent two components from getting folded is right clicking on the flow between them
and uncheck the Allow Component Folding option.
Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent and allow
component folding for your components of the graph, tune it for the highest performance.
CPU tracking report of folded components in a graph:
To report the execution detail of folded graph on console we need to override the AB_REPORTvariable
with show-folding option as AB_REPORT=show-folding flows times interval=180 scroll=true spillage
totals file-percentages.
The folded components are displayed as multitool process in CPU tracking information. The CPU time for
a folded component is shown twice once for the component itself again as a multitool component.

Parameter Definition Language (PDL):

PDL is used to put logic for inline computation in parameter value. It provides high flexibility in terms of
interpretation. It supports both $ and ${} substitution. For this you need to set the parameter interpretation
to PDL and write the DML expression within $[ ]. The excution time of PDL is always shorter than
traditional shell interpreted parameters. For all new developments PDL is highly recommended to replace
the crude shell scripting on parameters as much as possible.
The major drawback of shell interpreted parameters is lack of support in EME dependency analysis. But
the EME understands PDL since it is the native language for Ab Initio. Also there is no overhead of
invoking shell for every parameter evaluation which may significantly increase graph pre-processing time.
Again PDL comes with numerous metaprogramming functions like add_field, make_transfom, add_rule
which help handling metadata and tranformation rules run time. The definition and utility of those
functions are well defined in Online help document. We can use the majority of the Ab Initio DML
functions as well. I would recommend looking at the metaprogramming section for starters. Then play with
the parameters editor.

Some examples of PDL Suppose in a graph we have a conditional component which runs based on existence of a file called
emp.dat.

Now FILE_NAME parameter is defined as /home/xyz/emp.dat and a conditional parameter called EXIST
is defined as
$[if (file_information($FILE_NAME).found) 1 else 0]
We can define a parameter with type and transform function with the help of parameter AB_DML_DEFS.
e.g. Suppose AB_DML_DEFS is defined as
out :: sqrt(in) = begin out :: math_sqrt(in); end;
Now in a parameter called SQRT is defined as $[sqrt (16)]
Resolved value from this parameter will be 4.
Ensure your host run settings are checked for dynamic script generation, and read the 2.14 patchset
notes for a description of any hint.

Das könnte Ihnen auch gefallen