Sie sind auf Seite 1von 70

This user guide explains how to use Talend

open studio, Talend Integration, Talend


Profiling and Talend MDM. This user guide
contains examples of every component.

Talend
Presentation on Talend MDM

Bhushan Maindarkar.

Talend MDM User Guide.


Fidel Technologies Pvt Ltd

Table of Contents
1. General Information ....................................................................................................................... 4
1.1. What is ETL .............................................................................................................................. 4
1.2. What is Talend ......................................................................................................................... 4
1.3. What is Talend Open Studio .................................................................................................... 4
2. Installation .......................................................................................................................... 5
2.1. Hardware requirement ............................................................................................................ 5
2.2. Software requirement ............................................................................................................. 5
2.3. Configure the memory settings ............................................................................................... 5
2.4. Launch the Studio .................................................................................................................... 2
3.Talend Integration ................................................................................................................. 6
3.1. Create New Project ................................................................................................................. 7
3.2. Delete Project .......................................................................................................................... 7
3.3. Getting started with a basic Job Creating a Job ...................................................................... 8
3.4. Workspace window ................................................................................................................. 9
3.5. Add components to the Job ................................................................................................. 10
3.6. List of components ................................................................................................................ 11
3.7 Connect the components together ....................................................................................... 13
3.8. Connect components using drag and drop method .............................................................. 13
3.9. Configuring the components ................................................................................................. 14
3.10. Execute Job ........................................................................................................................... 15
3.11. Custom code components ................................................................................................... 16
3.11.1. tjava component ..................................................................................................... 16
3.11.2. tjavaRow component .............................................................................................. 18
3.11.3. tjavaFlex component ............................................................................................... 20
3.11.4. tLibraryLoad component ......................................................................................... 22
3.11.5. tSetGlobalVar component ....................................................................................... 23
3.12. Connection components ............................................................................................ 24
3.12.1. tMysqlInput component ............................................................................ 25
3.12.2. tMysqlOutput component ......................................................................... 26

1
Fidel Technologies Pvt Ltd

3.12.3. tMysqlConnection component .................................................................. 27


3.13. Custom code components ......................................................................................... 28
3.13.1. taddCRCRow component .......................................................................... 29
3.13.2. tReplaceList component ............................................................................ 32
3.13.3. tUniqRow component ............................................................................... 32
3.13.4. tReplace component ................................................................................. 33
3.14. Processing Components: ............................................................................................ 36
3.14.1. tAggregateRow component ...................................................................... 37
3.14.2. tFilterRow component .............................................................................. 39
3.14.3. tSortRow component ................................................................................ 40
3.14.4. tAggregateSortedRow component ............................................................ 42
3.14.5. tMap component ...................................................................................... 43
3.14.6. tSampleRow component ........................................................................... 44
3.14.7. tXMLMap component ............................................................................... 45
3.15. Internet Components: ................................................................................................ 46
3.15.1. tHttpRequest component ......................................................................... 46
3.15.2. tRest component ....................................................................................... 47
3.15.3. tExtractJSONField component ................................................................... 48
3.15.4. tUnite component ..................................................................................... 49
3.15.5. tReplicate component ............................................................................... 51
4.Talend Profiling ................................................................................................................... 53
4.1. Create a connection into Profiling ................................................................................ 53
4.2. Database Analysis ......................................................................................................... 56
4.3. Column Analysis ........................................................................................................... 57
4.3.1. Add patterns to the analyzed columns ........................................................ 58
4.4. Duplication Analysis ..................................................................................................... 60

5.Talend MDM-Master Data Management .............................................................................. 62


5.1. Functional architecture of Talend MDM Architecture .......................................................... 62
5.2. Creating a data model ........................................................................................................... 63

2
Fidel Technologies Pvt Ltd

5.2.1. Creating business entities in the data model ............................................................ 65


5.2.2. Modeling .................................................................................................................. 65
5.3. Add server Location ............................................................................................................... 65
5.3. Data Container ...................................................................................................................... 65
5.4. Create a view ......................................................................................................................... 66
5.5. Deploy a model ...................................................................................................................... 67
5.6. Web GUI ................................................................................................................................. 67

3
Fidel Technologies Pvt Ltd

1. General Information

1.1. What is ETL


ETL, which stands for "extract, transform and load," is the set of functions
combined into one tool or solution that enables companies to "extract" data from
numerous databases, applications and systems, "transform" it as appropriate, and
"load" it into another database, a data mart or a data warehouse for analysis, or
send it along to another operational system to support a business process.

1.2. What is Talend


Talend offers an enterprise class integration solution that allows users to natively
connect databases, flat files, and cloud - based applications. The company
provides graphical dragand-drop tools, test creation, and code generation in
numerous languages.
Features of Talend

Business modeling
Graphical development
Metadata-driven design and execution
Real-time debugging
Robust execution

4
Fidel Technologies Pvt Ltd

1.3. What is Talend Open Studio


Talend Open Studio for Data Integration is an open source data integration product
developed by Talend and designed to combine, convert and update data in various
locations across a business.

2. Installation
Before installing your Talend product, make sure the machines you are using meet
the following hardware requirements recommended by Talend.
Memory usage heavily depends on the size and nature of your Talend projects.
However, in summary, if your Jobs include many transformation components, you
should consider upgrading the total amount of memory allocated to your servers,
based on the following recommendations

2.1. Hardware requirement

Memory Usage

Product Client/Server Recommended alloc,memory


Studio Client 3GB minimum, 4GB recommended

Disk usage:

Product Client/Server Required disk space Required disk space for use
for installation
Studio Client 3 GB 3+GB

2.2. Software requirement


Setting up JAVA_HOME
In order for your Talend product to use the Java environment installed on your
machine, you must set the JAVA_HOME environment variable.
To do so, proceed as follows:
Find the folder where Java is installed, usually C:\Program
Files\Java\JREx.x.x.
2. Open the Start menu and type Environment variable in the search bar to
open the Environment variable properties.
3. Click Environment Variables....
4. Under System Variables, click New... to create a variable. Name the
variable JAVA_HOME, enter the path to your Java JRE, and click OK.

5
Fidel Technologies Pvt Ltd

5. Under System Variables, select the Path variable, click Edit... and add the
following variable at the end of the Path variable
value: ;%JAVA_HOME%\bin

2.3. Download
Download the product from talend website.
Note that the .zip file contains binaries for ALL platforms (Linux/Unix,
Windows and MacOS).
Once the download is complete, extract the archive file on your hard drive.

2.4. Configure the memory settings


If you want to tune the memory allocation for your JVM, you only need to
edit the TOS_DQ-win-x86_64.inifile.
The default values are: -vmargs -Xms40m -Xmx500m -
XX:MaxMetaspaceSize=128m

2.5. Launch the Studio


Double-click the TOS_DQ-win-x86_64.exe executable file to launch your
Talend Studio.

3. Talend Integration:
Fast and cost effective way to connect data
Maximize the value of data to your business with Talend Data Integration software,
a modern data platform based on an open and scalable architecture. Graphical
tools and wizards help you develop and deploy data integration jobs 10 times
faster than hand coding, at 1/5th the cost of competitors. Increase your productivity
today with a free trial of our commercial edition.

Develop and deploy 10 times faster


Synchronize metadata across database platforms
Let anyone access and cleanse data while governing its use

3.1. Create New Project

1. Launch Talend Studio

6
Fidel Technologies Pvt Ltd

2. On the login window, select the Create a new project option and enter a
project name in the field.
3. Click Finish to create the project and open it in the Studio.

3.4. Delete Project?

1. On the login screen, click Manage Connections, then on the dialog box that
opens click Delete Existing Project(s) to open the [Select Project] dialog box.

7
Fidel Technologies Pvt Ltd

3.3. Getting started with a basic Job Creating a Job

8
Fidel Technologies Pvt Ltd

3.4. Workspace window

9
Fidel Technologies Pvt Ltd

3.5. Add components to the Job

To drop a component from the Palette, proceed as follows:

1. Enter the search keyword(s) in the search field of the Palette and press
Enter to validate your search.

2. Select the component you want to use and click on the design workspace
where you want to drop the component.
3. Note that you can also drop a note to your Job the same way you drop
components.

10
Fidel Technologies Pvt Ltd

4. Each newly-added component is shown in a blue box to show that it as an


individual Sub job

3.6. List of Talend components

ID Name of Components Description


1 taddCRCRow taddCRCRow adds CRC column for all rows of flow
2 tChangeFileEncoding tChangeFileEncoding Changes the Encoding of file.
3 tReplaceList tReplaceList Replaces String with a dynamic replacement list.
4 tUniqRow tUniqRow Makes a data flow unique based on the schema.
5 tReplace Replace the expression with another one.
6 tjava tJava enables you to enter personalized code in order to integrate it in
Talend program. You can execute this code only once.
7 tjavaRow The tJavaRow component allows Java logic to be performed for every
record within a flow.
8 tjavaFlex tJavaFlex enables you to enter personalized code in order to integrate it in
Talend program. With tJavaFlex, you can enter the three java-code parts
(start, main and end) that constitute a kind of component dedicated to do a
desired operation.
9 tLibraryLoad If you want to add/load third party libraries in Talend Project, then we can
choose tLibraryLoad
10 tSetGlobalVar tSetGlobalVar allows you to define and set global variables in GUI.

11 tMysqlInput READ MYSQL table and extract fields based on Mysql query.
12 tMysqlOutput INSERT or UPDATE lines into MYSQL Database.
13 tMysqlConnection Create a connection to a MYSQL Database.
14 tAggregateRow tAggregateRow receives a input and aggregates it based on one or more
columns.
15 tAggregateSortedRow tAggregateRow receives a input and aggregates it based on one or more

11
Fidel Technologies Pvt Ltd

columns.
16 tExternalSortedRow tAggregateSortedRow receives a sorted flow and aggregates it based on
one or more columns. For each output line, are provided the aggregation
key and the relevant result of set operations (min, max, sum)
17 tFilterRow tFilterRow component is used to filter input rows by setting conditions on
the selected columns.
18 tMap tMap allow Join, columns row filtering, transformation and sort type and
order.
19 tSampleRow tSampleRow filter rows according to the line numbers.
20 tSortRow tSortRow component sorts input data based on one or several columns, by
sort type and order.
21 tXMLMap tXMLMap allow Allows Join, columns row filtering ,transformation and
multiple output.
22 tFileInputDeliminated tFileInputDelimited reads a given file row by row with simple separated
fields.
23 tFileInputExcel tFileInputExcel reads an Excel file (.xls or .xlsx) and extracts data line by
line.
24 tFileInputFullRow tFileInputFullRow opens a file and reads it row by row and sends complete
rows as defined in the Schema to the next job component, via a Row link.
25 tFileInputLDIF tFileOutputLDIF outputs data to an LDIF type of file which can then be
loaded into a LDAP directory.
26 tFileInputMail reads the header and content parts of an email file defined
27 tFileInputMSDeliminated tFileInputMSDelimited reads a complex multi-structured delimited file.
28 tFileInputMSPositional tFileInputMSDelimited reads a complex multi-structured delimited file.
29 tFileInputXML tFileInputXML reads an XML structured file and extracts data row by row.
30 tFileInputRegrex Powerful feature which can replace number of other components of the File
family. Requires some advanced knowledge on regular expression syntax
31 tFileOutputDeliminated tFileOutputDeliminated Write to a file row by row with simple separated
fields
32 tFileOutputExcel tFileOutputExcel writes an MS Excel file with separated data value according to a
defined schema.
33 tFileOutputRow tFileOutputRow write data into file.
34 tFileOutputLDIF tFileOutputLDIF writes or modifies a LDIF file with data separated in respective
entries based on the schema defined,.or else deletes content from an LDIF file.
35 tFileOutputMSDeliminated tFileOutputMSDeliminated writes into file based on schema
36 tFileOutputMSPositional tFileOutputMSPositional writes into file based on position of field in a string.
37 tFileOutputXML tFileOutputXML writes an XML file with separated data value according to a
defined schema.
38 tHttpRequest The tHttpRequest component is part of the Internet family of components, and
makes both POST and GET requests to the
39 tRest The tREST component serves as a REST Web service client that sends HTTP
requests to a REST Web service provider and gets the responses.
40 tExtractJSONField tExtractJSONFields extracts the data from JSON fields stored in a file, a database
table, etc., based on the XPath query.
41 tMsgBox It displayed the message box
42 tUnite Merges data from various sources, based on a common schema.
43 tReplicate Duplicate the incoming schema into two identical output flows.

12
Fidel Technologies Pvt Ltd

3.7 Connect the components together

Now that the components have been added on the workspace, they have to be
connected together. Components connected together form a subjob. Jobs are
composed of one or several subjobs carrying out various processes.In this
example, as the tLogRow and tFileOutputDelimited components are already
connected, you only need to connect the tFileInputDelimited to the tLogRow
component.To connect the components together, use either of the following
methods:

1. Right-click and click again

2. Right-click the source component, tFileInputDelimited in this example.

3. In the contextual menu that opens, select the type of connection you want to
use to link the components, Row > Main in this example.

4. Click the target component to create the link, tLogRow in this example

3.8. Drag and drop method


1. Click the input component, tFileInputDelimited in this example.
2. When the O icon appears, click it and drag the cursor to the destination
component, tLogRow in this example. A Row > Main connection is automatically
created between the two components.

13
Fidel Technologies Pvt Ltd

3.9. Configuring the components

Ex:Configuring the tLogRow component

1. Double-click the tLogRow component to open its Basic settings view.


2. In the Mode area, select Table (print values in cells of a table).

Configuring the tFileOutputDelimited component

1. Double-click the tFileOutputDelimited component to open its Basic settings


view.
2. Browse your system or enter the path to the output file, customers.csv in this
example.
3. Select the Include Header check box.
4. If needed, click the Sync columns button to retrieve the schema from the
input component.

14
Fidel Technologies Pvt Ltd

3.10. Execute Job


Now that components are configured, the Job can be executed.
To do so, proceed as follows:
1. Press Ctrl+S to save the Job.
2. Go to Run tab, and click on Run to execute the Job.
3. Or just press F6 to execute Job.

15
Fidel Technologies Pvt Ltd

3.11. Custom code components

ID Name of Components Description


1 tjava tJava enables you to enter personalized code in order to integrate it in
Talend program. You can execute this code only once.
2 tjavaRow The tJavaRow component allows Java logic to be performed for every
record within a flow.
3 tjavaFlex tJavaFlex enables you to enter personalized code in order to integrate it in
Talend program. With tJavaFlex, you can enter the three java-code parts
(start, main and end) that constitute a kind of component dedicated to do a
desired operation.
4 tLibraryLoad If you want to add/load third party libraries in Talend Project, then we can
choose tLibraryLoad
5 tSetGlobalVar tSetGlobalVar allows you to define and set global variables in GUI.

3.11.1 tjava Example

Batch design:

tRowGenerator_1

16
Fidel Technologies Pvt Ltd

tRowGenerator_1 Schema setting

tJava Code:

String abc;

17
Fidel Technologies Pvt Ltd

System.out.println("Hello");

Output:

3.11.2 tjavaRow

tRowGenerator_1

18
Fidel Technologies Pvt Ltd

tRowGenerator_1 Schema setting

tJava Code:
//Code generated according to input schema and output schema
System.out.println("tJavaRow");
output_row.First_Name = StringHandling.UPCASE(input_row.First_Name);

19
Fidel Technologies Pvt Ltd

output_row.Last_Name = input_row.Last_Name;
output_row.City = input_row.City;

Output:

3.11.3 tjavaFlex
Batch Design:

20
Fidel Technologies Pvt Ltd

tJavaFlex Code

Schema of tJavaFlex :

Output :

21
Fidel Technologies Pvt Ltd

Starting job tjava at 16:12 18/05/2017.

[statistics] connecting to socket on port 3519


[statistics] connected
tJavaFlex_1: Start code
tJavaFlex_1: Main code: i=1
tJavaFlex_1: Main code: i=2
tJavaFlex_1: Main code: i=3
tJavaFlex_1: End code
.---------.
|tLogRow_1|
|=-------=|
|newColumn|
|=-------=|
|row 1 |
|row 2 |
|row 3 |
'---------'

[statistics] disconnected
Job tjava ended at 16:12 18/05/2017. [exit code=0]

3.11.4 tLibraryLoad

22
Fidel Technologies Pvt Ltd

3.11.5 tSetGlobalVar
Batch Design :

Component setting for tSetGlobalVar

tJava Code :

23
Fidel Technologies Pvt Ltd

Output :
Starting job tjava at 18:04 18/05/2017.

[statistics] connecting to socket on port 4013


[statistics] connected
myString=Hello World!
[statistics] disconnected
Job tjava ended at 18:04 18/05/2017. [exit code=0]

3.12. Connection components

ID Name of Components Description


1 tMysqlInput READ MYSQL table and extract fields based on Mysql query.
2 tMysqlOutput INSERT or UPDATE lines into MYSQL Database.
3 tMysqlConnection Create a connection to a MYSQL Database.

24
Fidel Technologies Pvt Ltd

3.12.1. tMysqlInput :

Batch Design :

tMysqlInput Schema :

Output :
Starting job tjava at 18:52 18/05/2017.

[statistics] connecting to socket on port 3823


[statistics] connected
.--------------+----------------+-------------+-----------------.
| tLogRow_1 |
|=-------------+----------------+-------------+----------------=|
|customerNumber|contactFirstName|ZenkakuString|MappedPhoneNumber|
|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |

25
Fidel Technologies Pvt Ltd

|456 |Mouth |null |null |


|458 |Nilesh |null |null |
|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'

[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]

3.12.2. tMysqlOutput
Batch Design :

Output :
Starting job tjava at 18:52 18/05/2017.

[statistics] connecting to socket on port 3823


[statistics] connected
.--------------+----------------+-------------+-----------------.
| tLogRow_1 |
|=-------------+----------------+-------------+----------------=|
|customerNumber|contactFirstName|ZenkakuString|MappedPhoneNumber|
|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |
|456 |Mouth |null |null |

26
Fidel Technologies Pvt Ltd

|458 |Nilesh |null |null |


|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'

[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]

3.12.3. tMysqlConnection :

Batch Design :

27
Fidel Technologies Pvt Ltd

tMysqlConnection component setting :

Output :

3.13. Data Quality Components:


ID Name of Components Description
1 taddCRCRow taddCRCRow adds CRC column for all rows of flow
2 tChangeFileEncoding tChangeFileEncoding Changes the Encoding of file.
3 tReplaceList tReplaceList Replaces String with a dynamic replacement list.
4 tUniqRow tUniqRow Makes a data flow unique based on the schema.
5 tReplace Replace the expression with another one.

28
Fidel Technologies Pvt Ltd

3.13.1. taddCRCRow

Batch Design :

tMap component setting :

tAddCRCRow component setting :

29
Fidel Technologies Pvt Ltd

Output:
Starting job dataquality at 11:34 19/05/2017.

[statistics] connecting to socket on port 3883


[statistics] connected
For input string: "411028 "
.--+--------------+--------+-------+--------+----------+----------.
| tLogRow_1 |
|=-+--------------+--------+-------+--------+----------+---------=|
|Id|Street |Town |Country|Postcode|var1 |CRC |
|=-+--------------+--------+-------+--------+----------+---------=|
|1 |north est road|pune |india |413411 |19-05-2017|2848775222|
|3 |manhaton rd |New york|US |284745 |19-05-2017|2774761735|
'--+--------------+--------+-------+--------+----------+----------'

[statistics] disconnected
Job dataquality ended at 11:34 19/05/2017. [exit code=0]

Batch Design :

tRowGenertor1:

30
Fidel Technologies Pvt Ltd

tRowGenertor2:

3.13.2. tReplaceList component setting :

Output :
Starting job chgfileEncode at 12:17 19/05/2017.

[statistics] connecting to socket on port 3931


[statistics] connected
.----------+----------+--------------.
| tLogRow_1 |
|=---------+----------+-------------=|
|Last_Name |First_Name|city |
|=---------+----------+-------------=|
|Garfield |Millard |Little Rock |
|Arthur |Lyndon |Topeka |
|Eisenhower|Richard |Topeka |
|Harding |Andrew |Concord |
|Fillmore |Calvin |Baton Rouge |
|Monroe |William |Raleigh |
|Eisenhower|Woodrow |Baton Rouge |
|Van Buren |Dwight |Denver |
|Truman |Martin |Olympia |
|Carter |Chester |Honolulu |
|Coolidge |Harry |Carson City |
|Coolidge |James |Columbus |
|Pierce |Grover |Frankfort |
|Ford |Ulysses |Charleston |

31
Fidel Technologies Pvt Ltd

'----------+----------+--------------'
[statistics] disconnected
Job chgfileEncode ended at 12:17 19/05/2017. [exit code=0

1.13.3. tReplace :

Batch Design :

tReplace Component:

Output :
Starting job chgfileEncode at 12:41 19/05/2017.

[statistics] connecting to socket on port 3532


[statistics] connected
.---------+----------+------.
| tLogRow_1 |
|=--------+----------+-----=|
|Last_Name|First_Name|city |
|=--------+----------+-----=|
|1 |Bhushan |Samuel|
|2 |Ortega |lee |
|3 |Zant |Thi |

32
Fidel Technologies Pvt Ltd

|4 |Cohen |John |
|5 |Park |Umar |
|6 |Knipp |Troy |
|7 |Lunberg |Greg |
|8 |Brown |Sami |
|9 |Barnhill |Pascal|
|10 |Rose |Aaron |
|11 | | |
|12 | | |
'---------+----------+------'
[statistics] disconnected
Job chgfileEncode ended at 12:41 19/05/2017. [exit code=0]

3.13.4. tUniqRow :

Batch Design :

33
Fidel Technologies Pvt Ltd

34
Fidel Technologies Pvt Ltd

Output:
Starting job tuniqRow at 18:21 22/05/2017.

[statistics] connecting to socket on port 3921


[statistics] connected
.-------------------+---.
| Unique |

35
Fidel Technologies Pvt Ltd

|=------------------+--=|
|ABC |PQR|
|=------------------+--=|
|-6.333333333333332 |A |
|18.499999999999996 |B |
|21.055555555555557 |C |
|-1.2222222222222219|X |
|32.666666666666664 |Q |
'-------------------+---'

.-------------------+---.
| Duplicate |
|=------------------+--=|
|ABC |PQR|
|=------------------+--=|
|2.000000000000001 |A |
|-1.8333333333333337|A |
'-------------------+---'

[statistics] disconnected
Job tuniqRow ended at 18:21 22/05/2017. [exit code=0]

3.14. Processing Components:


ID Name of Components Description
01 tAggregateRow tAggregateRow receives a input and aggregates it based on one or more
columns.
02 tAggregateSortedRow tAggregateSortedRow receives a sorted flow and aggregates it based on
one or more columns. For each output line, are provided the aggregation key
and the relevant result of set operations (min, max, sum)
04 tFilterRow tFilterRow component is used to filter input rows by setting conditions on the
selected columns.
05 tMap tMap allow Join, columns row filtering, transformation and sort type and
order.
06 tSampleRow tSampleRow filter rows according to the line numbers.
07 tSortRow tSortRow component sorts input data based on one or several columns, by
sort type and order.
08 tXMLMap tXMLMap allow Allows Join, columns row filtering ,transformation and
multiple output.

36
Fidel Technologies Pvt Ltd

3.14.1. tAggregateRow:

Batch Design :

tAggregateRow component

37
Fidel Technologies Pvt Ltd

tMap component

Starting job tAggregateRow at 15:09 19/05/2017.

[statistics] connecting to socket on port 3531


[statistics] connected
.--------+----------+-------------------.
| tLogRow_1 |
|=-------+----------+------------------=|
|Order_ID|Shipper_ID|Shipper_Name |
|=-------+----------+------------------=|
|6 |1 |Shiny Shipping |
|4 |2 |Rose Marry Ship Pvt|
|2 |3 |Nick Ltd |
|3 |4 |Michle Ltd |
'--------+----------+-------------------'

[statistics] disconnected
Job tAggregateRow ended at 15:09 19/05/2017. [exit code=0]

38
Fidel Technologies Pvt Ltd

3.14.2. tFilterRow component :

Batch Design :

tFilterRow Component Setting :

39
Fidel Technologies Pvt Ltd

Output :

3.14.3. tSortRow:

Batch Design :

tSortRow Component :

40
Fidel Technologies Pvt Ltd

Output :

3.14.4. tAggregateSortedRow:

Batch Design :

41
Fidel Technologies Pvt Ltd

tAggregateSortedRow :

Output :
Starting job tAggregateSorted at 16:00 19/05/2017.

[statistics] connecting to socket on port 4022


[statistics] connected
.----------+-------.
| tLogRow_1 |
|=---------+------=|
|City |Country|
|=---------+------=|
|London |UK |
|California|USA |
|Texas |USA |
|Tokyo |Japan |
|Tokyo |Japan |
|Madrid |Spain |
|Saitama |Japan |
|Texas |USA |
|Birmingham|UK |
'----------+-------'

[statistics] disconnected
Job tAggregateSorted ended at 16:00 19/05/2017. [exit code=0]

42
Fidel Technologies Pvt Ltd

3.14.5. tSampleRow:

Batch Design :

tSampleRow Component :

Output :

43
Fidel Technologies Pvt Ltd

3.14.6. tXMLMap

Job Design :

tXMLMap :

44
Fidel Technologies Pvt Ltd

tFileOutputDelimited :

Output :

45
Fidel Technologies Pvt Ltd

3.15. Internet Component:


ID Name of Components Description
01 tHttpRequest The tHttpRequest component is part of the Internet family of components,
and makes both POST and GET requests to the
02 tRest The tREST component serves as a REST Web service client that sends
HTTP requests to a REST Web service provider and gets the responses.
04 tExtractJSONField tExtractJSONFields extracts the data from JSON fields stored in a file,
a database table, etc., based on the XPath query.
05 tMsgBox It displayed the message box
06 tUnite Merges data from various sources, based on a common schema.
07 tReplicate Duplicate the incoming schema into two identical output flows.

3.15.1. tHttpRequest:

tHttpRequest Component :

46
Fidel Technologies Pvt Ltd

Output:

3.15.2 tRest :

Job Design :

47
Fidel Technologies Pvt Ltd

tRest Component :

3.15.3. tExtractJSONField:

Output:

48
Fidel Technologies Pvt Ltd

3.15.4. tUnite :

Batch Design :

49
Fidel Technologies Pvt Ltd

Schema

Output

50
Fidel Technologies Pvt Ltd

3.15.5. tReplicate :

Batch Design :

Schema of tReplicate:

51
Fidel Technologies Pvt Ltd

tFilterRow 1:

tFilterRow2:

Output :

52
Fidel Technologies Pvt Ltd

4.0. Data Profiling :


Data profiling is the process of examining the data available in different data
sources and collecting statistics and information about this data. Data profiling
helps to assess the quality level of the data according to defined set goals.
If data is of a poor quality, or managed in structures that cannot be integrated to
meet the needs of the enterprise, business processes and decision-making suffer.
Compared to manual analysis techniques, data profiling technology improves the
enterprise ability to meet the challenge of managing data quality and to address
the data quality challenges faced during data migrations and data integrations.
4.1. Create a connection
In the DQ Repository tree view, expand Metadata, right-click DB Connections and
select Create DB
Connection.

53
Fidel Technologies Pvt Ltd

54
Fidel Technologies Pvt Ltd

55
Fidel Technologies Pvt Ltd

4.2. Database Analysis:


A first step in evaluating data is to get a high-level overview of its structure and
content. Talend offers structural analysis jobs at the level of entire database,
schema, catalog, and tables & views. Drilling down into the results shows row
counts, schema counts, table counts, rows per table, views counts, rows per view,
keys, and indexes.

56
Fidel Technologies Pvt Ltd

4.3. Column Analysis:


For tables & views of interest, a column analysis can be executed to show counts
of rows, nulls, distinct, uniques, duplicates, and blanks (by default). Additionlly
more complicated indicators can be added per column if needed. Here are the
results for the person.person.firstname column:

57
Fidel Technologies Pvt Ltd

4.3.1. Add patterns to the analyzed columns:


You can add patterns to one or more of the analyzed columns to validate the full
record (all columns) against all the patterns, and not to validate each column
against a specific pattern as it is the case with the column analysis.
The results chart is a single bar chart for the totality of the used patterns. This chart
shows the number of the rows that match "all" the patterns

58
Fidel Technologies Pvt Ltd

59
Fidel Technologies Pvt Ltd

4.4. Duplication Analysis


Having analyzed completeness, timeliness, validity, and accuracy of our data, we
can now perform a duplication analysis. Using Talends Match Analysis job on the
person table, here is the job configuration for duplicates on first and last name.
Two separate match algorithms are available, with configurable confidence and
match thresholds.

60
Fidel Technologies Pvt Ltd

61
Fidel Technologies Pvt Ltd

5. MDM-Master Data Management


Talend Open Studio for MDM provides unified development and management tools to
integrate and process all of your data with an easy to use, visual designer.
Talend Studio provides the key capabilities of Talend MDM for data governance and
data stewardship, which enable users to build data models employing the necessary
business and data rules to create one single copy of the master data to be propagated
back to the source and target systems.

5.1. Functional architecture of Talend MDM Architecture:

Modeling:
Before we get started, lets first get a common understanding of the most important
MDM terms:
Term Description
(business) element Also referred to as business attribute. The actual name of the data
point.
(business) entity Describes the actual data (the elements), its nature, its structure and
its relationships.1 An entity can have one or more business elements.
The Talend MDM jargon for this concept is data model entity.
data model type This is an element or collection of elements which is globally defined
and can be used across various entities. This makes maintenance of
common elements easier.
data model Defines the attributes (elements), user access rights and relationships
of entities mastered by the MDM Hub. The data model is the central
component of Talend MDM. A data model maps to one or more
(business) entities that can be explicitly defined. Any concept can be
defined by a data model.1 A data model can have multiple entities.
(business) domain A collection of data models that define a particular concept. For
instance, the customer domain may be defined by the organization,

62
Fidel Technologies Pvt Ltd

account, contact and opportunity data models. A product domain may


be defined by a product, product family and price list.
Ultimately, the domain is the collection of all data models that relate to
a concept. Talend MDM can model any and many domains within a
single hub. It is a generic multi-domain MDM solution.1
data container Holds data of one or several business entities. Data containers are
typically used to separate master data domains.1

Talend MDM Architecture can be broken down into functional blocks that enable
interaction between users and the MDM Hub and their corresponding IT needs. Here
are the main building blocks of Talend MDM
The Clients block includes one or more Talend Studio and Web browsers that
could be on the same or on different machines.
From the Studio, you can set up and operate a centralized repository. You can
build data models that employ
The necessary business and data rules to create a single copy of the master data. This
master data will be propagated back to target and source systems.
From the Web browser, you can search, display or edit master data with tasks
defined by the Studio.
The Server block includes an MDM server - where the master data are governed
and monitored.
The Database block includes the MDM database - where the master data and the
system data are stored

5.2. Creating a data model:


The first step at the beginning of any MDM project involves setting up a data model
and creating business entities in this data model. In this example, a Movie data model is
created

63
Fidel Technologies Pvt Ltd

In the Studio workspace, an editor opens where you can define the details of your new
data model. The new data model and data container are listed in the MDM Repository
tree view.

5.2.1. Creating business entities in the data model:


The following procedure shows how to populate the Movie data model created in
Creating a data model with some business entities.
1. in the editor, right-click anywhere in the Data Model Entities panel, and then click
New Entity.
2. In the [New Entity] dialog box that opens, enter a name for your new entity in the
Name field, Movie in this example.
3. Select the Complex Type option.
You use the Simple Type option if you want to define a single element type such as a
phone number or an email address, and the Complex Type option if you want to define
a more complete structure, such as an
Address or, in this case, the different attributes that describe a movie

64
Fidel Technologies Pvt Ltd

Lets have a look at the available types:


Simple type: Used for single, self-contained elements like email addresses.
Complex type: Used for structures like address which consists of multiple
elements? A complex type can also inherit elements from another complex type.

5.3. Add Server Location :


1. Right click on Server explore select add server location
2. Type Name of server.
3. Type server location.

65
Fidel Technologies Pvt Ltd

4. Type Username & Password


5. Check connection.

5.4. Data Container:


All the master data is stored in a Data Container. A data container can hold the data of
various business entities. Note that a business entity stored in one data container is not
visible from another data container.
To create a data container, simple right click on the Data Container in the repository
tree and choose New.

5.5. Create a view :


A view is basically what an end user can see via the web interface, which includes the
form and search functionality. There are various views that you can create. We will
only have a look at the most simple view here, which basically will allow end users to
create the business entity online and search for values within certain attributes.

66
Fidel Technologies Pvt Ltd

5.6. Deploy a model:


Once you have finalized your data model, you can deploy it to the MDM server. Right
click the data model in the repository and choose Deploy to .

Web GUI:
On successful installation, http://localhost:8080/talendmdm will show:

67
Fidel Technologies Pvt Ltd

The open source version comes with only two user accounts (it is restricted to these two
ones):

standard user
user: user
password: user

admin
user: administrator
password: administrator

68
Fidel Technologies Pvt Ltd

69

Das könnte Ihnen auch gefallen