Beruflich Dokumente
Kultur Dokumente
This tutorial helps you to learn all the fundamentals of Talend tool for data integration and
big data with examples.
Audience
This tutorial is for beginner's who are aspiring to become an ETL expert. It is also ideal for
Big Data professionals who are looking to use an ETL tool with Big Data ecosystem.
Prerequisites
Before proceeding with this tutorial, you should be familiar with basic Data warehousing
concepts as well as fundamentals of ETL (Extract, Transform, Load). If you are a beginner
to any of these concepts, we suggest you to go through tutorials based on these concepts
first to gain a solid understanding of Talend.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@tutorialspoint.com
i
Talend
Table of Contents
About the Tutorial ................................................................................................................................ i
Audience............................................................................................................................................... i
Prerequisites ......................................................................................................................................... i
Table of Contents................................................................................................................................. ii
Benefits ............................................................................................................................................... 8
ii
Talend
Introduction....................................................................................................................................... 40
iii
Talend
iv
1. Talend – Introduction
Talend is a software integration platform which provides solutions for Data integration,
Data quality, Data management, Data Preparation and Big Data. The demand for ETL
professionals with knowledge on Talend is high. Also, it is the only ETL tool with all the
plugins to integrate with Big Data ecosystem easily.
According to Gartner, Talend falls in Leaders magic quadrant for Data Integration tools.
Talend also offers Open Studio, which is an open source free tool used widely for Data
Integration and Big Data.
1
2. Talend – System Requirements Talend
The following are the system requirements to download and work on Talend Open Studio:
Microsoft Windows 10
Ubuntu 16.04 LTS
Apple macOS 10.13/High Sierra
Memory Requirement
Besides, you also need an up and running Hadoop cluster (preferably Cloudera.
2
3. Talend – Installation Talend
To download Talend Open Studio for Big Data and Data Integration, please follow the steps
given below:
Step 2: After the download finishes, extract the contents of the zip file, it will create a
folder with all the Talend files in it.
Step 3: Open the Talend folder and double click the executable file: TOS_BD-win-
x86_64.exe. Accept the User License Agreement.
3
Talend
Step 5: Click Allow Access in case you get Windows Security Alert.
4
Talend
5
Talend
6
4. Talend — Talend Open Studio Talend
Talend Open Studio is a free open source ETL tool for Data Integration and Big Data. It is
an Eclipse based developer tool and job designer. You just need to Drag and Drop
components and connect them to create and run ETL or ETL Jobs. The tool will create the
Java code for the job automatically and you need not write a single line of code.
There are multiple options to connect with Data Sources such as RDBMS, Excel, SaaS Big
Data ecosystem, as well as apps and technologies like SAP, CRM, Dropbox and many more.
Some important benefits which Talend Open Studio offers are as below:
Provides all features needed for data integration and synchronization with 900
components, built-in connectors, converting jobs to Java code automatically and
much more.
The tool is completely free, hence there are big cost savings.
In last 12 years, multiple giant organizations have adopted TOS for Data
integration, which shows very high trust factor in this tool.
Talend keeps on adding features to these tools and the documentations are well
structured and very easy to follow.
7
5. Talend – Data Integration Talend
Most organizations get data from multiple places and are store it separately. Now if the
organization has to do decision making, it has to take data from different sources, put it
in a unified view and then analyze it to get a result. This process is called as Data
Integration.
Benefits
Data Integration offers many benefits as described below:
Saves time and eases data analysis, as the data is integrated effectively.
Automated data integration process synchronizes the data and eases real time and
periodic reporting, which otherwise is time consuming if done manually.
Data which is integrated from several sources matures and improves over time,
which eventually helps in better data quality.
Creating a Project
Double click on TOS Big Data executable file, the window shown below will open.
Select Create a new project option, mention the name of the project and click on Create.
8
Talend
Importing a Project
Double click on TOS Big Data executable file, you can see the window as shown below.
Select Import a demo project option and click Select.
9
Talend
You can choose from the options shown below. Here we are choosing Data Integration
Demos. Now, click Finish.
10
Talend
You can see your imported project under existing projects list.
11
Talend
Give Project Name and select the “Select root directory” option.
12
Talend
Browse your existing Talend project home directory and click Finish.
Opening a Project
Select a project from existing project and click Finish. This will open that Talend project.
13
Talend
Deleting a Project
To delete a project, click Manage Connections.
14
Talend
Click OK again.
Exporting a Project
Click Export project option.
15
Talend
Select the project you want to export and give a path to where it should be exported. Click
on Finish.
16
6. Talend Business — Model Basics Talend
Talend Open Studio offers the following shapes and connector options for creating a
business model:
Document: This shape is used for inserting a document object which can be used
for input/output of the data processed.
Input: This shape is used for inserting input object using which user can pass the
data manually.
List: This shape contains the extracted data and it can be defined to hold only
certain kind of data in the list.
Database: This shape is used for holding the input / output data.
Actor: This shape symbolizes the individuals involved in decision making and
technical processes
Gear: This shape shows the manual programs that has to be replaced by Talend
jobs.
17
7. Talend — Components for Data Integration Talend
All the operations in Talend are performed by connectors and components. Talend offers
800+ connectors and components to perform several operations. These components are
present in palette, and there are 21 main categories to which components belong. You can
choose the connectors and just drag and drop it in the designer pane, it will create java
code automatically which will get compiled when you save the Talend code.
18
Talend
The following is the list of widely used connectors and components for data integration in
Talend Open Studio:
tMysqlInput: Runs database query to read a database and extract fields (tables,
views etc.) depending on the query.
tFileInputDelimited: Reads a delimited file row by row and divides them into
separate fields and passes it to the next component.
tFileInputExcel: Reads an excel file row by row and divides them into separate
fields and passes it to the next component.
tFileList: Gets all the files and directories from a given filemask pattern.
tMsgBox: Returns a dialog box with the message specified and an OK button.
tLogRow: Monitors the data getting processed. It displays data/output in the run
console.
tPreJob: Defines the subjobs that will run before your actual job starts.
tMap: Acts as a plugin in Talend studio. It takes data from one or more sources,
transforms it, and then sends the transformed data to one or more destinations.
tJoin: Joins 2 tables by performing inner and outer joins between the main flow
and the lookup flow.
tJava: Enables you to use personalized java code in the Talend program.
tRunJob: Manages complex job systems by running one Talend job after another.
19
8. Talend — Job Design Talend
Creating a Job
In the repository window, right click the Job Design and click Create Job.
Provide the name, purpose and description of the job and click Finish.
20
Talend
You can see your job has been created under Job Design.
Now, let us use this job to add components, connect and configure them. Here, we will
take an excel file as an input and produce an excel file as an output with same data.
21
Talend
22
Talend
Since, here we are taking an excel file as an input, we will drag and drop tFileInputExcel
component from the palette to the Designer window.
Now if you click anywhere on the designer window, a search box will appear. Find tLogRow
and select it to bring it in the designer window.
23
Talend
Finally, select tFileOutputExcel component from the palette and drag drop it in designer
window.
24
Talend
Similarly, right click tLogRow and draw a Main line on tFileOutputExcel. Now, your
components are connected.
25
Talend
If your 1st row in the excel is having the column names, put 1 in the Header option.
Click Edit schema and add the columns and its type according to your input excel file. Click
Ok after adding the schema.
Click Yes.
In tLogRow component, click on sync columns and select the mode in which you want to
generate the rows from your input. Here we have selected Basic mode with “,” as field
separator.
26
Talend
Finally, in tFileOutputExcel component, give the path of file name where you want to store
your output excel file with the sheet name. Click on sync columns.
27
Talend
You will see the output in the basic mode with “,” separator.
You can also see that your output is saved as an excel at the output path you mentioned.
28
9. Talend — Metadata Talend
Metadata basically means data about data. It tells about what, when, why, who, where,
which, and how of data. In Talend, metadata has the entire information about the data
which is present in Talend studio. The metadata option is present inside the Repository
pane of Talend Open Studio.
Various sources like DB Connections, different kind of files, LDAP, Azure, Salesforce, Web
Services FTP, Hadoop Cluster and many more options are present under Talend Metadata.
The main use of metadata in Talend Open Studio is that you can use these data sources
in several jobs just by a simple drag and drop from the Metadata in repository panel.
29
10. Talend — Context Variables Talend
Context variables are the variables which can have different values in different
environments. You can create a context group which can hold multiple context variables.
You need not add each context variable one by one to a job, you can simply add the
context group to the job.
These variables are used to make the code production ready. Its means by using context
variables, you can move the code in development, test or production environments, it will
run in all the environments.
In any job, you can go to Contexts tab as shown below and add context variables.
30
11. Talend — Managing Jobs Talend
In this chapter, let us look into managing jobs and the corresponding functionalities
included in Talend.
Activating/Deactivating a Component
Activating/Deactivating a Component is very simple. You just need to select the
component, right click on it, and choose the deactivate or activate that component option.
31
Talend
32
Talend
Enter the path where you want to export the item and click Finish.
To import item from the job, right click on the job in the Job Designs and click on Import
items.
Browse the root directory from where you want to import the items.
33
Talend
34
12. Talend — Handling Job Execution Talend
To build a job, right click the job and select Build Job option.
35
Talend
Mention the path where you want to archive the job, select job version and build type,
then click Finish.
36
Talend
Then, select and right click on the component, click Add Breakpoint option. Observe that
here we have added breakpoints to tFileInputExcel and tLogRow components. Then, go to
Debug Run, and click Java Debug button.
37
Talend
You can observe from the following screenshot that the job will now execute in debug
mode and according to the breakpoints that we have mentioned.
Advanced Settings
In Advanced setting, you can select from Statistics, Exec Time, Save Job before Execution,
Clear before Run and JVM settings. Each of this option has the functionality as explained
here:
Save Job before Execution: Automatically saves the job before the execution
begins.
38
Talend
39
13. Talend — Big Data Talend
The tag line for Open Studio with Big data is “Simplify ETL and ELT with the leading free
open source ETL tool for big data.” In this chapter, let us look into the usage of Talend as
a tool for processing data on big data environment.
Introduction
Talend Open Studio – Big Data is a free and open source tool for processing your data
very easily on a big data environment. You have plenty of big data components available
in Talend Open Studio , that lets you create and run Hadoop jobs just by simple drag and
drop of few Hadoop components.
Besides, we do not need to write big lines of MapReduce codes; Talend Open Studio Big
data helps you do this with the components present in it. It automatically generates
MapReduce code for you, you just need to drag and drop the components and configure
few parameters.
It also gives you the option to connect with several Big Data distributions like Cloudera,
HortonWorks, MapR, Amazon EMR and even Apache.
40
Talend
The list of Big Data connectors and components in Talend Open Studio is shown below:
tHDFSInput: Reads the data from given hdfs path, puts it into talend schema and
then passes it to the next component in the job.
tHDFSList: Retrieves all the files and folders in the given hdfs path.
tHDFSPut: Copies file/folder from local file system (user-defined) to hdfs at the
given path.
tHDFSGet: Copies file/folder from hdfs to local file system (user-defined) at the
given path.
tPigMap: Used for transforming and routing the data in a pig process.
tPigCoGroup: Groups and aggregates the data coming from multiple inputs.
tPigSort: Sorts the given data based on one or more defined sort keys.
tPigStoreResult: Stores the result from pig operation at a defined storage space.
41
Talend
tPigFilterRow: Filters the specified columns in order to split the data based on the
given condition.
42
14. Talend — Hadoop Distributed File System Talend
In this chapter, let us learn in detail about how Talend works with Hadoop distributed file
system.
Here we are running Cloudera quickstart 5.10 VM on virtualbox. A Host-Only Network must
be used in this VM.
43
Talend
You must have the same host running on cloudera manager also.
Similarly, on your cloudera quickstart VM, edit your /etc/hosts file as shown below.
44
Talend
Click Next.
45
Talend
Select the distribution as cloudera and choose the version which you are using. Select the
retrieve configuration option and click Next.
Enter the manager credentials (URI with port, username, password) as shown below and
click Connect. If the details are correct, you will get Cloudera QuickStart under discovered
clusters.
46
Talend
Click Fetch. This will fetch all the connections and configurations for HDFS, YARN, HBASE,
HIVE.
47
Talend
Note that all the connection parameters will be auto-filled. Mention cloudera in the
username and click Finish.
48
Talend
Connecting to HDFS
In this job, we will list all the directories and files which are present on HDFS.
Firstly, we will create a job and then add HDFS components to it. Right click on the Job
Design and create a new job – hadoopjob.
Now add 2 components from the palette – tHDFSConnection and tHDFSList. Right click
tHDFSConnection and connect these 2 components using ‘OnSubJobOk’ trigger.
49
Talend
In tHDFSConnection, choose Repository as the Property Type and select the Hadoop
cloudera cluster which you created earlier. It will auto-fill all the necessary details required
for this component.
In tHDFSList, select “Use an existing connection” and in the component list choose the
tHDFSConnection which you configured.
Give the home path of HDFS in HDFS Directory option and click the browse button on the
right.
If you have established the connection properly with the above-mentioned configurations,
you will see a window as shown below. It will list all the directories and files present on
HDFS home.
50
Talend
51
Talend
Drag and Drop 3 components – tHDFSConnection, tHDFSInput and tLogRow from the
palette to designer window.
Note that tHDFSConnection will have the similar configuration as earlier. In tHDFSInput,
select “Use an existing connection” and from the component list, choose tHDFSConnection.
In the File Name, give the HDFS path of the file you want to read. Here we are reading a
simple text file, so our File Type is Text File. Similarly, depending on your input, fill the
row separator, field separator and header details as mentioned below. Finally, click the
Edit schema button.
Since our file is just having plain text, we are adding just one column of type String. Now,
click Ok.
Note: When your input is having multiple columns of different types, you need to mention
the schema here accordingly.
52
Talend
53
Talend
Once you were successful in reading a HDFS file, you can see the following output.
Now, in tFileInputDelimited, give the path of input file in File name/Stream option. Here
we are using a csv file as an input, hence the field separator is “,”.
54
Talend
Select the header, footer, limit according to your input file. Note that here our header is 1
because the 1 row contains the column names and limit is 3 because we are writing only
first 3 rows to HDFS.
Now, as per our input file, define the schema. Our input file has 3 columns as mentioned
below.
55
Talend
Note that file type will be text file, Action will be “create”, Row separator will be “\n” and
field separator is “;”
Finally, click Run to execute your job. Once the job has executed successfully, check if
your file is there on HDFS.
Run the following hdfs command with the output path you had mentioned in your job.
56
Talend
You will see the following output if you are successful in writing on HDFS.
57
15. Talend — Map Reduce Talend
In the previous chapter, we have seen how to Talend works with Big Data. In this chapter,
let us understand how to use map Reduce with Talend.
For this purpose, right click Job Design and create a new job – MapreduceJob. Mention the
details of the job and click Finish.
58
Talend
Right click tNormalize and create main link to tAggregateRow. Then, right click on
tAggregateRow and create main link to tMap. Now, right click on tMap and create main
link to tHDFSOutput.
Now, select file type, row separator, files separator and header according to your input
file.
59
Talend
Click edit schema and add the field “line” as string type.
In tNomalize, the column to normalize will be line and Item separator will be whitespace
-> “ “. Now, click edit schema. tNormalize will have line column and tAggregateRow will
have 2 columns word and wordcount as shown below.
60
Talend
Now double click tMap component to enter the map editor and map the input with required
output. In this example, word is mapped with word and wordcount is mapped with
wordcount. In the expression column, click on […] to enter the expression builder.
Now, select StringHandling from category list and UPCASE function. Edit the expression to
“StringHandling.UPCASE(row3.word)” and click Ok. Keep row3.wordcount in expression
column corresponding to wordcount as shown below.
61
Talend
62
Talend
Go to your HDFS path and check the output. Note that all the words will be in uppercase
with their wordcount.
63
16. Talend — Working with Pig Talend
In this chapter, let us learn how to work with a Pig job in Talend.
For this, right click Job Design and create a new job – pigjob. Mention the details of the
job and click Finish.
64
Talend
Then, right click tPigLoad and create Pig Combine line to tPigFilterRow. Next, right click
tPigFilterRow and create Pig Combine line to tPigAggregate. Right click tPigAggregate and
create Pig combine line to tPigStoreResult
In the Input file URI, give the path of your NYSE input file to the pig job. Note that this
input file should be present on HDFS.
65
Talend
Click edit schema, add the columns and its type as shown below.
In tPigFilterRow, select the “Use advanced filter” option and put “stock_symbol = = ‘IBM’”
in the Filter option.
66
Talend
In tPigStoreResult, give the output path in Result Folder URI where you want to store the
result of Pig job. Select store function as PigStorage and field separator (not mandatory)
as “\t”.
67
Talend
Once the job finishes, go and the check your output at the HDFS path you mentioned for
storing the pig job result. The average stock volume of IBM is 500.
68
17. Talend — Hive Talend
In this chapter, let us understand how to work with Hive job on Talend.
69
Talend
Host : “quickstart.cloudera”
Port: “10000”
Database: “default”
Username: “hive”
Note that password will be auto-filled, you need not edit it. Also other Hadoop properties
will be preset and set by default.
70
Talend
In tHiveLoad, select “Use an existing connection” and put tHiveConnection in component list.
Select LOAD in Load action. In File Path, give the HDFS path of your NYSE input file.
Mention the table in Table Name, in which you want to load the input. Keep the other
parameters as shown below.
Put your query in query option which you want to run on the Hive table. Here we are
printing all the columns of first 10 rows in the test hive table.
71
Talend
In tLogRow, click sync columns and select Table mode for showing the output.
72
Talend
73