Beruflich Dokumente
Kultur Dokumente
Topic I
Big Data Introduction
What is Big Data?
The term ‘Big Data’ is used to describe the collection of complex and large data
sets such that it’s difficult to capture, process, store, search and analyze using
conventional data base management tolls and traditional data base management
system.
2 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Source of Big Data
What does Big Data come from?
Sensors used to gather street information on an autonomous driving car; post
on social media sites; digital pictures and videos; webpage data; market order
book and transaction data, so on so forth.
3 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Facts about Big Data
Big Data includes both structured data such as data table and unstructured data
such as webpage data.
Big Data deals with data in petabyte and exabyte.
Big Data is extremely difficult to work with by most relational database
management systems and desktop statistical software because Big Data requires
massively parallel software running on thousands of servers.
Big Data is not only about the data volume, but also about the variety and
velocity of data stream.
4 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Value of Big Data
Analytical use
1. Big Data reveals hidden insights.
2. Being able to process every item of data in reasonable time removes the
troublesome need for sampling and promotes an investigative approach to data.
3. Processing Big Data stream is dynamic and thus more meaningful than static
business analysis.
Enabling new products
1. Web data becomes more and more important.
2. Cyber security, online fraud detection become more and more important.
3. Personal data privacy becomes more and more important.
5 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Big Data Storage
Big Data and Hadoop Distributed File System
1. As Big Data overwhelms traditional databases, storage and
more, companies are looking to exploit new tools such as
Hadoop.
2. MapReduce = Scalability
3. Hadoop is a great accomplishment implementing MapReduce
technologies.
6 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Apache Hadoop Project
Definition: Apache Hadoop is an open source project managed by the Apache
Software Foundation. It has been proven to be very successful in storing and
managing huge amount of data efficiently.
Hadoop is an open source JAVA framework. It allows us to store vast amounts
of data on clusters of low cost commodity hardware. It provides us capability to
process distributed files of a variety types by simple but scalable programming
models.
Hadoop is not a database. It is a file system. Anything that can be stored as a file
can be placed in a Hadoop repository.
7 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Origin of Hadoop
Hadoop is originated from Google White paper series on BigTable, MapReduce
and GFS. Google has written white papers on these topics but the code and
implementation was never done.
Later on Yahoo and many other contributors implemented Google’s white paper
and Hadoop came to life.
Doug Cutting, Hadoop’s creator, named the framework after his child’s stuffed
toy elephant.
8 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Facts about Hadoop
Hadoop is one of the most popular environment to work with Big Data and to
solve Big Data problems.
Hadoop’s cores are Hadoop Distributed File System (HDFS) and MapReduce.
Hadoop framework itself is mostly written in Java Programming language, with
some native code in C and command line utilities written as shell scripts.
When you have very huge dataset which cannot be fit into single machine, you
shall consider Hadoop and its cloud deployment.
When you have very huge data to be processed within a very short amount of
time, you shall consider Hadoop and its cloud deployment.
9 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Use
Economical (Cost Effective):
No license
It's an open source framework ,
If you buy it from Cloudera, Hortonworks., MapR, which are different distributions of Hadoop then also its totally free.
Flexible:
Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources.
Data from multiple sources can be joined enabling deeper analyses than any one system can provide.
Scalable:
New nodes can be added without needing to change data formats, how data is loaded, how jobs are written, or the applications
on top.
You can add as many number of nodes as you want depending upon the data size and business requirement.
Can process really large data (petabytes)
Solves Big data 3Vs problem:
Refer to the chapter one for Big Data 3Vs problem
Reliable and Fault Tolerant:
When you lose a node, the system redirects work to another location of the data and continues processing without missing a
beat.
Handles replication
Smart
Optimized Compression
Query Expression
10 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Hadoop Distributed File System
HDFS is a block-structured file system where
1. Individual files are split into blocks of fixed sizes.
2. These blocks are stored across a cluster of one or more machine
with fixed data storage capacity.
3. Individual machines in the cluster are called DataNodes.
4. HDFS distributes data/files across the DataNodes.
11 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
HDFS Cont’d
HDFS is a block-structured file system where
1. DataNodes are tracked by NameNode.
2. DataNodes serve read and write requests from clients.
3. NameNode is the admin/master of HDFS.
4. NameNode stores meta-data and runs on high quality hardware.
12 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Checklist of installing Hadoop 2.0
On Linux OS: follows the instruction here.
On Windows OS:
1. Cygwin: a powerful Unix command-line tool. Cygwin is a large
collection of GNU and open source tools which provide functionality
similar to a Linux distribution on Windows.
2. Java: Oracle Java Development Kit (JDK) but not Java Runtime
Environment (JRE). Need Oracle JDK 1.7.
13 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install JDK on Window OS
Download Oracle java development kit (JDK) version 1.7 (1.7 is the stable
version up till now for Hadoop) from Oracle archive
Download compatible version of JDK 1.7, for example, jdk-7u80-windows-
x64.exe
Install JDK by executing the exe file and chose the C:\ as the directory as
installation such that the JAVA environment path will be
C:\Java\jdk1.7.0_80
Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME
14 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Add path of JDK on Window OS
Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME
1. Step 1: Press Windows Key + Pause/Break Key
2. Step 2: Choose Advanced System Settings
3. Step 3: Choose Environment Variables
4. Step 4: Create New System Variables by clicking New tab
5. Step 5: Fill in Variable name as `JAVA_HOME`
6. Step 6: Fill in Variable value as `C:\Java\jdk1.7.0_80`
Verify the environment variable by `echo %JAVA_HOME%` in command line
prompt and the output shall be `C:\Java\jdk1.7.0_80`
15 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Cygwin as Linux terminal
Download Cygwin and install it with default setup.
Select packages openssh, openssl, dos2unix, git, and vim.
16 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Configure SSH on Cygwin
Run Cygwin as administrator.
ssh-host-config
Yes to Privilege Separation, Yes to install sshd as a service
Enter ntsec as the value of CYGWIN for the daemon.
net start sshd
ssh-user-config
Yes to create a SSH2 RSA identity file.
Yes to use this identity to login to this machine.
No to create an SSH2 DSA id file.
ssh –v localhost
17 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
Download Hadoop 1.0 from here.
Run Cygwin as administrator and unzip Hadoop-1.0 by
`tar -xzf hadoop-1.2.1.tar.gz`
Create a folder with name “hadoop-dir”. And inside “hadoop-dir” folder create 2
folder with names “datadir” and “namedir”.
In Cygwin execute chmod command to change folder permissions so that it will
be accesses by Hadoop.
$ chmod 755 hadoop-dir
$ cd hadoop-dir
$ chmod 755 datadir
$ chmod 755 namedir
Use vim to edit four important configuration files in Hadoop-1.2.1
18 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
Edit hadoop-env.sh file to set Java home as we did it before for environmental
variable setup
Uncomment the line which contains “export JAVA_HOME” and provide our
own Java path.
19 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
Edit core-site.xml file as follows to set network port use for the host connection.
20 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
Edit hdfs-site.xml file as follows to set the physical locations of namenode and
datanode.
21 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
Edit mapred-site.xml file as follows to set port of host for job tracker.
22 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Make Use of Hadoop 1.0
We first need to format the NameNode to create a Hadoop Distributed File
System (HDFS).
Open Cygwin terminal (Run as Administration) and execute following
command
$ cd hadoop-1.2.1
$ bin/hadoop namenode –format
This command will run for a while. We should be able to see message “Storage
Directory has been successfully formatted”
23 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Make Use of Hadoop 1.0
Start Hadoop Daemons:
Once the filesystem has been created . Next step would be to check and start Hadoop Cluster
Daemons NameNode, DataNode, SecondaryNameNode, JobTracker, TaskTracker.
Restart the Cygwin Terminal and execute below command to start all daemons on Hadoop
Cluster.
$ bin/start-all.sh
This command will start all the services in Cluster and now you have your Hadoop Cluster
running.
Stop Hadoop Daemons:
Stop Hadoop Daemons: to stop all the daemons, we can execute the command
$ bin/stop-all.sh
Next we browse the web interface of NameNode and JobTracker to see the details.
24 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Visualize NameNode and DataNodes
for Hadoop 1.0
25 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Track HDFS for Hadoop 1.0
26 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Compare Hadoop 1.0 with 2.0
27 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Apache Maven
Download Maven-3.5.2 binaries from here
Unzip the gz package in Cygwin by `tar xzf apache-maven-3.5.2-bin.tar.gz`
Copy the paste the unzipped folder to C:\
Add Maven folder directory to the `path` system variable
28 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Apache Hadoop 2.0
Download Hadoop-2.7.5-src source file from here.
Unzip the gz package in Cygwin
Copy and paste the unzipped files to C:\hdfs
Build Hadoop executable files by using Maven command
mvn package –Pdist –DskipTests –Dtar
Details will be given on part II slides. Try to install Hadoop 1.0 and then 2.0 as
Homework for this week.
Examples of using Hadoop 2.0 for file analysis will be on part II as well.
29 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Python on ANACONDA
It is a free and open source Integrated Development
Environment (IDE) for Python and R.You can run it on your
desktop with OS such as Windows, Mac, or Linux.