Sie sind auf Seite 1von 44

Hadoop Distribution File System

Topic I
Big Data Introduction
 What is Big Data?
 The term ‘Big Data’ is used to describe the collection of complex and large data
sets such that it’s difficult to capture, process, store, search and analyze using
conventional data base management tolls and traditional data base management
system.

2 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Source of Big Data
 What does Big Data come from?
 Sensors used to gather street information on an autonomous driving car; post
on social media sites; digital pictures and videos; webpage data; market order
book and transaction data, so on so forth.

3 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Facts about Big Data
 Big Data includes both structured data such as data table and unstructured data
such as webpage data.
 Big Data deals with data in petabyte and exabyte.
 Big Data is extremely difficult to work with by most relational database
management systems and desktop statistical software because Big Data requires
massively parallel software running on thousands of servers.
 Big Data is not only about the data volume, but also about the variety and
velocity of data stream.

4 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Value of Big Data
 Analytical use
1. Big Data reveals hidden insights.
2. Being able to process every item of data in reasonable time removes the
troublesome need for sampling and promotes an investigative approach to data.
3. Processing Big Data stream is dynamic and thus more meaningful than static
business analysis.
 Enabling new products
1. Web data becomes more and more important.
2. Cyber security, online fraud detection become more and more important.
3. Personal data privacy becomes more and more important.

5 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Big Data Storage
 Big Data and Hadoop Distributed File System
1. As Big Data overwhelms traditional databases, storage and
more, companies are looking to exploit new tools such as
Hadoop.
2. MapReduce = Scalability
3. Hadoop is a great accomplishment implementing MapReduce
technologies.

6 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Apache Hadoop Project
 Definition: Apache Hadoop is an open source project managed by the Apache
Software Foundation. It has been proven to be very successful in storing and
managing huge amount of data efficiently.
 Hadoop is an open source JAVA framework. It allows us to store vast amounts
of data on clusters of low cost commodity hardware. It provides us capability to
process distributed files of a variety types by simple but scalable programming
models.
 Hadoop is not a database. It is a file system. Anything that can be stored as a file
can be placed in a Hadoop repository.

7 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Origin of Hadoop
 Hadoop is originated from Google White paper series on BigTable, MapReduce
and GFS. Google has written white papers on these topics but the code and
implementation was never done.
 Later on Yahoo and many other contributors implemented Google’s white paper
and Hadoop came to life.
 Doug Cutting, Hadoop’s creator, named the framework after his child’s stuffed
toy elephant.

8 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Facts about Hadoop
 Hadoop is one of the most popular environment to work with Big Data and to
solve Big Data problems.
 Hadoop’s cores are Hadoop Distributed File System (HDFS) and MapReduce.
 Hadoop framework itself is mostly written in Java Programming language, with
some native code in C and command line utilities written as shell scripts.
 When you have very huge dataset which cannot be fit into single machine, you
shall consider Hadoop and its cloud deployment.
 When you have very huge data to be processed within a very short amount of
time, you shall consider Hadoop and its cloud deployment.

9 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Use
 Economical (Cost Effective):
 No license
 It's an open source framework ,
 If you buy it from Cloudera, Hortonworks., MapR, which are different distributions of Hadoop then also its totally free.
 Flexible:
 Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources.
 Data from multiple sources can be joined enabling deeper analyses than any one system can provide.
 Scalable:
 New nodes can be added without needing to change data formats, how data is loaded, how jobs are written, or the applications
on top.
 You can add as many number of nodes as you want depending upon the data size and business requirement.
 Can process really large data (petabytes)
 Solves Big data 3Vs problem:
 Refer to the chapter one for Big Data 3Vs problem
 Reliable and Fault Tolerant:
 When you lose a node, the system redirects work to another location of the data and continues processing without missing a
beat.
 Handles replication
 Smart
 Optimized Compression
 Query Expression

10 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Hadoop Distributed File System
 HDFS is a block-structured file system where
1. Individual files are split into blocks of fixed sizes.
2. These blocks are stored across a cluster of one or more machine
with fixed data storage capacity.
3. Individual machines in the cluster are called DataNodes.
4. HDFS distributes data/files across the DataNodes.

11 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
HDFS Cont’d
 HDFS is a block-structured file system where
1. DataNodes are tracked by NameNode.
2. DataNodes serve read and write requests from clients.
3. NameNode is the admin/master of HDFS.
4. NameNode stores meta-data and runs on high quality hardware.

12 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Checklist of installing Hadoop 2.0
 On Linux OS: follows the instruction here.
 On Windows OS:
1. Cygwin: a powerful Unix command-line tool. Cygwin is a large
collection of GNU and open source tools which provide functionality
similar to a Linux distribution on Windows.
2. Java: Oracle Java Development Kit (JDK) but not Java Runtime
Environment (JRE). Need Oracle JDK 1.7.

13 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install JDK on Window OS
 Download Oracle java development kit (JDK) version 1.7 (1.7 is the stable
version up till now for Hadoop) from Oracle archive
 Download compatible version of JDK 1.7, for example, jdk-7u80-windows-
x64.exe
 Install JDK by executing the exe file and chose the C:\ as the directory as
installation such that the JAVA environment path will be
C:\Java\jdk1.7.0_80
 Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME

14 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Add path of JDK on Window OS
 Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME
1. Step 1: Press Windows Key + Pause/Break Key
2. Step 2: Choose Advanced System Settings
3. Step 3: Choose Environment Variables
4. Step 4: Create New System Variables by clicking New tab
5. Step 5: Fill in Variable name as `JAVA_HOME`
6. Step 6: Fill in Variable value as `C:\Java\jdk1.7.0_80`
 Verify the environment variable by `echo %JAVA_HOME%` in command line
prompt and the output shall be `C:\Java\jdk1.7.0_80`

15 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Cygwin as Linux terminal
 Download Cygwin and install it with default setup.
 Select packages openssh, openssl, dos2unix, git, and vim.

16 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Configure SSH on Cygwin
 Run Cygwin as administrator.
 ssh-host-config
 Yes to Privilege Separation, Yes to install sshd as a service
 Enter ntsec as the value of CYGWIN for the daemon.
 net start sshd
 ssh-user-config
 Yes to create a SSH2 RSA identity file.
 Yes to use this identity to login to this machine.
 No to create an SSH2 DSA id file.
 ssh –v localhost

 Lastly, add ‘C:\cygwin64\bin’ to the path of system environment

17 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
 Download Hadoop 1.0 from here.
 Run Cygwin as administrator and unzip Hadoop-1.0 by
`tar -xzf hadoop-1.2.1.tar.gz`
 Create a folder with name “hadoop-dir”. And inside “hadoop-dir” folder create 2
folder with names “datadir” and “namedir”.
 In Cygwin execute chmod command to change folder permissions so that it will
be accesses by Hadoop.
$ chmod 755 hadoop-dir
$ cd hadoop-dir
$ chmod 755 datadir
$ chmod 755 namedir
Use vim to edit four important configuration files in Hadoop-1.2.1

18 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
 Edit hadoop-env.sh file to set Java home as we did it before for environmental
variable setup
 Uncomment the line which contains “export JAVA_HOME” and provide our
own Java path.

19 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
 Edit core-site.xml file as follows to set network port use for the host connection.

20 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
 Edit hdfs-site.xml file as follows to set the physical locations of namenode and
datanode.

21 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Hadoop 1.0
 Edit mapred-site.xml file as follows to set port of host for job tracker.

22 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Make Use of Hadoop 1.0
 We first need to format the NameNode to create a Hadoop Distributed File
System (HDFS).
 Open Cygwin terminal (Run as Administration) and execute following
command
$ cd hadoop-1.2.1
$ bin/hadoop namenode –format
 This command will run for a while. We should be able to see message “Storage
Directory has been successfully formatted”

23 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Make Use of Hadoop 1.0
 Start Hadoop Daemons:
Once the filesystem has been created . Next step would be to check and start Hadoop Cluster
Daemons NameNode, DataNode, SecondaryNameNode, JobTracker, TaskTracker.
Restart the Cygwin Terminal and execute below command to start all daemons on Hadoop
Cluster.
$ bin/start-all.sh

This command will start all the services in Cluster and now you have your Hadoop Cluster
running.
Stop Hadoop Daemons:
 Stop Hadoop Daemons: to stop all the daemons, we can execute the command

$ bin/stop-all.sh

Next we browse the web interface of NameNode and JobTracker to see the details.

24 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Visualize NameNode and DataNodes
for Hadoop 1.0

25 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Track HDFS for Hadoop 1.0

26 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Compare Hadoop 1.0 with 2.0

27 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Apache Maven
 Download Maven-3.5.2 binaries from here
 Unzip the gz package in Cygwin by `tar xzf apache-maven-3.5.2-bin.tar.gz`
 Copy the paste the unzipped folder to C:\
 Add Maven folder directory to the `path` system variable

 Verify Maven installation by `mvn –v`

28 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Apache Hadoop 2.0
 Download Hadoop-2.7.5-src source file from here.
 Unzip the gz package in Cygwin
 Copy and paste the unzipped files to C:\hdfs
 Build Hadoop executable files by using Maven command
mvn package –Pdist –DskipTests –Dtar

 Details will be given on part II slides. Try to install Hadoop 1.0 and then 2.0 as
Homework for this week.
 Examples of using Hadoop 2.0 for file analysis will be on part II as well.

29 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Install Python on ANACONDA
 It is a free and open source Integrated Development
Environment (IDE) for Python and R.You can run it on your
desktop with OS such as Windows, Mac, or Linux.

30 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA

31 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA environment

32 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA create your own env

33 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA python terminal

34 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA python terminal

35 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA activate env

36 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA install new packs

37 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA install jupyter notebook

38 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA install tensorflow

39 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA open python in terminal

40 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA install pandas

41 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA install matplotlib

42 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA open jupyter notebook

43 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018


ANACONDA jupyter in browser

44 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

Das könnte Ihnen auch gefallen