Topic 1 - Hadoop Distribution File System Part I Version 1.0

Hadoop Distribution File System
Topic I
Big Data Introduction
 What is Big Data?
 The term ‘Big Data’ is used to describe the collection of complex and large data
sets such that it’s difficult to capture, process, store, search and analyze using
conventional data base management tolls and traditional data base management
system.
2 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 2/3/2018
Source of Big Data
 What does Big Data come from?
 Sensors used to gather street information on an autonomous driving car; post
on social media sites; digital pictures and videos; webpage data; market order
book and transaction data, so on so forth.
Facts about Big Data
 Big Data includes both structured data such as data table and unstructured data
such as webpage data.
 Big Data deals with data in petabyte and exabyte.
 Big Data is extremely difficult to work with by most relational database
management systems and desktop statistical software because Big Data requires
massively parallel software running on thousands of servers.
 Big Data is not only about the data volume, but also about the variety and
velocity of data stream.
Value of Big Data
 Analytical use
1. Big Data reveals hidden insights.
2. Being able to process every item of data in reasonable time removes the
troublesome need for sampling and promotes an investigative approach to data.
3. Processing Big Data stream is dynamic and thus more meaningful than static
business analysis.
 Enabling new products
1. Web data becomes more and more important.
2. Cyber security, online fraud detection become more and more important.
3. Personal data privacy becomes more and more important.
Big Data Storage
 Big Data and Hadoop Distributed File System
1. As Big Data overwhelms traditional databases, storage and
more, companies are looking to exploit new tools such as
Hadoop.
2. MapReduce = Scalability
3. Hadoop is a great accomplishment implementing MapReduce
technologies.
Apache Hadoop Project
 Definition: Apache Hadoop is an open source project managed by the Apache
Software Foundation. It has been proven to be very successful in storing and
managing huge amount of data efficiently.
 Hadoop is an open source JAVA framework. It allows us to store vast amounts
of data on clusters of low cost commodity hardware. It provides us capability to
process distributed files of a variety types by simple but scalable programming
models.
 Hadoop is not a database. It is a file system. Anything that can be stored as a file
can be placed in a Hadoop repository.
Origin of Hadoop
 Hadoop is originated from Google White paper series on BigTable, MapReduce
and GFS. Google has written white papers on these topics but the code and
implementation was never done.
 Later on Yahoo and many other contributors implemented Google’s white paper
and Hadoop came to life.
 Doug Cutting, Hadoop’s creator, named the framework after his child’s stuffed
toy elephant.
Facts about Hadoop
 Hadoop is one of the most popular environment to work with Big Data and to
solve Big Data problems.
 Hadoop’s cores are Hadoop Distributed File System (HDFS) and MapReduce.
 Hadoop framework itself is mostly written in Java Programming language, with
some native code in C and command line utilities written as shell scripts.
 When you have very huge dataset which cannot be fit into single machine, you
shall consider Hadoop and its cloud deployment.
 When you have very huge data to be processed within a very short amount of
time, you shall consider Hadoop and its cloud deployment.
Use
 Economical (Cost Effective):
 No license
 It's an open source framework ,
 If you buy it from Cloudera, Hortonworks., MapR, which are different distributions of Hadoop then also its totally free.
 Flexible:
 Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources.
 Data from multiple sources can be joined enabling deeper analyses than any one system can provide.
 Scalable:
 New nodes can be added without needing to change data formats, how data is loaded, how jobs are written, or the applications
on top.
 You can add as many number of nodes as you want depending upon the data size and business requirement.
 Can process really large data (petabytes)
 Solves Big data 3Vs problem:
 Refer to the chapter one for Big Data 3Vs problem
 Reliable and Fault Tolerant:
 When you lose a node, the system redirects work to another location of the data and continues processing without missing a
beat.
 Handles replication
 Smart
 Optimized Compression
 Query Expression
Hadoop Distributed File System
 HDFS is a block-structured file system where
1. Individual files are split into blocks of fixed sizes.
2. These blocks are stored across a cluster of one or more machine
with fixed data storage capacity.
3. Individual machines in the cluster are called DataNodes.
4. HDFS distributes data/files across the DataNodes.
HDFS Cont’d
 HDFS is a block-structured file system where
1. DataNodes are tracked by NameNode.
2. DataNodes serve read and write requests from clients.
3. NameNode is the admin/master of HDFS.
4. NameNode stores meta-data and runs on high quality hardware.
Checklist of installing Hadoop 2.0
 On Linux OS: follows the instruction here.
 On Windows OS:
1. Cygwin: a powerful Unix command-line tool. Cygwin is a large
collection of GNU and open source tools which provide functionality
similar to a Linux distribution on Windows.
2. Java: Oracle Java Development Kit (JDK) but not Java Runtime
Environment (JRE). Need Oracle JDK 1.7.
Install JDK on Window OS
 Download Oracle java development kit (JDK) version 1.7 (1.7 is the stable
version up till now for Hadoop) from Oracle archive
 Download compatible version of JDK 1.7, for example, jdk-7u80-windows-
x64.exe
 Install JDK by executing the exe file and chose the C:\ as the directory as
installation such that the JAVA environment path will be
C:\Java\jdk1.7.0_80
 Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME
Add path of JDK on Window OS
 Add C:\Java\jdk1.7.0_80 to System Path as JAVA_HOME
1. Step 1: Press Windows Key + Pause/Break Key
2. Step 2: Choose Advanced System Settings
3. Step 3: Choose Environment Variables
4. Step 4: Create New System Variables by clicking New tab
5. Step 5: Fill in Variable name as `JAVA_HOME`
6. Step 6: Fill in Variable value as `C:\Java\jdk1.7.0_80`
 Verify the environment variable by `echo %JAVA_HOME%` in command line
prompt and the output shall be `C:\Java\jdk1.7.0_80`
Install Cygwin as Linux terminal
 Download Cygwin and install it with default setup.
 Select packages openssh, openssl, dos2unix, git, and vim.
Configure SSH on Cygwin
 Run Cygwin as administrator.
 ssh-host-config
 Yes to Privilege Separation, Yes to install sshd as a service
 Enter ntsec as the value of CYGWIN for the daemon.
 net start sshd
 ssh-user-config
 Yes to create a SSH2 RSA identity file.
 Yes to use this identity to login to this machine.
 No to create an SSH2 DSA id file.
 ssh –v localhost
 Lastly, add ‘C:\cygwin64\bin’ to the path of system environment
Install Hadoop 1.0
 Download Hadoop 1.0 from here.
 Run Cygwin as administrator and unzip Hadoop-1.0 by
`tar -xzf hadoop-1.2.1.tar.gz`
 Create a folder with name “hadoop-dir”. And inside “hadoop-dir” folder create 2
folder with names “datadir” and “namedir”.
 In Cygwin execute chmod command to change folder permissions so that it will
be accesses by Hadoop.
$ chmod 755 hadoop-dir
$ cd hadoop-dir
$ chmod 755 datadir
$ chmod 755 namedir
Use vim to edit four important configuration files in Hadoop-1.2.1
Install Hadoop 1.0
 Edit hadoop-env.sh file to set Java home as we did it before for environmental
variable setup
 Uncomment the line which contains “export JAVA_HOME” and provide our
own Java path.
Install Hadoop 1.0
 Edit core-site.xml file as follows to set network port use for the host connection.
Install Hadoop 1.0
 Edit hdfs-site.xml file as follows to set the physical locations of namenode and
datanode.
Install Hadoop 1.0
 Edit mapred-site.xml file as follows to set port of host for job tracker.
Make Use of Hadoop 1.0
 We first need to format the NameNode to create a Hadoop Distributed File
System (HDFS).
 Open Cygwin terminal (Run as Administration) and execute following
command
$ cd hadoop-1.2.1
$ bin/hadoop namenode –format
 This command will run for a while. We should be able to see message “Storage
Directory has been successfully formatted”
Make Use of Hadoop 1.0
 Start Hadoop Daemons:
Once the filesystem has been created . Next step would be to check and start Hadoop Cluster
Daemons NameNode, DataNode, SecondaryNameNode, JobTracker, TaskTracker.
Restart the Cygwin Terminal and execute below command to start all daemons on Hadoop
Cluster.
$ bin/start-all.sh
This command will start all the services in Cluster and now you have your Hadoop Cluster
running.
Stop Hadoop Daemons:
 Stop Hadoop Daemons: to stop all the daemons, we can execute the command
$ bin/stop-all.sh
Next we browse the web interface of NameNode and JobTracker to see the details.
Visualize NameNode and DataNodes
for Hadoop 1.0
Track HDFS for Hadoop 1.0
Compare Hadoop 1.0 with 2.0
Install Apache Maven
 Download Maven-3.5.2 binaries from here
 Unzip the gz package in Cygwin by `tar xzf apache-maven-3.5.2-bin.tar.gz`
 Copy the paste the unzipped folder to C:\
 Add Maven folder directory to the `path` system variable
 Verify Maven installation by `mvn –v`
Install Apache Hadoop 2.0
 Download Hadoop-2.7.5-src source file from here.
 Unzip the gz package in Cygwin
 Copy and paste the unzipped files to C:\hdfs
 Build Hadoop executable files by using Maven command
mvn package –Pdist –DskipTests –Dtar
 Details will be given on part II slides. Try to install Hadoop 1.0 and then 2.0 as
Homework for this week.
 Examples of using Hadoop 2.0 for file analysis will be on part II as well.
Install Python on ANACONDA
 It is a free and open source Integrated Development
Environment (IDE) for Python and R.You can run it on your
desktop with OS such as Windows, Mac, or Linux.
30 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

ANACONDA

ANACONDA environment

ANACONDA create your own env

ANACONDA python terminal

ANACONDA python terminal

ANACONDA activate env

ANACONDA install new packs

ANACONDA install jupyter notebook

ANACONDA install tensorflow

ANACONDA open python in terminal

ANACONDA install pandas

ANACONDA install matplotlib

ANACONDA open jupyter notebook

ANACONDA jupyter in browser

Topic 1 - Hadoop Distribution File System Part I Version 1.0

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Topic 1 - Hadoop Distribution File System Part I Version 1.0

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop Distribution File System

 Lastly, add ‘C:\cygwin64\bin’ to the path of system environment

 Verify Maven installation by `mvn –v`

30 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

31 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

32 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

33 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

34 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

35 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

36 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

37 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

38 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

39 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

40 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

41 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

42 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

43 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

44 STA 9713 Financial Statistics | Fall 2017 | Junyi Zhang 2/3/2018

Das könnte Ihnen auch gefallen