Hadoop Record

Experiment Number : Date :
Aim: To implement the following data structures using python:

a) Linked Lists b) Stacks c) Queues d) Set e) Map
Description:
Single Linked Lists:
A linked list is a sequence of data structures, which are connected together via links.
Linked List is a sequence of links which contains items. Each link contains a connection to
another link. Linked list is the second most-used data structure after array. Following are the
important terms to understand the concept of Linked List.
o Link Each link of a linked list can store a data called an element.
o Next Each link of a linked list contains a link to the next link called Next.
o LinkedList A Linked List contains the connection link to the first link called First.
Doubly Linked List:
Doubly Linked List is a variation of Linked list in which navigation is possible in both ways,
either forward and backward easily as compared to Single Linked List. Following are the
important terms to understand the concept of doubly linked list.
o Link Each link of a linked list can store a data called an element.
o Next Each link of a linked list contains a link to the next link called Next.
o Prev Each link of a linked list contains a link to the previous link called Prev.
o LinkedList A Linked List contains the connection link to the first link called First and to
the last link called Last.
Stacks:
A stack is an Abstract Data Type (ADT), commonly used in most programming languages. It
is named stack as it behaves like a real-world stack, for example a deck of cards or a pile of
plates, etc.
A real-world stack allows operations at one end only. For example, we can place or remove a
card or plate from the top of the stack only. Likewise, Stack ADT allows all data operations at
one end only. At any given time, we can only access the top element of a stack.
This feature makes it LIFO data structure. LIFO stands for Last-in-first-out. Here, the
element which is placed (inserted or added) last, is accessed first. In stack terminology,
insertion operation is called PUSH operation and removal operation is called POP operation.
Queue:
Queue is an abstract data structure, somewhat similar to Stacks. Unlike stacks, a queue is
open at both its ends. One end is always used to insert data (enqueue) and the other is used to
remove data (dequeue). Queue follows First-In-First-Out methodology, i.e., the data item
stored first will be accessed first.
A real-world example of queue can be a single-lane one-way road, where the vehicle enters
first, exits first. More real-world examples can be seen as queues at the ticket windows and
bus-stops.
Set:
The data type "set", which is a collection type, has been part of Python since version 2.4. A
set contains an unordered collection of unique and immutable objects. The set data type is, as
the name implies, a Python implementation of the sets as they are known from mathematics.
This explains, why sets unlike lists or tuples can't have multiple occurrences of the same
element.
Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Map:
Hash Table is a data structure which stores data in an associative manner. In a hash table,
data is stored in an array format, where each data value has its own unique index value.
Access of data becomes very fast if we know the index of the desired data.
Thus, it becomes a data structure in which insertion and search operations are very fast
irrespective of the size of the data. Hash Table uses an array as a storage medium and uses
hash technique to generate an index where an element is to be inserted or is to be located
from.
Programs:
a) Single Linked List:
class Node(object):
def __init__(self, data, next):
self.data = data
self.next = next
class SingleList(object):
head = None
tail = None
def show(self):
print "Showing list data:"
current_node = self.head
while current_node is not None:
print current_node.data, " -> ",
current_node = current_node.next
print None
def append(self, data):

node = Node(data, None)
if self.head is None:
self.head = self.tail = node
else:
self.tail.next = node
self.tail = node
def remove(self, node_value):

previous_node = None
if current_node.data == node_value:
# if this is the first node (head)
if previous_node is not None:
previous_node.next = current_node.next
else:
self.head = current_node.next

# needed for the next iteration

previous_node = current_node
s = SingleList()
s.append(31)
s.append(2)
s.append(3)
s.append(4)
s.show()
s.remove(31)
s.remove(3)
s.remove(2)
s.show()
Output:
b) Doubly Linked List:

class Node(object):
def __init__(self, data, prev, next):

self.data = data
self.prev = prev
self.next = next
class DoubleList(object):
head = None
tail = None
def append(self, data):

new_node = Node(data, None, None)
if self.head is None:
self.head = self.tail = new_node
else:
new_node.prev = self.tail
new_node.next = None
self.tail.next = new_node
self.tail = new_node
def remove(self, node_value):


if current_node.data == node_value:
# if it's not the first element
if current_node.prev is not None:
current_node.prev.next = current_node.next
current_node.next.prev = current_node.prev
else:
# otherwise we have no prev (it's None), head is the next one, and prev becomes
None
self.head = current_node.next
current_node.next.prev = None
def show(self):
print "Show list data:"
print current_node.prev.data if hasattr(current_node.prev, "data") else None,
print current_node.data,
print current_node.next.data if hasattr(current_node.next, "data") else None
print "*"*50
d = DoubleList()
d.append(5)
d.append(6)
d.append(50)
d.append(30)
d.show()
d.remove(50)
d.remove(5)
d.show()
Output:

c) (i) Stack using arrays:

class Stack:
def __init__(self):
self.items = []
def isEmpty(self):
return self.items == []
def push(self, item):

self.items.append(item)
def pop(self):
return self.items.pop()
def peek(self):
return self.items[len(self.items)-1]
def size(self):
return len(self.items)
s = Stack()
s.push(5)
s.push(10)
print "size=",s.size()
print "popped element=",s.pop()
print "size=",s.size()
print "top element=",s.peek()
print "is empty=",s.isEmpty()
print "popped element=",s.pop()
print "is empty=",s.isEmpty()
Output:

(ii) Stack using Linked List:

class node:
def __init__(self):
self.data = None
self.next = None
class stack:
def __init__(self):
self.cur_node = None
def add_node(self, data):

new_node = node()
new_node.data = data
new_node.next = self.cur_node
self.cur_node = new_node
def list_print(self):
node = self.cur_node # cant point to ll!
while node:
print node.data
node = node.next
ll = stack()
ll.add_node(1)
ll.add_node(2)
ll.add_node(3)
ll.list_print()
Output:
d) Queue:
class Queue:
def __init__(self):
self.items = []
def isEmpty(self):
return self.items == []
def enqueue(self, item):

self.items.insert(0,item)
def dequeue(self):
return self.items.pop()
def size(self):
return len(self.items)
q = Queue()
q.enqueue(1)
q.enqueue(2)
print "size=",q.size()
print "dequeued element=",q.dequeue()
print "size=",q.size()
print "isempty=",q.isEmpty()
print "dequeued element=",q.dequeue()
print "isempty=",q.isEmpty()
Output:
e) Set:
set1=set()
set2=set()
for i in range(1,6):
set1.add(i)
for i in range(4,9):
set2.add(i)
print "set1:",set1
print "set2:",set2
print "union:",set1|set2
print "intersection:",set1&set2
print "symmetric difference:",set1^set2

Output:
f) mydict = {'a':1,'b':2,'c':3}
print "keys:",mydict.keys()
print "values:",mydict.values()
mydict['a'] = 'one'
mydict['b'] = 4
print mydict
mydict.clear()
print mydict
print mydict.has_key('a')
Output:

LINUX COMMANDS
1. CP:
Description: The cp command is used to make copies of files and directories.
Syntax: CP [OPTION] SOURCE DIRECTORY
Example: Make a copy of a file into the same directory
$ cp originalfile directoryfile
2. MV:
Description: mv renames file SOURCE to DEST, or moves the SOURCE file (or files) to
DIRECTORY.
Syntax: MV [OPTION] SOURCE DIRECTORY
Example: Moving a file from one directory to another directory.
$ mv computer\hope.txt computer\hope\2.txt
3. CHMOD:
Description: chmod is used to change the permissions of files or directories.
Syntax: chmod [OPTION] permissions filename
Example:
$ chmod 754 myfile.txt
Here the digits 7, 5, and 4 each individually represent the permissions for the user, group,
and others, in that order. Each digit is a combination of the numbers 4, 2, 1, and 0:
4 stands for "read",

2 stands for "write",
1 stands for "execute", and
0 stands for "no permission."
4. USERADD:
Description: useradd is a low-level utility for adding users to a system. In general, the friendlier
adduser should be used instead.
Syntax: useradd [OPTION] username
Example:
$ useradd newperson
Creates newperson as a new user. Once the new user has been added, you would need to use the
passwd command to assign a password to the account. Once a user has been created, you can
modify any of the user settings, such as the user's home directory, using the usermod command.
groupadd Add a group to the system.

passwd Change a user's password.
userdel Remove a user from the system.
usermod Modify a user's account.
Add an existing user to a group:

Syntax: usermod a G examplegroup exampleusername

Example:
For example, to add the user geek to the group sudo, use the following command
usermod a G sudo geek
5. CHOWN:
Description:
chown changes the user or group ownership of each given file. If only an owner (a user
name or numeric user ID) is given, that user is made the owner of each given file, and the files'
group is not changed. If the owner is followed by a colon and a group name (or numeric group ID),
with no spaces between them, the group ownership of the files is changed as well. If a colon but
no group name follows the user name, that user is made the owner of the files and the group of
the files is changed to that user's login group. If the colon and group are given, but the owner is
omitted, only the group of the files is changed; in this case, chown performs the same function as
chgrp. If only a colon is given, or if the entire operand is empty, neither the owner nor the group
is changed.
Example:
chown R hope /file/work
Recursively grant ownership of the directory /files/work, and all files and subdirectories,
to user hope.

Aim : To find the count of each word in the file given using hadoop and python script.
Procedure :
1. Start hadoop cluster and insert the input files into the hadoop distributed file
system(hdfs)
2. Run the mapper and reducer using the hadoop streaming jar and capture the output.
Program :
Inserting the files from the local file system to HDFS.
Mapper.py :
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
Reducer.py :
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:

print '%s\t%s' % (current_word, current_count)

current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
input file:
Give executable permissions to the mapper and reducer files using the following
command:
$ chmod +x <path to mapper or reducer file>
Eg: $ chmod +x /home/hduser/Documents/hadoop/codes/mapper.py

$ chmod +x /home/hduser/Documents/hadoop/codes/reducer.py
Download the streaming jar from the following link :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-streaming/2.7.1

Executing the word count job on hadoop, the command is :

root@kb:/home/kb# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-2.7.0.jar
-files /home/kb/Desktop/mapper.py,/home/kb/Desktop/reducer.py
-mapper /home/kb/Desktop/mapper.py -reducer /home/kb/Desktop/reducer.py
-input /pcode/wcinput.txt
-output /output
Output :

Aim : To find the maximum temperature from the weather data available. (Semi
Structured and record oriented).
Procedure :
system(hdfs)
Program :
Sample Input:
0029029070999991901010106004+64333+023450FM-
12+000599999V0202701N015919999999N0000001N9-
00781+99999102001ADDGF108991999999999999999999
The bold text is the temperature 00781 which means 7.8 degree Celsius.
Mapper.py:
import sys
import os
filename = os.environ["map_input_file"][-4:]
line = line.strip() # remove leading and trailing whitespace
last = line.split('-')[-1]
temp = last.split('+')[0][:-1]
temp = float(temp)/10
print '%s\t%s\t%s' % (temp, 1, filename)

Reducer.py:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from operator import itemgetter

import sys
current_temp = None
total=0.0
max_temp_1901=0.0
min_temp_1901=0.0
max_temp_1902=0.0
min_temp_1902=0.0
total_1901=0.0
total_1902=0.0
records_1901=0
records_1902=0
records = 0
max_temp=0.0
min_temp=0.0

line = line.strip()
(current_temp, count,year) = line.split('\t')
current_temp=float(current_temp)
if year == "1901":
total_1901 += current_temp
records_1901+=1
if current_temp > max_temp_1901:
max_temp_1901 = current_temp
if current_temp < min_temp:
min_temp_1901 = current_temp
if year == "1902":
records_1902+=1int
total_1902 += current_temp
if current_temp > max_temp_1901:
max_temp_1902 = current_temp
if current_temp < min_temp:
min_temp_1902 = current_temp

print '--------------1901 Details-------------\n'

print "The maximum Temperature in 1901 is:",max_temp_1901
print "The minimum temperature in 1901 is:",min_temp_1901
print "The average temperature in 1901 is:",(total_1901/records_1901)
print '\n--------------1902 Details-------------\n'

print "The maximum Temperature in 1902 is:",max_temp_1902
print "The minimum temperature in 1902 is:",min_temp_1902
print "The average temperature in 1902 is:",(total_1902/records_1902)
if max_temp_1901 > max_temp_1902:

max_temp = max_temp_1901
else:
max_temp = max_temp_1902
if min_temp_1901 > min_temp_1902:

min_temp = min_temp_1902
else:
min_temp = min_temp_1901
total = total_1901 + total_1902

records = records_1901 + records_1902
print '\n---------------Consolidated Report---------------\n'

print "The maximum temperature is:",max_temp
print "The minimum temperature is:",min_temp
print "The average temperature is:",(total/records)
Execution:
Give executable permissions to the mapper and reducer.


$HADOOP_HOME/bin/hadoop jar ~/Documents/hadoop/hadoop-streaming-2.7.1.jar -file
~/Documents/hadoop/codes/weather_mapper.py -mapper
~/Documents/hadoop/codes/weather_mapper.py -file
~/Documents/hadoop/codes/weather_reducer.py -reducer
~/Documents/hadoop/codes/weather_reducer.py -input /weather_input/* -output

/weather_output
Output :

Aim : To implement Matrix Multiplication with Hadoop Map Reduce.

Procedure :
system(hdfs)
Program :
Sample Input:
001 1
The first zero indicates the row index of the element, the second zero indicates the
column index of the matrix.
The third input indicates whether the matrix is the first matrix or the second matrix.
The fourth input depicts the value that is present in the index specified in the given
matrix.
Mapper.py:
#!/usr/bin/python
import sys;
import re;
l = int(sys.argv[2]);
n = int(sys.argv[1]);
(i, j, m,v) = re.split("[ \t]+", line.strip())

if m=='1':
for c in range(0, l):
print "%s %s %d\t%s L" % (i, j, c, v)
else:
for c in range(0, n):
print"%d %s %s\t%s R" % (c, i, j, v)
Reducer.py:
#!/usr/bin/python
import sys;
pKey = None;
d={}
pd = 1;
(key,value) = line.strip().split("\t");
arr = key.split(" ");
(v, side) = value.split(" ");
if pKey != None and pKey != arr:
try:
d[(int(pKey[0]),int(pKey[2]))].append(pd);
except:
d[(int(pKey[0]),int(pKey[2]))]=[pd]
pd = 1;
pKey = arr;
pd *= int(v);
if pKey != None:
try:
d[(int(pKey[0]),int(pKey[2]))].append(pd);
except:
d[(int(pKey[0]),int(pKey[2]))]=[pd]
for key in d:
print key,"",sum(d[key])
Execution:
Give executable permissions to the mapper and reducer.


The command line arguments given to the mapper represent:
1. Number of rows in the matrix 1.
2. Number of columns in matrix 1 or Number of rows in matrix 2.
3. Number of columns in matrix 2.
$HADOOP_HOME/bin/hadoop jar ~/Documents/hadoop/hadoop-streaming-2.7.1.jar -file
~/Documents/hadoop/codes/matrix_mapper.py -mapper
"/home/hduser/Documents/hadoop/codes/matrix_mapper.py 3 3 3" -file
~/Documents/hadoop/codes/matrix_reducer.py -reducer
~/Documents/hadoop/codes/matrix_reducer.py -input /matrix_input/* -output
/matrix_output

Output:
The output format of the resultant matrix is as follows :
(<row index>, <column index>) the value at the index.

Aim: To install and run hive environment.

Description:
Apache Hive is a data warehouse infrastructure that facilitates querying and managing
large data sets which resides in distributed storage system. It is built on top of Hadoop and
developed by Facebook. Hive provides a way to query the data using a SQL-like query language
called HiveQL(Hive query Language).
Internally, a compiler translates HiveQL statements into MapReduce jobs, which are then
submitted to Hadoop framework for execution.
Difference between Hive and SQL:

Hive looks very much similar like traditional database with SQL access. However, because
Hive is based on Hadoop and MapReduce operations, there are several key differences:
As Hadoop is intended for long sequential scans and Hive is based on Hadoop, you would
expect queries to have a very high latency. Itmeans that Hive would not be appropriate for those
applications that need very fast response times, as you can expect with a traditional RDBMS
database.
Finally, Hive is read-based and therefore not appropriate for transaction processing that
typically involves a high percentage of write operations.
Installation Process:
To install hive on ubuntu, follow the below steps to install Apache Hive on Ubuntu:
1) Download Hive tar from http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-

bin.tar.gz
2) Extract the tar file download in downloads folder, using the commands.
Command: tar -xzf apache-hive-2.1.0-bin.tar.gz
Command: ls
3) Make a folder with name hive in /etc using the command :

$ sudo mkdir -p /etc/hive
4) Move the extracted tar from downloads into the folder created above, using the command:
$ sudo mv apache-hive-2.1.1.bin /etc/hive

5) Edit the .bashrc file to update the environment variables for user.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set HIVE_HOME
export HIVE_HOME=/etc/hive/apache-hive-2.1.0-bin
export PATH=$PATH:/etc/hive/apache-hive-2.1.0-bin/bin
Also, make sure that hadoop path is also set.
Run below command to make the changes work in same terminal.

Command: source .bashrc
6) Check the hive version, by typing hive in the terminal.
7) Create Hive directories within HDFS. The directory warehouse is the location to store the
table or data related to hive.
Command:
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
8) Set read/write permissions for table.

Command:
In this command, we are giving write permission to the group:
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
9) Set Hadoop path in hive-env.sh

Command: cd apache-hive-2.1.0-bin/
Command: gedit conf/hive-env.sh
10) Set the parameters as shown in the below snapshot.

11) Edit hive-site.xml

Command: gedit conf/hive-site.xml
Place the following data in the xml file created.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/etc/hive/apache-hive-2.1.0-
bin/metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the
connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value/>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to
remote metastore.</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>
12) By default, Hive uses Derby database. Initialize Derby database.

Command: bin/schematool -initSchema -dbType derby
13) Launch Hive.

Command: hive
Observation:
A hive console has opened up which suggests that hive has been successfully installed.
References:
1. https://www.edureka.co/blog/apache-hive-installation-on-ubuntu

Aim: To create and drop a database in hive environment.

Description:
Create Database is a statement used to create a database in Hive. A database in Hive is a
namespace or a collection of tables. The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with
the same name already exists. We can use SCHEMA in place of DATABASE in this command.
Drop Database is a statement that drops all the tables and deletes the database. Its
syntax is as follows:
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name

[RESTRICT|CASCADE];
Procedure:
1) To create a database with name userdb, use the following command:
hive > create database userdb;
Use the following command to list the databases in the hive warehouse and cross check the
name the database just created:
hive > show databases;
From the above picture we see that the database userdb has been successfully created.
2) To drop the database with the name userdb, use the following command:
hive > drop database userdb;
Use the following command to check whether the database that is dropped is not there in the
list:
hive > show databases;
References:
1. https://www.tutorialspoint.com/hive/hive_create_database.htm
2. https://www.tutorialspoint.com/hive/hive_drop_database.htm

Aim: To create, alter and drop a table in hive.

Description:
Create Table is a statement used to create a table in Hive. The syntax and example are as
follows:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
To alter a table the statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
To remove a table from hive database the syntax is as follows:

DROP TABLE [IF EXISTS] table_name;
Procedure:
1. Let us assume you need to create a table named employee using CREATE TABLE statement.
The following table lists the fields and their data types in employee table:
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator, and Stored File type.
COMMENT Employee details

FIELDS TERMINATED BY \t
LINES TERMINATED BY \n
STORED IN TEXT FILE
The following query creates a table named employee using the above data.

hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
LINES TERMINATED BY \n
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists.
On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 seconds
hive>
To check the table creation, go to hdfs and check the name of the table created there:
As we can see in the /usr/hive/warehouse folder we see a directory with name employee
which denotes the table that we have created.
2. To alter a view that shows the employees list with salary above 30000 to a view that shows all
the entries in the table we can use the following command:
hive > alter view emp_30000 as select * from employee;
This view will now show all the list of employees from the table employee.

3. The following table contains the fields of employee table and it shows the fields to be
changed (in bold).
Field Name Convert from Data Type Change Field Name Convert to Data Type
eid int eid int
name String ename String
salary Float salary Double
designation String designation String
The following queries rename the column name and column data type using the above data:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;
To see the changes happened, use the following command:

hive > describe employee;
It show the schema of the table, as follows:
4. The following query drops a table named employee:

hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following response:
OK
hive>
Use the following command to show the list of tables in the selected database. In the
pictures below we can see that the table that has been dropped is removed from the show
tables list:

hive > use default;

hive > drop table employee;
hive > show tables;
References:
1. https://www.tutorialspoint.com/hive/hive_drop_table.htm
2. https://www.tutorialspoint.com/hive/hive_create_table.htm
3. https://www.tutorialspoint.com/hive/hive_alter_table.htm

Aim: To create and drop views for a table in hive.

Description:
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in
Hive, we can insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL is identifier to specify the local path. It is optional.
OVERWRITE is optional to overwrite the data in the table.
PARTITION is optional.
You can create a view at the time of executing a SELECT statement. The syntax is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
Use the following syntax to drop a view:

DROP VIEW view_name
Program:
1. Let us place the data of employees into the employee table created in the previous
experiment.
1201 Gopal 45000 Technical manager

1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;
On successful download, you get to see the following response:
OK
hive>

2. Let us take an example for view. Assume employee table as given below, with the fields Id,
Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000;
3. To get the data that is in the view, we use the following statement:
hive > select * from emp_30000;
The output will be as follows:
It retrieves the data of the employees with salary>30000.

4. The following query drops a view named as emp_30000:
hive> DROP VIEW emp_30000;
References:
1. https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
2. https://www.tutorialspoint.com/hive/hive_create_table.htm

Aim: To create and drop an index for a column of a table in hive.

Description:
An Index is nothing but a pointer on a particular column of a table. Creating an index means
creating a pointer on a particular column of a table. Its syntax is as follows:
CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Procedure:
1. Let us take an example for index. Use the same employee table that we have used earlier with
the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the
salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS

'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
It is a pointer to the salary column. If the column is modified, the changes are stored using an
index value.
To verify the creation of index, we can use the following command:

hive > show formatted index on employee.
This will list the indexes that are stored on the table employee.

2. The following query drops an index named index_salary:
hive > DROP INDEX index_salary ON employee;
We can verify the dropping of the index by again using the command:
hive > show formatted index on employee;
As the index on the table was dropped it will show an empty list.
References:
1. https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm

Aim: To illustrate functions in hive.

Description:
Return Type Signature Description
BIGINT round(double a) It returns the rounded BIGINT value of the double.
BIGINT floor(double a) It returns the maximum BIGINT value that is equal or

less than the double.
BIGINT ceil(double a) It returns the minimum BIGINT value that is equal or

greater than the double.
double rand(), rand(int seed) It returns a random number that changes from row to
row.
string concat(string A, It returns the string resulting from concatenating B af-

string B,...) ter A.
string substr(string A, int It returns the substring of A starting from start posi-
start) tion till the end of string A.
string substr(string A, int It returns the substring of A starting from start posi-
start, int length) tion with the given length.
string upper(string A) It returns the string resulting from converting all char-
acters of A to upper case.
string ucase(string A) Same as above.
string lower(string A) It returns the string resulting from converting all char-
acters of B to lower case.
string lcase(string A) Same as above.

string trim(string A) It returns the string resulting from trimming spaces

from both ends of A.
string ltrim(string A) It returns the string resulting from trimming spaces

from the beginning (left hand side) of A.
string rtrim(string A) rtrim(string A) It returns the string resulting from

trimming spaces from the end (right hand side) of A.
string regexp_re- It returns the string resulting from replacing all sub-
place(string A, string strings in B that match the Java regular expression syn-
B, string C) tax with C.
int size(Map<K.V>) It returns the number of elements in the map type.
int size(Array<T>) It returns the number of elements in the array type.
value of <type> cast(<expr> as <type>) It converts the results of the expression expr to <type>
e.g. cast('1' as BIGINT) converts the string '1' to it in-
tegral representation. A NULL is returned if the con-
version does not succeed.
string from_unixtime(int convert the number of seconds from Unix epoch (1970-
unixtime) 01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the format of "1970-01-01 00:00:00"
string to_date(string It returns the date part of a timestamp string:

timestamp) to_date("1970-01-01 00:00:00") = "1970-01-01"
int year(string date) It returns the year part of a date or a timestamp

string: year("1970-01-01 00:00:00") = 1970, year("1970-
01-01") = 1970

int month(string date) It returns the month part of a date or a timestamp

string: month("1970-11-01 00:00:00") = 11, month("1970-
11-01") = 11
int day(string date) It returns the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string get_json_ob- It extracts json object from a json string based on json
ject(string path specified, and returns json string of the extracted
json_string, string json object. It returns NULL if the input json string is
path) invalid.
Aggregate Functions:
Return Type Signature Description
BIGINT count(*), count(*) - Returns the total number of retrieved rows.

count(expr),
DOUBLE sum(col), sum(DIS- It returns the sum of the elements in the group or the
TINCT col) sum of the distinct values of the column in the group.
DOUBLE avg(col), avg(DIS- It returns the average of the elements in the group or
TINCT col) the average of the distinct values of the column in the
group.
DOUBLE min(col) It returns the minimum value of the column in the

group.
DOUBLE max(col) It returns the maximum value of the column in the

group.
A CASE expression returns a value from the THEN portion of the clause.
The ORDER BY clause is used to retrieve the details based on one column and sort the result set by
ascending or descending order.
Given below is the syntax of the ORDER BY clause:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Procedure:
1. To find ceil of the value the command is:
hive > select ceil(7.7) from employee;
Output : 8
2. To find floor of the value the command is:

hive > select floor(7.7) from employee;
Output: 7
3. To print a random number the command is:

hive > select rand() from employee;
Output: 0.512997094481562

4. To concat two strings the command is:

hive > select concat(hadoop and ,hive) from employee;
Output: hadoop and hive
5. To convert a string to upper case use the following command:

hive > select ucase(hadoop) from employee;
Output: HADOOP
6. To get substring of a string the command is:

hive > select substr(hadoop,2,3) from employee;
Output: ado
7. To convert string to date the command is:

hive > select to_date(1996-10-11 00:00:00) from employee;
Output: 1996-10-11

8. Use the case statement as follows to segregate the salaries into low, middle and high as <40000,
=40000, >40000 respectively by using the following statements:
select name,salary, case when salary <40000 then 'low' when salary=40000 then 'middle' when
salary>40000 then 'high' else 'very high' end as salary_bracket from employee;
Output:
9. To count the number of rows in the table use the following command:
hive > select count(*) from employee;
Output:
10. To sort the entries in the employee table in descending order based on salary using the order by
clause, use as follows:
hive > select * from employee e order by e.salary desc;
Output:
References:
1. https://www.tutorialspoint.com/hive/hive_built_in_functions.htm
2. https://www.guru99.com/hive-user-defined-functions.html
3. http://hadooptutorial.info/hive-functions-examples/

Aim: To install and run pig.
Description:
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger
sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Pig.
Procedure:
1. Make a directory in etc directory in root with the name pig.
2. Download the pig-0.17.0.tar.gz from the pig index. (Download Link : http://www-eu.apache.org/dist/pig/pig-
0.17.0/)
3. Extract the tar file using the following command:

$ tar -xvzf pig-0.17.0.tar.gz
4. Move the extracted directory to the directory created in etc in step 1 using the following command:
$ sudo mv pig-0.17.0/* /etc/pig/
5. Go to the home directory using the command:
$ cd ~
6. Open the bashrc file using any editor:
$ gedit ~/.bashrc

7. Add the following environment variables to the file:

export PIG_HOME=/etc/pig
export PIG_CONF_DIR=$PIG_HOME/conf
export PIG_CLASS_PATH=$PIG_CONF_DIR
export PATH=$PIG_HOME/bin:$PATH
8. Compile the bashrc file using the command:

$ source ~/.bashrc
9. Run the pig environment using the command:
$ pig -x local
Now, we can see the grunt console opened.
References:
1. https://www.tutorialspoint.com/apache_pig/apache_pig_installation.htm
2. http://www-eu.apache.org/dist/pig/pig-0.17.0/

Aim: To sort the given data using the pig of Hadoop ecosystem.
Description:
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more
fields.
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Procedure:
1. Take an input file which consists of some structured data. (let the name be students_details.txt )
2. Create a directory in hdfs and place the file in it using the following commands:
$HADOOP_HOME/bin/hadoop fs -mkdir /pig_data
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/hadoop/codes/Pig/students_details.txt /pig_data/
3. Load the file into pig with the relation name student_details as shown below:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
4. Let us now sort the relation in a descending order based on the age of the student and store it into another
relation named order_by_data using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;

Output:
Verify the relation order_by_data using the DUMP operator as shown below:
grunt> Dump order_by_data;

Aim: To group the given data using an attribute in pig of Hadoop ecosystem.
Description:
The GROUP operator is used to group the data in one or more relations. It collects the data having the same
key.
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Procedure:
1. Take the student_details file used in the previous experiment and load it into pig using the following com-
mand:
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
2. Group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Output:
Verify the relation order_by_data using the DUMP operator as shown below:
grunt> Dump group_data;

Aim: To project the data in different formats using the operators in pig of Hadoop ecosystem.
Description:
The FOREACH operator is used to generate specified data transformations based on the column data.
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
The LIMIT operator is used to get a limited number of tuples from a relation.
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Procedure:
mand:
city:chararray);
2. Using the foreach operator project the data with each record separated by comma in the file using the fol-
lowing command:
grunt>foreach_data = FOREACH student_details GENERATE id,age,city;
3. Dump the foreach data using the following command:

grunt>dump foreach_data;
4. Using the limit command project the first four record from the foreach_data using the following command:
grunt> limit_data = LIMIT student_details 4;
5. Dump the limit_data using the command:
grunt>dump limit_data;

Aim: To filter the data using the pig of Hadoop ecosystem.
Description:
The FILTER operator is used to select the required tuples from a relation based on a condition.
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Procedure:
mand:
city:chararray);
2. Filter the file to get students from Chennai using the following command:
grunt> filter_data = FILTER student_details BY city == 'Chennai';
Output:
To get the output, dump the filter_data using the command:
grunt>dump filter_data;

Aim: To use joins like self, inner and outer joins in pig of Hadoop ecosystem.
Description:
Self Join:
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming
at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under
different aliases (names).
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key;

Inner Join:
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows
when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B) based upon the
join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy
the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows
of A and B are combined into a result row.
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Outer Join:
Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation
is carried out in three ways
Left outer join
Right outer join
Full outer join
Left Outer Join:
The left outer Join operation returns all rows from the left table, even if there are no matches in
the right relation.
Given below is the syntax of performing left outer join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Right Outer Join
The right outer join operation returns all rows from the right table, even if there are no matches in
the left table.
Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Full Outer Join:
The full outer join operation returns rows when there is a match in one of the relations.
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Procedure:
1. Create the customer and orders files as follows:
customers.txt:
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00

3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
2. Place the files created into HDFS, using the following commands:
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/Hadoop/codes/Pig/customers.txt /pig_data/
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/Hadoop/codes/Pig/orders.txt /pig_data/
3. Load the files into pig with relations customers and orders as shown below:
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
Self Join:
1. Perform self-join operation on the relation customers, by joining the two relations customers1 and
customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

2. Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join:
1. Perform inner join operation on the two relations customers and orders as shown below.
grunt> customer_orders = JOIN customers BY id, orders BY customer_id;
2. Verify the relation customer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
We will get the following output that will the contents of the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Left Outer Join:

1. Perform left outer join operation on the two relations customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

2. Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Right Outer Join:

1. Perform right outer join operation on the two relations customers and orders as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
2. Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Full Outer Join:
1. Perform full outer join operation on the two relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

2. Verify the relation outer_full using the DUMP operator as shown below.
grunt> Dump outer_full;
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
References:
1. https://www.tutorialspoint.com/apache_pig/apache_pig_join_operator.htm
2. https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#JOIN+%28inner%29

Hadoop Record

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Record

Hochgeladen von

Copyright:

Verfügbare Formate

Experiment Number : Date :

Aim: To implement the following data structures using python:

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

def append(self, data):

def remove(self, node_value):

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

# needed for the next iteration

b) Doubly Linked List:

def __init__(self, data, prev, next):

def append(self, data):

def remove(self, node_value):

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

while current_node is not None:

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

c) (i) Stack using arrays:

def push(self, item):

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

(ii) Stack using Linked List:

def add_node(self, data):

def enqueue(self, item):

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

4 stands for "read",

groupadd Add a group to the system.

Add an existing user to a group:

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Syntax: usermod a G examplegroup exampleusername

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

print '%s\t%s' % (current_word, current_count)

$ chmod +x <path to mapper or reducer file>

Eg: $ chmod +x /home/hduser/Documents/hadoop/codes/mapper.py

Download the streaming jar from the following link :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Executing the word count job on hadoop, the command is :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

from operator import itemgetter

for line in sys.stdin:

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

print '--------------1901 Details-------------\n'

print '\n--------------1902 Details-------------\n'

if max_temp_1901 > max_temp_1902:

if min_temp_1901 > min_temp_1902:

total = total_1901 + total_1902

print '\n---------------Consolidated Report---------------\n'

Download the streaming jar from the following link :

Executing the word count job on hadoop, the command is :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Aim : To implement Matrix Multiplication with Hadoop Map Reduce.

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Give executable permissions to the mapper and reducer.

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Download the streaming jar from the following link :

Executing the word count job on hadoop, the command is :

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

The output format of the resultant matrix is as follows :

(<row index>, <column index>) the value at the index.

Roll Number : 14331A05C8 HADOOP RECORD Page Number :

Aim: To install and run hive environment.

Difference between Hive and SQL:

1) Download Hive tar from http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-

def init(self, data, prev, next):

BIGINT count(), count() - Returns the total number of retrieved rows.