Beruflich Dokumente
Kultur Dokumente
Map:
Hash Table is a data structure which stores data in an associative manner. In a hash table,
data is stored in an array format, where each data value has its own unique index value.
Access of data becomes very fast if we know the index of the desired data.
Thus, it becomes a data structure in which insertion and search operations are very fast
irrespective of the size of the data. Hash Table uses an array as a storage medium and uses
hash technique to generate an index where an element is to be inserted or is to be located
from.
Programs:
a) Single Linked List:
class Node(object):
def __init__(self, data, next):
self.data = data
self.next = next
class SingleList(object):
head = None
tail = None
def show(self):
print "Showing list data:"
current_node = self.head
while current_node is not None:
print current_node.data, " -> ",
current_node = current_node.next
print None
s = SingleList()
s.append(31)
s.append(2)
s.append(3)
s.append(4)
s.show()
s.remove(31)
s.remove(3)
s.remove(2)
s.show()
Output:
class DoubleList(object):
head = None
tail = None
current_node = self.head
def show(self):
print "Show list data:"
current_node = self.head
while current_node is not None:
print current_node.prev.data if hasattr(current_node.prev, "data") else None,
print current_node.data,
print current_node.next.data if hasattr(current_node.next, "data") else None
current_node = current_node.next
print "*"*50
d = DoubleList()
d.append(5)
d.append(6)
d.append(50)
d.append(30)
d.show()
d.remove(50)
d.remove(5)
d.show()
Output:
def isEmpty(self):
return self.items == []
def pop(self):
return self.items.pop()
def peek(self):
return self.items[len(self.items)-1]
def size(self):
return len(self.items)
s = Stack()
s.push(5)
s.push(10)
print "size=",s.size()
print "popped element=",s.pop()
print "size=",s.size()
print "top element=",s.peek()
print "is empty=",s.isEmpty()
print "popped element=",s.pop()
print "is empty=",s.isEmpty()
Output:
class stack:
def __init__(self):
self.cur_node = None
def list_print(self):
node = self.cur_node # cant point to ll!
while node:
print node.data
node = node.next
ll = stack()
ll.add_node(1)
ll.add_node(2)
ll.add_node(3)
ll.list_print()
Output:
d) Queue:
class Queue:
def __init__(self):
self.items = []
def isEmpty(self):
return self.items == []
self.items.insert(0,item)
def dequeue(self):
return self.items.pop()
def size(self):
return len(self.items)
q = Queue()
q.enqueue(1)
q.enqueue(2)
print "size=",q.size()
print "dequeued element=",q.dequeue()
print "size=",q.size()
print "isempty=",q.isEmpty()
print "dequeued element=",q.dequeue()
print "isempty=",q.isEmpty()
Output:
e) Set:
set1=set()
set2=set()
for i in range(1,6):
set1.add(i)
for i in range(4,9):
set2.add(i)
print "set1:",set1
print "set2:",set2
print "union:",set1|set2
print "intersection:",set1&set2
print "symmetric difference:",set1^set2
Output:
f) mydict = {'a':1,'b':2,'c':3}
print "keys:",mydict.keys()
print "values:",mydict.values()
mydict['a'] = 'one'
mydict['b'] = 4
print mydict
mydict.clear()
print mydict
print mydict.has_key('a')
Output:
LINUX COMMANDS
1. CP:
Description: The cp command is used to make copies of files and directories.
Syntax: CP [OPTION] SOURCE DIRECTORY
Example: Make a copy of a file into the same directory
$ cp originalfile directoryfile
2. MV:
Description: mv renames file SOURCE to DEST, or moves the SOURCE file (or files) to
DIRECTORY.
Syntax: MV [OPTION] SOURCE DIRECTORY
Example: Moving a file from one directory to another directory.
$ mv computer\hope.txt computer\hope\2.txt
3. CHMOD:
Description: chmod is used to change the permissions of files or directories.
Syntax: chmod [OPTION] permissions filename
Example:
$ chmod 754 myfile.txt
Here the digits 7, 5, and 4 each individually represent the permissions for the user, group,
and others, in that order. Each digit is a combination of the numbers 4, 2, 1, and 0:
4. USERADD:
Description: useradd is a low-level utility for adding users to a system. In general, the friendlier
adduser should be used instead.
Syntax: useradd [OPTION] username
Example:
$ useradd newperson
Creates newperson as a new user. Once the new user has been added, you would need to use the
passwd command to assign a password to the account. Once a user has been created, you can
modify any of the user settings, such as the user's home directory, using the usermod command.
5. CHOWN:
Description:
chown changes the user or group ownership of each given file. If only an owner (a user
name or numeric user ID) is given, that user is made the owner of each given file, and the files'
group is not changed. If the owner is followed by a colon and a group name (or numeric group ID),
with no spaces between them, the group ownership of the files is changed as well. If a colon but
no group name follows the user name, that user is made the owner of the files and the group of
the files is changed to that user's login group. If the colon and group are given, but the owner is
omitted, only the group of the files is changed; in this case, chown performs the same function as
chgrp. If only a colon is given, or if the entire operand is empty, neither the owner nor the group
is changed.
Example:
chown R hope /file/work
Recursively grant ownership of the directory /files/work, and all files and subdirectories,
to user hope.
Aim : To find the count of each word in the file given using hadoop and python script.
Procedure :
1. Start hadoop cluster and insert the input files into the hadoop distributed file
system(hdfs)
2. Run the mapper and reducer using the hadoop streaming jar and capture the output.
Program :
Inserting the files from the local file system to HDFS.
Mapper.py :
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
Reducer.py :
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
input file:
Give executable permissions to the mapper and reducer files using the following
command:
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-streaming/2.7.1
Output :
Aim : To find the maximum temperature from the weather data available. (Semi
Structured and record oriented).
Procedure :
1. Start hadoop cluster and insert the input files into the hadoop distributed file
system(hdfs)
2. Run the mapper and reducer using the hadoop streaming jar and capture the output.
Program :
Inserting the files from the local file system to HDFS.
Sample Input:
0029029070999991901010106004+64333+023450FM-
12+000599999V0202701N015919999999N0000001N9-
00781+99999102001ADDGF108991999999999999999999
The bold text is the temperature 00781 which means 7.8 degree Celsius.
Mapper.py:
#!/usr/bin/env python
import sys
import os
filename = os.environ["map_input_file"][-4:]
for line in sys.stdin:
line = line.strip() # remove leading and trailing whitespace
last = line.split('-')[-1]
temp = last.split('+')[0][:-1]
temp = float(temp)/10
print '%s\t%s\t%s' % (temp, 1, filename)
Reducer.py:
#!/usr/bin/python
# -*- coding: utf-8 -*-
current_temp = None
total=0.0
max_temp_1901=0.0
min_temp_1901=0.0
max_temp_1902=0.0
min_temp_1902=0.0
total_1901=0.0
total_1902=0.0
records_1901=0
records_1902=0
records = 0
max_temp=0.0
min_temp=0.0
if year == "1902":
records_1902+=1int
total_1902 += current_temp
if current_temp > max_temp_1901:
max_temp_1902 = current_temp
if current_temp < min_temp:
min_temp_1902 = current_temp
Execution:
Give executable permissions to the mapper and reducer.
/weather_output
Output :
Program :
Inserting the files from the local file system to HDFS.
Sample Input:
001 1
The first zero indicates the row index of the element, the second zero indicates the
column index of the matrix.
The third input indicates whether the matrix is the first matrix or the second matrix.
The fourth input depicts the value that is present in the index specified in the given
matrix.
Mapper.py:
#!/usr/bin/python
import sys;
import re;
l = int(sys.argv[2]);
n = int(sys.argv[1]);
for line in sys.stdin:
(i, j, m,v) = re.split("[ \t]+", line.strip())
if m=='1':
for c in range(0, l):
print "%s %s %d\t%s L" % (i, j, c, v)
else:
for c in range(0, n):
print"%d %s %s\t%s R" % (c, i, j, v)
Reducer.py:
#!/usr/bin/python
import sys;
pKey = None;
d={}
pd = 1;
for line in sys.stdin:
(key,value) = line.strip().split("\t");
arr = key.split(" ");
(v, side) = value.split(" ");
if pKey != None and pKey != arr:
try:
d[(int(pKey[0]),int(pKey[2]))].append(pd);
except:
d[(int(pKey[0]),int(pKey[2]))]=[pd]
pd = 1;
pKey = arr;
pd *= int(v);
if pKey != None:
try:
d[(int(pKey[0]),int(pKey[2]))].append(pd);
except:
d[(int(pKey[0]),int(pKey[2]))]=[pd]
for key in d:
print key,"",sum(d[key])
Execution:
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-streaming/2.7.1
Output:
2) Extract the tar file download in downloads folder, using the commands.
Command: tar -xzf apache-hive-2.1.0-bin.tar.gz
Command: ls
5) Edit the .bashrc file to update the environment variables for user.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set HIVE_HOME
export HIVE_HOME=/etc/hive/apache-hive-2.1.0-bin
export PATH=$PATH:/etc/hive/apache-hive-2.1.0-bin/bin
Also, make sure that hadoop path is also set.
7) Create Hive directories within HDFS. The directory warehouse is the location to store the
table or data related to hive.
Command:
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
http://www.apache.org/licenses/LICENSE-2.0
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>
Observation:
A hive console has opened up which suggests that hive has been successfully installed.
References:
1. https://www.edureka.co/blog/apache-hive-installation-on-ubuntu
Drop Database is a statement that drops all the tables and deletes the database. Its
syntax is as follows:
From the above picture we see that the database userdb has been successfully created.
2) To drop the database with the name userdb, use the following command:
hive > drop database userdb;
Use the following command to check whether the database that is dropped is not there in the
list:
hive > show databases;
References:
1. https://www.tutorialspoint.com/hive/hive_create_database.htm
2. https://www.tutorialspoint.com/hive/hive_drop_database.htm
To alter a table the statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator, and Stored File type.
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT Employee details
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
LINES TERMINATED BY \n
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists.
OK
Time taken: 5.905 seconds
hive>
To check the table creation, go to hdfs and check the name of the table created there:
As we can see in the /usr/hive/warehouse folder we see a directory with name employee
which denotes the table that we have created.
2. To alter a view that shows the employees list with salary above 30000 to a view that shows all
the entries in the table we can use the following command:
hive > alter view emp_30000 as select * from employee;
This view will now show all the list of employees from the table employee.
3. The following table contains the fields of employee table and it shows the fields to be
changed (in bold).
Field Name Convert from Data Type Change Field Name Convert to Data Type
The following queries rename the column name and column data type using the above data:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;
References:
1. https://www.tutorialspoint.com/hive/hive_drop_table.htm
2. https://www.tutorialspoint.com/hive/hive_create_table.htm
3. https://www.tutorialspoint.com/hive/hive_alter_table.htm
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.
The syntax for load data is as follows:
You can create a view at the time of executing a SELECT statement. The syntax is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
The following query loads the given text into the table.
OK
Time taken: 15.905 seconds
hive>
2. Let us take an example for view. Assume employee table as given below, with the fields Id,
Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
The following query retrieves the employee details using the above scenario:
3. To get the data that is in the view, we use the following statement:
hive > select * from emp_30000;
The output will be as follows:
References:
1. https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
2. https://www.tutorialspoint.com/hive/hive_create_table.htm
Procedure:
1. Let us take an example for index. Use the same employee table that we have used earlier with
the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the
salary column of the employee table.
It is a pointer to the salary column. If the column is modified, the changes are stored using an
index value.
We can verify the dropping of the index by again using the command:
hive > show formatted index on employee;
As the index on the table was dropped it will show an empty list.
References:
1. https://www.tutorialspoint.com/hive/hive_views_and_indexes.htm
double rand(), rand(int seed) It returns a random number that changes from row to
row.
string substr(string A, int It returns the substring of A starting from start posi-
start) tion till the end of string A.
string substr(string A, int It returns the substring of A starting from start posi-
start, int length) tion with the given length.
string upper(string A) It returns the string resulting from converting all char-
acters of A to upper case.
string lower(string A) It returns the string resulting from converting all char-
acters of B to lower case.
string regexp_re- It returns the string resulting from replacing all sub-
place(string A, string strings in B that match the Java regular expression syn-
B, string C) tax with C.
value of <type> cast(<expr> as <type>) It converts the results of the expression expr to <type>
e.g. cast('1' as BIGINT) converts the string '1' to it in-
tegral representation. A NULL is returned if the con-
version does not succeed.
string from_unixtime(int convert the number of seconds from Unix epoch (1970-
unixtime) 01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the format of "1970-01-01 00:00:00"
int day(string date) It returns the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string get_json_ob- It extracts json object from a json string based on json
ject(string path specified, and returns json string of the extracted
json_string, string json object. It returns NULL if the input json string is
path) invalid.
Aggregate Functions:
DOUBLE sum(col), sum(DIS- It returns the sum of the elements in the group or the
TINCT col) sum of the distinct values of the column in the group.
DOUBLE avg(col), avg(DIS- It returns the average of the elements in the group or
TINCT col) the average of the distinct values of the column in the
group.
A CASE expression returns a value from the THEN portion of the clause.
The ORDER BY clause is used to retrieve the details based on one column and sort the result set by
ascending or descending order.
Procedure:
1. To find ceil of the value the command is:
hive > select ceil(7.7) from employee;
Output : 8
8. Use the case statement as follows to segregate the salaries into low, middle and high as <40000,
=40000, >40000 respectively by using the following statements:
select name,salary, case when salary <40000 then 'low' when salary=40000 then 'middle' when
salary>40000 then 'high' else 'very high' end as salary_bracket from employee;
Output:
9. To count the number of rows in the table use the following command:
hive > select count(*) from employee;
Output:
10. To sort the entries in the employee table in descending order based on salary using the order by
clause, use as follows:
hive > select * from employee e order by e.salary desc;
Output:
References:
1. https://www.tutorialspoint.com/hive/hive_built_in_functions.htm
2. https://www.guru99.com/hive-user-defined-functions.html
3. http://hadooptutorial.info/hive-functions-examples/
Description:
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger
sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Pig.
Procedure:
1. Make a directory in etc directory in root with the name pig.
2. Download the pig-0.17.0.tar.gz from the pig index. (Download Link : http://www-eu.apache.org/dist/pig/pig-
0.17.0/)
4. Move the extracted directory to the directory created in etc in step 1 using the following command:
$ sudo mv pig-0.17.0/* /etc/pig/
5. Go to the home directory using the command:
$ cd ~
6. Open the bashrc file using any editor:
$ gedit ~/.bashrc
References:
1. https://www.tutorialspoint.com/apache_pig/apache_pig_installation.htm
2. http://www-eu.apache.org/dist/pig/pig-0.17.0/
Aim: To sort the given data using the pig of Hadoop ecosystem.
Description:
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more
fields.
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Procedure:
1. Take an input file which consists of some structured data. (let the name be students_details.txt )
2. Create a directory in hdfs and place the file in it using the following commands:
$HADOOP_HOME/bin/hadoop fs -mkdir /pig_data
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/hadoop/codes/Pig/students_details.txt /pig_data/
3. Load the file into pig with the relation name student_details as shown below:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
4. Let us now sort the relation in a descending order based on the age of the student and store it into another
relation named order_by_data using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
Output:
Verify the relation order_by_data using the DUMP operator as shown below:
Aim: To group the given data using an attribute in pig of Hadoop ecosystem.
Description:
The GROUP operator is used to group the data in one or more relations. It collects the data having the same
key.
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Procedure:
1. Take the student_details file used in the previous experiment and load it into pig using the following com-
mand:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
2. Group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Output:
Verify the relation order_by_data using the DUMP operator as shown below:
Aim: To project the data in different formats using the operators in pig of Hadoop ecosystem.
Description:
The FOREACH operator is used to generate specified data transformations based on the column data.
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
The LIMIT operator is used to get a limited number of tuples from a relation.
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Procedure:
1. Take the student_details file used in the previous experiment and load it into pig using the following com-
mand:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
2. Using the foreach operator project the data with each record separated by comma in the file using the fol-
lowing command:
grunt>foreach_data = FOREACH student_details GENERATE id,age,city;
4. Using the limit command project the first four record from the foreach_data using the following command:
grunt> limit_data = LIMIT student_details 4;
5. Dump the limit_data using the command:
grunt>dump limit_data;
Description:
The FILTER operator is used to select the required tuples from a relation based on a condition.
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Procedure:
1. Take the student_details file used in the previous experiment and load it into pig using the following com-
mand:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
2. Filter the file to get students from Chennai using the following command:
grunt> filter_data = FILTER student_details BY city == 'Chennai';
Output:
To get the output, dump the filter_data using the command:
grunt>dump filter_data;
Aim: To use joins like self, inner and outer joins in pig of Hadoop ecosystem.
Description:
Self Join:
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming
at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under
different aliases (names).
Given below is the syntax of performing self-join operation using the JOIN operator.
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
2. Place the files created into HDFS, using the following commands:
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/Hadoop/codes/Pig/customers.txt /pig_data/
$HADOOP_HOME/bin/hadoop fs -put ~/Documents/Hadoop/codes/Pig/orders.txt /pig_data/
3. Load the files into pig with relations customers and orders as shown below:
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
Self Join:
1. Perform self-join operation on the relation customers, by joining the two relations customers1 and
customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
2. Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join:
1. Perform inner join operation on the two relations customers and orders as shown below.
grunt> customer_orders = JOIN customers BY id, orders BY customer_id;
2. Verify the relation customer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
We will get the following output that will the contents of the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
2. Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
2. Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
1. Perform full outer join operation on the two relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
2. Verify the relation outer_full using the DUMP operator as shown below.
grunt> Dump outer_full;
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
References:
1. https://www.tutorialspoint.com/apache_pig/apache_pig_join_operator.htm
2. https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#JOIN+%28inner%29