Big Data Analysis

PIG ASSIGNMENT
Name: Parikh Oberoi

SID: 16103045
Problem 1: For the SalesJan2009.csv dataset answer the following queries:

Q1. How to load file in HDFS.
To load the file, first we use the hdfs dfs copyFromLocal command to copy the file into the hadoop system
and then to load this file in the grunt shell, we use the LOAD pig-latin command.
grunt> salesFile = LOAD '/user/hduser/input/pig/SalesJan2009.csv' USING PigStorage(',') AS

(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City
:chararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:chararray,Latitude:char
array,Longitude:chararray);
Q2. To group data by field Country
To group data by field country, we use the GROUP command in pig-latin.
grunt> group_country = GROUP salesFile by Country;
Q3. To dump the data for city == New York

To dump data for the records where city is New York, we use the FILTER command to store the result in a
temp variable and then dump it.
grunt> temp = FILTER salesFile by City=='New York ';

grunt> dump temp;
Q4. To store the result in final output folder.
We use the STORE command in pig-latin to storre the result in an output folder and to view the output, we
use the fs -cat command.
grunt> STORE temp INTO 'pig_NYC_sales' using PigStorage('\t');

grunt> fs -ls pig_NYC_sales
grunt> fs -cat pig_NYC_sales/part-m-00000
Problem 2. For the movie dataset answer the following queries:
Q1. Find the movie with avg rating >4 from u.data dataset.
First, we load the u.data file using the LOAD command followed by grouping the records according to their
movieID using the GROUP command.
Next, we generate an average rating for each movieID using the FOREACH command.
The last step is to use the FILTER commad to get the records of the desired results and dump them.
grunt> ratings = LOAD '/user/hduser/input/pig/u.data' AS (userID:int, movieID:int, rating:int,
ratingTime:int);
grunt> group_ratings = GROUP ratings BY movieID;
grunt> ratings_avg = FOREACH group_ratings GENERATE group AS movieID, AVG(ratings.rating) AS

avgRating;
grunt> five_star_movies = FILTER ratings_avg BY avgRating > 4.0;
grunt> dump five_star_movies;
Q2. Find the oldest 5-star movies from u.data and u.item datasets.
To solve ths problem we first need to load the u.item dataset, convert it release date into seconds to unix time
(time in seconds since the epoch time 00:00:00 UTC Jan 1, 1970).
After that, we use the JOIN command to join the u.data and u.item datasets according to the field movieID.
Next, we use the ORDER command to sort the rows in increasing order of their release time, thus getting the
oldest five star movies.
grunt> movie_data = LOAD '/user/hduser/input/pig/u.item' USING PigStorage('|')AS (movieID:int,

movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray);
grunt> date_converted = FOREACH movie_data GENERATE movieID, movieTitle,
ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;
grunt> fiveStarsWithData = JOIN five_star_movies by movieID, date_converted by movieID;
grunt> oldestFiveStarMovies = order fiveStarsWithData by date_converted::releaseTime;
grunt> dump oldestFiveStarMovies;
Q3. Find the oldest 3-star movies from u.data and u.item datasets.
To find the oldest 3-star movies, we repeat our steps as with the oldest 5-star movies, but this time we use the
FILTER command to filter the movies with ratings greater than 3 and less than 4.
After that, we use the JOIN command to join the u.data and u.item datasets according to the field movieID.
Next, we use the ORDER command to sort the rows in increasing order of their release time, thus getting the
oldest five star movies.
grunt> three_star_movies = FILTER ratings_avg BY avgRating >= 3.0 AND avgRating < 4.0;
grunt> threeStarsWithData = JOIN three_star_movies by movieID, date_converted by movieID;
grunt> oldestThreeStarMovies = order threeStarsWithData by date_converted::releaseTime;
grunt> dump oldestThreeStarMovies;

Q4. Display name of all movies in uppercase.
To display the names of all movies in uppercase, we use the UPPER() function which inputs the field name
and returns its data in uppercase
grunt> Case_upper = foreach movie_data generate UPPER(movieTitle);
grunt> dump Case_upper;
Problem 3. For the student data analysis dataset answer the following queries:
CSE.csv, Dept.csv
Q1. To group the tuples of students base on city.
First, we load the CSE.csv file using the LOAD command. Next, we use hte GROUP command to group
the tuples of students based on field city.
grunt> students = LOAD '/user/hduser/input/pig/CSE.csv' using PigStorage(',') as (Name:chararray,

RollNo:int, Marks:int, City:chararray);
grunt> GroupByCity = GROUP students BY City;
grunt> dump GroupByCity;
Q2. To display the name of all students in upper case.

To display the names of all students in uppercase, we use the UPPER() function which inputs the field name
and returns its data in uppercase
grunt> upper_names = foreach students generate UPPER(Name);

grunt> dump upper_names;
Q3. To find the tuples of those students where the marks is greater than equal to
70.
To filter the students according to marks, we use the FILTER command in pig-latin.
grunt> marksMoreThan70 = filter students BY Marks >= 70;
grunt> dump marksMoreThan70;
Q4. To get details of students who belongs to New Delhi only.

To get details of the students who belong to New Delhi, we use the FILTER command in pig-latin. And filter
according to city equalsNew Delhi.
grunt> NewDelhiStudents = filter students BY City == 'New Delhi';
grunt> dump NewDelhiStudents;
Q5. To display the results where the city name Chandigarh is replaced with
shorter form CHD and New Delhi with NDLS.
To display the results with names reduced to its short form for New Delhi and Chandigarh, we use the
REPLACE() funaction which takes the old string as one of its parameters and convert it into the new string.
grunt> shortName = foreach students generate Name, RollNo, Marks, REPLACE(City, 'Chandigarh', 'CHD')
AS City;
grunt> shortName = foreach shortName generate Name, RollNo, Marks, REPLACE(City, 'New Delhi',
'NDLS') AS City;
grunt> dump shortName;
Q6. To calculate average marks for each student.

To calculate the average marks for each student, we first use the GROUP command to group the student data
according to their roll numbers. Then, we use the AVG() function to calculate the average marks.
grunt> GroupStudents = GROUP students BY RollNo;
grunt> AvgMarks = foreach GroupStudents generate group as RollNo, AVG(students.Marks) as avgmarks;
grunt> dump AvgMarks;
Q7. To merge contents of two relation CSE.csv and Dept.csv.

First, we load the DEPT.csv file using the LOAD command and then use the JOIN command to merge
contents of the CSE.csv and DEPT.csv relations.
grunt> department = LOAD '/user/hduser/input/pig/Dept.csv' using PigStorage(',') as (RollNo: int,
Department: chararray, City: chararray, College: chararray);

grunt> merge = JOIN students by RollNo, department by RollNo;
grunt> dump merge;
Q8. To display the details of students who belongs to computer Science only.
We use the FILTER command to display the details of students who belong to CSE.
grunt> CSE_students = filter merge by department::Department == 'Computer Science';
grunt> dump CSE_students;
Q9. To display the details of students who does not belongs to computer science.
We use the FILTER command to display the details of students who doesn’t belong to CSE.
grunt> Not_CSE_students = filter merge by department::Department != 'Computer Science';

grunt> dump Not_CSE_students;
Q10. To display the details of students who belongs to PEC only.

We use the FILTER command to display the details of students who belong to PEC only.
grunt> PEC_students = filter merge by department::College == 'PEC';
grunt> dump PEC_students;
Q11. To display the details of students who does not belong to IIT Roorkee.
We use the FILTER command to display the details of students who doesn’t belong to IIT Roorkee.
grunt> Not_IITRoorkee = filter merge by department::College != 'IIT Roorkee';
grunt> dump Not_IITRoorkee;

Big Data Analysis

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Big Data Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

PIG ASSIGNMENT

Name: Parikh Oberoi

Problem 1: For the SalesJan2009.csv dataset answer the following queries:

grunt> salesFile = LOAD '/user/hduser/input/pig/SalesJan2009.csv' USING PigStorage(',') AS

Q2. To group data by field Country

To group data by field country, we use the GROUP command in pig-latin.

grunt> group_country = GROUP salesFile by Country;

Q3. To dump the data for city == New York

grunt> temp = FILTER salesFile by City=='New York ';

grunt> STORE temp INTO 'pig_NYC_sales' using PigStorage('\t');

Problem 2. For the movie dataset answer the following queries:

grunt> group_ratings = GROUP ratings BY movieID;

grunt> ratings_avg = FOREACH group_ratings GENERATE group AS movieID, AVG(ratings.rating) AS

grunt> five_star_movies = FILTER ratings_avg BY avgRating > 4.0;

grunt> dump five_star_movies;

grunt> movie_data = LOAD '/user/hduser/input/pig/u.item' USING PigStorage('|')AS (movieID:int,

grunt> dump oldestFiveStarMovies;

grunt> threeStarsWithData = JOIN three_star_movies by movieID, date_converted by movieID;

grunt> oldestThreeStarMovies = order threeStarsWithData by date_converted::releaseTime;

grunt> dump oldestThreeStarMovies;

grunt> dump Case_upper;

grunt> students = LOAD '/user/hduser/input/pig/CSE.csv' using PigStorage(',') as (Name:chararray,

Q2. To display the name of all students in upper case.

grunt> upper_names = foreach students generate UPPER(Name);

Q4. To get details of students who belongs to New Delhi only.

grunt> dump shortName;

Q6. To calculate average marks for each student.

grunt> dump AvgMarks;

Q7. To merge contents of two relation CSE.csv and Dept.csv.

Department: chararray, City: chararray, College: chararray);

grunt> Not_CSE_students = filter merge by department::Department != 'Computer Science';

Q10. To display the details of students who belongs to PEC only.

Das könnte Ihnen auch gefallen