Beruflich Dokumente
Kultur Dokumente
Q1. Find the movie with avg rating >4 from u.data dataset.
First, we load the u.data file using the LOAD command followed by grouping the records according to their
movieID using the GROUP command.
Next, we generate an average rating for each movieID using the FOREACH command.
The last step is to use the FILTER commad to get the records of the desired results and dump them.
grunt> ratings = LOAD '/user/hduser/input/pig/u.data' AS (userID:int, movieID:int, rating:int,
ratingTime:int);
Q2. Find the oldest 5-star movies from u.data and u.item datasets.
To solve ths problem we first need to load the u.item dataset, convert it release date into seconds to unix time
(time in seconds since the epoch time 00:00:00 UTC Jan 1, 1970).
After that, we use the JOIN command to join the u.data and u.item datasets according to the field movieID.
Next, we use the ORDER command to sort the rows in increasing order of their release time, thus getting the
oldest five star movies.
Q3. Find the oldest 3-star movies from u.data and u.item datasets.
To find the oldest 3-star movies, we repeat our steps as with the oldest 5-star movies, but this time we use the
FILTER command to filter the movies with ratings greater than 3 and less than 4.
After that, we use the JOIN command to join the u.data and u.item datasets according to the field movieID.
Next, we use the ORDER command to sort the rows in increasing order of their release time, thus getting the
oldest five star movies.
grunt> three_star_movies = FILTER ratings_avg BY avgRating >= 3.0 AND avgRating < 4.0;
Problem 3. For the student data analysis dataset answer the following queries:
CSE.csv, Dept.csv
Q1. To group the tuples of students base on city.
First, we load the CSE.csv file using the LOAD command. Next, we use hte GROUP command to group
the tuples of students based on field city.
Q3. To find the tuples of those students where the marks is greater than equal to
70.
To filter the students according to marks, we use the FILTER command in pig-latin.
grunt> marksMoreThan70 = filter students BY Marks >= 70;
grunt> dump marksMoreThan70;
Q5. To display the results where the city name Chandigarh is replaced with
shorter form CHD and New Delhi with NDLS.
To display the results with names reduced to its short form for New Delhi and Chandigarh, we use the
REPLACE() funaction which takes the old string as one of its parameters and convert it into the new string.
grunt> shortName = foreach students generate Name, RollNo, Marks, REPLACE(City, 'Chandigarh', 'CHD')
AS City;
grunt> shortName = foreach shortName generate Name, RollNo, Marks, REPLACE(City, 'New Delhi',
'NDLS') AS City;
Q8. To display the details of students who belongs to computer Science only.
We use the FILTER command to display the details of students who belong to CSE.
grunt> CSE_students = filter merge by department::Department == 'Computer Science';
grunt> dump CSE_students;
Q9. To display the details of students who does not belongs to computer science.
We use the FILTER command to display the details of students who doesn’t belong to CSE.