Sie sind auf Seite 1von 2

user - userid, name , state

tweets - tweetid,tweet,userid

-----------------------------------------------------------

Solution -

user = LOAD '/user/cloudera/data/test_raw_data/users.csv' USING PigStorage(',')


AS(userid:chararray,name:chararray,state:chararray);

OR

REGISTER /home/cloudera/Module-3/PigAdvanced/piggybank.jar
user = LOAD '/user/cloudera/data/test_raw_data/users.csv' USING
org.apache.pig.piggybank.storage.CSVLoader()
AS(userid:chararray,name:chararray,state:chararray);

tweets = LOAD '/user/cloudera/data/test_raw_data/tweets.csv' USING


org.apache.pig.piggybank.storage.CSVLoader()
AS(tweetid:chararray,tweet:chararray,userid:chararray);

A = LIMIT tweets 10;

1. Write a Pig Latin query that outputs the login of all users in NY state

q1 = FILTER user by state == 'NY';


q1 = FILTER user by state matches '.*NY.*';

-----------------------------------------------------------------------------------
------

2. Write a Pig Latin query that returns all the tweets that include the word
'favorite', ordered by tweet id

q2 = FILTER tweets by tweet matches '.*favorite.*';


q3 = ORDER q2 by tweetid;

-----------------------------------------------------------------------------------
--------

3. Write a Pig Latin query that returns the number of tweets for each userid

q4 = group tweets by userid;


q4: {group: chararray,tweets: {tweetid: chararray,tweet: chararray,userid:
chararray}}
q5 = FOREACH q4 GENERATE tweets.userid as userid, COUNT(tweets) as tweets_count;

-----------------------------------------------------------------------------------
------------------

4. Write a Pig Latin query that returns the number of tweets for each userid
ordered from most active to least active users

q6 = Order q5 by tweets_count desc;


-----------------------------------------------------------------------------------
-------------------

5. Write a Pig Latin query that returns the name of users that posted at least two
tweets

q8 = join tweets by userid left outer, user by userid;

describe q8;
q8: {tweets::tweetid: chararray,tweets::tweet: chararray,tweets::userid:
chararray,user::userid: chararray,user::name: chararray,user::state: chararray}

q9 = group q8 by tweets::userid;
describe q9;

q9: {group: chararray,q8: {tweets::tweetid: chararray,tweets::tweet:


chararray,tweets::userid: chararray,user::userid: chararray,user::name:
chararray,user::state: chararray}}

q10 = FOREACH q9 GENERATE q8.(user::name), COUNT(q8.(tweets::tweet)) as


tweet_count;

q11 = filter q10 by $1 < 3;

-----------------------------------------------------------------------------------
---------------------------

6. Write a Pig Latin query that returns the name of users that posted no tweets

q12 = filter q10 by $1 == 0;

-----------------------------------------------------------------------------------
------------------------

7. Write a Pig Latin query that returns the number of tweets for each user name
(not user id ).

dump q10;

-----------------------------------------------------------------------------------
----------------------------

8. Write a Pig Latin query that returns the number of tweets for each user name
(not userid), ordered from most active to least active users

q13 = order q10 by $1 desc;

q14 = LIMIT q13 10;

Das könnte Ihnen auch gefallen