Sie sind auf Seite 1von 8

Project 1

State-Wise Development Analysis In India


1.Find out the districts who achieved 100 percent objective in BPL cards
Queries:
Flume:
flume-ng agent --conf-file /usr/local/flume/conf/flumeproject.conf --name agent
-Dflume.root.logger=INFO,console
Inside .conf file:
agent1.sources = mysrc
agent1.sinks = hdfsdest
agent1.channels = mychannel
agent1.sources.mysrc.type = exec
agent1.sources.mysrc.command = hadoop dfs -put /home/gv/Desktop/Execute/flume/pro/State
/user/flume/mainproject1
agent1.sinks.hdfsdest.type = hdfs
agent1.sinks.hdfsdest.hdfs.path = hdfs://localhost:9000/user/flume/mainproject1
agent1.channels.mychannel.type = memory
agent1.sources.mysrc.channels = mychannel
agent1.sinks.hdfsdest.channel = mychannel
PIG:
A = load '/user/flume/mainproject1/State' using
org.apache.pig.piggybank.storage.XMLLoader('row') as (state:chararray);
Parsing xml file:
B = foreach A generate
FLATTEN(REGEX_EXTRACT_ALL(state,'<row>\\s*<State_Name>(.*)</State_Name>\\s*<Distr
ict_Name>(.*)</District_Name>\\s*<Project_Objectives_IHHL_BPL>(.*)</Project_Objectives_IH
HL_BPL>\\s*<Project_Objectives_IHHL_APL>(.*)</Project_Objectives_IHHL_APL>\\s*<Projec
t_Objectives_IHHL_TOTAL>(.*)</Project_Objectives_IHHL_TOTAL>\\s*<Project_Objectives_S
CW>(.*)</Project_Objectives_SCW>\\s*<Project_Objectives_School_Toilets>(.*)</Project_Objec
tives_School_Toilets>\\s*<Project_Objectives_Anganwadi_Toilets>(.*)</Project_Objectives_Anga
nwadi_Toilets>\\s*<Project_Objectives_RSM>(.*)</Project_Objectives_RSM>\\s*<Project_Object
ives_PC>(.*)</Project_Objectives_PC>\\s*<Project_PerformanceIHHL_BPL>(.*)</Project_Performance-IHHL_BPL>\\s*<Project_PerformanceIHHL_APL>(.*)</Project_Performance-IHHL_APL>\\s*<Project_PerformanceIHHL_TOTAL>(.*)</Project_Performance-IHHL_TOTAL>\\s*<Project_Performance-

SCW>(.*)</Project_Performance-SCW>\\s*<Project_PerformanceSchool_Toilets>(.*)</Project_Performance-School_Toilets>\\s*<Project_PerformanceAnganwadi_Toilets>(.*)</Project_Performance-Anganwadi_Toilets>\\s*<Project_PerformanceRSM>(.*)</Project_Performance-RSM>\\s*<Project_Performance-PC>(.*)</Project_PerformancePC>\\s*</row>'));
Storing B:
C = STORE B into '/user/hdusr/mainproject' using
org.apache.pig.piggybank.storage.CSVExelStorage();
Loading data to pig:
D = load '/user/hdusr/mainproject' using PigStorage(',') as
(State_Name:chararray,District_Name:chararray,Project_Objectives_IHHL_BPL:int,Project_Object
ives_IHHL_APL:int,Project_Objectives_IHHL_TOTAL:int,Project_Objectives_SCW:int,Project_O
bjectives_School_Toilets:int,Project_Objectives_Anganwadi_Toilets:int,Project_Objectives_RSM:i
nt,Project_Objectives_PC:int,Project_Performance_IHHL_BPL:int,Project_Performance_IHHL_A
PL:int,Project_Performance_IHHL_TOTAL:int,Project_Performance_SCW:int,Project_Performanc
e_School_Toilets:int,Project_Performance_Anganwadi_Toilets:int,Project_Performance_RSM:int,P
roject_Performance_PC:int);
Filtering:
E = filter D by Project_Objectives_IHHL_BPL == Project_Performance_IHHL_BPL;
we can get output here it self but if you need district name then below is code
Grouping:
F = group E by District_Name;
G = foreach F generate group; (this is to display only district names)
Storing:
store E into '/user/hdusr/mainproject/results2' using PigStorage(',');
Creating table in SQL:
create table mainproject(State_Name varchar(50),District_Name
varchar(50),Project_Objectives_IHHL_BPL int(20),Project_Objectives_IHHL_APL
int(20),Project_Objectives_IHHL_TOTAL int(20),Project_Objectives_SCW
int(20),Project_Objectives_School_Toilets int(20),Project_Objectives_Anganwadi_Toilets
int(20),Project_Objectives_RSM int(20),Project_Objectives_PC
int(20),Project_Performance_IHHL_BPL int(20),Project_Performance_IHHL_APL
int(20),Project_Performance_IHHL_TOTAL int(20),Project_Performance_SCW
int(20),Project_Performance_School_Toilets int(20),Project_Performance_Anganwadi_Toilets
int(20),Project_Performance_RSM int(20),Project_Performance_PC int(20));

Exporting to SQL using sqoop:


sqoop export --connect jdbc:mysql://localhost/gova --table mainproject --export-dir
/user/hdusr/mainproject/results2/part-m-00000 --fields-terminated-by ' '--username root -P;
checking output in SQL:
select * from mainproject;(hence the results are too many so cant take screen shot for whole
data(total 70 records))

2.Write a Pig UDF to filter the districts who have reached 80% of objectives of BPL cards.
Export the results to mysql using sqoop.

Queries:
UDF Program:
package pig_UDF;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class Fil extends FilterFunc {
@Override
public Boolean exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
if(input == null || input.size() == 0 )
{
return null;
}
try{
//String str = (String) input.get(0);
int completed = (int) input.get(0);
int total = (int) input.get(1);
int percentage = (int) (completed/total * 100.0f) ;
if(percentage > 80.0f)
{
return true;
}else{
}

return false;

}
catch (Exception e)
{
// TODO: handle exception
throw new IOException("caught excepion in fil uDf", e);
}

}
}

Registering in pig:
register /home/hdusr/Desktop/Exefile/filpertrywithelse.jar
Filtering:
H = filter D by pig_UDF.Fil(Project_Performance_IHHL_BPL,Project_Objectives_IHHL_BPL);
Foreach:
I = foreach H generate
(State_Name,District_Name,Project_Objectives_IHHL_BPL,Project_Performance_IHHL_BPL);
Dump I(here values are storing with () so remove them by below command)
Flatten:
J = foreach I generate FLATTEN($0);
Storing:
store J into '/user/hdusr/mainproject/udf_result' using PigStorage(',');
Creating table in SQL:
Create table mainproject_udf(State_Name varchar(40),District_Name
varchar(40),Project_Objectives_IHHL_BPL int(20),Project_Performance_IHHL_BPL int(20));
Exporting to SQL using sqoop:
sqoop export --connect jdbc:mysql://localhost/gova --table mainproject_udf --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P
checking output in SQL:
select * from mainproject_udf;(hence the results are too many so cant take screen shot for whole
data(total 176 records))

Project 2
USA Consumer Forum Data Analysis
Flume:
flume-ng agent --conf-file /usr/local/flume/conf/flumeproject.conf --name agent
-Dflume.root.logger=INFO,console
Inside .conf file:
agent1.sources = mysrc
agent1.sinks = hdfsdest
agent1.channels = mychannel
agent1.sources.mysrc.type = exec
agent1.sources.mysrc.command = hadoop fs -put
/home/gv/Desktop/Execute/flume/pro/Consumer_Complaints /user/flume/mainprojectQ2
agent1.sinks.hdfsdest.type = hdfs
agent1.sinks.hdfsdest.hdfs.path = hdfs://localhost:9000/user/flume/mainproject1
agent1.channels.mychannel.type = memory
agent1.sources.mysrc.channels = mychannel
agent1.sinks.hdfsdest.channel = mychannel
First cleaning the raw data using mapreduce
hadoop jar /home/hdusr/Desktop/Exefile/clean.jar cleaning.Cleandata /user/flume/mainprojectQ2
cleandata
1. Write a pig script to find no of complaints which got timely response
Loading data into PIG
A = load '/user/hdusr/cleandata'using PigStorage(',') as
(Date_received:chararray,Product:chararray,Sub_product:chararray,Issue:chararray,Sub_issue:chara
rray,Consumer_complaint_narrative:chararray,Company_public_response:chararray,Company:chara
rray,State:chararray,ZIP_code:int,Submitted_via:chararray,Date_sent_to_company:chararray,Compa
ny_response_to_consumer:chararray,Timely_response:chararray,Consumer_disputed:chararray,Co
mplaint_ID:int);
Filtering:
B = filter A by Timely_response == 'Yes';
Grouping:

C = group B ALL;
Foreach:
D = foreach C generate COUNT($0);
dump D;
Storing:
store D into '/user/hdusr/mainprojectQ2/time' using PigStorage(',');
Exporting to SQL using sqoop:
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_time --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P
___________________________________________
2. Write a pig script to find no
of complaints where consumer forum forwarded the complaint
same day they received to respective company
Filtering:
E = filter A by Date_received == Date_sent_to_company;
Grouping:
F = group E by Company;
Foreach:
G = foreach F generate group, COUNT(E.$0);
Storing:
store G into '/user/hdusr/mainprojectQ2/rese' using PigStorage(',');
Exporting to SQL using sqoop:
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_rese --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P
___________________________________________
3. Write a pig script to find list of companies toping in complaint chart (companies with
maximum number of complaints)
Filtering:

H = filter A by Consumer_complaint_narrative != '';


Grouping:
I = group H by Company;
Foreach:
J = foreach I generate group, COUNT(H.$0);
OrderBy:
K = order J by $1 DESC;
Limit:
L = limit K 1;
Storing:
store L into '/user/hdusr/mainprojectQ2/null' using PigStorage(',');
Exporting to SQL using sqoop:
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_null --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P
_______________________________________________
4. Write a pig script to find no
of complaints filed with product type has &quot;Debt
collection&quot; for the year 2015
Filtering:
M = filter A by Product == 'Debt collection';
Filtering:
N = FILTER M BY (Date_received matches '.*2015.*');
Grouping:
O = group N ALL;
Foreach:
P = foreach O generate COUNT(N.$0);
Storing:
store P into '/user/hdusr/mainprojectQ2/2015' using PigStorage(',');

Exporting to SQL using sqoop:


sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_2015 --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P

Das könnte Ihnen auch gefallen