Project 1

Project 1
State-Wise Development Analysis In India

1.Find out the districts who achieved 100 percent objective in BPL cards
Queries:
Flume:
flume-ng agent --conf-file /usr/local/flume/conf/flumeproject.conf --name agent
-Dflume.root.logger=INFO,console
Inside .conf file:
agent1.sources = mysrc
agent1.sinks = hdfsdest
agent1.channels = mychannel
agent1.sources.mysrc.type = exec
agent1.sources.mysrc.command = hadoop dfs -put /home/gv/Desktop/Execute/flume/pro/State
/user/flume/mainproject1
agent1.sinks.hdfsdest.type = hdfs
agent1.sinks.hdfsdest.hdfs.path = hdfs://localhost:9000/user/flume/mainproject1
agent1.channels.mychannel.type = memory
agent1.sources.mysrc.channels = mychannel
agent1.sinks.hdfsdest.channel = mychannel
PIG:
A = load '/user/flume/mainproject1/State' using
org.apache.pig.piggybank.storage.XMLLoader('row') as (state:chararray);
Parsing xml file:
B = foreach A generate
FLATTEN(REGEX_EXTRACT_ALL(state,'<row>\\s*<State_Name>(.*)</State_Name>\\s*<Distr
ict_Name>(.*)</District_Name>\\s*<Project_Objectives_IHHL_BPL>(.*)</Project_Objectives_IH
HL_BPL>\\s*<Project_Objectives_IHHL_APL>(.*)</Project_Objectives_IHHL_APL>\\s*<Projec
t_Objectives_IHHL_TOTAL>(.*)</Project_Objectives_IHHL_TOTAL>\\s*<Project_Objectives_S
CW>(.*)</Project_Objectives_SCW>\\s*<Project_Objectives_School_Toilets>(.*)</Project_Objec
tives_School_Toilets>\\s*<Project_Objectives_Anganwadi_Toilets>(.*)</Project_Objectives_Anga
nwadi_Toilets>\\s*<Project_Objectives_RSM>(.*)</Project_Objectives_RSM>\\s*<Project_Object
ives_PC>(.*)</Project_Objectives_PC>\\s*<Project_PerformanceIHHL_BPL>(.*)</Project_Performance-IHHL_BPL>\\s*<Project_PerformanceIHHL_APL>(.*)</Project_Performance-IHHL_APL>\\s*<Project_PerformanceIHHL_TOTAL>(.*)</Project_Performance-IHHL_TOTAL>\\s*<Project_Performance-
SCW>(.*)</Project_Performance-SCW>\\s*<Project_PerformanceSchool_Toilets>(.*)</Project_Performance-School_Toilets>\\s*<Project_PerformanceAnganwadi_Toilets>(.*)</Project_Performance-Anganwadi_Toilets>\\s*<Project_PerformanceRSM>(.*)</Project_Performance-RSM>\\s*<Project_Performance-PC>(.*)</Project_PerformancePC>\\s*</row>'));
Storing B:
C = STORE B into '/user/hdusr/mainproject' using
org.apache.pig.piggybank.storage.CSVExelStorage();
Loading data to pig:
D = load '/user/hdusr/mainproject' using PigStorage(',') as
(State_Name:chararray,District_Name:chararray,Project_Objectives_IHHL_BPL:int,Project_Object
ives_IHHL_APL:int,Project_Objectives_IHHL_TOTAL:int,Project_Objectives_SCW:int,Project_O
bjectives_School_Toilets:int,Project_Objectives_Anganwadi_Toilets:int,Project_Objectives_RSM:i
nt,Project_Objectives_PC:int,Project_Performance_IHHL_BPL:int,Project_Performance_IHHL_A
PL:int,Project_Performance_IHHL_TOTAL:int,Project_Performance_SCW:int,Project_Performanc
e_School_Toilets:int,Project_Performance_Anganwadi_Toilets:int,Project_Performance_RSM:int,P
roject_Performance_PC:int);
Filtering:
E = filter D by Project_Objectives_IHHL_BPL == Project_Performance_IHHL_BPL;
we can get output here it self but if you need district name then below is code
Grouping:
F = group E by District_Name;
G = foreach F generate group; (this is to display only district names)
Storing:
store E into '/user/hdusr/mainproject/results2' using PigStorage(',');
Creating table in SQL:
create table mainproject(State_Name varchar(50),District_Name
varchar(50),Project_Objectives_IHHL_BPL int(20),Project_Objectives_IHHL_APL
int(20),Project_Objectives_IHHL_TOTAL int(20),Project_Objectives_SCW
int(20),Project_Objectives_School_Toilets int(20),Project_Objectives_Anganwadi_Toilets
int(20),Project_Objectives_RSM int(20),Project_Objectives_PC
int(20),Project_Performance_IHHL_BPL int(20),Project_Performance_IHHL_APL
int(20),Project_Performance_IHHL_TOTAL int(20),Project_Performance_SCW
int(20),Project_Performance_School_Toilets int(20),Project_Performance_Anganwadi_Toilets
int(20),Project_Performance_RSM int(20),Project_Performance_PC int(20));
Exporting to SQL using sqoop:

sqoop export --connect jdbc:mysql://localhost/gova --table mainproject --export-dir
/user/hdusr/mainproject/results2/part-m-00000 --fields-terminated-by ' '--username root -P;
checking output in SQL:
select * from mainproject;(hence the results are too many so cant take screen shot for whole
data(total 70 records))
2.Write a Pig UDF to filter the districts who have reached 80% of objectives of BPL cards.
Export the results to mysql using sqoop.
Queries:
UDF Program:
package pig_UDF;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class Fil extends FilterFunc {
@Override
public Boolean exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
if(input == null || input.size() == 0 )
{
return null;
}
try{
//String str = (String) input.get(0);
int completed = (int) input.get(0);
int total = (int) input.get(1);
int percentage = (int) (completed/total * 100.0f) ;
if(percentage > 80.0f)
{
return true;
}else{
}
return false;
}
catch (Exception e)
{
// TODO: handle exception
throw new IOException("caught excepion in fil uDf", e);
}
}
}
Registering in pig:
register /home/hdusr/Desktop/Exefile/filpertrywithelse.jar
Filtering:
H = filter D by pig_UDF.Fil(Project_Performance_IHHL_BPL,Project_Objectives_IHHL_BPL);
Foreach:
I = foreach H generate
(State_Name,District_Name,Project_Objectives_IHHL_BPL,Project_Performance_IHHL_BPL);
Dump I(here values are storing with () so remove them by below command)
Flatten:
J = foreach I generate FLATTEN($0);
Storing:
store J into '/user/hdusr/mainproject/udf_result' using PigStorage(',');
Creating table in SQL:
Create table mainproject_udf(State_Name varchar(40),District_Name
varchar(40),Project_Objectives_IHHL_BPL int(20),Project_Performance_IHHL_BPL int(20));
sqoop export --connect jdbc:mysql://localhost/gova --table mainproject_udf --export-dir
/user/hdusr/mianproject/udf_result/part-m-00000 --fields-terminated-by ',' --username root -P
checking output in SQL:
select * from mainproject_udf;(hence the results are too many so cant take screen shot for whole
data(total 176 records))
Project 2
USA Consumer Forum Data Analysis
Flume:
flume-ng agent --conf-file /usr/local/flume/conf/flumeproject.conf --name agent
-Dflume.root.logger=INFO,console
Inside .conf file:
agent1.sources = mysrc
agent1.sinks = hdfsdest
agent1.channels = mychannel
agent1.sources.mysrc.type = exec
agent1.sources.mysrc.command = hadoop fs -put
/home/gv/Desktop/Execute/flume/pro/Consumer_Complaints /user/flume/mainprojectQ2
agent1.sinks.hdfsdest.type = hdfs
agent1.sinks.hdfsdest.hdfs.path = hdfs://localhost:9000/user/flume/mainproject1
agent1.channels.mychannel.type = memory
agent1.sources.mysrc.channels = mychannel
agent1.sinks.hdfsdest.channel = mychannel
First cleaning the raw data using mapreduce
hadoop jar /home/hdusr/Desktop/Exefile/clean.jar cleaning.Cleandata /user/flume/mainprojectQ2
cleandata
1. Write a pig script to find no of complaints which got timely response
Loading data into PIG
A = load '/user/hdusr/cleandata'using PigStorage(',') as
(Date_received:chararray,Product:chararray,Sub_product:chararray,Issue:chararray,Sub_issue:chara
rray,Consumer_complaint_narrative:chararray,Company_public_response:chararray,Company:chara
rray,State:chararray,ZIP_code:int,Submitted_via:chararray,Date_sent_to_company:chararray,Compa
ny_response_to_consumer:chararray,Timely_response:chararray,Consumer_disputed:chararray,Co
mplaint_ID:int);
Filtering:
B = filter A by Timely_response == 'Yes';
Grouping:
C = group B ALL;
Foreach:
D = foreach C generate COUNT($0);
dump D;
Storing:
store D into '/user/hdusr/mainprojectQ2/time' using PigStorage(',');
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_time --export-dir
___________________________________________
2. Write a pig script to find no
of complaints where consumer forum forwarded the complaint
same day they received to respective company
Filtering:
E = filter A by Date_received == Date_sent_to_company;
Grouping:
F = group E by Company;
Foreach:
G = foreach F generate group, COUNT(E.$0);
Storing:
store G into '/user/hdusr/mainprojectQ2/rese' using PigStorage(',');
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_rese --export-dir
___________________________________________
3. Write a pig script to find list of companies toping in complaint chart (companies with
maximum number of complaints)
Filtering:
H = filter A by Consumer_complaint_narrative != '';

Grouping:
I = group H by Company;
Foreach:
J = foreach I generate group, COUNT(H.$0);
OrderBy:
K = order J by $1 DESC;
Limit:
L = limit K 1;
Storing:
store L into '/user/hdusr/mainprojectQ2/null' using PigStorage(',');
sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_null --export-dir
_______________________________________________
4. Write a pig script to find no
of complaints filed with product type has "Debt
collection" for the year 2015
Filtering:
M = filter A by Product == 'Debt collection';
Filtering:
N = FILTER M BY (Date_received matches '.*2015.*');
Grouping:
O = group N ALL;
Foreach:
P = foreach O generate COUNT(N.$0);
Storing:
store P into '/user/hdusr/mainprojectQ2/2015' using PigStorage(',');

sqoop export --connect jdbc:mysql://localhost/gova --table mainprojectQ2_2015 --export-dir

Project 1 - State-Wise Development Analysis in India

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Project 1 - State-Wise Development Analysis in India

Hochgeladen von

Copyright:

Verfügbare Formate

State-Wise Development Analysis In India

Exporting to SQL using sqoop:

H = filter A by Consumer_complaint_narrative != '';

Exporting to SQL using sqoop:

Das könnte Ihnen auch gefallen