Sie sind auf Seite 1von 3

Configuration

1. Cloudera:
- Basically for testing purpose.
- VMware Player: Free download from
http://www.vmware.com/products/player/
- Cloudera QuickStart VM: Free download from
https://www.cloudera.com/content/support/en/downloads.html (My
cloudera version is 4.2.0, with pig version 0.10.0)
- Run cloudera VM in VMware Player
- Need to install Cloudera Development Kit to enable full screen and sharing
folders between host machine and VM. Also from
https://www.cloudera.com/content/support/en/downloads.html
2. AWS account:
Ask from Ripul to get account credentials (access key, secret key),
keypair.pem file
3. SSH tools:
- PuTTy : configure with aws credential and pem file
- Cygwin terminal: ssh from the folder containing the pem file
- Problem: connection might keep dropping. May solve by using other tools.
4. S3 brower:
Download from http://s3browser.com/, put in AWS credentials to use.
Development
1. Cloudera environment
- Compile java code in folder called udf containing NcycloLoader.java
need to specify classpath with following jar files
>>javac -cp
/mnt/hgfs/project/lib/pig.jar:/home/cloudera/eclipse/plugins/org.apache.co
mmons.logging_1.0.4.v201101211617.jar:/mnt/hgfs/project/lib/commonsio-2.4.jar:/mnt/hgfs/project/lib/hadoop-core-0.20.2.jar NcycloLoader.java
Then Pack udf folder to jar:
>>cd ..
>>jar cf udf.jar udf
-

There are 2 modes in pig, hadoop and local. Hadoop mode will run code
from HDFS, local mode run code from local folders, which is better for
testing purpose.
(1)Run Pig in interactive mode
>>pig x local
(2)Run pig with Pig file
>>pig x local file.pig
>>quit
2. Code

Java udf code:


Use package udf to indicate folder
getNext() read from input file line by line, do format transformation, return
a tuple

Pig code:
Register udf.jar
Use proper schemas
Make sure output folder path do not exist
Useful references:
http://chimera.labs.oreilly.com/books/1234000001811/index.html
http://pig.apache.org/docs/r0.10.0/basic.html

3. AWS environment
- Create hadoop cluster
In EMR console
Create new job flow:
Choose name & type - select mode/upload script select number of
nodes&type/spot price keypair/log/if or not keep alive finish
-

Ssh commands
(If using Cygwin, to folder containing pem file:
>>cd /cygdrive/c/Users/qli/Documents
Ssh to micro jumpbox:
>>ssh -i keypair.pem ec2-user@10.1.62.181)
Ssh from jumpbox to hadoop cluster:
>>ssh -i keypair.pem hadoop@ec2-54-224-126-81.compute1.amazonaws.com

Install s3 in jumpbox or other instances(to access from command line)


Download s3:
>>wget http://sourceforge.net/projects/s3tools/files/s3cmd/1.5.0alpha1/s3cmd-1.5.0-alpha1.tar.gz
untar it:
>>tar -xvf s3cmd-1.5.0-alpha1.tar.gz
cd in it and install:
>> sudo python setup.py install
Then configure it with your key:
>>s3cmd configure
Command to upload and download from s3:
>>s3cmd put file s3://yourbucket/folder
>>s3cmd get s3://yourbucket/folder/file
Reference: http://s3tools.org/s3cmd

For large data set transmission:


Hadoop distcp s3://yourbucket/folder
Reference: http://hadoop.apache.org/docs/stable/distcp.html
If uploading pem file to s3, may need to change the authority of pem file:
>>chmod +600 keypair.pem
-

CloudWatch(to monitor the cluster)


Search all matrixes by job flow id
Useful matrix includes: HDFS read/write, running map/reduce jobs, S3
read/write

S3 console : another way to access S3

Workflow
[Beforehand] Upload udf.jar, upload serverlist.csv, fieldlist.csv to s3 bucket
1. S3 browser: Upload by drag file/folder
2. In AWS EMR console: create cluster
In CloudWatch: monitor cloud performance
3.Run script
4.Terminate cluster
5.Download output from s3
Next
1.
2.
3.

Steps
Automation through command line
Update fieldlist, serverlist in run time
Optimize code

Das könnte Ihnen auch gefallen