Beruflich Dokumente
Kultur Dokumente
1. Cloudera:
- Basically for testing purpose.
- VMware Player: Free download from
http://www.vmware.com/products/player/
- Cloudera QuickStart VM: Free download from
https://www.cloudera.com/content/support/en/downloads.html (My
cloudera version is 4.2.0, with pig version 0.10.0)
- Run cloudera VM in VMware Player
- Need to install Cloudera Development Kit to enable full screen and sharing
folders between host machine and VM. Also from
https://www.cloudera.com/content/support/en/downloads.html
2. AWS account:
Ask from Ripul to get account credentials (access key, secret key),
keypair.pem file
3. SSH tools:
- PuTTy : configure with aws credential and pem file
- Cygwin terminal: ssh from the folder containing the pem file
- Problem: connection might keep dropping. May solve by using other tools.
4. S3 brower:
Download from http://s3browser.com/, put in AWS credentials to use.
Development
1. Cloudera environment
- Compile java code in folder called udf containing NcycloLoader.java
need to specify classpath with following jar files
>>javac -cp
/mnt/hgfs/project/lib/pig.jar:/home/cloudera/eclipse/plugins/org.apache.co
mmons.logging_1.0.4.v201101211617.jar:/mnt/hgfs/project/lib/commonsio-2.4.jar:/mnt/hgfs/project/lib/hadoop-core-0.20.2.jar NcycloLoader.java
Then Pack udf folder to jar:
>>cd ..
>>jar cf udf.jar udf
-
There are 2 modes in pig, hadoop and local. Hadoop mode will run code
from HDFS, local mode run code from local folders, which is better for
testing purpose.
(1)Run Pig in interactive mode
>>pig x local
(2)Run pig with Pig file
>>pig x local file.pig
>>quit
2. Code
Pig code:
Register udf.jar
Use proper schemas
Make sure output folder path do not exist
Useful references:
http://chimera.labs.oreilly.com/books/1234000001811/index.html
http://pig.apache.org/docs/r0.10.0/basic.html
3. AWS environment
- Create hadoop cluster
In EMR console
Create new job flow:
Choose name & type - select mode/upload script select number of
nodes&type/spot price keypair/log/if or not keep alive finish
-
Ssh commands
(If using Cygwin, to folder containing pem file:
>>cd /cygdrive/c/Users/qli/Documents
Ssh to micro jumpbox:
>>ssh -i keypair.pem ec2-user@10.1.62.181)
Ssh from jumpbox to hadoop cluster:
>>ssh -i keypair.pem hadoop@ec2-54-224-126-81.compute1.amazonaws.com
Workflow
[Beforehand] Upload udf.jar, upload serverlist.csv, fieldlist.csv to s3 bucket
1. S3 browser: Upload by drag file/folder
2. In AWS EMR console: create cluster
In CloudWatch: monitor cloud performance
3.Run script
4.Terminate cluster
5.Download output from s3
Next
1.
2.
3.
Steps
Automation through command line
Update fieldlist, serverlist in run time
Optimize code