Sie sind auf Seite 1von 8

HADOOP BIG DATA COURSE CONTENT

Big Data Course Highlights


The Big Data course will start with the basics of Linux which are required to get
started with Big Data and then slowly progress from some of the basics of
Hadoop/Big Data (like a Word Count program) to some of the advanced concepts
around Hadoop/Big Data (like writing triggers/stored procedures to HBase).

The main focus of the course will be on how to use the Big Data tools, but we will
also focus on how to install and configure some of the Big Data related
frameworks on in-premise and cloud based infrastructure.

Because of the hype Hadoop is the news all the time. But, there are a lot of
frameworks supporting Hadoop (like Pig/Hive) and a lot of frameworks which are
alternatives to Hadoop (like Twitter Storm and LinkedIn Samza) to address the
limitations of the MapReduce model. Some of these frameworks will also be
discussed during the course to give a big picture of what Big Data is about.

Also, time will be spent on NoSQL databases. Starting with why NoSQL instead of
RDBMS databases to some of advanced concepts like importing data in bulk from
a RDBMS to a NoSQL database. Different NoSQL databases will be compared and
HBase will be discussed in much more detail.

A VM (Virtual Machine) will be provided for all the participants with Big Data
frameworks (Hadoop etc.) installed and configured on CentOS with data sets and
code to process the same. The VM helps in making the Big Data learning
experience less steeper.

The training will help the participant get through the Cloudera Certified
Developer for Apache Hadoop (CCDH) certification with minimal effort.

HADOOP BIG DATA COURSE CONTENT

Pre-Requisites:
Knowledge of Java is a definitive plus to get started with Big Data, but not
mandatory. Hadoop provides streaming which allows programming MapReduce
in non-Java languages like Perl, Python and there are also higher level abstracts
like Hive/Pig which provides SQL like procedure type interface.

Similarly knowledge of Linux would be a definitive plus but the basics of Linux just
enough to get started with the different Big data frameworks.

A laptop/desktop with minimum of 3GB RAM, 10 GB free HARD Disk and with a
decent processor. These specifications would be enough to run the Big Data VM
and the framework smoothly.

Who should Plan on joining this program


Audience:
This course is designed for anyone who is
Any Developer with skills in other technologies interested in getting into the
emerging Big Data field.

Any Data Analyst who would like to enhance/transfer their existing knowledge to
the Big Data space.

Any Architect who would like to design application in conjunction to Big Data or
Big Data applications itself.

HADOOP BIG DATA COURSE CONTENT

Anyone involved in Software Quality Assurance (Testing). Knowledge of Hadoop


will help them to test the application better and will also help them to move into
the development cycle.

Topics covered in the training


Understanding Big Data
Understanding Big Data
- 3V (Volume-Variety-Velocity) characteristics
- Structured and Unstructured Data
- Application and use cases of Big Data
Limitations of traditional large Scale systems
How a distributed way of computing is superior (cost and scale)
Opportunities and challenges with Big Data

HDFS (The Hadoop Distributed File System)


HDFS Overview and Architecture
Deployment Architecture
Name Node, Data Node and Checkpoint Node (aka Secondary Name Node)
Safe mode
Configuration files
HDFS Data Flows (Read vs Write)
How HDFS addresses fault tolerance?
CRC Check Sum
Data replication

Rack awareness and Block placement policy


Small files problem

HADOOP BIG DATA COURSE CONTENT

HDFS Interfaces
Web Interface
Command Line Interface
File System
Administrative

Advanced HDFS features


Load Balancer
DistCP
HDFS High Availability
Hadoop Cache
MapReduce - 1
MapReduce Overview
Functional Programming paradigms
How to think in a MapReduce way?
MapReduce Architecture
Legacy MR vs Next Generation MapReduce (YARN)
Slots vs Containers
Schedulers
Shuffling, Sorting
Hadoop Data types
Input and Output formats

Input Splits

HADOOP BIG DATA COURSE CONTENT

Partitioning (Hash Partitioner vs Custom Partitioner)


Counters
Configuration files
Distributed Cache
MapReduce 2
Developing, debugging and deploying MR programs
Standalone mode(Eclipse)
Pseudo distributed mode (as in the Big Data VM)
Fully Distributed Mode
MR API
Old and new MR API
Java Client API
Overview of MRUnit
Hadoop Data types and custom writables/writable comparables
Different input and output formats
Saving Binary Data using Sequence Files and Avro Files
Optimizing techniques
Speculative execution
Combiners
Compression
MR algorithms
Sorting (Max Temperature)
Different ways of joining data
Inverted Index

HADOOP BIG DATA COURSE CONTENT

Word co-occurrence
Pig
Introduction to PIG
Why PIG not MapReduce
Pig Components
Pig Execution Modes
Pig Shell - Grunt
Pig Latin, Writing PIG Latin scripts
Pig Data Types
Storage Types
Diagnosing Pig commands
Macros
UDF and External Scripts
Hive
Introduction and Architecture
Different modes of executing Hive queries
Metastore implementations
HiveQL (DDL & DML operations)
External vs Internal Tables
Views
Partitions & Buckets
UDF
Comparison of Pig and Hive
Flume
Overview of Flume

HADOOP BIG DATA COURSE CONTENT

Where is Flume used - import/export unstructured data


Flume Architecture
Using Flume to load data into HDFS
Sqoop
Overview of Sqoop
Where is Sqoop used - import/export structured data
Using Sqoop to import data from RDBMS into HDFS
Using Sqoop to import data from RDBMS into HBASE
Using Sqoop to export data from HDFS into RDMBS
Sqoop connectors
Impala
Overview of Impala
Architecture of Impala

NoSQL Databases
Introduction to NoSQL database
Types of NoSQL databases and their features
Brewers CAP Theorem
Advantage of NoSQL vs. traditional RDBMS
ACID vs BASE
Different types of NoSQL databases
Key value
Columnar
Document
Graph

HADOOP BIG DATA COURSE CONTENT

HBase
Introduction to HBase
Why use HBase
HBase Architecture - read and write paths
HBase vs. RDBMS
Installing and Configuration
Schema design in HBase - column families, hot spotting
Accessing data with HBase Shell
Accessing data with HBase API - Reading, Adding, Updating data from the
shell, JAVA API
HBase Coprocessors (Endpoints, Observers)

POC
Click stream analysis
Analyzing the Twitter data with Hive

Das könnte Ihnen auch gefallen