Libro Spark

Jeffrey Aven
SamsTeach Yourself
Apache Spark
24
in
Hours
800 East 96th Street, Indianapolis, Indiana, 46240 USA

Sams Teach Yourself Apache Spark in 24 Hours Editor in Chief
Copyright 2017 by Pearson Education, Inc. Greg Wiegand
All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or
Acquisitions Editor
transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without
written permission from the publisher. No patent liability is assumed with respect to the use of Trina McDonald
the information contained herein. Although every precaution has been taken in the preparation of Development Editor
this book, the publisher and author assume no responsibility for errors or omissions. Nor is any
liability assumed for damages resulting from the use of the information contained herein. Chris Zahn
ISBN-13: 978-0-672-33851-9 Technical Editor
ISBN-10: 0-672-33851-3 Cody Koeninger
Library of Congress Control Number: 2016946659
Managing Editor
Printed in the United States of America
Sandra Schroeder
First Printing: August 2016
Project Editor
Trademarks
Lori Lyons
All terms mentioned in this book that are known to be trademarks or service marks have been
appropriately capitalized. Sams Publishing cannot attest to the accuracy of this information. Project Manager
Useof a term in this book should not be regarded as affecting the validity of any trademark or Ellora Sengupta
service mark.
Copy Editor
Warning and Disclaimer
Every effort has been made to make this book as complete and as accurate as possible, but no Linda Morris
warranty or fitness is implied. The information provided is on an as is basis. The author and the Indexer
publisher shall have neither liability nor responsibility to any person or entity with respect to any
Cheryl Lenser
loss or damages arising from the information contained in this book.
Special Sales Proofreader
For information about buying this title in bulk quantities, or for special sales opportunities (which Sudhakaran
may include electronic versions; custom cover designs; and content particular to your business, Editorial Assistant
training goals, marketing focus, or branding interests), please contact our corporate sales
department at corpsales@pearsoned.com or (800) 382-3419. Olivia Basegio
For government sales inquiries, please contact Cover Designer
governmentsales@pearsoned.com. Chuti Prasertsith
For questions about sales outside the U.S., please contact
Compositor
intlcs@pearsoned.com.
codeMantra
Contents at a Glance
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Part I: Getting Started with Apache Spark

HOUR 1 Introducing Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Understanding Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Installing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Understanding the Spark Application Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Deploying Spark in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Part II: Programming with Apache Spark

HOUR 6 Learning the Basics of Spark Programming with RDDs ..................... 91
7 Understanding MapReduce Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8 Getting Started with Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9 Functional Programming with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10 Working with the Spark API (Transformations and Actions) . . . . . . . . . . . . 197
11 Using RDDs: Caching, Persistence, and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12 Advanced Spark Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Part III: Extensions to Spark

HOUR 13 Using SQL with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
14 Stream Processing with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
15 Getting Started with Spark and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16 Machine Learning with Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
17 Introducing Sparkling Water (H20 and Spark). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18 Graph Processing with Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
19 Using Spark with NoSQL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
20 Using Spark with Messaging Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
iv Sams Teach Yourself Apache Spark in 24 Hours
Part IV: Managing Spark

HOUR 21 Administering Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
22 Monitoring Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
23 Extending and Securing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
24 Improving Spark Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
Table of Contents
Preface xii
About the Author xv
Part I: Getting Started with Apache Spark
HOUR 1: Introducing Apache Spark 1

What Is Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Sort of Applications Use Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Programming Interfaces to Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Ways to Use Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
HOUR 2: Understanding Hadoop 11

Hadoop and a Brief History of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Hadoop Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Introducing HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Introducing YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Anatomy of a Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
How Spark Works with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
HOUR 3: Installing Spark 27

Spark Deployment Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Preparing to Install Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Installing Spark in Standalone Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Exploring the Spark Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vi Sams Teach Yourself Apache Spark in 24 Hours
Deploying Spark on Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
HOUR 4: Understanding the Spark Application Architecture 45

Anatomy of a Spark Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Spark Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Spark Executors and Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Spark Master and Cluster Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Spark Applications Running on YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Local Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
HOUR 5: Deploying Spark in the Cloud 61

Amazon Web Services Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Spark on EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Spark on EMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Hosted Spark with Databricks ................................................................... 81
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Part II: Programming with Apache Spark
HOUR 6: Learning the Basics of Spark Programming with RDDs 91

Introduction to RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Loading Data into RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Operations on RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Types of RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table of Contents vii
HOUR 7: Understanding MapReduce Concepts 115

MapReduce History and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Records and Key Value Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
MapReduce Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Word Count: The Hello, World of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
HOUR 8: Getting Started with Scala 137

Scala History and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Scala Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Object-Oriented Programming in Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Functional Programming in Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Spark Programming in Scala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
HOUR 9: Functional Programming with Python 165

Python Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Data Structures and Serialization in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Python Functional Programming Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Interactive Programming Using IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
HOUR 10: Working with the Spark API (Transformations and Actions) 197
RDDs and Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Spark Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Spark Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Key Value Pair Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Join Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Numerical RDD Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
viii Sams Teach Yourself Apache Spark in 24 Hours
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
HOUR 11: Using RDDs: Caching, Persistence, and Output 235

RDD Storage Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Caching, Persistence, and Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Saving RDD Output ............................................................................... 247
Introduction to Alluxio (Tachyon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
HOUR 12: Advanced Spark Programming 259

Broadcast Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Accumulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Partitioning and Repartitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Processing RDDs with External Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Part III: Extensions to Spark
HOUR 13: Using SQL with Spark 283

Introduction to Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Getting Started with Spark SQL DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Using Spark SQL DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Accessing Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
HOUR 14: Stream Processing with Spark 323

Introduction to Spark Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Using DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Table of Contents ix
State Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

Sliding Window Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
HOUR 15: Getting Started with Spark and R 343

Introduction to R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Introducing SparkR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Using SparkR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Using SparkR with RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
HOUR 16: Machine Learning with Spark 363

Introduction to Machine Learning and MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Classification Using Spark MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Collaborative Filtering Using Spark MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Clustering Using Spark MLlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
HOUR 17: Introducing Sparkling Water (H20 and Spark) 381

Introduction to H2O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Sparkling WaterH2O on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
HOUR 18: Graph Processing with Spark 399

Introduction to Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Graph Processing in Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Introduction to GraphFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
x Sams Teach Yourself Apache Spark in 24 Hours
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
HOUR 19: Using Spark with NoSQL Systems 417

Introduction to NoSQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Using Spark with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Using Spark with Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Using Spark with DynamoDB and More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
HOUR 20: Using Spark with Messaging Systems 433

Overview of Messaging Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Using Spark with Apache Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Spark, MQTT, and the Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Using Spark with Amazon Kinesis ........................................................... 446
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Part IV: Managing Spark
HOUR 21: Administering Spark 453

Spark Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Administering Spark Standalone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Administering Spark on YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
HOUR 22: Monitoring Spark 479

Exploring the Spark Application UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Spark History Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
Spark Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Table of Contents xi
Logging in Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
HOUR 23: Extending and Securing Spark 501

Isolating Spark ...................................................................................... 501
Securing Spark Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
Securing Spark with Kerberos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
HOUR 24: Improving Spark Performance 519

Benchmarking Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Application Development Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Optimizing Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Diagnosing Application Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Workshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
Index 543
Preface
This book assumes nothing, unlike many big data (Spark and Hadoop) books before it,
which are often shrouded in complexity and assume years of prior experience. I dont
assume that you are a seasoned software engineer with years of experience in Java,
Idontassume that you are an experienced big data practitioner with extensive experience
in Hadoop and other related open source software projects, and I dont assume that you are
an experienced data scientist.
By the same token, you will not find this book patronizing or an insult to your intelligence
either. The only prerequisite to this book is that you are comfortable with Python. Spark
includes several application programming interfaces (APIs). The Python API was selected as
the basis for this book as it is an intuitive, interpreted language that is widely known and
easily learned by those who havent used it.
This book could have easily been titled Sams Teach Yourself Big Data Using Spark because
this is what I attempt to do, taking it from the beginning. I will introduce you to Hadoop,
MapReduce, cloud computing, SQL, NoSQL, real-time stream processing, machine learning,
and more, covering all topics in the context of how they pertain to Spark. I focus on core
Spark concepts such as the Resilient Distributed Dataset (RDD), interacting with Spark using
the shell, implementing common processing patterns, practical data engineering/analysis
approaches using Spark, and much more.
I was first introduced to Spark in early 2013, which seems like a short time ago but is
alifetime ago in the context of the Hadoop ecosystem. Prior to this, I had been a Hadoop
consultant and instructor for several years. Before writing this book, I had implemented and
used Spark in several projects ranging in scale from small to medium business to enterprise
implementations. Even having substantial exposure to Spark, researching and writing this
book was a learning journey for myself, taking me further into areas of Spark that I had not
yet appreciated. I would like to take you on this journey as well as you read this book.
Spark and Hadoop are subject areas I have dedicated myself to and that I am passionate
about. The making of this book has been hard work but has truly been a labor of love.
Ihope this book launches your career as a big data practitioner and inspires you to do
amazing things with Spark.
Preface xiii
Why Should I Learn Spark?

Spark is one of the most prominent big data processing platforms in use today and is one
of the most popular big data open source projects ever. Spark has risen from its roots in
academia to Silicon Valley start-ups to proliferation within traditional businesses such as
banking, retail, and telecommunications. Whether you are a data analyst, data engineer,
data scientist, or data steward, learning Spark will help you to advance your career or
embark on a new career in the booming area of big data.
How This Book Is Organized

This book starts by establishing some of the basic concepts behind Spark and Hadoop,
which are covered in Part I, Getting Started with Apache Spark. I also cover deployment of
Spark both locally and in the cloud in Part I.
Part II, Programming with Apache Spark, is focused on programming with Spark, which
includes an introduction to functional programming with both Python and Scala as well as
a detailed introduction to the Spark core API.
Part III, Extensions to Spark, covers extensions to Spark, which include Spark SQL, Spark
Streaming, machine learning, and graph processing with Spark. Other areas such as NoSQL
systems (such as Cassandra and HBase) and messaging systems (such as Kafka) are covered
here as well.
I wrap things up in Part IV, Managing Spark, by discussing Spark management,

administration, monitoring, and logging as well as securing Spark.
Data Used in the Exercises

Data for the Try It Yourself exercises can be downloaded from the books Amazon Web
Services (AWS) S3 bucket (if you are not familiar with AWS, dont worryI cover this topic
in the book as well). When running the exercises, you can use the data directly from the S3
bucket or you can download the data locally first (examples of both methods are shown).
If you choose to download the data first, you can do so from the books download page at
http://sty-spark.s3-website-us-east-1.amazonaws.com/.
Conventions Used in This Book

Each hour begins with What Youll Learn in This Hour, which provides a list of bullet
points highlighting the topics covered in that hour. Each hour concludes with a Summary
page summarizing the main points covered in the hour as well as Q&A and Quiz
sections to help you consolidate your learning from that hour.
xiv Sams Teach Yourself Apache Spark in 24 Hours
Key topics being introduced for the first time are typically italicized by convention. Most
hours also include programming examples in numbered code listings. Where functions,
commands, classes, or objects are referred to in text, they appear in monospace type.
Other asides in this book include the following:
NOTE
Content not integral to the subject matter but worth noting or being aware of.
TIP
TIP Subtitle
A hint or tip relating to the current topic that could be useful.
CAUTION
Caution Subtitle
Something related to the current topic that could lead to issues if not addressed.
TRY IT YOURSELF
Exercise Title
An exercise related to the current topic including a step-by-step guide and descriptions of
expected outputs.
About the Author
Jeffrey Aven is a big data consultant and instructor based in Melbourne, Australia. Jeff has
an extensive background in data management and several years of experience consulting and
teaching in the areas or Hadoop, HBase, Spark, and other big data ecosystem technologies.
Jeff has won accolades as a big data instructor and is also an accomplished consultant who
has been involved in several high-profile, enterprise-scale big data implementations across
different industries in the region.
Dedication
This book is dedicated to my wife and three children. I have been burning the
candle at both ends during the writing of this book and I appreciate
your patience and understanding
Acknowledgments
Special thanks toCody Koeninger and Chris Zahn for their input and feedback as editors.
Also thanks to Trina McDonald and all of the team at Pearson for keeping me in line during
the writing of this book!
We Want to Hear from You
As the reader of this book, you are our most important critic and commentator. We value
your opinion and want to know what were doing right, what we could do better, what areas
youd like to see us publish in, and any other words of wisdom youre willing to pass our way.
We welcome your comments. You can email or write to let us know what you did or didnt
like about this bookas well as what we can do to make our books better.
Please note that we cannot help you with technical problems related to the topic of this book.
When you write, please be sure to include this books title and author as well as your name
and email address. We will carefully review your comments and share them with the author
and editors who worked on the book.
E-mail: feedback@samspublishing.com
Mail: Sams Publishing

ATTN: Reader Feedback
800 East 96th Street
Indianapolis, IN 46240 USA
Reader Services
Visit our website and register this book at informit.com/register for convenient access to
any updates, downloads, or errata that might be available for this book.
This page intentionally left blank
HOUR 3
Installing Spark
What Youll Learn in This Hour:

u What the different Spark deployment modes are
u How to install Spark in Standalone mode
u How to install and use Spark on YARN
Now that youve gotten through the heavy stuff in the last two hours, you can dive headfirst into
Spark and get your hands dirty, so to speak.
This hour covers the basics about how Spark is deployed and how to install Spark. I will also
cover how to deploy Spark on Hadoop using the Hadoop scheduler, YARN, discussed in Hour 2.
By the end of this hour, youll be up and running with an installation of Spark that you will use
in subsequent hours.
Spark Deployment Modes

There are three primary deployment modes for Spark:
u Spark Standalone
u Spark on YARN (Hadoop)
u Spark on Mesos
Spark Standalone refers to the built-in or standalone scheduler. The term can be confusing
because you can have a single machine or a multinode fully distributed cluster both running
in Spark Standalone mode. The term standalone simply means it does not need an external
scheduler.
With Spark Standalone, you can get up an running quickly with few dependencies or
environmental considerations. Spark Standalone includes everything you need to get started.
28 HOUR 3: Installing Spark
Spark on YARN and Spark on Mesos are deployment modes that use the resource schedulers
YARN and Mesos respectively. In each case, you would need to establish a working YARN or
Mesos cluster prior to installing and configuring Spark. In the case of Spark on YARN, this
typically involves deploying Spark to an existing Hadoop cluster.
I will cover Spark Standalone and Spark on YARN installation examples in this hour because
these are the most common deployment modes in use today.
Preparing to Install Spark

Spark is a cross-platform application that can be deployed on
u Linux (all distributions)
u Windows
u Mac OS X
Although there are no specific hardware requirements, general Spark instance hardware
recommendations are
u 8 GB or more memory
u Eight or more CPU cores
u 10 gigabit or greater network speed
u Four or more disks in JBOD configuration (JBOD stands for Just a Bunch of Disks,
referring to independent hard disks not in a RAIDor Redundant Array of Independent
Disksconfiguration)
Spark is written in Scala with programming interfaces in Python (PySpark) and Scala. The
following are software prerequisites for installing and running Spark:
u Java
u Python (if you intend to use PySpark)
If you wish to use Spark with R (as I will discuss in Hour 15, Getting Started with Spark
and R), you will need to install R as well. Git, Maven, or SBT may be useful as well if you
intend on building Spark from source or compiling Spark programs.
If you are deploying Spark on YARN or Mesos, of course, you need to have a functioning YARN
or Mesos cluster before deploying and configuring Spark to work with these platforms.
I will cover installing Spark in Standalone mode on a single machine on each type of platform,
including satisfying all of the dependencies and prerequisites.
Installing Spark in Standalone Mode 29
Installing Spark in Standalone Mode

In this section I will cover deploying Spark in Standalone mode on a single machine using
various platforms. Feel free to choose the platform that is most relevant to you to install
Spark on.
Getting Spark
In the installation steps for Linux and Mac OS X, I will use pre-built releases of Spark. You could
also download the source code for Spark and build it yourself for your target platform using the
build instructions provided on the official Spark website. I will use the latest Spark binary release
in my examples. In either case, your first step, regardless of the intended installation platform, is
to download either the release or source from: http://spark.apache.org/downloads.html
This page will allow you to download the latest release of Spark. In this example, the latest
release is 1.5.2, your release will likely be greater than this (e.g. 1.6.x or 2.x.x).
FIGURE 3.1
The Apache Spark downloads page.
NOTE
The Spark releases do not actually include Hadoop as the names may imply. They simply include
libraries to integrate with the Hadoop clusters and distributions listed. Many of the Hadoop
classes are required regardless of whether you are using Hadoop. I will use the
spark-1.5.2-bin-hadoop2.6.tgzpackage for this installation.
CAUTION
Using the Without Hadoop Builds
You may be tempted to download the without Hadoop or spark-x.x.x-bin-without-hadoop.
tgzoptions if you are installing in Standalone mode and not using Hadoop.
The nomenclature can be confusing, but this build is expecting many of the required classes
that are implemented in Hadoop to be present on the system. Select this option only if you have
Hadoop installed on the system already. Otherwise, as I have done in my case, use one of the
spark-x.x.x-bin-hadoopx.x builds.
TRY IT YOURSELF
Install Spark on Red Hat/Centos
In this example, Im installing Spark on a Red Hat Enterprise Linux 7.1 instance. However, the
same installation steps would apply to Centos distributions as well.
1. As shown in Figure 3.1, download thespark-1.5.2-bin-hadoop2.6.tgzpackage from
your local mirror into your home directory using wget or curl.
2. If Java 1.7 or higher is not installed, install the Java 1.7 runtime and development
environments using the OpenJDK yum packages (alternatively, you could use the Oracle JDK
instead):
sudo yum install java-1.7.0-openjdk java-1.7.0-openjdk-devel
3. Confirm Java was successfully installed:

$ java -version
java version "1.7.0_91"
OpenJDK Runtime Environment (rhel-2.6.2.3.el7-x86_64 u91-b00)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
4. Extract the Spark package and create SPARK_HOME:

tar -xzf spark-1.5.2-bin-hadoop2.6.tgz
sudo mv spark-1.5.2-bin-hadoop2.6 /opt/spark
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
The SPARK_HOME environment variable could also be set using the .bashrc file or similar
user or system profile scripts. You need to do this if you wish to persist the SPARK_HOME
variable beyond the current session.
5. Open the PySpark shell by running the pyspark command from any directory (as youve
added the Spark bin directory to the PATH). If Spark has been successfully installed, you
should see the following output (with informational logging messages omitted for brevity):
Welcometo
______
/__/__________//__
_\\/_\/_`/__/_/
/__/.__/\_,_/_//_/\_\version1.5.2
/_/
UsingPythonversion2.7.5(default,Feb11201407:46:25)
SparkContextavailableassc,HiveContextavailableassqlContext.
>>>
6. You should see a similar result by running thespark-shell command from any directory.
7. Run the included Pi Estimator example by executing the following command:
spark-submit --class org.apache.spark.examples.SparkPi \
--master local \
$SPARK_HOME/lib/spark-examples*.jar 10
8. If the installation was successful, you should see something similar to the following result
(omitting the informational log messages). Note, this is an estimator program, so the actual
result may vary:
Pi is roughly 3.140576
NOTE
Most of the popular Linux distributions include Python 2.x with the pythonbinary in the system
path, so you normally dont need to explicitly install Python; in fact, the yumprogram itself is
implemented in Python.
You may also have wondered why you did not have to install Scala as a prerequisite. The Scala
binaries are included in the assembly when you build or download a pre-built release of Spark.
TRY IT YOURSELF
Install Spark on Ubuntu/Debian Linux
In this example, Im installing Spark on an Ubuntu 14.04 LTS Linux distribution.
As with the Red Hat example, Python 2. 7 is already installed with the operating system, so we do
not need to install Python.
your local mirror into your home directory using wget or curl.
2. If Java 1.7 or higher is not installed, install the Java 1.7 runtime and development
environments using Ubuntus APT (Advanced Packaging Tool). Alternatively, you could use
the Oracle JDK instead:
sudo apt-get update
sudo apt-get install openjdk-7-jre
sudo apt-get install openjdk-7-jdk
3. Confirm Java was successfully installed:

$ java -version
java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)

The SPARK_HOME environment variable could also be set using the .bashrc file or
similar user or system profile scripts. You will need to do this if you wish to persist the
SPARK_HOME variable beyond the current session.
5. Open the PySpark shell by running thepyspark command from any directory. If Spark has
been successfully installed, you should see the following output:
Welcometo
______
/__/__________//__
_\\/_\/_`/__/_/
/__/.__/\_,_/_//_/\_\version1.5.2
/_/
UsingPythonversion2.7.6(default,Mar22201422:59:56)
>>>
6. You should see a similar result by running the spark-shell command from any directory.
--master local \
8. If the installation was successful, you should see something similar to the following
result (omitting the informational log messages). Note, this is an estimator program,
so the actual result may vary:
TRY IT YOURSELF
Install Spark on Mac OS X

In this example, I install Spark onOS X Mavericks (10.9.5).
Mavericks includes installed versions of Python (2.7.5) and Java (1.8), so I dont need to
install them.
your local mirror into your home directory using curl.
3. Open the PySpark shell by running thepyspark command in the Terminal from any
directory. If Spark has been successfully installed, you should see the following output:
Welcometo
______
/__/__________//__
_\\/_\/_`/__/_/
/__/.__/\_,_/_//_/\_\version1.5.2
/_/
UsingPythonversion2.7.5(default,Feb11201407:46:25)
>>>
The SPARK_HOME environment variable could also be set using the .profile file or similar
user or system profile scripts.
4. You should see a similar result by running the spark-shell command in the terminal from
any directory.
--master local \
(omitting the informational log messages). Note, this is an estimator program, so the actual
result may vary:
TRY IT YOURSELF
Install Spark on Microsoft Windows
Installing Spark on Windows can be more involved than installing it on Linux or Mac OS X because
many of the dependencies (such as Python and Java) need to be addressed first.
This example uses a Windows Server 2012, the server version of Windows 8.
1. You will need a decompression utility capable of extracting .tar.gz and .gz archives
because Windows does not have native support for these archives. 7-zip is a suitable
program for this. You can obtain it fromhttp://7-zip.org/download.html.
2. As shown in Figure 3.1, download thespark-1.5.2-bin-hadoop2.6.tgzpackage
from your local mirror and extract the contents of this archive to a new directory called
C:\Spark.
3. Install Java using the Oracle JDK Version 1.7, which you can obtain from the Oracle website.
In this example, I download and install the jdk-7u79-windows-x64.exe package.
4. Disable IPv6 for Java applications by running the following command as an administrator
from the Windows command prompt :
setx /M _JAVA_OPTIONS "-Djava.net.preferIPv4Stack=true"
5. Python is not included with Windows, so you will need to download and install it. You can
obtain a Windows installer for Python fromhttps://www.python.org/getit/. I use Python
2.7.10 in this example. Install Python into C:\Python27.
6. Download the Hadoop common binaries necessary to run Spark compiled for Windows x64
fromhadoop-common-bin. Extract these files to a new directory called C:\Hadoop.
7. Set an environment variable at the machine level for HADOOP_HOME by running the
followingcommand as an administrator from the Windows command prompt:
setx /M HADOOP_HOME C:\Hadoop
8. Update the system pathby running the following command as an administrator from the
Windows command prompt:
setx /M path "%path%;C:\Python27;%PROGRAMFILES%\Java\jdk1.7.0_79\bin;C:\Hadoop"
9. Make a temporary directory, C:\tmp\hive, to enable the HiveContext in Spark. Set

permission to this file using the winutils.exe program included with the Hadoop
common binaries by running the following commands as an administrator from the Windows
command prompt:
mkdir C:\tmp\hive
C:\Hadoop\bin\winutils.exe chmod 777 /tmp/hive
10. Test the Spark interactive shell in Python by running the following command:
C:\Spark\bin\pyspark
You should see the output shown in Figure 3.2.
FIGURE 3.2
The PySpark shell in Windows.
11. You should get a similar result by running the following command to open an interactive
Scala shell:
C:\Spark\bin\spark-shell
C:\Spark\bin\spark-submit --class org.apache.spark.examples.SparkPi --master
local C:\Spark\lib\spark-examples*.jar 10
shown in Figure 3.3. Note, this is an estimator program, so the actual result may vary:
FIGURE 3.3
The results of the SparkPi example program in Windows.
Installing a Multi-node Spark Standalone Cluster

Using the steps outlined in this section for your preferred target platform, you will have installed
a single node Spark Standalone cluster. I will discuss Sparks cluster architecture in more detail
in Hour 4, Understanding the Spark Runtime Architecture.However, to create a multi-node
cluster from a single node system, you would need to do the following:
u Ensure all cluster nodes can resolve hostnames of other cluster members and are routable
to one another (typically, nodes are on the same private subnet).
u Enable passwordless SSH (Secure Shell) for the Spark master to the Spark slaves (this step is
only required to enable remote login for the slave daemon startup and shutdown actions).
u Configure thespark-defaults.conf file on all nodes with the URL of the Spark
master node.
u Configure thespark-env.sh file on all nodes with the hostname or IP address of the
Spark master node.
u Run thestart-master.sh script from the sbin directory on the Spark master node.
u Run thestart-slave.sh script from the sbin directory on all of the Spark slave nodes.
u Check the Spark master UI. You should see each slave node in theWorkers section.
u Run a test Spark job.

TRY IT YOURSELF
Configuring and Testing a Multinode Spark Cluster

Take your single node Spark system and create a basic two-node Spark cluster with a master
node and a worker node.
In this example, I use two Linux instances with Spark installed in the same relative paths: one
with a hostname of sparkmaster, and the other with a hostname of sparkslave.
1. Ensure that each node can resolve the other. The ping command can be used for this.
For example, from sparkmaster:
ping sparkslave
2. Ensure the firewall rules of network ACLs will allow traffic on multiple ports between cluster
instances because cluster nodes will communicate using various TCP ports (normally not a
concern if all cluster nodes are on the same subnet).
3. Create and configure thespark-defaults.conf file on all nodes. Run the following
commands on thesparkmaster and sparkslavehosts:
cd $SPARK_HOME/conf
sudo cp spark-defaults.conf.template spark-defaults.conf
sudo sed -i "\$aspark.master\tspark://sparkmaster:7077" spark-defaults.conf
4. Create and configure thespark-env.sh file on all nodes. Complete the following tasks on
the sparkmaster and sparkslavehosts:
cd $SPARK_HOME/conf
sudo cp spark-env.sh.template spark-env.sh
sudo sed -i "\$aSPARK_MASTER_IP=sparkmaster" spark-env.sh
5. On the sparkmaster host, run the following command:

sudo $SPARK_HOME/sbin/start-master.sh
6. On the sparkslave host, run the following command:

sudo $SPARK_HOME/sbin/start-slave.sh spark://sparkmaster:7077
7. Check the Spark master web user interface (UI) athttp://sparkmaster:8080/.

8. Check the Spark worker web UI athttp://sparkslave:8081/.
9. Run the built-in Pi Estimator example from the terminal of either node:
--master spark://sparkmaster:7077 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
10. If the application completes successfully, you should see something like the following (omit-
ting informational log messages). Note, this is an estimator program, so the actual result
may vary:
This is a simple example. If it was a production cluster, I would set up passwordless

SSH to enable the start-all.sh and stop-all.sh shell scripts. I would also consider
modifying additional configuration parameters for optimization.
CAUTION
Spark Master Is a Single Point of Failure in Standalone Mode
Without implementing High Availability (HA), the Spark Master node is a single point of failure (SPOF)
for the Spark cluster. This means that if the Spark Master node goes down, the Spark cluster would
stop functioning, all currently submitted or running applications would fail, and no new applications
could be submitted.
High Availability can be configured using Apache Zookeeper, a highly reliable distributed coordination
service. You can also configure HA using the filesystem instead of Zookeeper; however, this is not
recommended for production systems.
Exploring the Spark Install

Now that you have Spark up and running, lets take a closer look at the install and its
various components.
If you followed the instructions in the previous section, Installing Spark in Standalone Mode,
you should be able to browse the contents of $SPARK_HOME.
In Table 3.1, I describe each subdirectory of the Spark installation.
TABLE 3.1 Spark Installation Subdirectories

Directory Description
bin Contains all of the commands/scripts to run Spark applications interactively
through shell programssuch as pyspark, spark-shell, spark-sql and
sparkR, or in batch mode using spark-submit.
conf Contains templates for Spark configuration files, which can be used to set Spark
environment variables (spark-env.sh) or set default master, slave, or client
configuration parameters (spark-defaults.conf). There are also configuration
templates to control logging (log4j.properties), metrics collection (metrics.
properties), as well as a template for the slaves file, which controls which
slave nodes can join the Spark cluster.
Deploying Spark on Hadoop 39
Directory Description
ec2 Contains scripts to deploy Spark nodes and clusters onAmazon Web Services
(AWS) Elastic Compute Cloud (EC2).I will cover deploying Spark in EC2 in
Hour 5, Deploying Spark in the Cloud.
lib Contains the main assemblies for Spark including the main library
(spark-assembly-x.x.x-hadoopx.x.x.jar) and included example programs
(spark-examples-x.x.x-hadoopx.x.x.jar), of which we have already run
one, SparkPi, to verify the installation in the previous section.
licenses Includes license files covering other included projectssuch as Scala and JQuery.
These files are for legalcompliance purposes only and are not required to
run Spark.
python Contains all of the Python libraries required to run PySpark.You will generally not
need to access these files directly.
sbin Contains administrative scripts to start and stop master and slave services
(locally or remotely) as well as start processesrelated to YARN and Mesos.
I used the start-master.sh and start-slave.shscripts when I covered how
to install a multi-node cluster in the previous section.
data Contains sample data sets used for testing mllib (which we will discuss in more
detail in Hour 16, Machine Learning with Spark).
examples Contains the source code for all of the examples included in
lib/spark-examples-x.x.x-hadoopx.x.x.jar. Example programs are
included in Java, Python, R, and Scala. You canalso find the latest code for the
included examples athttps://github.com/apache/spark/tree/master/examples.
R Contains the SparkRpackage and associated libraries anddocumentation.
I will discuss SparkR in Hour 15, Getting Started with Spark and R
Deploying Spark on Hadoop

As discussed previously, deploying Spark with Hadoop is a popular option for many users
because Spark can read from and write to the data in Hadoop (in HDFS) and can leverage
Hadoops process scheduling subsystem, YARN.
Using a Management Console or Interface

If you are using a commercial distribution of Hadoop such as Cloudera or Hortonworks, you can
often deploy Spark using the management console provided with each respective platform: for
example, Cloudera Manager for Cloudera or Ambari for Hortonworks.
If you are using the management facilities of a commercial distribution, the version of Spark
deployed may lag the latest stable Apache release because Hadoop vendors typically update
their software stacks with their respective major and minor release schedules.
Installing Manually
Installing Spark on a YARN cluster manually (that is, not using a management interface such as
Cloudera Manager or Ambari) is quite straightforward to do.
TRY IT YOURSELF
Installing Spark on Hadoop Manually
1. Follow the steps outlined for your target platform (for example, Red Hat Linux, Windows,
and so on) in the earlier section Installing Spark in Standalone Mode.
2. Ensure that the system you are installing on is a Hadoop client with configuration files
pointing to a Hadoop cluster. You can do this as shown:
hadoop fs -ls
This lists the contents of your user directory in HDFS. You could instead use the path in
HDFS where your input data resides, such as
hadoop fs -ls /path/to/my/data
If you see an error such as hadoop: command not found, you need to make sure a
correctly configured Hadoop client is installed on the system before continuing.
3. Set either theHADOOP_CONF_DIR orYARN_CONF_DIR environment variable as shown:
export HADOOP_CONF_DIR=/etc/hadoop/conf
# or
export YARN_CONF_DIR=/etc/hadoop/conf
As with SPARK_HOME, these variables could be set using the .bashrc or similar profile
script sourced automatically.
4. Execute the following command to test Spark on YARN:
--master yarn-cluster \
Deploying Spark on Hadoop 41
5. If you have access to the YARN Resource Manager UI, you can see the Spark job running
in YARN as shown in Figure 3.4:
FIGURE 3.4
The YARN ResourceManager UI showing the Spark application running.
6. Clicking the ApplicationsMaster link in the ResourceManager UIwill redirect you to the Spark
UI for the application:
FIGURE 3.5
The Spark UI.
Submitting Spark applications using YARN can be done in two submission modes:
yarn-cluster or yarn-client.
Using the yarn-cluster option, the Spark Driver and Spark Context, ApplicationsMaster, and
all executors run on YARN NodeManagers. These are all concepts we will explore in detail in
Hour 4, Understanding the Spark Runtime Architecture.The yarn-cluster submission
mode is intended for production or non interactive/batch Spark applications. You cannot use
yarn-cluster as an option for any of the interactive Spark shells. For instance, running
the following command:
spark-shell --master yarn-cluster
will result in this error:

Error: Cluster deploy mode is not applicable to Spark shells.
Using the yarn-client option, the Spark Driver runs on the client (the host where you ran the
Spark application). All of the tasks and the ApplicationsMaster run on the YARN NodeManagers
however unlike yarn-cluster mode, the Driver does not run on the ApplicationsMaster.
Theyarn-client submission mode is intended to run interactive applications such as
pyspark orspark-shell.
CAUTION
Running Incompatible Workloads Alongside Spark May Cause Issues
Spark is a memory-intensive processing engine. Using Spark on YARN will allocate containers,
associated CPU, and memory resources to applications such as Spark as required. If you have other
memory-intensive workloads, such as Impala, Presto, or HAWQ running on the cluster, you need
to ensure that these workloads can coexist with Spark and that neither compromises the other.
Generally, this can be accomplished through application, YARN cluster, scheduler, or application
queue configuration and, in extreme cases, operating system cgroups (on Linux, for instance).
Summary
In this hour, I have covered the different deployment modes for Spark: Spark Standalone, Spark
on Mesos, and Spark on YARN.
Spark Standalone refers to the built-in process scheduler it uses as opposed to using a preexisting
external scheduler such as Mesos or YARN. A Spark Standalone cluster could have any
numberof nodes, so the term Standalone could be a misnomer if taken out of context. I have
showed you how to install Spark both in Standalone mode (as a single node or multi-node
cluster) and how to install Spark on an existing YARN (Hadoop) cluster.
I have also explored the components included with Spark, many of which you will have used by
the end of this book.
Youre now up and running with Spark. You can use your Spark installation for most of the
exercises throughout this book.
Workshop 43
Q&A
Q. What are the factors involved in selecting a specific deployment mode for Spark?
A. The choice of deployment mode for Spark is primarily dependent upon the environment
you are running in and the availability of external scheduling frameworks such as YARN or
Mesos. For instance, if you are using Spark with Hadoop and you have an existing YARN
infrastructure, Spark on YARN is a logical deployment choice. However, if you are running
Spark independent of Hadoop (for instance sourcing data from S3 or a local filesystem),
Spark Standalone may be a better deployment method.
Q. What is the difference between the yarn-clientand the yarn-clusteroptions

of the --masterargument using spark-submit?
A. Both the yarn-clientand yarn-cluster options execute the program in the Hadoop
cluster using YARN as the scheduler; however, the yarn-clientoption uses the client host
as the driver for the program and is designed for testing as well as interactive shell usage.
Workshop
The workshop contains quiz questions and exercises to help you solidify your understanding of
the material covered. Try to answer all questions before looking at the Answers section that
follows.
Quiz
1. True or false: A Spark Standalone cluster consists of a single node.
2. Which component is not a prerequisite for installing Spark?
A. Scala
B. Python
C. Java
3. Which of the following subdirectories contained in the Spark installation contains scripts to
start and stop master and slave node Spark services?
A. bin
B. sbin
C. lib
4. Which of the following environment variables are required to run Spark on Hadoop/YARN?
A. HADOOP_CONF_DIR
B. YARN_CONF_DIR
C. Either HADOOP_CONF_DIR orYARN_CONF_DIR will work.
Answers
1. False.Standalone refers to the independent process scheduler for Spark, which could be
deployed on a cluster of one-to-many nodes.
2. A.The Scala assembly is included with Spark; however, Java and Python must exist on the
system prior to installation.
3. B.sbin contains administrative scripts to start and stop Spark services.
4. C.Either the HADOOP_CONF_DIR orYARN_CONF_DIR environment variable must be set for
Spark to use YARN.
Exercises
1. Using your Spark Standalone installation, execute pyspark to open a PySpark interactive
shell.
2. Open a browser and navigate to the SparkUI at http://localhost:4040.
3. Click the Environment top menu link or navigate to Environment page directly using the url:
http://localhost:4040/environment/.
4. Note some of the various environment settings and configuration parameters set. I will
explain many of these in greater detail throughout the book.
Index
Symbols defined, 47, 206

first(), 208209
<- (assignment operator) in R, 344 foreach(), 210211
map() transformation
versus, 233
lazy evaluation, 107108
A on RDDs, 92
saveAsHadoopFile(), 251252
ABC programming language, 166
saveAsNewAPIHadoopFile(),
abstraction, Spark as, 2
253
access control lists (ACLs), 503
saveAsSequenceFile(), 250
accumulator() method, 266
saveAsTextFile(), 93, 248
accumulators, 265266
spark-ec2 shell script, 65
take(), 207208
custom accumulators, 267
takeSample(), 199
in DStreams, 331, 340
top(), 208
usage example, 268270
adjacency lists, 400401
value() method, 266
adjacency matrix, 401402
warning about, 268
aggregation, 209
ACLs (access control lists), 503
fold() method, 210
actions
foldByKey() method, 217
aggregate actions, 209
groupBy() method, 202,
fold(), 210
313314
reduce(), 209
groupByKey() method,
collect(), 207 215216, 233
count(), 206 reduce() method, 209
544 aggregation
reduceByKey() method, Apache Cassandra. See Cassandra applications

216217, 233 Apache Drill, 290 components of, 4546
sortByKey() method, Apache HAWQ, 290 cluster managers, 49, 51
217218 Apache Hive. See Hive drivers, 4648
subtractByKey() method, Apache Kafka. See Kafka executors, 4849
218219
Apache Mahout, 367 masters, 4950
Alluxio, 254, 258
Apache Parquet, 299 workers, 4849
architecture, 254255
Apache Software Foundation defined, 21
benefits of, 257 (ASF), 1 deployment environment
explained, 254 Apache Solr, 430 variables, 457
as filesystem, 255256 Apache Spark. See Spark external applications
off-heap persistence, 256 Apache Storm, 323 accessing Spark SQL, 319
ALS (Alternating Least Squares), Apache Tez, 289 processing RDDs with,
373
Apache Zeppelin, 75 278279
Amazon DynamoDB, 429430
Apache Zookeeper, 38, 436 managing
Amazon Kinesis Streams. See
installing, 441 in Standalone mode,
Kinesis Streams 466469
API access to Spark History
Amazon Machine Image (AMI), 66
Server, 489490 on YARN, 473475
Amazon Software License (ASL),
appenders in Log4j framework, Map-only applications,
448
493, 499 124125
Amazon Web Services (AWS),
application support in Spark, 3 optimizing
6162
application UI, 48, 479 associative operations,
EC2 (Elastic Compute Cloud), 527529
diagnosing performance
6263
problems, 536539 collecting data, 530
Spark deployment on,
Environment tab, 486 diagnosing problems,
6473
536539
example Spark routine, 480
EMR (Elastic MapReduce),
dynamic allocation,
6364 Executors tab, 486487
531532
Spark deployment on, Jobs tab, 481482
with filtering, 527
7380 in local mode, 57
functions and closures,
pricing, 64 security via Java Servlet
529530
S3 (Simple Storage Service), Filters, 510512, 517
serialization, 531
63 in Spark History Server,
planning, 47
AMI (Amazon Machine Image), 66 488489
returning results, 48
anonymous functions Stages tab, 483484
running in local mode, 5658
in Python, 179180 Storage tab, 484485
running on YARN, 2022, 51,
in Scala, 158 tabs in, 499
472473
case statement in Scala 545
application management, with Kerberos, 512514, 517 breaking for loops, 151
473475 client commands, 514 broadcast() method, 260261
ApplicationsMaster, 5253 configuring, 515516 broadcast variables, 259260
log file management, 56 with Hadoop, 514515 advantages of, 263265, 280
ResourceManager, 5152, terminology, 513 broadcast() method, 260261
471472
shared secrets, 504506 configuration options, 262
yarn-client submission
authentication service (AS), 513 in DStreams, 331
mode, 5455
authorization, 503504 unpersist() method, 262
yarn-cluster submission
with Java Servlet Filters, usage example, 268270
mode, 5354
511512 value() method, 261262
Scala
AWS (Amazon Web Services). brokers in Kafka, 436
compiling, 140141
See Amazon Web Services (AWS) buckets, 63
packaging, 141
buffering messages, 435
scheduling, 47
built-in functions for DataFrames,
in Standalone mode,
469471 B 310
bytecode, machine code versus,
on YARN, 475476
BackType, 323 168
setting logging within, 497498
Bagel, 403
viewing status of all, 487
Bayes Theorem, 372
ApplicationsMaster, 2021,
471472 Beeline, 287, 318321 C
as Spark master, 5253 Beeswax, 287
arrays in R, 345 benchmarks, 519520 c() method (combine), 346
ASF (Apache Software spark-perf, 521525 cache() method, 108, 314

Foundation), 1 Terasort, 520521 cacheTable() method, 314
ASL (Amazon Software License), TPC (Transaction Processing caching
448 Performance Council), 520 DataFrames, 314
assignment operator (<-) in R, 344 when to use, 540 DStreams, 331
associative operations, 209 big data, history of, 1112 RDDs, 108109, 239240,
optimizing, 527529 Bigtable, 417418 243
asymmetry, speculative execution bin directory, 38 callback functions, 180
and, 124 block reports, 17 canary queries, 525
attribute value pairs. See key blocks CapacityScheduler, 52
value pairs (KVP) in HDFS, 1416 capitalization. See naming
authentication, 503504 replication, 25 conventions
encryption, 506510 bloom filters, 422 cartesian() method, 225226
with Java Servlet Filters, bound variables, 158 case statement in Scala, 152
510511
546 Cassandra
Cassandra cloud deployment in Scala, 144

accessing via Spark, on Databricks, 8188 lists, 145146, 163
427429 on EC2, 6473 maps, 148149
CQL (Cassandra Query on EMR, 7380 sets, 146147, 163
Language), 426427 Cloudera Impala, 289 tuples, 147148
data model, 426 cluster architecture in Kafka, column families, 420
HBase versus, 425426, 431 436437 columnar storage formats,
Cassandra Query Language (CQL), cluster managers, 45, 49, 51 253, 299
426427 independent variables, columns method, 305
Centos, installing Spark, 3031 454455 Combiner functions, 122123
centroids in clustering, 366 ResourceManager as, 5152 command line interface (CLI)
character data type in R, 345 cluster mode (EMR), 74 forHive, 287
character functions in R, 349 clustering in machine learning, commands, spark-submit, 7, 8
checkpoint() method, 244245 365366, 375377 committers, 2
checkpointing clustering keys in Cassandra, 426 commutative operations, 209
defined, 111 clusters comparing objects in Scala, 143
DStreams, 330331, 340 application deployment compiling Scala programs,
RDDs, 244247, 258 environment variables, 457 140141
checksums, 17 defined, 13 complex data types in Spark SQL,
child RDDs, 109 EMR launch modes, 74 302
choosing. See selecting master UI, 487 components (in R vectors), 345
classes in Scala, 153155 operational overview, 2223 compression
classification in machine learning, Spark Standalone mode. external storage, 249250

364, 367 See Spark Standalone of files, 9394
deployment mode
decision trees, 368372 Parquet files, 300
coalesce() method, 274275, 314
Naive Bayes, 372373 conf directory, 38
coarse-grained transformations,
clearCache() method, 314 configuring
107
CLI (command line interface) Kerberos, 515516
forHive, 287 codecs, 94, 249
local mode options, 5657
cogroup() method, 224225
clients Log4j framework, 493495
CoGroupedRDDs, 112
in Kinesis Streams, 448 SASL, 509
collaborative filtering in machine
MQTT, 445 Spark
learning, 365, 373375
closures broadcast variables, 262
collect() method, 207, 306, 530
optimizing applications, configuration properties,
collections
529530 457460, 477
in Cassandra, 426
in Python, 181183 environment variables,
in Scala, 158159 454457
problems, 538539
data types 547
managing configuration, createStream() method data locality

461 KafkaUtils package, 440 defined, 12, 25
precedence, 460461 KinesisUtils package, in loading data, 113
Spark History Server, 488 449450 with RDDs, 9495
SSL, 506510 MQTTUtils package, data mining, 355. Seealso
connected components algorithm, 445446 R programming language
405 CSV files, creating SparkR data data model
consumers frames from, 352354 for Cassandra, 426
defined, 434 current directory in Hadoop, 18 for DataFrames, 301302
in Kafka, 435 Curry, Haskell, 159 for DynamoDB, 429
containers, 2021 currying in Scala, 159 for HBase, 420422
content filtering, 434435, 451 custom accumulators, 267 data sampling, 198199
contributors, 2 Cutting, Doug, 1112, 115 sample() method, 198199
control structures in Scala, 149 takeSample() method, 199
do while and while loops, data sources
151152
D creating
for loops, 150151 JDBC datasources,
if expressions, 149150 daemon logging, 495 100103
named functions, 153 DAG (directed acyclic graph), 47, relational databases, 100
pattern matching, 152 399 for DStreams, 327328
converting DataFrames to RDDs, Data Definition Language (DDL) in HDFS as, 24
301 Hive, 288 data structures
core nodes, task nodes versus, 89 data deluge in Python
Couchbase, 430 defined, 12 dictionaries, 173174
CouchDB, 430 origin of, 117 lists, 170, 194
count() method, 206, 306 data directory, 39 sets, 170171
counting words. See Word Count data distribution in HBase, 422 tuples, 171173, 194
algorithm (MapReduce example) data frames in R, 345347
cPickle, 176 matrices versus, 361 in Scala, 144
CPython, 167169 in R, 345, 347348 immutability, 160
CQL (Cassandra Query Language), in SparkR lists, 145146, 163
426427 creating from CSV files, maps, 148149
CRAN packages in R, 349 352354
sets, 146147, 163
createDataFrame() method, creating from Hive tables,
tuples, 147148
294295 354355
data types
createDirectStream() method, creating from R data
439440 in Hive, 287288
frames, 351352
in R, 344345
548 data types
in Scala, 142 datasets, defined, 92, 117. design goals for MapReduce, 117
in Spark SQL, 301302 Seealso RDDs (Resilient destructuring binds in Scala, 152
Distributed Datasets)
Databricks, Spark deployment on, diagnosing performance problems,
8188 datasets package, 351352 536539
Databricks File System (DBFS), 81 DataStax, 425 dictionaries
Datadog, 525526 DBFS (Databricks File System), 81 keys() method, 212
data.frame() method, 347 dbutils.fs, 89 in Python, 101, 173174
DataFrameReader, creating DDL (Data Definition Language) values() method, 212
DataFrames with, 298301 in Hive, 288 direct stream access in Kafka,
DataFrames, 102, 111, 294 Debian Linux, installing Spark, 438, 451
3233
built-in functions, 310 directed acyclic graph (DAG),
decision trees, 368372 47, 399
caching, persisting,
repartitioning, 314 DecisionTree.trainClassifier directory contents
function, 371372
converting to RDDs, 301 listing, 19
deep learning, 381382
creating subdirectories of Spark
defaults for environment installation, 3839
with DataFrameReader,
variables and configuration
298301 discretized streams. See
properties, 460
from Hive tables, 295296 DStreams
defining DataFrame schemas, 304
from JSON files, 296298 distinct() method, 203204, 308
degrees method, 408409
from RDDs, 294295 distributed, defined, 92
deleting objects (HDFS), 19
data model, 301302 distributed systems, limitations of,
deploying. Seealso installing 115116
functional operations,
cluster applications, distribution of blocks, 15
306310
environment variables for,
GraphFrames. See do while loops in Scala, 151152
457
GraphFrames docstrings, 310
H2O on Hadoop, 384386
metadata operations, document stores, 419
Spark
305306 documentation for Spark SQL, 310
on Databricks, 8188
saving to external storage, DoubleRDDs, 111
314316 on EC2, 6473
downloading
schemas on EMR, 7380
files, 1819
defining, 304 Spark History Server, 488
Spark, 2930
inferring, 302304 deployment modes for Spark.
Drill, 290
Seealso Spark on YARN
set operations, 311314 drivers, 45, 4648
deployment mode; Spark
UDFs (user-defined functions),
Standalone deployment mode application planning, 47
310311
list of, 2728 application scheduling, 47
DataNodes, 17
selecting, 43 application UI, 48
Dataset API, 118
describe method, 392 masters versus, 50
files 549
returning results, 48 Elastic Block Store (EBS), 62, 89 external storage for RDDs,
SparkContext, 4647 Elastic Compute Cloud (EC2), 247248
drop() method, 307 6263, 6473 Alluxio, 254257, 258
DStream.checkpoint() method, 330 Elastic MapReduce (EMR), 6364, columnar formats, 253, 299
7380 compressed options, 249250
DStreams (discretized streams),
324, 326327 ElasticSearch, 430 Hadoop input/output formats,
broadcast variables and election analogy for MapReduce, 251253
accumulators, 331 125126 saveAsTextFile() method, 248
caching and persistence, 331 encryption, 506510 saving DataFrames to,
checkpointing, 330331, 340 Environment tab (application UI), 314316
486, 499 sequence files, 250
data sources, 327328
environment variables, 454 external tables (Hive), internal
lineage, 330
cluster application tables versus, 289
output operations, 331333
deployment, 457
sliding window operations,
cluster manager independent
337339, 340
variables, 454455
state operations, 335336,
defaults, 460 F
340
Hadoop-related, 455
transformations, 328329 FairScheduler, 52, 470471, 477
Spark on YARN environment
dtypes method, 305306 fault tolerance
variables, 456457
Dynamic Resource Allocation, in MapReduce, 122
Spark Standalone daemon,
476, 531532 with RDDs, 111
455456
DynamoDB, 429430 fault-tolerant mode (Alluxio),
ephemeral storage, 62
254255
ETags, 63
feature extraction, 366367, 378
examples directory, 39
E exchange patterns. See pub-sub
features in machine learning,
366367
messaging model
EBS (Elastic Block Store), 62, 89 files
executors, 45, 4849
EC2 (Elastic Compute Cloud), compression, 9394
logging, 495497
6263, 6473 CSV files, creating SparkR
number of, 477
ec2 directory, 39 data frames from, 352354
in Standalone mode, 463
ecosystem projects, 13 downloading, 1819
workers versus, 59
edge nodes, 502 in HDFS, 1416
Executors tab (application UI),
EdgeRDD objects, 404405 JSON files, creating RDDs
486487, 499
from, 103105
edges explain() method, 310
object files, creating RDDs
creating edge DataFrames, 407 external applications from, 99
in DAG, 47 accessing Spark SQL, 319 text files
defined, 399 processing RDDs with, creating DataFrames from,
edges method, 407408 278279 298299
550 files
creating RDDs from, 9399 functional programming passing to map

saving DStreams as, in Python, 178 transformations, 540541
332333 anonymous functions, in R, 348349
uploading (ingesting), 18 179180 Funnel project, 138
filesystem, Alluxio as, 255256 closures, 181183 future of NoSQL, 430
filter() method, 201202, 307 higher-order functions,
in Python, 170 180, 194
filtering parallelization, 181

short-circuiting, 181
G
messages, 434435, 451
optimizing applications, 527 tail calls, 180181 garbage collection, 169
find method, 409410 in Scala gateway services, 503
fine-grained transformations, 107 anonymous functions, 158 generalized linear model, 357
first() method, 208209 closures, 158159 Generic Java (GJ), 137
first-class functions in Scala, currying, 159 getCheckpointFile() method, 245
157, 163 first-class functions, getStorageLevel() method,
flags for RDD storage levels, 157, 163 238239
237238 function literals versus glm() method, 357
flatMap() method, 131, 200201 function values, 163
glom() method, 277
in DataFrames, 308309 higher-order functions, 158
Google
map() method versus, 135, immutable data structures,
graphs and, 402403
232 160
in history of big data, 1112
flatMapValues() method, 213214 lazy evaluation, 160
PageRank. See PageRank
fold() method, 210 functional transformations, 199
graph stores, 419
foldByKey() method, 217 filter() method, 201202
GraphFrames, 406
followers in Kafka, 436437 flatMap() method, 200201
accessing, 406
foreach() method, 210211, 306 map() method versus, 232
creating, 407
map() method versus, 233 flatMapValues() method,
defined, 414
213214
foreachPartition() method,
keyBy() method, 213 methods in, 407409
276277
motifs, 409410, 414
foreachRDD() method, 333 map() method, 199200
PageRank implementation,
for loops in Scala, 150151 flatMap() method versus,
411413
232
free variables, 158
subgraphs, 410
foreach() method versus,
frozensets in Python, 171
233 GraphRDD objects, 405
full outer joins, 219
mapValues() method, 213 graphs
fullOuterJoin() method, 223224
functions adjacency lists, 400401
function literals, 163
optimizing applications, adjacency matrix, 401402
function values, 163
529530
HDFS (Hadoop Distributed File System) 551
characteristics of, 399 sortByKey() method, 217218 history of big data, 1112
defined, 399 subtractByKey() method, Kerberos with, 514515
Google and, 402403 218219 Spark and, 2, 8
GraphFrames, 406 deploying Spark, 3942
accessing, 406 downloading Spark, 30
creating, 407 H HDFS as data source, 24
defined, 414 YARN as resource
methods in, 407409 H2O, 381 scheduler, 24
motifs, 409410, 414 advantages of, 397 SQL on Hadoop, 289290

PageRank implementation, architecture, 383384 YARN. See YARN (Yet Another
411413 Resource Negotiator)
deep learning, 381382
subgraphs, 410 Hadoop Distributed File System
deployment on Hadoop,
(HDFS). See HDFS (Hadoop
GraphX API, 403404 384386
Distributed File System)
EdgeRDD objects, interfaces for, 397
hadoopFile() method, 99
404405 saving models, 395396
HadoopRDDs, 111
graphing algorithms in, 405 Sparkling Water, 387, 397
hash partitioners, 121
GraphRDD objects, 405 architecture, 387388
Haskell programming language,
VertexRDD objects, 404 example exercise, 393395
159
terminology, 399402 H2OFrames, 390393
HAWQ, 290
GraphX API, 403404 pysparkling shell, 388390
HBase, 419
EdgeRDD objects, 404405 web interface for, 382383
Cassandra versus, 425426,
graphing algorithms in, 405 H2O Flow, 382383 431
GraphRDD objects, 405 H2OContext, 388390 data distribution, 422
VertexRDD objects, 404 H2OFrames, 390393 data model and shell,
groupBy() method, 202, 313314 HA (High Availability), 420422
groupByKey() method, 215216, implementing, 38 reading and writing data with
233, 527529 Hadoop, 115 Spark, 423425
grouping data, 202 clusters, 2223 HCatalog, 286
distinct() method, 203204 current directory in, 18 HDFS (Hadoop Distributed File
foldByKey() method, 217 Elastic MapReduce (EMR), System), 12
groupBy() method, 202, 6364, 7380 blocks, 1416
313314 environment variables, 455 DataNodes, 17
groupByKey() method, explained, 1213 explained, 13
215216, 233 external storage, 251253 files, 1416
reduceByKey() method, H2O deployment, 384386 interactions with, 18
216217, 233
HDFS. See HDFS (Hadoop deleting objects, 19
sortBy() method, 202203 Distributed File System) downloading files, 1819
552 HDFS (Hadoop Distributed File System)
listing directory tables from object files, 99

contents, 19 creating DataFrames from, programmatically, 105106
uploading (ingesting) 295296 from text files, 9399
files, 18 creating SparkR data inner joins, 219
NameNode, 1617 frames from, 354355 input formats
replication, 1416 writing DataFrame data Hadoop, 251253
as Spark data source, 24 to, 315
for machine learning, 371
heap, 49 Hive on Spark, 284
input split, 127
HFile objects, 422 HiveContext, 292293, 322
input/output types in Spark, 7
High Availability (HA), HiveQL, 284285
installing. Seealso deploying
implementing, 38 HiveServer2, 287
IPython, 184185
higher-order functions
Jupyter, 189
in Python, 180, 194
Python, 31
in Scala, 158
I R packages, 349
history
Scala, 31, 139140
of big data, 1112 IAM (Identity and Access Spark
of IPython, 183184 Management) user accounts, 65
on Hadoop, 3942
of MapReduce, 115 if expressions in Scala, 149150
on Mac OS X, 3334
of NoSQL, 417418 immutability
on Microsoft Windows,
of Python, 166 of HDFS, 14 3436
of Scala, 137138 of RDDs, 92 as multi-node Standalone
of Spark SQL, 283284 immutable data structures in cluster, 3638
of Spark Streaming, 323324 Scala, 160 on Red Hat/Centos, 3031
History Server. See Spark immutable sets in Python, 171 requirements for, 28
HistoryServer immutable variables in Scala, 144 in Standalone mode,
Hive Impala, 289 2936
conventional databases indegrees, 400 subdirectories of
versus, 285286 inDegrees method, 408409 installation, 3839
data types, 287288 inferring DataFrame schemas, on Ubuntu/Debian Linux,
DDL (Data Definition 302304 3233
Language), 288 ingesting files, 18 Zookeeper, 441
explained, 284285 inheritance in Scala, 153155 instance storage, 62
interfaces for, 287 initializing RDDs, 93 EBS versus, 89
internal versus external from datasources, 100 Instance Type property (EC2), 62
tables, 289 from JDBC datasources, instances (EC2), 62
metastore, 286 100103 int methods in Scala, 143144
Spark SQL and, 291292 from JSON files, 103105 integer data type in R, 345
KDC (key distribution center) 553
Interactive Computing Protocol, Java virtual machines (JVMs), 139 JSON (JavaScript Object Notation),
189 defined, 46 174176
Interactive Python. See IPython heap, 49 creating DataFrames from,
(Interactive Python) 296298
javac compiler, 137
interactive use of Spark, 57, 8 creating RDDs from, 103105
JavaScript Object Notation (JSON).
internal tables (Hive), external See JSON (JavaScript Object json() method, 316
tables versus, 289 Notation) jsonFile() method, 104, 297
interpreted languages, Python as, JDBC (Java Database Connectivity) jsonRDD() method, 297298
166167 datasources, creating RDDs Jupyter notebooks, 187189
intersect() method, 313 from, 100103 advantages of, 194
intersection() method, 205 JDBC/ODBC interface, accessing kernels and, 189
IoT (Internet of Things) Spark SQL, 317318, 319
with PySpark, 189193
defined, 443. Seealso MQTT JdbcRDDs, 112
JVMs (Java virtual machines), 139
(MQ Telemetry Transport) JMX (Java Management
defined, 46
MQTT characteristics for, 451 Extensions), 490
heap, 49
IPython (Interactive Python), 183 jobs
Jython, 169
history of, 183184 in Databricks, 81
Jupyter notebooks, 187189 diagnosing performance

problems, 536538
advantages of, 194
kernels and, 189 scheduling, 470471 K
Jobs tab (application UI),
with PySpark, 189193
481482, 499 Kafka, 435436
Spark usage with, 184187
join() method, 219221, 312 cluster architecture, 436437
IronPython, 169
joins, 219 Spark support, 437
isCheckpointed() method, 245
cartesian() method, 225226 direct stream access,
cogroup() method, 224225 438, 451
example usage, 226229 KafkaUtils package,

J fullOuterJoin() method, 439443
223224 receivers, 437438, 451

Java, word count in Spark KafkaUtils package, 439443
join() method, 219221, 312
(listing1.3), 45
leftOuterJoin() method, createDirectStream() method,
Java Database Connectivity (JDBC) 439440
221222
datasources, creating RDDs
optimizing, 221 createStream() method, 440
from, 100103
rightOuterJoin() method, KCL (Kinesis Client Library), 448
Java Management Extensions
222223 KDC (key distribution center),
(JMX), 490
types of, 219 512513
Java Servlet Filters, 510512, 517
554 Kerberos
Kerberos, 512514, 517 Kinesis Streams, 446447 lines. See edges

client commands, 514 KCL (Kinesis Client Library), linked lists in Scala, 145
configuring, 515516 448 Lisp, 119
with Hadoop, 514515 KPL (Kinesis Producer Library), listing directory contents, 19
448
terminology, 513 listings
Spark support, 448450
kernels, 189 accessing
KinesisUtils package, 448450
key distribution center (KDC), Amazon DynamoDB from
512513 k-means clustering, 375377 Spark, 430
key value pairs (KVP) KPL (Kinesis Producer Library), columns in SparkR data
defined, 118 448 frame, 355
Kryo serialization, 531 data elements in R matrix,
in Map phase, 120121
KVP (key value pairs). See key 347
pair RDDs, 211
value pairs (KVP) elements in list, 145
flatMapValues() method,
213214 History Server REST API,
489
foldByKey() method, 217
and inspecting data in R
groupByKey() method, L data frames, 348
215216, 233
struct values in motifs,
keyBy() method, 213 LabeledPoint objects, 370
410
keys() method, 212 lambda calculus, 119
and using tuples, 148
mapValues() method, 213 lambda operator
Alluxio as off heap memory for
reduceByKey() method, in Java, 5
RDD persistence, 256
216217, 233 in Python, 4, 179180
Alluxio filesystem access
sortByKey() method, lazy evaluation, 107108, 160 using Spark, 256
217218 leaders in Kafka, 436437 anonymous functions in Scala,
subtractByKey() method, left outer joins, 219 158
218219
leftOuterJoin() method, 221222 appending and prepending to
values() method, 212
lib directory, 39 lists, 146
key value stores, 419
libraries in R, 349 associative operations in
keyBy() method, 213
library() method, 349 Spark, 527
keys, 118
licenses directory, 39 basic authentication for Spark
keys() method, 212 UI using Java servlets, 510
limit() method, 309
keyspaces in Cassandra, 426 broadcast method, 261
lineage
keytab files, 513 building generalized linear
of DStreams, 330
Kinesis Client Library (KCL), 448 model with SparkR, 357
of RDDs, 109110, 235237
Kinesis Producer Library (KPL), caching RDDs, 240
linear regression, 357358
448 cartesian transformation, 226
listings 555
Cassandra insert results, 428 DataFrame from plain text SparkR data frame from
checkpointing file or file(s), 299 Hive table, 354
RDDs, 245 DataFrame from RDD, 295 SparkR data frame from
DataFrame from RDD R data frame, 352
in Spark Streaming, 330
containing JSON objects, StreamingContext, 326
class and inheritance example
298 subgraph, 410
in Scala, 154155
edge DataFrame, 407 table and inserting data in
closures
GraphFrame, 407 HBase, 420
in Python, 182
H2OFrame from file, 391 vertex DataFrame, 407
in Scala, 159
coalesce() method, 275 H2OFrame from Python and working with RDDs
object, 390 created from JSON files,
cogroup transformation, 225
H2OFrame from Spark 104105
collect action, 207
RDD, 391 currying in Scala, 159
combine function to create R
keyspace and table in custom accumulators, 267
vector, 346
Cassandra using cqlsh, declaring lists and using
configuring 426427 functions, 145
pool for Spark application, PySparkling H2OContext defining schema
471 object, 389 for DataFrame explicitly,
SASL encryption for block R data frame from column 304
transfer services, 509 vectors, 347 for SparkR data frame, 353
connectedComponents R matrix, 347 degrees, inDegrees, and
algorithm, 405
RDD of LabeledPoint outDegrees methods,
converting objects, 370 408409
DataFrame to RDD, 301 RDDs from JDBC detailed H2OFrame
H2OFrame to Spark SQL datasource using load() information using describe
DataFrame, 392 method, 101 method, 393
count action, 206 RDDs from JDBC dictionaries in Python,
creating datasource using read. 173174
and accessing jdbc() method, 103
dictionary object usage in
accumulators, 265 RDDs using parallelize() PySpark, 174
broadcast variable from method, 106
dropping columns from
file, 261 RDDs using range() DataFrame, 307
DataFrame from Hive ORC method, 106
DStream transformations, 329
files, 300 RDDs using textFile()
EdgeRDDs, 404
DataFrame from JSON method, 96
enabling Spark dynamic
document, 297 RDDs using wholeText-
allocation, 532
DataFrame from Parquet Files() method, 97
evaluating k-means clustering
file (or files), 300 SparkR data frame from
model, 377
CSV file, 353
556 listings
external transformation Hive CREATE TABLE KafkaUtils.createDirectStream

program sample, 279 statement, 288 method, 440
filtering rows human readable KafkaUtils.createStream
from DataFrame, 307 representation of Python (receiver) method, 440
bytecode, 168169 keyBy transformation, 213
duplicates using distinct,
308 if expressions in Scala, keys transformation, 212
149150
final output (Map task), 129 Kryo serialization usage, 531
immutable sets in Python and
first action, 209 launching pyspark supplying
PySpark, 171
first five lines of Shakespeare JDBC MySQL connector
implementing JARfile, 101
file, 130
implementing ACLs for lazy evaluation in Scala, 160
fold action, 210
Spark UI, 512
compared with reduce, leftOuterJoin transformation,
Naive Bayes classifier 222
210
using Spark MLlib, 373
foldByKey example to find listing
importing graphframe Python
maximum value by key, 217 functions in H2O Python
module, 406
foreach action, 211 module, 389
including Databricks Spark
foreachPartition() method, 276 R packages installed and
CSV package in SparkR, 353 available, 349
for loops
initializing SQLContext, 101 lists
break, 151
input to Map task, 127 with mixed types, 145
with filters, 151
int methods, 143144 in Scala, 145
in Scala, 150
intermediate sent to Reducer, log events example, 494
fullOuterJoin transformation, 128
224 log4j.properties file, 494
intersection transformation,
getStorageLevel() method, 239 logging events within Spark
205
program, 498
getting help for Python API join transformation, 221
Spark SQL functions, 310 map, flatMap, and filter
joining DataFrames in Spark transformations in Spark,
GLM usage to make prediction SQL, 312
201
on new data, 357
joining lookup data map(), reduce(), and filter() in
GraphFrames package, 406
using broadcast variable, Python and PySpark, 170
GraphRDDs, 405 264 map functions with Spark SQL
groupBy transformation, 215 using driver variable, DataFrames, 309
grouping and aggregating data 263264 mapPartitions() method, 277
in DataFrames, 314 using RDD join(), 263 maps in Scala, 148
H2OFrame summary function, JSON object usage mapValues and flatMapValues
392
in PySpark, 176 transformations, 214
higher-order functions
in Python, 175 max function, 230
in Python, 180
Jupyter notebook JSON max values for R integer and
in Scala, 158 document, 188189 numeric (double) types, 345
listings 557
mean function, 230 pipe() method, 279 saveAsPickleFile() method

min function, 230 PyPy with PySpark, 532 usage in PySpark, 178
mixin composition using traits, pyspark command with saving

155156 pyspark-cassandra package, DataFrame to Hive table,
motifs, 409410 428 315
mtcars data frame in R, 352 PySpark interactive shell in DataFrame to Parquet file
local mode, 56 or files, 316
mutable and immutable
variables in Scala, 144 PySpark program to search for DStream output to files,
errors in log files, 92 332
mutable maps, 148149
Python program sample, 168 H2O models in POJO
mutable sets, 147
RDD usage for multiple format, 396
named functions
actions and loading H2O models in
and anonymous functions
with persistence, 108 native format, 395
in Python, 179
without persistence, 108 RDDs as compressed text
versus lambda functions in
files using GZip codec,
Python, 179 reading Cassandra data into
249
Spark RDD, 428
in Scala, 153
RDDs to sequence files,
reduce action, 209
non-interactive Spark job 250
submission, 7 reduceByKey transformation to
and reloading clustering
average values by key, 216
object serialization using model, 377
Pickle in Python, 176177 reduceByKeyAndWindow
scanning HBase table, 421
function, 339
obtaining application logs
scheduler XML file example,
from command line, 56 repartition() method, 274
470
ordering DataFrame, 313 repartitionAndSortWithin-
schema for DataFrame
Partitions() method, 275
output from Map task, 128 created from Hive table, 304
returning
pageRank algorithm, 405 schema inference for
column names and data
partitionBy() method, 273 DataFrames
types from DataFrame,
passing created from JSON, 303
306
large amounts of data to created from RDD, 303
list of columns from
function, 530 select method in Spark SQL,
DataFrame, 305
Spark configuration 309
rightOuterJoin transformation,
properties to
223 set operations example, 146
spark-submit, 459
running SQL queries against sets in Scala, 146
pattern matching in Scala
Spark DataFrames, 102 setting
using case, 152
sample() usage, 198 log levels within
performing functions in each
saveAsHadoopFile action, 252 application, 497
RDD in DStream, 333
saveAsNewAPIHadoopFile Spark configuration
persisting RDDs, 241242
action, 253 properties
pickleFile() method usage in programmatically, 458
PySpark, 178
558 listings
spark.scheduler.allocation. log4j.properties file using in Scala, 147

file property, 471 JVM options, 495 union transformation, 205
Shakespeare RDD, 130 splitting data into training and unpersist() method, 262
short-circuit operators in test data sets, 370 updating
Python, 181 sql method for creating cells in HBase, 422
showing current Spark DataFrame from Hive table,
data in Cassandra table
configuration, 460 295296
using Spark, 428
simple R vector, 346 state DStreams, 336
user-defined functions in
singleton objects in Scala, 156 stats function, 232 Spark SQL, 311
socketTextStream() method, stdev function, 231 values transformation, 212
327 StorageClass constructor, 238 variance function, 231
sortByKey transformation, 218 submitting VertexRDDs, 404
Spark configuration object Spark application to YARN vertices and edges methods,
methods, 459 cluster, 473 408
Spark configuration properties streaming application with viewing applications using
in spark-defaults.conf file, Kinesis support, 448 REST API, 467
458 subtract transformation, 206 web log schema sample,
Spark environment variables subtractByKey transformation, 203204
set in spark-env.sh file, 454 218 while and do while loops in
Spark HiveContext, 293 sum function, 231 Scala, 152
Spark KafkaUtils usage, 439 table method for creating window function, 338
Spark MLlib decision tree dataFrame from Hive table, word count in Spark
model to classify new data, 296
using Java, 45
372 tail call recursion, 180181
using Python, 4
Spark pi estimator in local take action, 208
using Scala, 4
mode, 56 takeSample() usage, 199
yarn command usage, 475
Spark routine example, 480 textFileStream() method, 328
to kill running Spark
Spark SQLContext, 292 toDebugString() method, 236 application, 475
Spark Streaming top action, 208 yield operator, 151
using Amazon Kinesis, training lists
449450
decision tree model with in Python, 170, 194
using MQTTUtils, 446 Spark MLlib, 371
in Scala, 145146, 163
Spark usage on Kerberized k-means clustering model
Hadoop cluster, 515 load() method, 101102
using Spark MLlib, 377
spark-ec2 syntax, 65 load_model function, 395
triangleCount algorithm, 405
spark-perf core tests, 521522 loading data
tuples
specifying data locality in, 113
in PySpark, 173
local mode in code, 57 into RDDs, 93
in Python, 172
MapReduce 559
from datasources, 100

M in Python, 170
from JDBC datasources, in Word Count algorithm,
100103 Mac OS X, installing Spark, 3334 129132
from JSON files, 103105 machine code, bytecode versus, Map phase, 119, 120121
from object files, 99 168 Map-only applications, 124125
programmatically, machine learning mapPartitions() method, 277278
105106 classification in, 364, 367 MapReduce, 115
from text files, 9399 decision trees, 368372 asymmetry and speculative
local mode, running applications, Naive Bayes, 372373 execution, 124
5658 Combiner functions, 122123
clustering in, 365366,
log aggregation, 56, 497 375377 design goals, 117
Log4j framework, 492493 collaborative filtering in, 365, election analogy, 125126
appenders, 493, 499 373375 fault tolerance, 122
daemon logging, 495 defined, 363364 history of, 115
executor logs, 495497 features and feature limitations of distributed
log4j.properties file, 493495 extraction, 366367 computing, 115116
severity levels, 493 H2O. See H2O Map phase, 120121
log4j.properties file, 493495 input formats, 371 Map-only applications,
loggers, 492 in Spark, 367 124125
logging, 492 Spark MLlib. See Spark MLlib partitioning function in, 121
Log4j framework, 492493 splitting data sets, 369370 programming model versus
Mahout, 367 processing framework,
appenders, 493, 499
118119
daemon logging, 495 managing
Reduce phase, 121122
executor logs, 495497 applications
Shuffle phase, 121, 135
log4j.properties file, in Standalone mode,
466469 Spark versus, 2, 8
493495
on YARN, 473475 terminology, 117118
severity levels, 493
configuration, 461 whitepaper website, 117
setting within applications,
497498 performance. See Word Count algorithm
performance management example, 126
in YARN, 56
map() method, 120121, 130, map() and reduce()
logical data type in R, 345
199200 methods, 129132
logs in Kafka, 436
in DataFrames, 308309, 322 operational overview,
lookup() method, 277
127129
loops in Scala flatMap() method versus,
135, 232 in PySpark, 132134
do while and while loops,
foreach() method versus, 233 reasons for usage,
151152
126127
for loops, 150151 passing functions to, 540541
YARN versus, 1920
560 maps in Scala
maps in Scala, 148149 direct stream access, 438, Movielens dataset, 374
mapValues() method, 213 451 MQTT (MQ Telemetry Transport),
Marz, Nathan, 323 KafkaUtils package, 443
439443 characteristics for IoT, 451
master nodes, 23
receivers, 437438, 451 clients, 445
master UI, 463466, 487
Spark support, 437 message structure, 445
masters, 45, 4950
Kinesis Streams, 446447 Spark support, 445446
ApplicationsMaster as, 5253
KCL (Kinesis Client as transport protocol, 444
drivers versus, 50
Library), 448
starting in Standalone mode, MQTTUtils package, 445446
463 KPL (Kinesis Producer MR1 (MapReduce v1), YARN
Library), 448
match case constructs in Scala, versus, 1920
Spark support, 448450
152 multi-node Standalone clusters,
MQTT, 443 installing, 3638
Mathematica, 183
characteristics for IoT, 451 multiple concurrent applications,
matrices
clients, 445 scheduling, 469470
data frames versus, 361
message structure, 445 multiple inheritance in Scala,
in R, 345347
Spark support, 445446 155156
matrix command, 347
as transport protocol, 444 multiple jobs within applications,
matrix factorization, 373
scheduling, 470471
pub-sub model, 434435
max() method, 230
mutable variables in Scala, 144
metadata
MBeans, 490
for DataFrames, 305306
McCarthy, John, 119
in NameNode, 1617
mean() method, 230
members, 111 metastore (Hive), 286 N
metrics, collecting, 490492
Memcached, 430
metrics sinks, 490, 499 Naive Bayes, 372373
memory-intensive workloads,
Microsoft Windows, installing NaiveBayes.train method,
avoiding conflicts, 42
Spark, 3436 372373
Mesos, 22
min() method, 229230 name value pairs. See key value
message oriented middleware
mixin composition in Scala, pairs (KVP)
(MOM), 433
155156 named functions
messaging systems, 433434
MLlib. See Spark MLlib in Python, 179180
buffering and queueing
MOM (message oriented in Scala, 153
messages, 435
middleware), 433 NameNode, 1617
filtering messages, 434435
MongoDB, 430 DataNodes and, 17
Kafka, 435436
monitoring performance. See naming conventions
cluster architecture,
performance management in Scala, 142
436437
motifs, 409410, 414 for SparkContext, 47
output operations for DStreams 561
narrow dependencies, 109 history of, 417418 objects (HDFS), deleting, 19

neural networks, 381 implementations of, 430 observations in R, 352
newAPIHadoopFile() method, 128 system types, 419, 431 Odersky, Martin, 137
NewHadoopRDDs, 112 notebooks in IPython, 187189 off-heap persistence with Alluxio,
Nexus, 22 advantages of, 194 256
NodeManagers, 2021 kernels and, 189 OOP. See object-oriented

programming in Scala
nodes. Seealso vertices with PySpark, 189193
Optimized Row Columnar (ORC),
in clusters, 2223 numeric data type in R, 345
299
in DAG, 47 numeric functions
optimizing. Seealso performance
DataNodes, 17 max(), 230
management
in decision trees, 368 mean(), 230
applications
defined, 13 min(), 229230
associative operations,
EMR types, 74 in R, 349 527529
NameNode, 1617 stats(), 231232 collecting data, 530
non-deterministic functions, fault stdev(), 231 diagnosing problems,
tolerance and, 111 sum(), 230231 536539
non-interactive use of Spark, 7, 8 variance(), 231 dynamic allocation,
non-splittable compression NumPy library, 377 531532
formats, 94, 113, 249 Nutch, 1112, 115 with filtering, 527
NoSQL functions and closures,
Cassandra 529530
accessing via Spark, serialization, 531
427429 O joins, 221
CQL (Cassandra Query parallelization, 531
object comparison in Scala, 143
Language), 426427
partitions, 534535
object files, creating RDDs from, 99
data model, 426
ORC (Optimized Row Columnar),
object serialization in Python, 174
HBase versus, 425426, 299
431 JSON, 174176
orc() method, 300301, 316
characteristics of, 418419, Pickle, 176178
orderBy() method, 313
431 object stores, 63
outdegrees, 400
DynamoDB, 429430 objectFile() method, 99
outDegrees method, 408409
future of, 430 object-oriented programming
outer joins, 219
HBase, 419 inScala
output formats in Hadoop,
data distribution, 422 classes and inheritance,
251253
153155
data model and shell,
output operations for DStreams,
420422 mixin composition, 155156
331333
reading and writing data polymorphism, 157
with Spark, 423425 singleton objects, 156157
562 packages
P parquet() method, 299300, 316 Terasort, 520521

Partial DAG Execution (PDE), 321 TPC (Transaction
packages partition keys Processing Performance
Council), 520
GraphFrames. in Cassandra, 426
See GraphFrames when to use, 540
in Kinesis Streams, 446
in R, 348349 canary queries, 525
partitionBy() method, 273274
datasets package, Datadog, 525526
partitioning function in
351352 MapReduce, 121 diagnosing problems,
Spark Packages, 406 536539
PartitionPruningRDDs, 112
packaging Scala programs, 141 Project Tungsten, 533
partitions
Page, Larry, 402403, 414 PyPy, 532533
default behavior, 271272
PageRank, 402403, 405 perimeter security, 502503, 517
foreachPartition() method,
defined, 414 276277 persist() method, 108109,
241, 314
implementing with glom() method, 277
GraphFrames, 411413 persistence
in Kafka, 436
pair RDDs, 111, 211 of DataFrames, 314
limitations on creating, 102
flatMapValues() method, of DStreams, 331
lookup() method, 277
213214 of RDDs, 108109, 240243
mapPartitions() method,
foldByKey() method, 217 277278 off-heap persistence, 256
groupByKey() method, optimal number of, 273, 536 Pickle, 176178
215216, 233 repartitioning, 272273 Pickle files, 99
keyBy() method, 213 coalesce() method, pickleFile() method, 178
keys() method, 212 274275 pipe() method, 278279
mapValues() method, 213 partitionBy() method, Pivotal HAWQ, 290
reduceByKey() method, 273274 Pizza, 137
216217, 233 repartition() method, 274 planning applications, 47
sortByKey() method, 217218 repartitionAndSort- POJO (Plain Old Java Object)
subtractByKey() method, WithinPartitions() format, saving H2O models, 396
218219 method, 275276 policies (security), 503
values() method, 212 sizing, 272, 280, 534535, polymorphism in Scala, 157
parallelization 540
POSIX (Portable Operating System
optimizing, 531 pattern matching in Scala, 152 Interface), 18
in Python, 181 PDE (Partial DAG Execution), 321 Powered by Spark web page, 3
parallelize() method, 105106 Prez, Fernando, 183 pprint() method, 331332
parent RDDs, 109 performance management. precedence of configuration
Seealso optimizing
Parquet, 299 properties, 460461
benchmarks, 519520
writing DataFrame data to, predict function, 357
315316 spark-perf, 521525
R programming language 563
predictive analytics, 355356 .py file extension, 167 higher-order functions,

machine learning. Py4J, 170 180, 194
See machine learning PyPy, 169, 532533 parallelization, 181
with SparkR. See SparkR PySpark, 4, 170. Seealso Python short-circuiting, 181
predictive models dictionaries, 174 tail calls, 180181
building in SparkR, 355358 higher-order functions, 194 history of, 166
steps in, 361 JSON object usage, 176 installing, 31
Pregel, 402403 Jupyter notebooks and, IPython (Interactive Python),
pricing 189193 183
AWS (Amazon Web Services), pickleFile() method, 178 advantages of, 194
64 saveAsPickleFile() method, history of, 183184
Databricks, 81 178 Jupyter notebooks,
primary keys in Cassandra, 426 shell, 6 187193
primitives tuples, 172 kernels, 189
in Scala, 141 Word Count algorithm Spark usage with, 184187
in Spark SQL, 301302 (MapReduce example) in, object serialization, 174

132134 JSON, 174176
principals
pysparkling shell, 388390 Pickle, 176178
in authentication, 503
Python, 165. Seealso PySpark word count in Spark
in Kerberos, 512, 513
architecture, 166167 (listing1.1), 4
printSchema method, 410
CPython, 167169 python directory, 39
probability functions in R, 349
IronPython, 169 Python.NET, 169
producers
Jython, 169
defined, 434
Psyco, 169
in Kafka, 435
PyPy, 169
in Kinesis Streams, 448 Q
PySpark, 170
profile startup files in IPython, 187
Python.NET, 169 queueing messages, 435
programming interfaces to Spark,
35 data structures quorums in Kafka, 436437
Project Tungsten, 533 dictionaries, 173174
properties, Spark configuration, lists, 170, 194

457460, 477 sets, 170171
managing, 461 tuples, 171173, 194
R
precedence, 460461 functional programming in, R directory, 39
Psyco, 169 178
R programming language,
public data sets, 63 anonymous functions, 343344
179180
pub-sub messaging model, assignment operator (<-), 344
434435, 451 closures, 181183
data frames, 345, 347348
564 R programming language
creating SparkR data default partition behavior, join() method, 219221

frames from, 351352 271272 leftOuterJoin() method,
matrices versus, 361 in DStreams, 333 221222
data structures, 345347 EdgeRDD objects, 404405 rightOuterJoin() method,
data types, 344345 explained, 9193, 197198 222223
datasets package, 351352 external storage, 247248 types of, 219
functions and packages, Alluxio, 254257, 258 key value pairs (KVP), 211
348349 columnar formats, 253, flatMapValues() method,
SparkR. See SparkR 299 213214
randomSplit function, 369370 compressed options, foldByKey() method, 217

range() method, 106 249250 groupByKey() method,
Hadoop input/output 215216, 233
RBAC (role-based access control),
503 formats, 251253 keyBy() method, 213
RDDs (Resilient Distributed saveAsTextFile() method, keys() method, 212

Datasets), 2, 8 248 mapValues() method, 213
actions, 206 sequence files, 250 reduceByKey() method,
collect(), 207 fault tolerance, 111 216217, 233
count(), 206 functional transformations, sortByKey() method,

199 217218
first(), 208209
filter() method, 201202 subtractByKey() method,
foreach(), 210211, 233
flatMap() method, 218219
take(), 207208
200201, 232 values() method, 212
top(), 208
map() method, 199200, lazy evaluation, 107108
aggregate actions, 209
232, 233 lineage, 109110, 235237
fold(), 210
GraphRDD objects, 405 loading data, 93
reduce(), 209
grouping and sorting data, 202 from datasources, 100
benefits of replication, 257
distinct() method, from JDBC datasources,
coarse-grained versus 203204 100103
fine-grained transformations,
groupBy() method, 202 from JSON files, 103105
107
sortBy() method, 202203 from object files, 99
converting DataFrames to,
joins, 219 programmatically, 105106
301
cartesian() method, from text files, 9399
creating DataFrames from,
225226
294295 numeric functions
cogroup() method,
data sampling, 198199 max(), 230
224225
sample() method, mean(), 230
example usage, 226229
198199 min(), 229230
fullOuterJoin() method,
takeSample() method, 199 stats(), 231232
223224
running applications 565
stdev(), 231 records requirements for Spark

sum(), 230231 defined, 92, 117 installation, 28
variance(), 231 key value pairs (KVP) and, 118 resilient
off-heap persistence, 256 Red Hat Linux, installing Spark, defined, 92
persistence, 108109 3031 RDDs as, 113
processing with external Redis, 430 Resilient Distributed Datasets

programs, 278279 reduce() method, 122, 209 (RDDs). See RDDs (Resilient
Distributed Datasets)
resilient, explained, 113 in Python, 170
resource management
set operations, 204 in Word Count algorithm,
intersection() method, 205 129132 Dynamic Resource Allocation,
476, 531532
subtract() method, Reduce phase, 119, 121122
list of alternatives, 22
205206 reduceByKey() method, 131, 132,
216217, 233, 527529 with MapReduce.
union() method, 204205
See MapReduce
storage levels, 237 reduceByKeyAndWindow()
method, 339 in Standalone mode, 463
caching RDDs, 239240,
reference counting, 169 with YARN. See YARN
243
(Yet Another Resource
checkpointing RDDs, reflection, 302
Negotiator)
244247, 258 regions (AWS), 62
ResourceManager, 2021,
flags, 237238 regions in HBase, 422
471472
getStorageLevel() method, relational databases, creating
as cluster manager, 5152
238239 RDDs from, 100
Riak, 430
persisting RDDs, 240243 repartition() method, 274, 314
right outer joins, 219
selecting, 239 repartitionAndSortWithin-
rightOuterJoin() method, 222223
Storage tab (application UI), Partitions() method, 275276
role-based access control (RBAC),
484485 repartitioning, 272273
503
types of, 111112 coalesce() method, 274275
roles (security), 503
VertexRDD objects, 404 DataFrames, 314
RStudio, SparkR usage with,
read command, 348 expense of, 215
358360
read.csv() method, 348 partitionBy() method, 273274
running applications
read.fwf() method, 348 repartition() method, 274
in local mode, 5658
reading HBase data, 423425 repartitionAndSortWithin-
on YARN, 2022, 51,
read.jdbc() method, 102103 Partitions() method,
472473
275276
read.json() method, 104 application management,
replication
read.table() method, 348 473475
benefits of, 257
realms, 513 ApplicationsMaster, 5253,
of blocks, 1516, 25
receivers in Kafka, 437438, 451 471472
in HDFS, 1416
recommenders, implementing, log file management, 56
374375 replication factor, 15 ResourceManager, 5152
566 running applications
yarn-client submission saving naming conventions, 142

mode, 5455 DataFrames to external object-oriented programming in
yarn-cluster submission storage, 314316 classes and inheritance,
mode, 5354 H2O models, 395396 153155
runtime architecture of Python, sbin directory, 39 mixin composition,
166167 155156
sbt (Simple Build Tool for Scala
CPython, 167169 and Java), 139 polymorphism, 157
IronPython, 169 Scala, 2, 137 singleton objects,
Jython, 169 architecture, 139 156157
Psyco, 169 comparing objects, 143 packaging programs, 141
PyPy, 169 compiling programs, 140141 primitives, 141
PySpark, 170 control structures, 149 shell, 6
Python.NET, 169 do while and while loops, type inference, 144
151152 value classes, 142143
for loops, 150151 variables, 144
if expressions, 149150 Word Count algorithm
S example, 160162
named functions, 153
pattern matching, 152 word count in Spark
S3 (Simple Storage Service), 63
(listing1.2), 4
sample() method, 198199, 309 data structures, 144
scalability of Spark, 2
sampleBy() method, 309 lists, 145146, 163
scalac compiler, 139
sampling data, 198199 maps, 148149
scheduling
sample() method, 198199 sets, 146147, 163
application tasks, 47
takeSample() method, 199 tuples, 147148
in Standalone mode, 469
SASL (Simple Authentication and functional programming in
multiple concurrent
Security Layer), 506, 509 anonymous functions, 158
applications, 469470
save_model function, 395 closures, 158159
multiple jobs within
saveAsHadoopFile() method, currying, 159
applications, 470471
251252 first-class functions, 157,
with YARN. See YARN
saveAsNewAPIHadoopFile() 163
(Yet Another Resource
method, 253 function literals versus Negotiator)
saveAsPickleFile() method, function values, 163
schema-on-read systems, 12
177178 higher-order functions, 158
SchemaRDDs. See DataFrames
saveAsSequenceFile() method, 250 immutable data structures,
schemas for DataFrames
saveAsTable() method, 315 160
defining, 304
saveAsTextFile() method, 93, 248 lazy evaluation, 160
inferring, 302304
saveAsTextFiles() method, history of, 137138
schemes in URIs, 95
332333 installing, 31, 139140
Spark 567
Secure Sockets Layer (SSL), setCheckpointDir() method, 244 slave nodes

506510 sets defined, 23
security, 501502 in Python, 170171 starting in Standalone mode,
authentication, 503504 in Scala, 146147, 163 463
encryption, 506510 severity levels in Log4j framework, worker UIs, 463466
shared secrets, 504506 493 sliding window operations with
authorization, 503504 shards in Kinesis Streams, 446 DStreams, 337339, 340
gateway services, 503 shared nothing, 15, 92 slots (MapReduce), 20
Java Servlet Filters, 510512, shared secrets, 504506 Snappy, 94

517 shared variables. socketTextStream() method,
Kerberos, 512514, 517 See accumulators; broadcast 327328
client commands, 514 variables Solr, 430
configuring, 515516 Shark, 283284 sortBy() method, 202203
with Hadoop, 514515 shells sortByKey() method, 217218
terminology, 513 Cassandra, 426427 sorting data, 202
perimeter security, 502503, HBase, 420422 distinct() method, 203204

517 interactive Spark usage, 57, 8 foldByKey() method, 217
security groups, 62 pysparkling, 388390 groupBy() method, 202
select() method, 309, 322 SparkR, 350351 groupByKey() method,
selecting short-circuiting in Python, 181 215216, 233
Spark deployment modes, 43 show() method, 306 orderBy() method, 313
storage levels for RDDs, 239 shuffle, 108 reduceByKey() method,

216217, 233
sequence files diagnosing performance
problems, 536538 sortBy() method, 202203
creating RDDs from, 99
expense of, 215 sortByKey() method, 217218
external storage, 250
Shuffle phase, 119, 121, 135 subtractByKey() method,
sequenceFile() method, 99
218219
SequenceFileRDDs, 111 ShuffledRDDs, 112
sources. See data sources
serialization side effects of functions, 181
Spark
optimizing applications, 531 Simple Authentication and
Security Layer (SASL), 506, 509 as abstraction, 2
in Python, 174
Simple Storage Service (S3), 63 application support, 3
JSON, 174176
SIMR (Spark In MapReduce), 22 application UI. See
Pickle, 176178
application UI
single master mode (Alluxio),
service ticket, 513
254255 Cassandra access, 427429
set operations, 204
single point of failure (SPOF), 38 configuring
for DataFrames, 311314
singleton objects in Scala, broadcast variables, 262
intersection() method, 205
156157 configuration properties,
subtract() method, 205206
sizing partitions, 272, 280, 457460, 477
union() method, 204205 534535, 540
568 Spark
environment variables, IPython usage, 184187 clustering in, 375377

454457 Kafka support, 437 collaborative filtering in,
managing configuration, direct stream access, 438, 373375
461 451 Spark ML versus, 378
precedence, 460461 KafkaUtils package, Spark on YARN deployment mode,
defined, 12 439443 2728, 3942, 471473
deploying receivers, 437438, 451 application management,
on Databricks, 8188 Kinesis Streams support, 473475
on EC2, 6473 448450 environment variables,

logging. See logging 456457
on EMR, 7380
machine learning in, 367 scheduling, 475476
deployment modes. Seealso
Spark on YARN deployment MapReduce versus, 2, 8 Spark Packages, 406
mode; Spark Standalone master UI, 487 Spark SQL, 283
deployment mode accessing
metrics, collecting, 490492
list of, 2728 via Beeline, 318321
MQTT support, 445446
selecting, 43 via external applications,
non-interactive use, 7, 8
downloading, 2930 319
programming interfaces to,
Hadoop and, 2, 8 35 via JDBC/ODBC interface,
HDFS as data source, 24 317318
scalability of, 2
YARN as resource via spark-sql shell,
security. See security
scheduler, 24 316317
Spark applications. See
input/output types, 7 architecture, 290292
applications
installing DataFrames, 294
Spark History Server, 488
on Hadoop, 3942 built-in functions, 310
API access, 489490
on Mac OS X, 3334 converting to RDDs, 301
configuring, 488
on Microsoft Windows, creating from Hive tables,
deploying, 488
3436 295296
as multi-node Standalone creating from JSON
problems, 539
cluster, 3638 objects, 296298
UI (user interface) for,
on Red Hat/Centos, 3031 creating from RDDs,
488489
294295
requirements for, 28 Spark In MapReduce (SIMR), 22
creating with
in Standalone mode, Spark ML, 367
DataFrameReader,
2936 Spark MLlib versus, 378 298301
subdirectories of Spark MLlib, 367 data model, 301302
installation, 3839
classification in, 367 defining schemas, 304
on Ubuntu/Debian Linux,
decision trees, 368372
3233 functional operations,
Naive Bayes, 372373 306310
interactive use, 57, 8
starting masters/slaves in Standalone mode 569
inferring schemas, Spark Streaming creating data frames

302304 architecture, 324325 from CSV files, 352354
metadata operations, DStreams, 326327 from Hive tables, 354355
305306 broadcast variables and from R data frames,
saving to external storage, accumulators, 331 351352
314316 caching and persistence, documentation, 350
set operations, 311314 331 RStudio usage with, 358360
UDFs (user-defined checkpointing, 330331, shell, 350351
functions), 310311 340 spark-sql shell, 316317
history of, 283284 data sources, 327328 spark-submit command, 7, 8
Hive and, 291292 lineage, 330 - -master local argument, 59
HiveContext, 292293, 322 output operations, sparsity, 421
SQLContext, 292293, 322 331333
speculative execution, 135, 280
Spark SQL DataFrames sliding window operations,
defined, 21
caching, persisting, 337339, 340
in MapReduce, 124
repartitioning, 314 state operations,
splittable compression formats,
Spark Standalone deployment 335336, 340
94, 113, 249
mode, 2728, 2936, 461462 transformations, 328329
SPOF (single point of failure), 38
application management, history of, 323324
466469 spot instances, 62
StreamingContext, 325326
daemon environment SQL (Structured Query Language),
word count example, 334335
variables, 455456 283. Seealso Hive; Spark SQL
SPARK_HOME variable, 454
on Mac OS X, 3334 sql() method, 295296
SparkContext, 4647
master and worker UIs, SQL on Hadoop, 289290
spark-ec2 shell script, 65
463466 SQLContext, 100, 292293, 322
actions, 65
on Microsoft Windows, 3436 SSL (Secure Sockets Layer),
options, 66
as multi-node Standalone 506510
syntax, 65
cluster, 3638 stages
spark-env.sh script, 454
on Red Hat/Centos, 3031 in DAG, 47
Sparkling Water, 387, 397
resource allocation, 463 diagnosing performance
architecture, 387388 problems, 536538
scheduling, 469
example exercise, 393395 tasks and, 59
multiple concurrent
applications, 469470 H2OFrames, 390393 Stages tab (application UI),
multiple jobs within pysparkling shell, 388390 483484, 499
applications, 470471 spark-perf, 521525 Standalone mode. See Spark
starting masters/slaves, 463 SparkR Standalone deployment mode
on Ubuntu/Debian Linux, building predictive models, starting masters/slaves in

3233 355358 Standalone mode, 463
570 state operations with DStreams
state operations with DStreams, StorageClass constructor, 238 subtractByKey() method, 218219
335336, 340 Storm, 323 sum() method, 230231
statistical functions stream processing. Seealso summary function, 357, 392
max(), 230 messaging systems supervised learning, 355
mean(), 230 DStreams, 326327
min(), 229230 broadcast variables and
in R, 349 accumulators, 331
stats(), 231232 caching and persistence, T

331
stdev(), 231 table() method, 296
sum(), 230231 checkpointing, 330331,
tables
340
variance(), 231 in Cassandra, 426
data sources, 327328
stats() method, 231232 in Databricks, 81
lineage, 330
stdev() method, 231 in Hive
output operations,
stemming, 128
331333 creating DataFrames from,
step execution mode (EMR), 74 295296
sliding window operations,
stopwords, 128 creating SparkR data
337339, 340
storage levels for RDDs, 237 frames from, 354355
state operations,
caching RDDs, 239240, 243 335336, 340 internal versus external,
checkpointing RDDs, 289
transformations, 328329
244247, 258 writing DataFrame data
Spark Streaming
external storage, 247248 to, 315
architecture, 324325
Alluxio, 254257, 258 tablets (Bigtable), 422
history of, 323324
columnar formats, 253, Tachyon. See Alluxio
StreamingContext,
299 tail call recursion in Python,
325326
compressed options, 180181
word count example,
249250 tail calls in Python, 180181
334335
Hadoop input/output take() method, 207208, 306, 530
StreamingContext, 325326
formats, 251253 takeSample() method, 199
StreamingContext.checkpoint()
saveAsTextFile() method, task attempts, 21
method, 330
248
streams in Kinesis, 446447 task nodes, core nodes versus, 89
sequence files, 250
strict evaluation, 160 tasks
flags, 237238
Structured Query Language (SQL), in DAG, 47
getStorageLevel() method, 283. Seealso Hive; Spark SQL defined, 2021
238239
subdirectories of Spark diagnosing performance
persisting RDDs, 240243 installation, 3839 problems, 536538
selecting, 239 subgraphs, 410 scheduling, 47
Storage tab (application UI), subtract() method, 205206, 313 stages and, 59
484485, 499
URIs (Uniform Resource Identifiers), schemes in 571
Terasort, 520521 defined, 47 transport protocol, MQTT as, 444

Term Frequency-Inverse Document distinct(), 203204 Trash settings in HDFS, 19
Frequency (TF-IDF), 367 for DStreams, 328329 triangle count algorithm, 405
test data sets, 369370 filter(), 201202 triplets, 402
text files flatMap(), 131, 200201 tuple extraction in Scala, 152
creating DataFrames from, map() versus, 135, 232 tuples, 132
298299 flatMapValues(), 213214 in Python, 171173, 194
creating RDDs from, 9399 foldByKey(), 217 in Scala, 147148
saving DStreams as, 332333 fullOuterJoin(), 223224 type inference in Scala, 144
text input format, 127 groupBy(), 202 Typesafe, Inc., 138
text() method, 298299 groupByKey(), 215216, 233
textFile() method, 9596 intersection(), 205
text input format, 128 join(), 219221
wholeTextFiles() method U
keyBy(), 213
versus, 9799
keys(), 212 Ubuntu Linux, installing Spark,
textFileStream() method, 328
lazy evaluation, 107108 3233
Tez, 289
leftOuterJoin(), 221222 udf() method, 311
TF-IDF (Term Frequency-Inverse
lineage, 109110, 235237 UDFs (user-defined functions) for
Document Frequency), 367
map(), 130, 199200 DataFrames, 310311
Thrift JDBC/ODBC server,
flatMap() versus, 135, 232 UI (user interface).
accessing Spark SQL, 317318
See application UI
foreach() action versus,
ticket granting service (TGS), 513
233 Uniform Resource Identifiers
ticket granting ticket (TGT), 513 (URIs), schemes in, 95
passing functions to,
tokenization, 127
540541 union() method, 204205
top() method, 208
mapValues(), 213 unionAll() method, 313
topic filtering, 434435, 451
of RDDs, 92 UnionRDDs, 112
TPC (Transaction Processing
reduceByKey(), 131, 132, unnamed functions
Performance Council), 520
216217, 233 in Python, 179180
training data sets, 369370
rightOuterJoin(), 222223 in Scala, 158
traits in Scala, 155156
sample(), 198199 unpersist() method, 241, 262,
Transaction Processing
sortBy(), 202203 314
Performance Council (TPC), 520
sortByKey(), 217218 unsupervised learning, 355
transformations
subtract(), 205206 updateStateByKey() method,
cartesian(), 225226 335336
subtractByKey(), 218219
coarse-grained versus uploading (ingesting) files, 18
union(), 204205
fine-grained, 107
URIs (Uniform Resource
values(), 212
cogroup(), 224225 Identifiers), schemes in, 95
572 user interface (UI)
user interface (UI). Hadoop-related, 455 windowed DStreams, 337339,

See applicationUI Spark on YARN 340
user-defined functions (UDFs) for environment variables, Windows, installing Spark, 3436
DataFrames, 310311 456457 Word Count algorithm
Spark Standalone daemon, (MapReduce example), 126
455456 map() and reduce() methods,
free variables, 158 129132
V in R, 352 operational overview,
in Scala, 144 127129
value classes in Scala, 142143
variance() method, 231 in PySpark, 132134
value() method
vectors in R, 345347 reasons for usage, 126127
accumulators, 266
VertexRDD objects, 404 in Scala, 160162
broadcast variables, 261262
vertices word count in Spark
values, 118
creating vertex DataFrames, using Java (listing 1.3), 45
values() method, 212
407 using Python (listing 1.1), 4
van Rossum, Guido, 166
in DAG, 47 using Scala (listing 1.2), 4
variables
defined, 399 workers, 45, 4849
accumulators, 265266
indegrees, 400 executors versus, 59
outdegrees, 400 worker UIs, 463466
custom accumulators, 267
vertices method, 407408 WORM (Write Once Read Many),
VPC (Virtual Private Cloud), 62 14
value() method, 266
write ahead log (WAL), 435
warning about, 268
writing HBase data, 423425
bound variables, 158
broadcast variables, 259260 W
advantages of, 263265,
280 WAL (write ahead log), 435 Y
broadcast() method, weather dataset, 368
260261 web interface for H2O, Yahoo! in history of big data,
configuration options, 262 382383 1112
unpersist() method, 262 websites, Powered by Spark, 3 YARN (Yet Another Resource
WEKA machine learning software Negotiator), 12
package, 368 executor logs, 497
value() method, 261262
while loops in Scala, 151152 explained, 1920
environment variables, 454
wholeTextFiles() method, 97 reasons for development, 25
cluster application
deployment, 457 textFile() method versus, running applications, 2022,
9799 51
cluster manager
independent variables, wide dependencies, 110 ApplicationsMaster, 5253
454455 window() method, 337338 log file management, 56
Zookeeper 573
ResourceManager, 5152 scheduling, 475476

Z
yarn-client submission as Spark resource scheduler,
mode, 5455 24 Zeppelin, 75
yarn-cluster submission YARN Timeline Server UI, 56 Zharia, Matei, 1
mode, 5354 yarn-client submission mode, Zookeeper, 38, 436
running H2O with, 384386 42, 43, 5455
installing, 441
Spark on YARN deployment yarn-cluster submission mode,
mode, 2728, 3942, 4142, 43, 5354
471473 Yet Another Resource Negotiator
application management, (YARN). See YARN (Yet Another
473475 Resource Negotiator)
environment variables, yield operator in Scala, 151
456457

Libro Spark

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Libro Spark

Hochgeladen von

Copyright:

Verfügbare Formate

Jeffrey Aven

800 East 96th Street, Indianapolis, Indiana, 46240 USA

Part I: Getting Started with Apache Spark

Part II: Programming with Apache Spark

Part III: Extensions to Spark

Part IV: Managing Spark

Part I: Getting Started with Apache Spark

HOUR 1: Introducing Apache Spark 1

HOUR 2: Understanding Hadoop 11

HOUR 3: Installing Spark 27

Deploying Spark on Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

HOUR 4: Understanding the Spark Application Architecture 45

HOUR 5: Deploying Spark in the Cloud 61

Part II: Programming with Apache Spark

HOUR 6: Learning the Basics of Spark Programming with RDDs 91

HOUR 7: Understanding MapReduce Concepts 115

HOUR 8: Getting Started with Scala 137

HOUR 9: Functional Programming with Python 165

HOUR 11: Using RDDs: Caching, Persistence, and Output 235

HOUR 12: Advanced Spark Programming 259

Part III: Extensions to Spark

HOUR 13: Using SQL with Spark 283

HOUR 14: Stream Processing with Spark 323

State Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

HOUR 15: Getting Started with Spark and R 343

HOUR 16: Machine Learning with Spark 363

HOUR 17: Introducing Sparkling Water (H20 and Spark) 381

HOUR 18: Graph Processing with Spark 399

HOUR 19: Using Spark with NoSQL Systems 417

HOUR 20: Using Spark with Messaging Systems 433

Part IV: Managing Spark

HOUR 21: Administering Spark 453

HOUR 22: Monitoring Spark 479

Logging in Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

HOUR 23: Extending and Securing Spark 501

HOUR 24: Improving Spark Performance 519

Why Should I Learn Spark?

How This Book Is Organized

I wrap things up in Part IV, Managing Spark, by discussing Spark management,

Data Used in the Exercises

Conventions Used in This Book

Other asides in this book include the following:

Mail: Sams Publishing

What Youll Learn in This Hour:

Spark Deployment Modes

u Spark on YARN (Hadoop)

Preparing to Install Spark

u Linux (all distributions)

u Eight or more CPU cores

u 10 gigabit or greater network speed

u Python (if you intend to use PySpark)

Installing Spark in Standalone Mode

3. Confirm Java was successfully installed:

4. Extract the Spark package and create SPARK_HOME:

3. Confirm Java was successfully installed:

4. Extract the Spark package and create SPARK_HOME:

Install Spark on Mac OS X

9. Make a temporary directory, C:\tmp\hive, to enable the HiveContext in Spark. Set

You should see the output shown in Figure 3.2.

Installing a Multi-node Spark Standalone Cluster

u Run a test Spark job.

Configuring and Testing a Multinode Spark Cluster

5. On the sparkmaster host, run the following command:

6. On the sparkslave host, run the following command: