Beruflich Dokumente
Kultur Dokumente
20773A
Analyzing Big Data with Microsoft R
MCT USE ONLY. STUDENT USE PROHIBITED
ii Analyzing Big Data with Microsoft R
Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not
responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
© 2017 Microsoft Corporation. All rights reserved.
Released: 05/2017
MCT USE ONLY. STUDENT USE PROHIBITED
MICROSOFT LICENSE TERMS
MICROSOFT INSTRUCTOR-LED COURSEWARE
These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its
affiliates) and you. Please read them. They apply to your use of the content accompanying this agreement which
includes the media on which you received it, if any. These license terms also apply to Trainer Content and any
updates and supplements for the Licensed Content unless other terms accompany those items. If so, those terms
apply.
BY ACCESSING, DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS.
IF YOU DO NOT ACCEPT THEM, DO NOT ACCESS, DOWNLOAD OR USE THE LICENSED CONTENT.
If you comply with these license terms, you have the rights below for each license you acquire.
1. DEFINITIONS.
a. “Authorized Learning Center” means a Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, or such other entity as Microsoft may designate from time to time.
b. “Authorized Training Session” means the instructor-led training class using Microsoft Instructor-Led
Courseware conducted by a Trainer at or through an Authorized Learning Center.
c. “Classroom Device” means one (1) dedicated, secure computer that an Authorized Learning Center owns
or controls that is located at an Authorized Learning Center’s training facilities that meets or exceeds the
hardware level specified for the particular Microsoft Instructor-Led Courseware.
d. “End User” means an individual who is (i) duly enrolled in and attending an Authorized Training Session
or Private Training Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.
e. “Licensed Content” means the content accompanying this agreement which may include the Microsoft
Instructor-Led Courseware or Trainer Content.
f. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training session
to End Users on behalf of an Authorized Learning Center or MPN Member, and (ii) currently certified as a
Microsoft Certified Trainer under the Microsoft Certification Program.
g. “Microsoft Instructor-Led Courseware” means the Microsoft-branded instructor-led training course that
educates IT professionals and developers on Microsoft technologies. A Microsoft Instructor-Led
Courseware title may be branded as MOC, Microsoft Dynamics or Microsoft Business Group courseware.
h. “Microsoft IT Academy Program Member” means an active member of the Microsoft IT Academy
Program.
i. “Microsoft Learning Competency Member” means an active member of the Microsoft Partner Network
program in good standing that currently holds the Learning Competency status.
j. “MOC” means the “Official Microsoft Learning Product” instructor-led courseware known as Microsoft
Official Course that educates IT professionals and developers on Microsoft technologies.
k. “MPN Member” means an active Microsoft Partner Network program member in good standing.
MCT USE ONLY. STUDENT USE PROHIBITED
l. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic device
that you personally own or control that meets or exceeds the hardware level specified for the particular
Microsoft Instructor-Led Courseware.
m. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led Courseware.
These classes are not advertised or promoted to the general public and class attendance is restricted to
individuals employed by or contracted by the corporate customer.
n. “Trainer” means (i) an academically accredited educator engaged by a Microsoft IT Academy Program
Member to teach an Authorized Training Session, and/or (ii) a MCT.
o. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and additional
supplemental content designated solely for Trainers’ use to teach a training session using the Microsoft
Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint presentations, trainer
preparation guide, train the trainer materials, Microsoft One Note packs, classroom setup guide and Pre-
release course feedback form. To clarify, Trainer Content does not include any software, virtual hard
disks or virtual machines.
2. USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content is licensed on a one copy
per user basis, such that you must acquire a license for each individual that accesses or uses the Licensed
Content.
2.1 Below are five separate sets of use rights. Only one set of rights apply to you.
2.2 Separation of Components. The Licensed Content is licensed as a single unit and you may not
separate their components and install them on different devices.
2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights above, you may
not distribute any Licensed Content or any portion thereof (including any permitted modifications) to any
third parties without the express written permission of Microsoft.
2.4 Third Party Notices. The Licensed Content may include third party code tent that Microsoft, not the
third party, licenses to you under this agreement. Notices, if any, for the third party code ntent are included
for your information only.
2.5 Additional Terms. Some Licensed Content may contain components with additional terms,
conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses also
apply to your use of that respective component and supplements the terms described in this agreement.
a. Pre-Release Licensed Content. This Licensed Content subject matter is on the Pre-release version of
the Microsoft technology. The technology may not work the way a final version of the technology will
and we may change the technology for the final version. We also may not release a final version.
Licensed Content based on the final version of the technology may not contain the same information as
the Licensed Content based on the Pre-release version. Microsoft is under no obligation to provide you
with any further content, including any Licensed Content based on the final version of the technology.
b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly or
through its third party designee, you give to Microsoft without charge, the right to use, share and
commercialize your feedback in any way and for any purpose. You also give to third parties, without
charge, any patent rights needed for their products, technologies and services to use or interface with
any specific parts of a Microsoft technology, Microsoft product, or service that includes the feedback.
You will not give feedback that is subject to a license that requires Microsoft to license its technology,
technologies, or products to third parties because we include your feedback in them. These rights
survive this agreement.
c. Pre-release Term. If you are an Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, MPN Member or Trainer, you will cease using all copies of the Licensed Content on
the Pre-release technology upon (i) the date which Microsoft informs you is the end date for using the
Licensed Content on the Pre-release technology, or (ii) sixty (60) days after the commercial release of the
technology that is the subject of the Licensed Content, whichever is earliest (“Pre-release term”).
Upon expiration or termination of the Pre-release term, you will irretrievably delete and destroy all copies
of the Licensed Content in your possession or under your control.
MCT USE ONLY. STUDENT USE PROHIBITED
4. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some
rights to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you more
rights despite this limitation, you may use the Licensed Content only as expressly permitted in this
agreement. In doing so, you must comply with any technical limitations in the Licensed Content that only
allows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:
• access or allow any individual to access the Licensed Content if they have not acquired a valid license
for the Licensed Content,
• alter, remove or obscure any copyright or other protective notices (including watermarks), branding
or identifications contained in the Licensed Content,
• modify or create a derivative work of any Licensed Content,
• publicly display, or make the Licensed Content available for others to access or use,
• copy, print, install, sell, publish, transmit, lend, adapt, reuse, link to or post, make available or
distribute the Licensed Content to any third party,
• work around any technical limitations in the Licensed Content, or
• reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation.
5. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to
you in this agreement. The Licensed Content is protected by copyright and other intellectual property laws
and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in the
Licensed Content.
6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations.
You must comply with all domestic and international export laws and regulations that apply to the Licensed
Content. These laws include restrictions on destinations, end users and end use. For additional information,
see www.microsoft.com/exporting.
7. SUPPORT SERVICES. Because the Licensed Content is “as is”, we may not provide support services for it.
8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you fail
to comply with the terms and conditions of this agreement. Upon termination of this agreement for any
reason, you will immediately stop all use of and delete and destroy all copies of the Licensed Content in
your possession or under your control.
9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible for
the contents of any third party sites, any links contained in third party sites, or any changes or updates to
third party sites. Microsoft is not responsible for webcasting or any other form of transmission received
from any third party sites. Microsoft is providing these links to third party sites to you only as a
convenience, and the inclusion of any link does not imply an endorsement by Microsoft of the third party
site.
10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.
12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws
of your country. You may also have rights with respect to the party from whom you acquired the Licensed
Content. This agreement does not change your rights under the laws of your country if the laws of your
country do not permit it to do so.
13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS
AVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE
AFFILIATES GIVES NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY
HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT
CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND
ITS RESPECTIVE AFFILIATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP
TO US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL,
LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
It also applies even if Microsoft knew or should have known about the possibility of the damages. The
above limitation or exclusion may not apply to you because your country may not allow the exclusion or
limitation of incidental, consequential or other damages.
Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this
agreement are provided below in French.
Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses
dans ce contrat sont fournies ci-dessous en français.
EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Toute
utilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre garantie
expresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection dues
consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garanties
implicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contrefaçon sont exclues.
EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits
prévus par les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois de votre
pays si celles-ci ne le permettent pas.
Acknowledgements
Microsoft Learning would like to acknowledge and thank the following for their contribution towards
developing this title. Their effort at various stages in the development has ensured that you have a good
classroom experience.
He now runs Data Jujitsu Ltd., a data science consultancy company based in Manchester, UK
Contents
Module 1: Microsoft R Server and Microsoft R Client
Module Overview 1-1
Course Description
This three-day instructor-led course describes how to use Microsoft R Server to create and run an analysis
on a large dataset, and utilize it in Big Data environments, such as a Hadoop or Spark cluster, or a SQL
Server database.
Audience
The primary audience for this course is data scientists who are already familiar with R and who have a
high-level understanding of data platforms such as the Hadoop ecosystem, SQL Server, and core T-SQL
capabilities.
The course will likely be attended by developers who need to integrate R analyses into their solutions.
Student Prerequisites
In addition to their professional experience, students who attend this course should have:
Basic knowledge of the Microsoft Windows operating system and its core functionality.
Course Objectives
After completing this course, students will be able to:
Create, score, and deploy partitioning models generated from big data.
Use R in SQL Server and Hadoop environments.
Course Outline
The course outline is as follows:
Module 1: ‘Microsoft R Server and R Client’ provides an introduction to Microsoft R Server and R
Client, and an overview of the ScaleR functions
Module 2: ‘Exploring Big Data’ describes a how to use ScaleR data sources to read and summarize big
data in different compute contexts.
Module 3: ‘Visualizing Big Data’ shows how to use the ggplot2 package, and the ScaleR functions, to
generate plots and graphs of big data.
Module 4: ‘Processing Big Data’ describes how to transform and clean big data sets.
MCT USE ONLY. STUDENT USE PROHIBITED
ii About This Course
Module 5: ‘Parallelizing Analysis Operations’ shows how to use ScaleR and PEMA classes to split
analysis jobs into parallel tasks.
Module 6: ‘Generating and Evaluating Regression Models' describes how build and evaluate
regression models generated from big data sets.
Module 7: 'Creating and Evaluating Partitioning Models' shows how to create and score partitioning
models generated from big data sets.
Module 8: 'Processing Big Data in SQL Server and Hadoop' describes how to use R in SQL Server and
Hadoop environments.
Course Materials
The following materials are included with your kit:
Course Handbook: a succinct classroom learning guide that provides the critical technical
information in a crisp, tightly-focused format, which is essential for an effective in-class learning
experience.
o Lessons: guide you through the learning objectives and provide the key points that are critical to
the success of the in-class learning experience.
o Labs: provide a real-world, hands-on platform for you to apply the knowledge and skills learned
in the module.
o Module Reviews and Takeaways: provide on-the-job reference material to boost knowledge
and skills retention.
Modules: include companion content, such as questions and answers, detailed demo steps and
additional reading links, for each lesson. Additionally, they include Lab Review questions and answers
and Module Reviews and Takeaways sections, which contain the review questions and answers, best
practices, common issues and troubleshooting tips with answers, and real-world issues and scenarios
with answers.
Resources: include well-categorized additional resources that give you immediate access to the most
current premium content on TechNet, MSDN®, or Microsoft® Press®.
Course evaluation: at the end of the course, you will have the opportunity to complete an online
evaluation to provide feedback on the course, training facility, and instructor.
The following table shows the role of each virtual machine that is used in this course:
This course also uses a separate VM, LON-HADOOP, running in Azure. This VM is a Hadoop server that is
shared by all students.
Software Configuration
The following software is installed on the virtual machines:
Microsoft R Server - 9.0.1, 64 bit
Course Files
The files associated with the labs in this course are located in the E:\Labfiles folder on the 20773A-LON-
DEV virtual machine.
Classroom Setup
Each classroom computer will have the same virtual machines configured in the same
way. Additionally, the instructor must perform the following tasks before starting the
course:
You should have been provided with the details of the Azure account.
MCT USE ONLY. STUDENT USE PROHIBITED
iv About This Course
4. In the navigation blade on the left side of the portal, click Resource groups.
7. In the LON-HADOOP blade, click Start, and then wait for the VM to start running.
8. Navigate to http://fqdn:8080, where fqdn is the fully qualified domain name of the LON-HADOOP
VM. For example, http://lon-hadoop-01.ukwest.cloudapp.azure.com:8080 (Note: you can find the
fqdn of the VM in Azure, on the same page that you used to start the VM—it is reported in the Public
IP address/DNS name label field.)
9. On the Sign in page, type admin for the Username and Password, and then click Sign in. Note that
none of the Hadoop services are currently running.
10. In the left pane, click the Actions drop-down, and then click Start All.
All students and the instructor must perform the following tasks prior to commencing
module 1:
Start the VMs
1. In Hyper-V Manager, under Virtual Machines, right-click MT17B-WS2016-NAT, and then click Start.
2. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-DC, and then click Start.
3. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-DEV, and then click Start.
4. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-RSVR, and then click Start.
5. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-SQLR, and then click Start.
Note: If a Hyper-V Manager dialog box appears saying there is not enough memory in
the system to start the virtual machines, restart the PC, and then start the VMs again.
4. In the RStudio Setup wizard, on the Welcome to the RStudio Setup Wizard page, click Next.
5. On the Choose Install Location page, click Next.
8. Click the Windows Start button, and in the Recently added list click RStudio.
9. In the RStudio Console, type the following command and then press Enter.
10. In the Remote Server dialog box, in the User name box type admin, in the Password box type
Pa55w.rd, and then click OK.
11. In the RStudio Console, verify that a remote R session starts successfully.
3. In the PuTTY release 0.68 (64-bit) Setup wizard, on the Welcome to the PuTTY release 0.68 (64-bit)
Setup Wizard page, click Next.
4. On the Destination Folder page, click Next.
11. In the Environment Variables dialog box, click Path, and then click Edit.
12. In the Edit User Variable dialog box, append the path C:\Program Files\PuTTY to the Variable
Value, and then click OK.
17. In the command prompt window, type the following command and then press Enter:
putty
18. Verify that the PuTTY Configuration window appears, and then click Cancel.
Note: For the instructor. In the following procedure, use the instructor account for
connecting to Hadoop, where specified by the text <your user name>. There are separate
accounts on the Hadoop VM for each student, named student01, student02, and so on, as
described in the document Creating the Hadoop VMs on Azure. Allocate one of these accounts
to each student, who should use it as <your user name> in this procedure. You should also
provide students with the fully qualified domain name (fdqn), such as lon-hadoop-
01.ukwest.cloudapp.azure.com, of the Hadoop VM.
2. In the command prompt window, run the putty command. The putty utility should start and the
PuTTY Configuration window should appear.
3. In the PuTTY Configuration window, in the Host Name box, enter the fully qualified domain name
of the LON-HADOOP VM. For example, lon-hadoop-01.ukwest.cloudapp.azure.com.
4. In the Saved Sessions box, type LON-HADOOP, click Save, and then click Open.
7. Run the following command to create SSH keys for performing password-less authentication:
ssh-keygen
8. At the prompt Enter file in which to save the key (/home/<your user name>/.ssh/id_rsa), press
Enter.
16. Run the following command to copy the key file for your account on the Hadoop VM to the LON-
DEV VM. Replace fqdn with the fully qualified domain name of the LON-HADOOP VM, for example,
lon-hadoop-01.ukwest.cloudapp.azure.com:
17. At the Password prompt, type Pa55w.rd, and then press Enter.
MCT USE ONLY. STUDENT USE PROHIBITED
About This Course vii
18. In the command prompt window, run the puttygen command. The PuTTY Key Generator window
should appear.
20. In the Load private key dialog box, move to the E:\ folder, in the file selector drop-down list box,
click All Files(*.*), click id_rsa, and then click Open.
21. In the PuTTYgen Notice dialog box, verify that the key was imported successfully, and then click OK.
22. In the PuTTY Key Generator window, click Save private key.
24. In the Save private key as dialog box, in the File name box, type HadoopVM, and then click Save.
28. In the Category pane of the PuTTY Configuration window, under Connection, expand SSH, and
then click Auth.
29. In the Options controlling SSH authentication pane, next to the Private key file for
authentication box, click Browse.
30. In the Select private key file dialog box, move to the E:\ folder, click HadoopVM, and then click
Open.
31. In the Category pane of the PuTTY Configuration window, under Connection, click Data.
32. In the Data to send to the server pane, in the Auto-login username box, type <your user name>.
33. In the Category pane of the PuTTY Configuration window, click Session.
34. Click Save, and then close the PuTTY Configuration window.
35. Close the Command Prompt window.
4. In File Explorer, right-click the C:\Data folder, point to Share with, and then click Specific people.
5. In the File Sharing dialog box, click the drop-down list, click Everyone, and then click Add.
6. In the lower pane, click the Everyone row, and set the Permission Level to Read/Write, and then
click Share.
7. In the File Sharing dialog box, verify that the file share is named \\LON-RSVR\Data, and then click
Done.
MCT USE ONLY. STUDENT USE PROHIBITED
viii About This Course
At the end of the course, the instructor must perform the following tasks to shut down
the Hadoop VM:
3. In the navigation blade on the left side of the portal, click Resource groups, and then click the LON-
HADOOP-SERVER resource group.
Processor:
AMD:
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (NX Bit)
Intel:
o Supports Second Level Address Translation (SLAT) – Extended Page Table (EPT)
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (XD bit)
Network adapter
Be connected to a projection display device that supports SVGA 1024 x 768 pixels, 16 bit colors.
Module 1
Microsoft R Server and Microsoft R Client
Contents:
Module Overview 1-1
Module Overview
In this module, you will learn the basics of how to use the Microsoft® R Server and the Microsoft R Client.
This will be the principal environment you will use to interact with R when dealing with data at scale. The
module will give an overview of what the Microsoft R Server actually is, in addition to key concepts (such
as compute contexts) which you return to throughout this course. You’ll then learn how to connect to the
R Server from the R Client, the various environments for interacting with R, how to connect to remote
servers, and transferring data between local and remote sessions. Finally, you will learn how to use the
ScaleR™ functions to handle distributed computations transparently. You will find that many of these
functions are analogous to well-known nonparallel functions in base R.
Objectives
In this module, you will:
Lesson 1
Introduction to Microsoft R Server
Open Source R is an excellent tool for the modeling and prototyping of data science algorithms. However,
it often lacks the speed and stability to be effective in either a production environment or when working
with datasets too large to fit in memory. Microsoft R Server is a server for hosting and managing parallel
and distributed R processes on servers (both Linux and Windows®), clusters running either Hadoop or
Apache Spark, or database systems such as Teradata and SQL Server®. It gives you an infrastructure for
distributing a workload across multiple nodes (chunking) so that you can run R jobs in parallel, and then
reassemble the results for downstream analysis and/or visualization.
Lesson Objectives
In this lesson, you will learn about:
ScaleR. This is the engine that can partition massively large datasets into chunks and distribute them
across multiple compute nodes or database platforms such as Teradata or SQL Server. ScaleR
functions are provided by the RevoScaleR package. You use these functions to remove a lot of the
complexity of dealing with parallel and distributed computation—they are often analogous to many
well-known data manipulation functions in base R. You also use ScaleR functions to, for example, test
out models on small local datasets, and then run them on huge datasets on a cluster with very few
changes in your code.
RevoScaleR Functions
https://aka.ms/uihq8b
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-3
Machine learning algorithms. Functions that employ machine learning algorithms such as
regression, clustering and partitioning methods on ScaleR distributed data, are available in the
MicrosoftML package. As with ScaleR functions, the complexity of dealing with chunked data is
abstracted so you can concentrate on tuning your models.
Microsoft ML: State-of-the-Art Machine Learning R Algorithms from Microsoft Corporation
https://aka.ms/nbexpl
Platform-specific components that enable you to write essentially the same code, irrespective of
whether you are running on a Linux server, a Hadoop or Spark cluster, or a Teradata or SQL Server
database.
The operationalization engine for connecting to a remote R server and for deploying R code to
production as a RESTful web service.
Introducing Microsoft R Server
https://aka.ms/vwuke0
R Server environments
You can run R Server on a variety of different
platforms depending on your requirements:
Windows/Linux
You can install R Server on a standalone Windows
or Linux machine. In this case, R Server runs as a
background process that can be accessed from a
variety of front ends including the standard R
Client, RStudio and Visual Studio® (connection via
these tools is covered in Lesson 2). You might also
have a large remote server on which you can
install R Server, and to which you can connect
from your laptop or desktop (see the topic Using a
remote R server later in the module). On large servers, you can use the ScaleR functions to parallelize your
workload over multiple processor cores. You can also install R Server on a network of locally connected
servers.
Hadoop MapReduce/Spark
For very large datasets, having multiple cores on a large server might still not be enough for your needs.
In this case, you might need to farm out your computations over a cluster of multiple compute nodes. R
Server is available on both Hadoop and Spark clusters in the cloud. Here you can make use of the HDFS
distributed file system and MapReduce or Spark frameworks for distributed computation.
Teradata/SQL Server
Both of these platforms provide high performance, high capacity data storage capabilities that are well
suited to working with the high performance analytics available in R Server. You use ScaleR functions to
get data in and out of the database quickly, and undertake high performance in-database analytics.
MCT USE ONLY. STUDENT USE PROHIBITED
1-4 Microsoft R Server and Microsoft R Client
mrsdeploy functions
https://aka.ms/xhll16
Reconcile any differences between the local and remote environments and generate different reports.
This can help you to ensure that a remote environment includes the necessary packages to run your
scripts.
You use the mrsdeploy functions for remote execution. However, you must first set up operationalization
—this is described in the topic Deploying Your Code later in this lesson.
Compute contexts
To execute parallel or distributed computations on
an R server or cluster, you must first define a
compute context. This is the entry point to the
distributed computation engine and the heart of
any ScaleR application. The compute context sets
up internal processes such as logging, and
provides the information that the server needs to
execute code remotely, including the back-end
engine (for example, Hadoop or Spark), database
connection information, and specifications for how
to handle output.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-5
After the compute context is established, you can use it to create XDF data files, broadcast variables
(relatively small variables or data to be shared across all nodes) and run jobs.
Note that each compute context has its own environment and limitations. This means that you will often
need to configure the environment to load specific packages in different compute contexts. Not all
compute contexts support all ScaleR functionality. For example, the Teradata compute context can only
interact with the Teradata database, not with CSV or XDF data.
For more detail on compute contexts, see Module 5: Parallelizing Analysis Operations and Module 8:
Processing Big Data in SQL Server and Hadoop.
After installation, you need to configure operationalization to deploy R code as web services or to connect
to a remote R server.
The architecture of R Server supports a clustered configuration. All configurations have at least a single
web node and a single compute node:
A web node acts as an HTTP REST endpoint with which you can interact directly using API calls. The
web node accesses data in the database, and sends jobs to the compute node.
A compute node executes R code as a session or service. Each compute node has its own pool of R
shells.
When you work with remote R servers or web services, your local R Client or app sends instructions to the
web node. The web node then passes instructions to a database service, and/or one or more compute
nodes, to perform the analytic heavy lifting.
Note that, if your code requires access to resources such as data files, you must ensure that these items
are accessible to all nodes. For example, you should place any text files on a fast shared device that can be
accessed by using the same path name from all nodes.
You can set up a single web node and compute node on a single machine, as a one-box configuration.
You can also install multiple components on multiple machines.
Which of the following is not an advantage of using R Server over open source R?
R Server can run advanced machine learning algorithms implemented by the MicrosoftML
package.
Lesson 2
Using Microsoft R Client
Microsoft R Client is a freely available data science tool for advanced analytics. It is built on the Microsoft
Open R distribution, so you have access to the full power of R and can install any R packages—for
example, through CRAN (https://cran.r-project.org/). Like R Server, R Client also has access to the powerful
ScaleR functionality for parallel and distributed computation.
Lesson Objectives
In this lesson, you will learn:
How to transfer objects between an R Client session and a remote R server session.
With these features and limitations, R Client is an effective replacement for standard open-source R.
However, to benefit from disk scalability, performance and speed, you can push the compute to R servers
on servers, clusters or database services. In this case, R Server is the back-end process that performs the
heavy lifting in terms of executing code, running models, fetching from database services, and so on.
However, you don’t interact with the R Server directly—only the R Client that then sends out code to
remote R servers.
Using R Client
After you have installed R Client, you can interact
with it using the built-in R GUI (Rgui.exe). On
Linux, you can use the R terminal-based
application. You can also run standalone scripts
using Rscript. You can find these applications
under the installation folder—this is typically
C:\Program Files\Microsoft\R client\R_SERVER\bin
on Windows, or /usr/bin on Linux.
RTVS is an add-in for Microsoft Visual Studio (including the free Visual Studio Community edition). It is
the IDE recommended by Microsoft to use with R Client. To use it, you will first need to install Visual
Studio and then add the R Tools package.
If you already have another version of R installed, you will need to reconfigure RTVS to interact with R
Client. You do this as follows:
1. Launch RTVS.
RStudio is a free IDE for R by the developers of popular third-party R packages like dplyr, Shiny and
sparklyr. It includes an R console, a syntax-highlighting editor that supports direct code execution, and
tools for plotting, viewing your command history, debugging, and managing your workspace. If you
already have another version of R installed you will need to reconfigure RStudio:
1. Launch RStudio.
2. From the Tools menu, choose Global Options.
When the remote R server is operationalized, and you have successfully logged in to the remote server
using the remote execution functions from the mrsdeploy package, you can start a remote R session.
Logging in
The mrsdeploy package provides the following functions for authenticating against a remote R server
and starting a new remote R session, for switching back and forth between remote and local sessions, and
for closing a connection to a remote server:
remoteLogin(). Use this function to connect with a server on your own local network. By default, it
prompts for a username and password (you can also pass these details as parameters to the function),
creates a remote R session, and gives access to the R command line. You can also configure Active
Directory® authentication.
remoteLoginAAD(). Use this function to connect to a remote R server in the cloud. It authenticates
the user through the Azure® Active Directory, then creates a new session in the same way as
remoteLogin.
pause(). This command temporarily drops you out of the remote session, so that you can work
locally, but keeps the connection to the remote server open.
resume(). This command reconnects you to a remote session that you have paused.
remoteExecute(). Use this function to execute a block of code or an R script on a remote server.
After you authenticate on either a local network or the cloud, the web node returns an access token. The
R console passes this token in the request header of every subsequent mrsdeploy request.
MCT USE ONLY. STUDENT USE PROHIBITED
1-10 Microsoft R Server and Microsoft R Client
Under the default parameters noted above, you will see the default prompt REMOTE> in your R console
window. The code you enter here will run on the remote server. You can terminate the remote R session
by either typing exit at the REMOTE> prompt or using the remoteLogout() function.
putLocalObject(). Transfers an R object from your local workspace to the server workspace.
getRemoteWorkspace(). Transfers all objects in the remote session into the local R session.
putLocalWorkspace(). Transfers all objects in the local R session to the remote session.
getRemoteFile(). Downloads a binary or text file from the working directory of the remote R session
into the working directory of the local R session.
putLocalFile(). Uploads a file from the local machine and writes it to the working directory of the
remote R session. This function is often used if a data or parameter file needs to be accessed by a
script running on the remote R session. This function copies the file to the working directory of the
remote server. If you need to transfer large data files, create a manual out-of-band copy to a shared
storage location that is accessible to all servers in the cluster using the same path.
listRemoteFiles(). Returns a list of all the files that are in the working directory of the remote session.
deleteRemoteFile(). Deletes a file from the working directory of the remote R session.
Note: You should never try to transfer large datasets or files from remote to local. Your
local server is unlikely to be able to cope with the size of the data. The file transfer functions in
the mrsdeploy package are principally designed for handling small configuration and parameter
files that can be efficiently transmitted across nodes.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-11
Demonstration Steps
2. In Visual Studio 2015, on the R Tools menu, click Data Science Settings.
3. In the Microsoft Visual Studio message box, click Yes. Visual Studio arranges the IDE with four
panes in a layout that resembles that of RStudio.
4. In the R Interactive pane, type mtcars, and then press Enter. The contents of the mtcars sample data
frame should appear.
5. Type mpg <- mtcars[["mpg"]], and then press Enter. Notice that the variable mpg appears in the
Variable Explorer window.
6. Click the R Tools menu. This menu contains most of the commands that you use for working in the
IDE. These include:
The Session menu, which contains commands that enable you to stop a session and interrupt a
long-running R function.
The Plots menu, which enables you to open new plot windows, and save plots as image files and
PDF documents.
The Data menu, which enables you to import datasets, and manage data sources and variables.
The Working Directory menu, which enables you to set the current working directory.
The Install Microsoft R Client command, which downloads and installs R Client if it has not
already been configured.
8. In the New File dialog box, in the Installed pane, click R. The center pane displays the different types
of R file you can create, including documentation and markdown.
9. In the center pane, click R Script, and then click Open. The source editor window appears in the top
left pane.
10. In the source editor window, type print(mpg), and then press Ctrl+Enter. The statement is executed,
and the results appear in the R Interactive window. You use the source editor window to create
scripts of R commands. You can execute a single statement by using Ctrl+Enter, or you can highlight
a batch of statements using the mouse and press Ctrl+Enter to run them.
11. Click the File menu. This menu contains commands that you can use to save the R script, and load an
R script that you had saved previously.
MCT USE ONLY. STUDENT USE PROHIBITED
1-12 Microsoft R Server and Microsoft R Client
3. In the Options dialog box, verify that the R version is set to C:\Program Files\Microsoft\R
Client\R_SERVER. This is the location of Microsoft R Client. Note that you must download and install
R Client separately, before running RStudio.
4. Click Change. The Choose R Installation dialog box enables you to switch between different
versions of R. If you have just installed R Client, use the Choose a specific version of R to select it.
Click Cancel.
6. On the File menu, point to New File, and then click R Script. The script editor pane appears in the
top right window. Note the following points:
As with RTVS, you can use this window to create scripts, and run commands by using Ctrl+Enter.
You can also run commands directly from the Console window.
The Environment window displays the variables for the current session.
The Session menu provides commands that you can use to interrupt R, terminate a session,
create a new session (which starts a new instance of RStudio), and change the working directory.
Note: You can perform these tasks either in RTVS or RStudio, according to your preference
of IDE.
1. In the script editor window, enter and run (Ctrl+Enter) the following command:
2. In the Remote Server dialog box, enter admin for the user name, Pa55w.rd for the password, and
then click OK.
3. Verify that the login is successful; the interactive window should display a list of packages that are
installed on the client machine but not on the server (you might need to install these if your script
uses them), followed by the REMOTE> prompt.
4. In the script editor window, enter and run the following command:
mtcars
This command displays the mtcars data frame again, but this time it is being run remotely.
5. In the script editor window, enter and run the following command:
This command creates the firstCar variable in the remote session. Note that it doesn't appear in the
Variable Explorer/Environment window.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-13
6. In the script editor window, enter and run the following command:
pause()
Notice that the interactive window no longer displays the REMOTE> prompt; you are now running in
the local session.
7. In the script editor window, enter and run the following command:
print(firstCar)
This command should fail with the error message object 'firstCar' not found. This occurs because
the firstCar variable is part of the remote session, not the local one.
8. In the script editor window, enter and run the following command:
getRemoteObject(c("firstCar"))
9. Verify that the interactive window displays the response TRUE to indicate that the command
succeeded. Also note that the firstCar variable now appears in the Variable Explorer/Environment
window.
10. In the script editor window, enter and run the following command:
print(firstCar)
11. This command should now succeed, and display the data for the first observation from the mtcars
data frame.
12. In the script editor window, enter and run the following command:
resume()
The REMOTE> prompt should reappear in the interactive window. You are now connected to the
remote session again.
13. In the script editor window, enter and run the following command:
exit
This command closes the remote session and returns you to the local session.
MCT USE ONLY. STUDENT USE PROHIBITED
1-14 Microsoft R Server and Microsoft R Client
Question
How can you run interactive code remotely in R Server from R Client?
Use the remoteExecute function and specify the code to run on the remote R Server.
Specify the name of the server on which to run the code as the remoteServer parameter to
the ScaleR functions
Use the remoteLogin function to connect to the remote R Server and start an interactive
session on that server.
You can't. You must log in to the remote server manually and start an interactive R session
there.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-15
Lesson 3
The ScaleR functions
You use the ScaleR functions to manipulate and analyze big data in Microsoft R in a consistent, efficient
and scalable way. It’s consistent in terms of a common API to the functions, efficient because it makes
good use of whatever hardware configuration you have, and scalable because you can run algorithms
from on a local laptop right up to a massively distributed cluster.
Lesson Objectives
After completing this lesson, you will be able to:
Describe best practices for using the ScaleR functions when working with big data.
The ScaleR functions are implemented in the RevoScaleR package that is installed with R Client and R
Server. All ScaleR function names start with either rx, for data manipulation or analysis functions that
operate independently of data source, or Rx, for functions that are class constructors for specific data
sources or compute contexts.
MCT USE ONLY. STUDENT USE PROHIBITED
1-16 Microsoft R Server and Microsoft R Client
Statistical modeling. Functions for performing statistical analyses efficiently on large, chunked data.
You can build, and predict from, linear, logistic and generalized linear; tree-based partitioning; and K-
means clustering models.
Basic graphing. Functions to efficiently produce on-the-fly histograms and line graphs over large
datasets.
Compute contexts. Class constructors for the different compute context objects (Rx functions).
Data Sources. Class constructors for data source objects (Rx functions).
High performance and distributed computing. Lower level functions for high performance and
distributed computing, such as enabling you to run arbitrary code on a cluster.
Utilities. Miscellaneous functions. For example, functions that check the state of a cluster.
RevoScaleR Functions
https://aka.ms/qyh9pj
Many ScaleR functions are analogous to common data manipulation, summarizing or modeling functions
in base R. For example, rxSort() is similar to sort() in base R and rxlm() is like lm(). The difference is that
the rx functions are designed and optimized to operate in a distributed environment and process XDF
data. Other functions provide a degree of interoperability with base R functions. For example,
rxResultsDF() extracts summary results from an rxCrosstab(), rxCube(), or rxSummary() as a data
frame, which you can then pass to many base R functions for processing.
To find out how many ScaleR functions map to their base R equivalents:
You can use this information to understand where and when you might consider using ScaleR functions in
place of base R. Most analyses you produce are likely to involve a combination of base R and ScaleR
functions.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-17
Data Sources
https://aka.ms/xpxb3c
You use the compute context arguments to ScaleR functions to rapidly prototype your algorithms, and
then deploy them in an efficient way. First, you test your algorithm locally on a sample of your data, using
the default local session compute context. When you are happy that the algorithm is doing what you
expect of it, running at scale is often simply a question of changing the compute context to a server,
cluster or database.
Depending on the file system, there might also be differences in availability within a single data source
type. For example, if you create XDF files in your local session in R Client, these files are not chunked,
unlike XDF files created on the Hadoop distributed file. Consequently, you can’t use chunking locally. You
might also need to split and distribute your data across the available nodes of your cluster. For a more
detailed description of this process, see Module 5: Parallelizing Analysis Operations.
With big data analytics, data is too large to fit in memory, so you need to deal with transferring data to
the cores, processors or nodes.
This relies on efficient disk I/O, data locality, threading, and data management in RAM.
The RevoScaleR package provides you with tools to address speed and capacity issues involved in big data
analytics:
Data management and analysis functionality that scales from small, in-memory datasets to huge
datasets stored on disk.
Analysis functions that are threaded to use multiple cores and computations that can be distributed
across multiple computers (nodes) on a cluster or in the cloud.
While much of the complexity underlying big data analytics has been abstracted in ScaleR analytic
functions, there are still best practices you need to be aware of. If you are not, you will likely incur costs, in
terms of time, money, capacity, and frustration. The following list summarizes best practices for working
with big data:
1. Upgrade your hardware. This is often the easiest way to deal with bigger datasets. Getting a more
powerful laptop or server might mean you can perform your analytics in-memory without having to
worry about distributed computing. If you can get away with it, try and work in memory.
2. Minimize copies of data. Making multiple copies of your data can severely slow down computation.
Bear in mind, too, that many base R analytic functions like lm() and glm() make several copies of the
data as they run. The ScaleR statistical analysis functions, such as rxlm(), minimize this as much as
possible.
3. Transfer data to RDX format. This enables the data analysis and manipulation functions to operate
on chunks that can be farmed out across whatever processors or nodes you have available. Also, the
computation time of many analysis functions increases faster than linearly, so total compute time per
node is also reduced with chunking. Another advantage is that, when you run a model on an XDF file,
only the variables identified in the model formula are read.
4. Consider data types. Numerical computations with integers can be many times faster than
computations with floats. Try to convert as many of your numeric data types to integer as you can
without loss of information—for example, you can multiply 17.4 by 10 to 174, and then store as an
integer.
Note: RevoScaleR provides several tools for handling integers. For instance, within a formula
statement (for example, in rxlm(), rxglm(), rxCube() or rxCrosstabs()), you can wrap a variable in
the F() function to coerce numeric variables into factors, with the levels represented by integers (note
that you cannot use this outside of a formula). Also, you can use rxCube() to quickly tabulate factors
and their interactions for arbitrarily large datasets.
5. Use vectorized functions. This is a general tip for efficiency in R. Loops are slow in R, so try to make
use of vectorized code as much as possible. If there is not a vectorized version of the function you
want to use, you could consider rewriting the function in C++ using the Rcpp package. This package
makes it simple to integrate C and C++ code into R (https://cran.r-project.org/package=Rcpp). All of
the ScaleR functions are written in optimized C++ for maximum speed and efficiency.
6. Be careful when sorting big data. Sorting big data is a time-intensive operation. When you have to
sort huge datasets, use rxSort() on XDF—and then only when you really need to. Use rxQuantile() to
efficiently produce quantiles or medians. If you need to summarize your data by group, use
rxSummary(), rxCube(), or rxCrossTabs(); they make a single pass through the original data and
accumulate the desired statistics by group on the fly.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-19
7. Use row-oriented data transformations where possible. Try to ensure that any data
transformations on a row are not dependent on values in other rows. Your transformation expression
should give the same result, even if only some of the rows of data are in memory at one time. You
can perform data manipulations with lags or leads but these require special handling.
8. Process data transformations in batches. If you need to run multiple transformations on your
dataset, use rxDataStep() to do this in a single pass through the data, processing the data a chunk at
a time. Making multiple passes through your data can be very time consuming.
9. Be careful with categorical variables. If you have a factor with many levels, you might find there
are chunks that don’t contain all the levels. This means that you must explicitly specify the levels or
you might end up with incompatible factor levels from chunk to chunk. rxImport() and rxFactors()
provide functionality for creating factor variables in big datasets.
10. Get more nodes. If the above tips still don’t help, you might need to invest in more nodes to
distribute across. Data analysis algorithms tend to be I/O bound when data cannot fit into memory,
so multiple hard drives can be even more important than multiple cores.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
You have been given a small subset of the flight delay data for the year 2000 as a local CSV file. You wish
to quickly examine its contents and schema to gain familiarity with the structure of the data and the tools
that you are going to use.
Objectives
In this lab, you will:
Use R Client in Visual Studio (RTVS), or RStudio.
Lab Setup
Estimated Time: 45 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-21
2. Read the 2000.csv file into a data frame and look at the first 10 rows. Use the base R read.csv
function and the head function, rather than the ScaleR functions.
3. The data frame has a column named Month that contains the month number. Add a factor column
to the data frame called MonthName that contains the corresponding month name. You can find
month names in the month.name list. You should use the base R factor and lapply functions.
4. Generate a summary of the data frame using the summary function. Time how long it takes to
perform this operation using the system.time function.
Find the minimum value in the ArrDelay column. This column records the flight arrival delay
time in minutes.
Find the maximum flight arrival delay.
6. Use the xtabs function to generate a cross tabulation showing the number of flights cancelled and
not cancelled for each month. The Cancellation column contains a value indicating whether a flight
was cancelled (1 – cancelled, 0 – not cancelled).
Note: Strictly speaking, xtabs is not a base R function, but it is commonly used by data scientists
performing cross tabulations.
Note: Record the console output, as this will be referenced in a later exercise.
Results: At the end of this exercise, you will have used either RTVS or RStudio to examine a subset of the
flight delay data for the year 2000.
Question: How many columns are there in the data frame, including the new MonthName
column?
Question: What are the minimum and maximum arrival delay times recorded in the data
frame?
2. Run the rxGetInfo function over the data frame. This function retrieves the number of variables and
observations in the data frame. Verify that these values are the same as those you obtained in
exercise 1.
3. Run the rxGetVarInfo function over the data frame. This function retrieves more detailed information
about the variables in the data frame, such as their names and type. Verify that the MonthName
variable is a factor with 12 levels, each named after a month.
4. Run the rxQuantile function over the data frame. Specify the ArrDelay column as the first argument
and the data frame as the second. This function reports the quantiles for the data. The 0% quantile
should be the same as the minimum value reported earlier, and the 100% quantile should be the
maximum value.
5. Use the rxCrossTabs function to generate a cross tabulation of the number of cancelled/not
cancelled flights by month name. Use the formula ~MonthName:as.factor(Cancelled == 1).
Note that rxCrossTabs uses the ":" character to separate independent variables, rather than the "+"
of xtabs.
6. Use the rxCube function to generate a data cube over the same data. Verify that the values displayed
are the same as the cross tabulation.
Note: Please note the console output, as this will be compared too in a later exercise.
Results: At the end of this exercise, you will have used the ScaleR functions to examine the flight delay
data for the year 2000, and compared the results against those generated by using the ScaleR functions.
deployr_endpoint: http://LON-RSVR.ADATUM.COM:12800
session: TRUE
diff: TRUE
commandLine: TRUE
username: admin
password: Pa55w.rd
3. Copy the file 2000.csv to the remote session. Use the putLocalFile function.
Note: For large files, you should perform this step as an out-of-band task and manually place the data on
a network share that all files in the R Server cluster can access.
2. Add the MonthName column to the data frame as you did in exercise 1.
3. Display the first 10 rows of the data using the head function.
MCT USE ONLY. STUDENT USE PROHIBITED
1-24 Microsoft R Server and Microsoft R Client
2. Copy the variables that you created in the remote session to the local session using the
getRemoteObject function.
3. Print the variables you have just retrieved. Their contents should match the results you obtained in
exercise 2.
5. Save the script as Lab1Script.R in the E:\Labfiles\Lab01 folder, and close your R development
environment.
Results: At the end of this exercise, you will have used the ScaleR functions to examine the flight delay
data for the year 2000, and compared the results against those generated by using the ScaleR functions.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 1-25
Module 2
Exploring Big Data
Contents:
Module Overview 2-1
Module Overview
ScaleR™ uses the Extensible Data Format (XDF) to store data. This file format was developed by NASA for
managing very large datasets. It is extremely efficient and compact, and is ideally suited to performing big
data analysis. The ScaleR functions use this format for managing data, and it enables these functions to
access data in a piecemeal manner. This means that you can use the ScaleR functions to process datasets
that are arbitrarily large, and that are much too big to fit into the available memory of a computer.
Objectives
In this module, you will learn how to:
Lesson 1
Understanding ScaleR data sources
In this lesson, you will learn about the data sources that you can use in ScaleR. This lesson explains the
limitations on which sources you can employ in a distributed environment. This lesson also describes how
to access data held in SQL Server and Hadoop.
Lesson Objectives
After completing this lesson, you will be able to:
SAS RxSasData
SPSS RxSpssData
It is important to understand that, although each data source is available when you use the R client on a
local computer, some distributed contexts, specifically Hadoop and Teradata, limit which data sources you
can use. This is due to the fundamental architectures of these contexts. The following table summarizes
which data sources you can use in which compute contexts:
Compute Context
Delimited Text x x x
(RxTextData)
Fixed-Format x
Text (RxTextData)
ODBC data x
(RxOdbcData)
Teradata x x
database
(RxTeradata)
SQL Server x x
(RxSqlServerData)
Hive x
(RxHiveData)
Parquet x
(RxParquetData)
The following example shows how to create an RxTextData data source referencing a comma-delimited
text (CSV) file:
rxIsOpen(src, mode= ‘r’). Use this function to check whether a data source can be accessed.
rxReadNext(src). When a data source is open, use this function to extract a chunk of data from it. A
chunk is a section or block of data rather than the full content of the file. The data can be returned as
a data frame or a list, depending on the value of the returnDataFrame property of the underlying
data source.
rxWriteNext(from, to, …). Use this function to write a chunk of data to a data source.
The following code shows how to use these functions to read data from an RxXdfData data source. The
example reads the data a chunk at a time:
rxSqlServerDropTable. This function removes a table (and its contents) from a SQL Server database.
rxSqlServerTableExists. Use this function to test whether a specified table exists in a SQL Server
database.
rxExecuteSQLDDL. This function performs Data Definition Language (DDL) operations, enabling you
to perform operations such as creating new tables.
An important parameter to the RxSqlServerData function is rowsToRead. This parameter controls how
many rows the data source reads as a chunk. If you make this parameter too large, you might encounter
poor performance because you don’t have sufficient memory to this volume of data. However, setting
rowsPerRead to too small a value can also result in poor performance due to excessive I/O and inefficient
use of the network. You should experiment with this setting to find the optimal value for your own
configuration and dataset.
The following example shows how to connect directly to a SQL Server table from ScaleR. In this example,
the Airport table contains the details of US airports:
# Results:
iata airport city state country lat long
1 00M Thigpen Bay Springs MS USA 31.95376 -89.23450
2 01G Perry-Warsaw Perry NY USA 42.74135 -78.05208
3 06D Rolla Municipal Rolla ND USA 48.88434 -99.62088
4 06M Eupora Municipal Eupora MS USA 33.53457 -89.31257
5 06N Randall Middletown NY USA 41.43157 -74.39192
...
MCT USE ONLY. STUDENT USE PROHIBITED
2-6 Exploring Big Data
The following example shows how to check whether a table exists, and then remove it:
Note that you can only use the RxHdfsFileSystem file system in a supported compute context, such as
RxHadoopMR. You can also use the RxHdfsFileSystem file system from a local session running on a
Hadoop cluster. The RxNativeFileSystem file system is available in all compute contexts.
The following example shows how to connect a data source to a text file stored in HDFS. The parameters
to the RxHdfsFileSystem constructor provide the information for connecting to the HDFS cluster:
rxHadoopRemove. This function removes a file from the HDFS file system.
rxHadoopFileExists. This function tests whether a specified file exists in an HDFS directory.
Behind the scenes, these functions perform the corresponding Hadoop fs commands in an SSH shell
created by the RxHadoopMR compute context.
You can also use the rxHadoopCommand function to run an arbitrary Hadoop command. You can
perform any Hadoop operation, including managing the HDFS file system, and submitting Hadoop jobs.
This command uses an explicit SSH connection to communicate with the Hadoop cluster.
3. Highlight and run the code under the comment # Connect to SQL Server. These statements create
an RxSqlServerData data source that references the Airports data in the AirlineData database on
the LON-SQLR server.
4. Highlight and run the code under the comment # Use R functions to examine the data in the
Airports table. These statements display the first few rows of the table, and then use the
rxGetVarInfo and rxSummary functions to display the columns in the table and calculate summary
information.
MCT USE ONLY. STUDENT USE PROHIBITED
2-8 Exploring Big Data
2. Highlight and run the code under the comment # Create a Hadoop compute context. These
statements open a connection to the Hadoop cluster and set this connection as the compute context.
3. Highlight and run the code under the comment # List the contents of the /user/instructor folder
in HDFS. This command uses the Hadoop fs -ls command to display the file names from HDFS.
4. Highlight and run the code under the comment # Connect directly to HFDS on the Hadoop VM.
These statements create an RxHdfsFileSystem object and make it the default file system for the
session.
5. Highlight and run the code under the comment # Create a data source for the CensusWorkers.xdf
file. This statement creates an XDF data source that references the CensusWorkers.xdf file in HDFS.
6. Highlight and run the code under the comment # Perform functions that read from the
CensusWorkers.xdf file. These statements display the first few lines from the CensusWorkers.xdf file,
and generate a summary of the file contents.
Note that the rxSummary command might take a couple of minutes. This is because Hadoop is
optimized to handle very large data files. With a small file such as the one used in this demonstration,
the overhead of using Hadoop exceeds the time spent performing any processing.
7. Close your R development environment of choice (RStudio or Visual Studio) without saving any
changes.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
You want to use the rxImport function to transfer data from a local CSV
file into a SQL Server table. You use an RxTextData data source to read
the data from the CSV file and an RxSqlServerData data source to write
to the SQL Server database. You should perform this operation in the
RxInSqlServer compute context. True or False?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-9
Lesson 2
Reading and writing XDF data
In this lesson, you will learn how to read data into an XDF object from a variety of sources. You will also
learn how to control the formats in which they are created, transform any of the variables you need to,
and perform union and one-to-one merges.
Lesson Objectives
After completing this lesson, you will be able to:
Explain how using XDF overcomes many limitations of traditional R data formats.
XDF files are subdivided into “chunks”. The ScaleR functions only need to read a single chunk into
memory at a time, process it, and then write it back to disk. This leads to improved scalability because the
size of a dataset that you can process is not limited by the memory available on the computer. The
parallel algorithms used by the ScaleR functions enable different processors to work on these data chunks,
and then combine the chunked results together.
MCT USE ONLY. STUDENT USE PROHIBITED
2-10 Exploring Big Data
Occasionally, it might be necessary for the ScaleR functions to make multiple passes over one or more
chunks, but this processing is transparent and is controlled by the functions themselves.
However, you should consider some tradeoffs when determining whether to import data into the XDF
format or leave it in a more traditional format. There might be times when using an in-memory data
frame is more appropriate for performing a specific task. Additionally, many existing R packages use the
data frame interface, and are not designed to work with XDF files. For more information on these
tradeoffs, see:
Trade-offs to consider when reading a large dataset into R using the RevoScaleR package
https://aka.ms/let4so
By default, the result returned by rxImport is in in-memory data frame. If the data file is large, the data
frame can exhaust the available memory, but you can also use rxImport to create an XDF file by
specifying the outFile parameter to rxImport.
The following example shows how to create an XDF file from the claims.txt file used by the previous
example:
If you specify the outFile parameter, the value returned by rxImport is an RxXdfData data source object
that references the new XDF file, rather than a data frame. The data is not read into memory, but is
instead written to the XDF file. You can use the RxXdfData data source to read and process the data in
this file, chunk by chunk.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-11
The following example shows how to use the RxXdfData data source object created in the previous
example to generate a summary of the XDF data:
Filtering data
You can control the rxImport process by using the numRows and rowSelection arguments. The
numRows argument causes the import to stop after reading the specified number of rows. You use the
rowSelection argument to filter the data as it is imported, so that only rows that match specified criteria
make it to the XDF file.
The following example shows how to limit the number of rows being imported, and how to filter the data
as it is being imported. In this example, cost is a field containing numeric data in the file being imported:
You can use the varsToKeep and varsToDrop arguments to specify which columns from input you wish
to retain or discard from the XDF file. These are character vectors containing the names of the columns.
You can specify either argument, but not both. You cannot use these arguments if you are importing data
from an ODBC data source or a fixed format text file.
The following example shows how to use the colClasses argument to set the data type of columns:
If you omit any columns in colClasses, the rxImport function will infer the type of those columns from
the source metadata.
You can also specify the stringsAsFactors argument to rxImport if you want the import process to
convert all character strings as factors. In this case, you can override specified columns by using
colClasses and indicating that these columns should be retained as character data.
The following example reads information about car insurance claims from a fixed format text file:
o .rxStartRow. The row number of the first row in the current chunk being processed.
transformObjects. Transformations in rxImport operate inside their own closed scope, and variables
external to the rxImport operation are inaccessible. If a transformation requires access to an external
variable, you must specify it using this argument. This argument is a list. Note that if you reference
external variables in the rowSelection argument, you must also include them in this list.
transformFunc. You can implement complex transformations as a function rather than specifying
them in the transforms list. You provide the name of the function in this argument.
transformVars. If the transformFunc takes parameters, you must provide their values in this vector.
transformPackages. If a transformation references external packages, you must specify them in this
vector.
The following example uses a transform to add a variable named logcost, containing the log of the cost
column in the current row, to an XDF file:
Best Practice: Test transformations over a small subset of the data first. This gives you a
way to quickly test that the transformation is correct. When you are satisfied that the
transformation is working correctly, you can perform it over the entire data.
Best Practice: The transforms argument contains a list of transformations. Use this list to
batch together transformations over different fields, rather than performing separate runs of
rxImport over the same data, each with its own single transformation.
Module 4: Processing Big Data, describes how to use the transformFunc, transformVars, and
transformPackages arguments in detail.
MCT USE ONLY. STUDENT USE PROHIBITED
2-14 Exploring Big Data
Refactoring variables
You can convert nonfactor variables in an XDF file
into factors by using the rxFactors function. This
function takes an XDF file and a specification of
how to categorize data in the factorInfo
argument. Like the rxImport function, rxFactors
can write the results to a new XDF file or return it
as a data frame. You can also specify whether to
retain or drop columns by using the varsToKeep
and varsToDrop arguments. However, it doesn't
support filtering by row selection or
transformations.
You can also use rxFactors to refactor a variable that is already a factor.
In the next example, the DayOfWeek variable is a factor that contains the values "1" through "7" to
represent the days of the week. The following code changes the variable to use the levels "Mon" through
"Sun" instead:
You can also output the data as a composite set of files rather than a single file. To do this, create an
RxXdfData data source that references the destination folder, and specify the createCompositeSet
argument to the rxImport function, as follows:
In this case, the resulting folder contains two subfolders; a metadata folder holding an .xdfm file that
contains the metadata describing the contents of the XDF files, and a data folder that contains the data in
the form of a set of .xdfd files. This is the structure that HDFS uses for storing XDF data; the various .xdfd
files might be physically stored on separate machines in the HDFS cluster.
When you need to process the data, create an RxXdfData data source using the folder holding the data
and metadata folders.
MCT USE ONLY. STUDENT USE PROHIBITED
2-16 Exploring Big Data
The following example splits census data held in an XDF file by using the year variable. The year is a
factor variable:
The previous example creates files named YearData1.xdf, YearData2.xdf, and so on.
You can also split an XDF file into a number of approximately uniform-sized pieces rather than splitting by
factor. To do this, specify the numOut argument, as follows:
Note that, in common with many of the ScaleR functions concerned with importing or modifying data,
you can also use arguments such as varsToKeep, varsToDrop, rowSelection, and transforms when
splitting the data.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-17
OneToOne. This operation performs a column-wise append, concatenating a row in the second file
onto the end of the corresponding row in the first. For more information about one-to-one merging,
see:
One to one Merge
https://aka.ms/ihwfxm
Inner Merge and Outer Merge. These are analogous to the inner join and outer join operations
between tables in a relational database. These options are covered in more detail in Module 4:
Processing Big Data.
The following example illustrates how to perform a simple union merge between two XDF files. Notice
that the type argument specifies the type of merge to perform:
rxOpen(xdfSource)
data <- rxReadNext(xdfSource)
rxClose(xdfSource)
Note that this approach is not too common. If you need to process data in chunks, a better method is to
use the rxDataStep function described in Module 4: Processing Big Data.
# Typical output
# File name: E:\AirlineDelayData.xdf
# Number of observations: 227044
# Number of variables: 30
# Number of blocks: 1
# Compression type: zlib
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-19
You can also specify the getVarInfo argument to obtain detailed information about the individual
variables in the dataset, as shown in the following example:
# Typical output
# File name: E:\AirlineDelayData.xdf
# Number of observations: 227044
# Number of variables: 30
# Number of blocks: 1
# Compression type: zlib
Var 1: .rxRowNames, Type: character
Var 2: Year
1 factor levels: 2000
Var 3: Month
12 factor levels: 1 2 3 4 5 ... 8 9 10 11 12
Var 4: DayofMonth
31 factor levels: 1 3 9 17 19 ... 18 12 8 21 5
Var 5: DayOfWeek
7 factor levels: 6 1 7 3 4 5 2
Var 6: DepTime, Type: character
…
You can limit the variables displayed by using the varsToKeep and varsToDrop arguments.
The rxGetVarInfo function is similar, except that it only returns variable information from a data source.
You use the rxSetVarInfo function to modify variable metadata in a data source.
The following example uses rxGetVarInfo and rxSetVarInfo to change the levels for a factor in an XDF
file. In this case, the Cancelled factor originally used the levels "0" and "1" to indicate whether the flight
had been cancelled. This code changes the levels to "No" and "Yes":
There is also a corresponding rxSetInfo function that you can use to set the metadata of the XDF file
rather than the individual variables in the file.
Note: The rxGetVarInfo function only accesses the variable metadata at the start of an
XDF file. The rxSetVarInfo function does not modify the actual data for each row, only the
metadata. Consequently, both functions can run very quickly, even over large XDF objects.
MCT USE ONLY. STUDENT USE PROHIBITED
2-20 Exploring Big Data
Which argument to the rxImport function does not enable you to filter data?
rowSelection
varsToDrop
varsToKeep
colClasses
numRows
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-21
Lesson 3
Summarizing data in an XDF object
In this lesson, you will learn how to generate statistical summaries of the data in an XDF object. You can
use many regular R functions to perform these tasks, but ScaleR also provides several functions of its own
specifically designed to efficiently process big data held in the XDF format.
Lesson Objectives
After completing this lesson, you will be able to:
The names function operates in a slightly different manner. It is also generic, but the technique it uses
internally differs. This detail is not actually important, but the key point is that names eventually invokes
the rxGetVarNames function. Similarly, dim invokes the rxGetInfo function, as do the nrow and ncol
functions.
MCT USE ONLY. STUDENT USE PROHIBITED
2-22 Exploring Big Data
The key fact to take from this discussion is that you can continue to use these common Base R functions
over XDF data. However, if you are writing new scripts from scratch, you might find that using the
equivalent ScaleR functions directly is more efficient.
rxSummary(~ArrDelay, xdfSource)
# Typical output
# Rows Read: 227044, Total Rows Processed: 227044, Total Chunk Time: 0.026 seconds
# Computation time: 0.041 seconds.
# Call:
# rxSummary(formula = ~ArrDelay, data = xdfSource)
If you need to summarize more than one variable, specify them in the formula separated by the +
operator, like this:
You can use the ":" notation to include dependencies in a formula. The following example summarizes
departure delay times as a function of the airport of origin for a flight:
rxSummary(~DepDelay:Origin, xdfSource)
# Typical output
# Rows Read: 227044, Total Rows Processed: 227044, Total Chunk Time: 0.031 seconds
# Computation time: 0.039 seconds.
# Call:
# rxSummary(formula = ~DepDelay:Origin, data = xdfSource)
#
# Summary Statistics Results for: ~DepDelay:Origin
# Data: xdfSource (RxXdfData Data Source)
# File name: SplitData1.xdf
# Number of valid observations: 227044
#
# Name Mean StdDev Min Max ValidObs MissingObs
# DepDelay:Origin 9.374303 31.25425 -45 1435 218064 8980
#
# Statistics by category (205 categories):
#
# Category Origin Means StdDev Min Max ValidObs
# DepDelay for Origin=ATL ATL 10.04044364 29.627925 -17 466 10459
# DepDelay for Origin=AUS AUS 6.69516971 25.208972 -17 259 1532
# DepDelay for Origin=BHM BHM 5.51197982 24.143043 -26 282 793
# DepDelay for Origin=BNA BNA 7.59791004 27.797410 -11 480 2201
# DepDelay for Origin=BOS BOS 10.45093105 33.601146 -15 521 3974
# DepDelay for Origin=BUR BUR 13.16031196 30.553457 -8 316 1154
# DepDelay for Origin=BWI BWI 10.23133191 30.173772 -10 348 3281
# DepDelay for Origin=CLE CLE 6.65045455 26.232894 -10 348 2200
…
You can also use the special formula "~." to summarize all variables. For more information about using
formulas with rxSummary, see:
Data Summaries
https://aka.ms/obvzrd
You can refine the summaries generated by using the following arguments to the rxSummary function:
byGroupOutFile. If the formula uses factor variables, you can save the summary output to a set of
files each containing the data for a different factor value. This helps you to process the results
separately for each factor value later. Use the byGroupFile argument to specify a base file name for
the data to be saved. The rxSummary function will generate a set of files using this name with a
numeric suffix.
summaryStats. By default, rxSummary generates statistics for the Mean, StdDev, Min, Max, ValidObs
(number of valid observations), MissingObjs (number of missing observations), and Sum. If you only
require a subset of these statistics, use the summaryStats to specify which ones as a character vector.
byTerm. This is a logical argument. If TRUE, missing values will not be included when computing the
summary statistics.
removeZeroCounts. This is a logical argument. If TRUE, rows with no observations will not be
included in the output for counts of categorical data.
fweights and pweights. These are character strings that specify the name of a variable to use as the
frequency weight or probability weight for the observations being summarized.
MCT USE ONLY. STUDENT USE PROHIBITED
2-24 Exploring Big Data
The following example shows how to retrieve the data frame that contains the summary for the
categorical variables from an rxSummary object:
# Typical results
# Name Mean StdDev Min Max ValidObs MissingObs
# 1 Delay 17.40734 60.90534 -1273 2121 217503 9541
You can also discount rows from the summary by using the rowSelection argument.
Note that these transformations are transient; they are used while the rxSummary function is running,
and the changes are not stored permanently. If you expect the same transformation to be required
elsewhere, consider using the rxDataStep function to save the transformed data. To find out how to do
this, see Module 4: Processing Big Data.
head(results)
The functions in this package use ScaleR functions behind the scenes to chunk the data, and they create
temporary XDF files while they are running. These temporary XDF files are automatically removed when a
function finishes. If you need to retain the data generated at any stage in a dplryXdf pipeline, use the
persist function to write the data to an XDF file.
Many of the dplyrXDF functions enable you to pass additional arguments to the underlying ScaleR
functions by using the .rxArgs argument. For more information, see:
Note: Currently, the dplyrXdf package is only available through Git, and you must
download and build it locally. You can perform this task from within R by using the devtools
package. You must also install the dplyr package because some of the functionality of dplyrXdf
references functions in this package. Use the following code to do this:
# Install dplyrXdf
install.packages("dplyr")
install.packages("devtools")
devtools::install_github("RevolutionAnalytics/dplyrXdf")
library(dplyr)
library(dplyrXdf)
Generating CrossTabs
The following example calculates the mean
departure delay for airline flights as a function of
origin airport and month. Usually, the rxCrossTabs function generates counts, but this example calculates
means. You do this by setting the means argument to TRUE:
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-27
# Typical results
# Rows Read: 227044, Total Rows Processed: 227044, Total Chunk Time: 0.050 seconds
# Computation time: 0.063 seconds.
# Call:
# rxCrossTabs(formula = DepDelay ~ Origin:Month, data = xdfSource,
# means = TRUE)
# DepDelay (means):
# Month
# Origin 1 2 3 …
# ATL 10.6929763 9.49295775 9.78934741
# AUS 4.7921875 7.38102410 10.03947368
# BHM 5.2704403 4.54227405 8.61363636
# BNA 7.2029478 7.53543307 8.53720930
# BOS 13.1485643 8.68876611 8.49798116
# BUR 8.7056075 17.95685279 13.21084337
# BWI 10.8261905 11.43887623 6.90767045
# CLE 7.0021906 6.55760870 6.00817439
# CLT 7.9783641 6.42066806 5.24920128
# CMH 7.3936348 8.33013436 5.49802372
# COS 2.0535714 7.51351351 3.27118644
# CVG 11.9991823 12.60124334 11.07074830
# DEN 7.2984729 8.61186114 11.36315789
…
Note: If you are already familiar with xtabs, notice that the formula syntax is different with
rxCrossTabs. Use the ":" character to separate cross-classifying variables rather than the "+"
character that xtabs uses.
You can convert the rxCrossTabs object returned by the rxCrossTabs function into an xtabs
object by using the as.xtabs function.
Ideally, independent variables should be categorical, but you can cause the rxCrossTabs function to treat
a character variable as a categorical variable by using the F function. If you have numeric data, use the
as.factor function instead, although you should be cautious of using this approach because it can
generate significant quantities of information if the data has a large set of possible values.
Another important argument is na.rm, which causes NA values to be excluded from the calculations. Also,
as with many of the other ScaleR functions, you can perform transformations and row selection on the fly
by using the transforms and rowSelection arguments.
The rxCrossTabs function returns an rxCrossTabs object, which contains lists of the cross-tabulation
sums, counts, and means, together with a list of chi-squared test results.
MCT USE ONLY. STUDENT USE PROHIBITED
2-28 Exploring Big Data
The following example extracts counts from the results object generated by the previous example:
# Typical results
# $DepDelay
# Month
#Origin 1 2 3 …
# ATL 4257 4118 2084
# AUS 640 664 228
# BHM 318 343 132
# BNA 882 889 430
# BOS 1602 1629 743
# BUR 428 394 332
# BWI 1260 1317 704
# CLE 913 920 367
# CLT 1895 1916 939
# CMH 597 521 253
…
Generating cubes
The rxCube function generates a data cube. The calculations are the same as those performed by
rxCrossTabs, but the data is presented in a more matrix-like manner, showing the individual statistics for
each combination of independent variables.
As an example, the following code creates a cube of arrival delay statistics grouped by origin airport and
month:
# Typical results
# Call:
# rxCube(formula = DepDelay ~ Origin:Month, data = xdfSource)
#
# Cube Results for: DepDelay ~ Origin:Month
# File name: SplitData1.xdf
# Dependent variable(s): DepDelay
# Number of valid observations: 218064
# Number of missing observations: 8980
# Statistic: DepDelay means
#
# Origin Month DepDelay Counts
# 1 ATL 1 10.69297627 4257
# 2 AUS 1 4.79218750 640
# 3 BHM 1 5.27044025 318
# 4 BNA 1 7.20294785 882
# 5 BOS 1 13.14856429 1602
# 6 BUR 1 8.70560748 428
# 7 BWI 1 10.82619048 1260
# 8 CLE 1 7.00219058 913
# 9 CLT 1 7.97836412 1895
…
The resulting rxCube object contains a variable for each column of output—in this case, Origin, Month,
DepDelay, and Counts. You can access these variables by using the $ operator.
rxChiSquaredTest
rxFisherTest
rxKendallCore
The following example tests for independence between the origin airport and month for flight departure
delays. Notice that the data for the cube is prefiltered to remove any negative values of the DepDelay
variable; this is required by the rxChiSquaredTest function:
# Typical results
# Chi-squared test of independence between Origin and Month
# df
# 2244
Note: The rxFisherTest function can exhaust workspace resources if you use it on a large
set of results.
Cross-tabulated data
https://aka.ms/qgqtbq
The following example calculates the quantiles for departure delay in the airline delay data:
# Results
# Rows Read: 227044, Total Rows Processed: 227044, Total Chunk Time: 0.016 seconds
# Computation time: 0.022 seconds.
# 0% 25% 50% 75% 100%
# -45 -3 0 7 1435
The results in this example show that 25 percent of aircraft depart between three and 45 minutes early,
half of all aircraft leave early or on time (the delay is 0), and 75 percent of aircraft have a departure delay
of no more than seven minutes. However, there are some extreme cases with long delays—the 75–100%
range represents a very long tail of values.
To analyze this tail in more detail, you can customize the number of bins and probabilities to use for the
calculations. The following example creates additional bins for the 75–100% range, using the probs
argument to rxQuantile. This argument takes a numeric vector containing the probability values (in the
range 0 through 1) for each bin:
# Results
# Rows Read: 227044, Total Rows Processed: 227044, Total Chunk Time: 0.014 seconds
# Computation time: 0.018 seconds.
# 0% 25% 50% 70% 80% 90% 100%
# -45 -3 0 5 12 32 1435
From these results, you can see that 90 percent of all departures are no more than 32 minutes late.
Demonstration Steps
2. Highlight and run the code under the comment # Summarize the delay fields. This statement
generates summaries for the ArrDelay, DepDelay, and Delay fields only. Notice the syntax in the
formula.
3. Highlight and run the code under the comment # Examine Delay broken down by origin airport.
These statements factorize the Origin and Dest columns using the rxFactors function, and generate
the summary.
2. Highlight and run the code under the comment # Generate a cube of the same data. This
statement outputs the delay information, but also displays the counts. Note that the cube includes
routes that don't exist; they have a delay of NaN, and a count of 0.
3. Highlight and run the code under the comment # Omit the routes that don't exist. This statement
sets the removeZeroCounts argument so that routes that have a count of zero are no longer
included.
4. Close your R development environment of choice (RStudio or Visual Studio) without saving any
changes.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Objectives
In this lab, you will:
Import data held in a CSV file into an XDF object and compare the performance of operations using
these formats.
Combine multiple CSV files into a single XDF object and transform data as it is imported.
Combine data retrieved from SQL Server into an XDF file.
Lab Setup
Estimated Time: 60 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running, and then complete the steps below:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Click Start, type Microsoft SQL Server Management Studio, and then press Enter.
3. In the Connect to Server dialog box, log in to LON-SQLR using Windows authentication.
5. In the New Database dialog box, in the Database name box, type AirlineData, and then click
OK.
6. Close SQL Server Management Studio.
2. Start your R development environment of choice (Visual Studio Tools, or RStudio), and create a new R
file.
3. The data is located in the file 2000.csv, in the folder E:\Labfiles\Lab02. Set your current working
directory to this folder.
4. Import the first 10 rows of the data file into a data frame and examine its structure. Note the type of
each column.
5. Create a vector named flightDataColumns that you can use as the colClasses argument to
rxImport. This vector should specify that the following columns are factors:
Year
DayofMonth
DayOfWeek
UniqueCarrier
Origin
Dest
Cancelled
Diverted
6. Import the data into an XDF file named 2000.xdf. If this file already exists, overwrite it. Use the
flightDataColumns vector to ensure that the specified columns are imported as factors.
7. When the data has been imported, examine the first 10 rows of the new file and check the structure
of this new file.
8. Use File Explorer to compare the relative sizes of the CSV and the XDF files.
2. Use the rxSummary function to generate a summary of all the numeric fields in the 2000.csv file,
and then repeat using the 2000.xdf file. Compare the timings by using the system.time function.
3. Use the rxCrossTabs function to generate a cross-tabulation of the data in the CSV file showing the
number of flights that were cancelled and not cancelled each month. Note that the Month and the
Cancelled columns are both numerics, but dependent variables referenced by a formula in
rxCrossTabs must be factors. You use the as.factor function to cast these variables to factors. Display
the cancellation values as TRUE/FALSE values. Note how long the process takes.
MCT USE ONLY. STUDENT USE PROHIBITED
2-34 Exploring Big Data
4. Generate the same cross-tabulation for the XDF file, and compare the timing with that for the CSV
file.
5. Repeat the previous two steps, but generate cubes over the CSV and XDF data using the rxCube
function rather than crosstabs. Compare the timings.
Results: At the end of this exercise, you will have created a new XDF file containing the airline delay
data for the year 2000, and you will have performed some operations to test its performance.
2000.csv
2001.csv
2002.csv
2003.csv
2004.csv
2005.csv
2006.csv
2007.csv
2008.csv
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-35
2. Start a remote session on the LON-RSVR VM. When prompted, specify the username admin, and the
password Pa55w.rd.
3. At the REMOTE> prompt, temporarily pause the remote session and return to the local session
running on the LON-DEV VM.
4. Use the putLocalObject function to copy the local object, flightDataColumns, to the remote
session.
Create a Delay column that sums the values in the ArrDelay, DepDelay, CarrierDelay,
WeatherDelay, NASDelay, SecurityDelay, and LateAircraftDelay columns. Note that, apart
from the ArrDelay and DepDelay columns, this data can contain NA values that you should
convert to 0 first.
Add a column named MonthName that holds the month name derived from the month
number. This column must be a factor.
Filter out all cancelled flights (flights where the Cancelled column contains 1).
Remove the variables FlightNum, TailNum, and CancellationCode from the dataset.
2. Examine the first few rows of the XDF file to verify the results.
4. Import all the CSV files in the \\LON-RSVR\Data share into an XDF file called FlightDelayData.xdf.
Perform the same transformations from step 1, and save the file to the \\LON-RSVR\Data share.
Note that this process can take some time—you should enable progress reports for the rxImport
operation so you have confirmation that it is working. You might also find that setting the number of
rows per read can impact performance. Try setting this value to 500,000, which will create chunk sizes
of approximately half a million rows.
Results: At the end of this exercise, you will have created a new XDF file containing the cumulative
airline delay data for the years 2000 through 2008, and you will have performed some transformations
on this data.
MCT USE ONLY. STUDENT USE PROHIBITED
2-36 Exploring Big Data
2. View the first few rows of data to establish the columns that it contains.
3. Import the airport data into a small data, and then convert all string data to factors.
4. View the first few rows of the data frame file to ensure that they contain the same data as the original
SQL Server table.
2. Temporarily pause the remote session and copy the data frame containing the airport information to
the remote session. Use the putLocalObject function.
4. Use the rxImport function to read the FlightDelayData,xdf file on the \\LON-RSVR\Data share and
add the following variables:
OriginState. This should contain the state from the data frame you created in the previous task
where the IATA code in that data frame matches the Origin variable in the XDF file.
DestState. This is very similar, except it should match the IATA code in the data frame against
the Dest variable in the XDF file.
Note that you must use the transformObjects argument of rxImport to make the data frame
accessible to the transformation.
5. View the first few rows of the XDF file to ensure that they contain the OriginState and DestState
variables.
Results: At the end of this exercise, you will have augmented the flight delay data with the state in which
the origin and destination airports are located.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-37
No delay
Up to 30 minutes
30 minutes to 1 hour
1 to 2 hours
2 to 3 hours
In the remote session, generate four cross-tabulations (using rxCrossTabs) that report the delay
intervals by origin airport, destination airport, origin state, and destination state. Use a transformation
with each cross-tabulation to factorize the delays as described.
3. Close the remote session, and return to the local session on the LON-DEV VM.
2. Create an RxXdfDataSource that references the following variables in the XDF file. These are the only
variables that the analysis will require:
Delay
Origin
Dest
OriginState
DestState
3. Use the data accessible through the data source to calculate the mean delay for each origin airport,
sort the results, and then display them. Use a dplyrXdf pipeline. You will need to persist the final
results, otherwise they will be deleted automatically by the pipeline.
4. Repeat the previous step to calculate and display the mean delay for each destination airport.
MCT USE ONLY. STUDENT USE PROHIBITED
2-38 Exploring Big Data
5. Repeat step 3 to calculate and display the mean delay by origin state.
6. Repeat step 3 to calculate and display the mean delay by destination state.
7. Save the script as Lab2Script.R in the E:\Labfiles\Lab02 folder, and close your R development
environment.
Results: At the end of this exercise, you will have examined flight delays by origin and destination
airport and state.
Question: From the summaries that you developed, were you able to perceive any
relationship between airport or state and flight delay times?
Question: Given your answer to the first question, is the effort performing these tasks
justified so far?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 2-39
Import and transform data into the XDF format used by ScaleR.
Summarize data held in an XDF file.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
3-1
Module 3
Visualizing Big Data
Contents:
Module Overview 3-1
Module Overview
Visualization is an essential part of the data manipulation and modeling process. In addition to presenting
results in a report, you should visualize your data or results often and in a variety of ways before, during
and after your analysis. Building plots is an important tool for experimental design and can help you to
identify issues in your data that need to be addressed. Plotting your model coefficients can also help you
interpret them much more easily.
The common first step in working with big data is to build a subset of your big dataset that you can
process locally, in-memory. You typically develop models and algorithms iteratively on this subset before
deploying to a server, cluster or database service to run on the full big dataset. After the analysis has run,
you will commonly bring model objects and results back onto your local machine to produce plots for
presentation. You need to be able to visualize data both locally and at scale—there are two different tools
to use.
With in-memory data, you have access to a broader and more flexible range of visualization tools.
However, these tools would soon break when applied to large, chunked data too big to fit in memory.
ScaleR™ provides functions for the quick and efficient building of simple plots to visualize very large
datasets.
Objectives
In this module, you will learn how to:
Lesson 1
Visualizing in-memory data
This lesson describes how to use ggplot2. This is a very popular and flexible third-party R package that
provides an incredibly rich suite of functions for visualizing in-memory data—it can be used for
generating a wide range of graphs. Although it has the limitations associated with using in-memory data,
the ggplot2 package is useful when working with big data. You might break down big data into smaller
subsets, or use clustering algorithms to coalesce similar data together, and then use ggplot2 functions to
help spot trends in this data. You can then use the big data to develop predictive models that substantiate
or disprove these trends.
Lesson Objectives
After completing this lesson, you will be able to:
Because the data is independent from the other components, you can reuse the same plot with
different datasets.
You can chain plot objects (such as annotations, legends, trend lines, and so on) together without
having to worry about the order in which they are combined.
Sensible defaults mean that you can produce professional looking plots very quickly.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-3
Elements of a graph
A ggplot2 graph consists of the following components:
Data. This is the raw data that is passed to the plot object—this must be a data frame.
Aesthetic mappings. These translate the raw data to data that is readable on a graph. For example, a
continuous variable might map to points along an x axis, and a factor might map to a point size or a
color.
Geometric objects (geoms). The actual plot type you will be using—for example, line, point, or
histogram. Each plot must have at least one geom.
Coordinate system. This defines how the data maps onto space in the plot. A single coordinate
system applies to a single plot object. Options include Cartesian coordinates, log and polar
coordinates.
Statistical transformations (stats). Use these to transform or summarize your data—for example, to
generate regression lines, smoothers, bins for histograms, and boxplots. Each geom has its own
default stat.
Facets. These describe how to split data into different panels, according to the values of the supplied
variables.
Note: The functions in the ggplot2 package are accessible through the ggplot2 library. You
should bring this library into scope:
library(ggplot2)
Alternatively, you can install the tidyverse package and bring the tidyverse library into scope.
This package and library includes ggplot2 and other useful packages often used with ggplot2,
such as dplyr:
install.packages(tidyverse)
library(tidyverse)
The following example shows a simple scatter plot based on the sample mtcars dataset provided with R.
The graph shows the relationship between engine capacity (displacement) and the miles per gallon
achieved:
The first line defines the basic plot object by calling ggplot with the dataset as the first argument. You
can then chain geoms, stats, facets, coordinate types, and so on, to this using the “+” operator. The
second line adds the point geom to make an XY plot. All geom_* functions accept a “mapping” argument
to define the mapping from the data to plot space. This is an aes() function call, here mapping “disp”
(engine displacement in cc) to the x axis and “mpg” (miles per gallon) to the y axis. The default is always
Cartesian coordinates and no facets, so you don’t need to define these. Also, the default stat for
geom_point is “identity”, which means no transformation takes place—so you don’t need to set this either.
MCT USE ONLY. STUDENT USE PROHIBITED
3-4 Visualizing Big Data
FIGURE 3.1: SCATTER PLOT OF ENGINE DISPLACEMENT VERSUS MILES PER GALLON
Note that any of the objects in the ggplot chain are just standard R objects, so they can be assigned to
variables. The following example gives the same result as the code:
This example produces a bar plot of counts of cars, with each number of gears in the mtcars data. The
default statistic is count, which counts the number of cases at each x position. Use this geometry to
describe a single discrete variable:
FIGURE 3.2: BAR PLOT SHOWING THE NUMBER OF CARS WITH DIFFERENT GEARS
MCT USE ONLY. STUDENT USE PROHIBITED
3-6 Visualizing Big Data
The next example creates a histogram using the geom_histogram geometry. Use this geometry to
categorize a single continuous variable with binned values—in this case, the frequencies of times taken by
cars in the mtcars dataset to travel a quarter of a mile. You can adjust the width of the bins with the
binwidth argument, and the number of bins with the bins argument.
FIGURE 3.3: HISTOGRAM SHOWING THE TIMES TAKEN FOR CARS TO TRAVEL A QUARTER OF A
MILE
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-7
Use the geom_line geometry to show two variables with values joined by lines. The following example
presents the same data shown earlier as a line plot, but this time with data points connected by lines. This
is particularly useful for presenting time series or repeated measures data where there is a relationship
from point to point. You can replace geom_line with geom_step to build a stairstep plot to highlight
exactly where change happens—or replace with geom_path to join points in the order they appear in the
data.
FIGURE 3.4: LINE PLOT SHOWING THE RELATIONSHIP BETWEEN ENGINE DISPLACEMENT AND
MILES PER GALLON
MCT USE ONLY. STUDENT USE PROHIBITED
3-8 Visualizing Big Data
You can refine a line plot by smoothing the line to create a regression curve, using the geom_smooth
geometry.
The default smoothing method is local regression (LOESS).
The graph that this generates also includes an indication of the confidence intervals, shown shaded as a
gray area around the smoothed line:
Rather than using LOESS, you can perform a standard linear regression by specifying the value "lm" to the
method argument of the geometry. You can also use nonlinear regression terms such as polynomials
using the formula argument.
This code generates the following results over the mtcars dataset:
Combining plots
You can combine multiple geometries together using the
+ operator to create overlay plots.
Notice this example has moved the mapping argument to the initial call to ggplot. This means that the
mappings are then available to all the chained geoms.
The following example uses the same scatter plot from earlier in this module, but it groups the cars by
their transmission type (manual or automatic). The mapping uses the aes function to set the color of the
points based on the type of transmission. In this example, the scale_color_manual function specifies the
colors to use for each category. Note that, rather than feed the raw variable which has the values 0 (for
automatic) and 1 (for manual), the code converts it to a factor with more intuitive labels. This code also
uses the labs function to set the labels displayed for each variable, and color them in the same way as the
points on the graph.
FIGURE 3.8: SCATTER PLOT SHOWING POINTS AND LABELS COLORED BY CATEGORY
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-13
The following example shows another use of the aes function with a tile plot. This example illustrates the
relationship between three variables in mtcars—the number of gears (“gear”), the number of cylinders
(“cyl”), and the brake horsepower (“hp”). Notice that “hp” has been assigned to fill in the mapping
function. The geom fits an appropriate scale to the color variation by default.
FIGURE 3.9: TILE PLOT USING COLOR TO SHOW THE RELATIONSHIP BETWEEN THREE VARIABLES
Creating facets
You use facets to split your data into different plot panels according to the categorical variables you
choose to facet by. These panels are then laid out in a grid. Any statistical transformations occur within
each panel—for example, if you use geom_smooth, a new regression line will be fitted to the data in
each panel. There are two faceting functions you can chain into your ggplot2 plot objects—facet_grid
and facet_wrap.
MCT USE ONLY. STUDENT USE PROHIBITED
3-14 Visualizing Big Data
Using facet_grid
You use the facet_grid function to split your data into
rows and/or columns of plotting panels. The first
argument takes a formula expression, with the right-hand
side representing row facets and the left-hand side
representing column facets. A dot on either side of the
“~” means that you don’t want to facet in this dimension.
The grid contains two columns (0 for automatic, 1 for manual). Note that for simplicity, this example has
used the original factor values rather than creating labels for each column.
You can also facet by rows and columns. The following example splits the scatter plot into columns by the
“am” transmission type variable and rows by the “gear” variable, producing a 2 x 3 grid of plot panels.
While the graph is interesting, it does demonstrate that sometimes "less is more". In this case, breaking
the graphs down to this level of detail starts to obscure the previously clear relationship between
displacement and miles per gallon.
FIGURE 3.11: DISPLAYING ROWS AND COLUMNS USING THE FACET_GRID FUNCTION
Using facet_wrap
You use the facet_wrap function when you need to facet by a single variable, but that variable has more
levels than can fit in a single row or column. The panels will “wrap” around to the next line. You can set
the number of columns using the ncol argument, as shown in the next example. In this case, the data is
grouped by the number of carburetors, which has a value between 1 and 8 in the sample data:
facet_wrap function
ggplot(mtcars) +
geom_point(mapping = aes(x = disp, y = mpg)) +
facet_wrap( ~ carb, ncol = 2)
MCT USE ONLY. STUDENT USE PROHIBITED
3-16 Visualizing Big Data
The result is the following set of panels. Note that because no cars have 5 or 7 carburetors, there are no
panels for these numbers:
If you want to have the bars for the different groups next to each other, set position = "dodge". You can
set position = "fill" to have the bars stacked and stretched to a constant height to show relative
proportions.
The following example shows the mpg for each car in the mtcars dataset. The code creates a data frame
containing the row names and the mpg variable. The graph calls the geom_bar function with stat =
”identity”.
Because of their length, the names would have been unreadable on a standard x axis, so you chain in a
call to coord_flip() to flip the x and y axes, making the names readable. This is an example of a
coordinate system function and will be discussed in a later topic.
FIGURE 3.14: BAR PLOT DISPLAYING VALUES FOR INDIVIDUAL INSTANCES OF VARIABLES
coord_fixed. These are Cartesian coordinates that have a fixed aspect ratio between the x and y axes.
coord_polar. Polar coordinates can be used to create pie, donut and radar charts.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-19
coord_map. Use this to create map projections using ggplot2. For more details, see:
http://docs.ggplot2.org/current/coord_map.html.
Arranging plots
You might want to place two plots next to each other
to compare the results. Base R has the par() function
to do this for you, but the ggplot2 system is
different—you can’t use this function. Instead, you
can use the gridExtra package to arrange multiple
plots in a variety of ways.
Demonstration Steps
3. Highlight and run the code under the comment # Install packages. This code loads the tidyverse
package (including ggplot2 and dplyr), and brings the tidyverse library into scope.
4. Highlight and run the code under the comment # Create a data frame containing 2% of the flight
delay data. This code populates a data frame with a small subset of the flight delay data. The
rowSelection argument uses the rbinom function to select a random sample of observations.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-21
5. Highlight and run the code under the comment # Generate a plot of Departure Delay time versus
Arrival Delay time. These statements use ggplot to create a scatter plot. Note that the geom_point
method sets an alpha level of 1/50. This helps to highlight the density of the data (the more dense
the data points, the darker the plot area). Even with a small subset of the data, it still takes a couple of
minutes to create this plot.
Add an overlay
Highlight and run the code under the comment # Fit a regression line to this data. This code uses
the geom_smooth function to fit a line to the data. Note that the mapping argument has moved to
the ggplot function so that it is available to geom_point and geom_smooth.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
The data source for a ggplot2 graph must be a data frame. True or False.
MCT USE ONLY. STUDENT USE PROHIBITED
3-22 Visualizing Big Data
Lesson 2
Visualizing big data
The ggplot2 package provides a wide range of powerful data visualization tools. However, when you
work with big datasets, you will often not have the luxury of being able to read all your data into
memory—so ggplot2 might not always be appropriate. The big data plotting functions in the RevoScaleR
package, while involving a little more complexity, have some very important advantages:
When working with XDF files, only the selected variables are read in, making them highly efficient.
XDF files are processed in chunks, the algorithms summarizing as they go.
You use these features to visualize arbitrarily large datasets on the fly whereas, with traditional plotting
tools, you would have to first take subsets or samples of the data small enough to fit into memory.
Note that, if you are using these functions locally, R Client does not support chunking. This means that, in
this case, all the data needs to be read into memory—in a large dataset, this can exhaust your memory
quickly. To get around this, you push the compute context to a Microsoft R server instance.
For more information about visualizing big data with Microsoft R, see:
Visualizing Huge Data Sets: An Example from the US Census
https://aka.ms/qd95hg
Ostensibly, the ScaleR big data plotting functions appear to have quite limited functionality. However,
these functions act as wrappers for the more comprehensive features of the lattice graphics package—
you can use many lattice features to customize the layout and appearance of graphs.
For a comprehensive introduction to lattice graphics in R, see: http://lattice.r-forge.r-
project.org/Vignettes/src/lattice-intro/lattice-intro.pdf.
Lesson Objectives
After completing this lesson, you will be able to:
Create and customize scatter and line plots using the rxLinePlot function.
The following example uses rxLinePlot to create a scatter plot showing the relationship between engine
displacement and miles per gallon using the mtcars data frame:
In the example, the first argument is a formula expression, with the dependent variable (the “y” variable)
on the left-hand side and the independent variable (the “x” variable) on the right-hand side. Note that the
example also sets the labels for the x and y axes using the xlab and ylab arguments.
The rxLinePlot function is a wrapper around the xyplot() function in the lattice package that comes with
base R. This means that, alongside using the plot customization arguments in rxLinePlot, you use the
ellipsis (“…”) arguments to customize your plots further by passing arguments to xyplot. For example, you
can customize the scales for the x and y axes by using the scales argument.
You determine the type of plot by passing a character vector to the type argument. The options are:
l: this gives a line plot, where the lines are connected in the order they appear in the data. This is
analogous to geom_path in ggplot2.
You can combine multiple types. For example, specifying type = c("p", "r") generates a scatter plot with a
regression line overlay.
There is no separate function to do this—you must specify the facets in the formula input to the first
argument. You use a “|” (vertical bar) on the right-hand side of the formula to indicate that you are
conditioning on a variable. The next example shows the miles per gallon versus displacement plot
conditioned by gear, and the number of gears in the transmission of the different cars:
If required, you can further refine this approach to drill deeper into the data. For example, you can specify
multiple conditioning variables by using the + operator. This example includes the car transmission (the
am variable).
This code generates a lattice containing a pane for each combination of conditioning variable:
blocksPerRead. This is the number of blocks to read for each chunk of data read from the data
source. You can vary this to create plots more efficiently from very large datasets. This argument is
ignored when you run locally in R Client.
reportProgress. You can use this argument to generate feedback while generating a plot over a big
dataset. You can specify the following values:
o 0: no progress is reported.
Another common technique is to summarize a large dataset into a more manageable size by using the
rxCube function. You can then convert the rxCube object into a data frame by using the rxResultsDF
function and pass this object to rxLInePlot.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-27
Creating histograms
You can create histogram plots from both data
frames and XDF files using the rxHistogram
function. This function uses the rxCube to process
the data at scale and to create the bins. These bins
are then passed to graphics functions from the
lattice package. Many of the arguments are the
same as for rxLinePlot—you can pass arguments
to lattice using the ellipsis (“…”) arguments.
Behind the scenes, rxHistogram uses the lattice
barchart function.
In the previous example, the CRSDepTime variable was treated as a continuous numeric variable by the
graph, which makes the bins rather an odd size. You can use the F function to refactorize data on the fly,
and split it into bins according to the integer range in which it lies. The following example uses this
approach to show the data hour by hour:
The histogram produced by this code is a little easier to interpret. The busiest time of the day for
departures is between 6:00 AM and 8:00 AM. Unsurprisingly, very few aircraft depart in the small hours of
the morning.
You can modify the number of bins for the histogram by using the numBreaks argument.
You can also specify lower and upper limits for numeric data using the startVal and endVal arguments.
Like rxLinePlot, you can split up your histogram plots, conditioned on other variables. Again, you use the
“|” (bar) notation to select the conditioning variables. In this example, the scheduled airline departure
times are grouped by the day of the week:
Note: The default statistic that rxHistogram uses is a count of the number of observations
that fall in each bin. You can change this to report the percentage of observations in each bin by
setting the histType argument to "Percent".
MCT USE ONLY. STUDENT USE PROHIBITED
3-30 Visualizing Big Data
Saving plots
You can save your rxLinePlot and rxHistogram plots in the
same way you can with base R plots. First, open a file handle
for a graphical file, then produce your plot, and then close
the file handle:
You can save to various file types, including jpeg, tiff, pdf, and postscript.
Note that, when performing transformations on big datasets, you should carry out multiple
transformations in a single pass rather than performing several passes of single transformations. Also note
that, if you expect to use the transformed variables more than once, it might be beneficial to transform
your data before you plot it, using rxDataStep. The rxDataStep function is discussed in more detail in
Module 4: Processing Big Data.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-31
Demonstration Steps
Creating a histogram
1. Open your R development environment of choice (RStudio or Visual Studio).
3. Highlight and run the code under the comment # Use the flight delay data. This code creates an
RxXdfDdata data source for the FlightDelayData.xdf file. This file contains 11.6 million records.
4. Highlight and run the code under the comment # Create a histogram showing the number of
flights departing from each state. This code uses the rxHistogram function to display a count of
flights for each value of the OriginState variable. The scales argument is a part of the lattice
functionality that underpins rxHistogram; your code uses this argument to change the orientation
and size of the labels on the x axis.
5. Highlight and run the code under the comment # Filter the data to only count late flights. This
code uses the rowSelection argument to only include observations where the ArrDelay variable is
greater than zero.
2. Highlight and run the code under the comment # Late flights by Carrier. This chart shows a
histogram of the number of delayed flights for each airline. The code uses the yAxisMinMax
argument to set the limits of the vertical scale to the same as that of the previous chart. Additionally,
the plotAreaColor argument makes the background transparent; this chart will be used as an
overlay.
3. Highlight and run the code under the comment # Display both histograms in adjacent panels. This
code installs the latticeExtra package. This package provides functionality that you can use to
customize the layout of the panels that display graphs and chart. The code then displays both charts;
they appear in adjacent panels. Both charts use the same vertical scale, enabling you to compare the
number of flights against the number of delayed flights for each airline.
4. Highlight and run the code under the comment # Overlay the histograms. This statement uses the
+ operator to overlay the second chart (with the transparent background) on top of the first, making
it even easier to see the proportion of late flights for each airline.
How can you control the number of bins used by the rxHistogram function if the
data being plotted is continuous rather than categorical?
Use the transforms argument of the rxHistogram function to round the data
up or down a set of discrete values.
Objectives
In this lab, you will:
Use the ggplot2 package to generate plots of flight delay data, to visualize any relationship between
delay and distance.
Use the rxLinePlot function to examine data by departure state and day of the week.
Use the rxHistogram function to examine the relative rates of the different causes of delay.
Lab Setup
Estimated Time: 60 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
2. Start your R development environment of choice (Visual Studio or RStudio), and create a new R file.
3. You will find the data for this exercise in the file FlightDelayData.xdf, located in the folder
E:\Labfiles\Lab03. Set your current working directory to this folder.
MCT USE ONLY. STUDENT USE PROHIBITED
3-34 Visualizing Big Data
4. The ggplot2 package requires a data frame containing the data. The sample XDF file is too big to fit
into memory (it contains more than 11.6 million rows), so you need to generate a random sample
containing approximately 2% of this data. To reduce the size of the data frame further, you are only
interested in the Distance, Delay, Origin, and OriginState variables. Finally, you also want to remove
any anomalous observations; there are some rows that have a negative or zero distance which you
should discard.
Use the rxImport function to create a data frame that matches this specification from the
FlightDelayData.xdf file. You can use the rbinom base R function with a probability of 0.02 to
generate a random sample of 2% of the data as part of the rowSelection filter.
2. Create a scatter plot that shows the flight distance on the x axis versus the delay time on the y axis.
Give the axes appropriate labels.
3. Overlay the data with a line plot to help establish whether there is any pattern to the data presented
in the graph. Reduce the intensity of the points using the alpha function. This will help to show the
frequency of delay times (more common delay times will appear darker). Also, filter outliers, such as
negative delays and delays greater than 1,000 minutes (you can use the dplyr filter function to
perform this task).
4. Facet the graph by OriginState to determine whether all states show the same trend. Use the
facet_wrap function (there are more than 50 states and US territories included in the data).
Note: You might receive some warning messages due to insufficient data for regressing the data for
some states. You can ignore these warnings, but you will see some states that don't include the
overlay.
Results: At the end of this exercise, you will have used the ggplot2 package to generate line plots that
depict flight delay times as a function of distance traveled and departure state.
Question: Does the regression indicate that there is any relationship between flight distance
and flight delay times?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-35
You also want to see how flight delays vary according to the day of the week, to establish whether this
could be a factor.
The main tasks for this exercise are as follows:
Only include the Distance, ActualElapsedTime, Delay, Origin, Dest, OriginState, DestState,
ArrDelay, DepDelay, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, and
LateAircraftDelay variables.
Add a transformation that creates a variable called DelayPercent that contains the flight delay as
a percentage of the actual elapsed time of the flight.
Save the data in an XDF file named FlightDelayDataWithProportions.xdf. Overwrite this file if
it already exists.
2. Create a cube that summarizes the data of interest (DelayPercent as a function of Distance and
OriginState). Use the following formula:
DelayPercent ~ F(Distance):OriginState
Note that you must factorize the Distance variable to use it on the right-hand side of a formula.
Filter all observations where the DelayPercent variable is more than 100%.
Summarizing the data in this way makes it much quicker to run rxLInePlot as it now only has to
process a focused subset of the data.
3. Change the name of the first column in the cube from F_Distance (the name generated by the
F(Distance) expression in the formula) to Distance. You can use the base R names function to do
this.
4. Create a data frame from the data in the cube. This is necessary because the rxLinePlot function can
only process XDF format data or data frames, not rxCube data. Use the rxResultsDF function to
perform this conversion.
5. Generate a scatter plot of DelayPercent versus Distance. Experiment with the symbolStyle,
symbolSize, and symbolColor arguments to the rxLinePlot to see their effects.
MCT USE ONLY. STUDENT USE PROHIBITED
3-36 Visualizing Big Data
Note: The legend on the x axis can become unreadable. You can remove this legend by setting the
scales argument to (list(x = list(draw = FALSE))). The scales argument is passed to the underlying
xyplot function.
Results: At the end of this exercise, you will have used the rxLinePlot function to generate line plots that
depict flight delay times as a function of flight time and day of the week.
Question: What do you observe about the graphs showing flight delay as a proportion of
travel time against distance? Does this bear out your theory that there is little relationship
between these two variables? If not, how do you account for any discrepancy between your
theory and the observed data?
o Remove all observations where the delay is negative or greater than 180 minutes (three hours).
o Add a transformation that converts the CRSDepTime variable into a factor with 48 half-hour
intervals. Use the base R cut function to do this.
o Save the data in an XDF file named FlightDelayWithDay.xdf. Overwrite this file if it already
exists.
2. The DayOfWeek variable comprises numeric codes (1 = Monday, 2 = Tuesday, and so on). Recode
the DayOfWeek variable in the XDF file to a meaningful set of text abbreviations suitable for display
as values on a graph. Use the rxFactors function to do this.
3. Create a cube that summarizes Delay as a function of CRSDepTime and DayOfWeek). Use the
following formula:
Delay ~ CRSDepTime:DayOfWeek
5. Generate a scatter plot overlaid with a smooth line plot of delay as a function of departure time. Use
the following scales argument to display and orient the labels for the x and y axes:
scales = (list(y = list(labels = c("0", "20", "40", "60", "80", "100", "120", "140",
"160", "180")),
x = list(rot = 90),
labels = c("Midnight", "", "", "", "02:00", "",
"", "", "04:00", "", "", "", "06:00", "", "", "", "08:00", "", "", "", "10:00", "",
"", "", "Midday", "", "", "", "14:00", "", "", "", "16:00", "", "", "", "18:00", "",
"", "", "20:00", "", "", "", "22:00", "", "", "")))
Question: Using the graph showing delay times against departure time for each day of the
week, which time of day generally suffers the worst flight delays, and which day of the week
has the longest delays in this period?
o Only include the OriginState, Delay, ArrDelay, WeatherDelay and MonthName variables.
o Remove all observations where the arrival or departure delay is negative, or the total delay is
negative or greater than 1,000 minutes.
o Save the data in an XDF file named FlightDelayReasonData.xdf. Overwrite this file if it already
exists.
2. Create a histogram showing the frequency of the different arrival delays. Set the histType argument
to "Counts" to show the number of items in each bin.
3. Modify the histogram to show the percentage of items in each bin (set histType to "Percent").
7. Save the script as Lab3Script.R in the E:\Labfiles\Lab03 folder, and close your R development
environment.
MCT USE ONLY. STUDENT USE PROHIBITED
3-38 Visualizing Big Data
Results: At the end of this exercise, you will have used the rxHistogram function to create histograms
that show the relative rates of arrival delay by state, and weather delay by month.
Question: What is the most common arrival delay (in minutes), and how frequently does this
delay occur?
Question: Which month has the most delays caused by poor weather? Which months have
the least delays caused by poor weather?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 3-39
Use the ScaleR rxLinePlot and rxHistogram functions to generate graphs based on big data.
Tools
The ggplot2 package is huge and has a bewildering array of different options, plot types and
transformations. See these resources for further information:
Whatever ggplot2 problem you have, it is very likely someone else has experienced something similar and
has found a fix on stackoverflow.com.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
4-1
Module 4
Processing Big Data
Contents:
Module Overview 4-1
Module Overview
Many data scientists are familiar with the notion of data wrangling—the way in which you need to
manipulate and arrange data to get it into a shape where you can perform your various analyses. Data
wrangling can include operations such as generating information derived from the raw data, filtering the
data, performing additional computations such as normalizing the data, sorting the data, and possibly
splitting or merging datasets to form a coherent source of data.
It’s relatively easy to perform these tasks when you are using small datasets that will easily fit into memory
and do not take long to process. However, when you are working with much larger datasets, you need to
plan these operations more carefully; you need to ensure that you don't run out of resources partway
through a lengthy task—and you will also want to optimize jobs to minimize their duration. The ScaleR™
functions are designed to help you.
Objectives
In this module, you will learn how to:
Perform transformations over big data in an efficient manner.
Lesson 1
Transforming big data
This lesson focuses on the transformation framework implemented by ScaleR functions. Many ScaleR
functions provide the transforms argument that you can use to manipulate data as you read it in,
summarize it, or perform operations such as generating plots. The key aspect of the transformation
framework is that it is designed to work at scale; it uses the chunking capabilities of the XDF format to
process data in blocks. This lesson describes how to use this framework in detail to perform scalable
transformations.
Lesson Objectives
After completing this lesson, you will be able to:
Explain when to transform data on the fly, and when to make a transformation permanent.
However, remember that in the world of big data, a dataset might be many thousands of gigabytes in
size. Writing the transformed data to disk can generate a lot of additional I/O, and might incur excessive
storage charges, depending on where you are saving the data.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-3
Therefore, saving the results each time you transform data—just in case you might need it again later—is
not always feasible. You need to strike a balance. In particular, you should consider:
How likely are you to require the transformed data again?
Does the cost of performing the transformation when you need the transformed data exceed the
costs associated with saving transformed data to storage?
How big is the transformed data compared to the effort required to generate it? If it is small and used
occasionally, but takes considerable effort to construct, then you should save it; the associated I/O
and storage costs will be minimal.
How volatile is the underlying data? If you are performing analyses on live data (rather than historic
records), it might not be appropriate to save the transformed information because it could quickly
become outdated.
You can view the block size and the number of blocks in an XDF file using the rxGetInfo function; set the
getBlockSizes argument to TRUE.
rxGetInfo function
rxGetInfo("FlightDelayData.xdf", getBlockSizes = TRUE)
# Typical output
File name: FlightDelayData.xdf
Number of observations: 1135221
Number of variables: 29
Number of blocks: 23
Rows per block (first 10): 50000 50000 50000 50000 50000 50000 50000 50000 50000 50000
Compression type: zlib
When you process the file using a function such as rxDataStep, you can either specify the number of
blocks to read as a chunk using the blocksPerRead argument, or the number of rows using the
rowsPerRead argument. So, if the XDF file has a block size of 50000 rows and you specify a value of 2 for
the blocksPerRead argument, each chunk will contain 100000 rows (2 * 50000).
MCT USE ONLY. STUDENT USE PROHIBITED
4-4 Processing Big Data
When the data is written back, it will use the new chunk size of 100000 rows for each block. Alternatively,
if you set rowsPerRead to 25000, rxDataStep will only read half a block at a time as a chunk into
memory, and when the data is written out, it will have a block size of 25000 rows.
Note: Chunking only operates in an R Server environment. If you are using R Client, the
ScaleR functions do not perform chunking, and the entire dataset must fit into memory.
rxDataStep function
rxDataStep(inData = "FlightDelayData.xdf", outFile = " FlightDelayDataMetric.xdf",
rowsPerRead = 50000,
transforms = list(Distance = Distance * 1.6))
Best Practice: Remember that the transforms argument is a list that can contain any
number of transformations. Add all the transformations that you require to this list to process the
data in a single pass. Do not perform multiple runs of rxDataStep, each implementing a single
transformation.
Be aware of the chunk-oriented nature of transformations. You have immediate access to all the data in
the current chunk. You can read data held in other blocks (this is discussed in the topic Using Custom
Transformation Functions), but this will incur additional I/O—try to avoid repeatedly reading the same
blocks over and over.
Best Practice: Avoid defining transformations that require access to all observations in the
dataset simultaneously, such as the poly and solve matrix operations. These operations can be
expensive because they can involve repeatedly reading the dataset, and they will be performed
for every row in the dataset.
Also, when sampling data in a transform, remember that the sampling algorithm only has access
to the current chunk unless you reread the entire dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-5
Creating factors
You can add categorical variables to a dataset, but you must remember that only one chunk of data is
accessible at a time. This means that you should not write code that attempts to define factor levels and
labels automatically. Doing this could cause inconsistencies in the variable across the dataset (different
chunks might omit some factor levels and labels if there is no matching data in that chunk). Instead, you
should specify levels and labels explicitly. In the following example, IsCancelled is a categorical variable
that maps the values of the Cancelled variable (0, 1) to TRUE and FALSE.
MCT USE ONLY. STUDENT USE PROHIBITED
4-6 Processing Big Data
The following example will not actually filter any data, and the result will contain the entire dataset:
To get around this issue, you amend the rowSelection expression to involve at least one of the variables
in the dataset, as shown in the following example:
The .rxNumRows variable used in this example is one of the special variables created by the
transformation process; it contains the number of rows in the current chunk. Later in this lesson, the topic
Using Custom Transformation Functions describes the special variables in more detail.
You can also use the startRow, and numRows arguments to limit the size of the transformed dataset. The
startRow argument specifies the starting offset at which to begin the process, and numRows indicates
the number of rows to transform. The resulting dataset will only contain rows that fall into this range. The
startBlock and numBlocks arguments are similar, except that you specify blocks rather than rows.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-7
Note: If you repeatedly reference the same packages in all transformations in a session, you
can use the rxOption function to set the transformPackages option globally.
The following example uses functions in the lubridate package to add the flight departure date to the
flight delay dataset. The departure data is created as a POSIXct date/time variable, based on the values in
the Year, Month, DayofMonth, and CRSDepTime variables. Note that the CRSDepTime variable is a
character string containing up to four digits that represent the departure time in the format "hhmm". For
times before 10:00 AM, the format is "hmm".
You can modify the data in these vectors, or construct a new list of vectors containing the transformed
data. You must return a list containing these vectors at the end of the function. All vectors in this list must
have the same length. This data is merged back into the chunk overwriting the appropriate variables with
the transformed values, before being written back to disk. Note that, if you remove a vector from the list,
the corresponding variable will be removed from the resulting dataset.
Inside the transformation function, you have access to a set of special variables managed by rxDataStep.
These include:
.rxReadFileName. The name of the XDF file from which the data was read. If you need to access data
in other blocks in the file, you can open an RxXdfData data source using this variable and navigate
to the appropriate location.
.rxIsTestChunk. TRUE if this is a test pass of the transformation function; otherwise it is FALSE.
The .rxIsTestChunk variable is important. Some transformations will perform an initial test pass over the
first block of data. If this test pass is successful, the same block is repeated, followed by the remaining
blocks. You should always check the .rxIsTestChunk variable to avoid generating duplicated results for
the first block of data.
You can manipulate the objects in the transformation environment by using the .rxGet(objName) and
.rxSet(objName, objValue) functions. These functions give you a way to pass information from one
chunk to the next. The following example uses these functions to add a running total of the total flight
distance for all flights recorded in the flight delay dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-9
# Retrieve the current running total for the distance from the environment
runningTotal <- as.double(.rxGet("runningTotal"))
# Iterate through the values for the Distance variable and accumulate them
idx <- 1
for (distance in dataList[[1]]) {
runningTotal <- runningTotal + distance
dataList[[runningTotalVarIndex]][idx] <- runningTotal
idx <- idx + 1
}
# Save the running total back to the environment, ready for the next chunk
.rxSet("runningTotal", as.double(runningTotal))
return(dataList)
}
The addRunningTotal function uses the variable runningTotal as an accumulator. The value of this
variable is initialized to 0 in the transformObjects argument of the rxDataStep object. At this point, the
runningTotal variable becomes part of the environment used by the addRunningTotal function. The
function retrieves the current value of runningTotal by using .rxGet, and saves it at the end of the
function by using .rxSet.
The logic in the body of the function adds a new vector to the dataList list and names it RunningTotal.
This vector is populated with the accumulated total of the Distance variable read from each row in the
first vector of dataList. This first vector is filled in by the rxDataStep function as specified by the
transformVars argument.
Note: If your transformation function uses functions in external packages, you must
reference these packages in the transformPackages argument of rxDataStep.
Needless to say, if your code is only slightly less efficient and takes 1.5 milliseconds per iteration,
this will add another 14 hours to the processing time.
This is also a situation where you should consider the size of the platform on which you are
running your code. Add as much memory and processing power to your computing environment
as possible. It might even be worth creating a temporary cluster of large VMs in Azure®
especially to perform the task. You can remove these VMs once you have finished.
# Results:
# File name: FlightDelayData.xdf.xdf
# Number of observations: 1135221
# Number of variables: 29
# Number of blocks: 23
# Rows per block (first 10): 50000 50000 50000 50000 50000 50000 50000 50000 50000 50000
# Compression type: zlib
# Results:
# File name: FlightDelayDataSample.xdf
# Number of observations: 11433
# Number of variables: 31
# Number of blocks: 12
# Rows per block (first 10): 1008 969 942 952 1088 1021 1034 1009 1016 1022
# Compression type: zlib
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-11
You can reblock a file by using rxDataStep. Read the file and write it out again, and specify the number
of rows to include in each block using the rowsPerRead argument. The result should be a defragmented
file consisting of fewer blocks:
# Results:
# File name: DefragmentedSample.xdf
# Number of observations: 11433
# Number of variables: 31
# Number of blocks: 1
# Rows per block: 11433
# Compression type: zlib
Best Practice: If you change the name of a variable in an XDF file, you might need to
reblock the file afterwards to ensure that the metadata recording the new variable name is
updated in every block.
Best Practice: To reduce the chances of fragmentation, avoid transformations that change
the length of a variable.
Demonstration Steps
3. Highlight and run the code under the comment # Connect to R Server. This code connects to R
Server running on the LON-RSVR VM.
4. Highlight and run the code under the comment # Examine the dataset. This code shows the
structure of the sample data and displays the first 10 rows. Note that the data contains 29 variables.
5. Highlight and run the entire block of code that creates the addRunningTotal function, under the
comment # Create the transformation function. This function performs the following tasks:
a. It checks to see whether this is a test pass over a chunk, and if so it returns immediately.
b. It retrieves the value of the runningTotal variable from the environment (this variable will be
initialized to 0 by the rxDataStep function).
MCT USE ONLY. STUDENT USE PROHIBITED
4-12 Processing Big Data
c. It adds a new vector to the dataList list. This vector will add the data for the new column
containing the running total. The column is named RunningTotal.
d. It iterates through the values in the vector for the Distance variable, generates the running total,
and adds the total for each row to the RunningTotal vector.
e. After completing its work, the function saves the current value of the runningTotal variable to
the environment.
f. It returns the updated dataList list that now includes the RunningTotal vector. This vector will
be added as a variable to the dataset.
The transformFunc argument is set to addRunningTotal. This is the name of the transformation
function.
The transformObjects argument creates and initializes the runningTotal environment variable.
The numRows argument limits the operation to the first 2 million rows. It is always best to test
transformations on a subset of your data first.
The transforms list also adds a variable named ObservationNum to the data. This variable holds
the row number. It is generated by using a range based on the .rxStartRow and .rxNumRows
special variables.
2. As the function runs, note the progress messages that are reported, listing the time taken to process
each chunk.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Managing big datasets
Sorting and merging are common tasks when wrangling data—you frequently require data to be in a
specific sequence, or include information retrieved from disparate datasets. When operating with datasets
that will easily fit into memory, these tasks are trivial. However, when you are handling big data, they
become much more significant. Because of the resources required, sorting and merging many hundreds
of gigabytes of data is not a task that you should undertake lightly, due to the resources required—
specifically, memory, processor power, disk space, and time. This lesson examines these issues in more
detail, and describes how you can use ScaleR functions to address them.
Lesson Objectives
In this lesson, you will learn:
3. How to combine data from different datasets using the rxMerge function.
If you need to sort by a noninteger numeric independent variable, consider scaling the data as you
cross-tabulate it. For example, if you have numeric values falling in the range between 0 and 1, scale
them by 1,000. The integer conversion performed by using the F function will sort the data by within
1/500th of the original values. If you require greater or less accuracy, you iterate to refine this process.
If you are sorting to calculate aggregates by groups of data, consider using a transformation function
that creates a running total for each group, as described in the previous lesson. This calculation can
be performed by taking a single pass through the data a block at a time and can be very efficient.
Remember that you can use custom transformation functions with ScaleR functions such as
rxSummary, rxCrossTabs, and rxCube.
MCT USE ONLY. STUDENT USE PROHIBITED
4-14 Processing Big Data
Be aware that many ScaleR functions, such as rxQuantiles, rxLorenz and rxRoc, are specifically
intended for operating on big data and do not require data to be presorted. If possible, use these
functions in preference to other, more traditional R packages.
The following example sorts the flight delay data by descending order of Origin airport, and then by
ascending order of Dest airport within each Origin group:
Note: If you sort on a factor variable, the data is sorted by factor level and not by name.
This can cause confusion if the levels are not in any specific order.
You can reduce the resources required to sort data by only selecting the variables that you really need in
the result—by using the varsToKeep and varsToDrop. If you can cut the number of variables to a
minimum, rxSort might be able to sort data in memory rather than by performing a merge sort, and
consequently run much more quickly.
Note: The rxSort function does not support filtering through rowSelection, or
transformations. Additionally, you cannot use numRows to limit the number of rows in the
source dataset.
Use the removeDupKeys argument to rxSort to remove rows from the result that have duplicated sort
key values. You can track the number of rows removed by using the dupFreqVar argument. This
argument specifies the name of a column to add to the sorted result containing this number. The
following example uses this technique to show the popularity of each airline route, based on the origin
and destination airports. (Remember that Origin and Dest are both factors in this dataset, so the data is
sorted by level rather than alphabetically).
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-15
# Sample results
head(sortedFlightDelayData, 10)
These operations are described in more detail in Module 2: Exploring Big Data.
The rxMerge function also supports relational-style styles, meaning you can perform inner and outer
joins across datasets.
MCT USE ONLY. STUDENT USE PROHIBITED
4-16 Processing Big Data
Inner joins
An inner join merges datasets across columns that share a common key value. For example, the flight
delay dataset specifies airport information by using codes:
The details of each airport, such as its name, city, state, and location, could be held in a separate dataset:
You can merge these two datasets using the airport codes, but you should first change the name of the
join column to be the same in both datasets. In this example, the name of the "iata" variable in the airport
data (sortedAirportData) is changed to "Origin" to match the flight delay data.
You specify the type of join to perform as "inner", and you use the matchVars argument to specify the
variables to use for performing the join. Note that before merging, both datasets must be sorted by these
variables, in the same order. You can specify the autoSort argument to rxMerge to do this, or you can
presort the data manually:
Merging the flight delay data and airport datasets using airport codes
rxMerge(inData1 = sortedFlightDelayData, inData2 = sortedAirportData, outFile =
mergedData,
matchVars = c("Origin"), type = "inner")
Outer joins
If a row in the first dataset has no corresponding row in the second, then the first row will not appear in
the result. If you need to retain all rows in the first dataset, you can set the type of join to "left". In this
case, the rxMerge function performs a left outer join operation, and uses NAs for the values of all
variables from the second dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-17
You can also perform a right outer join by setting the type to "right"; all rows from the second dataset
will appear in the result, joined with rows from the first containing NA values if necessary. Finally, you can
carry out a combination of left and right outer join operations by setting type to "full".
Demonstration Steps
Tuning a sort
1. Open your R development environment of choice (RStudio or Visual Studio).
4. Highlight and run the code under the comment # Examine the data. This code shows the first few
rows from the flight delay data.
5. Highlight and run the code under the comment # Sort it by Origin. This code sorts the data by the
Origin variable, in decreasing order. This process can take up to 90 seconds.
6. Highlight and run the code under the comment # Note the factor levels for Origin. This code
displays the factor levels for the Origin variable. This is the sequence in which the data should be
sorted. Note that the final three levels are MKG, LMT, and OTH.
7. Highlight and run the code under the comment # View the data. It should be sorted in
descending order of Origin. This statement displays the first 200 rows of the data. Note that the
data for OTH appears first, followed by LMT, and then MKG.
8. Highlight and run the code under the comment # Sort the data again. This statement sorts a dataset
containing a much smaller number of variables. The sort should be much quicker. This shows the
importance of being selective when sorting a dataset.
9. Highlight and run the code under the comment # View the data. The data should still be sorted by
Origin.
2. Highlight and run the code under the comment # View the data. The data is sorted by Origin, but will
only contain the Origin, Dest, and Distance variables, together with RoutesFrequency, indicating how
many times each route was found in the original data.
Which option for the rxMerge function enables you to combine data horizontally from two
different datasets that have a different number of rows?
oneToOne
union
combine
lookup
inner
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-19
Objectives
In this lab, you will:
Merge data from a second dataset into the flight delay data.
Write a transformation function to add variables that record the departure and arrival times as UTC
times.
Create a transformation function to generate the cumulative departure and arrival delays for each
route.
Lab Setup
Estimated Time: 60 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Copy the following files from the E:\Labfiles\Lab04 folder to the \\LON-RSVR\Data share:
airportData.xdf
FlightDelayData.xdf
2. Create a remote session on the LON-RSVR server. This is another VM running R Server. Use the
following parameters to the remoteLogin function:
deployr_endpoint: http://LON-RSVR.ADATUM.COM:12800
session: TRUE
diff: TRUE
commandLine: TRUE
username: admin
password: Pa55w.rd
3. Examine the factor levels for the iata field in the airportData XDF file, and the factor levels for the
Origin and Dest variables in the FlightDelayData XDF file. Notice that the two files use different
factor levels for this data. There is even some minor variation between the Origin and Dest variables
in the flight delay data, as shown in the following output:
Var 1: iata
3376 factor levels: 00M 00R 00V 01G 01J ... ZEF ZER ZPH ZUN ZZV
Var 1: Origin
329 factor levels: ATL AUS BHM BNA BOS ... GCC RKS MKG LMT OTH
Var 1: Dest
331 factor levels: PHX PIA PIT PNS PSC ... GCC RKS MKG OTH LMT
4. Combine the levels in the iata, Origin, and Dest variables into a new set of factor levels. Remove any
duplicates.
5. Use the rxFactors function to refactor the iata field in the airportData XDF file with this new set of
factor levels.
6. Use the rxFactors function to refactor the Origin and Dest variables in the FlightDelayData XDF file
with this set of factor levels.
7. Verify that the factor levels in both XDF files are now the same, with 3377 levels in the same order.
2. Reblock the airport data file using the rxDataStep function. This task ensures that the metadata for
the renamed field is updated in every block.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-21
3. Use the rxMerge to combine the data in the two refactored files, as follows:
Keep all fields from the flight delay data file, but only retain the timezone and Origin fields from
the airport data file.
4. Verify that the flight delay data now contains the OriginTimeZone variable, containing time zone
information as shown in the following example:
rxGetVarInfo(mergedFlightDelayData)
Var 1: Year
9 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008
...
Var 29: OriginTimeZone, Type: character
Results: At the end of this exercise, you will have created a new dataset that combines information from
the flight delay data and airport information datasets.
Performing timezone conversions is a complex task that lubridate makes appear very simple. However,
this involves a considerable amount of processing. Therefore, you decide to test the transformation on a
small subset of the flight delay data, comprising approximately 20,000 rows.
2. Verify the number of rows in the sample by using the rxGetInfo function. The result should look like
this (the number of rows and block sizes in your output might vary slightly):
2. Create a transformation function named standardizeTimes. In this function, perform the following
tasks:
Create another vector for the arrival time. Name the variable StandardizedArrivalTime.
For each row in the chunk:
i. Retrieve the departure year, month, day of month, time, and timezone.
ii. Construct a string containing the date and time in POSIXct format: "yyyy-mm-dd hh:mi".
iii. Use the as.POSIXct function from the lubridate package to convert this string into a date.
Include the local timezone.
iv. Use the format function to generate a string representation of the date converted to UTC
format.
vi. Retrieve the elapsed flight time. This is an integer value representing a number of minutes.
vii. Add the elapsed time to the standardized departure time. You can use the minutes function
to convert an integer into a number of minutes, and then use the + operator.
viii. Save the arrival time as a string in the StandardizedArrivalDate field of the dataset.
3. Use the rxDataStep function to perform the transformation over the sample subset of the flight
delay data. You will need to include the following arguments:
transformFunc = standardizeTimes
transformVars = c("Year", "Month", "DayofMonth", "DepTime", "ActualElapsedTime",
"OriginTimeZone")
transformPackages = c("lubridate")
4. Examine the data in the transformed file and verify that the StandardizedDepartureDate and
StandardizedArrivalDate variables have been added successfully.
Results: At the end of this exercise, you will have implemented a transformation function that adds
variables containing the standardized departure and arrival times to the flight delay dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-23
You will calculate the cumulative average delay, based on the cumulative number of flights for
the route and the cumulative total delay for the route. You need to record and save this
information so it can be accessed as the function processes each row. To do this, you will use two
lists named cumulativeDelays and cumulativeRouteOccurrences when you run the
rxDataStep function. Use the .rxGet function to retrieve these two lists.
Iterate through the rows in the block and perform the following actions:
i. Retrieve the Origin and Dest variables, and concatenate their values together. You will use
this string as the key for the cumulativeDelays and cumulativeRouteOccurrences lists.
iii. Find the current cumulative delay for the route in the cumulativeDelays list, add the value
of the Delay variable, and store the result back in the cumulativeDelays list.
iv. Find the current cumulative count for occurrences of the route in the
cumulativeRouteOccurrences list, increment this value, and store it back in the list.
v. Calculate the cumulative average delay for the route by dividing the cumulative delay by the
cumulative number of occurrences for the route, and write the result to the
CumulativeAverageDelayForRoute variable.
Use the .rxSet function to save the cumulativeDelays and cumulativeRouteOccurrences lists
so that they can be accessed when processing the next block.
MCT USE ONLY. STUDENT USE PROHIBITED
4-24 Processing Big Data
2. Use the rxDataStep function to run the transformation. You will need to include the following
arguments:
transformFunc = calculateCumulativeAverageDelays
2. Use the rxLinePlot function to generate a scatter plot and regression line of average flight delays for
the following routes:
ATL to PHX
SFO to LAX
LAX to SFO
DEN to SLC
LGA to ORD
CumulativeAverageDelayForRoute ~ as.POSIXct(StandardizedDepartureTime)
Use the rowSelection argument to specify the origin and destination airport for each route.
3. Save the script as Lab4Script.R in the E:\Labfiles\Lab04 folder, and close your R development
environment.
Results: At the end of this exercise, you will have sorted data, and created and tested another
transformation function.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 4-25
Module 5
Parallelizing Analysis Operations
Contents:
Module Overview 5-1
Module Overview
The ScaleR™ functions in the revoScaleR package you have seen to this point are excellent tools for High
Performance Analytics (HPA), where jobs are typically data-limited and the problem for the software is
effectively distributing that data to the different nodes in the cluster or server. In the HPA model, you run
a task using the various rx* functions on an R Server cluster. One of the computing resources in the
cluster takes on the role of managing the task and becomes the master node for that task. The master
node splits the computation out in subtasks, which it distributes across all nodes in the cluster. All nodes
in the cluster have access to the data, and the master node determines which parts of the data each node
should process. When the nodes have completed their work, the master node collects the results, and
then accumulates an overall result, which it returns.
Another set of problems are more accurately classed as High Performance Computing (HPC). These jobs
are typically CPU-limited; you have less data but you require a lot of CPU power. These problems are
often known as “embarrassingly parallel”—little or no effort is required to split up the processing into
tasks that can be run in parallel. In other words, the tasks are not dependent on each other, so they can be
easily separated to run on different nodes. Base R has a number of packages and functions to assist with
this, such as the parallel package that implements parallel versions of the lapply function, and the
foreach package that you can use for parallelizing loops.
The revoScaleR package includes functions that enable you to do HPC and embarrassingly parallel
computations, in addition to HPA operations. Just like the rx* functions, these functions make use of the
“write once, deploy anywhere” model, where you can write your code and check that it works locally
before deploying it to a more powerful remote server by simply changing the compute context in which it
runs.
Objectives
In this module, you will learn how to:
Use the rxExec function with the RxLocalParallel compute context to run arbitrary code and
embarrassingly parallel jobs on specified nodes or cores, or in your compute context.
Use the RevoPemaR package to write customized scalable and distributable analytics.
MCT USE ONLY. STUDENT USE PROHIBITED
5-2 Parallelizing Analysis Operations
Lesson 1
Using the RxLocalParallel compute context with rxExec
Use the rxExec function to perform traditional HPC tasks by executing a function in parallel across the
node of a cluster or cores of a remote server. It offers great flexibility regarding how arguments are
passed—you can specify that all nodes receive the same arguments, or provide different arguments to
each node. However, unlike the HPA ScaleR functions, you need to control how the computational tasks
are distributed and you are responsible for any aggregation and final processing of results.
Lesson Objectives
After completing this lesson, you will be able to:
Describe when to use the RxLocalParallel compute context to perform parallel jobs.
You should use the combination of rxExec and the RxLocalParallel compute context only in situations
when tasks are well suited to parallel execution, such as:
Embarrassingly parallel tasks where individual subtasks are not dependent upon each other. These
include mathematical simulations, bootstrap replicates of relatively small models, image processing,
brute force searches, growing trees in a random forest algorithm, and almost any situation where you
would use the lapply function on a single core computer.
The default compute context is rxLocalSeq, which enables only sequential processes when using rxExec.
To change to the RxLocalParallel, you need to first create a RxLocalParallel object to use with rxExec,
and then set this as the main compute context:
Subsequent calls to rxExec will now make use of the parallel compute context.
You should also check the documentation in R Client for rxExec and RxLocalParallel.
There are two primary use cases for running rxExec:
2. As a parallel lapply type function that operates on each object of a list, vector or similarly iterable
object.
The following examples illustrate instances of these use cases. Note that these examples do not necessarily
demonstrate efficient uses of parallel computing.
MCT USE ONLY. STUDENT USE PROHIBITED
5-4 Parallelizing Analysis Operations
This example shows a simulation function that takes no arguments and returns a numeric in the FUN
argument to rxExec. The number of times the simulation function g is executed is determined by the
number passed to the timesToRun argument. The value returned by rxExec is a list containing the results
for each iteration:
y2
# Typical results
# [[1]]
# [1] 1.59938
#
# [[2]]
# [1] 2.322001
# …
# [[10]]
# [1] 0.3149986
Note that, if you are running on a cluster, you can influence the distribution of tasks to nodes in the
cluster by using the taskChunkSize argument to rxExec. This argument specifies the number of tasks that
should be allocated to each node. For example, if you set timesToRun to 5000 and you have a five-node
cluster, you can set the taskChunkSize to 1000 to force each node to perform 1,000 iterations of the task
rather than letting the master node decide.
In the next example, you have a list of numbers, xs, which you want to use as inputs to a simulation
function, f. The rxExec function applies the function f (in the FUN argument) to each element of xs (in
the elemArgs argument) to produce the list ys. Here, you determine the number of times the function is
called, not by the timesToRun argument, but by the length of the object passed to elemArgs. Because
the function operation on each element is independent of the others, the different operations can be
farmed out across as many nodes or cores as appropriate for the size of your problem and your available
compute resource.
# Invoke the function 10 times using rxExec and gather the results
ys <- rxExec(FUN = f, elemArgs = xs)
ys
# Typical results
# [[1]]
# [1] 1.681905
#
# [[2]]
# [1] 0.5026906 # …
# [[10]]
# [1] 1.145017
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-5
This final example shows how to supply multiple arguments to the FUN function by providing a nested list
to elemArgs. The simulation function h takes two arguments, x1 and x2. The code in the second line
builds a list of lists, xx, each element of which contains two numeric values, also named x1 and x2. In the
call to rxExec, the length of the list xx determines the number of times h is called, and the values in each
element are passed on to h for that iteration. The results are returned by the rxExec function as a list.
Note that, in all three examples, vectorized versions of base R functions could be more efficient than the
code shown. However, the power of parallelized code is apparent with more complex simulation functions
running for vastly more iterations.
Note: The rxExec function operates in the same closed environment as other ScaleR
functions. If you invoke a function that references another package, you must specify the
package name using the packagesToLoad argument. Similarly, if you reference other R objects
from your environment inside the function, you must provide a list of the object names using the
execObjects argument.
To use doRSR, you first need to load the package into the namespace, and then register the back end.
After this, any code using the %dopar% function will run using rxExec and your current compute context.
The following code shows a simple example:
# Run a foreach loop containing the %dopar% function to calculate square roots in
parallel
foreach(i=1:3) %dopar% sqrt(i)
# Results:
# [[1]]
# [1] 1
#
# [[2]]
# [1] 1.414214
#
# [[3]]
# [1] 1.732051
Note: The ScaleR package defines a compute context, RxForeachDoPar that is specifically
optimized to handle parallel foreach operations. This compute context creates a parallel
environment, and registers the back end automatically. You can use this compute context in
place of RxLocalParallel if you are only using rxExec to run loops.
If you roll a 4, 5, 6, 8, 9, or 10, then that number becomes your point. You continue rolling until you
either roll your point again (you win), or roll a 7 (you lose).
Demonstration Steps
Running a simulation sequentially
3. Highlight and run the code under the comment # Connect to R Server. This code connects to R
Server running on the LON-RSVR VM.
4. Highlight and run the code that creates the playDice function, under the comment # Create dice
game simulation function.
5. Highlight and run the statement that runs the playDice function, under the comment # Test the
function. This code should display the message Win or Loss, depending on the output of the
simulation.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-7
6. Highlight and run the code under the comment # Play the game 100000 times sequentially. This
code uses the replicate function to run the playDice function. The results are captured and
tabulated, showing the percentage of wins and losses in the 100,000 runs of the games. Note the
user and system statistics reported by the system.time function.
7. Repeat step 6 several times, to get an average of the user and system timings.
Running a simulation using parallel tasks
1. Highlight and run the code under the comment # Play the game 100000 times using rxExec. Note
that the code is currently running using the RxLocalSeq compute context, so tasks are still being
performed sequentially. However, the user and system timings should be much quicker than before.
2. Highlight and run the code under the comment # Switch to RxLocalParallel. The time spent running
in user mode should be lower still, although the overall elapsed time is likely to be higher in this
example.
The reason for this is that this is a simple simulation using a very small amount of data on a modest
server. The overheads of splitting up the job into tasks and running them in parallel actually exceeds
any performance benefits gained. However, if the job was much more compute intensive and
involving vast amounts of data, then these overheads become a much less significant part of the
processing.
Note: the remote session might be interrupted with the message:
Canceling execution...
Error in remoteExecute(line, script = FALSE, displayPlots = displayPlots, :
object 'r_outputs' not found
Another use for non-waiting compute contexts is for massively parallel jobs involving multiple clusters.
You can define a non-waiting compute context on each cluster, launch all your jobs, and then aggregate
the results. The job scheduler on the cluster can control the timing of these jobs.
MCT USE ONLY. STUDENT USE PROHIBITED
5-8 Parallelizing Analysis Operations
You can set a compute context object to be non-waiting by setting the wait argument to FALSE when
you call the context constructor function. Calls to rxExec will then return control back to the local session.
Note that you cannot define a local compute context (RxLocalSeq or RxLocalParallel) as non-waiting.
To find the status of a running non-waiting job, you can call rxGetJobStatus with the object name of the
job (defined in the call to rxExec) as the argument. If you forget to assign a name to the job in the rxExec
call, you can use the function rxLastPendingJob to retrieve it and assign a name to it. To cancel a non-
waiting job, use the rxCancelJob function with the job name as the argument.
To retrieve the results of a finished non-waiting job, you can call rxGetJobResults with the name of the
job as the argument.
For more information about creating and managing non-waiting jobs, see:
Demonstration Steps
Uploading data to Hadoop
1. Open your R development environment of choice (RStudio or Visual Studio).
2. Open the R script Demo2 - nonwaiting.R in the E:\Demofiles\Mod05 folder.
3. Highlight and run the code under the comment # Create a Hadoop compute context. These
statements establish a waiting compute context running on the Hadoop VM.
4. Highlight and run the code under the comment # Upload the Flight Delay Data. This statement
removes any existing version of the flight delay data, and then copies the latest flight delay data XDF
file to HDFS on the Hadoop VM. This will take two or three minutes to run.
Note: the rxHadoopRemove function will return FALSE if the file doesn't already exist. You can
ignore this status.
The rxHadoopCopyFromClient functions should display the message TRUE if the file is uploaded
successfully.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-9
2. Highlight and run the code under the comment # Try and sort the data in the Hadoop context.
This code uses the rxSort function to sort the results of the summary. However, the way in which
rxSort works renders it unsuitable for running in a distributed environment such as Hadoop, so the
code fails with an error message.
3. Highlight and run the code under the comment # Use rxExec to run rxSort in a distrubuted
context. This statement uses the rxExec function to invoke rxSort. You should note that this is more
of a workaround to enable you to run this function (and others like it that are not inherently
distributable), because, in this case, rxExec runs the function as a single task.
When the sort has completed, the head function displays the first few hundred routes, with the most
popular ones at the top of the list.
2. Highlight and run the code under the comment # Perform the analysis again. This statement runs
the same rxSummary task as before, but this time it executes as a non-waiting job. The value
returned is a job object.
3. Highlight and run the code under the comment # Check the status of the job. This statement
checks the status of the job. If it is still in progress, it returns the message running. If the job has
completed, it returns the message finished.
4. Keep running the code under the comment # Check the status of the job until it reports the status
message finished.
5. Highlight and run the code under the comment # When the job has finished, get the results. This
code uses the rxGetJobResults function to retrieve the results of the rxSummary function, and then
prints the result (it has not been sorted).
6. Highlight and run the code under the comment # Run the job again.
7. Highlight and run the code under the comment # Check the status of the job. Verify that the job is
running.
8. Highlight and run the code under the comment # Cancel the job. This statement uses the rxCancel
function to stop the job and tidy up any resources it was using.
Note: you can speed up cancellation by setting the autoCleanup flag to FALSE when you create the
Hadoop compute context. However, this will leave any temporary artifacts and partial results in place
on the Hadoop server. You will eventually need to remove these items to avoid filling up server
storage.
9. Highlight and run the code under the comment # Check the status of the job. The job should now
be reported as missing.
10. Highlight and run the code under the comment # Return to the local compute context. This
statement resets your compute context back to the local client VM.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Using the RevoPemaR package
In this lesson, you will learn about the RevoPemaR package—a framework to build your own HPA
functions that are scalable and distributable across your servers, clusters or database services. These
functions are known as Parallel External Memory Algorithms (PEMAs). They can work with chunked data
too big to fit in memory, and these chunks can be processed in parallel. The results are then combined
and processed, either at the end of each iteration or at the very end of computation.
As with the RevoScaleR HPA functions you have learned about in previous modules, PEMA functions can
be tested on a local session with a sample dataset, and then deployed to your high performance compute
resources.
Lesson Objectives
After completing this lesson, you will be able to:
Describe the structure of a PEMA class.
Note that the PEMA framework relies upon Reference classes, an Object-Oriented Programming (OOP)
paradigm that was introduced into the R language in R version 2.12. Reference (Ref) classes:
Have methods that belong to objects rather than functions, unlike the standard S3 and S4 classes in R.
These features make them useful for computations that need to be updated repeatedly from different
chunks running on different servers or nodes on a cluster.
MCT USE ONLY. STUDENT USE PROHIBITED
5-12 Parallelizing Analysis Operations
processData. This method controls what the algorithm does to process the data in each chunk. It
produces intermediate results that are stored as fields.
updateResults. This is the method used by the master node in the cluster or server to collect the
results from the other nodes together in one place. This method makes use of the mutable state of
Ref classes to update the same fields in multiple nodes.
processResults. This method takes the combined intermediate results from the updateResults
method and performs whichever computations are needed to calculate the final result.
getVarsToUse. This method specifies the names of the variables in the dataset to use.
The following code snippet shows the essential structure of a PEMA class generator. In this example,
PemaMean is a generator for a PEMA object that calculates the mean values for a specified variable in a
dataset. Note that you must load and reference the RevoPemaR package:
Specifying fields
You define the fields required by the PEMA object in the fields list. Each field has a name and a type, as
shown in the next example. The fields required by PemaMean are:
The name of the variable for which the mean is being calculated (varName).
The final result of the algorithm; the mean value being calculated (mean).
The code below shows the list of fields for the PemaMean class
Provide a doc string that documents the method. This is general good practice.
Initialize the parent class from which the PEMA class inherits. Use the callSuper method to do this.
Set up the various functions and infrastructure that the RevoPemaR framework uses to parallelize
instances of this class. You do this by calling usingMethods(.pemaMethods).
The following code shows the initialize method of the PemaMean class. The varName field is populated
with the parameter passed to initialize; the remaining fields are set to 0:
Note: Important: if you fail to initialize a field, it can retain its value from a previous use of
a PEMA object constructed using this class generator. This can cause confusing results so you
should make sure that you initialize everything.
Note: Important: the processData method is called once for each chunk of data it is
assigned by the master node. It is therefore likely that the same node will be used many times.
Make sure that you accumulate (add to) the results for each run of the processData method in
the fields of the object, as shown in the previous example.
This method takes a reference to a PEMA object. You use the updateResults method to retrieve the data
in the fields of this PEMA object and add this data to the values represented by the local set of fields in
this node (the master). The PemaMean class uses the updateResults function to accumulate the values in
the sum, totalObs, and totalValidObs fields of each PemaMean object into the corresponding fields on
the master node. Like the processData method, you should not return a value from the updateResults
method.
The following code shows the updateResults method for the PemaMean class.
invisible(NULL)
}
Note: Important: Don't assume that the updateResults method will always run. In a
single-node environment, or if the master node does not distribute the work, then the
aggregated results should be available in the only running instance of the PEMA object—so there
is no need for the master node to call updateResults.
The PemaMean class implements the getVarsToUse method very simply, as follows:
You can then use the pemaMeanObj object to perform an analysis and calculate the mean of a variable
in a dataset. To do this, you use the pemaCompute function. This function takes a PEMA object, a
dataset, and any parameters that you defined for the initialize method of the PEMA class.
The following example creates a sample data frame containing 1,000 random numbers in a field named x,
and uses this data frame as the dataset for the PemaMean object, which calculates the mean of the
values in field x.
Using the PemaMean object to calculate the mean value of a variable in a dataset
set.seed(12345)
pemaCompute(pemaObj = meanPemaObj,
data = data.frame(x = rnorm(1000)), varName = "x")
The value returned by pemaCompute is the result of the computation, in this case the mean of x.
The fields in the meanPemaObj object remain populated after the pemaCompute function has finished,
so you can examine the contents of any field by using the $ accessor. This code fragment retrieves the
values of the mean field, which should be the same as that returned by the pemaCompute function, and
the totalValidObs field, which contains the number of valid observations in the dataset; this should be
1000:
You can run the pemaCompute function over the same object again. By default, the initialize function
will be used to reset the fields in the object. However, if you specify the argument initPema = FALSE, the
initialize function will not be invoked, enabling the analysis to continue using the previously stored values
for the fields.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-17
The following example uses this feature to run a further analysis over a new dataset, but generates a result
based on the aggregation of the data in this dataset and the previous results. If you display the value of
meanPemaObj$totalValidObs after running this statement, it should contain the value 2000.
You can set the trace level in the initialize method, as highlighted in the following code sample. Note
that the traceLevel field is part of the PemaBaseClass parent class:
In the following code, the processResults uses the outputTrace method to display the message
"outputting mean" if the traceLevel field is set to a value that is greater than or equal to 1.
Demonstration Steps
Creating a PEMA class generator
1. Open your R development environment of choice (RStudio or Visual Studio).
2. Highlight and run the code under the comment # Connect to R Server. This creates a remote
connection and sets up the environment for RevoPemaR.
3. Highlight and run the code under the comment # Copy the PemaMean object to the R server
environment for testing. This copies the PemaMean object from the local session to the remote
session running on R server.
4. Highlight and run the code under the comment # Create some test data. This code creates a data
frame with a single variable named x. The variable contains 1,000 random values. Note that the
random number generated is seeded with a specific value to enable the test to be repeatable.
5. Highlight and run the code under the comment # Run the analysis. This code uses the
pemaCompute function to deploy the meanPemaObj object to the distributed environment and
start it running. The parameters specify the dataset, and the variable for which the mean should be
calculated. The value returned is displayed. It should be 0.04619816.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-19
6. Highlight and run the code under the comment # Examine the internal fields of the PemaMean
object. This code retrieves the data stored in the sum, mean, and totalValidObs fields of the object.
Note that the number of valid observations is 1000.
7. Highlight and run the code under the comment # Create some more test data. This code creates
another dataset of 1000 seeded random numbers.
8. Highlight and run the code under the comment # Run the analysis again, but include the previous
results. This repeats the analysis, but sets the initPema argument of the pemaCompute function to
FALSE. This prevents the PEMA framework from invoking the initialize function in the PEMA object,
so the fields are not reset. The result should be -0.006803199.
9. Highlight and run the code under the comment # Examine the internal fields of the PemaMean
object again. This time, note that the number of valid observations is 2000.
You have written a PEMA class that performs a complex analysis in parallel. You
decide to test the class on a cluster with a single compute node. The data is divided
into 50 chunks. How many times does the updateResults method of the PEMA
object run?
50
It varies, depending on how the master node decides to distribute the work,
but it could be anywhere between 1 and 50.
This is a complex question that depends on a number of variables, so you decide to break it down into
smaller elements. Initially, you focus on the phrase "If I fly from A to B by airline C … ?" To help solve this
part of the equation, you want to find the number of times these flights are delayed, and by how long.
You realize that you can perform the processing for this problem using parallel nodes, so you decide to
write a PEMA class to help you.
Objectives
In this lab, you will create a PEMA object that you can use to examine flight delay data.
Lab Setup
Estimated Time: 90 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
The total number of flights made by the selected airline from the first airport to the second.
3. Create a network share for the C:\Data folder. This share should provide read and write access for
Everyone. Name the share \\LON-RSVR\Data.
2. Copy the file FlightDelayData.xdf from the E:\Labfiles\Lab05 folder to the \\LON-RSVR\Data
share.
callSuper(...) usingMethods(.pemaMethods)
This method should take a single parameter named dataList. This parameter will either contain a
data frame or a list of vectors, depending on whether the PEMA object is run against a data frame or
an XDF object.
Coerce the dataList parameter into a data frame if it is not there already.
If the origin field contains an empty string, the user didn't provide this information as a
parameter when running the object. In this case, populate the origin with the first value found in
the Origin variable in the data frame.
Likewise, populate the dest and airline fields using the first values found in the Dest and
UniqueCarrier variables of the data frame.
Filter the data frame to find all flights that match the combination specified by the values in the
origin, dest, and airline fields.
Count the number of flights matched and add this figure to the totalFlights field.
From this dataset, find all flights with a delay time of more than zero minutes and append these
delay times to the delayTimes vector.
Count the number of delayed flights and add this figure to the totalDelays field.
8. Define the updateResults method.
This method should take a single parameter named pemaFlightDelaysObj. This parameter contains
a reference to another instance of the PEMA object that has been running on another node. In this
method, perform the following tasks:
This method should not take any parameters. In this method, perform the following tasks:
Construct the results list. This list should have three elements named NumberOfFlights,
NumberOfDelays, and DelayTimes. Store the value from the totalFlights field in the
NumberOfFlights element. Store the value from the totalDelays field in the NumberOfDelays
element. Copy the delayTimes vector to the DelayTimes element.
2. Start a remote session on the LON-RSVR VM. When prompted, specify the username admin, and the
password Pa55w.rd.
3. At the REMOTE> prompt, temporarily pause the remote session and return to the local session
running on the LON-DEV VM.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 5-23
4. Copy the local object, pemaFlightDelaysObj, to the remote session. Use the putLocalObject
function to do this.
5. Resume the remote session.
6. In the remote session, install the dplyr package, and bring the dplyr and RevoPemaR libraries into
scope.
7. Create a data frame containing the first 50,000 observations from the FlightDelayData.xdf file in the
\\LON-RSVR\Data share.
8. Use the pemaCompute function to run an analysis of flight delays using the pemaFlightDelaysObj
object. Specify an origin of "ABE", a destination of "PIT", and the airline "US".
9. Display the results. They should indicate that there were 755 matching flights, of which 188 were
delayed. The delay times for each delayed flight should also appear.
10. Display the contents of the internal fields of the pemaFlightDelaysObj object. They should contain
the following values:
delayTimes: a list of 188 integers showing the delay times for each delayed flight.
totalDelays: 188
totalFlights: 755
origin: ABE
dest: PIT
airline: US
3. Perform an analysis of flights from LAX to JFK made by airline DL using the entire dataset in the
FlightDelayData.xdf file.
4. Save the script as Lab5Script.R in the E:\Labfiles\Lab05 folder, and close your R development
environment.
Results: At the end of this exercise, you will have created and run a PEMA class that finds the number of
times flights that match a specified origin, destination, and airline are delayed—and how long each delay
was.
Question: How many flights were made from LAX to JFK by DL, and how many were
delayed? What was the longest delay?
Question: How could you verify that the results produced by the PemaFlightDelays object
are correct?
MCT USE ONLY. STUDENT USE PROHIBITED
5-24 Parallelizing Analysis Operations
Use the rxExec function with the RxLocalParallel compute context to run arbitrary code and
embarrassingly parallel jobs on specified nodes or cores, or in your compute context.
Use the RevoPemaR package to write customized scalable and distributable analytics.
MCT USE ONLY. STUDENT USE PROHIBITED
6-1
Module 6
Creating and Evaluating Regression Models
Contents:
Module Overview 6-1
Module Overview
When you have refined your data and can process it effectively, you probably want to start extracting
insights from it. RevoScaleR has an extensive range of modeling tools and algorithms that allow you to
investigate almost any kind of data. The purpose of these models is to help you generate predictions for
future observations given the information held in the current dataset. The algorithms for building these
models can be broadly grouped into two categories:
1. Supervised learning: This kind of algorithm requires that every case in the data has a valid case label
(or value in the example of regression analysis) for the response variable. The model then iteratively
changes the parameter values to improve the fit of the model to the data. Supervised learning can be
used for predictive modeling and for estimating the effects of the different predictors on the
response variable.
2. Unsupervised learning: Here, there are no class labels and no response variable. The algorithm
attempts to split the data into “natural” groups (dependent on the algorithm used to determine these
groups). This is the only method available to you if you don’t have labeled data. It is also useful as an
exploratory step in your data analysis.
In this module, you will learn about the most common supervised and unsupervised analysis types:
clustering and linear regression. The RevoScaleR package has highly optimized versions of these
algorithms that can be very efficiently deployed on a cluster or server to analyze huge datasets.
Objectives
In this module, you will learn how to:
Use clustering to reduce the size of a big dataset and perform further exploratory analysis.
Fit data to linear and logit regression models, and use these models to make predictions.
MCT USE ONLY. STUDENT USE PROHIBITED
6-2 Creating and Evaluating Regression Models
Lesson 1
Clustering big data
Clustering is an unsupervised learning algorithm that finds structure in a dataset by placing cases into
groups or “clusters” according to a distance metric based on the set of variables you choose to use.
Lesson Objectives
After completing this lesson, you will be able to:
In clustering, it’s important to understand that there is no concept of accuracy because there is no
baseline of labeled data to compare against to decide this. Instead, you must judge the usefulness of the
outcome in the context of what you are trying to achieve with a cluster analysis.
To find natural groups in your data. For example, in market segmentation analysis where you might
want to find out the rough groups of people in a population according to demographic, attitudinal
and purchasing data. Summary statistics are then calculated for each individual cluster.
To reduce datasets into subsets of similar data. Here you are effectively converting a big data
problem into a set of smaller data problems. You might want to perform clustering to determine the
cluster that you are most interested in, and then run regression models on the data in that cluster.
To reduce the dimensionality of your data. Here you are using the supplied clusters to summarize the
predictor variables. An example of this is color quantization, where you want to reduce the colors in
an image to a fixed number of colors.
For nonrandom sampling. You can use k-means to select k groups and then sample randomly from
each group. Cluster sampling is often used in large-scale survey designs.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-3
Clustering
https://aka.ms/ce58yc
Note: The examples in this lesson use the diamonds dataset from the ggplot2 package.
This is the same package that contains the plot functions used in Module 3: Visualizing Big Data.
The diamonds dataset contains data on nearly 54,000 diamonds, including price, cut, color,
clarity, size, and other attributes.
The following code clusters the diamonds in the dataset into five groups, based on the values of the carat
(weight), depth, and price properties. The intent is that all diamonds that have similar values aggregated
across these properties will be in the same cluster:
You specify the model with a one-sided formula because there is no response variable in unsupervised
learning tasks. You specify a value for k with the numClusters argument, and you can set a seed value
that is used for the initial random classification of data (this aids the ability to reproduce; reusing the same
seed will repeat the same initial classification).
Note that the analysis in this example is performed on an in-memory data frame, but you can also use the
rxKmeans function over an XDF file if you have a big dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
6-4 Creating and Evaluating Regression Models
# Typical results
Call:
rxKmeans(formula = ~carat + depth + price, data = ggplot2::diamonds,
numClusters = 5, seed = 1979)
Data: ggplot2::diamonds
Number of valid observations: 53940
Number of missing observations: 0
Clustering algorithm:
K-means clustering with 5 clusters of sizes 7686, 2731, 26506, 4358, 12659
Cluster means:
carat depth price
1 1.1683737 61.78579 6377.410
2 1.9030172 61.65529 15675.241
3 0.4239082 61.71354 1103.074
4 1.4816843 61.65860 10444.441
5 0.8824015 61.85397 3598.578
Clustering vector:
…
Available components:
[1] "centers" "size" "withinss" "valid.obs" "missing.obs"
[6] "numIterations" "tot.withinss" "totss" "betweenss" "cluster"
[11] "params" "formula" "call"
This gives you information on the number of cases and the means of the variables in each cluster. You can
see it makes intuitive sense, grouping the cases at different levels of carat and price. Note that depth
looks to be far less influential.
The “available components” section shows the different elements you can extract from the model object
using the $ operator. For example, clust$numIterations will tell you how many iterations the algorithm
performed.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-5
Evaluating clusters
Although there is no real measure of accuracy in a cluster
analysis, you can get an idea of the percentage of
variation explained by the model by calculating the ratio
of the “between cluster sum of squares” (betweenss —a
measure of the variation between clusters) and total sum
of squares (totss—the total variation among the variables
in the model):
# Typical results:
0.9562775
In this analysis, you can see that 95 percent of the variation in the dataset is explained by the differences
between the cluster means. It seems that the clusters here are very informative!
Standardizing data
Clustering uses a distance metric to determine
membership of clusters. If you have two variables,
one of which is on a scale of 0 to 1 and the other
is on a scale of 0 to 1 million, the variation in the
second variable could swamp that of the first
variable. This means that the clustering algorithm
will be biased to take the variable with the higher
variance more into account—this often occurs
when variables with different units are used.
# Typical results
carat depth price
0.4740112 1.4326213 3989.4397381
The standard deviation of price measured in dollars is more than 8,000 times that of the weight in carats,
although this difference is not necessarily informative because the variables have different units.
One way to handle this problem is to standardize your variables before running the clustering algorithm.
The usual transformation is known as the z-transformation where you subtract the mean and divide by
the standard deviation. This gives a unitless measure with a mean of 0 and a standard deviation of 1, also
known as a z-score. Clustering on these transformed variables will give a truer reflection of the variation in
the data.
MCT USE ONLY. STUDENT USE PROHIBITED
6-6 Creating and Evaluating Regression Models
The following example uses a transform with the rxKmeans function to implement this technique:
# Typical results:
0.7356599
You can see that the percentage variance explained by the model has decreased to 74 percent. This is not
surprising because it reflects the increased relative contribution of the variables on a smaller scale. The
percentage was artificially inflated in the untransformed analysis because the variables were in different
units.
You can also examine how the influence of the different variables has changed in determining the
clusters:
# Typical results
carat_z depth_z price_z
1 0.69267532 0.07945899 0.5541154
2 0.05006483 1.54316431 -0.1950409
3 2.07916635 -0.08380051 2.3937402
4 -0.20652536 -1.59163979 -0.2969734
5 -0.74473941 0.07436982 -0.6617408
Notice how there is obvious variation in the transformed depth variable, unlike with the clusters
generated by using the untransformed variables.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-7
At a certain point, adding more clusters will only add a marginal amount of explanatory power to your
model. With standard k-means, it is common to run a series of models over a range of values of k and
look at, or plot, the “between cluster sum of squares” versus “total sums of squares” ratios for each value.
At some point, there will be an elbow on the plot where the increase in explanatory power tails off as you
add more clusters. This is a good indicator of the number of clusters to choose for k. With large data on a
cluster, however, this can be a very costly and time consuming way to find the optimum value of k.
With very large data, it is best to take a representative sample of the full dataset and run multiple models
for different values of k on this. If the sample is representative, you should see the same patterns as for
the full data.
After you have decided on a value of k from your sample, you can also take the centroids from this
analysis and pass them to the centers argument when you run the analysis on your full dataset. This will
greatly speed up your analysis because the algorithm will have a starting point that is likely to be already
very close to the optimum. Without setting the centers argument, the algorithm will choose random
starting points and will waste time getting to a point close to the optimum.
Demonstration Steps
3. Highlight and run the code under the comment # Cluster the data in the diamonds dataset into 5
partitions by carat, depth, and price. This code runs the rxKmeans function to generate the
clusters.
MCT USE ONLY. STUDENT USE PROHIBITED
6-8 Creating and Evaluating Regression Models
4. Highlight and run the code under the comment # Examine the cluster. These statements display the
distribution of values in the clusters, and the number of iterations that were performed to generate
the clusters. Note that the values of the depth variables used in each cluster are all very similar
compared to the variation in the carat and price variables. It took 51 iterations of the algorithm to
create these clusters.
5. Highlight and run the code under the comment # Assess the variation between clusters. This code
calculates the ratio of the sums of squares between clusters and the total sums of squares for all
clusters. The result shows that 95.6% (0.9562775) of the differences are accounted for between
clusters, so each cluster appears to be relatively homogenous.
6. Highlight and run the code under the comment # Examine the standard deviations between the
cluster variables. These statements highlight the different scales of each variable, and why the price
and carat variables are more influential than depth: the deviation in these variables is relatively large
compared to the deviation of values in depth.
2. Highlight and run the code under the comment # Examine the cluster sums of squares. This time,
the ratio of the sums of squares between clusters and the total sums of squares is only 73.6%
(0.7356599). There is a bigger variation in each cluster, due to the increased influence of the depth
and carat variables compared to price.
3. Highlight and run the code under the comment # Examine the influence of each variable. The
code shows the centroids for each cluster. This time, depth has a much greater variation than before,
and all variables have the same order of magnitude.
2. Highlight and run the code under the comment # Plot the results. This code creates a data frame of
values of k and the sums of squares ratios, and then plots this data on a line graph. You can see the
decreasing effects of increasing k. When k reaches 17, the line becomes more or less flat. This
suggests that it would not be worth specifying a k value of more than 17. In fact, values from 13
upwards add little variation to the clusters, so 13 might be an optimal value to use.
3. Close your R development environment of choice (RStudio or Visual Studio).
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-9
You have used the rxKmeans function to cluster data. The ration of the "between
cluster sum of squares" and the "total sum of squares" across the cluster is very
high (99.8%). What does this indicate?
Clustering has been ineffective as each cluster contains vastly differing data.
You cannot draw any conclusions about the effectiveness of the clustering
using this measure.
All the data values in the entire dataset are nearly identical.
MCT USE ONLY. STUDENT USE PROHIBITED
6-10 Creating and Evaluating Regression Models
Lesson 2
Generating regression models and making predictions
Linear regression is perhaps the most commonly used tool in the data scientist’s toolkit. In regression
analysis, you try to fit a straight line to the relationship between a continuous response variable and one
or more continuous predictors. It is part of the more general group of linear models that also includes
ANOVA (continuous response variable and categorical predictors), ANCOVA (continuous response
variable and a combination of categorical and continuous predictors) and MANOVA (multiple response
variables). There are also generalized linear models that apply a transformation to the linear model—for
example, logistic regression for modeling a binary response variable.
Linear regression is a supervised learning technique, in that you need to have labeled response variables
to train it on. After you have run the model on your training data, you can then make predictions from
the model, given a set of predictor variables. A big advantage over more complex machine learning
algorithms, such as decision forests, boosting and neural networks (covered in Module 7), is that the
model is easily interpretable: you get effect levels for the predictor variables that tell you the effect on the
response for each increase in that predictor. The more complicated algorithms often operate as black
boxes—these produce good predictions but are less easily interpretable.
The problem with linear regression is that it assumes that:
Often, you might begin your analysis with linear regression, and then move on to something more
complex if it does not fit your data well.
Lesson Objectives
After completing this lesson, you will be able to:
Explain how linear regression works.
The following example uses the rxLinMod function to determine the effect of diamond weight (carat) on
the price. This model helps to determine whether there is a linear relationship between the two variables:
Note that the response variable is on the left-hand side of the formula and the predictor variables are on
the right-hand side. The covCoef = TRUE argument ensures that the covariance matrix for the
coefficients is calculated as part of the regression. You can examine the results directly:
# Typical results
Call:
rxLinMod(formula = price ~ carat, data = ggplot2::diamonds)
Coefficients:
price
(Intercept) -2256.361
carat 7756.426
This output should be familiar if you have used the lm function in base R. You can see the effect estimates
for the carat predictor variable is 7756.43. This is the effect on price for an increase of 1 carat, so a single
carat diamond would be predicted to cost (-2256.36 + 7756.43) = $5500.07. The intercept reflects the
theoretical cost of a zero carat diamond.
Note: This example shows the possible dangers of blindly over-interpreting the results.
According to this model, diamonds become worthless at 0.29 carats, and you would have to pay
someone $2,256.36 to take a zero carat diamond away. Clearly the model is nonsense at this
extreme but, as the weight of diamonds increases, it can become more realistic.
The R-squared value shows that the carat variable accounts for approximately 85 percent of the variation
in price.
MCT USE ONLY. STUDENT USE PROHIBITED
6-12 Creating and Evaluating Regression Models
You can use the F() function within the formula to convert a continuous variable to a factor with integer
levels as shown below:
It’s important to note that rxLinMod operates somewhat differently to the lm function in base R when
dealing with factors. By default, rxLinMod uses the last factor level as the baseline for comparisons while
lm uses the first. The example specifies the arguments dropFirst = TRUE and covCoef = TRUE. These
argument values will return a similar output to an lm fit.
Coefficients:
price
(Intercept) 1632.641
F_carat=0 Dropped
F_carat=1 5655.627
F_carat=2 13214.308
F_carat=3 12676.065
F_carat=4 14825.359
F_carat=5 16385.359
The coefficients table shows the estimated price for each factorized value of the weight. In this output,
you can see that, according to this model, a 1-carat diamond should cost $5,655.63, a 2-carat diamond
should cost $13,241.31, and so on. Again, you should beware of extrapolating these results to their
extremes.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-13
Transforming variables
You can transform model variables within the rxLinMod function in the same way as in many of the other
rx* functions, by using the transforms argument.
The next example uses the log of the weight as the predictor variable:
The advantage of using transforms rather than transforming the variable directly is that the
transformation is performed on chunks of data, so it is more efficient on large datasets.
Note: The rxLinMod function is not restricted to single predictor variables. In the same
way as with lm, you can include multiple variables, interactions between variables, and nested
terms.
# Typical results
Call:
rxLinMod(formula = price ~ carat, data = ggplot2::diamonds)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36 13.06 -172.8 2.22e-16 ***
carat 7756.43 14.07 551.4 2.22e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now you can see not only the parameter estimates, but also the standard errors and p-values for the
coefficients, residual standard error, and R-squared statistics. Note that, with very large datasets, you are
likely to get significant p-values, so it is sensible to pay more attention to the effect levels and standard
error—and to run some visualizations to view the results of your model.
The RevoScaleR package provides several other functions that you might find useful for examining your
variables and your rxLinMod models. These functions can all be run on chunked data on a cluster or
server, and can transform data in the same way as the rxLinMod function:
rxSSCP: calculates the sum of squares or cross-product matrix for a set of variables.
rxCovCoef: returns the covariance matrix for the regression coefficients in a model object.
rxCorCoef: calculates the correlation matrix for the regression coefficients in a model object.
rxCovData: calculates the covariance matrix for the predictor variables in a model object.
rxCorData: calculates the correlation matrix for the predictor variables in a model object.
Next, generate predictions using the rxPredict function. This function produces a result set containing
row-by-row predictions for the response variable. You can include the original variables from the dataset
by specifying writeModelVars = TRUE as an argument:
The predictions dataset contains a price_Pred variable containing the predicted value for each price. You
can assess these predictions against the real prices.
# Typical results
price_Pred price carat
1 -239.689919 337 0.26
2 -472.382688 338 0.23
3 -472.382688 353 0.23
4 148.131362 402 0.31
5 225.695618 403 0.32
6 225.695618 403 0.32
7 225.695618 403 0.32
8 -6.997151 404 0.29
9 70.567105 554 0.30
10 70.567105 554 0.30
…
45 3560.958633 2806 0.75
46 3948.779914 2808 0.80
47 70.567105 554 0.30
48 70.567105 554 0.30
49 70.567105 554 0.30
50 70.567105 554 0.30
In this example, there is clearly some discrepancy between the real and predicted prices. In such cases,
you might find it useful to calculate confidence intervals around your predictions:
head(predictions_se, 50)
# Typical results
price_Pred price_StdErr price_Lower price_Upper price carat
1 -239.689919 10.085469 -259.45752 -219.92232 337 0.26
2 -472.382688 10.405828 -492.77819 -451.98718 338 0.23
3 -472.382688 10.405828 -492.77819 -451.98718 353 0.23
4 148.131362 9.569076 129.37590 166.88683 402 0.31
5 225.695618 9.468687 207.13692 244.25432 403 0.32
6 225.695618 9.468687 207.13692 244.25432 403 0.32
7 225.695618 9.468687 207.13692 244.25432 403 0.32
8 -6.997151 9.772834 -26.15198 12.15768 404 0.29
9 70.567105 9.670468 51.61291 89.52130 554 0.30
10 70.567105 9.670468 51.61291 89.52130 554 0.30
…
45 3560.958633 6.701669 3547.82331 3574.09396 2806 0.75
46 3948.779914 6.667718 3935.71113 3961.84869 2808 0.80
47 70.567105 9.670468 51.61291 89.52130 554 0.30
48 70.567105 9.670468 51.61291 89.52130 554 0.30
49 70.567105 9.670468 51.61291 89.52130 554 0.30
50 70.567105 9.670468 51.61291 89.52130 554 0.30
MCT USE ONLY. STUDENT USE PROHIBITED
6-16 Creating and Evaluating Regression Models
Demonstration Steps
Fit a linear model
1. Open your R development environment of choice (RStudio or Visual Studio).
3. Highlight and run the code under the comment # Fit a linear model showing how price of a
diamond varies with weight. This code runs the rxLinMod function to generate a linear regression
model on the diamond data, using the weight of diamonds to derive their prices.
5. Highlight and run the code under the comment # Examine the results. These statements display
information about the model created by the rxLinMod function. The coefficients suggest that each
carat in weight adds $7,756.43 to the price.
6. Highlight and run the code under the comment # Use a categorical predictor variable. This code
performs another regression using discrete values for the weight. The results show the prices for each
weight. It is clear from these results that the relationship between price and weight is not particularly
linear.
2. Highlight and run the code under the comment # Make price predictions against this dataset. This
code uses the rxPredict function to run predictions using the linear model against the sample
dataset.
3. Highlight and run the code under the comment # View the predictions to compare the predicted
prices against the real prices. This code displays the first 50 rows of the results. Compare the
price_Pred values against the price variable. The discrepancy shows how the linear model is not
particularly accurate for this data. You should note that the lower the weight, the greater the
discrepancy.
4. Highlight and run the code under the comment # Calculate confidence intervals around each
prediction. This code calculates and displays the upper and lower range for each prediction, and the
degree of confidence in terms of the standard deviation.
o The data plot is very funnel shaped, indicating a wide variation in price as the weight increases.
This indicates that price is not solely dependent on weight, and that other factors might be
important.
o The plot tapers up away from the origin for small values of the weight. This makes sense, because
even very small diamonds have some value.
2. Close the script Demo2 - modelling.R, but leave your R development environment of choice open
for the next demonstration.
MCT USE ONLY. STUDENT USE PROHIBITED
6-18 Creating and Evaluating Regression Models
The following example creates a logit model to determine whether a person is likely to default on a
mortgage loan, based on a dataset that includes the credit score, number of years in employment, and
amount of credit card debt. You can see that the parameters are much the same as for linear models.
A ROC curve plots the true positive rate (the number of correctly predicted TRUE responses divided by the
actual number of TRUE responses) against the false positive rate (the number of incorrectly predicted
TRUE responses divided by the actual number of FALSE responses), at various thresholds. The area under
the ROC curve indicates the predictive power of the model. Because the area under the ROC curve (AUC)
is scaled to 1, an AUC of 1 represents perfect prediction, an AUC of 0 represents 0 predictive power, and
an AUC of 0.5 is what is expected with random guessing.
The following example shows how to investigate what difference credit card debt makes to predictions of
mortgage defaults. First, you run the predictions for the first model (including credit card debt data):
Using the logit model to make predictions with an initial set of variables
predFile <- "mortPred.xdf"
predOutXdf <- rxPredict(modelObject = logitOut1, data = mortXdf,
writeModelVars = TRUE, predVarNames = "Model1", outData =
predFile)
You can then build a second model with the ccDebt variable removed, and add the predictions for this
model to the original predictions data file:
Using the logit model to make predictions with an amended set of variables
logitOut2 <- rxLogit(default ~ creditScore + yearsEmploy,
data = predOutXdf, blocksPerRead = 5)
predOutXdf <- rxPredict(modelObject = logitOut2, data = predOutXdf,
predVarNames = "Model2")
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-19
Finally, you can run the rxRoc function to generate the ROC curves for both models.
You can visualize the ROC curves with the plot function:
You can see that the area under the ROC curve for the first model (including credit card debt) is far
greater than the second model (with credit card debt removed). In fact, the predictive power of the
second model is barely greater than that of random guessing (this is the faint white diagonal line). This
shows that credit card debt is an important predictor of mortgage defaults.
For more information on logistic regressions in the RevoScaleR package, see:
Demonstration Steps
2. Highlight and run the code under the comment # Create a logit regression model. This code runs
the rxLogit function to generate a logistic model on the mortgage data, using a customer's credit
score, number of years in employment, and amount of credit card debt, to assess whether the
customer is likely to default on their mortgage.
3. Highlight and run the code under the comment # Generate predictions a model and predictions
that exclude credit card debt. This statement creates another logit model that excludes credit and
runs predictions using this model. The results are appended to the previous results.
4. Highlight and run the code under the comment # Display the results. This statement shows the first
50 predictions again. Examine the Model1 and Model2 columns. The values in these columns show
the actual likelihoods of default with and without taking credit card debt into account.
2. Highlight and run the code under the comment # Visualize the ROC curve. The graph shows the
true positive rate versus the false positive rate for both models. Note that the area under the curve for
the first model is far greater than that of the second model, indicating that it has much more
predictive power. The faint white line shows how making random guesses would fare (a straight
diagonal line from the origin), and the second model is actually not much better than this. Therefore,
it would seem that taking credit card debt into account is an important factor in this model.
3. Close your R development environment of choice.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-21
2. You can specify a distribution for the response variable and the errors. This is referred to as the
“family”.
The rxGlm function in the RevoScaleR package provides the ability to estimate generalized linear models
on large datasets.
All the link/family combinations available to glm in base R are also available to rxGlm, but the following
combinations have been optimized for high performance on large datasets:
binomial/logit: this is the logistic regression for binary classification and has been covered in the
previous topic. Use this combination when:
Data is non-negative.
o Data has a positive skew.
Tweedie: use this to produce a generalized linear model family object with any power variance
function, and any power link
MCT USE ONLY. STUDENT USE PROHIBITED
6-22 Creating and Evaluating Regression Models
The rxGlm function works in the same way as rxLinMod and rxLogit. In fact, rxLinMod and rxLogit are
just convenience functions wrapped around rxGlm. You specify the glm type using the family argument.
For more information on GLMs, see:
For more information on the use of cubes for performing regression, see:
https://aka.ms/qaxttm
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
You have generated a linear model using the rxLinMod function over a
large dataset. You have tested the model by making predictions using this
model and comparing them to a set of known results. You have plotted a
ROC curve displaying the accuracy of the known results to the predicted
values. The ROC curve shows a diagonal straight line from the point (0.0)
to the point (1,1). This indicates that the model is making very accurate
predictions. True or False?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-23
Objectives
In this lab, you will:
Fit another linear model against the entire dataset and compare predictions.
Lab Setup
Estimated Time: 90 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Copy the FlightDelayData.xdf file from the E:\Labfiles\Lab06 folder to the \\LON-RSVR\Data shared
folder.
MCT USE ONLY. STUDENT USE PROHIBITED
6-24 Creating and Evaluating Regression Models
Task 2: Examine the relationship between flight delays and departure times
1. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
2. Create a remote session on the LON-RSVR server. This is another VM running R Server. Use the
following parameters to the remoteLogin function:
deployr_endpoint: http://LON-RSVR.ADATUM.COM:12800
session: TRUE
diff: TRUE
commandLine: TRUE
username: admin
password: Pa55w.rd
3. Create a test file containing a random 10 percent sample of the flight delay data. Save this sample in
the file \\LON-RSVR\Data\flightDelaySample.xdf.
4. Create a scatter plot that shows the flight departure time on the X-axis and the delay time on the Y-
axis. Use the local departure time in the DepTime variable, and not the departure time recorded as
UTC. Add a regression line to the plot to help you spot any trends.
5. Create a histogram that shows the number of flights that depart during each hour of the day. Note
that you will have to factorize the departure time to do this; create a factor for each hour.
Results: At the end of this exercise, you will have determined the optimal number of clusters to create,
and built the appropriate cluster model.
Question: What do the graphs you created in this exercise tell you about flights made from
6:01 PM onwards?
2.
Calculate the ratio of the between clusters sums of squares and the total sums of squares for this
model. How much of the difference between values is accounted for between the clusters?
Examine the cluster centers to see how the clusters have partitioned the data values.
3. You don't yet know whether this is the best cluster model to use. Generate models with 2, 4, 6, 8, 10,
12, 14, 16, 18, 20, 22, and 24 clusters.
Register the RevoScaleR parallel back end with the foreach package (run the registerDoRSR
function).
Use the %dopar% operator with a foreach loop that creates different instances of the model in
parallel.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-25
Note: At the time of writing, there was still some instability in R Server running on
Windows. Placing it under a high parallel load can cause it to close the remote session. If this
happens, resume the remote session and switch back to the RxLocalSeq compute context.
4. Calculate the ratio of the between clusters sums of squares and the total sums of squares for each
model.
5. Generate a scatter plot that shows the number of clusters on the X-axis and the sums of squares ratio
on the Y-axis. Which value for the number of clusters does this graph suggest you should use?
2. Run the rxPredict function to predict the delays for the test dataset. Record the standard error and
confidence level for each prediction (set the computeStdErr argument to TRUE, and set the interval
argument to "confidence").
3. Examine the first few predictions made. Compare the Delay and Pred_Delay values. Pay attention to
the confidence level of each prediction.
Note: The graph showing the actual departure times and delays is based on a much bigger
dataset. To get a fair comparison between the two graphs, regenerate the earlier graph showing
the data for the entire dataset and set the alpha level of the points to 1/50. Both graphs should
look very similar.
MCT USE ONLY. STUDENT USE PROHIBITED
6-26 Creating and Evaluating Regression Models
2. Create a scatter plot that shows the difference between the actual and predicted delays for each
observation. Again, what do you notice about this graph?
Results: At the end of this exercise, you will have created a linear regression model using the clustered
data, and tested predictions made by this model.
Question: What conclusions can you draw about the predictions made by the linear model
using the clustered data?
2. Make predictions about the delay times. Use the same test dataset that you used to make predictions
for the cluster model.
2. Create a scatter plot that shows the difference between the actual and predicted delays for each
observation. How does this graph compare to that, based on the results of the previous regression
model?
3. Save the script as Lab6Script.R in the E:\Labfiles\Lab06 folder, and close your R development
environment.
Results: At the end of this exercise, you will have created a linear regression model using the entire flight
delay dataset, and tested predictions made by this model.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 6-27
Question: This lab analyses the flight delay data to try and predict the answer to the
question, "How long will my flight be delayed if it leaves at ‘N’ o'clock?" The linear regression
analysis shows that, although it is nearly impossible to answer this question accurately for a
specific flight (the departure time is clearly not the only predictor variable involved in
determining delays), it is possible to generalize across all flights. What might be a better
question to ask about flight delays, and how could you model this to determine a possible
answer?
MCT USE ONLY. STUDENT USE PROHIBITED
6-28 Creating and Evaluating Regression Models
Use clustering to reduce the size of a big dataset and perform further exploratory analysis.
Fit data to linear and logit regression models, and use these models to make predictions.
MCT USE ONLY. STUDENT USE PROHIBITED
7-1
Module 7
Creating and Evaluating Partitioning Models
Contents:
Module Overview 7-1
Module Overview
Partitioning models, or tree-based models, are a type of supervised learning algorithm that can be used
for either regression or classification. Partitioning models with a response variable that is a factor are
known as classification trees; partitioning models with a continuous response variable are known as
regression trees.
2. All your predictor variables are sorted and examined to find a split (partition) that best separates out
the classes. In a binary classification tree, each split produces two nodes.
4. If the algorithm determines that a node should not be partitioned further, this becomes a terminal
node and all the observations in it are classified as belonging to the same group.
Despite their simplicity, partitioning models have some important advantages over linear model-based
approaches that make them preferable in some situations:
A tree model is easily interpretable—you just need to look at the variable splits for each node.
They don’t rely on a predefined scheme—instead, they use induction to generalize from the data to
build a knowledge model. This makes them more suited than linear models to predicting the results
from stochastic or semi-stochastic processes, such as stock market prices. They aim to learn what the
output of a process might be, given a set of inputs—rather than extrapolating results from a derived
set of mathematical equations.
They are nonparametric and do not require assumptions of statistical normality or linearity of the
data. The variables do not require statistical transformations to be used effectively. Also, different data
types can be easily combined in the same model.
MCT USE ONLY. STUDENT USE PROHIBITED
7-2 Creating and Evaluating Partitioning Models
Variables can be reused in different parts of the tree. This means that complex interactions between
predictor variables can be learned from the data—they do not need to be prespecified as they do in
linear models.
Tree models are robust to outliers, which can be isolated in their own node so they do not affect the
remainder of the analysis.
However, partitioning models are not suited to every situation and have a number of disadvantages:
If an individual tree is too small (if it has too few partitions), it can have a low prediction accuracy.
If a tree is too large and complex, it might predict well on the training set but be prone to
overfitting—and so lose generality.
Because they rely on sorting the data, which is a time-consuming process, they can be problematic
with very large datasets.
Objectives
In this module, you will learn how to:
Use the three main partitioning models in the ScaleR™ package, and how to tune the models to
reduce bias and variance.
Use the MicrosoftML package for using advanced machine learning algorithms to create predictive
models.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-3
Lesson 1
Creating partitioning models based on decision trees
The ScaleR package includes several functions for building and examining partitioning models. This lesson
describes how to use these functions, and how to tune your models to improve accuracy and speed.
Lesson Objectives
After completing this lesson, you will be able to:
Create a partitioning model using decision trees, decision forests, and gradient boosted decision
trees.
Convert a decision tree model built using the ScaleR functions to an rpart model.
When you build a partitioning model, you need to strike a balance between minimizing complexity and
maximizing predictive power. You can do this by specifying the maximum number of bins in the
histograms:
Using a larger number of bins enables a more accurate description of the data and reduces bias, at
the expense of the increased likelihood of overfitting.
Using fewer bins reduces time complexity and memory usage, at the expense of predictive power.
Ensemble algorithms
The ScaleR package provides two more tree-based
algorithms that mean you can take advantage of
the benefits of tree models while reducing the
downsides associated with overfitting and
overgeneralizing. They are both ensemble methods,
which means that they effectively generate many
individual tree models and combine the results.
However, the two methods take very different
approaches to combining individual trees.
The individual trees in a decision forest have many nodes, a low bias, a high chance of overfitting, and
there is a large variance between trees. The trees each overfit the data in a different way and the bagging
process reduces error by decreasing the variance, effectively averaging out these differences.
Because the trees in a decision forest are bootstrap replicates that are not dependent on each other, the
tree growing phase of the algorithm becomes an embarrassingly parallel problem. However, individual
trees take a long time to process because they are large, although the histogram-based sorting method
can run across nodes.
Individual trees are added sequentially, so boosting is not an embarrassingly parallel problem. However,
the time to run each individual tree model is generally less because the trees are shallower.
Gradient boosted trees have more hyperparameters (parameters controlling the fit of the model) that
need tuning than forest models—and are also more likely to overfit the training data. However, with
careful tuning, they can give better predictions.
To prepare your data for analysis, use the rxDataStep function to:
1. Randomly assign the data to a training and test set. In this example, approximately 5 percent of the
data will be assigned to the test set and the remainder will be used to train the models.
2. Create a binary factor named value, indicating whether a diamond is high value (>=$4,000) or low
value (<$4,000).
MCT USE ONLY. STUDENT USE PROHIBITED
7-6 Creating and Evaluating Partitioning Models
3. Retain the columns required by the analysis (cut, clarity, carat, color), and drop the others.
You can then split the data into the test and training subsets, based on the set variable.
This example creates two data frames, because the dataset is relatively small; for a large dataset, you
should generate XDF files.
You can see exactly how many diamonds are in each set as follows:
# Typical output
# $diamondData.set.test
# [1] 2678
#
# $diamondData.set.train
# [1] 51262
The formula input to the model is the same as for a standard linear model in R. The maxDepth argument
determines how deep you allow the tree to grow.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-7
This example limits maxDepth to 4, which means that each node in the tree can have a maximum of four
splits. Note that the rxDTree function automatically fits a decision tree because the dependent variable,
value, is a factor. If the dependent variable was continuous, then the rxDTree function would fit a
regression tree model instead.
You can view the results of the rxDTree function like this:
# Typical results
# Call:
# rxDTree(formula = value ~ cut + carat + color + clarity, data =
diamondDataList$diamondData.set.train,
# maxDepth = 4)
# Data: diamondDataList$diamondData.set.train
# Number of valid observations: 51262
# Number of missing observations: 0
# Tree representation:
# n= 51262
The first few lines show the details of the model formula and data. You can then see the representation of
the tree itself and the way in which the branches correspond to the various decision points:
Diamonds of more than 0.97 carats are high value, except for the lowest clarity diamonds—these
need to be greater than 1.32 carats to be high value.
Diamonds between 0.895 and 0.97 carats are high value if they are not low clarity or low color
quality—otherwise they are low value.
This example sets the maxDepth to the same value as for the decision tree model in the previous
example. The mTry argument is the most important hyperparameter in a forest model; it determines the
number of variables you want to consider for each split.
The default is the square root of the number of predictor variables. The nTree variable is the number of
trees in the forest. The importance argument determines if importance values for the predictor variables
should be calculated. These are useful for visualization—you will learn about this in the next lesson.
Note that building a decision forest model consumes significantly more resources than a decision tree
because it is effectively creating many trees behind the scenes.
# Typical results
# Call:
# rxDForest(formula = value ~ cut + carat + color + clarity, data =
diamondDataList$diamondData.set.train,
# maxDepth = 4, nTree = 50, mTry = 2, importance = TRUE)
You can see the out-of-bag (OOB) error rate is 3.62 percent. This means that 3.62 percent of cases were
wrongly classified in the training data. The confusion matrix below this figure shows where the errors were
made.
You specify the lossFunction according to the response variable. For a binary factor response variable,
you should use “bernoulli”. The learningRate (or “shrinkage”) is a variable that helps reduce overfitting.
Lower values are less likely to overfit but require more trees in the model to provide a good prediction.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-9
# Typical results:
# Call:
# rxBTrees(formula = value ~ cut + carat + color + clarity, data =
diamondDataList$diamondData.set.train,
# maxDepth = 3, nTree = 50, mTry = 2, lossFunction = "bernoulli",
# learningRate = 0.4)
For these parameters, the model has an error rate of more than 11.5 percent.
2. Reduce overfitting
3. Reduce computation time
xVal: this controls the number of folds used to perform cross-validation. The default is two folds.
maxDepth: this sets the maximum depth of any node of the tree. This is the most intuitive way to
control the size of your trees. Computation becomes more expensive very quickly as the depth
increases. A deeper tree will have a reduced bias but the variance between trees will be greater.
Deeper trees are also more likely to overfit the training data.
maxNumBins: this controls the maximum number of bins used for sorting each variable. The default
is to use whatever is the larger—101 or the square root of the number of observations. For very large
datasets, you might need to set this higher.
cp: this is a complexity parameter and sets a limit for how much a split must reduce the complexity
before being accepted. You can use this to control the tree size instead of maxDepth.
minSplit: this determines how many observations must be in a node before a split is attempted.
MCT USE ONLY. STUDENT USE PROHIBITED
7-10 Creating and Evaluating Partitioning Models
minBucket: this determines how many observations must remain in a terminal node.
In practice, you can often leave a lot of these parameters at their default values. For relatively small
datasets, you could visualize the scree plot of the complexity change with tree size, and then prune the
tree to the point at the elbow of the plot.
The following example illustrates this technique using the plotcp function (this function plots the
complexity parameter table for a model fit). The plot shows the line flattening at a cp value of
approximately 0.0025, so the tree is pruned at this point. After this point, the splits have little importance
because they do not add much to the model, and retaining them just makes the model more complex for
little reward:
mTry: this is the main hyperparameter for forest models and controls how many variables contribute
to each split in a tree.
nTree: this controls the number of trees in the forest. Theoretically, the higher the better, but the
improvements reduce after a certain number and it’s computationally costly to run more trees.
learningRate: this is important for controlling overfitting in boosting models. A value around 0.1 will
do this but more trees will be required to predict successfully.
Weighting: If your dataset is unbalanced (that is, one response level is much more common than
another) you can set higher weights to the less common level to “oversample” that level, relative to
the others. You can do this by either creating a weight variable and setting the pweights argument
to the name of the weighting variable, or passing a loss matrix to the parms argument. For more
information, see the documentation page in R.
You can also use the pmml function in the pmml package convert models into a sharable, XML-based
PMML (Predictive Model Markup Language) format.
Demonstration Steps
3. Highlight and run the code under the comment # Examine the diamond data. This code displays
the first 20 rows in the diamonds dataset. Notice that each diamond has variables that include the
number of carats, the cut, the color, and the clarity, together with other attributes that concern the
geometry of a diamond.
4. Highlight and run the code under the comment # Generate a dataset containing just the columns
required. This code uses the rxDataStep function to:
Add a factor variable named value with the values high and low. Diamonds are categorized
according to their price.
Add a factor variable named set with the values train and test. Note that 95 percent of the data
is selected at random and placed into the train category; the remainder is placed in the test
category.
Remove variables not required by the model, leaving only cut, clarity, carat, and color.
5. Highlight and run the code under the comment # Divide the dataset into training and test data.
This code uses the rxSplit function to separate out the observations in the dataset according to the
value of the set category. The result is two datasets named Diamonds.set.test.xdf and
Diamonds.set.train.xdf.
2. Highlight and run the code under the comment # Show the results. This code displays the model fit.
The top level node summarizes the split into high and low value diamonds. The subsequent nodes
show how the decision to classify diamonds was made, based on the other variables.
3. Highlight and run the code under the comment # For comparision, fit a DTreeForest model. This
code uses the rxDForest function to fit the model to the training data. The forest generates 50
decision trees and takes significantly longer to run in consequence. The results display how the data
in the value field of each diamond compares to that generated by the model.
4. Highlight and run the code under the comment # … and a BTree model. This code uses the
rxBTrees function to fit the model to the training data. Again, this model generates 50 trees and
takes a while to run. The results don't include specific details of the categorization, but rather display
the error rate for the recorded value of a diamond compared to its assessed value, in terms of the
deviance.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-13
2. Highlight and run the code under the comment # Prune the tree to remove unnecessary
complexity from the model. This code removes the least important splits from the tree, based on
the value selected from the scree plot.
3. Leave your R development environment open.
When might you consider constructing a partitioning model rather than a linear model, to make
predictions?
If the relationship between the predictor variables and the dependent variable are non-
linear.
Lesson 2
Evaluating models
Once you have trained your model and tuned the parameters to optimize the fit, you will want to run it
on your test dataset. This will give a truer picture of the predictive power of the model and whether it is
overfitting. You want to have a model that is general enough to perform well on your test set, but has
enough predictive power to be useful.
Lesson Objectives
In this lesson, you will learn how to:
Running predictions
Running predictions from partitioning models
works in an almost identical way to running
predictions from linear models. For more details,
see Module 6: Creating and Evaluating Regression
Models. You:
1. Construct a data frame containing the
different combinations of predictor variables
you want to predict the response for. You will
often get this from the test dataset. Note that
this data must include all the predictor
variables in the original model. Use the
rxDataStep function to select the variables in
a large dataset.
2. Use the function rxPredict to generate your predictions. This function takes the dataset you want to
predict for and the original model object. Note that predictions on rxDForest models can take
considerably longer than predictions on rxDTree models for large datasets.
First, construct the data frame from the test data, then run the predictions. The following code shows both
steps:
Note that you can supply “class” to the type argument to specify that you want the actual class
predictions, rather than the probability of class membership.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-15
For the full list of arguments to these functions, see the R documentation for rxPredict.rxDTree and
rxPredict.rxDForest.
You use the results generated by the rxPredict function to test the accuracy of the predictions made by
the model against the training data.
Testing predictions
After you have constructed your predictions, you
can investigate how well your model has
performed on the test data.
# Typical results
# Call:
# rxSummary(formula = ~value, data = diamondDataList$diamondData.set.test)
…
# Category Counts for value
# Number of categories: 2
# Number of valid observations: 2678
# Number of missing observations: 0
# value Counts
# high 1003
# low 1675
# Typical results:
# Call:
# rxSummary(formula = ~value_Pred, data = pDTree)
…
# Category Counts for value_Pred
# Number of categories: 2
# Number of valid observations: 2678
# Number of missing observations: 0
# value_Pred Counts
# high 1059
# low 1619
In this example, the data in the test dataset revealed that 1,003 diamonds were high value and 1,675 were
low value. The decision tree predicted that 1,059 of the diamonds in this dataset would be high value and
1,619 would be low value.
MCT USE ONLY. STUDENT USE PROHIBITED
7-16 Creating and Evaluating Partitioning Models
You can also compare the totals grouped by the levels in the predictor variables:
# Typical results:
rxSummary(formula = ~value:(color + clarity + F(carat)), data =
diamondDataList$diamondData.set.test)
# Typical results:
Rows Read: 2678, Total Rows Processed: 2678, Total Chunk Time: 0.009 seconds
Computation time: 0.014 seconds.
Call:
rxSummary(formula = ~value_Pred:(color + clarity + F(carat)),
data = predictData1)
high VVS1 27
low VVS1 158
high IF 7
low IF 58
You should calculate the percentage prediction accuracy. This is the percentage of test cases that were
accurately predicted by the model:
# Typical results
[1] 0.961165
The result from the last line shows that the model has a 96.1 percent prediction accuracy on the test data.
You might also want to calculate the error rate, based on the training set, and compare this to the error
rate on the test set. A large difference between the two would indicate overfitting against the training
data.
Visualizing trees
An intuitive way to understand a partitioning
model is to visualize the tree itself. The
RevoTreeView package can be used to plot
rxDTree decision or regression trees as an
interactive HTML page that you can view in a
browser. You can also share the HTML page with
other people or display it on different machines
using the zipTreeView function.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-19
This command will produce an HTML representation of the decision tree model from the last lesson,
and then open it in the browser. You can interact with the plot by clicking on the nodes to expand the
next nodes out.
If you hover your mouse pointer over a node, you will see the number of cases at that split and the rule
that defines the split. The terminal nodes are labeled with the predicted classes.
The results for the example look like this. You can see that the carat variable provides the most weight to
the decisions made by the trees in the model:
You might find these other utility functions useful for exploring rxDForest models:
rxLeafSize: returns the size of the terminal nodes for the trees in the decision forest.
rxTreeSize: returns the number of nodes of all trees in the decision forest.
rxVarUsed: returns how many times each predictor variable is used in the decision forest.
rxGetTree: extracts a single decision tree from the forest. You can then examine this tree graphically
by using the createTreeView function in a plot, as described earlier in this topic.
MCT USE ONLY. STUDENT USE PROHIBITED
7-22 Creating and Evaluating Partitioning Models
Demonstration Steps
2. Highlight and run the code under the comment # Create a copy of the test set without the value
and set variables. This code creates an in-memory data frame containing the data from the test set
but with the value and set variables removed.
3. Highlight and run the code under the comment # Predict the value of each diamond in the test
set using the DTree model. This code uses the DTree model to predict the value of each diamond in
the test set. The results are stored in an rxPredict object with a variable, named Value_Pred, for each
row in the test set. This variable will contain the value high or low. Note that, if you want to see the
probability of each decision, you can omit the type argument from the rxPredict function. You can
also compute residuals if necessary.
4. Highlight and run the code under the comment # Assess the results against the values recorded
in the test set. This code generates a summary of the split between high and low values in the test
data, and in the predicted results, for comparison. The predicted number of high and low values
should be within a few percent of the actuals.
5. Highlight and run the code under the comment # Repeat using the DForest model. This code uses
the DForest model to generate predictions using the test dataset, and assesses the accuracy of the
results. Ideally, these results should be closer to the actual values than those predicted by the DTree.
6. Highlight and run the code under the comment # Add the predicted value of each diamond to
the in-memory data frame. This code merges the test data used to generate the predictions with
the predicted values for each diamond.
7. Highlight and run the code under the comment # Compare the predicted results against the
actual values by variable. This code generates summaries of the original test data and the data
frame with the predicted results. You can browse this data to determine which factors lead to
discrepancies.
2. Highlight and run the code under the comment # Generate a line plot of the DTree model. This
code uses the rpart library to display a line plot of the tree. The structure of the plot should mirror
that shown by using RevoTreeView.
3. Highlight and run the code under the comment # Show an importance plot from the DForest
model. This code generates a dotchart showing the importance of each variable in classifying
diamonds as used by the DForest model. The carat variable is clearly the most significant.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 3
Using the MicrosoftML package
The MicrosoftML package extends the machine learning capacity already provided in RevoScaleR. It adds
new algorithms and a set of data transform functions to increase the speed, performance and scalability
of your data science pipelines.
Lesson Objectives
After completing this lesson, you will be able to:
The MicrosoftML package is also available in both R Client and R Server, and the functions are designed
to complement the RevoScaleR package.
Providing faster classification and regression algorithms for very large datasets.
Providing a set of common and useful data transformations that can be run on chunked data.
Introduction to MicrosoftML
https://aka.ms/r89gc8
categorical: this function converts a categorical value into an indicator array using a dictionary. It is
useful when the number of categories is smaller or fixed, and hashing is not required.
selectFeatures: this transformation function selects features from the specified variables using one of
two modes: count or mutual information.
o The count feature selection mode selects a feature if the number of examples have at least the
specified count examples of nondefault values in the feature. This mode is useful when applied
together with a categorical hash transform.
o The mutual information feature selection mode selects the features based on the mutual
information. It keeps the top numFeaturesToKeep features with the largest mutual information
with the label. Mutual information is similar to a correlation and specifies how much information
you can get about one variable from another.
featurizeText: This is a feature selection function that provides a wide range of tools to process text
data for analysis. It provides:
o Language detection (English, French, German, Dutch, Italian, Spanish and Japanese available as
default).
o Punctuation removal.
o Feature generation.
MicrosoftML algorithms
The MicrosoftML package provides a range of
machine learning algorithms, each suited to a
particular use case.
mxFastLinear
This algorithm is based on the stochastic dual
coordinate ascent (SDCA) method, a state-of-the-
art optimization technique for convex objective
functions. It is designed for both binary
classification (two classes—for example, spam
filtering) and linear regression analysis (for
example, predicting mortgage defaults) and
combines the advantages of both logistic
regression and SVM algorithms. The algorithm scales well on large, out-of-memory datasets and supports
multithreading. You can also use the SDCA algorithm to analyze potentially billions of rows and columns.
Because the algorithm involves a stochastic (random) element, you will not always get consistent results
from one run to the next, although the error between runs should be very small.
rxOneClassSvm
This algorithm is most useful for anomaly detection—that is, to identify outliers that don’t belong to a
target class. It’s a support vector machine (SVM) algorithm that attempts to find the boundary that
separates classes by as wide a margin as possible. rxOneClassSvm is a “one-class” algorithm because the
training set contains only examples from the target class. It infers the “normal” properties for the objects
in the target class and, from these properties, predicts whether cases are like the normal examples. This
works because typically there are very few anomalies. Anomaly detection is often used for network
intrusion, fraud, or to identify problems in long-running system processes. The data in anomaly detection
problems is not usually very large, and this algorithm is designed to run single threaded on in-memory
data.
rxFastTrees
This is a boosting algorithm, like the rxBTrees algorithm in the RevoScaleR package. The algorithm uses
an advanced sorting method that makes it faster, but it is limited to working with in-memory data.
However, it can be multithreaded to make use of multiple processors. It is suitable for datasets of up to
around 50,000 columns. It can run both binary classification trees and regression trees.
rxFastForest
Like rxFastTrees, the rxFastForest function is also optimized for in-memory modeling. It has similar
limitations and advantages. It implements the Random Forest algorithm and Quantile Regression
Forests.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-27
rxNeuralNets
This function implements feed-forward neural networks for regression modeling and for binary and
multinomial (multiple classes) classification. Neural nets are inspired by the neural network in the brain,
which is composed of many interconnected, but independent neurons. The neurons in a neural net model
are arranged in layers, where neurons in one layer are connected by a weighted edge to neurons in the
next layer. The values of the neurons are determined by calculating the weighted sum of the values of the
neurons in the previous layer and applying an activation function to that weighted sum. A model is
defined by the number of layers, the number of neurons in each layer, the choice of activation function,
and the weights on the graph edges. The algorithm tries to learn the optimal weights on the edges based
on the training data.
Neural nets perform well when data structures are not well understood and for problems where standard
regression based models fail—such as check signature recognition and optical character recognition
(OCR). The rxNeuralNets algorithm is capable of “deep learning” over potentially millions of columns
and an effectively infinite number of rows of data. It can also run on multiple cores or even GPUs. The
trade-off for this power is that neural nets have many control parameters and can take a long time to
train.
rxLogisticRegression
This is a highly optimized logistic regression algorithm for binary and multinomial classification over large
datasets. The algorithm can be either run single-threaded, where it can make use of out-of-memory data,
or multithreaded but loading all data into memory at once. It performs well for datasets up to
approximately 100 million columns of data. It uses linear class boundaries, so your data should, at least
approximately, meet the criteria of statistical normality.
https://aka.ms/syosce
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Objectives
In this lab, you will:
Create a DTree partitioning model using the departure time, arrival time, month, and day of the week
as predictor variables, and use this model to predict delay times.
Create a DForest model to see how this affects the quality of the predictions made.
Create a further DTree model that combines the variables from the previous models to judge the
effects on the accuracy of the predictions.
Lab Setup
Estimated Time: 90 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before you start this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Copy the FlightDelayData.xdf file from the E:\Labfiles\Lab07 folder to the \\LON-RSVR\Data shared
folder.
2. Create a remote session on the LON-RSVR server. This is another VM running R Server. Use the
following parameters to the remoteLogin function:
deployr_endpoint: http://LON-RSVR.ADATUM.COM:12800
session: TRUE
diff: TRUE
commandLine: TRUE
username: admin
password: Pa55w.rd
4. Split the data in the PartitionedFlightDelayData.xdf file into separate test and training files using
the DataSet field.
5. Verify the number of observations in each file. The training file should contain 19 times more
observations than the test file.
2. Examine the model and note the complex interplay between the variables that drive the decisions
made. If time allows, copy the DTree object to the local session running on R Client and use the
RevoTreeView package to step through the model. Return to the R Server session when you have
finished.
3. Use the plotcp function to generate a scree plot of the tree. Notice how the large number of
branches adds complexity, but does not necessarily improve the decision making process; this is an
overfit model.
MCT USE ONLY. STUDENT USE PROHIBITED
7-30 Creating and Evaluating Partitioning Models
4. Examine the cptable field in the model to ascertain the point of overfit—it will probably be around
seven levels of branching.
5. Prune the tree back to seven levels.
2. Run predictions against the variables in the data frame using the DTree model.
3. Summarize the results of the Delay_Pred variable in the predictions. Compare the statistics of the
Delay variable in the test dataset with these values. How close are the mean values of both datasets?
4. Merge the predicted values for Delay_Pred into a copy of the test dataset (not the data frame).
Perform a oneToOne merge.
Note: You might find it useful to write the code that performs this analysis as a function
because you will be repeating it several times over different sets of predictions in subsequent
exercises.
Results: At the end of this exercise, you will have constructed a DTree model, made predictions using this
model, and evaluated the accuracy of these predictions.
Question: How many predicted delays were within 10 minutes of the actual reported delays?
What proportion of the observations is this?
Question: How many predicted delays were within 5 percent of the actual delays?
Question: How many predicted delays were within 10 percent of the actual delays?
Question: How many predicted delays were within 50 percent of the actual delays?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 7-31
2. Merge the predicted values into a copy of the test data frame. Perform a oneToOne merge.
Results: At the end of this exercise, you will have constructed a DTree model, made predictions using this
model, and evaluated the accuracy of these predictions.
Question: How many predicted delays were within 10 minutes of the actual reported delays?
What proportion of the observations is this? How does this compare to the predictions made
using the DTree model?
Question: How many predicted delays were within 5 percent of the actual delays? How does
this compare to the predictions made using the DTree model?
Question: How many predicted delays were within 10 percent of the actual delays? How
does this compare to the predictions made using the DTree model?
MCT USE ONLY. STUDENT USE PROHIBITED
7-32 Creating and Evaluating Partitioning Models
Question: How many predicted delays were within 50 percent of the actual delays? How
does this compare to the predictions made using the DTree model?
Question: Was the DForest model more accurate at predicting delays than the DTree
model? What conclusions can you draw?
Question: Was the DForest model with a reduced depth more or less accurate than the
previous model? What conclusion to you reach?
2. Generate predictions for the delay times against the new DTree model. Use the data frame containing
the test data that you just constructed.
3. Merge the predicted values into a copy of the test data frame. Perform a oneToOne merge.
Results: At the end of this exercise, you will have constructed a DTree model using a different set of
variables, made predictions using this model, and compared these predictions to those made using the
earlier DTree model.
Question: How do the predictions made using the new set of predictor variables compare to
the previous set? What are your conclusions?
2. Generate predictions for the delay times against the DTree model. Use the data frame containing the
test data that you just constructed.
3. Merge the predicted values into a copy of the test data frame. Perform a oneToOne merge.
Results: At the end of this exercise, you will have constructed a DTree model combining the variables
used in the two earlier DTree models, and made predictions using this model.
Question: What do the results of this model show about the accuracy of the predictions?
MCT USE ONLY. STUDENT USE PROHIBITED
7-34 Creating and Evaluating Partitioning Models
Use the three main partitioning models in the ScaleR package, and tune the models to reduce bias
and variance.
Use the MicrosoftML package for using advanced machine learning algorithms to create predictive
models.
MCT USE ONLY. STUDENT USE PROHIBITED
8-1
Module 8
Processing Big Data in SQL Server and Hadoop
Contents:
Module Overview 8-1
Module Overview
In this module, you will learn how to process big data by using R Server with Microsoft® SQL Server®,
and Hadoop. You will see how R Server is incorporated into SQL Server to enable you to analyze data held
in a database efficiently, making use of SQL Server resources. You will also learn how R Server can be used
to handle big datasets stored in HDFS by using Hadoop Map/Reduce and Spark functionality.
Objectives
In this module, you will learn how to:
Use R in conjunction with SQL Server to analyze data held in a database.
Incorporate Hadoop Map/Reduce functionality, together with Pig and Hive, into the ScaleR workflow.
Lesson 1
Integrating R with SQL Server
A key principle of using R is to move the processing close to the data that is being processed. In this way,
you reduce the overhead and memory requirements of relocating large datasets across networks and
hardware. A SQL Server database can act as a source for R data, holding massive amounts of data.
Microsoft have integrated R into SQL Server to enable you to perform R processing directly from the
database server. Using SQL Server R Services, you can create stored procedures that run R functions,
including ScaleR operations. These functions can have access to the data held in your databases. SQL
Server R Services takes advantage of the parallelism available with SQL Server to help maximize
throughput.
Lesson Objectives
In this lesson, you will learn:
The features of SQL Server R Services.
The SQL Server Trusted Launchpad manages security and communications. R tasks execute outside the
SQL Server process, to provide security and greater manageability. An additional service named BxlServer
(Binary Exchange Language Server) enables SQL Server to communicate efficiently with external processes
running R, and provides access to data in a SQL Server database to these external processes. Each R
request from the database is handled as a separate Windows® job that runs in its own R session.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-3
The RxInSqlServer compute context also uses the BxlServer to enable you to run R code that is not
stored inside a SQL Server database, but that executes within the context of the SQL Server engine (and
access data in a SQL Server database) from environments such as R Client and remote R Server sessions.
For more information about the new components in SQL Server that support R services, see:
You can run R code in SQL Server either by setting the RxInSqlServer compute context from an R client
session, or directly from SQL Server by using stored procedures. In all cases, the SQL Server Trusted
Launchpad verifies that the process running the R code has the appropriate privileges and access rights to
the data that it uses. Depending on how SQL Server is configured, you can connect by using SQL Server
authentication (a SQL Server login and password), or you can utilize Windows authentication. Additionally,
the user or account must be granted the right to execute external stored procedures. For more
information about how SQL Server Trusted Launchpad manages security, see:
RxOdbcData. This is a more generic data source for accessing data through the ODBC interface by
using the RODBC package. It is less optimal than the rxSqlServerData data source for retrieving SQL
Server data, but is currently the only option available if you need to query data held in Azure SQL
Database.
In both cases, you must provide the details of the SQL Server connection by using a connection string that
specifies the address of the database server, the database to connect to, and logon or security information
that identifies the SQL Server account to use.
Having established a connection to SQL Server through a data source, you can then read and write data
using R functions. Note that not all R functions are supported; for example, you can use head to display
the first few rows from a table or query, but the tail function is not available.
MCT USE ONLY. STUDENT USE PROHIBITED
8-4 Processing Big Data in SQL Server and Hadoop
The following example shows how to create a connection to the flightdelaydata table in a SQL Server
database named FlightDelays, and then use this connection to display the first few rows from the table.
# Results
Origin Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime …
1 ORD 2000 2 10 4 557 600 843 854
2 ORD 2000 2 20 7 1718 1700 2009 1958
3 ORD 2000 2 15 2 1646 1650 1929 1934
4 ORD 2000 2 23 3 905 906 1153 1146
5 ORD 2000 2 28 1 817 820 1032 1043
6 ORD 2000 2 3 4 556 600 845 854
Remember that you can also use the following functions that operate directly on tables within a SQL
Server database:
rxSqlServerDropTable. This function removes a table (and its contents) from a SQL Server database.
rxSqlServerTableExists. Use this function to test whether a specified table exists in a SQL Server
database.
rxExecuteSQLDDL. This function performs Data Definition Language (DDL) operations, enabling you
to perform operations such as creating new tables.
Note: The RxInSqlServer compute context only supports the RxSqlServerData data
source. You cannot access text files, XDF files, or files held in an HDFS file system in this compute
context.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-5
The following example shows how to connect to SQL Server R Services using an RxInSqlServer compute
context object:
rxSetComputeContext(sqlCompute)
In this case, the code connects to a database named FlightDelays hosted by a SQL Server instance
running on a server named LON-SQLR. The connection utilizes Windows authentication. The
rxSetComputeContext function in this example switches your session to run within the context of the
SQL Server database. At this point, your R code runs using SQL Server R Services on the database server.
The code that you run is performed as a series of Windows jobs, managed by SQL Server. This approach
helps to prevent SQL Server from being overwhelmed by a sudden influx of work. You can monitor these
jobs using the sp_help_jobactivity stored procedure in SQL Server. The wait parameter indicates whether
your code is blocked when the job is created and waits until the job has finished and returned any results,
or whether each request is simply queued and your session is allowed to continue. This is known as
waiting or nonwaiting.
If the wait parameter is false, each request is given a unique identifier, and it is your responsibility to
check the status of the job and retrieve the results. To find the status of a running nonwaiting job, you can
call rxGetJobStatus with the job identifier. You can obtain the identifier for the most recent job from the
rxgLastPendingJob variable. To retrieve the results of a finished nonwaiting job, you can call
rxGetJobResults with the job identifier as the argument.
To cancel a nonwaiting job, use the rxCancelJob function with the job name as the argument.
int integer
smallint integer
float numeric
real numeric
bigint numeric
money numeric
datetime POSIXct
date POSIXct
bit logical
Note: The mappings for binary and character data shown in the table apply for conversions
from SQL Server to R. R character data is converted to varchar(max) on output to SQL Server,
and R raw data is converted to varbinary(max).
You should note that SQL Server has some data types that are not available in R, including image, xml,
table, timestamp, and all spatial types. In these situations, you must write your own code to extract the
information held in this data and reformat it as types that are compatible with R. You can do this using
the CAST and CONVERT Transact-SQL functions. Additionally, some types might be converted in an
unexpected manner, so you might need check the results carefully.
Note: If you want to remove columns that have incompatible types from a dataset, note
that the RxSqlServerData data source does not support the varsToKeep and varsToDrop
options of the rxDataStep function.
For detailed information on the type mappings performed by SQL Server R Services, see:
The following code sample shows how to use the sp_execute_external_script stored procedure.
In this example, the script uses the rxSummary function to summarize a dataset. The results are printed,
before a new dataset is generated containing a subset of the observations from the original dataset by
using the rxDataStep function. The new dataset is returned by the stored procedure. The key points to
notice are:
The @language parameter. You should always set this to 'R' (this is currently the only external
language supported by the sp_execute_external_script stored procedure).
The @input_data_1 parameter. This parameter specifies a dataset to be passed in to the stored
procedure. It appears in the stored procedure as the variable InputDataSet (although you can
change this name by specifying the @input_data_1_name parameter). In the example, the dataset is
the result of executing a SQL query.
The @params parameter. This parameter specifies a comma-separated list of variables that are
passed in to the stored procedure and that can be referenced by the R code in the stored procedure.
You must specify the type as a recognized SQL Server type for each variable. SQL Server R Services
will convert the variables into the equivalent R data types, subject to the rules specified in the
previous topic. In the example, the origin variable is referenced by the rxSummary function.
MCT USE ONLY. STUDENT USE PROHIBITED
8-8 Processing Big Data in SQL Server and Hadoop
The @origin variable. Each input variable mentioned in the @params list should be specified with a
value to be assigned to the variable. In the example, the @origin variable is assigned the text "MA",
and this value is used by the rxSummary function in the R code to limit the observations being
summarized to those where the OriginState field matches "MA".
The WITH RESULT SETS clause. This clause specifies the fields to include if the R code returns a
results set, such as a data frame (currently, a data frame is the only supported type of result set). In
the R code, you reference the result set using the OutputDataSet variable, and the contents of this
variable are returned from the stored procedure. You must ensure that the fields in the
OutputDataSet variable match those specified by the WITH RESULT SETS clause.
The sp_execute_external_script stored procedure also enables you to specify processing hints, such as
whether to parallelize the processing for the R code, and whether to use result-set streaming for handling
datasets too big to fit into memory. For more information, see:
sp_execute_external_script (Transact-SQL)
https://aka.ms/d0e0uz
Note: You must enable external scripts before using the sp_execute_external_script
stored procedure. An administrator can perform this task by using the following commands while
connected SQL Server.
# Save the stored procedure to SQL Server. The sqlConnString variable holds a connection
string
registerStoredProcedure(testProcSP, sqlConnString)
The stored procedure created by the preceding code looks like this. You can see how the R code is
embedded in a call to the sp_execute_external_script stored procedure:
The id column is a character-based primary key that you use to identify objects. The object itself is
serialized and stored in a binary format in the value column.
After you have created the table, you can save objects to it. The following example creates a histogram
which it stores in the database:
# Save the chart to the charts table, and give it a unique name to identify it later
rxWriteObject(dest = chartsTable, key = "chart1", value = chart)
Later, you can retrieve the histogram using chart1 as the key. You can then display the histogram:
Note that the rxWriteObject and rxReadObject functions expect an ODBC data source and do not work
in the SQL Server compute context. If you need to perform similar operations inside a SQL Server stored
procedure, you must:
2. Serialize the object manually. You can use the serialize and paste base R functions to perform these
tasks.
3. Execute a SQL Server INSERT operation to save the serialized object to the table. It is recommended
that you wrap this operation in a stored procedure that you call from your R code. In this way, you
can reduce dependencies between the structure of the table and your R code.
1. Execute a SQL Server SELECT statement that fetches the serialized object from the database.
2. Deserialize the object back into its original form. You can use the as.raw and unserialize functions to
perform this task.
Note: You will use this second approach in the lab at the end of this lesson.
Demonstration Steps
3. Highlight and run the code under the comment # Create a SQL Server compute context. This code
creates a compute context that connects to the AirlineData database.
4. Highlight and run the code under the comment # Create a data source that retrieves airport
information.
5. Highlight and run the code under the comment # Create and display a histogram of airports by
state. This code uses the rxHistogram function to generate the histogram. Notice that the histogram
object itself is stored in the chart variable.
Note: Make sure that you display the Plots window in the lower right pane.
6. In the toolbar above the Plots window, click Clear all Plots, and then click Yes to confirm.
7. On the Windows desktop, click Start, type Microsoft SQL Server Management Studio, and then
press Enter.
8. In the Connect to Server dialog box, log in to LON-SQLR using Windows authentication.
10. Move to the E:\Demofiles\Mod08 folder, click the Demo1 - persisting R objects SQL Server Query
File, and then click Open.
MCT USE ONLY. STUDENT USE PROHIBITED
8-12 Processing Big Data in SQL Server and Hadoop
11. In the Query window, notice that this script creates a table named charts with two columns; id and
value. This table will be used to hold R objects. The id column is the primary key, and the value
column will hold a serialized version of the object.
14. Highlight and run the code under the comment # Create an ODBC data source that connects to
the charts table. This code switches back to the local compute context and creates an ODBC data
source.
15. Highlight and run the code under the comment # Save the chart to the charts table, and give it a
unique name to identify it later. This code uses the rxWriteObject function to store the histogram
object with the key chart1.
17. In Object Explorer, expand LON-SQLR, expand Databases, expand AirlineData, right-click Tables,
and then click Refresh.
18. Expand Tables, right-click dbo.charts, and then click Select Top 1000 Rows.
You should see a single row. The value is a hexadecimal string that is a binary representation of the
histogram object.
2. Highlight and run the code under the comment # Retrieve the persisted chart. This code uses the
rxReadObject to read the data for the chart1 object from the database and reinstate it as an R
object in memory.
3. Highlight and run the code under the comment # Display the chart. This code prints the object. The
histogram should appear in the Plots window.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Objectives
In this lab, you will:
Upload the flight delay data to SQL Server and examine it.
Lab Setup
Estimated Time: 60 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
2. Start SQL Server Management Studio. Log on to the LON-SQLR server using Windows
authentication.
4. Stop and restart SQL Server, and then create a new database named FlightDelays. Use the default
options for this database.
MCT USE ONLY. STUDENT USE PROHIBITED
8-14 Processing Big Data in SQL Server and Hadoop
2. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
3. Ensure that you are running in the local compute context; this is because you cannot read or write
XDF data in the SQL Server compute context.
4. Create an RxSqlServerData connection string and data source that connects to a table named
flightdelaydata in the FlightDelays database. The database is located on the LON-SQLR server. You
should use a trusted connection.
Create an additional column named DelayedByWeather. This column should be a logical factor
that is true if the WeatherDelay value in an observation is non-zero. For this exercise, treat NA
values as zero.
Create another column called Dataset. You will use this column to divide the data into training
and test datasets for the DForest model. The column should contain the text "train" or "test",
selected according to a random uniform distribution. Five percent of the data should be marked
as "test" with the remainder labelled as "train".
Note that the data file contains 1158143 (1.158 million) rows.
The RxSqlServerData data source should convert the following columns to factors:
Month
OriginState
DestState
DelayedByWeather
WeatherDelayCategory
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-15
MonthName
3. Run the rxGetVarInfo function over the data source and verify that it contains the correct variables.
Note that the factors may be reported as having zero factor levels. This is fine as you have not yet
retrieved any data, so the data source does not know what the factor levels are.
4. Summarize the data by using the rxSummary function. This might take a while as this is the point at
which the data source reads the data from the database.
5. Create and display a histogram that shows the number delays in each value of
WeatherDelayCategory conditioned by MonthName.
6. Create and display another histogram that shows the number delays in each value of
WeatherDelayCategory conditioned by OriginState.
Results: At the end of this exercise, you will have imported the flight delay data to SQL Server and used
ScaleR functions to examine this data.
Question: Using the second histogram, which states appear to have the most delays as a
proportion of the flights that depart from airports in those states? Proportionally, which state
has the fewest delays?
Only use the observations where the Dataset variable contains the value "train" to fit the model.
Note: Make sure you are still using the SQL Server compute context. It should take
approximately five minutes to construct the model. However, if you are running in the local
compute context, it can take more than 30 minutes to perform the same task. This is one
advantage of keeping the computation close to the data, and exploiting the parallelism available
with R Server rather than R Client.
Also note that you will receive a warning message stating that the "Number of observations
not available for this data source" when the process completes. You can ignore this warning.
2. Inspect the model and examine the trees that it contains. Notice the forecast accuracy of the model
based on the training data.
3. Use the rxVarImpUsed function to see the influence that each predictor variable has on the decisions
made by the model.
Results: At the end of this exercise, you will have created a decision tree forest using the weather data
held in the SQL Server database, scored it, and stored the results back in the database.
Question: What is the Out-Of-Box (OOB) error rate for the DForest model?
Question: Are there any discrepancies between flights being forecast as delayed versus
those being forecast as on-time? If so, how could you adjust for this?
Question: Which predictor variable had the most influence on the decisions made by the
model?
Note: The rxDForest function has a rowSelection argument that you can use to limit the
rows retrieved to those labeled as "training". The rxPredict function that you will use to score
does not have this capability, so you need to restrict the data by modifying the data source
instead.
2. Create another RxSqlServerData data source. This data source should connect to a table named
scoredresults in the FlightDelays database (this table doesn't exist yet).
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-17
3. Temporarily switch back to the local compute context and run the rxPredict function to make
predictions about weather delays using the new dataset in SQL Server. Save the results in the
scoredresults table. Include the model variables in the scored results. Specify a prediction type of
prob to generate the probabilities of a match/nomatch for each case, and use the preVarNames
agument to record these probabilities in columns named PredictDelay and PredictNoDelay in the
scoredresults table. The rxPredict function also generates a TRUE/FALSE value that indicates, based
on these probabilities, whether the flight will be delayed. Save this data in a column named
PredictedDelayedByWeather.
When the rxPredict function has finished, return to the SQL Server compute context.
Note: The rxPredict function does not currently work as expected in the SQL Server
compute context, which is why you need to switch back to the local compute context.
4. Run the following code to test the accuracy of the weather delay predictions in the scoredresults
table against the real data:
install.packages('ROCR')
library(ROCR)
# Transform the prediction data into a standardized form
results <- rxImport(weatherDelayScoredResults)
weatherDelayPredictions <- prediction(results$PredictedDelay,
results$DelayedByWeather)
# Plot the ROC curve of the predictions
rocCurve <- performance(weatherDelayPredictions, measure = "tpr", x.measure = "fpr")
plot(rocCurve)
It uses the prediction function to compare the probability of a weather delay recorded in the
PredictedDelay column of the scored results with the flag (TRUE=1, FALSE = 0) that indicates
whether the flight was actually delayed by weather for each observation.
It runs the performance function to measure the ratio of true positive results against false
positive results.
It plots the results as a ROC curve.
Question: What does the ROC curve tell you about the possible accuracy of weather delay
predictions? Is this what you expected?
MCT USE ONLY. STUDENT USE PROHIBITED
8-18 Processing Big Data in SQL Server and Hadoop
2. Using SQL Server Management Studio, create the following table in the FlightDelays database.
This table will hold the serialized model:
3. Add the following stored procedure. You can run this stored procedure from R to save the DTree
model to the database:
4. In your R development environment, create an ODBC connection to the database and use the
sqlQuery ODBC function to run the PersistModel stored procedure. Specify the serialized version of
the DTree object as the parameter to the stored procedure.
Note that the sqlQuery function is part of the RODBC library, and you must use the
odbcDriverConnect function to create the ODBC connection. You can reuse the same connection
string as before.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-19
Task 2: Create a stored procedure that runs the model to make predictions
1. In SQL Server Management Studio, create the following stored procedure:
This stored procedure takes three input parameters: Month, OriginState, and DestState. These
parameters equate to the predictor variables used by the DTree model.
The body of the stored procedure creates a variable named @weatherDelayModel that will be
used to retrieve the model from the delaymodels table.
The remainder of the stored procedure uses the sp_execute_external_script stored procedure to
run a chunk of R code. This R code takes four parameters: @model which is used to reference the
model to run, and @month, @originState, and @destState which specify the predictor values
passed in. The assignments below the @params definition shows how these parameters are
populated using the @weatherDelayModel, @Month, @OriginState, and @DestState variables
respectively.
The R code uses these variables to construct a data frame containing predictor values and also to
retrieve the model from the database. The rxPredict function in the R code generates a prediction
from this data indicating whether a flight from the specified origin to destination in the given month
is likely to be delayed by weather. The results are output as another data frame (containing a single
row). The WITH RESULT SETS clause specifies the fields in this data frame.
MCT USE ONLY. STUDENT USE PROHIBITED
8-20 Processing Big Data in SQL Server and Hadoop
2. Return to your R development environment and run the following code to test the stored procedure:
This code asks about the probability of a flight from Michigan to New York in October being delayed
due to weather.
3. Save the script as Lab8_1Script.R in the E:\Labfiles\Lab08 folder, and close your R development
environment.
Results: At the end of this exercise, you will have saved the DForest model to SQL Server, and created a
stored procedure that you can use to make weather delay predictions using this model.
Question: According to the DForest model, what is the probability of a flight from Georgia
(GA) to New York (NY) in November being delayed by weather? What about a flight in June?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-21
Lesson 2
Using ScaleR functions with Hadoop on a Map/Reduce
cluster
The RevoScaleR package provides the RxHadoopMR compute context to enable you to perform data
analysis operations using a Hadoop cluster. The operations performed by many of the ScaleR functions in
this compute context have been adapted to take advantage of the Hadoop Map/Reduce mechanism to
analyze and refine big datasets. The RevoScaleR package also includes a set of command line helper
functions that you can use to interact directly with Hadoop and HDFS data.
The RevoScaleR package provides the same ScaleR functions for the RxHadoopMR compute context as it
does for other nonclustered contexts. This approach enables you to develop and test your R code locally
on small datasets before deploying it on a Hadoop cluster against massive volumes of data.
Lesson Objectives
After completing this lesson, you will be able to:
The structure of XDF data and the way in which the ScaleR functions makes them a natural fit for working
with Hadoop. The implementation of the ScaleR functions for Hadoop breaks operations down into
parallel pieces, each of which can be run independently by separate Hadoop processes; a ScaleR function
initiates a Map/Reduce job, and the separate processes run as tasks within that job. You can monitor and
trace the progress of these jobs using the standard Hadoop job utilities.
MCT USE ONLY. STUDENT USE PROHIBITED
8-22 Processing Big Data in SQL Server and Hadoop
Additionally, the ScaleR functions are optimized to operate on chunked data. XDF files stored in HDFS are
actually structured as composite files, as described in Module 2: Exploring Big Data. Each element of the
composite file can be allocated to a process in isolation, and there are no issues with locking or
contention.
HDFS is a shared file system, and the RevoScaleR package depends on you to maintain the security of the
files in HDFS. You can do this by using the Hadoop HDFS utilities (such as the hadoop fs command in
Linux). Although handling security is out of the scope of the ScaleR functions, the RevoScaleR package
does depend on a particular directory structure in HDFS. Specifically, as part of the ScaleR installation
process, you must create the following folders in HDFS:
/user/RevoShare
/user/RevoShare/username (where username is the name of each account that can run ScaleR
functions)
These directories must be assigned the appropriate read/write permissions for each user. You can do this
using the hadoop fs -chmod command in Linux.
Additionally, you must also create the /var/RevoShare directory in the Linux file system, together with
/var/RevoShare/username subdirectories (again where username is the name of each account that can run
ScaleR functions).
For more information about using the ScaleR functions with Hadoop, see:
If you are currently located outside of the cluster, you must provide additional information that specifies:
The cluster name node or host name of an edge node in the cluster.
The port number for incoming connections on the Hadoop cluster (if a nonstandard port is used).
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-23
Any SSH additional switches required. This includes the location of the private key file required to
authenticate your login.
The following code shows how to connect to a Hadoop cluster located at LON-HADOOP-
01.ukwest.cloudapp.azure.com from a non-Hadoop client computer. The user is logged in as
student01, and the hadoop.ppk file contains the private key that authenticates the users for the SSH
session. The default port is used to connect to the cluster.
If you are using PuTTY, you can wrap this information up and save it in a PuTTY session configuration file
instead. In this case, you can omit the port and sshSwitches parameters, and replace the sshHostname
parameter with the name of the session configuration file.
Note: The configuration used by the labs in this module follow this approach. See the
About This Course document for information on how this is set up.
Hadoop jobs can be long running. You can configure the Hadoop compute context to be nonwaiting by
setting the wait parameter to FALSE. As with SQL Server, all requests will return immediately, but they will
report a job id that you can use to monitor the job from the Hadoop Jobtracker console. You can test the
status of a job with the rxGetJobStatus function, and when it has finished you can retrieve the results
with the rxGetJobResults function. You can halt a job by using the rxCancelJob function, or block while
a job completes with the rxWaitForJob function. For more information about waiting and nonwaiting
jobs, refer to Module 5: Parallelizing Analysis Operations.
One further parameter that can be instructive is consoleOutput. If you set this to TRUE, you can see how
the ScaleR operations are split into Map/Reduce tasks. The information displayed also includes the
Hadoop job and task ids, and you can use the Hadoop Jobtracker console to monitor these items.
There’s a raft of other optional arguments that you can specify. These include switches that enable you to
control how Hadoop processes your operations. For more information, see:
Code running in the RxHadoopMR can utilize the full range of R packages for Hadoop, including rmr2 (R
Map/Reduce) which enables you to implement custom map/reduce functionality in R. Additionally,
remember that you can invoke arbitrary functions from ScaleR functions such as rxDataStep by using a
custom transformation with the transformFunc argument.
MCT USE ONLY. STUDENT USE PROHIBITED
8-24 Processing Big Data in SQL Server and Hadoop
Retrieving it using a different compute context, and then transfer the data to HDFS.
Using a technology such as Apache Sqoop to import the data into HDFS on the cluster (Sqoop is
specifically designed to transfer data efficiently between Hadoop and structured stores such as
relational databases).
You can store and access data in the local Linux file system rather than HDFS. To do this, you can use the
rxSetFileSystem function and specify the RxNativeFileSystem file system. You can switch back to HDFS
by specifying the RxHdfsFileSystem file system.
Note: Remember that you can access data held in HDFS from outside of the RxHadoopMR
compute context by connecting directly to the file system with the rxHdfsConnect function.
The RxHadoopMR compute context provides a collection of helper functions for interacting with the
HDFS file system. Module 2 describes these functions, which are also summarized below:
rxHadoopCopyFromClient. Use this function to copy a file from a remote client to the HDFS file
system in the Hadoop cluster.
rxHadoopCopyFromLocal. Use this function to copy a file from the native file system to the HDFS
file system in the Hadoop cluster.
rxHadoopMove. Use this function to move a file around the HDFS file system.
rxHadoopRemove. This function removes a file from the HDFS file system.
rxHadoopFileExists. This function tests whether a specified file exists in an HDFS directory.
Another important function is rxHadoopCommand. You can use this function to perform any Hadoop
operation from within R, including submitting Hadoop Map/Reduce jobs.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-25
For more information, see the section Use a Local Compute Context at:
1. You can use the system function to issue a Hive job on your cluster, saving your data as a text file on
HDFS. You can run further analyses on this data using either the RxHadoopMR or RxSpark compute
contexts.
MCT USE ONLY. STUDENT USE PROHIBITED
8-26 Processing Big Data in SQL Server and Hadoop
2. If the result of your Hive job is relatively small, it might be better to use the RxOdbcData data source
to connect a remote client to Hive through ODBC. You can then either stream the results to the
remote client or download them as XDF in the local file system.
The following example creates a Hive table and loads data into it from a text file in HDFS. Note that you
need to supply the schema when working with data in Hive. When the table has been created, the
example returns the first 100 lines.
system(hive_query)
This code will dump the text file on to HDFS on the cluster. You can then switch back to your client and
connect to the cluster using the RxHadoopMR compute context and RxTextData data source.
The next example shows how to read the census data from the previous example into a table, and then
return the top 100 rows of the table as a local XDF file.
For more information, see the section Using data from Hive for Your Analyses at:
The following Pig script loads a file and performs some filtering and transformations before saving the
results to HDFS:
In the R session, while logged into the cluster, you can run the Pig script like this:
The results file is then accessible to R through the rxTextData data source. You can either work with this
on the cluster, using the RxHadoopMR compute context, or, if the data is not too large, pull the data
onto a remote client (see “Accessing HDFS locally for performing smaller computations”).
Note: You can also run Hive scripts in the same way, using the “-f” flag.
MCT USE ONLY. STUDENT USE PROHIBITED
8-28 Processing Big Data in SQL Server and Hadoop
Demonstration Steps
3. Highlight and run the code under the comment # Create a Hadoop compute context. These
statements connect to the LON-HADOOP-01 server as instructor. Note that the host name is actually
the name of a PuTTY configuration file (LON-HADOOP) and not the name of the remote server. The
PuTTY configuration file contains the details and keys required to establish an SSH session on the
server.
4. Highlight and run the code under the comment # Copy FlightDelayData.xdf to HDFS
(/user/RevoShare/instructor). These statements remove the FlightDelayData.xdf file from HDFS (if
it exists), and then copy a new version of this file from the E:\Demofiles\Mod08 folder on the client
VM.
Note: If the rxHadoopRemove command reports No such file or directory then the file
didn't exist. You can ignore this message.
5. Highlight and run the code under the comment # Verify that the file has been uploaded. This
statement uses the rxHadoopCommand function to display the contents of the
/user/RevoShare/instructor folder in HDFS. You should see the FlightDelayData.xdf file included in
the output.
2. Highlight and run the code under the comment # Create a Hadoop compute context on this
server. Note that you don't have to specify the host name this time because Hadoop is running on
the same machine as your session.
3. Highlight and run the code under the comment # Examine the structure of the data. This code
runs the rxGetVarInfo function to view the variables in the data file. Notice that you have to switch
to the HDFS file system; the default file system for the Hadoop compute context is the native Linux
file system. Additionally, you will see a few messages reported by the compute context before the
results are displayed. This is because the consoleOutput flag in the compute context is set to TRUE.
In a production environment, you would typically disable this feature, but it is useful for debugging in
a test and development environment.
Note: The output will also include the messages WARN util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java classes where
applicable, and WARN shortcircuit.DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
You can ignore these messages.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-29
4. Highlight and run the code under the comment # Read a subset of the data into a data frame.
This code runs the rxImport function to fetch a sample of 10 percent of the data into memory.
5. Highlight and run the code under the comment # Summarize the flight delay data sample. This
code runs the rxSummary function over the data frame. Notice that Hadoop displays the message
Warning: Computations on data sets in memory cannot be distributed. Computations are
being done on a single node. The task is performed as a single-threaded operation rather than a
Map/Reduce job.
6. Highlight and run the code under the comment # Break the data file down into composite pieces.
The ScaleR functions in the Hadoop Map/Reduce compute context are optimized to work with
composite files. This code stores the composite version of the data file in the
/user/RevoShare/instructor/DelayData directory in HDFS. This directory must exist before creating
the file, so this code creates it using the rxHadoopMakeDir function.
7. Highlight and run the code under the comment # Examine the composite XDF file. This block of
code uses the rxHadoopListFiles function to display the contents of the DelayData folder and data
subfolder.
Note: The rxHadoopListFiles function does not work properly with R Server 9.01 or earlier,
which is why a previous step in this demonstration used the rxHadoopCommand to run the
Hadoop fs -ls command. However, the Hadoop server is running R Server 9.1.0, which has fixed
this issue.
8. Highlight and run the code under the comment # Perform a more complex analysis - compute a
crosstab. This code generates a crosstab of airlines and the airports that they serve, counting the
number of flights that have departed from each airport for that airline.
3. In the ID column, click the link for the most recent job. This should be the job that ran when you
created the crosstab. The details for the job should appear on a new page.
4. In the Logs column, click the Logs link. This page shows the trace for the job. This information is
useful for debugging purposes, if a ScaleR function fails for some reason.
5. Click the Back button in the toolbar to return to the previous page, and then click the Back button
again to return to the Job Tracking page listing all the recent jobs.
Statement Answer
Lesson 3
Using ScaleR functions with Spark
Spark is an open source big data processing framework on Hadoop, similar to the Map/Reduce
framework. It differs from Map/Reduce in that most of the data is copied into RAM on individual nodes in
Spark, and operations are then carried out in memory. This contrasts with Map/Reduce that writes all the
data to disk on the nodes after every operation. This difference makes Spark considerably faster than
Map/Reduce. Spark can also access data held in diverse sources including HDFS, Cassandra, HBase, and
S3.
You can work with Spark interactively using different programming languages: Scala, Java, Python and R.
One of the main advantages of Spark for data analytics is that it has a comprehensive data frames API that
is modelled on R data frames. For the R user, working with Spark data frames should be immediately
intuitive. Spark also has an extensive machine learning library, MLlib, which can deal with batch or
streaming applications and has a very active community.
R has been tightly integrated with Spark development from early on, and the ScaleR functions improve on
this still further. The Spark compute context enables you to conduct high performance analytics on Spark,
making use of the Hadoop HDFS in code that is very close to the code you would write in for in-memory
data.
The ScaleR functions are also complemented by sparklyr, an open source R package developed by
RStudio. This package enables users to apply the popular dplyr methods of data manipulation directly to
Spark data frames. ScaleR and sparklyr can be used in tandem in the same R server session.
Finally, the Spark compute context enables you to access data directly in Hive and Parquet format using
data sources that are not available tin the RxHadoopMR compute context.
Lesson Objectives
After completing this lesson, you will be able to:
Create and use the RxSpark compute context.
If you are logged into an R session running on an edge node of your Hadoop cluster, you can connect to
Spark by creating the compute context with the default values:
If you are connecting from R client, you can set up a compute context that will run distributed Spark jobs
remotely on your cluster. In this case, the compute context creates a remote SSH session on the cluster.
You will need to supply additional arguments to RxSpark to create the compute context. Specifically, you
must specify your user name, the file-sharing directory where you have read and write access, the
publicly-facing host name or IP address of your Hadoop cluster’s name node or an edge node that will
run the master processes, and any additional switches to pass to the SSH session (such as the -i flag if you
are using a pem or ppk file for authentication).
For example:
Note that, as with the RxHadoopMR compute context, you can save many of the security parameters for
the SSH session as a PuTTY or Cygwin session configuration file, and then reference this configuration in
the sshHostname argument of RxSpark. For examples showing various R client Spark configuration
setups, see:
All the startup parameters for a Spark job are available through the RxSpark compute context. Amongst
many others, the following tuning options are available as arguments to RxSpark. They enable you to
control how memory and processors in the cluster are allocated to your Spark job:
numExecutors. The number of individual processes to be set up for the job. Typically this will be one
executor for each node in the cluster, although you might want to reduce this if you are working on a
shared cluster since the default behavior is to launch as many executors as possible, which might use
up all resources and prevent other users from sharing the cluster.
executorMem. A character string specifying the amount of memory to assign to each executor (for
example, 1000M, 3G).
driverMem. A character string specifying the amount of memory to assign to the driver, or edge,
node.
See the R help file for RxSpark for a full list of the options available.
MCT USE ONLY. STUDENT USE PROHIBITED
8-32 Processing Big Data in SQL Server and Hadoop
Note that the RxHiveData and RxParquetData data sources can only be used within an RxSpark
compute context, and not with RxHadoopMR.
Until this is addressed in an upcoming version of R Server, the workaround is to maintain different
compute contexts for SparkR and for ScaleR. You can then exchange data through intermediate files. For
example, you might start a Spark context in SparkR to select some data stored in a data lake as Parquet,
perform some filtering, transformation and merging to construct a useable dataset, and then save this to
a CSV file on HDFS. You could then open an RxSpark compute context, read the CSV file into an XDF
object, split into a training and a test set, and build a logistic regression or boosting model on the
extracted data using the ScaleR modeling functions.
For a comprehensive example of a workflow involving sharing data between SparkR and ScaleR, see:
Note: Future releases of R Server are expected to enable SparkR and ScaleR to share data
within the same compute context.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-33
Note that, although you can use the same compute context for your ScaleR and sparklyr code, the ScaleR
analysis functions can still only work with data in one of the four data sources listed previously (text files,
XDF, Hive and Parquet) and not directly with Spark data frames. However, it is simple to use the ScaleR
data source functions to convert to one of these data sources.
To integrate an RxSpark compute context with a sparklyr session, you specify the interop argument. You
then register a new sparklyr session within that compute context:
sc <- rxGetSparklyrConnection(con)
The following example uses the diamonds dataset from the ggplot2 package. The code uses dplyr
functions to copy the local diamonds dataset onto the cluster as a Spark data frame and then to perform
some data manipulation and partition into test and training sets. When the manipulation code is run, you
need to “register” the data frames. This forces the execution of the data manipulation code, which is
needed because Spark operates “lazily” and doesn’t perform any computations until it is explicitly
instructed to do so. Without doing this, ScaleR would not be able to read the data frames into a data
source.
MCT USE ONLY. STUDENT USE PROHIBITED
8-34 Processing Big Data in SQL Server and Hadoop
sdf_register(trainTest$training, "diamonds_train")
sdf_register(trainTest$test, "diamonds_test")
When the data is in the correct form for modeling, you can use the ScaleR functions to upload the data
into a persistent data source (such as Hive), and then use the rxLinMod function to run a linear regression
model over that data in Hive. Note that, when you are creating Hive tables, you need to specify the types
of the variables you are reading in:
For more information about using ScaleR and SparklyR together, see:
https://aka.ms/i9r9t7
Which data source can you use to connect to a Hive database when using the
RxHadoopMR compute context?
RxOdbcData
RxHiveData
RxHadoopData
You don't need to use a specific data source. You can start a sparklyr session to
read the Hive data.
RxSpark
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-35
Objectives
In this lab, you will:
Use R code in the Hadoop Map/Reduce compute context to run a Pig script that generates data and
use ScaleR functions to examine the results.
Run sparklyr code in the Spark compute context to retrieve data from the Hive database, and filter
this data.
Lab Setup
Estimated Time: 30 minutes
Username: Adatum\AdatumAdmin
Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
MT17B-WS2016-NAT
20773A-LON-DC
20773A-LON-DEV
20773A-LON-RSVR
20773A-LON-SQLR
This file is the Pig script that you will run. This script performs the following tasks:
It reads a CSV file named carriers.csv held in the /user/RevoShare/loginName directory in HDFS
(where loginName is your login name). This file contains airline codes and their names.
It reads another file called FlightDelayDataSample.csv which contains flight delay information.
This file contains a subset of the information about flight delays that you have been using
throughout the course.
It filters the flight delay information to find all delays that have a positive value in the
LateAircraftDelay or CarrierDelay fields. These fields indicate delays that are caused by the
airline.
It joins this data with the airline name in the carriers.csv file.
It writes a dataset containing the origin airport, destination, airline code, airline name, carrier
delay, and late aircraft delay fields to a file in Pig storage (located in the
/user/RevoShare/loginName/results directory in HDFS).
3. Edit the first line of the script, and change the text {specify login name} to your login name, as
shown in the following example:
Task 2: Upload the Pig script and flight delay data to HDFS
1. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
2. Create an RxHadoopMR compute context. Specify a login name for the sshUsernme argument, set
the sshHostname to "LON-HADOOP", and set the consoleOutput argument to TRUE (this is so you
can view the messages that Hadoop displays).
Note: You might also want to remove any existing file with the same name in the HDFS
folder first. You can use the rxHadoopRemove function to do this.
5. Close the RxHadoopMR compute context, and use the remoteLogin function to create a remote
connection to R Server running on the Hadoop VM. The deployr_endpoint argument should be
http://fqdn:12800, where fqdn is the URL of the Hadoop VM in Azure (for example, LON-HADOOP-
01.ukwest.cloudapp.azure.com). The username is admin, and the password is Pa55w.rd.
7. Pause the remote session, use the putLocalFile function to copy the carrierDelays.pig file to the
remote session, and then resume the remote session.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-37
If the command fails, you can examine the reason code for the failure in the result variable.
2. The Pig script saves the output to a directory named results in HDFS. Use the following code to verify
that this directory has been created. Replace studentnn with your login name:
3. Verify that the results directory contains two files: _SUCCESS and part-r-00000. The data is actually in
the part-r-00000 file. The _SUCCESS file should be empty; it is simply a flag created by the Pig script
to indicate that the data was successfully saved.
4. Use the rxHadoopRemove function to delete the _SUCCESS file from the results folder.
5. Use the rxGetVarInfo function to examine the structure of the data in the results directory. To do
this, switch to the HDFS file system, and create an RxTextData data source that references the
results directory (not the part-r-00000 file).
Note that the data does not include any useful field names in the schema information.
Task 4: Convert the results to XDF format and add field names
1. Create a colInfo list that can be used by the rxImport function to add schema information to a file as
it is imported. Map the fields in the existing results file as follows:
2. Use the rxImport function to create an XDF file from the results data using the column information
mapping you just defined. Save the XDF file as the CarrierData composite XDF file in your directory
under /user/RevoShare in HDFS.
3. While in the remote session, create a new RxHadoopMR compute context. Note that you do not
need to specify a host name as the context will run on the same computer as the remote session.
4. Run the rxGetVarInfo and rxSummary functions to verify the new mappings in the CarrierData XDF
file. Note that the rxSummary function is run as a Map/Reduce job.
2. Generate another histogram that shows the number of delayed flights by airline code. Use the XDF
file rather than the data frame for this graph.
Note that you might need to increase the resolution of the plot window to display the codes for all
airlines. You can do this using the png function. However, you must execute the png function and
the rxHistogram function together and not as separate commands in a remote session.
3. Generate a bar chart showing the total delay time across all flights for each airline. Display the airline
name rather than the airline code.
Note that this graph requires you to use ggplot with the geom_bar function rather than
rxHistogram.
Results: At the end of this exercise, you will have run a Pig script from R, and analyzed the data that the
script produces.
Question: According to the histogram that displays delays against frequency, what is the
most common delay period across all airlines?
Question: Using the second histogram, which airline has had the most delayed flights?
2. Use the rxSort function to sort the cube in descending order of the Counts and AverageDelay
variables in the cube.
Note that the rxSort function is not inherently distributable, so you cannot use it directly in the
RxHadoopMR environment. However, you can use the rxExec function to run it if you set the
timesToRun argument of rxExec to 1.
3. Display the top 50 (the worst routes for delays), and the bottom 50 (the best routes).
Note that the value returned by rxExec is a list of results named rxElem1, rxElem2, and so on; one
item for each task performed in parallel. There was only one task (timesToRun was set to 1), so you
can find all the data in the rxElem1 field.
4. Save the sorted data cube to the CSV file SortedDelayData.csv in your directory under the
/user/RevoShare directory in HDFS. Perform this operation using the local compute context rather
than Hadoop.
Question: Which route has the most frequent airline delays? How long is the average airline
delay on this route?
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-39
numExecutors: 10
executorCores: 2
executorMem : "1g"
driverMem = "1g"
Note that it is important to set the resource parameters of the RxSpark session appropriately,
otherwise you risk grabbing all the resources available and starving other concurrent users.
2. You will upload the data to a table named studentnnRouteDelays in Hive, where studentnn is your
login name. Create an RxHiveData data source that you can use to reference this table.
3. Use the RxDataStep function to upload the data. The inData argument should reference the CSV file
containing the sorted data cube, and the outFile argument should specify the Hive data source.
4. Use the rxSummary function over the Hive data source to verify that the data was uploaded
successfully.
2. In the PuTTY Configuration window, select the LON-HADOOP session, click Load, and then click
Open. You should be logged in to the Hadoop VM.
hive
This command lists each airline together with the number of delayed flights for that airline.
exit;
7. Close the PuTTY terminal window, and return to your R development environment.
2. Close the current RxSpark compute context and create a new one using the rxSparkConnect
function. Use the same parameters as before, but in addition set the interop argument to "sparklyr".
4. Use the sparklyr src_tbls function to list the tables available in Hive. Note that you should see not
only your table, but also the tables of the other students.
5. Cache your own routedelays table in the Spark session (using the tbl_cache function), retrieve the
contents of this table (using the tbl function), and display the first few rows from this table (using the
head function).
6. Construct a dplyr pipeline that filters the data in the table that you previously retrieved using the tbl
function to find all rows for American Airlines (code AA) that departed from New York JFK airport
(code JFK). Only include the Dest and AverageDelay columns in the results which you should save in
a tibble (using collect)
7. Display the tibble, and use the rxSummarize function to summarize the data in the tibble.
8. Use the rxSparkDisconnect function to terminate the sparklyr session and close the RxSpark
compute context.
9. Save the script as Lab8_2Script.R in the E:\Labfiles\Lab08 folder, and close your R development
environment.
Results: At the end of this exercise, you will have used R code running in an RxSpark compute context to
upload data to Hive, and then analyzed the data by using a sparklyr session running in the RxSpark
compute context.
MCT USE ONLY. STUDENT USE PROHIBITED
Analyzing Big Data with Microsoft R 8-41
Incorporate Hadoop Map/Reduce functionality, together with Pig and Hive, into the ScaleR workflow.
Utilize Hadoop Spark features in a ScaleR workflow.
MCT USE ONLY. STUDENT USE PROHIBITED
8-42 Processing Big Data in SQL Server and Hadoop
Course Evaluation
Your evaluation of this course will help Microsoft understand the quality of your learning experience.
Please work with your training provider to access the course evaluation form.
Microsoft will keep your answers to this survey private and confidential and will use your responses to
improve your future learning experience. Your open and honest feedback is valuable and appreciated.
MCT USE ONLY. STUDENT USE PROHIBITED
L1-1
a. Click the Windows Start button, type Visual Studio 2015, and then click Visual Studio 2015.
b. In Visual Studio 2015, on the R Tools menu, click Data Science Settings.
a. Click the Windows Start button, click the RStudio program group, and then click RStudio.
b. On the File menu, point to New File, and then click R Script.
setwd("E:\\Labfiles\\Lab01")
2. Add the following code to the R file and run it. These statements create a data frame from the
2000.csv file and display the first 10 rows:
3. Add the following code to the R file and run it. The mName function returns the month name given
the month number. The code uses the lapply function to generate the month name for each row in
the data frame. The factor function converts this data into a factor. The result is added to the data
frame as the MonthName column:
4. Add the following code to the R file and run it. These statements summarize the data frame and time
how long the operation takes before displaying the results:
5. Add the following code to the R file and run it. These statements display the name of each column in
the data frame and the number of rows. The code then finds the minimum and maximum flight
arrival delay times:
print(names(flightDataSampleDF))
print(nrow(flightDataSampleDF))
print(min(flightDataSampleDF$ArrDelay, na.rm = TRUE))
print(max(flightDataSampleDF$ArrDelay, na.rm = TRUE))
6. Add the following code to the R file and run it. This code cross tabulates the month name against the
number of flights cancelled and not cancelled:
Note: Record the console output, as this will be referenced in a later exercise.
Results: At the end of this exercise, you will have used either RTVS or RStudio to examine a subset of the
flight delay data for the year 2000.
2. Add the following statement to the R file and run it. This statement retrieves the number of variables
and observations from the data frame:
print(rxGetInfo(flightDataSampleDF))
3. Add the following statement to the R file and run it. This statement retrieves the details for each
variable in the data frame:
print(rxGetVarInfo(flightDataSampleDF))
4. Add the following code to the R file and run it. This statement calculates the quantiles for ArrDelay
variable in the data frame. The 0% quantile is the minimum value, and the 100% is the maximum
value:
print(rxQuantile("ArrDelay", flightDataSampleDF))
MCT USE ONLY. STUDENT USE PROHIBITED
L1-3
5. Add the following code to the R file and run it. This statement generates a cross tabulation on month
name against flight cancellations.
6. Add the following code to the R file and run it. This statement generates a cube of month name
against flight cancellations. The data should be the same as that for the cross tabulation. The
difference is the format in which it is returned:
print(rxCube(~MonthName:as.factor(Cancelled), flightDataSampleDF))
7. Add the following code to the R file and run it. This statement removes the data frame from session
memory:
rm(flightDataSampleDF)
Note: Please note the console output, as this will be compared too in a later exercise.
Results: At the end of this exercise, you will have used the ScaleR functions to examine the flight delay
data for the year 2000, and compared the results against those generated by using the ScaleR functions.
2. Add the following code to the R file and run it. This statement pauses the remote session and returns
you to the local session:
pause()
3. Add the following statement to the R file and run it. This statement copies the file 2000.csv to the
remote server:
putLocalFile(c("2000.csv"))
4. Add the following statement to the R file and run it. This statement returns you to the remote session:
resume()
MCT USE ONLY. STUDENT USE PROHIBITED
L1-4 Analyzing Big Data with Microsoft R
Add the following code to the R file and run it. Verify that the MonthName column appears in the
output:
head(flightDataSampleDF, 10)
pause()
2. Add the following statement to the R file and run it. This statement copies the remote variables back
to the local session:
3. Add the following statements to the R file and run them. This code displays the contents of the
variables you have just copied. They should match the values from exercise 2:
print(rxRemoteDelaySummary)
print(rxRemoteInfo)
print(rxRemoteVarInfo)
print(rxRemoteQuantileInfo)
print(rxRemoteCrossTabInfo)
print(rxRemoteCubeInfo)
4. Add the following statement to the R file and run it. This statement logs out of the remote session:
remoteLogout()
5. Save the script as Lab1Script.R in the E:\Labfiles\Lab01 folder, and close your R development
environment.
Results: At the end of this exercise, you will have used the ScaleR functions to examine the flight delay
data for the year 2000, and compared the results against those generated by using the ScaleR functions.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
L2-1
2. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
3. Add the following statement to the R file and run it to set the working directory:
setwd("E:\\Labfiles\\Lab02")
4. Add and run the following statements to import the first 10 rows of the 2000.csv file into a data
frame:
5. Add and run the following statement to view the structure of the data frame:
rxGetVarInfo(flightDataSample)
6. Add and run the following statement to create the flighDataColumns vector:
7. Add and run the following statements to import the CSV data into the 2000.xdf file:
8. Add and run the following statement to view the structure of the data frame:
rxGetVarInfo(flightDataXdf)
9. Open File Explorer and move to the E:\Labfiles\Lab02 folder. Verify that the 2000.csv file is
approximately 155 MB in size, whereas the 2000.xdf file is just under 28 MB.
2. Add and run the following statement to generate a summary across all numeric fields in the CSV file.
Make a note of the timings reported by the system.time function:
3. Add and run the following statement to generate the same summary for the XDF file. Compare the
timings reported by the system.time function against those for the CSV file:
4. The timings for the XDF file should be significantly quicker than those of the CSV file. If you need to
satisfy yourself that both statements are performing the same task, add print statements that display
the values of the csvDelaySummary and xdfDelaySummary variables, as follows:
print(csvDelaySummary)
print(xdfDelaySummary)
5. Add and run the following statement to generate a cross-tabulation that summarizes cancellations by
month in the CSV file. Make a note of the timings:
6. Add and run the following statement to generate the same cross-tabulation for the XDF file. Compare
the timings against those for the CSV file:
7. Add and run the following statement to generate a cube that summarizes cancellations by month in
the CSV file. Make a note of the timings:
8. Add and run the following statement to generate the same cube for the XDF file. Compare the
timings against those for the CSV file:
Results: At the end of this exercise, you will have created a new XDF file containing the airline delay data
for the year 2000, and you will have performed some operations to test its performance.
MCT USE ONLY. STUDENT USE PROHIBITED
L2-3
2. Add the following statement to your R script, and run it. This statement creates a remote connection
to the LON-RSVR VM. When prompted, specify the username admin with the password Pa55w.rd:
3. At the REMOTE> prompt, add and run the following command to temporarily pause the remote
session:
pause()
4. Add and run the following statement. This statement copies the local variable flightDataColumns to
the remote session:
putLocalObject(c("flightDataColumns"))
5. Add and run the following statement to resume the remote session:
resume()
6. Add and run the following statement. This statement lists the variables in the remote session. Verify
that the flightDataColumns variable is listed:
ls()
2. Add and run the following statement to examine the first few rows in the XDF file:
head(flightDataSampleXDF, 100)
3. Using File Explorer, delete the file Sample.xdf from the \\LON-RSVR\Data share.
4. In your R environment, add and run the following statements. This code imports and transforms all of
the files in the \\LON-RSVR\Data share:
rxOptions(reportProgress = 1)
delayXdf <- "\\\\LON-RSVR\\Data\\FlightDelayData.xdf"
flightDataCsvFolder <- "\\\\LON-RSVR\\Data\\"
flightDataXDF <- rxImport(inData = flightDataCsvFolder, outFile = delayXdf, overwrite
= TRUE, append = ifelse(file.exists(delayXdf), "rows", "none"), colClasses =
flightDataColumns,
transforms = list(
Delay = ArrDelay + DepDelay + ifelse(is.na(CarrierDelay),
0, CarrierDelay) + ifelse(is.na(WeatherDelay), 0, WeatherDelay) +
ifelse(is.na(NASDelay), 0, NASDelay) + ifelse(is.na(SecurityDelay), 0, SecurityDelay)
+ ifelse(is.na(LateAircraftDelay), 0, LateAircraftDelay),
MonthName = factor(month.name[as.numeric(Month)],
levels=month.name)),
rowSelection = ( Cancelled == 0 ),
varsToDrop = c("FlightNum", "TailNum", "CancellationCode"),
rowsPerRead = 500000
)
5. Add and run the following statement to close the remote session:
exit
Results: At the end of this exercise, you will have created a new XDF file containing the cumulative
airline delay data for the years 2000 through 2008, and you will have performed some transformations
on this data.
2. Add and run the following statement. This statement displays the first six rows from the Airports
table:
head(airportData)
3. Add and run the following statements. This code imports the data from the SQL Server database into
a data frame, and converts all string data to factors:
4. Add and run the following statement that displays the first six rows of the data frame. Verify that they
are the same as the original SQL Server data:
head(airportInfo)
2. At the REMOTE> prompt, add and run the following command to temporarily pause the remote
session:
pause()
3. Add and run the following statement that copies the local airportInfo data frame to the remote
session:
putLocalObject(c("airportInfo"))
4. Add and run the following statement to resume the remote session:
resume()
5. Add and run the following statements that import the flight delay data and combine the state
information from the airport data:
6. Add and run the following statement that displays the first six rows of XDF file. Verify that they
include the OriginState and DestState variables:
head(enhancedXdf)
Results: At the end of this exercise, you will have augmented the flight delay data with the state in
which the origin and destination airports are located.
MCT USE ONLY. STUDENT USE PROHIBITED
L2-6 Analyzing Big Data with Microsoft R
delayFactor <- expression(list(Delay = cut(Delay, breaks = c(0, 1, 30, 60, 120, 180,
181), labels = c("No delay", "Up to 30 mins", "30 mins - 1 hour", "1 hour to 2
hours", "2 hours to 3 hours", "More than 3 hours"))))
2. Add and run the following statements. The first statement generates a cross-tabulation that
summarizes the delay by origin airport. It uses the delayFactor expression to transform the Delay
variable. The second statement displays the results:
3. Add and run the following statements to generate and display the cross-tabulation of delays by
destination airport:
4. Add and run the following statements to generate and display the cross-tabulation of delays by
origin state:
5. Add and run the following statements to generate and display the cross-tabulation of delays by
destination state:
6. Add and run the following command to close the remote session:
exit
install.packages("dplyr")
install.packages("devtools")
devtools::install_github("RevolutionAnalytics/dplyrXdf")
library(dplyr)
library(dplyrXdf)
MCT USE ONLY. STUDENT USE PROHIBITED
L2-7
2. Add and run the following code to create a data source that retrieves the required columns from the
XDF data:
3. Add and run the following code. This code is a dplyrXdf pipeline that calculates the mean delay by
origin airport and sorts them in descending order. The airport with the longest delays will be at the
top:
4. Add and run the following code that calculates the mean delay by destination airport and sorts them
in descending order:
5. Add and run the following code that calculates the mean delay by origin state and sorts them in
descending order:
6. Add and run the following code that calculates the mean delay by destination state and sorts them in
descending order:
7. Save the script as Lab2Script.R in the E:\Labfiles\Lab02 folder, and close your R development
environment.
Results: At the end of this exercise, you will have examined flight delays by origin and destination
airport and state.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
L3-1
2. Start your R development environment of choice (Visual Studio or RStudio), and create a new R file.
3. Add the following statement to the R file and run it to set the working directory:
setwd("E:\\Labfiles\\Lab03")
4. Add the following statements to the R file and run them to create an XDF data source that references
the data file:
5. Add the following statements to the R file and run them to import the sample data into a data frame:
rxOptions(reportProgress = 1)
delayPlotData <- rxImport(flightDelayData, rowsPerRead = 1000000,
varsToKeep = c("Distance", "Delay", "Origin", "OriginState"),
rowSelection = (Distance > 0) & as.logical(rbinom(n =
.rxNumRows, size = 1, prob = 0.02))
)
install.packages("tidyverse")
library(tidyverse)
2. Add and run the following statement. This code creates a scatter plot of flight distance on the x axis
against delay time on the y axis:
ggplot(data = delayPlotData) +
geom_point(mapping = aes(x = Distance, y = Delay)) +
xlab("Distance (miles)") +
ylab("Delay (minutes)")
3. Add and run the following statement. This code creates a line plot of flight distance on the x axis
against delay time on the y axis:
delayPlotData %>%
filter(!is.na(Delay) & (Delay >= 0) & (Delay <= 1000)) %>%
ggplot(mapping = aes(x = Distance, y = Delay)) +
xlab("Distance (miles)") +
ylab("Delay (minutes)") +
geom_point(alpha = 1/50) +
geom_smooth(color = "red")
MCT USE ONLY. STUDENT USE PROHIBITED
L3-2 Analyzing Big Data with Microsoft R
4. Add and run the following statement. This code creates a faceted plot organized by departure state:
delayPlotData %>%
filter(!is.na(Delay) & (Delay >= 0) & (Delay <= 1000)) %>%
ggplot(mapping = aes(x = Distance, y = Delay)) +
xlab("Distance (miles)") +
ylab("Delay (minutes)") +
geom_point(alpha = 1/50) +
geom_smooth(color = "red") +
theme(axis.text = element_text(size = 6)) +
facet_wrap( ~ OriginState, nrow = 8)
Results: At the end of this exercise, you will have used the ggplot2 package to generate line plots that
depict flight delay times as a function of distance traveled and departure state.
2. Add and run the following code. This code creates a cube summarizing the data of interest from the
XDF file:
3. Add and run the following statement. This code changes the name of the first column in the cube to
Distance (it was F_Distance):
4. Add and run the following statement. This code creates a data frame from the cube:
5. Add and run the following statement. This code uses the rxLinePlot function to generate a scatter
plot of DelayPercent versus Distance:
2. Add and run the following statement. This code facets the plot by OriginState:
2. Add and run the following code. This statement refactors the DayOfWeek variable in the XDF data:
3. Add and run the following code. This statement creates a cube that summarizes the data:
4. Add and run the following code. This statement creates a data frame from the cube:
5. Add and run the following statement that generates a line plot of delay against the day of the week.
Note you may have to move the Plots window to view the graph clearly:
Results: At the end of this exercise, you will have used the rxLinePlot function to generate line plots that
depict flight delay times as a function of flight time and day of the week.
MCT USE ONLY. STUDENT USE PROHIBITED
L3-5
2. Add and run the following code. This code creates a histogram that counts the frequency of arrival
delays:
3. Add and run the following code. This code creates a histogram that shows the percentage frequency
of arrival delays as a percentage:
4. Add and run the following code. This code creates a histogram that shows the percentage frequency
of arrival delays as a percentage, organized by state:
5. Add and run the following code. This code creates a histogram that shows the frequency of weather
delays:
6. Add and run the following code. This code creates a histogram that shows the frequency of weather
delays organized by month:
7. Save the script as Lab3Script.R in the E:\Labfiles\Lab03 folder, and close your R development
environment.
Results: At the end of this exercise, you will have used the rxHistogram function to create histograms
that show the relative rates of arrival delay by state, and weather delay by month.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
L4-1
3. Add the following code to the R file and run it. This code retrieves the information for the iata
variable in the airportData XDF file and displays it. The code then performs this same operation for
the Origin and Dest variables in the FlightDelayData XDF file:
airportData = RxXdfData("\\\\LON-RSVR\\Data\\airportData.xdf")
flightDelayData = RxXdfData("\\\\LON-RSVR\\Data\\flightDelayData.xdf")
iataFactor <- rxGetVarInfo(airportData, varsToKeep = c("iata"))
print(iataFactor)
originFactor <- rxGetVarInfo(flightDelayData, varsToKeep = c("Origin"))
print(originFactor)
destFactor <- rxGetVarInfo(flightDelayData, varsToKeep = c("Dest"))
print(destFactor)
4. Add the following code to the R file and run it. This code creates a new set of factor levels using the
levels in the iata, Origin, and Dest variables:
5. Add the following code to the R file and run it. This code refactors the iata variable in the
airportData XDF file with the new factor levels:
rxOptions(reportProgress = 2)
refactoredAirportDataFile <- "\\\\LON-RSVR\\Data\\RefactoredAirportData.xdf"
refactoredAirportData <- rxFactors(inData = airportData, outFile =
refactoredAirportDataFile, overwrite = TRUE,
factorInfo = list(iata = list(newLevels =
refactorLevels))
)
6. Add the following code to the R file and run it. This code refactors the Origin and Dest variables in
the FlightDelayData XDF file with the new factor levels:
7. Add the following code to the R file and run it. This code displays the new factor levels for the iata,
Origin, and Dest variables. They should all be the same now:
2. Add the following code to the R file and run it. This code reblocks the airport data XDF file:
3. Add the following code to the R file and run it. This code uses the rxMerge function to merge the
two XDF files, performing an inner join over the Origin field:
4. Add the following code to the R file and run it. This code examines the structure of the new flight
delay data file that should now include the OriginTimeZone variable, and displays the first and last
few rows:
rxGetVarInfo(mergedFlightDelayData)
head(mergedFlightDelayData)
tail(mergedFlightDelayData)
Results: At the end of this exercise, you will have created a new dataset that combines information from
the flight delay data and airport information datasets.
rxOptions(reportProgress = 1)
flightDelayDataSubsetFile <- "\\\\LON-RSVR\\Data\\flightDelayDataSubset.xdf"
flightDelayDataSubset <- rxDataStep(inData = mergedFlightDelayData,
outFile = flightDelayDataSubsetFile, overwrite =
TRUE,
rowSelection = rbinom(.rxNumRows, size = 1, prob
= 0.005)
)
2. Add the following code to the R file and run it. This code displays the metadata for the XDF file
containing the sample data:
install.packages("lubridate")
2. Add the following code to the R file and run it. This code implements the standardizeTimes
transformation function:
departureMonthVarIndex <- 2
departureDayVarIndex <- 3
departureTimeStringVarIndex <- 4
elapsedTimeVarIndex <- 5
departureTimezoneVarIndex <- 6
# Iterate through the rows and add the standardized arrival and departure times
for (i in 1:.rxNumRows) {
# Get the local departure time details
departureYear <- dataList[[departureYearVarIndex]][i]
departureMonth <- dataList[[departureMonthVarIndex]][i]
departureDay <- dataList[[departureDayVarIndex]][i]
departureHour <- trunc(as.numeric(dataList[[departureTimeStringVarIndex]][i]) /
100)
departureMinute <- as.numeric(dataList[[departureTimeStringVarIndex]][i]) %% 100
departureTimeZone <- dataList[[departureTimezoneVarIndex]][i]
# Construct the departure date and time, including timezone
departureDateTimeString <- paste(departureYear, "-", departureMonth, "-",
departureDay, " ", departureHour, ":", departureMinute, sep="")
departureDateTime <- as.POSIXct(departureDateTimeString, tz = departureTimeZone)
# Convert to UTC and store it
standardizedDepartureDateTime <- format(departureDateTime, tz="UTC")
dataList[[departureTimeVarIndex]][i] <- standardizedDepartureDateTime
# Calculate the arrival date and time
# Do this by adding the elapsed time to the departure time
# The elapsed time is stored as the number of minutes (an integer)
elapsedTime = dataList[[5]][i]
standardizedArrivalDateTime <- format(as.POSIXct(standardizedDepartureDateTime) +
minutes(elapsedTime))
# Store it
dataList[[arrivalTimeVarIndex]][i] <- standardizedArrivalDateTime
}
# Return the data including the new variables
return(dataList)
}
3. Add the following code to the R file and run it. This code uses the rxDataStep function to perform
the transformation:
4. Add the following code to the R file and run it. This code examines the transformed data file:
rxGetVarInfo(flightDelayDataTimeZones)
head(flightDelayDataTimeZones)
tail(flightDelayDataTimeZones)
Results: At the end of this exercise, you will have implemented a transformation function that adds
variables containing the standardized departure and arrival times to the flight delay dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
L4-5
2. Add the following code to the R file and run it. This code displays the first and last few lines of the
sorted file. Examine the data in the StandardizedDepartureTime variable:
head(sortedFlightDelayData)
tail(sortedFlightDelayData)
2. Add the following code to the R file and run it. This code uses rxDataStep to run the
calculateCumulativeAverageDelays transformation function:
rxGetVarInfo(flightDelayDataWithAverages)
head(flightDelayDataWithAverages)
tail(flightDelayDataWithAverages)
2. Add the following code to the R file and run it. This creates a scatter and regression plot showing the
cumulative average delay for flights from ATL to PHX (Atlanta to Phoenix):
rxLinePlot(CumulativeAverageDelayForRoute ~ as.POSIXct(StandardizedDepartureTime),
type = c("p", "r"),
flightDelayDataWithAverages,
rowSelection = (Origin == "ATL") & (Dest == "PHX"),
yTitle = "Cumulative Average Delay for Route",
xTitle = "Date"
)
3. Repeat step 2 and replace the rowSelection argument with each of the following values in turn:
4. Save the script as Lab4Script.R in the E:\Labfiles\Lab04 folder, and close your R development
environment.
Results: At the end of this exercise, you will have sorted data, and created and tested another
transformation function.
MCT USE ONLY. STUDENT USE PROHIBITED
L5-1
3. In File Explorer, right-click the C:\Data folder, click Share with, and then click Specific people.
4. In the File Sharing dialog box, click the drop-down list, click Everyone, and then click Add.
5. In the lower pane, click the Everyone row, and set the Permission Level to Read/Write.
6. Click Share.
7. In the File Sharing dialog box, verify that the file share is named \\LON-RSVR\Data, and then click
Done.
5. Verify that the files are copied successfully, and then close the command prompt window.
2. Add the following statement to your R script, and run it. This statement installs the dplyr package:
install.packages("dplyr")
3. Add the following code to your R script, and run it. These statements bring the dplyr and
RevoPemaR libraries into scope:
library(dplyr)
library(RevoPemaR)
4. Add the following code to your R script, but do not run it yet. Code defines the PemaFlightDelays
class generator:
5. Add the following code to your R script, after the contains line, but before the closing brace. Do not
run it yet. This code adds the fields to the class:
fields = list(
totalFlights = "numeric",
totalDelays = "numeric",
origin = "character",
dest = "character",
airline = "character",
delayTimes = "vector",
results = "list"
),
6. Add the following code to your R script, after the fields list, but before the closing brace. Do not run
it yet. This code defines the initialize method:
methods = list(
initialize = function(originCode = "", destinationCode = "",
airlineCode = "", ...) {
'initialize fields'
callSuper(...)
usingMethods(.pemaMethods)
totalFlights <<- 0
totalDelays <<- 0
delayTimes <<- vector(mode="numeric", length=0)
origin <<- originCode
dest <<- destinationCode
airline <<- airlineCode
},
7. Add the following code to your R script, after the initialize method, but before the closing brace. Do
not run it yet. This code defines the processData method:
processData = function(dataList) {
'Generates a vector of delay times for specified variables in the current chunk
of data.'
data <- as.data.frame(dataList)
# If no origin was specified, default to the first value in the dataset
if (origin == "") {
origin <<- as.character(as.character(data$Origin[1]))
}
# If no destination was specified, default to the first value in the dataset
if (dest == "") {
dest <<- as.character(as.character(data$Dest[1]))
}
# If no airline was specified, default to the first value in the dataset
if (airline == "") {
airline <<- as.character(as.character(data$UniqueCarrier[1]))
}
# Use dplyr to filter by origin, dest, and airline
# update the number of flights
# select the Delay variable
# only include delayed flights in the results
data %>%
filter(Origin == origin, Dest == dest, UniqueCarrier == airline) %T>%
{totalFlights <<- totalFlights + length(.$Origin)} %>%
select(ifelse(is.na(Delay), 0, Delay)) %>%
filter(Delay > 0) ->
temp
# Store the result in the delayTimes vector
delayTimes <<- c(delayTimes, as.vector(temp[,1]))
totalDelays <<- length(delayTimes)
invisible(NULL)
},
MCT USE ONLY. STUDENT USE PROHIBITED
L5-3
8. Add the following code to your R script, after the processData method, but before the closing brace.
Do not run it yet. This code defines the updateResults method:
updateResults = function(pemaFlightDelaysObj) {
'Updates total observations and delayTimes vector from another PemaFlightDelays
object object.'
# Update the totalFlights and totalDelays fields
totalFlights <<- totalFlights + pemaFlightDelaysObj$totalFlights
totalDelays <<- totalDelays + pemaFlightDelaysObj$totalDelays
# Append the delay data to the delayTimes vector
delayTimes <<- c(delayTimes, pemaFlightDelaysObj$delayTimes)
invisible(NULL)
},
9. Add the following code to your R script, after the updateResults method, but before the closing
brace. Do not run it yet. This code defines the processResults method:
processResults = function() {
'Generates a list containing the results:'
' The first element is the number of flights made by the airline'
' The second element is the number of delayed flights'
' The third element is the list of delay times'
results <<- list("NumberOfFlights" = totalFlights,
"NumberOfDelays" = totalDelays,
"DelayTimes" = delayTimes)
return(results)
}
)
10. Highlight and run the code you have entered in the previous steps, starting at step 4, in this task.
Verify that no errors are reported.
2. Add the following statement to your R script, and run it. This statement creates a remote connection
to the LON-RSVR VM. Specify the username admin with the password Pa55w.rd when prompted:
3. At the REMOTE> prompt, add and run the following command to temporarily pause the remote
session:
pause()
4. Add and run the following statement. This statement copies the local variable pemaFlightDelaysObj
to the remote session:
putLocalObject("pemaFlightDelaysObj")
5. Add and run the following statement to resume the remote session:
resume()
MCT USE ONLY. STUDENT USE PROHIBITED
L5-4 Analyzing Big Data with Microsoft R
install.packages("dplyr")
library(dplyr)
library(RevoPemaR)
7. Add the following statements to your R script, and run them. These statements create a data frame
comprising the first 50,000 observations from the FlightDelayData.xdf file in the \\LON-RSVR\Data
share.
8. Add the following statements to your R script, and run them. This code uses the pemaCompute
function to run the pemaFlightDelaysObj object to perform an analysis of flights from "ABE" to "PIT"
made by airline "US":
$NumberOfFlights
[1] 755
$NumberOfDelays
[1] 188
$DelayTimes
[1] 3 10 3 16 54 2 61 65 54 12 18 23 92 16 153 7 18 2
[19] 21 61 2 1 1 4 40 67 1 82 6 3 112 298 39 21 13 2
[37] 1 12 2 131 474 85 27 352 9 2 49 24 18 60 43 28 126 109
[55] 40 39 53 34 120 3 274 73 57 3 83 27 58 53 15 8 58 61
[73] 1 117 34 32 9 19 66 44 2 82 17 21 9 103 2 45 4 64
[91] 3 48 52 17 5 11 7 1 18 23 43 29 7 46 22 71 16 18
[109] 9 62 27 120 10 12 11 6 10 4 50 4 1 6 1 129 3 9
[127] 185 5 11 17 19 171 2 81 3 17 1 33 21 2 45 8 27 29
[145] 42 25 40 5 1 15 1 59 4 10 6 81 13 45 37 6 9 1
[163] 7 1 2 2 2 5 109 3 15 7 25 58 17 45 289 5 7 7
[181] 38 89 3 34 12 15 129 19
9. Add the following statements to your R script, and run them. This code displays the values of the
internal fields in the pemaFlightDelaysObj object:
print(pemaFlightDelaysObj$delayTimes)
print(pemaFlightDelaysObj$totalDelays)
print(pemaFlightDelaysObj$totalFlights)
print(pemaFlightDelaysObj$origin)
print(pemaFlightDelaysObj$dest)
print(pemaFlightDelaysObj$airline)
MCT USE ONLY. STUDENT USE PROHIBITED
L5-5
2. Add the following statements to your R script, and run them. This code performs the same analysis as
before, but using the XDF file:
3. Verify that the results are the same as before (755 flights, with 188 delayed).
4. Add the following statements to your R script, and run them. This code performs the same analysis as
before, but using the XDF file:
5. Examine the results. Note the number of flights made, the number of flights that were delayed, and
the length of the longest delay.
6. Save the script as Lab5Script.R in the E:\Labfiles\Lab05 folder, and close your R development
environment.
Results: At the end of this exercise, you will have created and run a PEMA class that finds the number of
times flights that match a specified origin, destination, and airline are delayed—and how long each delay
was.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
L6-1
Task 2: Examine the relationship between flight delays and departure times
1. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
2. Add the following code to the R file and run it. This statement creates a remote R session on the LON-
RSVR server:
3. Add the following code to the R file and run it. This code creates a data file containing a random
sample of 10 percent of the flight delay data:
rxOptions(reportProgress = 1)
flightDelayData = RxXdfData("\\\\LON-RSVR\\Data\\flightDelayData.xdf")
sampleDataFile = "\\\\LON-RSVR\\Data\\flightDelayDatasample.xdf"
flightDelayDataSample <- rxDataStep(inData = flightDelayData,
outFile = sampleDataFile, overwrite = TRUE,
rowSelection = rbinom(.rxNumRows, size = 1, prob
= 0.10)
)
4. Add the following code to the R file and run it. This code displays a scatter plot with a regression line
showing how flight delays vary with departure time throughout the day:
5. Add the following code to the R file and run it. This code creates an expression that you can use to
factorize the departure times by hour:
6. Add the following code to the R file and run it. This code generates a histogram showing the number
of departures for each hour:
2. Add the following code to the R file and run it. This code calculates the ratio of the between clusters
sums of squares and the total sums of squares for this model. The value returned should be in the
high 80% to low 90% range:
delayCluster$betweenss / delayCluster$totss
3. Add the following code to the R file and run it. This code displays the cluster centers. There should be
12 rows, showing the value of DepTime and Delay used as the centroid values for each cluster:
delayCluster$centers
4. Add the following code to the R file and run it. These statements create a parallel compute context
and register the RevoScaleR parallel back end with the foreach package:
library(doRSR)
registerDoRSR()
# Maximize parallelism
rxSetComputeContext(RxLocalParallel())
MCT USE ONLY. STUDENT USE PROHIBITED
L6-3
5. Add the following code to the R file and run it. This block of code runs a foreach loop to generate
the cluster models and calculate the sums of squares ratio for each model:
numClusters <- 12
testClusters <- vector("list", numClusters)
# Create the cluster models
foreach (k = 1:numClusters) %dopar% {
testClusters[[k]] <<- rxKmeans(formula = ~DepTime + Delay,
data = flightDelayDataSample,
transforms = list(DepTime = as.numeric(DepTime)),
numClusters = k * 2
)
}
Note: At the time of writing, there was still some instability in R Server running on
Windows. Placing it under a high parallel load can cause it to close the remote session. If this
step fails and returns to the local session on R Client, run the following code, and then repeat
this step:
resume()
rxSetComputeContext(RxLocalSeq())
6. Add the following code to the R file and run it. This block of code calculates the sums of squares ratio
for each model:
7. Add the following code to the R file and run it. This code generates a scatter plot that shows the
number of clusters on the X-axis and the sums of squares ratio on the Y-axis. The graph should
suggest that the optimal number of clusters is 18. This is the point at which additional clusters add
little value to the model:
Results: At the end of this exercise, you will have determined the optimal number of clusters to create,
and built the appropriate cluster model.
k <- 9
clusterModel <- rxLinMod(Delay ~ DepTime, data =
as.data.frame(testClusters[[k]]$centers),
covCoef = TRUE)
MCT USE ONLY. STUDENT USE PROHIBITED
L6-4 Analyzing Big Data with Microsoft R
2. Add the following code to the R file and run it. This code makes predictions about delays in the test
data using the linear model:
3. Add the following code to the R file and run it. This code displays the first 10 predictions from the
results. The Pred_Delay variables contain the predicted delay times, while the Delay variables show
the actual delays. Note that the Pred_Delay values are not close to the actual Delay values, but are
within the very broad confidence level for each prediction:
head(delayPredictions)
2. Add the following code to the R file and run it. This code creates a scatter plot that shows the
differences between the actual and predicted delays. This graph emphasizes the bias of values in the
predictions:
Results: At the end of this exercise, you will have created a linear regression model using the clustered
data, and tested predictions made by this model.
MCT USE ONLY. STUDENT USE PROHIBITED
L6-5
2. Add the following code to the R file and run it. This statement uses the rxPredict function to make
predictions using the test dataset:
3. Add the following code to the R file and run it. This code displays the first 10 predictions. Note that
the individual predictions are more accurate than before and that they have a tighter confidence
level. However, in some cases the confidence might be misplaced because the real delay frequently
falls outside this range:
head(delayPredictionsFull)
2. Add the following code to the R file and run it. This code creates a scatter plot that shows the
differences between the actual and predicted delays. The graph still shows a bias, but it is much less
exaggerated than that created from the previous model:
3. Save the script as Lab6Script.R in the E:\Labfiles\Lab06 folder, and close your R development
environment.
Results: At the end of this exercise, you will have created a linear regression model using the entire flight
delay dataset, and tested predictions made by this model.
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
L7-1
2. Add the following code to the R file and run it. This statement creates a remote R session on the LON-
RSVR server:
3. Add the following code to the R file and run it. This code transforms the flight delay data with a new
variable named DataSet that indicates whether an observation should be used for testing or training
models. The code also removes the variables not required by the models:
4. Add the following code to the R file and run it. This code splits the data into two files based on the
value of the DataSet variable:
5. Add the following code to the R file and run it. This statement shows the number of observations in
each file. There should be approximately 19 times more rows in the train file than the test file:
lapply(flightDataSets, nrow)
2. Add the following code to the R file and run it. This statement displays the structure of the decision
tree:
delayDTree
3. Add the following code to the R file and run it. This code switches back to the local R Client session
and copies the decision tree. The code then uses the createTreeView function of the RevoTreeView
package to visualize the DTree. The DTree should be displayed using Microsoft Edge:
pause()
getRemoteObject("delayDTree")
library(RevoTreeView)
plot(createTreeView(delayDTree))
Close Microsoft Edge, add the following code to the R file, and run it. This statement switches back to
the session on the R Server:
resume()
4. Add the following code to the R file and run it. This code generates a scree plot of the DTree, and
shows the complexity parameters table. You can see that the DTree has a lot of levels (more than
320), but only the first few make any significant decisions. The remaining levels are primarily
concerned with making the model fit the data at a detailed level. This is classic overfit:
plotcp(rxAddInheritance(delayDTree))
delayDTree$cptable
5. Add the following code to the R file and run it. This code prunes the DTree, and displays the
amended complexity parameters table, which should now be much reduced:
2. Add the following code to the R file and run it. This statement runs predictions against the data frame
using the DTree:
3. Add the following code to the R file and run it. This code summarizes the statistics for the predicted
delays and the actual delays for comparison purposes. The mean values should be close to each
other, although the other statistics are likely to vary more widely:
4. Add the following code to the R file and run it. This code merges the predicted delays into a copy of
the test dataset:
5. Add the following code to the R file and run it. This code defines a function that you will use for
analyzing the results of the predictions against the real data value:
6. Add the following code to the R file and run it. This statement calls the processResults function to
analyze the predictions and display the results:
processResults(mergedDelayData, 10)
Results: At the end of this exercise, you will have constructed a DTree model, made predictions using this
model, and evaluated the accuracy of these predictions.
2. Add the following code to the R file and run it. This code merges the predictions into a copy of the
test data frame:
3. Add the following code to the R file and run it. This code uses the processResults function to analyze
the predictions:
processResults(mergedDelayData, 10)
2. Add the following code to the R file and run it. This code generates predictions for the test data using
the new DForest model:
3. Add the following code to the R file and run it. This code merges the predictions into a copy of the
test data frame:
4. Add the following code to the R file and run it. This code uses the processResults function to analyze
the predictions:
processResults(mergedDelayData, 10)
Results: At the end of this exercise, you will have constructed a DTree model, made predictions using this
model, and evaluated the accuracy of these predictions.
2. Add the following code to the R file and run it. This code generates predictions for the test data using
the new DTree model:
3. Add the following code to the R file and run it. This code merges the predictions into a copy of the
test data frame:
4. Add the following code to the R file and run it. This code uses the processResults function to analyze
the predictions:
processResults(mergedDelayData, 10)
Results: At the end of this exercise, you will have constructed a DTree model using a different set of
variables, made predictions using this model, and compared these predictions to those made using the
earlier DTree model.
2. Add the following code to the R file and run it. This code generates predictions for the test data using
the new DTree model:
3. Add the following code to the R file and run it. This code merges the predictions into a copy of the
test data frame:
4. Add the following code to the R file and run it. This code uses the processResults function to analyze
the predictions:
processResults(mergedDelayData, 10)
Results: At the end of this exercise, you will have constructed a DTree model combining the variables
used in the two earlier DTree models, and made predictions using this model.
MCT USE ONLY. STUDENT USE PROHIBITED
L8-1
2. On the Windows desktop, click Start, type Microsoft SQL Server Management Studio, and then
press Enter.
3. In the Connect to Server dialog box, log on to LON-SQLR using Windows authentication.
8. In the Microsoft SQL Server Management Studio message box, click Yes.
9. In the second Microsoft SQL Server Management Studio message box, click Yes.
11. In Object Explorer, expand LON-SQLR, right-click Databases, and then click New Database.
12. In the New Database dialog box, in the Database name text box, type FlightDelays, and then click
OK.
3. Verify that the file is copied successfully, and then close the command prompt.
4. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
5. In the script editor, add the following statement to the R file and run it. This code ensures that you
are running in the local compute context:
rxSetComputeContext(RxLocalSeq())
MCT USE ONLY. STUDENT USE PROHIBITED
L8-2 Analyzing Big Data with Microsoft R
6. In the script editor, add the following statement to the R file and run it. This code creates a
connection string for the SQL Server FlightDelays database, and an RxSqlServerData data source for
the flightdelaydata table in the database:
7. Add the following code to the R file and run it. These statements import the data from the
FlightDelayDataSample.xdf file and adds the DelayedByWeather logical factor and Dataset
column to each observation:
rxOptions("reportProgress" = 2)
flightDelayDataFile <- "\\\\LON-RSVR\\Data\\FlightDelayDataSample.xdf"
flightDelayData <- rxDataStep(inData = flightDelayDataFile,
outFile = flightDelayDataTable, overwrite = TRUE,
transforms = list(DelayedByWeather =
factor(ifelse(is.na(WeatherDelay), 0, WeatherDelay) > 0, levels = c(FALSE, TRUE)),
Dataset =
factor(ifelse(runif(.rxNumRows) >= 0.05, "train", "test")))
)
2. Add the following code to the R file and run it. These statements create an RxSqlServerData data
source that reads the flight delay data from the SQL Server database and refactors it:
3. Add the following statement to the R file and run it. This statement retrieves the details for each
variable in the data source:
rxGetVarInfo(delayDataSource)
There should be seven variables, named Month, MonthName, OriginState, DestState, Dataset,
DelayedByWeather, and WeatherDelayCategory.
4. Add the following statement to the R file and run it. This statement retrieves the data from the data
source and summarizes it:
rxSummary(~., delayDataSource)
5. Add the following code to the R file and run it. This statement creates a histogram that shows the
categorized delays by month:
rxHistogram(~WeatherDelayCategory | MonthName,
data = delayDataSource,
xTitle = "Weather Delay",
scales = (list(
x = list(rot = 90)
))
)
MCT USE ONLY. STUDENT USE PROHIBITED
L8-4 Analyzing Big Data with Microsoft R
6. Add the following code to the R file and run it. This statement creates a histogram that shows the
categorized delays by origin state:
rxHistogram(~WeatherDelayCategory | OriginState,
data = delayDataSource,
xTitle = "Weather Delay",
scales = (list(
x = list(rot = 90, cex = 0.5)
))
)
Results: At the end of this exercise, you will have imported the flight delay data to SQL Server and used
ScaleR functions to examine this data.
2. Add the following statement to the R file and run it. This statement summarizes the forecast accuracy
of the model:
print(weatherDelayModel)
3. Add the following statement to the R file and run it. This statement shows the structure of the
decision trees in the DForest model:
head(weatherDelayModel)
4. Add the following statement to the R file and run it. This statement shows the relative importance of
each predictor variable to the decisions made by the model:
rxVarUsed (weatherDelayModel)
2. Add the following code to the R file and run it. This statement creates a data source that will be used
to store scored results in the database:
3. Add the following code to the R file and run it. These statements switch to the local compute context,
generate weather delay predictions using the new data set and save the scored results in the
scoredresults table in the database, and then return to the SQL Server compute context:
rxSetComputeContext(RxLocalSeq())
rxPredict(modelObj = weatherDelayModel,
data = delayDataSource,
outData = weatherDelayScoredResults, overwrite = TRUE,
writeModelVars = TRUE,
predVarNames = c("PredictedDelay", "PredictedNoDelay",
"PredictedDelayedByWeather"),
type = "prob")
rxSetComputeContext(sqlContext)
4. Add the following code to the R file and run it. This code tests the scored results against the real data
and plots the accuracy of the predictions:
install.packages('ROCR')
library(ROCR)
# Transform the prediction data into a standardized form
results <- rxImport(weatherDelayScoredResults)
weatherDelayPredictions <- prediction(results$PredictedDelay,
results$DelayedByWeather)
# Plot the ROC curve of the predictions
rocCurve <- performance(weatherDelayPredictions, measure = "tpr", x.measure = "fpr")
plot(rocCurve)
Results: At the end of this exercise, you will have created a decision tree forest using the weather data
held in the SQL Server database, scored it, and stored the results back in the database.
3. In the toolbar, click New Query. In the Query window, type the following code:
USE FlightDelays;
CREATE TABLE [dbo].[delaymodels]
(
modelId INT IDENTITY(1,1) NOT NULL Primary KEY,
model VARBINARY(MAX) NOT NULL
);
4. In the toolbar, click Execute. Verify that the code runs without any errors.
MCT USE ONLY. STUDENT USE PROHIBITED
L8-6 Analyzing Big Data with Microsoft R
5. Overwrite the code in the Query window with the following block of Transact-SQL:
6. In the toolbar, click Execute. Verify that the code runs without any errors.
8. Add the following statements to the R file and run it. This code uses an ODBC connection to run the
PersistModel stored procedure and save your DTree model to the database:
install.packages('RODBC')
library(RODBC)
connection <- odbcDriverConnect(connStr)
cmd <- paste("EXEC PersistModel @m='", serializedModelString, "'", sep = "")
sqlQuery(connection, cmd)
Task 2: Create a stored procedure that runs the model to make predictions
1. Switch back to SQL Server Management Studio.
2. Overwrite the code in the Query window with the following block of Transact-SQL:
3. In the toolbar, click Execute. Verify that the code runs without any errors.
5. Add the following statements to the R file and run it. This code tests the stored procedure:
6. Save the script as Lab8_1Script.R in the E:\Labfiles\Lab08 folder, and close your R development
environment.
Results: At the end of this exercise, you will have saved the DForest model to SQL Server, and created a
stored procedure that you can use to make weather delay predictions using this model.
MCT USE ONLY. STUDENT USE PROHIBITED
L8-8 Analyzing Big Data with Microsoft R
Task 2: Upload the Pig script and flight delay data to HDFS
1. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
2. In the script editor, add the following statements to the R file and run them. Change studentnn to
your login name, and then and run the code. These statements establish a new RxHadoopMR
compute context:
loginName = "studentnn"
context <- RxHadoopMR(sshUsername = loginName,
sshHostname = "LON-HADOOP",
consoleOutput = TRUE)
rxSetComputeContext(context, wait = TRUE)
3. Add the following statements to the R file and run them. This code removes the
FlightDelayDataSample.csv file from your directory in HDFS (if it exists; you can ignore the error
message if it is not found), and copies the latest data from the E:\Labfiles\Lab08 folder:
4. Add the following statements to the R file and run them. This code uploads the carriers.csv file to
your directory in HDFS:
5. Add the following statements to the R file and run them. Replace the text fqdn with the URL of the
Hadoop VM in Azure (for example, LON-HADOOP-01.ukwest.cloudapp.azure.com). This code closes
the RxHadoopMR compute context, creates a remote R session on the Hadoop VM, and loads the
RevoScaleR library in that session:
rxSetComputeContext(RxLocalSeq())
remoteLogin(deployr_endpoint = "http://fqdn:12800", session = TRUE, diff = TRUE,
commandline = TRUE, username = "admin", password = "Pa55w.rd")
library(RevoScaleR)
6. Add the following statements to the R file and run them. This code copies the Pig script (and the
loginName variable) to the remote session:
pause()
putLocalFile("E:\\Labfiles\\Lab08\\carrierDelays.pig")
putLocalObject(c("loginName"))
resume()
2. Add the following code to the R file and run it. This code lists the contents of the
/user/RevoShare/studentnn directory in HDFS. This directory should include a subdirectory named
results:
3. Verify that the results directory contains two files: _SUCCESS and part-r-00000. The data is actually in
the part-r-00000 file.
4. Add the following statement to the R file and run it. This code deletes the _SUCCESS file from the
results directory:
5. Add the following statements to the R file and run them. These statements display the structure of the
results generated by the Pig script:
rxOptions(reportProgress = 1)
rxSetFileSystem(RxHdfsFileSystem())
resultsFile <- paste("/user/RevoShare/", loginName, "/results", sep = "")
resultsData <- RxTextData(resultsFile)
rxGetVarInfo(resultsData)
MCT USE ONLY. STUDENT USE PROHIBITED
L8-10 Analyzing Big Data with Microsoft R
Task 4: Convert the results to XDF format and add field names
1. Add the following code to the R file and run it:
2. Add the following code to the R file and run it. This code creates the CarrierData composite XDF file
in HDFS:
3. Add the following code to the R file and run it. This code creates a new RxHadoopMR compute
context in the remote session:
4. Add the following code to the R file and run it. This code displays the structure of the XDF file which
should now include the new mappings:
rxGetVarInfo(carrierData)
5. Add the following code to the R file and run it. This code summarizes the contents of the XDF file:
rxSummary(~., carrierData)
2. Add the following code to the R file and run it. This code changes the resolution of the plot window
to 1024 by 768 pixels, and then generates a histogram showing the number of delayed flights for
each airline:
3. Add the following code to the R file and run it. This code creates a bar chart showing the total delay
time for all flights made by each airline:
library(ggplot2)
ggplot(data = rxImport(carrierData, transforms = list(TotalDelay = CarrierDelay +
LateAircraftDelay))) +
geom_bar(mapping = aes(x = AirlineName, y = TotalDelay), stat = "identity") +
labs(x = "Airline", y = "Total Carrier + Late Aircraft Delay (minutes)") +
scale_x_discrete(labels = function(x) { lapply(strwrap(x, width = 25, simplify =
FALSE), paste, collapse = "\n")}) +
theme(axis.text.x = element_text(angle = 90, size = 8))
2. Add the following code to the R script and run it. This code uses the rxExec function to run rxSort to
sort the data in the cube:
3. Add the following code to the R script and run it. This the top 50 rows in the sorted cube (the routes
with the most frequent and longest delays).
head(sortedDelayData$rxElem1, 50)
4. Add the following code to the R script and run it. This code shows the bottom 50 rows in the sorted
cube (the routes with the least frequent and shortest delays):
tail(sortedDelayData$rxElem1, 50)
5. Add the following code to the R script and run it. This code switches to the local compute context
and saves the data cube to the file SortedDelayData.csv in HDFS:
rxSetComputeContext(RxLocalSeq())
sortedDelayDataFile <- paste("/user/RevoShare/", loginName, "/SortedDelayData.csv",
sep = "")
sortedDelayDataCsv <- RxTextData(sortedDelayDataFile)
sortedDelayDataSet <- rxDataStep(inData = sortedDelayData$rxElem1,
outFile = sortedDelayDataCsv, overwrite = TRUE
)
MCT USE ONLY. STUDENT USE PROHIBITED
L8-12 Analyzing Big Data with Microsoft R
Results: At the end of this exercise, you will have run a Pig script from R, and analyzed the data that the
script produces.
2. Add the following code to the R script and run it. This code creates an RxHiveData data source:
3. Add the following code to the R script and run it. This code uploads the data to Hive:
4. Add the following code to the R script and run it. This code runs the rxSummary function over the
Hive data:
rxSummary(~., hiveDataSource)
2. In the PuTTY Configuration window, select the LON-HADOOP session, click Load, and then click
Open. You should be logged in to the Hadoop VM.
3. In the PuTTY terminal window, run the following command:
hive
This command lists each airline together with the number of delayed flights for that airline.
MCT USE ONLY. STUDENT USE PROHIBITED
L8-13
exit;
7. Close the PuTTY terminal window, and return to your R development environment.
library(sparklyr)
library(dplyr)
2. Add the following code to the R script and run it. This code closes the current RxSpark compute
context and creates a new one that supports sparklyr interop:
rxSparkDisconnect(sparkContext)
connection = rxSparkConnect(sshUsername = loginName,
consoleOutput = TRUE,
numExecutors = 10,
executorCores = 2,
executorMem = "1g",
driverMem = "1g",
interop = "sparklyr")
3. Add the following code to the R script and run it. This code creates a new sparklyr session:
4. Add the following code to the R script and run it. This code lists the tables available in Hive:
src_tbls(sparklyrSession)
5. Add the following code to the R script and run it. This code caches your routedelays table and
fetches the data in this table:
tbl_cache(sparklyrSession, dbTable)
routeDelaysTable <- tbl(sparklyrSession, dbTable)
head(routeDelaysTable)
6. Add the following code to the R script and run it. This code constructs a dplyr pipeline that finds the
delays for all flights for American Airlines that departed from New York JFK, and saves the results in a
tibble:
routeDelaysTable %>%
filter(AirlineCode == "AA" & Origin == "JFK") %>%
select(Dest, AverageDelay) %>%
collect ->
aajfkData
7. Add the following code to the R script and run it. This code displays the data in the tibble and
summarizes it:
print(aajfkData)
rxSummary(~., aajfkData)
MCT USE ONLY. STUDENT USE PROHIBITED
L8-14 Analyzing Big Data with Microsoft R
8. Add the following code to the R script and run it. This code closes the sparklyr session and
disconnects from the RxSpark compute context:
rxSparkDisconnect(connection)
9. Save the script as Lab8_2Script.R in the E:\Labfiles\Lab08 folder, and close your R development
environment.
Results: At the end of this exercise, you will have used R code running in an RxSpark compute context to
upload data to Hive, and then analyzed the data by using a sparklyr session running in the RxSpark
compute context.