Sas 1 PDF

Introduction to Data
®
Curation for SAS Data
Scientists
Course Notes
Introduction to Data Curation for SAS ® Data Scientists Course Notes was developed by Anna
Yarbrough. Additional contributions were made by Nicole Ball, Mark Craver, David Ghan, Robert
Ligtenberg, Kari Richardson, Johnny Starling, Erin Winters, and Ari Zitin. Instructional design,
editing, and production support was provided by the Learning Design and Development team.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Introduction to Data Curation for SAS ® Data Scientists Course Notes
Copyright © 2020 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.
Book code E71513, course code LWDISDS1/DISDS1, prepared date 02Jan2020. LWDISDS1_001
ISBN 978-1-64295-451-7
For Your Infor mation iii
Table of Contents
Lesson 1 Introduction to Data Curation ..............................................................1-1
1.1 Discovering Data ..............................................................................................1-3
1.2 The Role of a Data Scientist and the Importance of Data Curation ...................... 1-11
1.3 Using the Power of SAS.................................................................................. 1-17
Lesson 2 An Overview of the Computing Environment.......................................2-1
2.1 An Introduction to Computer Architecture ............................................................2-3
2.2 The Many Types of Data Storage .......................................................................2-9
2.3 Parallel Processing and Grid Computing ........................................................... 2-19
2.4 Cloud Computing............................................................................................ 2-24
2.5 The SAS Platform and SAS Viya...................................................................... 2-29
Lesson 3 The Role of Data Science and Data Scientists .....................................3-1
3.1 Exploring the Discipline of Data Science.............................................................3-3
3.2 Understanding the Data Curation Life Cycle........................................................3-8
3.3 The Emergence of Artificial Intelligence and Machine Learning ........................... 3-28
Lesson 4 The Roadmap to SAS ® Data Curation ..................................................4-1
4.1 SAS Data Management Tools and Applications ...................................................4-3
4.2 SAS and Hadoop............................................................................................ 4-10
4.3 Additional Data Management Tools and Applications ......................................... 4-15

iv For Your Information
To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
training@sas.com. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.
For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
Lesson 1 Introduction to Data
Curation
1.1 Discovering Data ......................................................................................................... 1-3
1.2 The Role of a Data Scientist and the Importance of Data Curation ............................. 1-11
1.3 Using the Power of SAS ............................................................................................. 1-17

1-2 Lesson 1 Introduction to Data Curation
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Discovering Data 1-3
1.1 Discovering Data
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In today’s world, we are constantly generating, collecting, and retrieving data. Almost every action
we take leads to an associated data point. Buying a cup of coffee from Starbucks on the way to work
creates a transactional piece of data associated with your credit card and bank accounts. Scrolling
through Yelp generates data that the Yelp search engine can use to refine the restaurants and stores
that it returns to you. Clicking an advertisement on Facebook, posting a picture to Instagram, sharing
a Tweet on Twitter, reviewing purchases on Amazon, and rating television shows and movies on
Netflix all contribute to your personal data collection.
8
Although this alone may seem like a lot of data, it doesn't even take into consideration the data
generated from smart devices such as smart appliances, connected cars and sensor networks, or
operational data from organizations such as airlines, the stock market, university systems, research
institutions, and other companies. Beyond the sheer amount of data, the speed at which data is
produced and collected has drastically increased in the past 10 years. As we accumulate more and
more data, it is essential that the organizations learn how to leverage their data to make smart
operational decisions. Data can be used to spark marketing platforms, determine sales cycles,
create new medicine, research socioeconomic disparities, retain customers, and improve processes.
11
Organizations are dealing with more data than ever before. As mentioned, this data can be
generated from websites, social media, sensors, surveys, and a variety of other sources. Often,
businesses want to pull information from these varying data sources to answer business questions.
For example, a company might track the reviews of their products on Google, A mazon, and social
media threads. The company might also collect data about how people are exploring the company’s
website. What pages do they often go to? Where do shoppers spend their time? How long do they
spend learning about the product and the company before they proceed to check out?
How can we
Who are our
attract new
customers?
customers?
How can we
What products
retain
are selling?
customers?
What do our
What leaves
customers like
customers
about our
dissatisfied?
products?
12
The company might want to couple that usage information with their customer database to answer
questions like these:
• Who are our customers?
• What products are selling?
• What do our customers like about our products?
• What leaves customers dissatisfied?
• And, how can we retain customers and how can we attract new customers?
Data Collection
Hadoop
Text Files
Streaming Data
Twitter Raw Files
13
Before an organization can start using data to answer questions, we data scientist need to do a lot of
work on the data. The data needs to be collected from various data sources such as social media
applications and sensor devices. It can be collected as text or raw files, structured files, streaming
data, or a combination of those.
Data Cleansing and Transformation
cleanse
transform
aggregate
14
The data then needs to be explored and investigated. Often, the data that we are working with is
semi-structured or unstructured. This means that there is not associated metadata or a readily
available data model. Semi-structured or unstructured data must be cleansed and standardized so
that it can be used in business applications and evaluated with analytics and statistical software.
Transformations might need to be applied as well. Maybe we need to round, sum, or average
columns to standardize or aggregate data. We also might need to bring together data from a variety
of sources.
Analysis and Model Building
Model Building
Analysis
15
After the data has been properly curated, it is ready to be used in model building, predictive
analytics, and statistical modeling to answer the business questions presented. However, it still must
be updated and maintained during the analytical modeling process and eventually archived as new
data is generated. Then the process starts all over again.
Data Curation
Data Curation
Life Cycle
Data Scientist
18
The process of preparing data for analytics can be referred to as data curation. You will learn about
the data curation life cycle, the role of the data scientist in data curation, and the SAS tools and
methods available for data curation.
1.2 The Role of a Data Scientist and the Importance of Data Curation 1-11
1.2 The Role of a Data Scientist and the

Importance of Data Curation
What Is Data Science?

101010101101011101010100010
100010110110101001001010010
010101010110101010101010101
100101011101010101011011000
101101110011010101010010110
101010101010100111100001110
101010101101011101010100010
100010110110101001001010010
010101010110101010101010101
101101110011010101010010110
101010101010100111100001110
20
The desire to curate and leverage an organization's data has driven an increased demand for data
scientists. According to mastersindatascience.org, data scientists “take an enormous mass of messy
data points (unstructured and structured) and use their formidable skills in math, statistics and
programming to clean, manage and organize them. Then they apply their analytic powers – industry
knowledge, contextual understanding, skepticism of existing assumptions – to uncover hidden
solutions to business challenges.”
Although some data scientists might have all of the aforementioned skills, the implementation of data
science at an organization can often be a team effort. One data scientist might have strong data
curation skills, whereas another might be analytically savvy. Together, they can gain insight from an
organization's data.
Source:
“How to Become a Data Scientist in 2019.” Master’s in Data Science. Available
https://www.mastersindatascience.org/careers/data-scientist/. Accessed October 30, 2019.
statistics
domain computer
experience science
21
Data science can be thought of as a multidisciplinary field that combines skills in computer science
and statistics with domain experience. This combination of skills and experience is used to suppo rt
the end-to-end analysis of large and diverse data sets, ultimately uncovering value for an
organization and then communicating that value to stakeholders as actionable results.
Apply analytical
Gather, manage, Interpret
methods and
and transform data findings
investigate results
22
Essentially, data scientists are data wizards with the power to gather, manage, and transform messy
data as well as the ability to apply analytical methods and investigate results. After applying these
analytical methods and investigating the results, only then can data scientists interpret their findings
and apply them to business decisions. Notice that the need to gather, manage, and transform messy
data is a precursor to everything else.
Gather, manage, Data Curation

and transform data Life Cycle
23
Gathering, managing, and transforming data are some of the many components of the Data Curation
Life Cycle. Before data scientists, as well as business analysts and programmers, can delve into the
data and start answering business questions, they must understand and be able to implement the
entire Data Curation Life Cycle.
reporting statistics
data Data
mathematics
engineering Science
computer
econometrics
science
24
Let’s break down the description of a data scientist. As we have already discussed, the role is
multidisciplinary. Data science involves mathematics, statistics, econometrics, data engineering,
computer science, and reporting.
reporting statistics
data Data
mathematics
engineering Science
computer
econometrics
science
25
Data science also requires domain experience. Data scientists often practice within an industry. For
example, you might be a data scientist working in finance, retail, or clinical trials. Data scientists
must uncover value in the vast amounts of data available within an organization, and then they must
be able to communicate that value. If data scientists cannot present their findings to stakeholders
within an organization effectively, they have failed to inform and shape the future of the organization.
archiving finding
updating Data Curation exploring

Life Cycle
cleansing structuring
32
Data scientists rely on data curation methods. Data curation refers to the process of finding,
exploring, structuring, cleansing, updating, and eventually archiving data. This process can be
looked at as the Data Curation Life Cycle. It’s crucial that data curation methods are used efficiently
and effectively for businesses to be able to gather insight and value from the available data.
1.3 Using the Power of SAS 1-17
1.3 Using the Power of SAS
34
We are going to look at the power of SAS to implement data curation. As described on LinkedIn,
“SAS is the leader in business analytics sof tware and services, and the largest independent vendor
in the business intelligence market. Through innovative solutions, SAS helps customers at more
than 70,000 sites improve perf ormance and deliver value by mak ing better decisions f aster.”
Source:
SAS Institute Inc. LinkedIn. https://www.linkedin.com/company/sas/about/
administer curate analyze
35
The SAS Platf orm consists of a comprehensive set of tools that enables a variety of users with
dif f erent roles within an organization to do their part and work together to ef f ectively and efficiently
administer, curate, and analyze data within a shared IT inf rastructure.
1.3 Using the Power of SAS 1-19
SAS Servers SAS Client Data Sources

Applications
Raw Files
SAS Studio
Hadoop
Workspace DataFlux Data

Server Management Studio SAS Tables
Metadata
Server
36
On the SAS Platf orm, users with dif ferent roles each use specialized client applications designed to
accomplish specific types of tasks. With these client applications, users access application servers
and data sources in order to execute processes. A key component in the server environment is a
metadata server that stores and provides inf ormation to client applications to connect to required
application servers and data sources.
define users define groups
set
manage logins
permissions
control access control access

SAS
to applications to data
Administrator
37
Users with an administrative role use client applications to define the application servers and data
source connections. The administrators also def ine user and group identities, logins, and
permissions in the metadata to control access to application servers and data sources.
Data Curator Processes and

Results Metadata
Objects
Analyst
38
Data curators and analysts then use other specialized applications to manage and analyze data. The
processes and the results that they generate can be stored as metadata objects so that other users
can access and leverage them in the shared environment.
Lesson 2 An Overview of the
Computing Environment
2.1 An Introduction to Computer Architecture ................................................................... 2-3
2.2 The Many Types of Data Storage .................................................................................. 2-9
2.3 Parallel Processing and Grid Computing ................................................................... 2-19
2.4 Cloud Computing....................................................................................................... 2-24
2.5 The SAS Platform and SAS Viya ................................................................................ 2-29

2-2 Lesson 2 An Overview of the Computing Environment
2.1 An Introduction to Computer Architectur e 2-3
2.1 An Introduction to Computer

Architecture
Understanding the Computing Environment
Data
Processing
Data
Data Storage
Movement
3
Although data scientists are not computer engineers or software developers, it is important for them
to have a basic understanding of computer architecture. Understanding the components of
computing infrastructure, including where data is stored, where it is processed, and how it moves
through the network, equips data scientists to design efficient and effective programs.
Memory Storage
Central Processing Unit Network
4
There are several technology components that work together to form the comput ing environment.
These include the processors (also referred to as central processing units, or CPUs), memory,
storage, and network.
Central Processing Unit
Performs processing
Serves as "the brain"
Executes instructions from programs and

applications
5
As the name implies, the CPU is the place where all the work or processing takes place on the
computer. According to Digital Trends, the CPU can be thought of as the brain of the computer. It
executes instructions supplied by programs and applications. Initially, CPUs were created with a
single processing core. This processing core executed the instructions delivered from programs and
applications, and these instructions had to be executed sequentially. Modern computer
advancements have led to the creation of CPUs with multiple cores. When CPUs have multiple
cores, multiple steps in the instructions can be processed concurrently. In addition, modern
computers can be built with multiple CPUs. Having more than one CPU means that more processing
can be done simultaneously. However, a multi-core CPU is generally more efficient than multiple
CPUs. Along with this increase in cores per CPU and number of CPUs, there has been a decrease
in the size of CPUs. Although the design and implementation is different from large computer
systems, smart phones, laptops, and tablets all rely on CPUs.
Sources:
Martindale, Jon. “What is a CPU? The CPU is your PC’s most important component,
but what does it really do?” Digital Trends. March 8, 2018.
Available https://www.digitaltrends.com/computing/what-is-a-cpu/
Papiewski, John. "Multiple CPU Vs. Multi-Core" Small Business Chron.
Available https://smallbusiness.chron.com/multiple-cpu-vs-multicore-33195.html
Memory
Stores data for immediate use
Intermediary between physical storage and the

processing done by the CPU
Data in memory is lost if the computer loses

power
6
Memory, or random-access memory (RAM) as it is commonly referred to, is the component that
stores data for immediate use in CPU processing. RAM is volatile memory, meaning that when you
turn your computer off, data in memory is lost. Memory serves as the intermediary between data
stored physically on disk and the processing of that data. New technology enables huge volumes of
data to be loaded into memory for processing, at a much lower price than in years past. However, it
is important to remember that memory is volatile. If your computer loses power, the data stored in
memory is lost.
Storage
Disk is a permanent data storage location
Examples of storage devices are internal and external

hard drives and USB flash drives
Solid-state drives are faster and more resilient than

traditional hard drives
7
Data is brought into memory to be processed, but it is stored permanently on disk. Some examples
of devices that provide disk space are hard drives, USB flash drives, and solid-state drives or SSD.
How does disk space play into performance in the computing environment? A hard drive consists of
platters, which are actual disks coated in a magnetized film that allow the encoding of 1s and 0s that
make up the data. The spindles that turn the vertically stacked platters are a critical part of rating
hard drives because the spindles determine how fast the platters can spin and thus how fast the
data can be read and written, also referred to as the input and output, or I/O for short. The units of
storage on a hard drive are commonly referred to as bytes. Measurements are typically reported in
various quantities of bytes (for example, kilobytes, megabytes, and gigabytes). Solid-state drives are
like upgraded hard drives. They do not require actual moving platters and are often much faster than
traditional hard drives. They are also more resilient to physical interaction.
Sources:
“Solid-state drive.” Wikipedia. Available https://en.wikipedia.org/wiki/Solid-state_drive. Accessed on
October 30, 2019.
Network
N N
N = network
8
The network is the only hardware component that is always external to the computer. It is the
mechanism for computers to connect with one another and exchange information. The many
protocols and standards for network communication are not discussed here.
With the evolution of distributed computing environments, where different computers exchange
information, network speed is a factor to consider when designing comput er systems and processes.
With the ability to distribute data and processes across a network of computers, you have the option
of moving the process to where the data is, instead of moving huge volumes of data to where the
process is running.
2.2 The Many Types of Data Storage 2-9
2.2 The Many Types of Data Storage
Changes in Data Storage
Streaming data from sensors Transactional data from customers
11
Remember that data is stored on disk within your computing environment. When you save a file to
your desktop, you are saving that file to your computer’s hard drive. This works for small files, but
when dealing with massive amounts of data (for example, streaming data from sensors or
transactional data for a worldwide company), it isn't feasible to store that data on a single laptop or
desktop.
Different Data Storage Methods
Relational Hadoop Data Lakes Cloud

Databases Storage
12
Let’s talk about four other data storage methods: relational database management systems,
Hadoop, data lakes, and cloud storage.
Relational Database Management Systems
Structured data
Predefined schemas
Relational
Databases SQL programming language
13
The first data storage tool to consider is the relational database management system, referred to as
the RDBMS, or databases. Databases have been available since the 1970s and have been widely
used tools for storing data. Traditional RDBMSs are designed to support databases that are much
larger than the memory or storage available on a personal computer. They are designed to work with
predefined schemas and structured data. Structured data refers to data that has clearly defined
columns and data types. Rows of data are stored in logical records where the fields or entries in
each record pertain to a specific entity. Some examples of databases include Oracle, Teradata,
Microsoft SQL Server, and Postgres. Structured Query Language, or SQL, was widely adopted as a
standard programming language to manage, query, and retrieve data stored in relational databases.
Relational databases are great for storing structured data, but they are not designed to store
unstructured data unless that data is first processed to add the structure imposed by the database
design. Unstructured data is data that does not have a defined data model or schema. The column
names, data types, and lengths are not defined and stored with the data. Examples of unstructured
data include social media data, and audio and video files. Raw data, or data that has not been
processed yet, is also unstructured. An example of raw data would be operational or streaming data.
Sources:
Taylor, Christine. “Structured vs. Unstructured Data.” Datamation, March 28, 2018. Available
https://www.datamation.com/big-data/structured-vs-unstructured-data.html.
Hadoop
Open source software
Computer cluster
Distributed storage
Hadoop
Parallel processing
14
One of the first non-traditional data storage methods was Hadoop. Hadoop became very popular
very quickly because it was free and worked on existing hardware. Hadoop is an open source,
software framework that utilizes a cluster of computers for distributed storage and parallel
processing of data. Let’s break down that sentence.
Computer Cluster
15
First, what is a computer cluster? A computer cluster is a grouping of multiple computers, connected
by a local area network. The computers in the cluster are of ten ref erred to as nodes. The clustering
of these nodes enables them to f unction as a unit. This means that instead of relying on the storage
space of one computer, Hadoop can use the storage space and other resources of all the nodes in
the cluster. This allows f or distributed storage.
Distributed Storage
Hadoop DataNodes
Node 1
Node 2
HDFS Split Data
Data Block 1
Node 3
Data File
Data Block 2
Node 4
Data Block 3
Node 5
16
So what is distributed storage? Distributed storage of data means that the data is stored in pieces
across your computer cluster. Instead of having to fit an entire file on one disk on one computer, the
file is broken into pieces and distributed across the nodes. The Hadoop Distributed File System, or
HDFS, is used for distributed storage. Data within Hadoop is also replicated across nodes. So a
block of data is located on one node, and copies of that block of data are located on other nodes in
the cluster. This means that if a node goes down for any reason, the data on that node is not lost.
Parallel Processing
Parallel processing enables processing to occur on

the data nodes in the Hadoop cluster simultaneously.
17
Lastly, Hadoop is powerful because of parallel processing. Parallel processing within Hadoop
involves processing the data stored on the individual blocks simultaneously. Because work on
different pieces of data can be done at the same time, processing time is shorter. You learn more
about parallel processing later.
Data Lakes
Unstructured and structured data
Large variety and volume of data
Data Lakes
18
Speaking of non-traditional data storage, you might have heard of the term data lake. Data lakes are
useful for storing structured and unstructured data. They do not require your data to fit a certain
structure or schema, and they enable you to store a large variety and volume of data together.
Data lakes do not require your data to fit a

particular structure or schema definition.
19
For example, you might have various types of data associated with employees such as pic tures,
resumes, health benefit information, and more.
Data Lakes
Unstructured and structured data
Large variety and volume of data
Data Lakes
Hadoop can be used to implement a
data lake (as well as other software)
20
Hadoop can be used to implement a data lake. Although some data is more suit able for data lake
storage, data warehouses and databases are still valuable storage methods. Remember that with
data warehouses or databases, the data fits a schema structure. This makes the data easy to query,
but there is lot of work on the front end to make sure that data fits the schema structure and is
curated before it is stored. With data lakes, the data can be dumped into storage as is and curated
later in the process.
Sources:
Kim, Dale. “What’s the Difference between Hadoop and a Data Lake.” Arcadia Data, July 10, 2018.
Available https://www.arcadiadata.com/blog/whats-the-difference-between-hadoop-and-a-data-lake/.
Cloud Storage
Method for storing data off-site
Allows for scalability depending on the

amount of storage a company needs
Cloud
Storage
Data is often stored across machines in the
cloud
21
The last trend we will discuss with data storage is cloud storage. Often, as companies acquire more
and more data, they run out of storage on-premises. Companies might find it more financially savvy
to store their data off-site. This type of data storage is referred to as cloud storage. Cloud storage
enables you to store your data in a location that you cannot physically access, but you can still
access easily through the internet. Your data isn’t sitting on a server in the basement of your office or
on the hard drive of your desktop computer, but instead it is stored on your cloud provider’s servers.
Cloud storage allows for scalability depending on the amount of storage a company needs. If your
company needs more storage space, it can upgrade its plan with the cloud provider. Data is often
stored across machines and duplicated so that it can be accessed by users even if a server loses
power or goes down for maintenance. Cloud providers have many servers in one location referred to
as a data center. Some examples of Cloud Storage include Amazon S3, Google BigQuery, and
Google Drive.
Sources:
Strickland, Jonathan. “How Cloud Storage Works.” HowStuffWorks, Available
https://computer.howstuffworks.com/cloud-computing/cloud-storage2.htm.
2.3 Parallel Processing and Grid Computing 2-19
2.3 Parallel Processing and Grid

Computing
Memory Storage
Central Processing Unit Network
23
Remember those essentials of the computing environment that we discussed earlier? In case you
don’t, we’ve included a refresher. Your computing environment consists of many different
components, but four major resources are the CPU, memory, storage, and network. The CPU does
the processing of tasks. The memory temporarily holds data relevant to the processing that is
occurring. The storage (often disk storage) stores files and data permanently. And the network allows
communication between computers.
Parallel Processing
CPU
Limited to sequential processing
one core
Ability to process tasks in parallel

CPU
two cores
24
Your CPU contains one or more cores. Cores are also referred to as processing units. If your
computer has a CPU with one core, then the CPU has one place to perform processing. Many
computers today are designed with multi-core processors. If your laptop is built with a dual-core
processor, the CPU inside your laptop has two cores that can be used for executing tasks. This
means that work can be done faster and there is more processing space. This also means that
different jobs can be executing simultaneously, or one job can be broken into tasks that run in
parallel.
The concept of breaking jobs into tasks that run simultaneously is referred to as parallel processing.
Parallel processing, or parallel computing, allows for jobs to execute faster and processing to
happen simultaneously on smaller tasks. If we have a single computer with a multi-core processor,
we can execute tasks in parallel on that one computer. This can still be restrictive though, as a single
computer can only do so much. Parallel processing can also happen across multiple servers
connected via a network. Don’t forget: You learned about parallel processing earlier, along with
Hadoop.
Sources:
Hoffman, Chris. “CPU Basics: Multiple CPUs, Cores, and Hyper-Threading Explained.” How-To
Geek. October 12, 2018. Available https://www.howtogeek.com/194756/cpu-basics-multiple-cpus-
cores-and-hyper-threading-explained/.
“Multi-core processor.” Wikipedia. Available https://en.wikipedia.org/wiki/Multi-core_processor.
Accessed on October 30, 2019.
“Parallel computing.” Wikipedia. Available https://en.wikipedia.org/wiki/Parallel_computing. Accessed
on October 30, 2019.
Grid Computing
More resources
N N N
More processing
power
25
Grid computing enables us to expand the resources that are available for processing and jobs
beyond a single computer. Computer grids are created by connecting multiple computers together
via a network in order to take advantage of all the processing power and resources available on
those computers. Each computer in the grid has its own CPU that consists of multiple cores. The
computers in the grid are often referred to as nodes. The nodes execute different jobs or tasks
independently of each other.
Working on a grid provides programmers access to shared resources. Jonathan Strickland, in How
Grid Computing Works, gives an excellent definition. “Grid computing systems link computer
resources together in a way that lets someone use one computer to access and leverage the
collected power of all the computers in the system. To the individual user, it's as if the user's
computer has transformed into a supercomputer.”
Sources:
“Grid computing.” Wikipedia. Available https://en.wikipedia.org/wiki/Grid_computing. Accessed on
October 30, 2019.
Strickland, Jonathan. “How Grid Computing Works.” HowStuffWorks. Available
https://computer.howstuffworks.com/grid -computing1.htm
Grid Computing
File A
N N N
File B
File C
26
In a grid environment, all the servers need to have access to the data. This data might be in a
database that the servers can access or in files found in a common location accessible to all servers.
In the scenario shown here, each of the files, A, B, and C needs to be accessible by each of the four
servers.
SAS Grid Manager
…
27
A grid can be used by many SAS solutions. The SAS Grid Manager is used to balance user and
application requests on a computer cluster. Computing resources can be added when needed,
meaning that resource restriction is no longer limited. SAS prog rammers can submit their SAS
programs simultaneously, and the SAS Grid Manager places these jobs into queues. Af ter the
resources become available, the SAS Grid Manager can distribute these jobs to the dif ferent nodes
in the cluster.
Sources:
Iverson, J et al. SAS Programming on the Grid Course Notes. Cary, NC: SAS Institute Inc.: 2018.
Book code E71074, course code LWSPGRD4/SPGRID, ISBN 978-1-63526-298-6. For details about
the course notes, contact the SAS Education Division.
2.4 Cloud Computing
Cloud Computing
29
Cloud computing is a broad term that refers to the immediate access to computing resources hosted
over the internet. These resources can include software, data storage, processing power, and more.
Amazon Web Services defines cloud computing as follows: “Cloud computing is the on-demand
delivery of computer power, database, storage, applications, and other IT resources via the internet
with pay-as-you-go pricing.”
Companies can interact with cloud providers to get access to the resources that they need on
demand, and with flexible scalability. If a company realizes that they need more resources, they can
purchase more cloud space or more computational power to scale out their operations. Cloud
storage, as we mentioned earlier, is a type of cloud computing. Cloud storage involves using third-
party storage providers to store data instead of storing the data on-site.
Sources:
“What is Cloud Computing?” Amazon. Available https://aws.amazon.com/what-is-cloud-computing/.
2.4 Cloud Computing 2-25
Software as a
Service (SaaS)
Platform as a
Service (PaaS)
Infrastructure as a
Service (IaaS)
30
Within cloud computing, the three broad service types are Infrastructure as a Service (IaaS),
Platform as a Service (PaaS), Software as a Service (SaaS).
Infrastructure as a Service (IaaS)
Virtual
Servers Data Storage
Machines
Infrastructure
31
Providers of Infrastructure as a Service supply the infrastructure, which includes the basic comput ing
resources and storage, and the users then build everything else that they need. When companies
rely on IaaS providers, it can be thought of as renting servers, and their users can install operating
systems and programs on the servers. The advantage in doing this through a cloud provider instead
of buying and installing new servers is that users can request diff erent server configurations to meet
their needs. Users can also stop paying for resources that they no longer use or quickly expand their
processing power when needed.
Sources:
“Platform as a Service (PaaS).” TechTarget, Available
https://searchcloudcomputing.techtarget.com/definition/Platform-as-a-Service-PaaS. Accessed on
October 30, 2019.
2.4 Cloud Computing 2-27
Platform as a Service (PaaS)
Operating System Middleware
Hardware and Software
Virtual
Machines
Infrastructure
32
Platform as a Service provides platforms for application development to customers. For example, a
company might use a PaaS provider to provide the infrastructure and services necessary to develop
the software that they sell. As techtarget.com puts it, “With PaaS, a provider offers more of the
application stack than IaaS providers, adding operating systems, middleware (such as databases)
and other runtimes into the cloud environment.” Users can develop applications without worrying
about installing the operating system or dealing with maintenance or updates. PaaS providers allow
easy scaling as the application gains more users, with little added work for developers .
Sources:
“The Advantages of PaaS: Leveraging a Platform Service.” Liquid State, March 18, 2019. Available
https://liquid-state.com/advantages-paas-platform-as-a-service/.
“PaaS Advantages, Disadvantages and Best Practices .” Cloudhelix, Available
https://cloudhelix.io/blog/post/paas-advantages-disadvantages. Accessed on October 30, 2019.
“Platform as a Service (PaaS).” TechTarget, Available
https://searchcloudcomputing.techtarget.com/definition/Platform-as-a-Service-PaaS. Accessed on
October 30, 2019.
Software as a Service (SaaS)
End-User Applications
Operating System Middleware
Hardware and Software
Virtual
Machines
Infrastructure
33
With Software as a Service, cloud providers host software applications. These applications are
available to customers via the internet. SAS offers some SaaS products, including SAS Visual
Analytics for SAS Cloud, SAS Visual Statistics for SAS Cloud, SAS Visual Data Mining and Machine
Learning for SAS Cloud, and more. SaaS users can log on and start working without having to build
or install any components of their environment.
Sources:
“cloud computing.” TechTarget. Available
https://searchcloudcomputing.techtarget.com/definition/cloud-computing. Accessed on October 30,
2019.
“SAS Software as a Service (SaaS).” SAS Institute Inc. Available
https://www.sas.com/en_us/solutions/cloud-analytics/saas.html. Accessed on October 30, 2019.
2.5 The SAS Platform and SAS Viya 2-29
2.5 The SAS Platform and SAS Viya
The SAS®9 Platform

SAS Data
SAS Tables Integration Studio
DBMS
Metadata SAS Studio
Server
Workspace
Middle Tier
Server
Data Client
SAS Servers Middle Tier
Sources Applications
35
The traditional SAS Platf orm, ref erred to as the SAS Intelligence Platform, includes components for
data management, business intelligence, and advanced analytics. In a typical SAS 9.4 deployment,
the architecture consists of four tiers: data sources, SAS servers, the middle tier, and client
applications.
These tiers are not necessarily on separate computers or groups of computers, but rather they
represent the groupings of software that perf orm similar tasks.
Sources:
SAS Institute Inc. 2019. “SAS 9.4 Intelligence Platform: Overview, Second Edition.” Cary, NC: SAS
Institute Inc. Available https://go.documentation.sas.com/api/docsets/biov/9.4/content/biov.pdf .
Data Sources
SAS tables
Streaming Data
DBMS
Raw Data
Hadoop
36
The data sources available to your SAS platf orm are vast. Whether you have data in a database,
SAS tables, Hadoop, or other data sources, SAS provides engines to access your data.
SAS Servers
Reports Users and Groups

Tables
Metadata
Server
SAS Libraries
Workspace
Servers
Server
Metadata Repository
37
The SAS servers are the software components of your SAS deployment that receive requests from
client applications and perform requested operations. The SAS Metadata Server controls access to a
central repository of metadata that is shared by all SAS applications in the deployment. The
metadata repository includes information about the following:
• libraries and tables that are accessed by your SAS applications
• content created and used by SAS applications, including reports and queries
• SAS and third-party servers that participate in the system
• users and groups and associated permissions
When you log on to SAS applications that are part of the SAS Platform, you first authenticate to the
SAS Metadata Server.
Metadata
Server
Execute SAS Code Register Tables Import Data
Workspace
Server
38
When users of client applications submit SAS code, it is executed by a SAS Workspace Server
session. The workspace server supports registering tables in metadata and importing data, tasks
that you learn about later. When you submit SAS code, the metadata server starts a workspace
server session that executes the code. SAS deployments can have multiple users submitting SAS
code from client sessions, and each user is provided his or her own workspace server session. In
addition, SAS deployments can be implemented with multiple workspace servers .
Client Applications
SAS Data DataFlux Data SAS Event Stream

Processing Studio SAS Studio
Integration Studio Management Studio
39
The client applications provide users with various programming or point-and-click interfaces to the
SAS Platform. Some of the client applications are SAS Data Integration Studio, DataFlux Data
Management Studio, SAS Event Stream Processing Studio, and SAS Studio.
SAS Middle Tier
SAS Studio
SAS Event Stream

Processing Studio
Middle Tier
Other Web
Applications
40
The middle tier contains software components that enable SAS users to work with web applications,
such as SAS Studio. These web applications are hosted on the middle tier and send data to and
from users who interact with these hosted applications via a web browser.
Sources:
SAS Institute Inc. 2019. “SAS 9.4 Intelligence Platform: Overview, Second Edition.” Cary, NC: SAS
Institute Inc. Available https://go.documentation.sas.com/api/docsets/biov/9.4/ content/biov.pdf.
SAS Middle Tier
SAS Studio
Metadata
Server
SAS Event Stream

Processing Studio
Middle Tier
Other Web
Applications
Workspace
Server
41
Although users of web applications rely on the middle tier to host these applications, the applications
still interact with the SAS Metadata Server, and code is still submitted to and executes in a SAS
Workspace Server session.
Client
Applications
Metadata Server
Grid Controller
…
42
If an organization is working with a SAS grid computing environment, the SAS compute tasks are
distributed across the grid. The SAS Grid Control Server controls the distribution of jobs to the grid. If
a client application wants to submit code to the grid, the request is sent to the Grid Control Server,
and the request is queued and dispatched based on policies set by the grid administrator. SAS Grid
nodes perf orm the work f or the grid and return the results and the log to the requesting client
application.
Sources:
Iverson, J et al. SAS Programming on the Grid Course Notes. Cary, NC: SAS Institute Inc.: 2018.
Book code E71074, course code LWSPGRD4/SPGRID, ISBN 978-1-63526-298-6. For details about
the course notes, contact the SAS Education Division.
SAS®9 Platform
SAS Viya
• an open, cloud-enabled, analytics engine
SAS Cloud Analytic Services (CAS)
in-memory fast data of

engine processing any size
43
In recent years, the SAS Platf orm has extended beyond SAS 9.4 to include SAS Viya. SAS Viya
is a cloud-enabled, in-memory analytics engine that uses SAS Cloud Analytics Services, or CAS.
CAS is a server that provides the run-time environment for data management and analytics with
SAS, enabling you to tackle your analytics problems and gain insights from data. If an organization
is working with a SAS Viya implementation or a SAS ® 9 implementation with SAS LASR, a single
in-memory table can be loaded into memory across distributed computing nodes. Existing SAS 9.4
platf orm implementations can be used to load data into memory for use in SAS Viya or SAS LASR.
This distributed in-memory data can be processed in parallel which allows f or c omplex analytical
processing to be accomplished very quickly on large volumes of data.
Sources:
SAS Institute Inc. 2019. SAS Viya Solution Overview. Cary, NC: SAS Institute Inc. Available
https://www.sas.com/content/dam/SAS/en_us/doc/overviewbrochure/sas -viya-108233.pdf.
Styliadis, P. et al. SAS SQL 1: Essentials Course Notes. Cary, NC: SAS Institute Inc.: 2019. Book
code E71409, course code LWSSQ1M6/SQ194/, ISBN 978-1-64295-094-6. For details about the
course notes, contact the SAS Education Division.
Lesson 3 The Role of Data
Science and Data Scientists
3.1 Exploring the Discipline of Data Science ..................................................................... 3-3
3.2 Understanding the Data Curation Life Cycle ................................................................ 3-8
3.3 The Emergence of Artificial Intelligence and Machine Learning ................................. 3-28
3-2 Lesson 3 The Role of Data Science and Data Scientists
3.1 Exploring the Discipline of Data Science 3-3
3.1 Exploring the Discipline of Data

Science
Discipline of Data Science
Technical
Skills
Data
Science
Math
Industry
and
Domain
Statistics
5
Data science is a multidisciplinary f ield. The technical skills required f or working with the data include
computer programming, data management, data integration, data quality, and data transf ormation.
Knowledge of math and statistics is required to discover and explore the data and analyze it to f ind
value.
It is important to have the technical and statistical skills to mine the data f or value, but it is equally
important to thoroughly understand the industry domain. This includes the industry drivers, t he
products, the customers, and so on. To be a successf ul data scientist, you need technical skills,
knowledge of math and statistics, and knowledge of the industry domain. Don't worry if you wouldn’t
consider yourself an expert in all of these f ields. Teams of data scientists often work together to
solve problems.

• Programming
• Database Administration
• Data Curation Technical
• Hadoop Skills
• Grid Computing
• Cloud Computing
• Computing Resources
Data
Science
Math
Industry
and
Domain
Statistics
6
The technical skills needed f or data science require knowledge of any combination of the f ollowing:
• Programming languages (SAS, SQL, Python, Pig Latin, R, HiveQL, ...)
• database administration
• data curation
• Hadoop, parallel processing, and grid computing
• cloud computing
• computing resources
Technical
Skills
Data
Science • Traditional Statistics
• Machine Learning
Math
Industry
and
Domain
Statistics
7
The math and statistics skills of data science require knowledge of traditional statistical methods
such as regression, ANOVA, and hypothesis testing, as well as newer methods such as deep
learning and machine learning algorithms that rely more on computat ional power.
A growing discipline that requires knowledge of computers, mathematics, and statistics is machine
learning. Machine learning involves the creation of computer programs and algorithms that learn
from the data itself and adjust accordingly as new data becomes available. This means that as the
model is given more data to work with, it can adjust and predict based on this new data. Data
scientists today need the technical skills to successfully execute machine learning models and
algorithms, but they also need the math and statistics skills to select the appropriate machine
learning models and algorithms.
Technical
Skills
Data
Science
Math
Industry
and
Domain • Business Analytics
• Business Intelligence Statistics
8
In the world of business, industry domain and math and statistics skills overlap in a discipline
commonly referred to as business analytics. Business analytics is playing a larger and larger role in
decision making in companies. Business analysts look at past data from part of the business. Then,
using predictive modeling and statistical methods, they predict for the future.
Another related term is business intelligence. People in business intelligence roles are also using
statistical knowledge to work with data, and their role is very similar to the role of business analysts.
There are varying opinions on what the distinction is between business analytics and business
intelligence. Many define business analytics as being focused on business improvement and
prediction (that is, very future focused), whereas business intelligence is more explanatory and
focused on using data to view and understand current metrics. Both roles tend to require specific
business knowledge or experience.
For more information about the similarities and differences between business intelligence and
business analytics, see the article by Bergen Adair titled Business Intelligence vs. Business
Analytics: A Comprehensive Comparison of the Difference Between Them.
Sources:
Adair, Bergin. “Business Intelligence vs Business Analytics” A Comprehensive Comparison of the
Dif f erence Between Them.” SelectHub. Available https://selecthub.com/business-
intelligence/business-intelligence-vs-business-analytics/.
“Business analytics.” Wikipedia. Available https://en.wikipedia.org/wiki/Business_analytics. Accessed
on October 30, 2019.
Data scientists can…

Access, manage, and manipulate data
Understand statistical methods and analytical

procedures
Data
Science
Code using one or more programming
languages
Present findings clearly
9
Data science is at the intersection of it all. Data scientists must know how to access, manage, and
manipulate data. They must have a solid understanding of statistical methods and the analytical
procedures needed to produce meaning from the data. They must be able to program, and they
must be able to present findings clearly to business partners and industry stakeholders. With the
long list of skills required for data scientists, the desire to continue to learn and improve is important,
especially as technology continues to improve and the need for data scientists continues to grow.
3.2 Understanding the Data Curation Life

Cycle
How Data Scientists Spend Their Time

Other 5%
Refining Algorithms 4%
Building Training
Data Sets 3%
Mining Data for
Patterns 9%
Cleaning and
Collecting Data Organizing Data
19% 60%
CrowdFlower, 2016.
11
According to CrowdFlower findings, data scientists spend 60% of their time cleaning and organizing
data, and another 19% of their time finding data sets. This means almost 80% of a data scientist’s
time is spent preparing data before getting to the model s election and analytical evaluations.
Because so much time is spent on data curation, a strong data curation plan should be outlined and
implemented.
Sources:
“Data Science Report.” Crowd Flower. 2016. Available https://visit.figure-eight.com/rs/416-ZBE-
142/images/CrowdFlower_DataScienceReport_2016.pdf .
3.2 Understanding the Data Curation Life Cycle 3-9
Finding Data
1 Do we have data we can use? 2 Do we need to collect data?
14
After the questions and goals have been defined, the data required to provide the insights to these
questions and goals can be identified. Finding data can be as simple as looking at the data your
company or organization has already collected. Remember that data is constantly being generated,
especially with the creation of so many connected appliances, smart technology, and sensors.
It is also possible that in determining the questions of interest, you realize that new data has to be
collected. This data might come from customer surveys or questionnaires. Returning to the bank
example, the Marketing Department might send out a survey to customers to see what makes an
investment opportunity seem attractive. These responses could be helpful in designing a targeted
advertisement. Directly interviewing customers could generate information as well. This takes longer
than gathering data through a survey because it involves individual, in-person data collection.
However, depending on the business goals identified, the type of data collected through interviews
might be very beneficial.
Finding Data
Volume Variety Velocity Veracity
15
When you find and collect data, it is important to categorize that data based on volume, variety, and
velocity. When considering volume, ask yourself whether you have enough data to train a machine
learning model or make a statistically significant claim. When looking at variety, ask yourself whether
you have sampled a diverse and representative population. Returning to the drug development
example, have you made sure to collect data that adequately represents your target patient
demographic? When considering velocity, think about how quickly new data is being generated. If
data is constantly being generated, do you have a way to effectively process it?
An equally important category to consider is veracity, or the accuracy of the data. Does the identified
data contribute to answering the identified question? Is the data precise, trusted, and reliable? If not,
you might consider finding different data. Other tasks of veracity include removing inconsistencies,
null values, duplicated values, and abnormalities. These methods of improving the accuracy of the
data are addressed next in the data curation life cycle.
Exploring Data
Visualize and Identify Calculate Descriptive

Plot the Data Anomalies and Statistics
Inconsistencies
16
After the appropriate data has been identified, but before data scientists can jump right into
answering questions, it is important to explore the data. Data scientists must be able to truly
understand the data available to them. This is much easier when working with structured data. If you
are working with structured data, a good first exploratory step is to plot the data. Plotting the data
gives a visual overview of both categorical and continuous variables. You can use bar charts,
histograms, and box plots to explore your data.
When exploring your data, identify anomalies and inconsistencies. Inconsistencies can lead to
issues in your statistical evaluations later, so it is important to identify differences in data entry,
casing, spelling, missing values, and so on, and come up with a plan of action. We discuss this in
more depth on the next slide.
In addition, calculate some basic statistics. For example, with numerical data, you can look at the
range, minimum, maximum, and frequency of values. You might also want to look at measures of
central tendency such as the mean, median, and mode. This helps you identify inconsistencies and
extreme values, and it gives you a better understanding of how variable your data values are.
If you are working with unstructured data, it might be helpful to do some structuring and cleansing
before exploring the data. Remember, the data curation life cycle is not a rigid system. Often, you
need to jump ahead or return to previous points in the process .
Exploring Data
Missing Values
Name State Date

John Smith NC March 3, 2019
Possible Spelling Laney Booth N.C. .
Discrepancies Jonathan E. Smith North Carolina October 16, 2019
Kelly Crawford N. Carolina December 12, 2019
Inconsistent Data Entry
17
As data scientists explore their data, many questions arise, and it is important that data scientists
keep inquisitive and curious mindsets. Does the data need to be cleansed and standardized? For
example, if you have a state column in your data set, how has that data been entered? If the entry
method has been inconsistent, North Carolina could be represented any of the following ways :
• NC
• N. Carolina
• nc
• North Carolina
Are there spelling, casing, or pattern discrepancies in your data? For example, if you are working
with electronic medical records, how have patient names been entered? Did John Smith come in f or
chest pain in March and a dif ferent Jonathan E. Smith come in f or chest pain in October, or is this
the same patient?
Are there missing values? If so, are there a lot of missing data points and how is missing data going
to be addressed? It is important to establish a standard approach to what will be done with missing
values, extreme data points, and inconsistent entries, bef ore diving into the analysis of the data.
Structuring and
Cleansing Data
Hourly CO2 ppm State

378 NC
380 N.C.
… North Carolina State
379 N. Carolina N.C.
381 Avg Daily CO2 ppm N.C.
379.5 N.C.
N.C.
18
After a data scientist has sufficiently explored the data, the process of structuring, transforming, and
cleansing the data can begin. All the issues identified in data exploration (casing, spelling, missing
values, extreme values, and more) need to be handled in an appropriate way. In addition, there
might be a need to aggregate values or compute or create new columns. For example, if air quality
sensors collected air quality measurements down to the minute or hour, you might want to look at
these values aggregated for the day and study the change in air quality over the year. Or you might
want to take that state data and create a column for region so that you can study customers within
North Carolina and compare them to customers within the southeast. It is important that structuring,
transforming, and cleansing the data takes place because data can be used for modeling and
predictive analytics only after it has been properly prepared.
Updating Data
2000 Customer 2020 Customer

Head-to-toe denim Orders oat milk lattes
Portable CD player Bluetooth headphones
Drives a Hummer Environmentally conscious
Rocked frosted tips Drinks kombucha
19
The next step in the data curation life cycle refers to updating data. Think about trying to predict
consumer trends. If we were to work with data from the year 2000 for example, this data would not
be accurate, useful, or relevant to describe today's customer or predict next year's customer. The
choices our customers are making today and tomorrow are more important than the choices our
customers made five or ten years ago. In fact, we might be dealing with a totally different customer
demographic depending on how the business has adapted throughout the years.
Updating Data
Appointments
Patient_ID
Employee_ID
Patient Info Date Doctor Info
Patient_ID CheckIn Employee_ID
FirstName … Name
LastName StartDate
Address Salary
Relational Database
Schema … …
20
Historically, data has been very structured and somewhat static in nature. Data used in financial,
health-care, and retail solutions typically included fields such as name, address, city, and state, as
well as a field for uniquely identifying the rows of data, such as customer ID or Social Security
number. The data was structured in relational databases as star and snowflake schemas, with
defined relationships between fact and dimension tables. These structures, alt hough very fast and
efficient, are considered very rigid. Today’s data management model needs to be designed to
accommodate easier updates to data. These updates could be adding a new field to a table or
adding new social media or IoT data to a campaign analysis project. With the abundance of data
sources, flexibility when updating data is crucial.
Updating Data
Version 7.1 Version 7.2 Version 8.0
21
Sometimes, data updates occur unexpectedly. This can be referred to as data drift. When
unpredictable and unexpected changes occur to data characteristics, organizations need to know
how to address these changes. For example, think of a smart phone. Your phone is constantly
generating data based on how you use it, when you use it, what applications you use most often,
and more. Whenever updates are pushed to your phone, characteristics of the data and how this
data is generated and stored can change. This same idea applies to sensor data. For example, if
sensors are used to monitor a car’s tires, whenever software modifications are made to these
sensors, the amount, type, and characteristics of data collected could change. Data scientists need
to be able to curate the underlying data and the models and algorithms that they are using to
account for data drift.
Sources:
Pancha, Girish. “Big Data’s Hidden Scourge: Data Drif t.” CMSWire. Last modif ied April 8, 2016.
Available https://www.cmswire.com/big-data/big-datas-hidden-scourge-data-drift/. Accessed on
November 25, 2019.
Archiving Data
old data archive location
22
The last part of the data curation life cycle is archiving the data. At some point, the data that you are
working with might no longer be useful for answering your organization’s current questions. The data
might be too old, or the questions might have changed, and new data needs to be collected.
Whatever the reason, data needs to be archived at some point.
Archiving data or data retention is not the same thing as deleting or destroying data. When data is
archived, it is moved from its current primary location to a location specific for archived data, but the
data is still retained. According to TechTarget, the location for archived data might be cheaper, lower
commodity hardware. Moving data that is not accessed as frequently to this lower commodity
hardware frees up primary storage space for newer data.
Archived data might need to be retained in order to follow regulatory compliance or for auditing
purposes. A company might think that although the data is not currently being used, it could be
important to study later. Some organizations might store this data in such a way that it is Read-only
so that it is a true documentation of the data that cannot be updated.
Sources:
Rouse, Margaret et al. “Data Archiving.” TechTarget. Available
https://searchdatabackup.techtarget.com/definition/data-archiving.
Data Governance
availability usability integrity security
23
Data governance is also important throughout the data curation lif e cycle. Data governance is a
commitment by an entire organization to develop strategies and policies for their corporate data
assets. According to TechTarget, these strategies f ocus on "the overall management of the
availability, usability, integrity and security of data used in an enterprise." In order to develop a
strong data governance plan, data assets must be treated as crucial to the company. To the data
scientist, this data drives the operations of the entire busines s, so a good data governance strategy
is paramount to your data curation ef f orts.
Sources:
“Data Governance.” TechTarget. Available
https://searchdatamanagement.techtarget.com/definition/data-governance. Accessed on October 30,
2019.
archiving finding
updating
Data Curation
Life Cycle
exploring
cleansing structuring
24
It is important to remember that the data curation process does not always play out in a step-by-step
manner. Often, data is collected, explored, structured, and cleansed, and then the data scientists
might determine that additional data needs to be collected and integrated with the original data. Data
might be updated multiple times throughout a data science project, depending on how long the
project lasts. Archived data might become important to a current data science project. It is important
that the data scientists come up with a game plan and define the steps that they plan to take but are
also willing to adapt to change and handle unforeseen challenges.
SAS Tools and Applications for Data Curation

SAS/ACCESS SAS Data Loader
Technology for Hadoop
SAS Data SAS Federation

Integration Studio Server
DataFlux Data SAS Event Stream

Management Studio Processing Studio
25
SAS has an abundance of tools that can help data scientists build and implement data curation
plans, including SAS/ACCESS technology, SAS Data Integration Studio, DataFlux Data
Management Studio, SAS Data Loader for Hadoop, SAS Federation Server, and SAS Event Stream
Processing Studio.
Query and manage data stored in databases

SAS/ACCESS
Technology
Use database native SQL or SAS programming

language elements to work with data
Push processing to the database and bring

results back to SAS
26
SAS/ACCESS technology enables users to query and manage data stored in databases and other
data sources. Users can manage, update, and query data using SQL that is native to the database
or using SAS language. Processing can be pushed to the database, depending on the methods
used. Results can also be brought back to SAS and saved to SAS tables for further processing and
analysis.
Manage data and data warehouses
SAS Data
Integration Studio Create jobs that generate SAS code
Access, manipulate, and integrate data
Store data across a wide variety of data formats
27
SAS Data Integration Studio is a SAS platf orm application interf ace that enables users to manage
their data integration processes across an organization. Users can create jobs using a drag -and-
drop interf ace. These jobs generate SAS code to access, manipulate, integrate, and store their data
across a wide variety of data f ormats.
Data integration and advanced data quality
DataFlux Data
Management Studio Perform standardization, entity resolution, and
address verification
Profile data and build business rules
Identify and remedy data quality issues
28
DataFlux Data Management Studio is a platform application interface designed for data integration
and advanced data quality. To perform a wide variety of data quality operations, users leverage an
extensive library of data quality rules and algorithms, referred to as the Quality Knowledge Base, as
well as third-party reference data packs. These operations include standardization, entity resolution,
address verification, and more. DataFlux Data Management Studio also has built -in tools to profile
data and build business rules, enabling data quality stewards to identify and remedy issues in their
data. Users can design automated processes to assess data for specific data quality issues and
generate alerts when such issues arise.
Web-based, non-programmatic
SAS Data Loader

for Hadoop
Move data in and out of Hadoop
Interrogate and profile data for quality issues
Transform, transpose, and join data
29
SAS Data Loader for Hadoop is a web-based, non-programmatic way for users to interact with data
in Hadoop. It can be used to move data in and out of Hadoop; interrogate and profile data for quality
issues; transform, transpose, and join data; and more.
Access secure data through a virtual layer
SAS Federation
Server Maintain, configure, and monitor data from a web
browser interface
Improve data access performance
Apply data quality functions
30
SAS Federation Server is a platform application interface that makes it easier for business users to
access secure data for reporting and analysis. It enables data administrators to define SQL-based
views, making the data available to users without physically moving the data. SAS Federation Server
can be used to maintain, configure, and monitor data access from a single point of administration in
a web browser interface, improve data access performance, and apply data quality functions such as
standardization and parsing. If necessary, administrators can control business user permissions all
the way to the row and column level.
Graphical and code-based interface
SAS Event Stream

Processing Studio
Ingest, filter, join, and aggregate event streams
Execute external routines against event streams
Detect patterns in event streams
31
SAS Event Stream Processing Studio provides a graphical interface as well as a code-based
interface that enable users to build event stream processing applications. An event stream is the
continuous flow of data points from a sensor or other application. SAS Event Stream Processing
Studio can be used to ingest, filter, join, and aggregate event streams, as well as to execute external
routines against event streams and to detect patterns in event streams .
SAS Quality Knowledge Base

SAS Data Loader
for Hadoop
SAS Data SAS Federation
Integration Studio Server
DataFlux Data SAS Event Stream

Management Studio Processing Studio
SAS QKB
32
Although this is not an exhaustive list of the SAS tools and applications used for data curation, many
of these tools play a large part in data curation and are discussed in more detail later. One additional
component that plays a large part in data curation is the SAS Quality Knowledge Base, or QKB. The
SAS QKB is a collection of files and algorithms that store data and logic for defining data
management operations such as data cleansing and standardization. The SAS QKB is used in SAS
Data Integration Studio, DataFlux Data Management Studio, SAS Data Loader for Hadoop, SAS
Federation Server, SAS Event Stream Processing Studio, and more.
3.3 The Emergence of Artificial

Intelligence and Machine Learning
Artificial Intelligence and Machine Learning
Artificial Intelligence
Decision Recommendation
Making Voice Recognition
Engines
34
One of the major topics of conversation in the data science world—and, frankly, outside of data
science as well—is artificial intelligence, or AI. In simple terms, AI can be thought of as the ability of
your computer to mimic human intelligence.
Although AI has been around since the 1950s, the emergence of massive amounts of data and
improvements in computing power and storage have expanded the application and use of artificial
intelligence. Examples of artificial intelligence include decision making (s uch as in strategic games),
voice recognition (such as Siri or Alexa), and recommendation engines (such as the
recommendations given on service streaming applications like Hulu or Netflix).
Sources:
“Artificial Intelligence.” Wikipedia. Available https://en.wikipedia.org/wiki/Artificial_intelligence.
“Artificial Intelligence, What it is and why it matters” SAS Institute Inc. Available
https://www.sas.com/en_us/insights/analytics/what-is-artif icial-intelligence.html. Accessed on
October 30, 2019.
3.3 The Emergence of Artificial Intelligence and Machine Learning 3-29
Machine Learning
Learn
Identify Make
from
patterns decisions
data
35
Machine learning, or ML, is considered an application of AI that can be used to automate the building
of analytical models. As SAS describes ML, “It is a branch of artificial intelligence based on the idea
that systems can learn from data, identifying patterns, and make decisions with minimal human
intervention” (SAS Institute Inc. 2019).
Machine learning models vary in complexity depending on the type of data that you are working with
and the questions that you hope to answer. When applying ML methods, you have to determine what
type of ML algorithm to use. Machine learning algorithms fall into four categories: supervised, semi -
supervised, unsupervised, and reinforcement. Logistic regression and linear models as well as
decision trees, random forests, and neural networks can be used as supervised machine learning
algorithms. An example of an unsupervised machine learning algorithm is k -means clustering. Data
curation is an essential precursor to building machine learning models. For additional material about
SAS Machine Learning, see the Extended Learning Page.
Sources:
Li, Hui. “Which machine learning algorithm should I use?” SAS Institute Inc. Available
https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use.
Last modified April 12, 2017. Accessed on November 25, 2019.
“Machine Learning, What it is and why it matters” SAS Institute Inc. Available
https://www.sas.com/en_us/insights/analytics/machine-learning.html. Accessed on October 30,
2019.
training data neural network model target
algorithm
36
In supervised machine learning, the selected algorithm and data are used by machine learning tools
to build a model. The data used to initially build the model from the machine learning algorithm
selected is considered the training data. Often, companies can use historical data such as
transaction data for a retailer or financial data for a bank. An appropriate answer, known as a target,
must be present in the training data. If financial data is being used to determine whether a customer
missed a credit card payment, the answer yes (they did miss a payment) or no (they did not m iss a
payment) would need to be present in the training data. The learning algorithm is used to identify
patterns in the data that ultimately lead to the target, and the machine learning model learns to
identify these patterns in future data sets. As the model is given more and more training data to work
with, it can learn to make better predictions.
Sources:
Wakefield, Katrina. “A guide to machine learning algorithms and their applications.” SAS Institute Inc.
Available https://www.sas.com/en_us/insights/articles/analytics/machine-learning-algorithms-
guide.html. Accessed on October 30, 2019.
3.3 The Emergence of Artificial Intelligence and Machine Learning 3-31
training data neural network model target
Data curation must occur before the training

algorithm data can be used to build a model.
37
For the training data to be used in machine learning algorithms, data curation is a fundamental
precursory step. This data might need to be pulled from various sources. Major anomalies in the
data or sections of missing data should be identified and addressed before data is used to train a
model. In addition, any standardization needs to occur beforehand as well. Think about the missed
credit card payment data. Imagine if there were multiple months for some individuals where
confirmation of payment just wasn’t recorded. Those records of data would not present accurate
information for our learning algorithms because we would not know whether the customer had
missed a payment. Or suppose that the confirmation of payment data was entered any one of the
following ways: Y, Yes, YES, No, Missed, On Time, and so on. These possible answers would need
to be standardized to two binary options (for example, Y or N, or Missed or On Time) before the
machine learning algorithm could attempt to classify the data. If data is not curated, the machine
learning algorithm cannot consume and work with the data.
Lesson 4 The Roadmap to SAS®
Data Curation
4.1 SAS Data Management Tools and Applications ............................................................ 4-3
4.2 SAS and Hadoop........................................................................................................ 4-10
4.3 Additional Data Management Tools and Applications ................................................ 4-15

4-2 Lesson 4 The Roadmap to SAS® Data Curation
4.1 SAS Data Management Tools and Applications 4-3
4.1 SAS Data Management Tools and

Applications
SAS/ACCESS Technology
SAS Client SAS/ACCESS Interface Engine DBMS
3
The SAS/ACCESS interface engine is a tool that enables you to transfer data between the database
management systems and SAS. Using a SAS/ACCESS interface engine, programmers, analysts,
and data scientists can access data in a multitude of databases from SAS. They can then
manipulate and query this data using the SAS language as well as the native SQL for the database.
It is important to understand where processing occurs when working with data in databases .
SAS Data Integration Studio
Understanding Importing Setting Global

SAS Data
Metadata Metadata Options
Integration Studio
4
SAS Data Integration Studio is an application that provides a visual interface to build processes
referred to as jobs that read data from source tables, transform that data, and load the results into
target tables. SAS Data Integration Studio is a part of the SAS Platform. Thus, in order to understand
SAS Data Integration Studio, we must discuss some of the key concepts of the SAS Platform.
In the SAS Platform, metadata is stored information about the characteristics of another object, such
as source data, target data, jobs, users, user permissions, and more. Metadata is shared between
SAS Platform tools, making it easy to track where objects have been used, and how they were used.
You can import SAS metadata into SAS Data Integration Studio.
Access Register Define

metadata for metadata for metadata for
source data source data new tables
SAS Data
Integration Studio
5
You can access and register metadata for many types of source data, including SAS tables,
database tables, data accessed through ODBC, and external files. This enables you to use all these
data sources as sources in SAS Data Integration Studio jobs. You can also create metadata for new
tables. If a table doesn’t already exist, you can define metadata to provide a “blueprint” for what that
table will look like. You define column names, lengths, and types, as well as indexes, keys, and more
for your target table metadata objects. The underlying physical tables are created to these
specifications when you write to this target table with a job.
Transformation
SAS Data
Integration Studio
SAS code
6
After you create source metadata and target metadata, you can use a visual design interface to
create jobs, which generates SAS code. You first need to become familiar with the basics of building
a job. Then you can use transformations to generate different types of SAS code when used in a job.
For example, you can use the Join transformation, which generates PROC SQL code for several
types of joins. You can also use your own SAS code to create new, reusable transformations.
DataFlux Data Management Studio
Quality Knowledge Base

SAS QKB
Architecture and Configuration
DataFlux Data
Management Studio
Metadata Repositories
7
With DataFlux Data Management Studio, you can use a variety of algorithms from the Quality
Knowledge Base or QKB to ensure the accuracy, completeness, and reliability of your data. Data
quality is obviously important to the data scientist, and SAS methodology exists for ensuring the
quality of your data. It is important to know the basic architecture of Data Management Studio and
the necessary configuration for working with the application. You can then set up metadata
repositories to store the objects that you create in Data Management Studio.
Identification
Analysis
Data Parsing
Gender Analysis
DataFlux Data Build Data Jobs
Management Studio
Address
Verification and
Enrichment
Entity Resolution
8
Next, you profile or explore data from a variety of sources to identify issues that might exist in the
data and need to be addressed before you use it in analysis and reporting. This proactive approach
to identifying data quality issues is of critical importance in your role as a data scientist. You then can
use a visual design interface to build processes called data jobs to address any issues that you
identified in the profile report. In the data jobs, you see how to apply a variety of predefined
algorithms stored in the Quality Knowledge Base to address data quality issues. These algorithms
include identification analysis, data parsing, gender analysis, address verification and enrichment,
entity resolution, and more.
Understand QKB definitions
Change QKB definitions

SAS QKB
DataFlux Data Configure SAS to work with

Management Studio QKB definitions in SAS code
9
You can then delve deeper into understanding the QKB definitions that are used when building data
jobs. Having a better understanding of how the definitions work enables you to make changes to
them for certain situations that might arise when using the definitions to process data. SAS can be
configured to access the QKB, and you can use the definitions in SAS code, including in the SAS
DATA step, the SQL procedure, and other SAS procedures.
4.2 SAS and Hadoop

SAS and Hadoop
Hadoop
11
Hadoop is a platform that enables you to work with distributed data and execute tasks across
multiple machines, using parallel processing.
4.2 SAS and Hadoop 4-11
Hadoop Environment
HDFS
Hadoop DataNode DataNode DataNode
NameNode
HiveQL Pig Latin
N N N
Hadoop environment Hadoop higher-level

1 2
and ecosystem programming languages
12
You first need to learn about the Hadoop environment and ecosystem. Hadoop is deployed across
multiple servers, and it contains components that are used for job scheduling, metadata tracking,
and coordination of parallel processing across the distributed environment. As you use Hadoop, you
become familiar with terms such as NameNode and DataNode, and you learn the components of
MapReduce.
Hive and Pig are applications in Hadoop, and they each have a language: HiveQL and Pig Latin,
respectively. These higher-level languages enable you to manage and process the data stored in the
Hadoop Distributed File System, also called HDFS.
SAS and Hadoop
SAS Client Base SAS tools Hadoop
HiveQL
pass-through
queries
SAS/ACCESS
LIBNAME
engine
13
There are SAS programming methods that enable you to interact with data in Hadoop. If you have
data stored in HDFS that you want to access from SAS, you have multiple ways to do this. There are
Base SAS procedures and DATA step methods for working with data in HDFS, and you can also use
Base SAS to invoke an existing Pig program.
You can leverage the HiveQL that you learned previously to write SAS/ACCESS SQL pass-through
queries. With SQL pass-through queries, you can write the HiveQL native code in your SAS
environment and push it to your Hadoop system to process.
You can use the SAS/ACCESS LIBNAME engine to connect to your Hadoop data source and
access your data in HDFS as if the data sets were SAS tables. Your programming syntax is similar
to SAS syntax that you might have used before when working with SAS data sets .
4.2 SAS and Hadoop 4-13
SAS and the DS2 Language
Base SAS DS2

DATA step program
Restricted to serial Can be threaded to

execution process in parallel
Can execute in the

Cannot process in Hadoop cluster using
Hadoop the SAS In-Database
Code Accelerator
14
When you want to take the capabilities of SAS DATA step processing further, you can use the SAS
proprietary language DS2. A DS2 DATA program looks similar to a Base SAS DATA step, but the
functionality of the DS2 language when working with data in HDFS far exceeds that of the DATA
step. The DATA step does only serial execution, but DS2 programs can be threaded to execute in
parallel. Furthermore, leveraging the SAS In-Database Code Accelerator for Hadoop, DS2 programs
can be sent to execute in parallel on each DataNode where the data resides in the Hadoop cluster.
This means that when working with data in Hadoop, you can leverage the Hadoop cluster’s
distributed, parallel processing instead of bringing large volumes of data in Hadoop back to SAS for
processing.
SAS Data Loader for Hadoop
Profile Data
Cleanse Data in Hadoop
Copy Data to Hadoop web-based interface
Transform Data in Hadoop
15
SAS Data Loader for Hadoop helps you access and manage data on Hadoop through an intuitive
user interface. With SAS Data Loader for Hadoop, business analysts, data scientists, and less
technically inclined users can profile, cleanse, move, and transform data in a Hadoop envi ronment,
through a web-based interface without writing code.
4.3 Additional Data Management Tools and Applications 4-15
4.3 Additional Data Management Tools

and Applications
SAS Federation Server
Data
Data Data
Disclosure
Federation Virtualization
Control
SAS Federation
Server
19
The SAS Federation Server is another data management tool that SAS offers. To understand the
SAS Federation Server and its capabilities, you must understand three main concepts that the SAS
Federation Server addresses.
• Data Federation is the ability to use data across multiple source systems without physically
having to move the data. The access to the data is provided via SQL views, and these v iews
populate data only when the view is accessed.
• Data Virtualization is the process of accessing and manipulating data from disparate systems
through a common data-access approach that hides the complexity of data access from the end
user. This includes how the data is formatted, where it is located, database security, database
schemas or table names, and so on, as well as how data across multiple sources fits together.
• Data Disclosure Control is modifying data so that no sensitive information remains. The
challenge of Data Disclosure Control is in the ability to share information with users, while at the
same time, protecting personally identifiable information (for example, account numbers,
addresses, phone numbers, and taxpayer IDs) from the end user. It requires that those data
elements be masked from the end user in some way.
SAS Federation Server
Central Location for Multi-user, Virtualized Data

SAS Federation Data Connections Concurrent Data Views
Server Access
20
The SAS Federation Server is the central location for setup and maintenance of these data
connections and views. It supports multi-user, concurrent data access. This means that multiple
users can access the same data at the same time. Using SAS Federation SQL also called FedSQL,
you can create virtualized data views from varying data sources without moving the source data.
The SAS Federation Server provides a central location in which administrators define FedSQL views
that business users can access. These views provide a consistent data model with access control,
data masking, and security to the end user.
SAS Event Stream Processing
Provides connectivity to many source

formats
Provides connectivity to messaging

systems and data-flow systems
SAS Event Stream

Processing
Enables users to build applications to
ingest and process data in real time
21
SAS Event Stream Processing enables you to ingest many sources of streaming data, including
sensor data from manufacturing processes, social media activity, and financial transactions. In
addition, there is connectivity to messaging systems like MQTT and RabbitMQ and data-flow
systems like Apache Camel and NiFi.
In some cases, SAS Event Stream Processing can be used to capture and store this data in a
repository. The data is then available for subsequent in-depth analysis and reporting. In other cases,
SAS Event Stream Processing can be used to monitor streaming data and trigger specific actions
when certain conditions are met. Sometimes, immediate responses are crucial for safe and efficient
work. For example, sensors might be placed on heavy machinery to collect processing data. Sensor
readings can indicate imminent failure of a piece of machinery. When this data is read into an Event
Stream Processing model, a notification could be generated to alert the business that a timely
response is needed to prevent damage. We will learn how SAS Event Stream Processing is used to
build applications that ingest streaming data, process that data in real time, and provide immediate
responses when configured patterns or anomalies are detected.
Use Built-in
Leverage Advanced
Functions
Analytics
SAS Event Stream

Processing
Integrate
Custom Code
22
SAS Event Stream Processing has a variety of built-in functions. You use these functions to
transform data streams and detect anomalies in data. In addition, there is support for integrating
custom code written in C++, Python, and SAS DS2. SAS Event Stream Processing also provides
advanced analytical capabilities, including algorithms for natural language text processing, image
recognition, video image tracking, and other machine learning algorithms. Custom analytical models
(for example, models built in SAS Model Studio) are also supported.
XML and C++

SAS Event Stream SAS Event Stream Python and
Processing Processing Studio Jupyter Notebook
23
You can work with streaming data programmatically and develop SAS Event Stream Processing
applications directly in C++ and XML.
Also available is SAS Event Stream Processing Studio, a user-friendly, browser-based graphical
development environment with XML code generation. And you can use a Python interface that is
integrated with Jupyter Notebooks to develop SAS Event Stream Processing applications and
visualizations of Event Stream Processing outputs .
SAS Data Governance
Business Data Metadata Data

Glossary Lineage Monitoring
DataFlux Data
SAS Business
SAS Lineage Management
SAS Data Data Network
Studio
Governance
24
Data governance methods are used to centrally create, standardize, and view reference data.
Groups within an organization or business come to agreement on the terms that support and
document business initiatives as well as data curation processes. A good data governance project
requires the use of standard definitions for people, processes, and technology. It is part of your role
as a data curator to keep this in mind as you implement the data curation process.
SAS Data Governance enables you to address the scope of data governance, including the
business data glossary, metadata lineage, and data monitoring. As you learn about SAS Data
Governance, you are exposed to SAS Business Data Network, SAS Lineage, and DataFlux Data
Management Studio.
SAS Data Governance
SAS Data
Governance
Business Data SAS
Glossary Lineage
25
A business data glossary consists of a hierarchy of terms. You assign tags to a term, associate terms
with one another, assign a contact for a term, and add any number of additional attributes to the
term.
In SAS Lineage, you can associate a term with table and column metadata, libraries and data
connections, processes, and more. You can also create visual relationship diagrams to investigate
and document relationships among terms and metadata objects .
Let’s keep
learning!
26
So that's a lot of material to cover. Keep learning and you will be a SAS data curation champ – and a
stronger data scientist. Good luck!

Sas 1 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sas 1 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Data

Introduction to Data Curation for SAS ® Data Scientists Course Notes

Lesson 1 Introduction to Data Curation ..............................................................1-1

1.1 Discovering Data ..............................................................................................1-3

1.3 Using the Power of SAS.................................................................................. 1-17

Lesson 2 An Overview of the Computing Environment.......................................2-1

2.1 An Introduction to Computer Architecture ............................................................2-3

2.2 The Many Types of Data Storage .......................................................................2-9

2.3 Parallel Processing and Grid Computing ........................................................... 2-19

2.4 Cloud Computing............................................................................................ 2-24

2.5 The SAS Platform and SAS Viya...................................................................... 2-29

Lesson 3 The Role of Data Science and Data Scientists .....................................3-1

3.1 Exploring the Discipline of Data Science.............................................................3-3

3.2 Understanding the Data Curation Life Cycle........................................................3-8

Lesson 4 The Roadmap to SAS ® Data Curation ..................................................4-1

4.1 SAS Data Management Tools and Applications ...................................................4-3

4.2 SAS and Hadoop............................................................................................ 4-10

4.3 Additional Data Management Tools and Applications ......................................... 4-15

1.3 Using the Power of SAS ............................................................................................. 1-17

1.1 Discovering Data

Twitter Raw Files

Data Cleansing and Transformation

Analysis and Model Building

1.2 The Role of a Data Scientist and the

What Is Data Science?

Gather, manage, Data Curation

updating Data Curation exploring

1.3 Using the Power of SAS

administer curate analyze

SAS Servers SAS Client Data Sources

Workspace DataFlux Data

define users define groups

control access control access

Data Curator Processes and

2.2 The Many Types of Data Storage .................................................................................. 2-9

2.3 Parallel Processing and Grid Computing ................................................................... 2-19

2.4 Cloud Computing....................................................................................................... 2-24

2.5 The SAS Platform and SAS Viya ................................................................................ 2-29

2.1 An Introduction to Computer

Understanding the Computing Environment

Central Processing Unit Network

Central Processing Unit

Serves as "the brain"

Executes instructions from programs and

Stores data for immediate use

Intermediary between physical storage and the

Data in memory is lost if the computer loses

Disk is a permanent data storage location

Examples of storage devices are internal and external

Solid-state drives are faster and more resilient than

2.2 The Many Types of Data Storage

Changes in Data Storage

Streaming data from sensors Transactional data from customers

Different Data Storage Methods

Relational Hadoop Data Lakes Cloud

Relational Database Management Systems

Open source software

Parallel processing enables processing to occur on

Unstructured and structured data

Large variety and volume of data

Data lakes do not require your data to fit a

Unstructured and structured data

Large variety and volume of data