8.31.17 Final Data Science Lecture 1

V9.1.
17
Fundamentals of Data Science
Fall 2017
Daniel Egger
Outline of Lecture No. 1

August 31, 2017
Lecture 1, Part One:

Course Overview
Incomplete Information
It is often necessary to make decisions under conditions of incomplete information.

Incomplete information means uncertainty about outcomes.
The Role of the Data Scientist
(1) To identify relevant questions

(2) Gather and analyze the available data
(2) And reduce uncertainty as much as possible
(3) Given actual constraints of time and resources.
(4) To Communicate findings effectively
(5) For informed decision-making
(6) While quantifying and monitoring residual uncertainty
(7) And remaining receptive to the further benefit of new information.
Business Goals
From the point of view of business, better decisions are those that:
(1) Increase Revenues;
(2) Improve Profitability (by reducing costs of delivering goods or services,
or otherwise increasing efficiency); and
(3) Reduce Risk.
Models as Probability Distributions
Models are simplified mathematical pictures of real-life situations that can be

quantified, measured, and compared.
Data Scientists construct, adjust, and study Models.
In Bayesian Logical Data Analysis (our approach in this course) models are
generally represented as Probability Distributions.
1
When a model is a probability distribution, updating the distribution based on new
data is done using Bayes Theorem.
Data Science Best Practices: Projects and Pipelines
Well-engineered data science systems should include the following steps:

(1) Gather available data
(2) Scrub Data for analysis [The process is known as Extract, Transform,
Load ETL]
(3) Define business questions these data can answer; Analyze Data
(4) Generate a Model of the relevant situation
(5) Communicate your recommendations for action
(6) Make Decision
(7) Take action
(8) Record both the action taken and the resulting outcome
Return to step (4) and update the model with the outcome. Repeat (4)-(8)
continually.
A completely automated system, where new data inputs continually update outputs,
for example, what offers an online customer sees, or what buy and sell orders an
algorithmic trading system issues, is often called a data pipeline. It is an
engineered system.
Data Scientists do Projects manually, by executing steps (1)-5) above; they also
design and build Pipelines, where all the steps (1)-(8) occur automatically
Generally in business, successful projects lead to pipelines.
Most Common Types of Models/Probability Distributions
Binary (only two possible outcomes)

Example: coin heads or tails
Discrete (Categorical or numerical 2 or more outcomes)

Numerical example: die {1,2,3,4,5, or 6}.
Categorical example: expected votes {Clinton, Trump, Johnson, Stein}
Continuous (all outcomes in a range)

Example: regression model that forecasts the amount of time that will
be spent by a customer support technician resolving and issue for one
customer, as a function of experience the total time spent by that
technician at the job.
2
Linear Regression Models are Probabilistic
The linear correlation metric R (or coefficient of determination R-squared)

corresponds to a known probability distribution of residuals the error when using
the model for forecasting - and a known reduction in uncertainty.
Parametric Linear Regression models include probability distributions. The

regression function point forecast, y(i) = (Beta)*x(i) + alpha, is the SIGNAL, and an
error function (0, 2 ) is the NOISE, and the error takes the form of a probability
distribution. This will be explained in detail later in the course.
This course will include the following:
A. Review of basic probability theory concepts, up to and including how to use new
data and Bayes Theorem to update a probability distribution the model for all
machine learning.
B. Information Measures Shannon entropy and how it can be used both as a metric
to quantify reduction in uncertainty (information gain) provided by a model, and to
compare alternative models to determine which is most effective.
C. Binary Classification Models their technical vocabulary and performance

metrics.
D. Gaussian Noise and Linear Regression Understanding linear regression models

in terms of forecasting, probability distributions and information gain. This is a
signal-processing approach.
E. Insights into study design including the most common errors that cause most
research not to be reproducible, and how to avoid them.
F. A realistic practice project evaluating Credit Card applications that uses both:
Binary Classification, to select applicants based on forecasting default or no
default; and Linear Regression, to select applicants based on forecasting
future profitability.
G. Python Week - programming for complete beginners (pass-fail module).
H. Projects with real-world data (Cisco test-network).
3
Lecture 1, Part Two:
Basic Probability Part 1
Probability
Degree of belief in the truth of a statement.
[Definition from Bayesian Logical Data Analysis see Cox Axioms, McKay p 26]
A statement that is true with certainty is assigned probability 1.

Notation: p(x) = 1
A statement that is false with certainty is assigned probability 0.

Notation: p(x) = 0
All other statements have some degree of uncertainty and are assigned probabilities
that are real numbers greater than 0 and less than 1.
Notation 0 < p(x) < 1.
Negation
If x is a statement, ~x is its Negation.

Read Not x or It is not the case that x
Double Negation: p(~~x) = p(x)
Excluded middle: probability and its negation sum to one: p(x) + p(~x) = 1
Written another way, p(x OR ~x) = 1
Events are exclusive: p(x AND ~x) = 0
Note also the following logical negation rules:

~(a OR b) = (~a AND ~b)
~(a AND b) = (~a OR ~b)
Probability Distribution
A collection (set) of exclusive and exhaustive statements.
Probabilities p(x) for a set of statements {1 , 2 , 3 , , } that form a probability

distribution, must sum to 1.
Exclusive means no more than one statement can be true with certainty, given
complete information.
Exhaustive means at least one statement must be true with certainty, given
complete information.
4
Given complete information, exactly one statement in a probability distribution is
true with certainty and the others are false with certainty.
Notation: Complete probability distributions are generally represented by Capital

Letters while individual events/outcomes are generally represented by lower case
letters
X = {1 , 2 , 3 , , } Y = {1 , 2 , 3 , , }
Note: Since most statements x assert a particular outcome or an event (a group of

outcomes) we also refer to 1 , 2 , 3 , etc. informally as outcomes or events rather
than statements about events.
Principle of Indifference
Given a collection of n events that contribute to outcomes in a probability

distribution, in the absence of any distinguishing information, the probability
assigned to each event is 1/n.
It follows from the principle of indifference that, in the absence of any distinguishing
information, the probability of a particular outcome =
(The number of events that meet the relevant definition for that outcome) / (the
total number of possible events).
Number of Relevant Events Divided by Total Number of Events
A Universe, (also called a Sample Space) is the set of all possible outcomes.
Probabilities are assigned to outcomes.
The probability of an outcome must be greater than or equal to zero.
A probability distribution consists of a set of outcomes that include each outcome

in the universe exactly once, and for which the assigned probabilities sum to 1.
It follows from the definitions of probability and the Principle of Indifference that in
the absence of additional information, the probability of any outcome is the number
of events it contains, divided by the total number of events in the universe.
For example: when tossing a fair, six-sided die, the outcome the result is even
contains three different events: 2, 4, and 6. Because by the principle of indifference
each event must have probability 1/6, the probability of the outcome the result is
even is 3/6 = .
5
Joint Probabilities
A joint probability p(1 , 1 ) is the probability that both outcome 1 from probability
distribution X AND outcome 1 from probability distribution Y are True.
The joint probability distribution, written (X,Y) is the collection of all possible joint
distributions of outcomes from X and outcomes from Y. Note that if X has n
outcomes, and Y has m outcomes, then the join distribution (X,Y) is a new
probability distribution with n*m outcomes.
For example: if I toss a six-sided die once, and flip a coin (Heads/Tails) once, the
joint distribution would contain 12 outcomes, to which probabilities would be
assigned:
1 and Heads 2 and Heads 3 and Heads 4 and Heads 5 and Heads 6 and Heads
1 and Tails 2 and Tails 3 and Tails 4 and Tails 5 and Tails 6 and Tails
Order Doesnt Matter for joint distributions and joint probabilities
Probability of a joint distribution of two outcomes, one from X and one from Y.
p(1 , 1 ) = p(1 , 1 ) for all combinations of outcomes, so that
P(X,Y) = p(Y,X) .
Independence
If two probability distributions X and Y are independent, any joint probability

p( , ) is equal to the product of the two individual probabilities p( )( ).
This is summarized by saying the joint distribution equals the product
distribution.
If a joint distribution of X and Y, P(X,Y), does not equal the product distribution
p(X)p(Y), then X and Y are dependent.
For example: If over many years the expected number of hot (100 F, 37.78 C) days in
Durham is 40 per year, or 10.96%, and the expected number of rainy days is 75, or
20.55%, if hot days and rainy days are independent, the expected probability of hot-
and-rainy days is (10.96%)(20.55%) or 2.25% - so the average number of hot-and-
rainy days per year would be 8.2.
6
Venn Diagrams Are Sometimes Used to Represent Probabilities
The area of the rectangle represents the Universe of all possible outcomes.
It has area 1.
The area of the circles represent p(A) and p(B).
Intersection
P(A AND B)
The Joint Probability p(1 , 1 ) can be represented as the intersection of the two
probabilities p(1 ) and (1).
Example (for independent events)

Probability that a fair die will come up 3 AND a fair coin will come up heads
= (1/6)(1/2)
= 1/12.
7
Union
P(A OR B)
This is the SUM RULE of probability
The outcomes included in p(A OR B) are all outcomes in either A or B.
p(A or B) = p(A) + p(B) p(A,B)
Example: Probability that a fair six-sided die will come up 3 OR a fair coin will
come up heads [or both] = (1/6) + (1/2) ((1/6)(1/2))
= 2/12 + 6/12 1/12
= 7/12.
Example: The probability that a fair coin will come up heads at least once in two
tosses is (1/2) + (1/2) (1/4) = .
8
Complement
P(~A)
P(~A) = 1 p(A).
Permutations versus Combinations
Permutations are distinct outcomes when order matters.
Example: Two Coin Tosses can have 4 Permutations

First Toss: Second Toss:
1) Heads Heads
2) Heads Tails
3) Tails Heads
4) Tails Tails
Combinations are the outcomes when order does not matter.
Example: Two Coin Tosses have 3 Combinations
1) 2 Heads
2) 1 Head, 1 Tail (2 separate events in this Outcome)
3) 2 Tails
9
Drawing With Replacement versus Without Replacement
Examples
When tossing a coin multiple times, we do it with replacement - we dont use up

heads after we get one head or use up tails after we get one tail.
When choosing one of 1000 three-characters strings between 000 and 999 at
random, the digits {0, 1, 2, .9} are each used with replacement they can occur
more than once.
When drawing cards from a deck to make a five-card poker hand, the cards are
drawn without replacement if you draw the Ace of Clubs on the first card, there is
no chance of drawing the Ace of Clubs on subsequent draws, because it has already
been removed from the deck. So, when calculating the probability of drawing an Ace
on the second draw after drawing an Ace on the first draw, the probability is 3/51
(number of remaining Aces in the deck) / (number of remaining cards in the deck).
Urns
Urns are often used in school probability problems. They are imaginary containers
where the contents consist of black and white marbles that cannot be seen and may
be drawn out one at a time with equal probability of drawing any that remain in the
container.
Assume an urn contains exactly three marbles, two white and one black. The
probability of drawing two white marbles in a row with replacement is (2/3)(2/3) =
4/9.
The probability of drawing two white marbles in a row without replacement is not
the same. The two draws are now dependent because the outcome of the first
changes the probabilities of white and black the second.
On the first draw the probabilities are: the same: p(white) = 2/3, p(black) = 1/3.
However, if you draw white first, the changed probabilities on the second draw are:
p(white) = 1/2, p(black) = 1/2.
So the probability of white on the first, and on the second, is (2/3)*(1/2) = 1/3.
Excel for Hypergeometric Distributions
The probability distribution of obtaining exactly s successes here, black marbles

drawn in a sample of n draws without replacement, when an Urn contains M black
marbles out of a total population of N marbles, is called a Hypergeometric
distribution. The Excel function is HYPGEOM.DIST(s, n, M, N, cumulative = false).
10
For example, to calculate the probability of drawing exactly 2 black marbles in 20
draws, from a population of 100 marbles containing 10 black marbles using Excel,
use the function HYPGEOM.DIST(2, 20, 10, 100, false) = 31.8%.
Factorial and M choose N and its Notation
Selecting an item from a set of m possibilities n times with replacement when the
order matters (permutation) can happen in ways.
For example:
Drawing from the set {Heads, Tails} two times has 22 = 4 permutations.
Drawing from the set {0,1,2,9} three times has 103 = 1000 permutations.
Drawing from a 52-card deck five times with replacement has (52)5
(380,204,032) permutations.
To calculate how many ways can we draw from a set of m possibilities n times
without replacement when the order matters (permutation) we use factorial
notation.
Problems of Selection Without Replacement Generally Use Factorial

Notation
Permutations
Selecting an items from a set of m possibilities n times, without replacement results

in m!/(m-n)! distinct orderings (permutations).
[What does this mean? Factorial notation writes 5*4*3*2*1 = 120 as 5!, read five
factorial, 6*5*4*3*2*1 = 720 as 6!, read six factorial, and so on. To write a
product such as 7*6 we can write 7!/5! Because this is equal to
7*6*5*4*3*2*1/5*4*3*2*1 and all terms but 7*6 cancel out.
In addition, by convention, we set 0! = 1.
Some examples:
To draw n = 2 times from the Urn containing {black, white} without

replacement when the order matters there are 2!/0! = 2 permutations ({black,
white} and {white, black}).
To draw n = 3 times from {0,1,29} without replacement when the order

matters there are 10!/(10-3)! = 720 permutations.
To draw 5 cards from a 52-card deck without replacement when the order
matters there are 52!/(52-5)! = 311,875,200 permutations.
11
Combinations
Often we dont care about the order, but only about the resulting group, or
combination. The number of unique combinations when drawing from a set of m
things n times without replacement is
m!/(m n)!(n!).
This formula is used so frequently in probability that it has its own name, m

choose n, and special notation, ( ).

We use Choose for Combinations
The number of unique sets when drawing from the Urn (or set) {black, white} twice
2
without replacement = ( ) = 2!/(0!2!) = 1 combination.
2
The number of unique groups of numbers when drawing from the set {0,1,2,.9} 3
10
times without replacement is ( ) = 10!/(7!3!) = 120 combinations.
3
The number of unique five-card poker hands when drawing from a 52-card deck
52
without replacement is ( ) = 52!/(47!5!) = 2,598,960 combinations. Note that this
5
is the number of permutations, 311,875,200, divided by the ways five cards can be
ordered, which is 120.
Probability Equals the Number of Relevant Outcomes Divided by Total

Number of Possible Outcomes
Decks of cards contain four suits (hearts, diamonds, clubs, spades) of 13 cards each.
Suppose we want to know the probability of being dealt a flush (5 cards of the same
suit). We dont care about the order the cards are dealt, so this is a combination
problem.
The probability will be the ratio of combinations that are flushes to total
combinations possible.
The total number of combinations that are flushes are the number of ways you can
draw 5 cards from the 13 cards of one suit without replacement, multiplied by 4.
13
This is 4*( ) or 5,148.
5
52
The total number of combinations possible is ( ) or 2,598,960.
5
The probability is 5,148/2,598,960 = 0.001981.
12
Revisiting the Hypergeometric Distribution
Consider again the Hypergeometric distribution example, where s = 2, n = 20, M =

10 and N = 100.
The total number of distinct ways that 2 black marbles can be chosen from 10 in the
10
population is (M choose s) or ( ).
2
The total number of ways that the remaining 18 available slots (18 because 20 were
drawn and 2 are occupied by black marbles) can be filled by the 90 white marbles in
90
the population is ((N-M) choose (n-s)) or ( ) .
18
The numerator is the total number of relevant events the ways that the 18 white

and 2 black can occur together: the product ( ) ( )

= 45* 3.78965*10^18 = 1.70534*10^20.
The denominator is the total number of events the ways 20 draws can be taken
100
from the population of 100 marbles without regard to color: ( ) = ( )=
20
5.35983*10^20.
The probability of achieving this particular result is:

(1.70534*10^20)/(5.35983*10^20) = 0.318.
Coursera Videos and Quizzes to Supplement this Week
https://www.coursera.org/learn/datasciencemathskills
Section 5, Introduction to Probability Theory

Probability Definitions and Notation (8 minutes)
Joint Probabilities (6 minutes)
Practice Quiz on Definitions
Permutations and Combinations (7 minutes)
Using Factorial and M Choose N (7 minutes)
The Sum Rule, Conditional Probability, and the Product Rule [first
part]
Practice Quiz on Problem Solving
End of Week One
13

8.31.17 Final Data Science Lecture 1

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

8.31.17 Final Data Science Lecture 1

Hochgeladen von

Copyright:

Verfügbare Formate

V9.1.

Outline of Lecture No. 1

Lecture 1, Part One:

It is often necessary to make decisions under conditions of incomplete information.

The Role of the Data Scientist

(1) To identify relevant questions

Models as Probability Distributions

Models are simplified mathematical pictures of real-life situations that can be

Data Scientists construct, adjust, and study Models.

Data Science Best Practices: Projects and Pipelines

Well-engineered data science systems should include the following steps:

Generally in business, successful projects lead to pipelines.

Most Common Types of Models/Probability Distributions

Binary (only two possible outcomes)

Discrete (Categorical or numerical 2 or more outcomes)

Continuous (all outcomes in a range)

The linear correlation metric R (or coefficient of determination R-squared)

Parametric Linear Regression models include probability distributions. The

This course will include the following:

C. Binary Classification Models their technical vocabulary and performance

D. Gaussian Noise and Linear Regression Understanding linear regression models

G. Python Week - programming for complete beginners (pass-fail module).

H. Projects with real-world data (Cisco test-network).

Basic Probability Part 1

A statement that is true with certainty is assigned probability 1.

A statement that is false with certainty is assigned probability 0.

If x is a statement, ~x is its Negation.

Note also the following logical negation rules:

A collection (set) of exclusive and exhaustive statements.

Probabilities p(x) for a set of statements {1 , 2 , 3 , , } that form a probability

Notation: Complete probability distributions are generally represented by Capital

Note: Since most statements x assert a particular outcome or an event (a group of

Given a collection of n events that contribute to outcomes in a probability

Number of Relevant Events Divided by Total Number of Events

Probabilities are assigned to outcomes.

The probability of an outcome must be greater than or equal to zero.

A probability distribution consists of a set of outcomes that include each outcome

Order Doesnt Matter for joint distributions and joint probabilities

p(1 , 1 ) = p(1 , 1 ) for all combinations of outcomes, so that

If two probability distributions X and Y are independent, any joint probability

Example (for independent events)

This is the SUM RULE of probability

The outcomes included in p(A OR B) are all outcomes in either A or B.

p(A or B) = p(A) + p(B) p(A,B)

Permutations versus Combinations

Permutations are distinct outcomes when order matters.

Example: Two Coin Tosses can have 4 Permutations

Combinations are the outcomes when order does not matter.

Example: Two Coin Tosses have 3 Combinations

When tossing a coin multiple times, we do it with replacement - we dont use up

Excel for Hypergeometric Distributions

The probability distribution of obtaining exactly s successes here, black marbles

Factorial and M choose N and its Notation

Problems of Selection Without Replacement Generally Use Factorial

Selecting an items from a set of m possibilities n times, without replacement results

To draw n = 2 times from the Urn containing {black, white} without

To draw n = 3 times from {0,1,29} without replacement when the order

We use Choose for Combinations

Probability Equals the Number of Relevant Outcomes Divided by Total

Consider again the Hypergeometric distribution example, where s = 2, n = 20, M =

The probability of achieving this particular result is:

Coursera Videos and Quizzes to Supplement this Week

Section 5, Introduction to Probability Theory