Beruflich Dokumente
Kultur Dokumente
2018-2019
BIEF – Class 22
INSTRUCTORS
• Emilio Gregori (emilio.gregori@unibocconi.it)
Office hours: Monday , 4.30 p.m. to 5.30 p.m.
13+13 Room: 3-D1-12 (Roentgen building)
Lectures
International site > Programs > Current students > Courses offered in Academic Program 2018-2019 >
Code “30001” search > 30001 > click on class 22 or scroll down to BIEF/BIEM
TEXTBOOKS
• P. Newbold, W.L.Carlson, B.
Thorne (2013). Statistics for
Business and Economics, and
Student CD, Pearson. (Eighth
edition – Global edition).
TEXTBOOKS
• Additional Materials Document ("AMD"),
posted on the web learning page of the
course.
30001 STATISTICS - Material common to all classes > Materials > 2. ADDITIONAL
MATERIAL > Additional Material Document (AMD) on Descriptive Statistics
TEXTBOOKS
(forthcoming)
TOPICS
• Descriptive statistics
• Statistical inference:
– Random variables and introduction to statistical
inference
– Confidence intervals
– Hypothesis testing
– Correlation and linear regression
EXAMINATION RULES
WRITTEN
PARTIAL EXAMS
GENERAL EXAM
• JAN, 10th *
OCT, 19th
• JAN, 29th
First paper exam (max. 27 points)
• JUL, 1st
• SEP, 2nd
On line exam (max 4 points)
JAN, 10th*,
DEC, 10th JAN, 29th
Second paper exam (max. 27 points)
* DEC, 11th
FINAL GRADE for exchange
students only
(max 31 points)
EXAMINATION RULES: paper partial ex.
• Exercises, theoretical questions.
• First: lect. 1 to 13; Second: lect. 14 to 26
mainly (+first!)
• Maximum grade: 27 (both).
• Each test is passed if the score is 15.
• Final grade: arithmetic mean of the two.
• Exam successfully passed if the final grade
is 18.
EMAMINATION RULES: on line partial ex.
• Using R and Radiant, with student’s own laptop
(in a flat room).
• Exercises about analysis on a dataset.
• Topics: from lecture 1 to lecture 26.
• Max score: +4 points
• Student having passed the first paper exam =>
automatically signed up for the on line exam
• Not attending the o.l. exam => 0 points out of 4.
• Scores applied only to the paper partial exams of
the corresponding academic year (not to any general
exam).
EXAMINATION RULES: partial ex.
• Final grade: (rounded) arithmetic mean of the
two paper partial exams + points of the on line
exam.
• The score 31/30 indicates 30/30 cum laude
EXATINATION RULES: general exam
• Exercises, theoretical questions, even about R
and Radiant.
• Maximum grade: 31/30.
• Exam successfully passed if the grade is 18.
MORE ABOUT EXAMS
• Signing up for the exams: STUDENTS WHO
HAVE NOT REGULARLY SIGNED UP WILL NOT
BE ALLOWED TO TAKE ANY OF THE EXAMS.
• If the exam is successfully passed the grade
will be automatically added to the student’s
record.
• Looking at your exam: paper examination
meeting in unique date.
BLACKBOARD E-LEARNING PLATFORM
30001 STATISTICS - Material common to 2018/2019 - 30001 STATISTICS cl. 22
all classes
Statistics
A.Y. 2018-2019
Index
• Introduction
• Hypothesis testing
20
Introduction: background
The film industry is one of the biggest contributors to the entertainment industry but often it is characterized by strong
uncertainty in predicting whether a product will be successful or not.
With the advent of Big Data and Analytics, the analysis possibilities on the data (that can be collected on different
sources) have increased exponentially: in a rapidly growing industry such as the motion picture industry, data analytics
has opened a number of important new opportunities that can be used to analyze past data, make creative marketing
decisions, and accurately predict the fate of forthcoming movie releases.
Movie studios can use data analytics in many different ways, but the main goal is to assign each production its best chance
at success. The “success” is not a straightforward concept.. it can be ticket sales, reviews, social chatter, franchise
options, critical awards, …
21
Introduction: case set-up and data collection
Suppose we are an independent entertainment company interested in identifying the factors that mostly
impact on the movies’ revenue and, in general, on the success of a film.
Through a data collection activity on several sources, we were able to build a dataset containing 2868
films, produced from 1980 onwards. The data sources used to create the dataset are as follows:
22
Introduction: main research questions
23
Introduction: the dataset
Here below you can find the list of the variables in the dataset, with description and notes:
N Var Variable name Description Notes
1 id_imdb Movie ID ID
2 movie_title Title
3 year Year of production Numerical discrete
4 year_bins Year of production (Classes) Categorical ordinal - 3 categories
5 studio_distrib Film distributor Categorical nominal
6 major_distrib Film distributor - Major studio Dummy (1="Yes", 0="No")
7 main_genre Main Genre Categorical nominal
8 country Main country of production Categorical nominal
9 country_group Main country of production (Grouped) Categorical nominal
10 runtime_minutes Runtime minutes Numerical discrete
11 mpaa_content_rating Motion Picture Association of America rating (film's suitability for certain audiences based on its content) Categorical ordinal
12 production_budget Production Budget (Million $) Numerical continuous
13 box_office_domestic Box Office ($) - Total Revenues US (Million $) Numerical continuous
14 box_office_worldwide Box Office ($) - Total Revenues Worldwide (Million $) Numerical continuous
15 opening First 3-day opening revenues (US) (Million $) Numerical continuous
16 production_budget_adj Production Budget ($) - Inflation Adjusted (Million $) Numerical continuous
17 box_office_domestic_adj Box Office ($) - Total Revenues US - Inflation Adjusted (Million $) Numerical continuous
18 box_office_worldwide_adj Box Office ($) - Total Revenues Worldwide - Inflation Adjusted (Million $) Numerical continuous
19 opening_adj First 3-day opening revenues (US) - Inflation Adjusted (Million $) Numerical continuous
20 theaters Number of theaters in which the movie was initially released Numerical discrete
21 box_office_budget_ratio Revenues/Budget Ratio Numerical continuous
22 opening_boxoffice_ratio Opening Revenues/Total revenues (US) Numerical continuous
24
Introduction: the dataset
N Var Variable name Description Notes
23 plot_topic_power The plot of the movie contains the topic "Power" Dummy (1="Yes", 0="No")
24 plot_topic_love The plot of the movie contains the topic "Love" Dummy (1="Yes", 0="No")
25 plot_topic_money The plot of the movie contains the topic "Money" Dummy (1="Yes", 0="No")
26 plot_topic_friends The plot of the movie contains the topic "Friends" Dummy (1="Yes", 0="No")
27 plot_topic_world The plot of the movie contains the topic "World" Dummy (1="Yes", 0="No")
28 plot_topic_school The plot of the movie contains the topic "School" Dummy (1="Yes", 0="No")
29 plot_topic_family The plot of the movie contains the topic "Family" Dummy (1="Yes", 0="No")
30 plot_topic_murder The plot of the movie contains the topic "Murder" Dummy (1="Yes", 0="No")
31 morgan_freeman Morgan Freeman acted in the movie Dummy (1="Yes", 0="No")
32 brad_pitt Brad Pitt acted in the movie Dummy (1="Yes", 0="No")
33 robert_deniro Robert De Niro acted in the movie Dummy (1="Yes", 0="No")
34 angelina_jolie Angelina Jolie acted in the movie Dummy (1="Yes", 0="No")
35 meryl_streep Meryl Streep acted in the movie Dummy (1="Yes", 0="No")
36 julia_roberts Julia Roberts acted in the movie Dummy (1="Yes", 0="No")
37 imdb_num_votes Internet Movie Database - Number of votes for the movie Numerical discrete
38 imdb_average_rating Internet Movie Database - Rating Numerical continuous
39 metascore_rating Metacritic - Metascore Numerical continuous
40 rotting_tomatoes_rating Rotting Tomatoes - Tomatometer score Numerical continuous
41 n_tot_awards Total number of awards (including nominations) received by different organization (Academy, BAFTA, MTV, etc.) Numerical discrete
42 oscar_nomination_n Academy Awards (Oscar) - Number of nominations Numerical discrete
43 oscar_won_n Academy Awards (Oscar) - Number of statuettes won Numerical discrete
44 oscar_best picture Academy Awards (Oscar) - Statuette for Best Picture Dummy (1="Yes", 0="No")
45 oscar_directing Academy Awards (Oscar) - Statuette for Best Director Dummy (1="Yes", 0="No")
46 oscar_best_actor_actress Academy Awards (Oscar) - At least one statuette for acting (lead or supporting) Dummy (1="Yes", 0="No")
25
R manual
Install R, R-studio & Radiant!
• On your laptot (Windows, Mac)
• A.S.A.P.
• Needed for Lecture 3, next Tuesday!
SUGGESTIONS FOR PREPARING EXAMS
Ch. 1 § 1.1
Objectives of this lecture
• Statistical thinking and
decision making under uncertainty
• Basic concepts:
Variables & Data
Population & Sample
Parameter & Statistic
Descriptive statistics & Inferential statistics
Why statistics?
https://www.youtube.com/watch?v=wV0Ks7aS7
YI&feature=youtu.be
Decision Making in an Uncertain
Environment
Everyday decisions are based on incomplete information
Examples:
Will the job market be strong when I graduate?
Will the price of Yahoo stock be higher in six months than
it is now?
Will interest rates remain low for the rest of the year if
the federal budget deficit is as high as predicted?
Theory
Experience
Literature
Design of
research
Knowledge
(sources of data) (presentation)
Statistical
procedures
Variable
A specific characteristic about a
set of individuals or objects
Examples:
• Private clients of bank: gender, age, amount of deposits, level of satisfaction;
Keywords
Data
Any observations that have been
collected about a characteristic
Examples:
Gender Age Deposits Satisfaction
Male 45 23 k€ Poor
Female 24 7.6 k€ High
Female 32 84 k€ Moderate
Keywords
Population
The complete set of all items
that interest an investigator
Examples:
• Names of all registered voters in the United States;
• Incomes of all families living in Daytona Beach;
• Annual returns of all stocks traded on the New York Stock Exchange;
• Grade point averages of all the students in your university;
• For a bank: amount of deposits of all clients;
Keywords
Sample
an observed subset of the
population
Keywords
Parameter
a quantitative measure that
describes a specific characteristic
of a population
Examples:
• Proportion of private clients of the bank that are females;
• Average amount of deposits of private clients of the bank;
• Median age of private clients of the bank;
• Pct. of clients of the bank between 25 and 39 years old;
• Proportion of clients of the bank at least moderately satisfied;
Keywords
Statistic
a quantitative measure that
describes a specific characteristic
of a sample
Examples:
• Proportion of clients in the sample that are females;
• Average amount of deposits of clients in the sample ;
• Median age of clients in the sample ;
• Pct. of clients in the sample between 25 and 39 years old;
• Proportion clients in the sample at least moderately satisfied;
The process of inference
Population Sample
(N = size: # of items) (n = size: # of observations)
•…manage errors
Random sampling
Simple random sampling is a procedure in which
• each member of the population is chosen
strictly by chance,
• each member of the population is equally
likely to be chosen,
• every possible sample of n objects is equally
likely to be chosen
The resulting sample is called a random sample
Systematic sampling 31
j = N/n = 72 / 9 = 8
(items 1, 2, 3 , 4, 5, 6, 7, 8, 9,10,11, …,19, …, 27, …, 35, …, 43, …, 51, …, 59, …, 67, …, 72)
31
The two branches of statistics
(to be resumed
next lecture)
What is statistics?