Sie sind auf Seite 1von 49

Data and Variables

Before we jump into exploratory data analysis and really appreciate


its importance in the process of statistical analysis, let's step back for
a minute and ask:

What do we really mean by data?

Data are pieces of information about individuals organized into


variables. By an individual, we mean a particular person or object.
By a variable, we mean a particular characteristic of the individual.

A dataset is a set of data identified with particular circumstances.


Datasets are typically displayed in tables, in which rows represent
individuals and columns represent variables.

EXAMPLE: MEDICAL RECORDS


The following dataset shows medical records from a particular survey:

In this example, the individuals are patients, and the variables are
Gender, Age, Weight, Height, Smoking, and Race. Each row, then,
gives us all the information about a particular individual (in this case,
patient), and each column gives us information about a particular
characteristic of all the patients.
Variables can be classified into one of two types: categorical or
quantitative.

 Categorical variables take category or label values and place an


individual into one of several groups. Each observation can be placed
in only one category, and the categories are mutually exclusive.

In our example of medical records, Smoking is a categorical variable,


with two groups, since each participant can be categorized only as
either a nonsmoker or a smoker. Gender and Race are the two other
categorical variables in our medical records example. (Notice that the
values of the categorical variable Smoking have been coded as the
numbers 1 or 2. It is common to code the values of a categorical
variable as numbers, but you should remember that these are just
codes. They have no arithmetic meaning (i.e., it does not make sense
to add, subtract, multiply, divide, or compare the magnitude of such
values).

 Quantitative variables take numerical values and represent some


kind of measurement.

In our medical example, Age is an example of a quantitative variable


because it can take on multiple numerical values. It also makes sense
to think about it in numerical form; that is, a person can be 18 years
old or 80 years old. Weight and Height are also examples of
quantitative variables.

NOTE...
Categorical variables are sometimes called qualitative variables, but
in this course we use the term categorical.

SCENARIO: U.S. CENSUS


We took a random sample from the 2000 U.S. Census. Here is part of
the dataset:
Learn By Doing
(1/1 punto)
Who are the individuals described by this data?

States People living in the United States in the year 2000 People living in the
United States in the year 2000 <choicehint> The U.S. Census is completed by people
living in the United States. </choicehint> - correcto People with families in the year
2000
Correcto:
The U.S. Census is completed by people living in the United States.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
What type of variable is Zipcode?

Categorical Categorical <choicehint> Zipcode is a categorical variable because it


categorizes individuals by geographic location. </choicehint> - correcto
Quantitative
Correcto:
Zipcode is a categorical variable because it categorizes individuals by geographic location.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
What type of variable is Family Size?

Categorical Quantitative Quantitative <choicehint> Family size is a variable


with numerical values that can be averaged. </choicehint> - correcto
Correcto:
Family size is a variable with numerical values that can be averaged.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
What type of variable is Annual Income?

Categorical Quantitative Quantitative <choicehint> Annual income is a


variable with numerical values that can be averaged. </choicehint> - correcto
Correcto:
Annual income is a variable with numerical values that can be averaged.
ENVIARTU RESPUESTA PISTA

CLINICAL DEPRESSION AND DRUG TREATMENT


Background
Clinical depression is the most common mental illness in the United
States, affecting 19 million adults each year (Source: NIMH, 1999).
Nearly 50% of individuals who experience a major episode will have a
recurrence within 2 to 3 years. Researchers are interested in
comparing therapeutic solutions that could delay or reduce the
incidence of recurrence.

In a study conducted by the National Institutes of Health, 109


clinically depressed patients were separated into three groups, and
each group was given one of two active drugs (imipramine or lithium)
or no drug at all. For each patient, the dataset contains the treatment
used, the outcome of the treatment, and several other interesting
characteristics.

Here is a summary of the variables in our dataset:

 Hospt: The patient's hospital, represented by a code for each of the


5 hospitals (1, 2, 3, 5, or 6)

 Treat: The treatment received by the patient (Lithium, Imipramine,


or Placebo)

 Outcome: Whether or not a recurrence occurred during the patient's


treatment (Recurrence or No Recurrence)

 Time: Either the time in days till the first recurrence, or if a


recurrence did not occur, the length in days of the patient's
participation in the study.

 AcuteT: The time in days that the patient was depressed prior to the
study.
 Age: The age of the patient in years, when the patient entered the
study.

 Gender: The patient's gender (1 = Female, 2 = Male)

Here's a snapshot of the first 50 patients in the dataset with gender


recoded to display Female or Male:
Did I Get This
(1/1 punto)
Who are the individuals described by this data?
19 million adults who experience depression each year Hospitals 109
clinically depressed people 109 clinically depressed people <choicehint> The dataset
contains information on 109 clinically depressed people who were part of the NIH
study. </choicehint> - correcto
Correcto:
The dataset contains information on 109 clinically depressed people who were part of the
NIH study.
ENVIARTU RESPUESTA

Did I Get This


(1 punto posible)
Which of the following variables is categorical? Check all that apply.

Treat AcuteT Outcome


- sin responder
ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)
Which of the following variables is quantitative? Check all that apply.

Hospt Time Age Gender


Time , Age , - correcto
Correcto:
Time and Age are quantitative variables, since they can take on multiple numerical values,
which have arithmetic meaning (i.e., it makes sense to add, subtract, multiply, divide, or
compare the magnitude of such values).

Statistics Package Exercise:


Exploring Variables in a Dataset
Learning Objective: Classify a data analysis
situation (involving two variables) according to the
"role-type classification," and state the appropriate
display and/or numerical measures that should be
used in order to summarize the data.
Let's Explore a Dataset
In this activity we
 Learn how to open and examine a dataset.

 Practice classifying variables by their type: quantitative or


categorical.

 Learn how to handle categorical variables whose values are


numerically coded.

Background to Dataset
Clinical depression is the most common mental illness in the United
States, affecting 19 million adults each year (Source: NIMH, 1999).
Nearly 50% of individuals who experience a major episode will have a
recurrence within 2 to 3 years. Researchers are interested in
comparing therapeutic solutions that could delay or reduce the
incidence of recurrence.

In a study conducted by the National Institutes of Health, 109


clinically depressed patients were separated into three groups, and
each group was given one of two active drugs (imipramine or lithium)
or no drug at all. For each patient, the dataset contains the treatment
used, the outcome of the treatment, and several other interesting
characteristics.

Here is a summary of the variables in our dataset:

 Hospt: The patient's hospital, represented by a code for each of the


5 hospitals (1, 2, 3, 5, or 6)

 Treat: The treatment received by the patient (Lithium, Imipramine,


or Placebo)

 Outcome: Whether or not a recurrence occurred during the patient's


treatment (Recurrence or No Recurrence)

 Time: Either the time in days till the first recurrence, or if a


recurrence did not occur, the length in days of the patient's
participation in the study.

 AcuteT: The time in days that the patient was depressed prior to the
study.

 Age: The age of the patient in years, when the patient entered the
study.

 Gender: The patient's gender (1 = Female, 2 = Male)


What are the categorical variables in this dataset?

Your Answer:

Our Answer:
The categorical variables are 1) Hostp because the numbers
represent codes, which are used to identify individual hospitals and
place them into categories. As such, the numbers used for the codes
(1, 2, 3, 5, and 6) have no arithmetic meaning; 2) Treat because the
treatment received by the patients is in the form of categories
(Lithium, Imipramine, or Placebo); 3) Outcome since recurrence is in
the form of two categories (Recurrence or No Recurrence) and
4) Gender because the numbers represent two distinct categories:
Female and Male. Thus, the numbers used to represent gender (1 =
Female; 2 = Male) have no arithmetic meaning.

What are the quantitative variables in this dataset?

Your Answer:

Our Answer:
The quantitative variables are 1) Time since it can take on multiple
numerical values, which have arithmetic meaning (i.e., it makes
sense to add, subtract, multiply, divide, or compare the magnitude of
such values); 2) Age since it can take on multiple numerical values,
which represent a characteristic of the patient; and
3) AcuteT because it can take on multiple numerical values to
represent a characteristic of the patient.

In the previous section, a simple distinction was made between

quantitative and categorical variables. However, there is a more precise method

of categorizing variables. It is called scale of measurement. The scales of

measurement become increasingly precise with each level while retaining the
characteristics of the level below it.

For instance, ordinal data builds on the characteristics of nominal data and,

therefore, it is more precise than it. Now we're going to take a closer look at the

characteristics of these four levels of measurement. The first level of

measurement is nominal. It is the least precise measure of data as it only

indicates differences. Nominal level data uses discrete categories to describe

qualitative differences. An example of a nominal variable is types of pets.

For instance, types of pets can include birds, fish, cats, and dogs. None of these

categories – in this case types of pets – is implicitly better than the other.

Rather, the categories reflect the different types of pets. Some other examples of

nominal variables include gender, eye color, type of house, and type of resident.

The categories for each of these nominal variables reflect qualitative differences.

Now let's take a closer look at ordinal variables. The second level of

measurement is ordinal. Ordinal level data is more precise than nominal data as

the differences can now be rank ordered. However, it does not indicate

that the differences between two numbers are fixed or equal An example of an

ordinal variable is the order of finishes in a race. For instance, we can

say that Tanesha finished first and Alexis finished second. However, we do not

know the degree of difference. It could be that Tanesha won by 2 seconds or 20 minutes.

Other examples of ordinal variables include education level and some survey

question responses. The categories for each of these ordinal variables show order,

but not the magnitude of difference between two adjacent points.

The third level of measurement is interval. It builds upon the

characteristics of ordinal data through the addition of meaningful differences

between two numbers – that is, the distance between pairs of consecutive numbers is

assumed to be equal. However, interval variables do not have a meaningful zero point. Thus, a
zero does not mean the absence

of an attribute, but rather, is a particular, but arbitrary, point on the scale.


For instance, when temperature is measured in Celsius, the one degree

difference, say, between 35 degrees and 36 degrees is assumed to be the same as one

degree difference between 25 degrees and 26 degrees. However, zero degrees in

Celsius does not mean the absence of heat since there are below freezing

temperatures such as twenty below. The same characteristics also hold true for

temperatures measured in Fahrenheit. Some other examples of interval variables

include IQ scores and SAT scores. For each of these interval variables, the

distance between pairs of consecutive numbers is assumed to be equal, but they

do not have a meaningful zero point. The fourth level and final level of

measurement is ratio. It has all the characteristics of interval data plus a

meaningful zero point. A good example of ratio level data is age. For instance, we know that

someone who is forty years old is twice as old as someone who is twenty years old.

There is a meaningful zero point – that is, it is possible to have the absence of age.

Some other examples of ratio variables include height,

weight, cost of a car in dollars. For each of these ratio variables the

distance between pairs of consecutive numbers as assumed to be equal and there is

a meaningful zero point. It is also important to note that more precise data

can always be scaled down to less precise data. For instance, a ratio level

variable like age can be scaled into an ordinal variable of age groups, which

could include toddler, adolescent, young adult, and middle aged. Less precise data,

however, cannot be made into more

precise data – that is, an ordinal level variable like age groups cannot be

changed into a ratio level variable such as age in years. So why is it important to

understand scale of measurement? Statistical methods are ways of

summarizing, analyzing, and interpreting data, and are designed for specific types

of data. Researchers need to know the level of measurement of their data when

selecting a statistical method since using an incorrect method for analyzing


data can affect the reliability and accuracy of the results. Be sure to keep

these levels of measurement in mind whenever you are planning to analyze data.

https://youtu.be/wIBu7J18Fpw

Nominal Scale of Measurement


The nominal scale of measurement is a qualitative measure that
uses discrete categories to describe a characteristic of the research
participants. For each participant, the researcher determines the
presence, absence, and type of the attribute. Nominal scales of
measurement may have two categories, such as citizen status
(citizen/non-citizen), or they can have more than two categories, like
religious affiliation (e.g., Agnostic, Buddhist, Jewish, Muslim) or
marital status (e.g., divorced, married, single). Often, as described
here, the categories have names; however, researchers code them
with numbers for use in statistical analyses. These categories are not
ordered or ranked in any way.

Learn By Doing
(1/1 punto)

Which of the following is a nominal scale of measurement?

The number of minutes it takes participants to run one mile. Assigning participants
rank numbers (i.e., 1st place, 2nd place), based on the time it takes each of them to run one mile.
Identifying participants as runners or non-runners. Identifying participants as runners or
non-runners. <choicehint>This measure, like all nominal scales of measurement, assigns
subjects to discrete categories; thus, participants are either runner or non-runners.</choicehint>
- correcto

Correcto:

This measure, like all nominal scales of measurement, assigns subjects to discrete
categories; thus, participants are either runner or non-runners.

ENVIARTU RESPUESTA PISTA

Ordinal Scale of Measurement


An ordinal scale of measurement rank-orders participants on
some scale or attribute, but the difference between numbers does not
convey fixed or equal differences. Thus, with ordinal data, we know
that a one-unit increase in an ordinal scales represents “more,” but
we don’t know how much more. For example, a group of participants
can be rank-ordered from least to most politically active. We know
that a person who is ranked as 5 is more politically active than a
person who is ranked as 4, but not how much more politically active.
The value of the variable is used to order participants according to
the strength/presence of the attribute and not to calculate differences
between participants.

Learn By Doing
(1/1 punto)

Which of the following is an ordinal scale of measurement?

Temperature in Fahrenheit Number of pets Car Condition (Excellent, Good, Fair,


Poor) Car Condition (Excellent, Good, Fair, Poor) <choicehint>This measure, like all ordinal
scale of measurements, rank-orders participants on some scale or attribute. Thus, the condition
of a car is ranked, but the distance between the ranks is unknown.</choicehint> - correcto

Correcto:

This measure, like all ordinal scale of measurements, rank-orders participants on some
scale or attribute. Thus, the condition of a car is ranked, but the distance between the ranks
is unknown.

ENVIARTU RESPUESTA PISTA

Interval Scale of Measurement


The interval scale of measurement takes numerical form, and the
distance between pairs of consecutive numbers is assumed to be
equal. However, interval variables do not have a meaningful zero
point; thus, a zero does not mean the absence of the attribute, but
rather it is a particular (but arbitrary) point on the scale. A good
example of an interval measure is temperature in the Fahrenheit
scale: a temperature of zero degrees Fahrenheit is still a
temperature, not the absence of temperature. In education,
measures like achievement, motivation, and self-concept are
considered interval measures; a zero on a measure of such variables
does not mean the absence of the characteristic in the participant.

Learn By Doing
(1/1 punto)

Which of the following is an interval scale of measurement?

Political affiliation (i.e., Democrat, Republican, Independent) Intelligence (IQ)


Scores Intelligence (IQ) Scores <choicehint>Intelligence scores are interval level of
measurement, because they take numerical form and the distance between pairs of scores are
assumed to be equal, but there is no meaningful zero point; that is, there cannot be a complete
absence of intelligence.</choicehint> - correcto Amount of monthly mortgage payment

Correcto:

Intelligence scores are interval level of measurement, because they take numerical form
and the distance between pairs of scores are assumed to be equal, but there is no
meaningful zero point; that is, there cannot be a complete absence of intelligence.

ENVIARTU RESPUESTA PISTA

Ratio Scale of Measurement


The ratio scale of measurement is similar to the interval scale. As
with the interval scale, a number is assigned to a subject that
represents the amount of the attribute that the subject has and the
difference between consecutive numbers is assumed to be equal. The
main difference between interval and ratio measurements has to do
with how we interpret a value of zero. For ratio measures, the zero is
meaningful and tell us that the attribute is not present in the
participant. Examples of ratio measures include a participant’s
number of children, number of AP courses taken, or cumulative
college credits: for each of these variables, a score of zero represents
that the participant has none of the attribute.

Learn By Doing
(1/1 punto)

Which of the following is a ratio scale of measurement?

Social Security numbers Clothing sizes (e.g., Small, Medium, Large) Length of
room in inches Length of room in inches <choicehint>Length of room in inches is a ratio
variable because it uses numbers to represent the amount of a characteristic and it has a
meaningful zero.</choicehint> - correcto

Correcto:

Length of room in inches is a ratio variable because it uses numbers to represent the
amount of a characteristic and it has a meaningful zero.

ENVIARTU RESPUESTA PISTA

The next activity will help you to see whether you understand the
different scales of measurement.

Did I Get This


(1/1 punto)
A researcher classifies subjects’ level of anxiety as high, medium, or
low. What scale of measurement is this measure?

Nominal Interval Ordinal Ordinal <choicehint>Ordinal scales of measurement


use rank ordering. In this case, the measure ranks individuals into high, medium, and low
groups based on each person’s level of anxiety.</choicehint> - correcto Ratio

Correcto:

Ordinal scales of measurement use rank ordering. In this case, the measure ranks
individuals into high, medium, and low groups based on each person’s level of anxiety.

ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)

A researcher measures political affiliation, and records a value of 1


for a Republican, 2 for a Democrat, 3 for an Independent, and 4 for
other affiliations. What scale of measurement is this measure?

Nominal Nominal <choicehint>This measure of political affiliation is nominal. It assigns


values to discrete categories and attributes these values to the research subjects.</choicehint> -
correcto Interval Ordinal Ratio

Correcto:

This measure of political affiliation is nominal. It assigns values to discrete categories and
attributes these values to the research subjects.

ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)

A researcher observes Teacher A’s classroom of 30 students for a 45-


minute class. The researcher records the percentage of time students
spend working in groups during the class. What scale of
measurement is this measure?

Nominal Interval Ordinal Ratio Ratio <choicehint>Ratio variables use


numbers to represent the amount of a characteristic, where zero means the absence of the
characteristic. In this case, the measure of time represents the proportion of the class that is
group work, where zero means they do not do any group work in this 45-minute
period.</choicehint> - correcto
Correcto:

Ratio variables use numbers to represent the amount of a characteristic, where zero means
the absence of the characteristic. In this case, the measure of time represents the
proportion of the class that is group work, where zero means they do not do any group
work in this 45-minute period.

ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)

Scores on the SAT Math Test (note: the scores on the SAT Math Test
range from 200 to 800). What scale of measurement is this measure?

Nominal Interval Interval Ordinal Ratio

Correcto:

Interval scales of measurement use numbers to represent the amount of a characteristic


that a subject has, but do not have a meaningful zero point. Here, a higher SAT Math Test
score indicates greater levels of understanding of mathematical concepts.

ENVIARTU RESPUESTA

Examining Distributions
As indicated in the introduction, we will begin the EDA part of the
course by exploring (or looking at) one variable at a time.

As we saw in Data and Variables, the data for each variable are a
long list of values (whether numerical or not), and are not very
informative in that form. In order to convert these raw data into
useful information we need to summarize and then examine
the distribution of the variable. By distribution of a variable, we
mean:

 what values the variable takes, and

 how often the variable takes those values.

This module has two sections. We will first learn how to summarize
and examine the distribution of a single categorical variable, and then
do the same for a quantitative variable.
Frequency Distributions
Learning Objective: Summarize and describe the
distribution of a categorical variable in context.
What is your perception of your own body? Do you feel that you are
overweight, underweight, or about right?

A random sample of 1,200 U.S. college students were asked this


question as part of a larger survey. The following table shows part of
the responses:

Body Image

Student Body Image

student 25 overweight

student 26 about right

student 27 underweight

student 28 about right

student 29 about right

Here is some information that would be interesting to get from these


data:

 What percentage of the sampled students fall into each category?


 How are students divided across the three body image categories?
Are they equally divided? If not, do the percentages follow some
other kind of pattern?

There is no way that we can answer these questions by looking at the


raw data, which are in the form of a long list of 1,200 responses, and
thus not very useful. However, both these questions will be easily
answered once we summarize and look at the distribution of the
variable Body Image (i.e., once we summarize how often each of the
categories occurs).

In order to summarize the distribution of a categorical variable, we


first create a table of the different values (categories) the variable
takes, how many times each value occurs (count) and, more
importantly, how often each value occurs (by converting the counts
to percentages); this table is called a frequency distribution. Here is
the frequency distribution for our example:

Body Image Distribution

Category Count Percent

8551200 ∗ 100 = 71.3%


About right 855

2351200 ∗ 100 = 19.6%


Overweight 235

1101200 ∗ 100 = 9.2%


Underweight 110

Total n=1200 100%


Pie and Bar Charts
Learning Objective: Summarize and describe the
distribution of a categorical variable in context.
In order to visualize the numerical summaries we've obtained, we
need a graphical display. There are two simple graphical displays for
visualizing the distribution of categorical data:

1. The Pie Chart

2. The Bar Chart


OR

Learn By Doing
(1/1 punto)
What is the difference between the two bar charts?
There is no difference. The two bar charts represent the distributions of two
different variables. The first bar chart represents the count of respondents that chose
each category, while the second bar chart represents the percentage of respondents that
chose each category. The first bar chart represents the count of respondents that chose
each category, while the second bar chart represents the percentage of respondents that
chose each category. <choicehint> The two bar charts are different because counts and
percentages have different scales on the vertical axis. Counts have a scale from 0 to the
total number of subjects, while percentages always have a scale from 0 to 100.
</choicehint> - correcto The two bar charts represent the distribution of "Body
Image" obtained from two different samples.
Correcto:
The two bar charts are different because counts and percentages have different scales on
the vertical axis. Counts have a scale from 0 to the total number of subjects, while
percentages always have a scale from 0 to 100.
ENVIARTU RESPUESTA PISTA

Now that we have summarized the distribution of values in the Body Image
variable, let's go back and interpret the results in the context of the questions
that we posed:

Learn By Doing
(1/1 punto)
What do the results suggest about how the students are divided
across the three body image categories?

Students are equally divided across the three categories. Students are not
equally divided across the three categories. Students are not equally divided across the
three categories. <choicehint>You correctly saw that the pieces of the pie and the
lengths of the three bars representing the three body image categories are not all the
same. Thus, the students' responses are not equally divided among the categories.
</choicehint> - correcto
Correcto:
You correctly saw that the pieces of the pie and the lengths of the three bars representing
the three body image categories are not all the same. Thus, the students' responses are
not equally divided among the categories.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
How do the vast majority of students (71.3%) feel about their
weight?

About right About right <choicehint>71.3% is well over half, or "the vast
majority" of the respondents. We are looking for the category that has the largest piece
of the pie and the longest bar in the bar chart—the category "about right." Also, both
charts note that the percentage for this category is 71.3%. </choicehint> - correcto
Overweight Underweight
Correcto:
71.3% is well over half, or "the vast majority" of the respondents. We are looking for the
category that has the largest piece of the pie and the longest bar in the bar chart—the
category "about right." Also, both charts note that the percentage for this category is 71.3%.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
How does the middle group of students (19.6%) feel about their
weight?

About right Overweight Overweight <choicehint>The category "overweight"


represents the body image of 19.6% of the students. </choicehint> - correcto
Underweight
Correcto:
The category "overweight" represents the body image of 19.6% of the students.
ENVIARTU RESPUESTA PISTA

Learn By Doing
(1/1 punto)
What was the body perception that occurred the least often?

About right Overweight Underweight Underweight <choicehint>The


category "underweight" represents 9.2% of the students. </choicehint> - correcto
Correcto:
The category "underweight" represents 9.2% of the students.
ENVIARTU RESPUESTA PISTA

Now that we've interpreted the results, there are some other
interesting questions that arise:

 Can we reliably generalize our results to the entire population of


interest and conclude that a similar distribution across body image
categories exists among all U.S. college students? In particular, can
we make such a generalization even though our sample consisted of
only 1,200 students, which is a very small fraction of the entire
population?

 If we had separated our sample by gender and looked at males and


females separately, would we have found a similar distribution across
body image categories?
These are the types of questions that we will deal with in future
sections of the course.

Let's Summarize

 The distribution of a categorical variable is summarized using:

 Graphical display: pie chart or bar chart, supplemented by

 Numerical summaries: category counts and percentages.

 A variation on pie charts and bar charts is the pictogram.

 Pictograms can be misleading, so make sure to use a critical


approach when interpreting the information the pictogram is trying to
convey.

Ejercicio del paquete de


estadísticas: Cómo registrar
datos y crear gráficos de
sectores
Objetivo de aprendizaje: Resumir y describir la
distribución de una variable categórica en su
contexto.
La misma encuesta que preguntó a 1.200 estudiantes universitarios
estadounidenses sobre su percepción corporal también hizo la
siguiente pregunta:

-¿Con quién te resulta más fácil hacer amigos? (Sexo opuesto, sexo
igual o ninguna diferencia).

En esta actividad usaremos los datos recopilados para:

 Aprender a clasificar nuestros datos en una tabla de cuentas y porcentajes.

 Aprender a producir un gráfico circular.

 R

 StatCrunch
 Calculadora TI

 Minitab

 Sobresalir

Instrucciones Excel
Para abrir el conjunto de datos, haga clic aquí para descargar el
archivo en su computadora. A continuación, busque el archivo
descargado y haga clic en él para abrirlo en Excel. Cuando se abre
Excel puede que tenga que habilitar la edición.

Para pedir a Excel una tabla de recuentos y porcentajes:

 Haga clic en la celda con la etiqueta de variables Amigos .

 Haga clic en PivotTable en el grupo Tablas de la ficha Insertar y


elija Tabla dinámica en el menú desplegable.

 Haga clic en Aceptar .

 Excel creará una nueva hoja de cálculo con una tabla en blanco. Tenemos
que decir a Excel cómo construir la tabla:

 Debería ver una ventana en el lado derecho de la pantalla titulada Campos


de tabla dinámica con un único elemento que es nuestra variable
"Amigos".

 Marque la casilla junto a Amigos .

 Ahora arrastre el ítem Amigos a la tabla donde dice Campos de valor de


caída aquí .

Ahora debe tener una tabla que muestre el número de cada tipo de
entrada de datos: "No hay diferencia", "Sexo opuesto" y "Sexo igual",
y un gran total. Este es el recuento de cada una de estas
entradas. Para ver los porcentajes de cada entrada, podemos pedir a
Excel que muestre porcentajes en lugar de cuentas:

 Haga clic con el botón derecho en cualquiera de las celdas que contengan
los conteos y elija Configuración de campos de valores en el menú
emergente.

 Haga clic en la pestaña que dice Mostrar valores como .

 Ahora utilice el menú desplegable denominado Mostrar valores


como: para seleccionar % del total general y haga clic en Aceptar .
Para producir un gráfico circular de los datos mediante Excel:

 En primer lugar, si su tabla muestra porcentajes, debe cambiarlo de nuevo


a cuenta:

 Haga clic con el botón secundario en cualquiera de las celdas que contengan
los porcentajes y elija Configuración de campos de valor en el menú
emergente.

 Haga clic en la pestaña que dice Mostrar valores como .

 Ahora utilice el menú desplegable denominado Mostrar valores


como: para seleccionar Sin cálculo y haga clic en Aceptar .

 Ahora haga clic en una de las celdas de su tabla y, a continuación, haga clic
en el botón PivotChart en el grupo Herramientas de la pestaña Análisis
(herramientas de tabla dinámica) .

 En la ventana Insertar gráfico que aparece, elija la primera opción de


gráfico circular y haga clic en Aceptar .

 Para etiquetar las secciones del pastel, haga clic con el botón derecho en el
pastel y elija Formato de etiquetas de datos (versión 2013) o Agregar
etiquetas de datos (versión 2016) en el menú emergente.

 Para ver porcentajes así como valores absolutos, haga clic con el botón
derecho del ratón en el pastel y elija Formato de etiquetas de datos en el
menú emergente. En el menú que aparece a la derecha, marque la casilla
junto a Porcentaje.

Comentario
Tenga en cuenta que el gráfico circular proporciona visualmente toda
la información que se encuentra en la tabla.

APRENDER HACIENDO

(1/1 de punto)

Describa la distribución de la variable "amigos" en su contexto:

Tu respuesta:
Nuestra Respuesta:

Los estudiantes NO se dividen por igual entre las tres


categorías. Alrededor del 50% de los estudiantes encuentran que es
tan fácil hacer amigos con el sexo opuesto que con el mismo
sexo. Entre el 50% restante de los estudiantes, la mayoría (36,2%)
encuentran más fácil hacer amistad con personas del sexo opuesto, y
el resto (13,7%) encuentran más fácil hacer amistad con personas de
su propio sexo.

One Quantitative Variable


Introduction
Introduction
In the previous section, we explored the distribution of a categorical
variable using graphs (pie chart, bar chart) supplemented by
numerical measures (percent of observations in each category). In
this section, we will explore the data collected from
a quantitative variable, and learn how to describe and summarize
the important features of its distribution. We will first learn how to
display the distribution using graphs and then move on to discuss
numerical measures.

To display data from one quantitative variable graphically, we can use


either the histogram or the stemplot. (Another graph, the boxplot,
will be covered in another section).

Histogram: Intervals
Learning Objective: Generate and interpret several
different graphical displays of the distribution of a
quantitative variable (histogram, stemplot,
boxplot).
Idea
Break the range of values into intervals and count how many
observations fall into each interval.

EXAMPLE: EXAM GRADES


Here are the exam grades of 15 students:

88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73

We first need to break the range of values into intervals (also called
"bins" or "classes"). In this case, since our dataset consists of exam
scores, it will make sense to choose intervals that typically
correspond to the range of a letter grade, 10 points wide: 40-50, 50-
60, ... 90-100. By counting how many of the 15 observations fall in
each of the intervals, we get the following table:

Exam Grades

Score Count

[40-50) 1

[50-60) 2

[60-70) 4

[70-80) 5

[80-90) 2
[90-100] 1

Note: The observation 60 was counted in the 60-70 interval. See


comment 1 below.

To construct the histogram from this table we plot the intervals on


the X-axis, and show the number of observations in each interval
(frequency of the interval) on the Y-axis, which is represented by the
height of a rectangle located above the interval:

The table above can also be turned into a relative frequency table
using the following steps:

1. Add a row on the bottom and include the total number of


observations in the dataset that are represented in the table.

2. Add a column, at the end of the table, and calculate the relative
frequency for each interval, by dividing the number of observations in
each row by the total number of observations.

These two steps are illustrated in red in the following frequency


distribution table:

It is also possible to determine the number of scores for an interval, if


you have the total number of observations and the relative frequency
for that interval. For instance, if we know that we have 15 scores (or
observations) and the relative frequency is 0.13, we can determine
the number of scores by multiplying the total number of observations
by the relative frequency and rounding up to the next whole number:
15 * 0.13 = 1.95, which rounds up to 2 observations.
https://lagunita.stanford.edu/courses/course-
v1:OLI+ProbStat+Open_Jan2017/courseware/eda_ed/one_quant_var
_graphs/

A relative frequency table, like the one above, can be used to


determine the frequency of scores occurring at or across intervals.
Here are some examples, using the above frequency table:

1. What is the percentage of exam scores that were 70 and up to, but
not including, 80? To determine the answer, we look at the relative
frequency associated with the [70-80) interval. The relative frequency
is 0.33; to convert to percentage, multiply by 100 (0.33 * 100 = 33)
or 33%.

2. What is the percentage of exam scores that are at least 70? To


determine the answer, we need to:

 Add together the relative frequencies for the intervals that have
scores of at least 70 or above. Thus, would need to add together the
relative frequencies from [70-80), [80-90), and [90-100] = 0.33 +
0.13 + 0.07 = 0.53.

 To get the percentage, need to multiple the calculated relative


frequency by 100. In this case, it would be 0.53 * 100 = 53 or 53%.

Learn By Doing
(1/1 punto)
Recall the table from the exam grades example above:

Exam Grades

Score Count

[40-50) 1
[50-60) 2

[60-70) 4

[70-80) 5

[80-90) 2

[90-100] 1

What percentage of students earned less than a grade of 70 on the


exam?

9% 15% 47% 47% <choicehint> The data displays information about a


total of 1 + 2 + 4 + 5 + 2 + 1 = 15 students. Out of these 15 students, 1 + 2 + 4 = 7
earned less than a grade of 70 on the exam. To calculate the percentage, divide 7 (the
number of students who earned less than a grade of 70) by 15 (the total number of
students), and then multiply by 100 to change the decimal to a percentage: 7/15 = 0.47 *
100 = 47% of the students. </choicehint> - correcto 53% 93%
Correcto:
The data displays information about a total of 1 + 2 + 4 + 5 + 2 + 1 = 15 students. Out of
these 15 students, 1 + 2 + 4 = 7 earned less than a grade of 70 on the exam. To calculate
the percentage, divide 7 (the number of students who earned less than a grade of 70) by
15 (the total number of students), and then multiply by 100 to change the decimal to a
percentage: 7/15 = 0.47 * 100 = 47% of the students.
ENVIARTU RESPUESTA PISTA

Comments
1. It is very important that each observation be counted only in one
interval. For the most part, it is clear which interval an observation
falls in. However, in our example, we needed to decide whether to
include 60 in the interval 50-60, or the interval 60-70, and we chose
to count it in the latter. In fact, this decision is captured by the way
we wrote the intervals. If you'll scroll up and look at the table, you'll
see that we wrote the intervals in a peculiar way: [40-50), [50,60),
[60,70) etc. The square bracket means "including" and the
parenthesis means "not including". For example, [50,60) is the
interval from 50 to 60, including 50 and not including 60; [60,70) is
the interval from 60 to 70, including 60, and not including 70, etc. It
really does not matter how you decide to set up your intervals, as
long as you're consistent.

2. When data are displayed in a histogram, some information is lost.


Note that by looking at the histogram we can answer: "How many
students scored 70 or above?" (5+2+1=8) But we cannot answer:
"What was the lowest score?" All we can say is that the lowest score
is somewhere between 40 and 50, and therefore we can approximate
that it is around 45.

3. Obviously, we could have chosen to break the data into intervals


differently (for example: 45-50, 50-55, 55-60 etc.). To see how our
choice of bins or intervals affects the histogram, you can use the
interactive simulation that lets you change the intervals dynamically.
Try changing the bin width by dragging the slider underneath the bin
width scale.

Many Students Wonder ...


Question: How do I know what interval width to choose?
Answer: There are no right or wrong choices of interval widths. In
this course, we will rely on a statistical package to produce the
histogram for us, and we will focus instead on describing and
summarizing the distribution as it appears from the histogram.

Did I Get This


(1/1 punto)
An instructor asked her students how much time (to the nearest
hour) they spent studying for the midterm. The data are displayed in
the following histogram:
What do the numbers on the horizontal axis represent?

The values of the number of hours studied. The values of the number of hours
studied. <choicehint> The horizontal axis represents the number of hours
studied.</choicehint> - correcto The count of students falling in each of the
intervals.
Correcto:
The horizontal axis represents the number of hours studied.
ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)
An instructor asked her students how much time (to the nearest
hour) they spent studying for the midterm. The data are displayed in
the following histogram:
What do the numbers on the vertical axis represent?

The values of the number of hours studied. The count of students falling in
each of the intervals. The count of students falling in each of the intervals. <choicehint>
The vertical axis represents the count of students falling in each of the intervals.
</choicehint> - correcto
Correcto:
The vertical axis represents the count of students falling in each of the intervals.
ENVIARTU RESPUESTA

Did I Get This


(1/1 punto)
An instructor asked her students how much time (to the nearest
hour) they spent studying for the midterm. The data are displayed in
the following histogram:
What percentage of students study 6 or more hours for the midterm?

24% 48% 52% 52% <choicehint> The histogram displays information


about 3 + 9 + 6 + 3 + 2 + 1 + 1 = 25 students. Out of these 25 students, 6 + 3 + 2 + 1 + 1
= 13 studied 6 or more hours for the exam, so 13/25 = 0.52 * 100 = 52% of the students.
Note that it might have been easier to count the bars representing less than 6 hours.
Thus, the number of students who studied 6 or more hours is 25 - (3 + 9) = 25 - 12 = 13
students. </choicehint> - correcto 72% 88%
Correcto:
The histogram displays information about 3 + 9 + 6 + 3 + 2 + 1 + 1 = 25 students. Out of
these 25 students, 6 + 3 + 2 + 1 + 1 = 13 studied 6 or more hours for the exam, so 13/25 =
0.52 * 100 = 52% of the students. Note that it might have been easier to count the bars
representing less than 6 hours. Thus, the number of students who studied 6 or more hours
is 25 - (3 + 9) = 25 - 12 = 13 students.
ENVIARTU RESPUESTA

Extra Problems
These extra questions are here to give you more practice if you feel
you need it. No new concepts are introduced on this page. If you've
"got it", go ahead and move on to the next page. If you'd like a little
more practice, work through the questions below.

Question
(1/1 punto)
Thirty-two students were asked the number of servings of fruits and
vegetables they eat daily. The results are displayed in the histogram
below.

How many of the students surveyed eat at least 4 servings of fruits


and vegetables daily?

4 8 12 12 <choicehint> "At least 4" means "4 or more", which is 8 + 3 + 1 = 12.


</choicehint> - correcto 20 28

Correcto:

"At least 4" means "4 or more", which is 8 + 3 + 1 = 12.

ENVIARTU RESPUESTA

Question
(1/1 punto)

Thirty-two students were asked the number of servings of fruits and


vegetables they eat daily. The results are displayed in the histogram
below.
What percentage of the students surveyed eat no more than 3
servings of fruits and vegetables daily?

20 31.2% 37.5% 62.5% 62.5% <choicehint> There are 20 students who eat
no more than 3 servings daily and a total of 32 students. So, 20/32 = 0.625 or 62.5%.
</choicehint> - correcto 68.8%

Correcto:

There are 20 students who eat no more than 3 servings daily and a total of 32 students. So,
20/32 = 0.625 or 62.5%.

ENVIARTU RESPUESTA

Question
(1/1 punto)

Thirty-two students were asked the number of servings of fruits and


vegetables they eat daily. The results are displayed in the histogram
below.
What proportion of the students surveyed eat exactly 5 servings of
fruits and vegetables daily?

0.063 0.094 0.094 <choicehint> There are 3 students who ate exactly 5 servings of
fruits and vegetables daily and a total of 32 students. So, 3/32 = 0.09375, which can be rounded
to 0.094. </choicehint> - correcto 0.156 2 3

Correcto:

There are 3 students who ate exactly 5 servings of fruits and vegetables daily and a total of
32 students. So, 3/32 = 0.09375, which can be rounded to 0.094.

ENVIARTU RESPUESTA

Question
(1/1 punto)

A survey was conducted to see how many phone calls people made
daily. The results are displayed in the table below:

Number of calls made Frequency

1-4 16
5-8 11

9 - 12 5

13 - 16 3

17 - 20 1

How many of the people surveyed make less than 9 phone calls daily?

3 8 9 27 27 <choicehint> The frequencies between 1 - 8 calls add up to 27:


16 + 11 = 27. </choicehint> - correcto 32

Correcto:

The frequencies between 1 - 8 calls add up to 27: 16 + 11 = 27.

ENVIARTU RESPUESTA

Question
(1/1 punto)

A survey was conducted to see how many phone calls people made
daily. The results are displayed in the table below:

Number of calls made Frequency

1-4 16

5-8 11

9 - 12 5
13 - 16 3

17 - 20 1

How many people were surveyed?

7.2 20 36 36 <choicehint> The total number of people surveyed is found by


adding up all the frequencies: 16 + 11 + 5 + 3 + 1 = 36 people. </choicehint> - correcto 226

Correcto:

The total number of people surveyed is found by adding up all the frequencies: 16 + 11 + 5
+ 3 + 1 = 36 people.

Histogram: Shape
Learning Objective: Generate and interpret several
different graphical displays of the distribution of a
quantitative variable (histogram, stemplot,
boxplot).
Learning Objective: Summarize and describe the
distribution of a quantitative variable in context: a)
describe the overall pattern, b) describe striking
deviations from the pattern.
Interpreting the Histogram
Once the distribution has been displayed graphically, we can describe
the overall pattern of the distribution and mention any striking
deviations from that pattern. More specifically, we should consider
the following features of the distribution:

We will get a sense of the overall pattern of the data from the
histogram's center, spread and shape, while outliers will highlight
deviations from that pattern.
Shape
When describing the shape of a distribution, we should consider:

1. Symmetry/skewness of the distribution.

2. Peakedness (modality)—the number of peaks (modes) the


distribution has.

We distinguish between:

Symmetric Distributions
Note that all three distributions are symmetric, but are different in
their modality (peakedness). The first distribution is unimodal—it
has one mode (roughly at 10) around which the observations are
concentrated. The second distribution is bimodal—it has two modes
(roughly at 10 and 20) around which the observations are
concentrated. The third distribution is kind of flat, or uniform. The
distribution has no modes, or no value around which the observations
are concentrated. Rather, we see that the observations are roughly
uniformly distributed among the different values.

Skewed Right Distributions

A distribution is called skewed right if, as in the histogram above,


the right tail (larger values) is much longer than the left tail (small
values). Note that in a skewed right distribution, the bulk of the
observations are small/medium, with a few observations that are
much larger than the rest. An example of a real-life variable that has
a skewed right distribution is salary. Most people earn in the
low/medium range of salaries, with a few exceptions (CEOs,
professional athletes etc.) that are distributed along a large range
(long "tail") of higher values.

Skewed Left Distributions

A distribution is called skewed left if, as in the histogram above, the


left tail (smaller values) is much longer than the right tail (larger
values). Note that in a skewed left distribution, the bulk of the
observations are medium/large, with a few observations that are
much smaller than the rest. An example of a real life variable that
has a skewed left distribution is age of death from natural causes
(heart disease, cancer etc.). Most such deaths happen at older ages,
with fewer cases happening at younger ages.

Comments:

1. Note that skewed distributions can also be bimodal. Here is an


example. A medium size neighborhood 24-hour convenience store
collected data from 537 customers on the amount of money spend in
a single visit to the store. The following histogram displays the data.
Note that the overall shape of the distribution is skewed to the right
with a clear mode around $25. In addition it has another (smaller)
“peak” (mode) around $50-55. The majority of the customers spend
around $25 but there is a cluster of customers who enter the store
and spend around $50-55.

2. If a distribution has more than two modes, we say that the


distribution is multimodal.

Recall our grades example below. As you can see from the histogram,
the grades distribution is roughly symmetric.
Histograma: Centro, Difusión, &
Outliers
Objetivo de aprendizaje: Generar e interpretar
varias representaciones gráficas diferentes de la
distribución de una variable cuantitativa
(histograma, stemplot, boxplot).
Objetivo de aprendizaje: Resumir y describir la
distribución de una variable cuantitativa en el
contexto: a) describir el patrón general, b)
describir desviaciones sorprendentes del patrón.
Centrar
El centro de la distribución es su punto medio , el valor que divide la
distribución de modo que aproximadamente la mitad de las
observaciones toman valores más pequeños y aproximadamente la
mitad de las observaciones toman valores mayores. Observe que al
mirar el histograma sólo podemos obtener una estimación
aproximada para el centro de la distribución. (Las formas más
exactas de encontrar medidas de centro se discutirán en la siguiente
sección.)

Recordemos nuestro ejemplo de grados:

Como se puede ver en el histograma, el centro de la distribución de


grados es aproximadamente 70 (7 estudiantes obtuvieron
calificaciones por debajo de 70 y 8 estudiantes obtuvieron
calificaciones superiores a 70).

Untado
La propagación (también llamada variabilidad ) de la distribución
se puede describir por el rango aproximado cubierto por los
datos. Observando el histograma, podemos aproximar la menor
observación ( min ), y la observación más grande ( max ), y así
aproximar el rango. (Las formas más exactas de encontrar medidas
de propagación se discutirán en la siguiente sección.)

En nuestro ejemplo:

 Min aproximado: 45 (el medio del intervalo más bajo de las puntuaciones)

 Aproximado máximo: 95 (el medio del intervalo más alto de puntajes)

 Rango aproximado: 95-45 = 50

Outliers
Los valores atípicos son observaciones que caen fuera del patrón
general. Por ejemplo, el siguiente histograma representa una
distribución que tiene un valor atípico probable alto:
Vuelva y compruebe el histograma de resultados en la parte superior
de esta página. Como puede ver, no hay valores atípicos.

EJEMPLO: MEJOR ACTRIZ GANADORES DEL OSCAR


Para dar un ejemplo de un histograma aplicado a los datos reales,
veremos las edades de los ganadores de Oscar de mejor actriz de
1970 a 2013 ( Para ver el conjunto de datos completo, haga clic aquí ).

A continuación se muestra el histograma de los datos.

Ahora resumiremos las principales características de la distribución de


las edades tal como aparece del histograma:

Forma : La distribución de edades está sesgada a la


derecha. Tenemos una concentración de datos entre las edades más
jóvenes y una larga cola hacia la derecha. La gran mayoría de los
premios "mejor actriz" se otorgan a las jóvenes actrices, con muy
pocos premios otorgados a las actrices mayores.

Centro : Los datos parecen estar centrados alrededor de 34 o 35


años. Tenga en cuenta que esto implica que aproximadamente la
mitad de los premios se dan a las actrices que tienen menos de 34
años de edad.

Difusión: Los datos oscilan entre aproximadamente 20 y


aproximadamente 80, por lo que el rango aproximado es igual a 80 -
20 = 60.

Outliers: Parece que hay dos posibles valores extremos en la


extrema derecha y, posiblemente, tres alrededor de 62 años.

Puede ver cuán informativo es saber "qué mirar" en un


histograma. Si hay una conclusión que podemos hacer aquí, es que a
Hollywood le gustan sus actrices jóvenes.

Utilizaremos el conjunto de datos de los ganadores del Oscar de


mejor actor (1970-2013) para practicar lo que aprendió sobre la
descripción del histograma. A continuación se muestra el histograma
de los ganadores del Oscar de Mejor Actor de 1970 a 2013 agrupados
por edad.

Aprender haciendo
(1/1 punto)

¿Cuál es la forma de este histograma?

Uniforme simétrico Inclinado a la izquierda Inclinado hacia la derecha sesgada


derecho <choicehint> La mayor parte de los ganadores del Oscar al mejor actor son de los
grupos de edad más jóvenes. </ Choicehint> - correcto Simétrico-Unimodal Simétrico-
Bimodal

Correcto:

La mayor parte de los ganadores del Oscar de mejor actor son de grupos de edad más
jóvenes.

ENVIARTU RESPUESTA PISTA

¿Conseguí esto?
(1/1 punto)

¿Cuál es la forma más probable de distribución de la edad de muerte


por traumatismo (accidente, asesinato, suicidio, sobredosis de
drogas, etc.) cuando se representa mediante un histograma?

Recordemos que hablamos antes sobre la forma de la distribución de


la edad de muerte por causas naturales (enfermedades del corazón,
cáncer, etc.). Utilice un tipo similar de razonamiento para la edad de
muerte por trauma.

Uniforme simétrico Inclinado a la izquierda Sesgada derecho sesgada derecho


<choicehint> La mayor parte de las muertes por traumatismos (accidentes, suicidios, sobredosis
de drogas, etc.) pasará a una edad más joven, y menos a una edad mayor. Por lo tanto,
esperamos que la distribución de la edad de muerte por traumatismos sea sesgada a la
derecha. </ Choicehint> - correcto Simétrico-Unimodal Simétrico-Bimodal

Correcto:

La mayor parte de las muertes por traumatismos (accidentes, suicidios, sobredosis de


drogas, etc.) ocurren a una edad más temprana y menos a una edad avanzada. Por lo
tanto, esperamos que la distribución de la edad de muerte por traumatismos sea sesgada
a la derecha.

ENVIARTU RESPUESTA PISTA

¿Conseguí esto?
(1/1 punto)
¿Cuál es la forma más probable de la distribución de la edad en la
que un niño da sus primeros pasos?

Uniforme simétrico Inclinado a la izquierda Inclinado hacia la derecha


Symmetric-Unimodal Symmetric-Unimodal <choicehint> La mayoría de los niños
empiezan a caminar aproximadamente a la misma edad, por lo que la distribución se centra en
unos 18 meses. Tiene cierta variabilidad, pero es poco probable que tenga valores atípicos. Por
lo tanto, es simétrico y unimodal. </ Choicehint> - correcto Simétrico-Bimodal

Correcto:

La mayoría de los niños comienzan a caminar aproximadamente a la misma edad, por lo


que la distribución se centra en alrededor de 18 meses. Tiene cierta variabilidad, pero es
poco probable que tenga valores atípicos. Por lo tanto, es simétrico y unimodal.

ENVIARTU RESPUESTA

Vamos a resumir

 El histograma es una representación gráfica de la distribución de una


variable cuantitativa. Representa el número (número) de observaciones que
caen en intervalos de valores.

 Al examinar la distribución de una variable cuantitativa, se debe describir el


patrón general de los datos (forma, centro, extensión) y cualquier
desviación del patrón (valores atípicos).

 Al describir la forma de una distribución, se debe considerar:

 Simetría / asimetría de la distribución

 Peakedness (modalidad): el número de picos (modos) que tiene la


distribución.

No todas las distribuciones tienen una forma simple y reconocible.

 Los valores atípicos son puntos de datos que caen fuera del patrón general
de la distribución y necesitan más investigación antes de continuar el
análisis.

 Siempre es importante interpretar lo que significan las características de la


distribución (como aparecen en el histograma) en el contexto de los datos.

Das könnte Ihnen auch gefallen