Beruflich Dokumente
Kultur Dokumente
In this example, the individuals are patients, and the variables are
Gender, Age, Weight, Height, Smoking, and Race. Each row, then,
gives us all the information about a particular individual (in this case,
patient), and each column gives us information about a particular
characteristic of all the patients.
Variables can be classified into one of two types: categorical or
quantitative.
NOTE...
Categorical variables are sometimes called qualitative variables, but
in this course we use the term categorical.
States People living in the United States in the year 2000 People living in the
United States in the year 2000 <choicehint> The U.S. Census is completed by people
living in the United States. </choicehint> - correcto People with families in the year
2000
Correcto:
The U.S. Census is completed by people living in the United States.
ENVIARTU RESPUESTA PISTA
Learn By Doing
(1/1 punto)
What type of variable is Zipcode?
Learn By Doing
(1/1 punto)
What type of variable is Family Size?
Learn By Doing
(1/1 punto)
What type of variable is Annual Income?
AcuteT: The time in days that the patient was depressed prior to the
study.
Age: The age of the patient in years, when the patient entered the
study.
Background to Dataset
Clinical depression is the most common mental illness in the United
States, affecting 19 million adults each year (Source: NIMH, 1999).
Nearly 50% of individuals who experience a major episode will have a
recurrence within 2 to 3 years. Researchers are interested in
comparing therapeutic solutions that could delay or reduce the
incidence of recurrence.
AcuteT: The time in days that the patient was depressed prior to the
study.
Age: The age of the patient in years, when the patient entered the
study.
Your Answer:
Our Answer:
The categorical variables are 1) Hostp because the numbers
represent codes, which are used to identify individual hospitals and
place them into categories. As such, the numbers used for the codes
(1, 2, 3, 5, and 6) have no arithmetic meaning; 2) Treat because the
treatment received by the patients is in the form of categories
(Lithium, Imipramine, or Placebo); 3) Outcome since recurrence is in
the form of two categories (Recurrence or No Recurrence) and
4) Gender because the numbers represent two distinct categories:
Female and Male. Thus, the numbers used to represent gender (1 =
Female; 2 = Male) have no arithmetic meaning.
Your Answer:
Our Answer:
The quantitative variables are 1) Time since it can take on multiple
numerical values, which have arithmetic meaning (i.e., it makes
sense to add, subtract, multiply, divide, or compare the magnitude of
such values); 2) Age since it can take on multiple numerical values,
which represent a characteristic of the patient; and
3) AcuteT because it can take on multiple numerical values to
represent a characteristic of the patient.
measurement become increasingly precise with each level while retaining the
characteristics of the level below it.
For instance, ordinal data builds on the characteristics of nominal data and,
therefore, it is more precise than it. Now we're going to take a closer look at the
For instance, types of pets can include birds, fish, cats, and dogs. None of these
categories – in this case types of pets – is implicitly better than the other.
Rather, the categories reflect the different types of pets. Some other examples of
nominal variables include gender, eye color, type of house, and type of resident.
The categories for each of these nominal variables reflect qualitative differences.
Now let's take a closer look at ordinal variables. The second level of
measurement is ordinal. Ordinal level data is more precise than nominal data as
the differences can now be rank ordered. However, it does not indicate
that the differences between two numbers are fixed or equal An example of an
say that Tanesha finished first and Alexis finished second. However, we do not
know the degree of difference. It could be that Tanesha won by 2 seconds or 20 minutes.
Other examples of ordinal variables include education level and some survey
question responses. The categories for each of these ordinal variables show order,
between two numbers – that is, the distance between pairs of consecutive numbers is
assumed to be equal. However, interval variables do not have a meaningful zero point. Thus, a
zero does not mean the absence
difference, say, between 35 degrees and 36 degrees is assumed to be the same as one
Celsius does not mean the absence of heat since there are below freezing
temperatures such as twenty below. The same characteristics also hold true for
include IQ scores and SAT scores. For each of these interval variables, the
do not have a meaningful zero point. The fourth level and final level of
meaningful zero point. A good example of ratio level data is age. For instance, we know that
someone who is forty years old is twice as old as someone who is twenty years old.
There is a meaningful zero point – that is, it is possible to have the absence of age.
weight, cost of a car in dollars. For each of these ratio variables the
a meaningful zero point. It is also important to note that more precise data
can always be scaled down to less precise data. For instance, a ratio level
variable like age can be scaled into an ordinal variable of age groups, which
could include toddler, adolescent, young adult, and middle aged. Less precise data,
precise data – that is, an ordinal level variable like age groups cannot be
changed into a ratio level variable such as age in years. So why is it important to
summarizing, analyzing, and interpreting data, and are designed for specific types
of data. Researchers need to know the level of measurement of their data when
these levels of measurement in mind whenever you are planning to analyze data.
https://youtu.be/wIBu7J18Fpw
Learn By Doing
(1/1 punto)
The number of minutes it takes participants to run one mile. Assigning participants
rank numbers (i.e., 1st place, 2nd place), based on the time it takes each of them to run one mile.
Identifying participants as runners or non-runners. Identifying participants as runners or
non-runners. <choicehint>This measure, like all nominal scales of measurement, assigns
subjects to discrete categories; thus, participants are either runner or non-runners.</choicehint>
- correcto
Correcto:
This measure, like all nominal scales of measurement, assigns subjects to discrete
categories; thus, participants are either runner or non-runners.
Learn By Doing
(1/1 punto)
Correcto:
This measure, like all ordinal scale of measurements, rank-orders participants on some
scale or attribute. Thus, the condition of a car is ranked, but the distance between the ranks
is unknown.
Learn By Doing
(1/1 punto)
Correcto:
Intelligence scores are interval level of measurement, because they take numerical form
and the distance between pairs of scores are assumed to be equal, but there is no
meaningful zero point; that is, there cannot be a complete absence of intelligence.
Learn By Doing
(1/1 punto)
Social Security numbers Clothing sizes (e.g., Small, Medium, Large) Length of
room in inches Length of room in inches <choicehint>Length of room in inches is a ratio
variable because it uses numbers to represent the amount of a characteristic and it has a
meaningful zero.</choicehint> - correcto
Correcto:
Length of room in inches is a ratio variable because it uses numbers to represent the
amount of a characteristic and it has a meaningful zero.
The next activity will help you to see whether you understand the
different scales of measurement.
Correcto:
Ordinal scales of measurement use rank ordering. In this case, the measure ranks
individuals into high, medium, and low groups based on each person’s level of anxiety.
ENVIARTU RESPUESTA
Correcto:
This measure of political affiliation is nominal. It assigns values to discrete categories and
attributes these values to the research subjects.
ENVIARTU RESPUESTA
Ratio variables use numbers to represent the amount of a characteristic, where zero means
the absence of the characteristic. In this case, the measure of time represents the
proportion of the class that is group work, where zero means they do not do any group
work in this 45-minute period.
ENVIARTU RESPUESTA
Scores on the SAT Math Test (note: the scores on the SAT Math Test
range from 200 to 800). What scale of measurement is this measure?
Correcto:
ENVIARTU RESPUESTA
Examining Distributions
As indicated in the introduction, we will begin the EDA part of the
course by exploring (or looking at) one variable at a time.
As we saw in Data and Variables, the data for each variable are a
long list of values (whether numerical or not), and are not very
informative in that form. In order to convert these raw data into
useful information we need to summarize and then examine
the distribution of the variable. By distribution of a variable, we
mean:
This module has two sections. We will first learn how to summarize
and examine the distribution of a single categorical variable, and then
do the same for a quantitative variable.
Frequency Distributions
Learning Objective: Summarize and describe the
distribution of a categorical variable in context.
What is your perception of your own body? Do you feel that you are
overweight, underweight, or about right?
Body Image
student 25 overweight
student 27 underweight
Learn By Doing
(1/1 punto)
What is the difference between the two bar charts?
There is no difference. The two bar charts represent the distributions of two
different variables. The first bar chart represents the count of respondents that chose
each category, while the second bar chart represents the percentage of respondents that
chose each category. The first bar chart represents the count of respondents that chose
each category, while the second bar chart represents the percentage of respondents that
chose each category. <choicehint> The two bar charts are different because counts and
percentages have different scales on the vertical axis. Counts have a scale from 0 to the
total number of subjects, while percentages always have a scale from 0 to 100.
</choicehint> - correcto The two bar charts represent the distribution of "Body
Image" obtained from two different samples.
Correcto:
The two bar charts are different because counts and percentages have different scales on
the vertical axis. Counts have a scale from 0 to the total number of subjects, while
percentages always have a scale from 0 to 100.
ENVIARTU RESPUESTA PISTA
Now that we have summarized the distribution of values in the Body Image
variable, let's go back and interpret the results in the context of the questions
that we posed:
Learn By Doing
(1/1 punto)
What do the results suggest about how the students are divided
across the three body image categories?
Students are equally divided across the three categories. Students are not
equally divided across the three categories. Students are not equally divided across the
three categories. <choicehint>You correctly saw that the pieces of the pie and the
lengths of the three bars representing the three body image categories are not all the
same. Thus, the students' responses are not equally divided among the categories.
</choicehint> - correcto
Correcto:
You correctly saw that the pieces of the pie and the lengths of the three bars representing
the three body image categories are not all the same. Thus, the students' responses are
not equally divided among the categories.
ENVIARTU RESPUESTA PISTA
Learn By Doing
(1/1 punto)
How do the vast majority of students (71.3%) feel about their
weight?
About right About right <choicehint>71.3% is well over half, or "the vast
majority" of the respondents. We are looking for the category that has the largest piece
of the pie and the longest bar in the bar chart—the category "about right." Also, both
charts note that the percentage for this category is 71.3%. </choicehint> - correcto
Overweight Underweight
Correcto:
71.3% is well over half, or "the vast majority" of the respondents. We are looking for the
category that has the largest piece of the pie and the longest bar in the bar chart—the
category "about right." Also, both charts note that the percentage for this category is 71.3%.
ENVIARTU RESPUESTA PISTA
Learn By Doing
(1/1 punto)
How does the middle group of students (19.6%) feel about their
weight?
Learn By Doing
(1/1 punto)
What was the body perception that occurred the least often?
Now that we've interpreted the results, there are some other
interesting questions that arise:
Let's Summarize
-¿Con quién te resulta más fácil hacer amigos? (Sexo opuesto, sexo
igual o ninguna diferencia).
R
StatCrunch
Calculadora TI
Minitab
Sobresalir
Instrucciones Excel
Para abrir el conjunto de datos, haga clic aquí para descargar el
archivo en su computadora. A continuación, busque el archivo
descargado y haga clic en él para abrirlo en Excel. Cuando se abre
Excel puede que tenga que habilitar la edición.
Excel creará una nueva hoja de cálculo con una tabla en blanco. Tenemos
que decir a Excel cómo construir la tabla:
Ahora debe tener una tabla que muestre el número de cada tipo de
entrada de datos: "No hay diferencia", "Sexo opuesto" y "Sexo igual",
y un gran total. Este es el recuento de cada una de estas
entradas. Para ver los porcentajes de cada entrada, podemos pedir a
Excel que muestre porcentajes en lugar de cuentas:
Haga clic con el botón derecho en cualquiera de las celdas que contengan
los conteos y elija Configuración de campos de valores en el menú
emergente.
Haga clic con el botón secundario en cualquiera de las celdas que contengan
los porcentajes y elija Configuración de campos de valor en el menú
emergente.
Ahora haga clic en una de las celdas de su tabla y, a continuación, haga clic
en el botón PivotChart en el grupo Herramientas de la pestaña Análisis
(herramientas de tabla dinámica) .
Para etiquetar las secciones del pastel, haga clic con el botón derecho en el
pastel y elija Formato de etiquetas de datos (versión 2013) o Agregar
etiquetas de datos (versión 2016) en el menú emergente.
Para ver porcentajes así como valores absolutos, haga clic con el botón
derecho del ratón en el pastel y elija Formato de etiquetas de datos en el
menú emergente. En el menú que aparece a la derecha, marque la casilla
junto a Porcentaje.
Comentario
Tenga en cuenta que el gráfico circular proporciona visualmente toda
la información que se encuentra en la tabla.
APRENDER HACIENDO
(1/1 de punto)
Tu respuesta:
Nuestra Respuesta:
Histogram: Intervals
Learning Objective: Generate and interpret several
different graphical displays of the distribution of a
quantitative variable (histogram, stemplot,
boxplot).
Idea
Break the range of values into intervals and count how many
observations fall into each interval.
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
We first need to break the range of values into intervals (also called
"bins" or "classes"). In this case, since our dataset consists of exam
scores, it will make sense to choose intervals that typically
correspond to the range of a letter grade, 10 points wide: 40-50, 50-
60, ... 90-100. By counting how many of the 15 observations fall in
each of the intervals, we get the following table:
Exam Grades
Score Count
[40-50) 1
[50-60) 2
[60-70) 4
[70-80) 5
[80-90) 2
[90-100] 1
The table above can also be turned into a relative frequency table
using the following steps:
2. Add a column, at the end of the table, and calculate the relative
frequency for each interval, by dividing the number of observations in
each row by the total number of observations.
1. What is the percentage of exam scores that were 70 and up to, but
not including, 80? To determine the answer, we look at the relative
frequency associated with the [70-80) interval. The relative frequency
is 0.33; to convert to percentage, multiply by 100 (0.33 * 100 = 33)
or 33%.
Add together the relative frequencies for the intervals that have
scores of at least 70 or above. Thus, would need to add together the
relative frequencies from [70-80), [80-90), and [90-100] = 0.33 +
0.13 + 0.07 = 0.53.
Learn By Doing
(1/1 punto)
Recall the table from the exam grades example above:
Exam Grades
Score Count
[40-50) 1
[50-60) 2
[60-70) 4
[70-80) 5
[80-90) 2
[90-100] 1
Comments
1. It is very important that each observation be counted only in one
interval. For the most part, it is clear which interval an observation
falls in. However, in our example, we needed to decide whether to
include 60 in the interval 50-60, or the interval 60-70, and we chose
to count it in the latter. In fact, this decision is captured by the way
we wrote the intervals. If you'll scroll up and look at the table, you'll
see that we wrote the intervals in a peculiar way: [40-50), [50,60),
[60,70) etc. The square bracket means "including" and the
parenthesis means "not including". For example, [50,60) is the
interval from 50 to 60, including 50 and not including 60; [60,70) is
the interval from 60 to 70, including 60, and not including 70, etc. It
really does not matter how you decide to set up your intervals, as
long as you're consistent.
The values of the number of hours studied. The values of the number of hours
studied. <choicehint> The horizontal axis represents the number of hours
studied.</choicehint> - correcto The count of students falling in each of the
intervals.
Correcto:
The horizontal axis represents the number of hours studied.
ENVIARTU RESPUESTA
The values of the number of hours studied. The count of students falling in
each of the intervals. The count of students falling in each of the intervals. <choicehint>
The vertical axis represents the count of students falling in each of the intervals.
</choicehint> - correcto
Correcto:
The vertical axis represents the count of students falling in each of the intervals.
ENVIARTU RESPUESTA
Extra Problems
These extra questions are here to give you more practice if you feel
you need it. No new concepts are introduced on this page. If you've
"got it", go ahead and move on to the next page. If you'd like a little
more practice, work through the questions below.
Question
(1/1 punto)
Thirty-two students were asked the number of servings of fruits and
vegetables they eat daily. The results are displayed in the histogram
below.
Correcto:
ENVIARTU RESPUESTA
Question
(1/1 punto)
20 31.2% 37.5% 62.5% 62.5% <choicehint> There are 20 students who eat
no more than 3 servings daily and a total of 32 students. So, 20/32 = 0.625 or 62.5%.
</choicehint> - correcto 68.8%
Correcto:
There are 20 students who eat no more than 3 servings daily and a total of 32 students. So,
20/32 = 0.625 or 62.5%.
ENVIARTU RESPUESTA
Question
(1/1 punto)
0.063 0.094 0.094 <choicehint> There are 3 students who ate exactly 5 servings of
fruits and vegetables daily and a total of 32 students. So, 3/32 = 0.09375, which can be rounded
to 0.094. </choicehint> - correcto 0.156 2 3
Correcto:
There are 3 students who ate exactly 5 servings of fruits and vegetables daily and a total of
32 students. So, 3/32 = 0.09375, which can be rounded to 0.094.
ENVIARTU RESPUESTA
Question
(1/1 punto)
A survey was conducted to see how many phone calls people made
daily. The results are displayed in the table below:
1-4 16
5-8 11
9 - 12 5
13 - 16 3
17 - 20 1
How many of the people surveyed make less than 9 phone calls daily?
Correcto:
ENVIARTU RESPUESTA
Question
(1/1 punto)
A survey was conducted to see how many phone calls people made
daily. The results are displayed in the table below:
1-4 16
5-8 11
9 - 12 5
13 - 16 3
17 - 20 1
Correcto:
The total number of people surveyed is found by adding up all the frequencies: 16 + 11 + 5
+ 3 + 1 = 36 people.
Histogram: Shape
Learning Objective: Generate and interpret several
different graphical displays of the distribution of a
quantitative variable (histogram, stemplot,
boxplot).
Learning Objective: Summarize and describe the
distribution of a quantitative variable in context: a)
describe the overall pattern, b) describe striking
deviations from the pattern.
Interpreting the Histogram
Once the distribution has been displayed graphically, we can describe
the overall pattern of the distribution and mention any striking
deviations from that pattern. More specifically, we should consider
the following features of the distribution:
We will get a sense of the overall pattern of the data from the
histogram's center, spread and shape, while outliers will highlight
deviations from that pattern.
Shape
When describing the shape of a distribution, we should consider:
We distinguish between:
Symmetric Distributions
Note that all three distributions are symmetric, but are different in
their modality (peakedness). The first distribution is unimodal—it
has one mode (roughly at 10) around which the observations are
concentrated. The second distribution is bimodal—it has two modes
(roughly at 10 and 20) around which the observations are
concentrated. The third distribution is kind of flat, or uniform. The
distribution has no modes, or no value around which the observations
are concentrated. Rather, we see that the observations are roughly
uniformly distributed among the different values.
Comments:
Recall our grades example below. As you can see from the histogram,
the grades distribution is roughly symmetric.
Histograma: Centro, Difusión, &
Outliers
Objetivo de aprendizaje: Generar e interpretar
varias representaciones gráficas diferentes de la
distribución de una variable cuantitativa
(histograma, stemplot, boxplot).
Objetivo de aprendizaje: Resumir y describir la
distribución de una variable cuantitativa en el
contexto: a) describir el patrón general, b)
describir desviaciones sorprendentes del patrón.
Centrar
El centro de la distribución es su punto medio , el valor que divide la
distribución de modo que aproximadamente la mitad de las
observaciones toman valores más pequeños y aproximadamente la
mitad de las observaciones toman valores mayores. Observe que al
mirar el histograma sólo podemos obtener una estimación
aproximada para el centro de la distribución. (Las formas más
exactas de encontrar medidas de centro se discutirán en la siguiente
sección.)
Untado
La propagación (también llamada variabilidad ) de la distribución
se puede describir por el rango aproximado cubierto por los
datos. Observando el histograma, podemos aproximar la menor
observación ( min ), y la observación más grande ( max ), y así
aproximar el rango. (Las formas más exactas de encontrar medidas
de propagación se discutirán en la siguiente sección.)
En nuestro ejemplo:
Min aproximado: 45 (el medio del intervalo más bajo de las puntuaciones)
Outliers
Los valores atípicos son observaciones que caen fuera del patrón
general. Por ejemplo, el siguiente histograma representa una
distribución que tiene un valor atípico probable alto:
Vuelva y compruebe el histograma de resultados en la parte superior
de esta página. Como puede ver, no hay valores atípicos.
Aprender haciendo
(1/1 punto)
Correcto:
La mayor parte de los ganadores del Oscar de mejor actor son de grupos de edad más
jóvenes.
¿Conseguí esto?
(1/1 punto)
Correcto:
¿Conseguí esto?
(1/1 punto)
¿Cuál es la forma más probable de la distribución de la edad en la
que un niño da sus primeros pasos?
Correcto:
ENVIARTU RESPUESTA
Vamos a resumir
Los valores atípicos son puntos de datos que caen fuera del patrón general
de la distribución y necesitan más investigación antes de continuar el
análisis.