Organizing Data

Organizing Data
If we conduct an experiment of some kind, and collect data, we may face the problem of having too much of a good thing.... The difficulty comes when we try to answer questions like:

How do we deal with large amounts of data? or, What have we learned from examining the data?
Statistics is a discipline that can help us resolve problems like making sense out of too much data. But, while statistical tools can be used to bring insight and clarity to large amounts of information, if used inappropriately, they can also lead to confusing, even misleading interpretations. To begin, let's look at the data in the applet to the right. The data represent scoring that is typical of how students in this class perform on the first exam. Move the vertical scrollbar up and down to examine the scores. From this simple observation, are you able to form an impression about how values in the array are concentrated? As you can see, the 100 scores are unsorted. You may also have noticed that they range in value from 38 to 98. But, extremes rarely tell you much about trends in the data and even if you take time to examine the scores, the lack of order and the sheer number of values make it difficult to make sense out of what you are viewing. As a member of this class, there are a number of questions that might be of concern to you regarding the exam. In general, you might want to know how students taking this class "typically" score on the first exam. This could help you to set realistic performance and learning goals, to allocate study time for the class, and to form study groups. Questions of interest to you might include the following:

How many students scored in the 80s? How many scored above 70? What was the average score? What was the most frequent score? Where in the distribution is the data concentrated? Where was your score on the first exam relative to these scores? o Top 50%? o Top 25%?
Questions like these are often addressed in the early stages of data analysis and are answered using descriptive statistics. Using the exam score data, this first activity will explore the area of descriptive statistics.
Simple Frequency Distribution

Many people, when faced with the task of summarizing a large amount of data, report that they are overwhelmed by the extent of the information. In essense, they are having difficulty describing un-grouped data. To better understand how students performed on our sample first exam, we need to put the information into a form that is more interpretable. Thus, we begin our analsysis by organizing the data into a simple frequency distribution. To accomplish this task, the first job is to sort the exam scores from lowest to highest. Click the sort button in the applet. We're now ready to build a simple frequency distribution. Constructing a Simple Frequency Distribution A simple frequency distribution involves ordering scores from highest to lowest or lowest to highest and counting the number of times each score occurs. Fundamentally, there are two columns in the table, one to list the score or value and the other to indicate frequency of occurrence. Notice that in the data set, the range of scores goes from 38 to 98, a span of 61 integers. In a simple frequency distribution, only those scores that actually occur in the data are listed. Thus, scores like 38 would be listed in the table but 39, 40, and 41 would not because they don't appear in the actual data set. An example of a simple frequency distribution table can be found in Table 1. Table 1. Simple Frequency Distribution
Val ue
38 42 45 48 49 51 52 53 55 56 59
f
1 1 2 1 1 1 1 2 1 1 1
Val ue
60 61 62 63 64 66 67 68 70 71 72
f
3 3 3 3 2 3 2 5 2 2 5
Val ue
73 74 75 76 77 78 79 80 81 82 84
f
3 2 2 4 2 3 2 5 5 4 1
Val ue
85 86 87 88 89 90 91 92 96 97 98
f
1 1 3 1 1 3 3 2 1 4 1
Comparing the table summary to the sorted data, you can see how the economy of the table brings some clarity to the general question of how students performed on this test. For example, there were not many scores at the lower end of the distribution. Only 13 students (13%) scored below 60, a passing grade for this test. The majority of students scored in the 60s, 70s, and 80s with 14 students scoring 90 or above.
The simple frequency distribution is an easy way to investigate scoring trends but as you can see from the table, with so many values and with most having low frequencies (< 3), it is still hard to see how scores are concentrated in the array. This type of distribution, then, is a good first step toward organizing and simplifying data but additional simplification would be desirable.
Grouped Frequency Distributions

How can the data be further simplified? To accomplish this task, divide the range of scores into intervals and then list the intervals in the frequency distribution table. The result is called a grouped frequency distribution table because groups of scores are presented rather than individual values. An example of a grouped frequency distribution can be found in Table 2. Table 2. Grouped Frequency Distribution
Class Interval
96 - 102 89 - 95 82 - 88 75 - 81 68 - 74 61 - 67 54 - 60 47 - 53 40 - 46 33 - 39
Exact Limits
95.5 - 102.5 88.5 - 95.5 81.5 - 88.5 74.5 - 81.5 67.5 - 74.5 60.5 - 67.5 53.5 - 60.5 46.5 - 53.5 39.5 - 46.5 32.5 - 39.5
Midpoi nt
99 92 85 78 71 64 57 50 43 36
f
6 9 11 23 19 16 6 6 3 1
cf
100 94 85 74 51 32 16 10 4 1
What distinguishes the grouped frequency distribution from the simple frequency distribution is the size of the intervals. In the grouped frequency distribution, score values are grouped together. When score values are squeezed together, the distribution becomes more compact and often patterns emerge more clearly. Rules to guide the construction of a Grouped Frequency Distribution Table
Rule 1 - Table should have about 10 class intervals. o With more than 10 intervals the table becomes cumbersome o With too few intervals you lose information about the distribution of scores Rule 2 - Width of the interval should be a relatively simple number. For example, 2, 5, 10, or 20 would be good choices for the interval width. o It is easy to count by 5s or 10s. o Easy to understand. Rule 3 - An interval is usually started with the lowest score divisible by the interval size.
If you are using a width of 10, for example, the intervals should start with 10, 20, 30, 40, etc. o Makes it easier to understand the table. Rule 4 - All intervals should be the same width. o The intervals should cover the range of scores completely with no gaps and no overlaps. This means that there may be intervals within the distribution that have frequency counts of 0. This is not uncommon in distributions with extreme scores. o Each score should belong to only one interval.
o
Let's go back to the data and to see how the grouped frequency table was constructed.
The first task is to determine the range of scores in the data set. For integer data, the range is computed by subtracting the lowest score from the highest score and adding 1.
You add 1 so that the number you start from is included in your range calculation. For example, if you read from page 1 to 5 in a book, you've read 5 pages, 1, 2, 3, 4, and 5. But, subtracting 1 from 5 yields 4, so to determine the true range you need to add 1. For our data set, the range is (98-38) + 1 or 61.
Given the range, the next task is to determine the class interval width. Applying the first rule, the range of 61 is divided by 10. The result is an interval width of 6.1. This would be a difficult interval to work with, so the next step is to refine the width into something more manageable.
Applying the second rule, an interval width of 5, 6, or 7 would make sense. Choosing widths of 5 or 6, however, results in more than 10 intervals. Although 100 data points could comfortably drive more than 10 intervals, for the purpose of this demonstration, the class interval width will be set at 7 resulting in 10 intervals. Seven is a good choice; any odd-numbered width makes it easy to determine interval midpoints, something that needs to be done to fill out the table. Here, we depart somewhat from the rules. Typically, with an interval width of 7, the bottom interval would start at the first number that is divisible by 7 and less than 38. That would start the first class interval at 35 and end it at 41. But, like many things in statistics, the rules are guidelines to get started not absolutes. In this case, the grouped frequency table will be transformed into a graph called a histogram and to center the graph in the applet, the first interval will start at 33 and extend to 39.
To complete the class interval column, start the second interval at 40 and extend it to 46. Continue to construct intervals in a similar fashion. The last interval is the one that includes the highest value in the distribution, in this instance, 98. The next step is to set exact interval limits. Exact limits are necessary because measurement variables are continuous, meaning that there are no gaps between intervals. Look carefully at the class intervals column in Table 2. The problem with class intervals is the space between the intervals. For example, the first interval ends at 39 and the next starts at 40. In a continuous measurement scale, the distance between 39 and 40 contains an infinite number of equal divisions. To solve this problem, exact interval limits are used. Exact limits remove space between intervals by dividing it in half. One half is added to the upper limit of one interval and the other half is subtracted from the lower limit of the next interval. Figure 1 shows the relationship between class intervals, exact limits, and midpoints.
Figure 1. Illustration of class intervals, exact limits, and midpoints for the exam score data
Using exact limits ensures that each interval transitions seamlessly into the next and establishes the continuous nature of the underlying scale of measurement. As you can see in Table 2, the exact upper limit of one interval becomes the exact lower limit of the next interval. In other words, the intervals are now continuous, not separated by any space. Filling out the rest of the table is straightforward. Midpoints are the values at the center of their respective intervals. They are used as labels for the intervals in graphic displays of the
distribution and to represent interval values when computing the mean directly from a grouped frequency distribution table. The last part of the process is to count the number of scores that fall into each interval and to record those values in the corresponding cells of the frequency column. The cumulative frequency column does what it says, it represents how fast or slow the frequencies fill up across the intervals. The frequency and cumulative frequency are always the same in the bottom interval, but cumulative frequency changes as each ensuing cell frequency is added to the total. Another way to think about cumulative frequency is that it represents a running total. Flip between the table and score distribution, to verify the cell frequency and cumulative frequency counts.

Organizing Data

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Organizing Data

Hochgeladen von

Copyright:

Verfügbare Formate

Organizing Data

Simple Frequency Distribution

Grouped Frequency Distributions

Das könnte Ihnen auch gefallen