Sie sind auf Seite 1von 9

Statistical Analysis Paper Anci Cao Professor Grgurovic

INTRODUCTION

In this statistical analysis report, I will illustrate and interpret the descriptive statistics for two placement tests. The analysis of the test scores is essential for teachers and administrators. For the teachers, it is crucial because they need the information on how useful tests are so they can further revise them and improve their quality. The administrators want to place the students to the appropriate language levels. It is important to know how many students have been placed in each level, and whether such placement is reasonable.

The test scores under analysis are derived from two different placement tests. One is reading comprehension and the other is listening comprehension. The purpose of the tests is to place the students in the appropriate language proficiency level by assessing their prior English knowledge. Both tests include 30 multiple-choice questions to be answered based on students understanding of the reading or listening passages. These items are, thus, passage-based and selected-response types of items (Carr, P. 26-27, 2011). Each question is worth one point. The reading test is comprised of five different academic articles, and the main tested construct is the reading skill. Each passage is followed by several multiple-choice questions, which mainly assess the students ability to:

summarize the main ideas; comprehend the meaning of a specific detail in the context of the passage; make basic inferences based on reading; understand an unfamiliar word or phrase from context; identify the authors perspective.

In the listening test, the assessed construct is the listening skill. Students would listen to three audio lectures and one video lecture. Afterwards, they need to answer comprehension questions in terms of listening to the main ideas and detailed information, making inferences, and understanding the vocabulary based on the context. Even though the main tested constructs are different, both of the tests contain the same structures and types of tasks. Therefore, we still expect a somewhat positive relationship between the test scores. This implies that if the test takers do well on one test, they will score well on the other one, also.

RESULTS

Central Tendency

The statistics for each test have been shown in Table 1. First of all, the measures of central tendency can be analyzed by examining the mean, median, and mode of the test scores. The measures of central tendency describe where most of the scores are in the distribution. The mean is the average score the total of all the scores divided by the number of scores (Carr, P. 228, 2011). It tells us how difficult the test was for the students. As we can see, the mean for the

listening test is 19.24, while for the reading test is 16.18. This indicates that the reading test was more difficult for this group of students than the listening test, since each of them has the same maximum number of points (30), but the average score of the reading test is lower than the one on the listening test. Two additional measures of central tendency are the median and mode. The median is the middle score, which equals to the 50th percentile. Half the scores lie above this point, and half are below it (Carr, P. 228, 2011). There are 83 participants taking both of these tests, which is an odd number. Therefore, the median scores for both tests are the middle-ranked test score. One is 19, and the other one is 16. Both median scores are very close to the mean scores of the tests. The last measure is the mode. The mode is the most common score. The mode for the listening and the reading test is 18 and 20 respectively. They are also relatively close to the mean and the median. The relative closeness of the mean, median and mode shows that the scores of both tests are normally distributed (Carr, P. 231, 2011).

Table 1. Descriptive statistics for two tests Statistics Tests Listening N 83.00 Total items (k) 30.00 Mean 19.24 Mode 18.00 Median 19.00 Low-High 8-29 Range 21.00 St. Deviation 4.89 Skewness -0.04 Kurtosis -0.64

Reading 83.00 30.00 16.18 20.00 16.00 6-26 20.00 4.91 -0.10 -0.69

Dispersion

In addition, the measures of dispersion indicate how spread out the scores are (Carr, P. 229, 2011). The standard deviation is the best measure of dispersion; it equals to the average difference between individual scores and the mean (Carr, P. 229, 2011). The larger the standard deviation is, the more the test scores are spread out. Since the standard deviation for both tests (4.89 and 4.91, respectively) is relatively large comparing to the total scores (30), this indicates that the scores of both tests are widely spread. Furthermore, the range of the scores on both test is very large too (21 and 20, respectively), which also implies the wide spread of the scores. The purpose of these two placement tests is to place the students according to their scores on different levels. In my opinion, this wide spread of the test scores demonstrates that the tests served well to their purpose.

Distribution

Furthermore, in order to better observe the distribution of the tests, I created a histogram for each test. As we can see in Table 1, the numbers of skewness and kurtosis of both tests are within -2 to +2, which means that the tests are somewhat skewed and kurtotic but still reasonably normal distributed (Bachman, P.74, 2004). Additionally, as defined in Carr (2011), the outlier is the score more than three standard deviations above or below the mean. Based on my calculation, there is no outlier in the tests. Thus, we can also see the bell-curve shape distributions for both tests in Figure 1 and Figure 2.

20 18 16 14 12 10 8 6 4 2 0

Frequency

Score

Figure1. Histogram of the Listening Test Scores

20 18 16 14 12 10 8 6 4 2 0

Frequency

Score

Figure 2. Histogram of the Reading Test Scores

Correlation

In order to represent the relationship between the two tests better, I created a scatterplot with a trendline (see it in Figure 3). The tests have relatively strong positive correlation because of the following reasons: 1) All the points are clustered to the trendline, making a linear pattern for the tests. The trendline is also going up towards the higher scores, which means that the higher score one gets on the listening test, the higher score s/he will get on the reading test.

2) The correlation coefficient is .79, which is close to the perfect positive correlation of 1. Since the two tests are normally distributed, and having in mind that test scores are by its nature interval variables (for a definition of interval variable see Carr, P. 261, 2011), I used PEARSON r to calculate the correlation coefficient.
30 Reading Test Scores 25 20 15 10 5 0 0 5 10 15 20 25 Listening Test Scores 30 35

Figure 3. Scatterplot showing relatively strong positive correlation between the scores on the Listening and the Reading Test.

Reliability

Reliability means the consistency of scoring (Carr, P. 107, 2011). There are several approaches to estimate reliability: parallel forms, test-retest and internal consistency reliability. In this case, I can only calculate the internal consistency reliability, because other two approaches require a test to be administered twice (Carr, P. 109, 2011). In terms of the internal consistency reliability, due to the lack of individual items information, I chose KR-21 method. The values of the reliability coefficient for the listening and reading test are .74 and .71 respectively. It means that the test scores of the listening test are 74% reliable, and there is 26% of measurement error. Again, for

the reading test, 71% of each persons score comes from his true score, and 29% is based on measurement error. Carr (2011) suggested that the reliability lower than .8 but higher than .7 is low but still acceptable, especially for the low-stake test. Moreover, we can be sure that the reliability coefficient will not be lower than these numbers, because KR-21 is the most conservative estimation, which tends to underestimate the reliability. Therefore, we can see that both tests have low reliability.

CONCLUSION

On the basis of my data analysis, it is reasonable to conclude that the tests measure the same thing to a large extent. According to Figure 3 and the correlation coefficient value I got, the statistics strongly supports the positive correlation between the tests. Besides that, since I used the PEARSON r to calculate the correlation coefficient, I was able to square the correlation to calculate the determination coefficient. The result is .63, meaning that the two tests are measuring the same thing to the extent of 63%. Conversely, 37% of each test is measuring something unique.

From my perspective, these tests are high-stake tests, since the decisions made by placement tests are crucial to students. For high-stake tests, we should require the reliability higher than .8 (Carr, P. 121, 2011). Unfortunately, based on the resulting values (.74 and .71), the internal reliability is not sufficient enough for either of the tests. However, there are some methods to improve the reliability of the tests. First of all, we can replace some problematic items, such as the ones that are not measuring what they claim to measure, and the ones that are too difficult

and far beyond students proficiency levels. Secondly, as Carr (2011) mentioned, the longer the tests, the higher the reliability can be. If possible, we can add more items to the tests.

Based on my analysis, I can safely conclude that the tests are very useful: they are the placement tests, and they widely spread students on the score scales. The tests are very practical as well. They are short and contain multiple-choice questions. They do not require too much time and personnel to administer and score the tests. In addition, they are also valid. From my analysis, we can see the tests are measuring something common to the extent of 63%, and the correlation of the tests is .79, which indicates that the tests have relatively strong positive correlation. Thus, the criterion-related validity is relatively high. Except for all the good qualities above, the tests have relatively low reliability. However, it can be improved by replacing problematic items and adding new items to the tests. Overall, the tests are efficient and good to use.

References: Bachman, L. F. (2004) Statistical Analyses for Language Assessment, Cambridge: Cambridge University Press Carr, Nathan T. (2011) Designing and Analyzing Language Tests, Oxford.

Das könnte Ihnen auch gefallen