Sie sind auf Seite 1von 9

Validity and Reliability

by Dirk Cornelis Lourens Testing and Evaluation in TEFL

Validity and Reliability When discussing English language tests, there are factors involved which are considered clear-cut, and which are agreed upon by the majority of stakeholders, researchers, and authors. Carter and Nunan (2007) state that there are two key requirements for assessment of English language learners. They are validity and reliability. The former means assessment of abilities which it claims to assess, and the latter that it must assess consistently. Although this is a rather rudimentary definition, it serves the purpose of this general introduction. Hughes (2003) also states that any test should be consistent and that it should accurately measure that which it claims to measure. In other words, the author stresses the key elements of validity and reliability. The author also points out that tests should have a beneficial effect on teaching and that tests should be cost effective. In a prescribed article for this course, Thaine (2004) mainly discusses issues of reliability regarding the assessment of teacher trainees, but also calls for more reliability and validity in general. Thus having ascertained the relevance of reliability and validity as related to English language tests, the writer will now discuss them in more detail. Reliability According to Hughes (2003) reliability means that tests should measure consistently. If a person is given the same test on different occasions, and there was no significant change in his or her abilities, and the circumstances remain the same, the test is said to be reliable if the persons score is very similar. A more similar score means a more reliable test. As a secondary source or reference, the website http://fcit.usf.edu/assessment/basic/basicc.html draws an interesting analogy: imagine that you weigh the same potatoes on a kitchen scale at different times. If, for example, the scale reads five pounds consistently, then the scale is reliable. Reliability also

means that test items should be consistent. If a student correctly answers one item, he or she should also be able to answer similar items correctly. The reliability coefficient is a way of comparing the reliability of different tests, according to Hughes (2003). It is represented by numbers ranging from 0 to 1, with 1 being the ideal reliability coefficient. The latter means that a test written on different days would give exactly the same results if written by the same candidates. How does one calculate or arrive at the reliability coefficient? Two sets of test scores could be compared. These scores could be obtained by the test-retest method, by having the same candidates write the same test twice. Another way could be the split half method, where the same test is split in two halves, and two sets of scores are given. As a secondary source, the website http://fcit.usf.edu mentions that generally, in standardized tests, .80 means that the test has very good reliability and that .50 or below means that the test is not very reliable. When choosing an English test, why is it important to know the reliability coefficient? In the writers opinion, it is obvious that candidates write English tests for a variety of reasons. If, for example, they write an English proficiency test in order to apply for a job or for studies at a university, the stakes are high. If they choose a test that has a low reliability coefficient, it might not reflect their real proficiency. The result will be that their application will not succeed. It is therefore important for candidates to do some research and find a reliable test that suits their needs, and that has very good reliability. Candidates futures could depend on it. Hughes (2003) states that the reliability coefficient could also be used to determine the standard error of measurement (SEM). The latter indicates the difference between what a candidate scored and what he or she might have scored, the true score, if the same test was written on different occasions. There are formulas that are used to calculate the SEM. The

relevancy of SEM comes to the fore when decisions have to be made about candidates scores that are just over or under a set cut-point, which represents a pass or a fail mark. A practical example is mentioned by Brown (1999). Imagine that the same test could be taken by the same candidates infinitely, for arguments sake. The average score of each candidate for all the tests would be the best estimate of his or her true ability, or true score. Another important aspect of reliability is scorer reliability. It is important to distinguish between objective and subjective scoring of English language tests (Hughes, 2003). Scoring of multiple choice questions would involve a set of definite answers, and scorers would have the same scorer reliability coefficients. Only negligence will have the effect of a deviation from that score. When answers involve judgment on the part of scorers, for example in compositional writing or speaking tests, then the scorer reliability coefficient will not be consistent or perfect. How, then, can English language tests and scoring be made more reliable? Hughes (2003) gives a list of suggestions on how to improve and achieve reliability in scoring and performance. The author states that enough samples of behavior should be taken, that items should be excluded which do not discriminate well between weaker and stronger students, and that candidate should not be allowed too much freedom of choice. It is also important to write unambiguous test items, to provide instructions that are clear and explicit, and to ensure well laid out tests that are perfectly legible. Candidates should also be familiarized with the test format and testing techniques and the administration of the test should be uniform and non-distracting. Objective scoring should be a priority, and a detailed scoring key is essential. It is also important that scorers should receive ample training, and candidates should only be identified by numbers, with no name and pictures. Tests should also be scored by at least two independent scorers.

The standard error of measurement for the TOEFL proficiency examination is different for the reading, listening, speaking, and writing parts. The total is given as 5.64 in a research paper published in 2011 at http://www.ets.org/s/toefl/pdf/toefl_ibt_research_s1v3.pdf The standard error of measurement for IELTS listening and academic reading parts in 2010 were 0.389 and 0.382 respectively. The website http://www.ielts.org states that writing and speaking modules cannot be interpreted in the same manner as listening and reading. The writer will now discuss the second key requirement for assessment of English language learners. Validity As stated earlier in this discussion, Carter and Nunan (2007) pointed out that validity means that tests should only assess the abilities which they claim to assess. The authors also draw a distinction between three types of validity, which are construct, content, and criterionrelated validity. Hughes (2003) states that empirical evidence is needed in order to state that a test has validity. Content validity is one part that could provide such empirical evidence. Carter and Nunan (2007) define content validity as whether a test represents an adequate sample of ability. Hughes (2003) states that it is not enough to have a grammar test that tests grammar. The test must include a proper sample of relevant structures in order to be considered as having content validity. To ensure that this is reflected in the test, detailed specifications must be written during the planning stage. These can later be checked against the actual test content. The advantages of these procedures will be that the test will be more accurate in its measurement and that harmful backwash could be avoided. It will also help to identify, and do away with, easy-to-test items. The secondary source http://fcit.usf.edu/assessment/basic/basicc.html used above gives a very practical example of content validity, for example a test that only includes work covered in

the last six weeks of a semester. If a test only includes part of a courses overall objectives or specifications, then it will have very low content validity. Criterion-related validity refers to the extent to which test scores agree with an external criterion, such as those provided by independent assessments. Another website http://www.llas.ac.uk/resources/gpg/1398 gives the example of students test scores being correlated with other measurements such as teachers rankings of students, or scores on a similar test. This is referred to as concurrent validity by Hughes (2003). The author states that there are two kinds of criterion-related validity, the other part being predictive validity. The latter means that if candidates level of English is measured in order to assess if they are able to teach a certain level of students, then the test could be validated by means of classroom observation of candidates who have already passed the test and are teaching. Predictive validity assesses how candidates will perform or fare in the future. Construct validity is seen by some as covering all aspects of validity itself. One form of construct validity is the underlying ability that is embodied in a theory of language ability (Hughes, 2003). The author uses an example of being able to guess the meaning of words from the context of a given text. Therefore research must confirm that such an ability is indeed being tested and that the test has construct validity. The website http://fcit.usf.edu/ describes this as the extent to which the assessment corresponds to predictions made by a theory, for example. Research in the form of quantitative and qualitative methods could be used to establish evidence for construct validity. Another form of construct validity is to gather evidence about what candidates do when they take a test (Hughes, 2003). In order to do this, two methods are used, namely the think aloud method and the retrospection method. In the former, the test takers say what they are doing as

they do it. In the latter, the test takers recollect why they did something, when they did it. Thoughts are collected by means of tape recorders or questionnaires. In addition to the above forms, there is also validity in scoring. It basically means that the system for scoring must also be valid. Hughes (2003) includes a few practical examples to further illustrate this point. If a test claims to measure one skill, but scores for additional skills, then it makes the measurement less valid and accurate. Face validity is another important part of testing. Hughes (2003) points out that it means a test looks like it is measuring what it is supposed to measure. It also relates to whether parents, students, teachers, or employers accept the test and think that it is appropriate. Non-professionals will either accept or decline tests based on the face value thereof. How, then, can tests be made more valid if they are lacking in validity according to what has been discussed above? Hughes (2003) distinguishes between high stakes tests and teachermade tests. For the former, it is imperative to undertake and carry out a full and extensive validation exercise. This has to be done before the test is used formally. For the latter, the author gives recommendations that include the writing of explicit specifications, the use of direct testing, the importance of scoring validity, and a reminder to do everything possible to ensure reliability. Conclusion The above discussion makes it very clear that both reliability and validity are key factors in English language testing. Other factors like backwash and cost effectiveness are also important. Furthermore, it is evident that a reliable test does not ensure validity, but that a test which is valid will ensure reliability in most cases. Therefore care must be taken not to over emphasize reliability to the detriment of validity. Both are important, and even more so when candidates depend on tests to give them better opportunities for further studies and employment.

Reliability therefore strives for consistent assessment, and validity strives for accuracy of assessment.

References Brown, D.B. (1999). Standard error vs. standard error of measurement. Shiken: JALT Testing and Evaluation SIG Newsletter, 3(1), 20-25. Retrieved from http://jalt.org/test/bro_4.htm Carter, R., & Nunan, D. (2007). The Cambridge guide to TESOL. Cambridge: Cambridge University press. Classroom Assessment Basics. (2012). Florida Center for Instructional Technology. Retrieved May 22, 2012, from http://fcit.usf.edu/assessment/basic/basicc.html Hughes, A. (2003). Testing for language teachers 2nd edition, Cambridge University Press, New York, NY. IELTS Test. (2012). International English Language Testing System. Retrieved May 22, 2012, from http://www.ielts.org/ Principles of Assessment. (2012). LLAS Center for Languages. Retrieved May 22, 2012, from http://www.llas.ac.uk/resources/gpg/1398 Thaine, C. (2004). The assessment of second language teaching. ELT Journal, 58(4), 336-45. Retrieved from http://www.nova.edu/library/ TOEFL Test. (2012). Educational Testing Service. Retrieved May 22, 2012, from http://www.ets.org/toefl

Das könnte Ihnen auch gefallen