Sie sind auf Seite 1von 33

Presented to Dr. Muhammad Saeed Presented by Shumaila Hameed MP/2012-07 Semester:1 M.Phil University of Punjab.


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Overall Plan Content Definition Test Specification Item Development Test Design & Assembly Test Production Test Administration Scoring Test Response Establishing Passing Scores Reporting Test Result Item Banking Test Technical Report (Downing & Haladyna,2008)

Systematic guidelines for all test development activities


is the purpose of test? What construct is to be measured? What score interpretations are required? What test format? What mode of test will be used? What type of test interpretation will be used? What style of reporting?


content is to be tested? Content defining method Depend upon the purpose of test Simple to Multilayered Credentialing tests and certification exams (Methods and procedures may be developed and used to

minimize bias and increase the objectivity of the judgments, but in the end professional judgments by content experts shape the content domain and its definition. Such inherent subjectivity can lead to controversy and disagreement, based on politics rather than content.)

2. 3. 4. 5. 6. 7.

Test specifications guides detailed test development activities The type of testing format to be used The total number of test items The type or format of test items The cognitive classification system to be used Whether or not the test items or performance prompts will contain visual stimuli The expected item scoring rules

1. 2. 3.

The creation of effective test questions Designed to measure important content At an appropriate cognitive level



effective test items may be more art than science, although there is a solid scientific basis for many of the wellestablished principles of item writing (Haladyna, Downing, & Rodriguez, 2002). However there are some general guidelines for developing the each type of item



simple. True statements should be about the same length as false statements. The use of "can" in a true-false statement should be avoided. Ambiguous or vague terms, such as "large," "long time," "regularly," "some," and "usually" should avoided in the interest of clarity. Be sure to include directions that tell students how and where to mark their responses.

should be relatively short and


2. 3.

4. 5. 6. 7. 8.

Use homogeneous material in each list of a matching exercise. Include directions that clearly state the basis for the matching. Put the problems or the stems in a numbered column at the left and the response choices in a lettered column at the right Always include more responses than questions. Correct answers should not be obvious to those who do not know the content being taught. There should not be keywords appearing in both a premise and response. Make sure that there is only one correct choice for each stem or numbered question. All of the responses and premises for a matching item should appear on the same page.





The stem should be meaningful and provide a definite problem. The stem should be stated in positive manner if not possible than negative words should be highlighted in some way. The alternative should fit the stem grammatically (match the use of an/a in the stem). Alternative should be presented in logical order.


2. 3.


Express the item in such a way that only a single, brief answer is possible. Question should be stated in direct form. Avoid using negative words in statement. Express the item in clear, simple language. Try to avoid unwittingly providing clues to the required answer. Where a numerical answer has to be supplied, indicate the units in which you want it to be expressed.



2. 3. 4.

Avoid using essay questions for intended learning outcomes that are better assessed with other kinds of assessment. Question should be written in clear form. Indicate approximate time and length of the question. Choices should not be given in essay question.



a collection of test items into a test or test form. Single test (paper-pencil) Manually(Skilled Person with the help of computer software) Multiple forms of test (paper-pencil) Manually(Skilled Person with the help of advanced computer software) Computer based test (Advanced computer software automatic test assembly software)


production Printing Publication (SECURITY ISSUES) (PRINTINT MISTAKES)


mode Computer based Major threat to validity More issues concerned with high stake test


scoring is the process of applying a scoring key to examinee responses to the test stimuli. Examinee responses to test item stimuli or performance prompts provide the opportunity to create measurement. The responses are not the measurement; rather an application of some scoring rules, algorithms, or rubrics to the responses result in measurement (Thissen & Wainer, 2001).



of scoring depends upon the nature of

item Item analysis done after scoring

1. 2.

Objective scoring (when single answer) Subjective scoring (constructed response items with more than one right answer)



score or cut point is related to the level of skill and knowledge which student has to acquire for getting trough the test. Not all types of test requires establishing passing scores
TWO WAYS FOR ESTABLISHING THE PASSING SCORE 1. Relative standard-setting methods 2. Absolute standard-setting methods
All methods of establishing passing scores require human judgment and are therefore somewhat arbitrary (Norcini & Shea, 1997).


have a right to an accurate, timely, meaningful, and useful report of their test performance. The reported metric should be clearly defined Score reports must be written in language that is understandable to recipients Clear rationale for the type of score report and the reported score scale is essential The score report summarizes, in many important ways, the entire test development program, especially for the examinee and any other legitimate users of test scores(Downing & Haladyna, 2006).


banking means to preserve an item for future

use. After effective questions are written, edited, reviewed, pretested, and administered to examinees, the items with the most effective item characteristics and the best content should always be preserved for potential reuse on another version or form of the test. It is far too difficult and costly to create effective test items that are used only once.


process of test development systematically documented and summarized in a technical report Examination of technical report is also useful in independent evaluations of testing programs Recommendations Time consuming process Often ignored


whole test is not analyzed at once in fact test is analyzed in term of its individual items. Test analysis has two parts 1. Analysis of items which made up the test Item Analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole 2. Statistics summarizing the performance of the test as a whole


students response to an item is called

mean. Mean = Total correct response on the item Total number of the students in class S.D. is a measure of the dispersion of student scores on the item which mean score is calculated, it indicates the "spread rate" in the responses. The item standard deviation is meaningful essay type items and when scale scoring is used. For this S.D is not typically used to evaluate classroom tests.

Item Difficulty: Item difficulty indicates how much an item was difficult for the examinee. For items which have one correct answer, the item difficulty is simply the percentage of students who answer an item correctly. Item difficulty = Total correct response on the item

Total number of the students attempted the item

In this case, it is also equal to the item mean. When there is more than one correct alternative for one question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.

Example: The results of 10 examinee on 8 items are given and there item difficulty is calculated
Examinee A B C D E F G H I J Total correct Item difficulty Q1 1 1 1 1 1 1 1 1 1 1 10 1.0 Q2 1 1 0 0 1 1 0 0 0 0 4 0.4 Q3 0 0 0 0 0 0 0 0 0 1 1 0.1 Q4 1 0 1 0 0 0 0 0 0 1 3 0.3 Q5 1 1 1 1 1 1 1 1 0 1 9 0.9 Q6 0 1 0 1 0 1 1 1 0 1 6 0.6 Q7 0 0 0 0 0 0 0 0 0 0 0 0.0 Q8 0 1 1 1 1 1 1 1 0 1 8 0.8


Item discrimination refers to the ability of an item to differentiate among students on the basis of how well they know the material being tested.
Discrimination index =
number of the students corrected item (passed) - number of the students corrected item (failed) Total number who passed Total number who failed

Item discrimination as "Good" if the index is above 0.3 "Fair" if it is between0 .1 and 0.3 "Poor" if it is below 0.1


In this example 1 represents items correct answer 0 represent wrong answer. Total represents total score of examinee.
Examinee I A C F G H E B D J Item 0 0 0 1 1 1 0 1 1 1 Total 1 3 3 4 4 4 5 6 6 6


or mastery level =5 Discrimination index = 3/4 -3/6 = 0.25


when student from different groups have different likelihood of answering an item correctly. In simple words D.I.F occurs when two equally able who belong to different groups have a different probability of answering an item correctly. D.I.F is a result of a systematic error and is called systematic error (Athanasou & Lamprianou, 2002). It is directly related to tests reliability.

The reliability of a test refers to the extent to which the test is likely to produce consistent scores. High reliability means students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. 1. The inter-correlations among the items -- the greater the relative number of positive relationships, and the stronger those relationships are, the greater the reliability. 2. The length of the test -- a test with more items will have a higher reliability, all other things being equal. 3. The content of the test -- generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability. Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for about 95% of the classroom tests.

Item analysis data are not synonymous with item validity Item analyses reflect internal consistency of items The discrimination index is not always a measure of item quality (a) Extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives (b) An item may show low discrimination if the test measures many different content areas and cognitive skills. Item analysis data are tentative Item analysis statistics must always be interpreted in the context of the type of test given and the individuals being tested.