Sie sind auf Seite 1von 92

161

Chapter 5 Psychometrics: Constructing Tests and Performance Assessments In a standards based approach to education and training, informed by Constructivist theory, assessment informed instruction is the expectation as is continuous improvement. One of the most widely used tools in assessment and evaluation is the traditional or classic classroom achievement test, whether the classroom is on- or offline. These measures are often fraught with reliability and validity problems as the process for constructing such tests is often not followed or misunderstood, thereby introducing significant measurement error into the measurement process. Poor measurement frequently leads to inaccurate data-based inferences, which in turn leads to bad decisionmaking. In this chapter, we will first examine the Teaching/Assessment Cycle, types of knowledge, and the relationship between knowledge and learning. Second, test construction (e.g., test planning, building learning targets, test blueprinting, test item type and item format selection), writing select response items (multiple-choice, true/false, matching, completion and short-answer) and supply response items (brief and extended response) is discussed. Third, statistical strategies for improving test items are presented, followed by performance assessment. Strategies for preparing examinees for a testing session are presented in Appendix 5.7. I. Introduction A. The Teaching/Assessment Cycle 1. Assessment informed instruction requires the educator (teacher, trainer, planner, instructional designer or administrator) to plan, deliver, and adjust instruction based on students or trainees evolving mastery of learning and skill standards until the desired mastery is achieved. a. We avoid the phrase, test-driven instruction, which suggests that instruction is based only or primarily on test results. b. The authors prefer, data-informed instruction, which acknowledges that instruction involves more (e.g., values, preferences, developmental levels, etc.) than just test data. 2. The Teaching/Assessment cycle is outlined in Figure 5.1. a. Based on learning standards, teaching is conducted. b. Once teaching is launched, continuous formative assessment is engaged as is re-teaching based on assessment results. c. The assessment/re-teaching cycle is repeated until suitable mastery is demonstrated via summative assessment. Then a new teaching/assessment cycle begins. d. The teaching/ assessment cycle assumes that instruction and assessment are planned and executed in conformance to articulated learning and performance standards. e. The Teaching/Assessment Cycle is based on the Shewhart or the PDCA Cycle (Deming, 1986, p. 88), as seen is Figure 1.2. 3. Next, we will examine the nature of knowledge and its relationship to learning.

162

A. Definition of Knowledge 1. Alexander (1996, p. 89) writes that knowledge is a scaffold that supports the construction of all future learning. Greeno, Collins, & Resnick (1996, p. 16) argue that the cognitive view of knowledge emphasizes understanding of concepts and theories in different subject matter domains [e.g., reading or science] and general cognitive abilities, such as reasoning, planning, solving problems, and comprehending language. This suggests that there exists general divisions of knowledge: a. Domain specific knowledge: knowledge required to complete a specific task (e.g., using the telephone) or subject (e.g., the history of the SpanishAmerican War) b. General knowledge: knowledge that may be applied across differing situations (e.g., problem solving skills).

New Learning
New teaching/ assessment cycle begins

Teaching
Instruction conducted

Summative Assessment
suitable mastery demonstrated

Formative Assessment
Continuously performed

Re-teaching
Re-teaching/ assessment cycle repeats until suitable mastery

Figure 5.1 Teaching/Assessment Cycle

2. Knowledge Classifications a. Knowledge can also be broadly categorized according to use as declarative, procedural, or conditional (Paris & Cunningham, 1996; Paris, Lipson, & Wixson, 1983). (1) Farnham-Diggory (1994, p. 468) defined declarative knowledge, knowledge that can be declared, usually in words, through lectures, books, writing, verbal exchange, Braille, sign language, mathematical notation, and so on. Declarative knowledge can be simple facts, generalities, rules, personal preferences, etc.

163

(2) Woolfolk (2001, p. 242) defines procedural knowledge as knowing how to do something such as divide fractions or clean a carburetor. Procedural knowledge must be demonstrated. Other examples of procedural knowledge include translating languages, classifying shapes, reading, or writing. Blooms (1956) and Gagnes (1985) intellectual skills, i.e., the levels beyond knowledge, are procedural knowledge. (3) Woolfolk (2001, p. 243) defines conditional knowledge as, knowing when and why to applydeclarative and procedural knowledge. Conditional knowledge involves judgment. Examples of conditional knowledge include how to solve various math problems, when to skim or read for detail, when to change strategies when confronted with a perplexing problem, etc. 3. When designing instruction (e.g., writing learning targets/outcomes, selecting teaching strategies, etc.) and/or assessment instruments (e.g., tests and direct performance assessments), we must ensure we design to teach or to measure the correct form of knowledge. B. Learning 1. Kimble (1961, p. 6) defined learning as, a relatively permanent change in behavioral potentiality that occurs as a result of reinforced practice [based on knowledge]. Hergenhahn and Olson (1997, p. 2) have pointed out that a. Leaning must be exhibited in behavior. b. The change in behavior is fairly permanent. c. An immediate change in behavior need not occur, but the behavior potential must be resident. d. Learning is a consequence of experience (e.g., life, schooling, training, practice, observation, etc.) e. Only reinforced (either positively or negatively) experience, practice, etc. is learned. Reward is only one type of reinforcement. 2. Hergenhahn and Olson (1997, pp. 3-6) raise important questions. a. Must learning give rise to changes in behavior? (1) B. F. Skinner believes that change in behavior is learning and that behavior is not a basis for inferring learning. (2) Most learning researchers and theorists argue that it is from observed behavior or measures of behavior potentiality that we infer whether or not learning has occurred and that learning is an intervening variable, experience + learning = behavior change. (3) So the answer to the above question is that almost certainly learning must result in measurable or observable behavior change. b. How permanent is fairly permanent? (1) Illness, drugs, maturation, tiredness, etc. will change or modify behavior. Behavior changes caused by these are not due to learning.

164

(2) Learning takes some time and is retained either until forgotten or replaced by new learning. c. How is learning related to performance? (1) While learning is usually inferred from behavior, learning may not be exhibited immediately; in these instances, leaning can and does affect behavior potentiality. It is behavior potentiality that is measured. (2) Performance is a consequence of potentiality into behavior. d. Why does learning require experience or practice? (1) Reflexive behavior is a function of genetics and is not learned, e.g., fight or flight response, moving away from fire, coughing, etc. Reflexive behavior is usually simple behavior. (2) Complex behavior or instinct has historically been attributed to genetics. However, research on imprinting, e.g., ducklings attaching themselves to humans or objects as their mother, suggests that some complex behavior may be due to learning. (3) Thus, any change in behavior attributed to learning must be fairly permanent and due to experience. 3. In light of these questions, Hergenhahn and Olson (1997, pp. 5-6) offer a modified definition of learning, learning is a relatively permanent change in behavior or in behavioral potentiality that results from experience and cannot be attributed to temporary body states such as those induced by illness, fatigue, or drugs. a. What is learned is called knowledge. b. Knowledge is then classified into intellectual skills, which form the foundation for specific thinking skills. For an examination of the most utilized Taxonomy of Intellectual Skills, see Appendix 5.1. 4. Next, we turn our attention to planning the achievement test. II. Planning the Achievement Test A. General Planning Considerations 1. When planning the assessment, consider examinees age, stage of development, ability level, culture, etc. These factors will influence construction of learning targets or outcomes, the types of item formats selected, how items are actually written, and test length. 2. For classroom achievement tests, content reliability and validity is essential. a. Critical content (i.e., performance objective or standard) must be given appropriate weight in the table of specifications or test blueprint. b. There must be sufficient items to properly support the content weighting scheme. c. Each test should have internal consistency reliability, especially if it is to be administered once to an examinee. Appropriate formulae are the Split Half methods (KR-20 and KR-21) and Cronbachs alpha. The Target reliability coefficients should be at 0.80 and above.

165

B. The Test Construction Process 1. First, it is necessary that the content (intellectual skills), psychomotor skills, and/or attitudes to be assessed are fully and clearly identified. This is usually done through learning outcomes, or standards. A testing strategy is selected. 2. Second, the test item formats are considered and selected. Item specifications are written. The degree of specificity is dependent upon the use of the test or performance assessment. Items intended for a state-, provincial- or nationwide degree sanction test will have more elaborate specifications than will a common departmental or classroom examination. 3. Third, a table of specifications (or test blueprint) is developed which integrates the content (intellectual skills), psychomotor skills, and/or attitudes to be assessed and the selected item formats, including the number of items per content or skill domain. 4. Fourth, once steps one through three are met, they should be reviewed by knowledgeable colleagues to assess accuracy, clarity, continuity, fit, and bias. This is an iterative process which should be engaged until there is general agreement. 5. Fifth, once a consensus has been reached, items are crafted, compared to item specifications, and revised until there is agreement that each item meets its item specification, item writing guidelines, and fits the test blueprint. Items are typically pilot tested and evaluated statistically. The statistical evaluation provides guidance for revision of poorly functioning items. C. Selecting a Testing Strategy 1. Once the intellectual and/or thinking skills (see Appendix 5.1) to be assessed have been identified, a testing strategy is next selected. 2. A test is any device (written, observational, or oral) utilized to gather data for assessment and evaluation purposes. The chief assessment device in education and training is the test. Lyman (1998, pp. 21-26) offers classification taxonomy of various types of tests: a. Maximum performance tests (MPT): with these, we assume that all examinees will perform their best as all examinees are equally, highly motivated. Examinee performance is influenced by the effects of heredity, schooling or training, and his or her personal environment. (1). Intelligence tests: These are tests of general aptitude. IQ (Intelligence Quotient) scores will most likely vary according to the intelligence test taken as they do tend to be different. IQ is influenced by previous academic achievement. (2) Aptitude (or ability) tests: These tests measure performance potential. Aptitude tests imply prediction and are sometimes substituted for intelligence tests, and used for classification (e.g., ability grouping).

166

Ability test scores are influenced by prior achievement (e.g., reading and math knowledge and skills.) (3) Achievement tests: are used to measure examinees current knowledge and /or intellectual skills. (4) Speeded tests: are maximum performance tests where the speed at which the test is completed is a vital element in scoring (e.g., typing test, rifle assembly, foot race, etc.) If a test has a time limit such that virtually all students finish the test then it is not considered a speeded test and is called a power test. Most achievement tests should be power tests. (5) Performance tests: are designed to require examinees to demonstrate competence by constructing a response. In traditional testing, this takes form as a constructed response such as those routinely used in science and mathematics courses or brief and extended essay responses. A performance assessment may also require a competence demonstration via the construction of a work product; in this instance, detailed product specifications, scoring criteria, and rating or scoring sheets are used. b. Typical performance test (TPT): These tests include personality, interest, preference, and values tests. They are called instruments, scales, inventories, and indexes. (1) With these types of tests, we assume that examinees will perform typically or as usual. (2) There is less agreement with what is being measured and what the score means. That is why it is essential that any theory upon which a typical performance test is based be explicitly defined and explained. (3) There is an assumption that an examinee answers test items truthfully. (4) The application of typical performance tests should be cautious in educational applications. c. Standardized tests: These are aptitude or achievement tests which are administered under standard conditions and whose scores are interpreted under standardized rules, typically employing standard scores such as norms, except on criterion-referenced tests. On most standardized tests, the breadth of content coverage is broad and test items are written to maximize score variability. Thus, items which are too hard or too easy are not used in constructing the test. d. Informal tests: such tests are typically intended for a localized purpose. The classroom achievement test is a common example. D. Learning Standards, Targets, or Outcomes: The Assessment Drivers 1. In education and training, learning outcomes, also called standards, are framed to specify the content, skill, or attitude to be measured. Learning outcomes or standards have also been referred to as content standards, performance standards, or behavioral objectives. Here the terms learning outcomes and standards are used interchangeably as applied to both intellectual and

167

psychomotor skills. Learning outcomes or standards are written for content, skills, or attitudes to be measured. 2. Intellectual skill taxonomies such as Blooms (1956), Appendix 5.1; Canelos (2000); Ganges (1985); or Quellmalz (1987) provide guidance in crafting learning outcomes or standards. There are several models for framing these intended outcomes or standards. 3. Models for Framing Learning Outcomes or Standards a. Mitchell (1996) offers the following Taxonomy and Definitions (1) Content (or academic) standards: These standards identify what knowledge and skills are expected of learners at specified phases in their educational progression. (2) Performance standards: Performance standards have levels, e.g., 4, 3, 2, 1; exceeds expectations, meets expectations, or does not meet expectations; unsatisfactory, progressing, proficient, or exemplary, which are intended to show degree of content mastery. (3). Opportunity to learn standards: There are instances where enabling learning and/or performance conditions are identified so to ensure that learners have a fair chance to meet Content and Performance Standards. b. The State of Florida (1996, pp. 28-30) developed the following taxonomy: (1) Strand: label (word or phrase) for a category of knowledgethe most general type of informationare organizing categories essential to each discipline. (2) Standard: general statement of expected learner achievementis a description of general expectations regarding knowledge and skill development within a strandprovide more specific guidance as to what student should know and be able to do. (3) Benchmark: learner expectations (what a student should know and be able to do at the end of the developmental levels, e.g., PreK-2, 3-5, 68, 9-12 [grades]translate[specific] standards into expectations at different levels of student developmentbenchmarks describe expected achievement as students exit the developmental level.. .accompanying the benchmarks are sample performance descriptions. c. In crafting learning outcomes, Oosterhof (1994, pp. 43-47) suggests the writer consider: (1) Capability: Identify the intellectual capability being assessed. Performance taxonomies such as Bloom or Gagne are helpful here. (2) Behavior: Indicate the specific behavior which is to be evidence that the targeted learning has occurredthe behavior should be directly observable and not require any other inference. (3) Situation: Often it is helpful to specify the conditions under which the behavior is to be demonstratedkey is to do what best specifies the situation in which the behavior is to be demonstrated.

168

(4) Special Conditions: Depending on the circumstances, one may need to place conditions on the behavior, e.g., name a letter or word correctly 80% of the time in order to conclude that the targeted learning has been learned. 4. Integrating the Various Models a. Determine which type of standard needs to be constructed. (1) Attitude standards state explicitly what attitudes based on defined values the faculty expect students to hold and articulate as a function of program matriculation. (2) Content standards express explicitly what content students are expected to know as a function of program matriculation. (3) Skill standards state very clearly the specific skills students are expected to have mastered as a function of matriculation. b. It is often easy to confuse content and skill standards. (1) More specifically content standards may be defined to include mathematical rules, statistical formulas, important historical facts, grammar rules, or steps in conducting a biology experiment, etc. (2) Skill standards may involve conducting statistical or mathematical operations based on formulas or mathematical rules; reporting historical fact or analyzing historical data; correcting a passage of text for poor grammar; or conducting a biology experiment. The key difference between content and skill standards is that with content standards, students are required to possess specific knowledge; skill standards require students to apply that knowledge in some expected fashion at an expected level of performance. The expected performance level (e.g., critical thinking) is usually defined by a taxonomy such as Bloom (1956), Gagne` (1985), or Quellmalz (1987). c. A standard is typically composed of five elements. (1) The first element states who is to do something, usually a student. (2) The second element is an action oriented verb (e.g., articulate, describe, identify, explain, analyze, etc.); it is at the verb level that standard classification taxonomies usually exert their influence. For example, Quellmalz outlined five cognitive functioning levels, which are: Quellmalz (1987) Bloom (1956) Recall Knowledge & Comprehension Analysis Analysis Comparison Synthesis Inference* Application & Synthesis Evaluation Synthesis & Evaluation *(deductive & inductive reasoning)

169

For instance, according to Stiggins, Griswold, and Wikelund (1989), verbs associated with inference include generalize, hypothesize, or predict. Within the comparison level, one could use the words compare and contrast. Consulting Blooms extensive list for relevant verbs for specific Quellmalz levels might be helpful. (3) The third element describes under what conditions the student is to demonstrate something (e.g., fully, briefly, clearly, concisely, correctly, accurately, etc.). (4) The fourth element specifies what the student is to demonstrate (e.g., algebra calculation, leadership theories, decision-making models, etc.) (5) The fifth element (optional) describes the medium in which the demonstration is to take place, e.g., in written or oral form, via examination, case report, or portfolio, etc. We dont recommend using the fifth element as assessment options may become limited. Two sample standards are: (a) The student will accurately compute algebraic equations. (b) The student will accurately and concisely describe modern leadership theories. d. For each Attitude, Content, and/or Skill Standard, Construct an Operational Definition, called a benchmark. (1) Standard operational definitions are constructed through benchmark statements. A benchmark, in plain English, is a specific action oriented statement which requires the examinee to do something. When a student has achieved each relevant benchmark, the standard is met. Benchmarks further define the relevant attitude, content, and skill domains. Illustrative benchmarks are: (a) The student will accurately compute algebraic equations. [1] The student will correctly compute linear equations. [2] The student will correctly compute quadratic equations. [3] The student will correctly compute logarithms. (b) The student will accurately and concisely describe modern leadership theories. [1] Describe content and process motivation theories citing education leadership examples. [2] Describe leadership trait theories of leadership, respecting assumptions, elements, research, and education application. [3] Describe leadership styles and situational models in terms of assumptions, characteristics, and education application. (2) Benchmarks are very useful in framing examinations, assignments, etc. As with standards, content experts should agree on the appropriateness of each benchmark and its association with its specific standard. Further, it is not necessary to stipulate a performance level

170

as in percent correct; the schools or organizations grading scale should suffice. e. Once a sufficient number of standards, in depth and breadth, have been drafted, units, courses, degree programs, etc. may then be constructed around relevant bundles of standards, as such bundles of relevant attitude, content, and skill standards define domains. Attitude standards can be assessed by surveys. Content standard mastery can be studied through examinations which require the student to demonstrate his or her knowledge. Skills can be assessed via examinations using application oriented item formats, projects, portfolios, or cases, etc. f. The five part model presented here can also be used to evaluate standards written by others. Regardless of the approach employed, each standard should meet at least the first four components, be developmentally appropriate, and ensure that students have had the opportunity to learn, develop the desired attitude, and/or acquire the specified skill(s). E. Constructing a Test Blueprint 1. A table of specifications sorts the performance standards, i.e., content and/or skills, by intellectual skills to be performed. The number of test items is inserted in the appropriate table cell. See Table 5.3. 2. The more important the content, the greater the number of items. To arrive at these numbers, the test developer will typically a. Determine the total number of items to be included. This number is materially influenced by the testing time available, the test environment, developmental status of the examinees, and size of the content and/or skill domain to be tested. b. Test items are allocated to each performance standard. More critical standards are allotted more items. c. Next, items are sorted across each performance standard by intellectual skill. 3. Application a. Presented in Table 5.3 is a test blue print for an end-of-term test in grade 5 language arts. The performance standards are benchmarks drawn from standard two of the Florida Sunshine State standards for Language Arts. The benchmarks are presented in the far-left column and the intellectual skills are presented in columns to the right. The numbers in table cells are the number of items per intellectual skills. b. Upon reviewing the table, the reader sees that this instructor selected (LA.A.1.2.2) as more critical than the others. c. For (LA.A.1.2.2), there are 8 knowledge items, which require simple recall of information. Knowledge of phonetic sounds and word structure is critical if a student is to be able to apply this knowledge to construct

171

d.

e.

f.

g.

meaning from various texts. The types of knowledge to be demonstrated drive which item formats are selected. Possible item formats are simple multiple choice, matching, fill-in-the-blank, or true/false. Next, the examinee should be able to demonstrate a comprehension of what is read in the form of retelling and self-questioning. Comprehension involves translation (2.10), interpretation (2.20), and/or extrapolation (2.30). In the present instance, we would focus on interpretation extrapolation. Possible item formats are multiple choice, short response, or extended response items. The test developer elected to test at the application level. To test application, the test developer can craft one table and one graph. Several multiple choice questions concerning meaning can be carefully constructed. An alternative is to have the student write several sentences as to the meaning of the table and graph. Prior to being able to construct meaning one should be able to analyze various texts or illustrations. For the higher order items, a student could be asked to analyze two brief texts with tables and graphs. Next, he or she would be asked to construct a statement explaining the texts and predicting a potential outcome from that meaning. Answers could be selected as in multiple choice items and/or supplied as in a written paragraph or two. Since there is only two hours for the examination and more time is required to demonstrate higher order intellectual skills (i.e., analysis, synthesis, and evaluation), there are fewer items included on the test.

F. Selecting Test Items 1. Test items are proxy measures of an unobservable psychological construct or trait, intellectual or thinking skill (see Appendix 5.1), a psychomotor skill, or oral communication skill. Measuring psychological attributes, such as ability or achievement is often not directly possible or is at least extremely difficult. a. Test items consist of a stimulus which is intended to prompt a prescribed or expected answer. A correct answer or endorsement to a test item is seen as an indicator that an examinee possesses that element or part of the attribute, construct, or skill under investigation. b. Test item formats include: multiple choice, true/false, completion, fill-inthe blank, short answer, check-list, and essay. Each item type has its strengths and weaknesses and should be selected carefully. 2. There are three types of test items: a. Mastery-Type Items (1) Measures essential minimums that all examinees should know. (2) Lower level (e.g., Blooms Taxonomy) items requiring memorization of facts or computations. (3) Common in licensing or certification tests. b. Power-Type Items (1) Designed to measure what typical examinees are expected to know.

172

(2) May range in difficulty form very easy to very hard, depending on the test purpose and content domain being tested. (3) Power items are commonly found in achievement tests. c. Speed-Type Items a. Designed to assess higher level (e.g., Blooms Taxonomy) concepts and skills expected of the most able of examinees. b. Commonly found on tests of mental abilities (e.g., IQ tests). c. Dont confuse these item types with speeded tests where the strictly observed time limit is part of the measurement process (e.g., decisionmaking accuracy in an emergency situation such as a paramedic, firefighter, or police officer). d. The process for constructing a mastery, power, or speeded test is the same; the key difference is the purpose of the test. 3. Test items are Unidimensional and Logically Independent a. Items are Unidimensional (1) Items measure one psychological attribute, e.g., knowledge, intellectual skill, attitude, or psychomotor skill, etc. (2) If it were possible, in theory, to write every possible test item for an attribute (e.g., intellectual skill), they would fully describe every single aspect of that attribute. (3) This unidimensional assumption is critical for two reasons: (a) it simplifies the interpretation of test items and (b) allows the explanation that a single trait (e.g., achievement) accounts for an examinee responding correctly to one item. b. Items are logically independent (1) An examinees response to any test item is independent of his or her response to any other test item. Responses to all test items are unrelated, i.e., statistically independent. (2) Practical implications for writing test items are: (a) write items so that one item doesnt give clues to the correct answer to another item and (b) if several items are related, e.g., to a graphic, they should be written so as not to betray one another but to test different aspects of the graphic. See Interpretative Exercise in the discussion on writing multiple choice items and Appendix 5.2. G. Directions and Arranging Test Items 1. Directions a. Directions should be clear, concise, and to the point. b. Directions should include (1) How to record answers (2) Time available (3) How answers are scored and the points associated with specific subtests or items. (4) What to do when the test is completed.

173

(5) Explain the permissible use of scratch paper to show computations, if allowed. c. Keep the directions and associated items on the same page, even if the directions need to be repeated on subsequent page(s). 2. Arranging Items on the Test Instrument a. Include an Ice breaker item which virtually all examinees will answer correctly to build confidence. Since the purpose of test items is to identify those who can answer the item correctly, not all examinees should be able to answer every item on the test. b. Group similar content together on the testing instrument. Use items designed to measure important content as testing time is almost always limited. c. Use items of appropriate known difficulty levels when possible. d. Dont break items across a page. e. Keep charts and figures pertaining to an item on the same or preceding page. f. If using a computer printer, use a consistent font or font size strategy. j. If students are required to supply an answer, provide enough room for the answer. H. Timing and Testing 1. For speeded tests, an examinees score is materially influenced by the number of items answered correctly within a specified time limit. Thus, item order and format are important. Care must be taken to ensure that item order and format are consistent with the purpose and cognitive skill level(s) to be measured. Before time limits are set, there should be significant pilot-testing to ensure that the time limit is sufficient to accomplish the purpose of the test. 2. For power tests, an examinees score is influenced by item format and difficulty as well as the cognitive skill(s) to be assessed. If used, the time limit should be sufficient so that virtually all examinees will complete the test. 3. Testing time limits are useful because: a. Examinees complete the test more rapidly. b. Examinees learn to pace themselves. c. Examinees may be more motivated. d. A time limit can be selected so that most examinees will complete the test. 4. Recommended Time Limits by Item Format a. Simple true/false or matching: 15-20 seconds each. b. Complex true/false or matching and one or two word completion: 20-30 seconds each. c. Simple multiple choice: 30 seconds each. d. Complex multiple choice: 45-60 seconds each. e. Brief response essay: 10 minutes.

174

f. Remember, the time required for examinees to answer test items is influenced by reading skills, item format, item length, and cognitive skill(s) being tested. Items which require computations will take longer. The above time limits are intended only to serve as a guide. I. Interpreting Test Scores A. When individual test items are grouped, an examiner (i.e., trainer, teacher, or evaluator, etc.) is able to infer the amount of knowledge or skill an examinee possesses; a performance estimate is inferred (e.g., test score). There must be a logical correlation (content validity) between the test items and the knowledge or skill being assessed. The creation of an examinee performance estimate is a professional judgment which is informed by experience and prevailing professional practices. 1. Test items are either dichotomously or polychotomously scored. a. Dichotomously scored: Test items are scored as correct or incorrect, right or wrong, etc. This scoring applies to multiple-choice, true-false, matching, completion, and fill-in-the-blank. b. Polychotomously scored: More than one response alternative is correct for a test item or various point awards are possible. This is typically applied to restricted or extended response essays. 2. Once the item formats are selected, points are allocated. a. The most critical knowledge or skills are weighted with the most points, followed by less critical knowledge and skills. (1) Select response items (i.e., multiple choice, true/false, completion, matching), which measure lower order intellectual skills, are scored dichotomously and are usually worth a few points each. (2) Supply response items (brief and extended responses), which measure higher order intellectual skills, are polychotomously scored and are typically weighted with more points than select response items. b. These point totals are combined into maximum points possible. See Table 5.1. (1) The 2 performance tasks are weighted the most points as they require the demonstration of higher order intellectual skills. (2) Discussion prompt answers are a maximum of 500 words each requiring references; they must evidence at least one higher order intellectual skill. Answers are graded with an analytical rubric. (3) Students are required to comment on other student discussion prompt answers or comments. Comments are scored holistically. c. Each assignment has a task description and scoring rubric so students clearly understand what is expected of them. This graduate course does not employ testing using select response items; see Appendix 5.8.

175

Table 5.1: Assignment Point Distribution Assignment Points Traditional Classroom Test Construction Task 224 Direct Performance Assessment Construction Task 160 Discussion Prompt Answers (8 x 12 x 2) 192 Discussion Answer Comments (16 x 4 x 1.5) 96 Total Points 672

d. Points possible are then usually segmented, based on point levels or percentages, into performance levels, such as with qualitative performance levels. See Table 5.2.
Table 5.2: Grading Scale Grade Percentage Meaning A 95-100% Exceptional A90-94% Excellent B+ 87-89% Very Good B 83-86% Good B80-82% Fair C 75-79% Marginal F 75% Failure Saint Leo University (2010). Graduate academic catalog 2010-2011. St. Leo, FL: Author.

d. Combing the information from Tables 5.1 and 5.2, a student earning 602 points is assumed to have mastered 90% (actually 0.8958%) of the content (learning outcomes or targets) for a performance indicator of an A- or an Excellent performance. (1) Be clear and consistent on the rounding rules; hence the 90% versus the 89.58%. Round up or down according to the rule. (2) A performance indicator assumes a clearly understood task description and consistent application of the scoring rubric for each assignment. (3) An assignment score should not be based on anything other than the examinees performance using the task description and scoring rubric. (4) For tests, using select and/or supply response items, the performance indicator should be based on that particular test scoring plan which should be communicated to examinees before the examination. For an example of a test scoring plan, see Appendix 5.8. B. Decision-Making Perspectives 1. Once a performance inference is drawn, a decision must be made; this is where decision-making enters the equation. a. There are 4 decision perspectives: (1) Norm-Referenced Theory (NRT)

176

(2) Criterion-Referenced Theory (CRT) (3) Ability-Referenced (4) Growth-Referenced b. A test score is interpreted within one of these decision perspectives; we must ensure that any test or direct performance assessment designed produces the information needed to make the proper decision. 2. The Norm-Referenced (NRT) Perspective a. An examinees performance score is compared to others; the score is usually from a commercially available standardized test. b. The purpose of this approach is to determine an examinees performance standing against the norm group. One commonly uses a percentile rank table or chart. Such comparisons are relative in their interpretation, i.e., a change in the norm group will most likely change the examinees performance standing, e.g., a drop or increase from the 67th percentile to or from the 78th . c. To be a valid comparison, the norm group must be well-defined and representative of the population from which the examinee is drawn. d. Norm referenced scores are not very useful for curricular or instructional decisions as test content is often at variance with the school or training curriculum and what was taught. e. The NRT perspective is most associated with Standardized testing; see Appendix A Standardized Testing: Introduction. 3. The Criterion-Referenced (CRT) Perspective a. An examinees performance is compared to a defined and ordered content and/or skill domain, usually through specified learning objectives or performance standards. b. Because criterion-referenced tests are based upon specified learning objectives or performance standards, content sampling (i.e., number and depth of test items) is deep as opposed to the broad, shallow content sampling (a few, general test items) used in the NRT approach. c. Score variability is usually minimal as the meaning of the score is derived from the learning objectives and/or performance standards upon which the test is based. It is assumed that a high score means that the examinee knows more of the material and/or can perform more of the skills expected than an examinee earning a lower score. d. The CRT approach is most useful for instructional and/or curricular decision-making, as test items are selected due to their ability to describe the assessment domain of interest. It is not uncommon for examinees to score 80% of items correct. CRT applications include minimum competency or licensure testing.

177

Synthesis Total The student Uses a table of contents, index, headings, captions, 3 3 1 7 illustrations, and major words to anticipate or predict content and purpose of a reading selection. (LA.A.1.2.1) Selects from a variety of simple strategies, including the use of phonetics, word structure, context clues, self-questioning, 8 4 2 2 16 confirming simple predictions, retelling, and using visual cues to identify words and construct meaning from various texts, illustrations, graphics, and charts. (LA.A.1.2.2) Uses simple strategies to determine meaning and increase vocabulary for reading, including the use of prefixes, suffixes, root words, multiple meanings, antonyms, 3 3 1 1 1 9 synonyms, and word relationships. (LA.A.1.2.3) Clarifies understanding by rereading, self-correction, summarizing, checking other sources, and class or group 5 3 1 9 discussions. (LA.A.1.2.4) Item Totals 19 13 5 3 1 41 Source: Florida Curriculum Framework: Language Arts PreK-12 Sunshine State Standards and Instructional Practice (Florida Department of Education, 1996, pp. 36-37).

Table 5.3 Test Blue Print for End of Term Test on Language Arts Strand A: Reading Standard: 2. The student writes uses the reading process effectively, Grades 3-5 Benchmark Knowledge Comprehension Application Analysis

178

4. The Ability and Growth Referenced Perspectives a. Ability-Referenced: This is used when there is a need to estimate what an examinee can achieve. Thus, current performance is compared to an estimate of an examinees maximum performance. Standardized ability tests are routinely used. The ability-referenced approach relies additionally on prior experience with the examinee, professional judgment, records and/or conferences. This approach to interpretation is essentially subjective. While useful, we should be very conservative in its application. b. Growth-Referenced: Current performance is compared to prior performance. While a natural approach in education (and training), growth is typically measured with gain scores which tend to be unreliable. This approach is also called value added. III. Constructing Select Response Items to Assess Knowledge, Comprehension, & Application A. General Item Writing Guidelines 1. Make the difficulty and complexity of the item appropriate to the examinees level. Ensure that the item reading level is appropriate for examinees. 2. Define the item as clearly as possible. 3. Write simple, straightforward items. Use the most precise words to communicate the intent of the item. 4. Know the mental processes students are to use and frame the item accordingly. 5. Vary item complexity and difficulty. 6. Make items as independent from each other as much as possible, except when used in a series. 7. Avoid negatively phrased items; avoid double negatives. 8. Use correct grammar, syntax, spelling, etc. 9. When writing specific item formats, review the unique item construction guideline for each format to ensure conformity. 10. Ensure that items do not contain language which is racially, ethnically, religiously, or economically biased. 11. Avoid clues that Airasian (1997, p. 177) calls specific determiners. These are words which tend to give away the correct answer. For example, words such as always, all, or none tend to be associated with false true/false items. B. Writing Multiple-Choice Items 1. Each multiple-choice item has a stem which presents the situation, problem, definition, etc. Then there are answer choices (also called options or foils) which include the correct answer and plausible wrong answers, called distracters or foils. There are several multiple choice item variations:

179

a. Correct Answer: Only one answer is correct.


(1) Correct Answer ___In times of war in the early Roman Republic, the two consuls stepped aside in favor of one person who would be responsible for making decisions quickly. That person was a a. General c. Dictator b. Czar d. Tyrant (2) Correct Answer ___What is 68 x 22 = _____ a. 149 c. 1,496 b. 1,469 d. 4,196 (3) Answers: c & c

b. Best Answer: Examinees select the best response option that fits the situation presented in the item stem. (1) Best Answer ___This culture developed an accurate calendar. They built steep temple pyramids
and used advanced agricultural techniques. They developed a system of mathematics that included the concept of zero. They were located mostly in the Yucatn Peninsula. They ruled large cities based in southern and southeastern Mexico, as well as in the Central American highlands. This passage best describes the ancient. a. Aztecs c. Mayas b. Incas d. Olmecs Answer: c

(2)

c. Rearrangement: This item is can be used to assess examinee procedural knowledge or comprehension. (1) Rearrangement
There are several preconditions which must be satisfied before a test is constructed. Read each statement listed below and place each in the correct order of operation, by using the codes presented below. a. First Step c. Third Step b. Second Step d. Fourth Step ____1. The intellectual skills, attitudes, and/or psychomotor are identified. ____2. The test item formats are considered and selected. ____3. A table of specifications (or test blueprint) is developed. ____4. Items are crafted. (2) Answer: 1. a; 2. b; 3. c; 4. d

d. Substitution Item: In this variation, the item stem contains a blank or blanks. The examinee then selects from the response options, either the correct or best response for the stem; the purpose is usually intended to make the stem a correct or true statement. The item is widely used to assess examinee comprehension. (1) Directions. Please the letter which represents the word which correctly completes
the statement in the corresponding blank.

180

Reliability and Validity Relationship a. Concurrent c. Content b. Construct d. Test-retest

e. f.

Internal Consistency Alternate Forms

At a minimum all tests must have (1) _________validity and (2) ____ reliability. (2) Answers: 1c; 2e

e. Analogy (1) A multiple choice item can be written in the form of an analogy.
(2) Hieroglyphics is to Egypt as cuneiform is to a. Phoenicia c. Persia b. Sumer d. Crete

(3) Answer: b

2. Strengths and limitations of multiple choice items are: a. Strengths (1) Simple and/or complex learning outcomes are measured. (2) Highly structured and clear tasks are provided. (3) These items are highly efficient in assessing substantial content domains quickly. (4) All responses, even incorrect endorsements, provide useful item revision and examinee performance information. (5) Items are less susceptible to guessing than matching or true/false items. (6) Scoring is simple, quick, and objective. b. Limitations (1) Constructing high quality multiple choice items is labor and time intensive. (2) Writing plausible foils is often difficult. (3) Writing items to measure higher order intellectual skills is very difficult and when successfully done provide only a proxy measure of such skills. (4) Like most test items, the multiple choice format can be influenced by reading skills. 3. A review of the test item design literature (Oosterhof, 1994, pp. 33-147; Gronlund, 1998, pp. 60-74; Popham, 2000, pp. 242-250) and the authors reveal these item writing guidelines. a. Learning outcomes should drive all item construction. There should be at least two items per key learning outcome. b. In the item stem, present a clear stimulus or problem, using language that an intended examinee would understand. Each item should address only one central issue. c. Generally, write the item stem in positive language, but ensure to the extent possible, that the bulk of the item wording is in the stem. d. Underline or bold negative words whenever used in an item stem.

181

e. Ensure that the intended correct answer is correct and that distractors or foils are plausible to examinees who have not mastered the content. f. Ensure that the items are grammatically correct and that answer options are grammatically parallel with the stem and each other. g. With respect to correct answers, vary length and position in the answer option array. Ensure that there is only one correct answer per item. h. Dont use all of the above, and use none of the above unless there is no other option. i. Vary item difficulty by writing stems of varying levels of complexity or adjusting the attractiveness of distractors. j. Ensure that each item stands on its own, unless part of a scenario where several items are grouped. Dont use this item format to measure opinions. k. Ensure that the arrangements of response options are not illogical or confusing and that they are not interdependent and overlapping. l. Avoid clues that enable examinees to eliminate incorrect alternatives or select the correct answer without really knowing the content. Common clues include: (1) Restating the text or lecture note answer verbatim or nearly verbatim. (2) Using unplausible distractors. (3) Writing the correct answer so that it is longer than other plausible distractors. (4) Writing distractors which are so inclusive that an examinee is able to exclude other distractors or distractors which have the same meaning. (5) Absolutes such as never, always, etc. are associated with false statements. Their use is a clue to examinees that the alternative is most likely incorrect. 4. Specific Applications a. Best Answer Items: In this scenario, the examinee is required to select the best response from the options presented. These items are intended to test evaluation skills (i.e., make relative judgments given the scenario presented in the stem). b. To assess complex processes, use pictorials that the examinee must explain. c. Use analogies to measure relationships (analysis or synthesis). d. Present a scenario and ask examinees to identify assumptions (analysis). e. Present an evaluative situation and ask examinees to analyze applied criteria (evaluation). f. Construct a scenario where examinees select examples of principles or concepts. These items may either measure analysis or synthesis, depending on the scenario. 5. The Interpretive Exercise (IE) a. A preferred item format for assessing application, analysis, synthesis, and evaluation is the Interpretive Exercise. See Appendix 5.2.

182

b.

c.

d.

e.

(1) In IE, based on introductory information, several related items are used to assess mastery of a particular intellectual skill expression. (2) This assessment strategy is used to assess other complex skills such as reading ability, comprehension, mathematical thinking, problem solving, writing skills, graph and table interpretation, etc. IE can be used to assess whether or not examinees can (1) Recognize and state inferences; (2) Distinguish between warranted/unwarranted generalizations; (3) Identify and formulate tenable hypotheses; (4) Identify and interpret relationships; (5) Recognize assumptions underlying conclusions; (6) Formulate valid conclusions and recognize invalid ones; (7) Recognize relevance of information; and (8) Apply and/or order principles. Note: Multiple choice, true/false, and matching test items are only proxy measures of any intellectual skill. Direct measurement relies on restrictedresponse essays, extended response essays, and performance based methods. Short-answer items can provide a direct measure depending on the construction of the IE and the particular short-answer item. Advantages (1) Interpretive skills are critical in life. (2) IE can measure more complex learning than any single item. (3) As a related series of items, IE can tap greater skills depth and breadth. (4) The introductory material, display or scenario can provide necessary background information. (5) IE measures specific mental processes and can be scored objectively. Limitations (1) IE is labor and time intensive as well as difficult to construct. (2) The introductory material which forms the basis for the exercise is difficult to locate and when it is, reworking for clarity, precision, and brevity is often required. (3) Solid reading skills are required. Examinees that lack sufficient reading skills may perform poorly due to limited reading skills. (4) IE is a widely used proxy measure of higher order intellectual skills. For example, IE can be used to assess the elements of problem solving skills, but the extent to which the discrete skills are integrated. Construction Guidelines (1) Begin with written, verbal, tabular, or graphic (e.g., charts, graphs, maps, or pictures) introductory material which serves as the basis for the exercise. When writing, selecting, or revising introductory material, keep the material: (a) Relevant to learning objective(s) and the intellectual skill being assessed; (b) Appropriate for the examinees development, knowledge and skill level, and academic or professional experience;

183

(c) At a simple reading level, avoiding complex words or sentence structures, etc; (d) Brief, as brief introductory material minimizes the influence of reading ability on testing; (e) Complete, contains all the information needed to answer items; and (f) Clear, concise, and focused on the IEs purpose. (2) Ensure that items, usually multiple choice: (a) Require application of the relevant intellectual skill (e.g., application, analysis, synthesis, and evaluation); (b) Dont require answers readily available from the introductory material or that can be answered correctly without the introductory material; (c) Are of sufficient number to be either proportional to or longer than the introductory material; and (d) Comply with item writing guidelines; (e) Revise content, as is necessary, when developing items. (3) When examinee correctly answers an item, it should be due to the fact that he or she has mastered the intellectual skill needed to correctly answer rather than failure to memorize background information. For example, in statistics, give the examinee the formulas so that concepts are tested and not formula memorization. (4) Other item formats are often included in the interpretative exercise. C. Writing True/ False Items 1. True/false items are commonly used to assess whether or not an examinee can identify a correct or incorrect statement of fact. Tests employing true/false items should contain enough items so that the relevant domain is adequately sampled and that the examiner can draw sound conclusions about examinee knowledge. True and false statements should be equally represented. 2. Strengths and limitations of true/false items (Gronlund 1998, p. 79; Oosterhof, 1994, pp. 155-158; & Popham, 2000, pp. 232-233) are: a. Strengths (1) These items enable an adequate sampling from the content domain. (2) They are relatively easy to construct. (3) They can be efficiently and objectively scored. (4) Reading ability exerts less of an influence than in multiple choice items. b. Limitations (1) These items are susceptible to guessing. There is a 50% chance of an examinee correctly guessing the correct answer given a true/false item. (2) Such items are only able to measure higher order skills indirectly. (3) These items can only be employed when dichotomous alternatives sufficiently represent reasonable responses.

184

(4) Use of true/false items tends to encourage examinees to memorize facts, etc. as opposed to learning the content. If the item is not carefully constructed, student memorization pays off. (5) True/false items provide no item diagnostic information as is found in multiple choice items by selecting the wrong answer. 3. True/false item writing guidelines (Gronlund 1998, pp. 80-84; Oosterhof, 1994, pp. 156-164; & Popham, 2000, pp. 235-238) are: a. For each statement, incorporate only one theme. When preparing an item, write it as both correct and incorrect. Select the better of the two options. b. Write crisp, clear, and grammatically correct statements. Brief is recommended. However, use terms that would suggest an incorrect response upon a quick reading by an examinee. c. Select precise wording for the item so that it is not ambiguous. The item should be either absolutely true or false; avoid statements that are partly true and partly false. If there are adverb or adjectives which are key to marking a statement correctly, underline or otherwise stress that word. d. Avoid double negatives. Negatively worded statements tend to confuse examinees so use them rarely. e. Attribute opinion statements to their source. However, if testing examinees ability to distinguish fact from opinion, dont attribute. f. If measuring cause and effect, use true statements. g. Avoid absolute relative terms (e.g., always or never) usually adjective or adverbs, which tend to clue the examinees to the correct response. 4. True/False Item Variations a. True/false Items Requiring Corrections (1) This variation requires the examinee to either correct a false item or to identify the false portion of an item. (2) This is actually a blend of true/false and short answer. (3) Directions. Read the statement carefully. Determine whether or not each
underlined number is true or false. If you think the statement is true, circle T or if you think the statement is false, circle F. If false, place the correct number in the line provided. Acceptable reliability standards for measures exist. For instruments where groups are concerned, 0.80a or higher is adequate. For decisions about individuals, 0.85b is the bare minimum 0.95c is the desired standard. a. b. c. T T T F F F ______ ______ ______

Answer: a. T; b. F 0.90; c. T

b. Embedded Items (1) In a paragraph are included underlined words or word groupings. Examinees are asked to determine whether or not the underlined content possesses a specified quality, e.g., being true, correct, etc.

185

(2) Embedded items are useful for assessing declarative or procedural knowledge. (3) Directions. Indicate whether the underlined word is correct within the context of
threats to a measures reliability. Circle 1 for correct or 2 for incorrect. When a test is given to a very similar homogeneousa (1 or 2) group, the resulting scores are closely clustered and the reliability coefficient will be highb (1 or 2). Answer: a. 1 and b. 2.

c. Multiple True-False Items (1) This variation is a blend of the true/false and multiple choice item. Multiple statements, whose truth or falsity is being tested share a common stem. Each statement is numbered as a unique test item. (2) True/false item writing rules apply, but note that statements sharing a common stem are usually narrower in focus than conventional true/false items. (3) Directions. Read each option and if you think the option is true, circle T or if you
think the statement is false, circle F. The standard error of measurement is useful for T F 1. Reporting an individuals scores within a band of the score range T F 2. Converting individual raw scores to percentile ranks T F 3. Reporting a groups average score T F 4. Comparing group differences Answers: 1.T; 2.F; 3.F; 4.F

d. Focused Alternative-Choice Items (1) While conceptually similar to true/false items, examinees are required to select between two words or values which correctly complete a statement. The words or values must have opposite or reciprocal meanings. (2) When compared to conventional true/false items, focused alternativechoice items typically produce more reliable test scores. (3) Directions. Read each statement carefully. Circle the letter that represents the word
or phrase which makes the statement correct. A A B B 1. A measures reliability coefficient will likely (a. Increase or b. Decrease) as the numbers of items are lengthened. 2. An examinees answer to a restricted response essay should typically not exceed (a. 150 or b. 250) words.

Answers: 1. a; 2.a

e. Standard Format (1) Traditionally, a true/false item has been written as a simple declarative sentence which was either correct or incorrect. (2) Directions. Read the statement carefully. If you think the statement is true, circle
T or if you think the statement is false, circle F. T F A measure can be reliable without being valid.

Answer: T

186

D. Writing Matching Items 1. The matching item is a variation of the multiple choice item format. Examinees are directed to match (i.e., associate) items from two lists. Traditionally, the premises (the portion of the item for which a match is sought) is listed on the left hand side of the page. On the right hand side of the page are listed the responses (the portion of the item which supplies the associated elements). a. The key to using this item format is the logical relationship between the elements to be associated. b. Elements in each column are homogeneous, i.e., related to the same topic or content domain. c. There are usually more responses (right hand column) than premises. This reduces the biasing impact of guessing on the test score. d. Example
Directions. Match the terms presented in the right column to their definitions which are presented in the left column. Write the letter which represents your choice of definition in the blank provided. Definition Term _____ 1. Measures essential minimums that all a. Mastery Items examinees should know _____ 2. Designed to assess higher level concepts & skills _____ 3. Designed to measure what typical examinees are expected to know _____4. Designed to measure achievement where the content is evolving b. c. Power Items Speed Items

d.

Independent Variable Items Dependent Items

e. Answers: 1.a; 2.b; 3.b; 4.d

2. Strengths and limitations of matching items (Gronlund, 1998, p. 85-86; Kubiszyn & Borich, (1996, p. 99; Oosterhof, 1994, p. 148; & Popham, 2000, pp. 238-240) are: a. Strengths (1) An efficient and effective format for testing examinees knowledge of basic factual associations within a content domain. (2) Fairly easy to construct. Scoring is fast, objective, and reliable. b. Limitations (1) This format should not be used to assess other cogitative skills beyond mere factual associations which emphasize memorization. (2) Sets of either premises or responses are susceptible to clues which help under-prepared examinees guess correctly which gives rise to false inferences respecting examinee content mastery. 3. Item writing guidelines (Gronlund, 1998, p. 86-87 & Popham, 2000, pp. 240241) are: (a) Employ only homogeneous content for a set of premises and responses.

187

(b) Keep each list reasonably short, but ensure that each is complete, given the table of specifications. Seven premises and 10-12 options should be the maximum for each set of matching items. (c) To reduce the impact of guessing, list more responses than premises and let a response be used more than once. This helps prevent examinees from gaining points through the process of elimination. (d) Put responses in alphabetical or numerical order. (e) Dont break matching items across pages. (f) Give full directions which include the logical basis for matching and indicate how often a response may be used. (g) Follow applicable grammatical and syntax rules. (h) Keep responses as short as possible by using key precise words. E. Writing Completion and Short Answer Items 1. Completion and short answer items require the examinee to supply a word or short phrase response. (1) For Example:
Directions. Read each statement carefully. Write in the word or words which completes and makes the statement correct. 1. 2. The two most essential attributes of a measure are: (a) ________ and (b)______. If a test is to be re-administered to the same examinees, the researcher is concerned about _________ reliability. If the reliability of a test is zero, its predictive validity will be _______.

3.

Answers: 1a. validity; 1b. reliability; 2. test-retest; 3. zero

2. Strengths and limitations (Gronlund, 1998, pp. 96-97; Kubiszyn & Borich, 1996, p. 99-100; Oosterhof, 1994, pp. 98-100; Popham, 2000, pp. 264-266) of completion and short answer items identified are: a. Strengths (1) These item formats are efficient in that due to ease of writing and answering, more items can be constructed and used. Hence, much content can be assessed, improving content validity. (2) The effect of guessing is reduced as the answer must be supplied. (3) These item formats are ideal for assessing mastery of content where computations are required as well as other knowledge outcomes. b. Limitations (1) Phasing statements or questions which are sufficiently focused to elicit a single correct response is difficult. There are often more than one correct answer depending on the degree of item specificity. (2) Scoring maybe influenced by an examinees writing and spelling ability. Scoring can be time consuming and repetitious, thereby introducing scoring errors. (3) This item format is best used with lower level cogitative skills, e.g., knowledge or comprehension.

188

(4) The level of technological support for scoring completion and short answer items is very limited as compared to selected response items (i.e., matching, multiple choice, and true/false). 3. Item writing guidelines (Gronlund, 1998, pp. 87-100; Oosterhof, 1994, pp. 101-104; & Popham, 2000, pp. 264-266) are: a. Focus the statement or question so that there is only one concise answer of one or two words or phrases. Use precise statement or question wording to avoid ambiguous items. b. A complete question is recommended over an incomplete statement. You should use one, unbroken blank of sufficient length for the answer. c. If an incomplete statement is used, some authors recommend a separate blank of equal size for each missing word, others consider such an action to be a clue to examinees as to elements of the correct response. Avoid using broken lines which correlate to the number of letters in each word of the desired answer. d. Place the blank at the end of the statement or question. e. For incomplete statements, select a key word or words as the missing element(s) of the statement. Use this item variation sparingly. f. Ensure that extraneous clues are avoided due to grammatical structure, e.g., articles such as a or an. g. Ensure that each item format allows for efficient and reliable scoring. IV. Constructing Restricted and Extended Response Items to Assess Analysis, Synthesis, and Evaluation A. Introduction 1. A restricted response essay poses a stimulus in the form of a problem, question, scenario, etc. where the examinee is required to recall, organize, and present specific information and usually construct a defensible answer or conclusion in a prescribed manner. a. Restricted responses are best suited for achievement tests and the assessment of lower order intellectual (knowledge, comprehension, and application) skills. b. The restricted response essay provides the examinee with guidance for constructing a response. c. It also provides information on how the item is scored. d. Oosterhof (1994, p. 113) suggests that it should take examinees 10 or fewer minutes to answer a restricted response essay item. Limit the number of words in a response to a maximum of 100 or 150. e. Examples (1) Compare and contrast each of the four types of reliability. (Restricted Response)
(2) Define each type of validity. (Restricted Response)

2. An extended response essay is one where the examinee determines the complexity and response length. a. An extended response essay is an effective device for assessing higher order cognitative (e.g., Blooms analysis, synthesis or evaluation) skills.

189

b. This item format is significantly influenced by writing ability which may mask achievement deficits. c. Oosterhof (1994, p. 113) recommends that extended response essays not be used on tests given the limitations cited below. He does suggest that extended response items are useful as assignments. d. Examples (1) Illustrate how test validity and reliability using an original example involving at least
one type of validity and two types of reliability would be established. (Extended Response) (2) Using an original example, explain the process for conducting an item analysis study, including (a) a brief scenario description, (b) at least three sample items in different formats and (c) the computation, interpretation, and application of relevant statistical indices. (Extended Response)

3. Strengths and limitations of essay items (Gronlund, 1998, p. 103; Kubiszyn & Borich, 1996, pp. 109-110; Oosterhof, 1994, pp. 110-112; Popham, 2000, pp. 266-268) are: a. Strengths (1) These formats are effective devices for assessing lower and higher order cognitive skills. The intellectual skill(s) must be identified and require the construction of a product that is an expected outcome of the exercise of the specified intellectual skill or skills. (2) Item writing time is reduced as compared to select response items, but care must be taken to construct highly focused items which assess the examinees mastery of relevant performance standards. (3) In addition to assessing achievement, essay formats also assess an examinees writing, grammatical, and vocabulary skills. (4) It is almost impossible for an examinee to guess a correct response. b. Limitations (1) Since answering essay format items consumes a significant amount of examinee time, other relevant performance objectives or standards may not be assessed. (2) Scores may be influenced by writing skills, bluffing, poor penmanship, inadequate grammatical skills, misspelling, etc. (3) Scoring takes a great amount of time and tends to be unreliable given the subjective nature of scoring. B. Writing and Scoring Restricted and Extended Response Essays 1. Item writing guidelines (Gronlund, 1998, pp. 87-100; Kubiszyn & Borich, 1996, pp. 107-109; Oosterhof, 1994, pp. 101-104; Popham, 2000, pp. 264266), are: a. Use extended essay responses to assess mastery of higher order cognitative skills. Use restricted responses for lower order cognitative skills. Use Blooms Taxonomy or similar cognitive intellectual skills classification system to identify what cognitive skills are to be assessed. b. Use the appropriate verbs in preparing the directions, statement, or question to which examinees are to respond.

190

c. Ensure that examinees have the necessary enabling content and skills to construct a correct response to the prompt. d. Frame the statement or question so that the examinees task(s) is explicitly explained. e. Frame the statement or question so that it requires a response to a performance objective or standard. f. Avoid giving examinees a choice of essays to answer. They will most likely pick the ones they know the answer to, thus adversely affecting content validity. g. Have sufficient time to answer each essay item, if used on a test, and identify the point value. Its best to use restricted response essays for examinations. h. Verify an items quality by writing a model answer. This should identify any ambiguity. 2. Item scoring guidelines (Gronlund, 1998, pp. 107-110; Kubiszyn & Borich, 1996, pp.110-115; Oosterhof, 1994, pp. 113-118; Popham, 2000, pp. 270-271) are: a. Regardless of essay type, evaluate an examinees response in relation to the performance objective or standard. (1) Ensure that the items reading level is appropriate to examinees. (2) Evaluate all examinees responses to the same item before moving on to the next. If possible, evaluate essay responses anonymously, i.e., try not to know whose essay response you are evaluating. (3) For high stakes examinations (i.e., degree sanction tests), have at least two trained readers evaluate the essay response. b. For restricted response essays, ensure that the model answer is reasonable to knowledgeable students and content experts, and use a point method based on a model response. Restricted essays are not recommended for assessing higher order intellectual skills. c. For an extended response essay, use a holistic or analytical scoring rubric with previously defined criteria and point weights. When evaluating higher order cognitative skills, generally evaluate extended essay responses in terms of (1) Content: Is the content of the response logically related to the performance objective or standard being assessed? Knowledge, comprehension, or application skills can be assessed depending on how the essay item was constructed. (2) Organization: Does the response have an introduction, body and conclusion? (3) Process: Is the recommendation, solution, or decision in the essay response adequately supported by reasoned argument and/or supporting evidence? This is the portion of the response where analytical, synthesis, or evaluation skills are demonstrated. Again, be very specific as to which skill is being assessed and that the essay

191

item, itself, requires the examinee to construct a response consistent with the intellectual skill being assessed. d. When evaluating essays where a solution or conclusion is required, consider the following in constructing your scoring plan: (1) Accuracy or reasonableness: Will posited solution or conclusion work? (2) Completeness or internal consistency: To what extent does the solution or conclusion relate to the problem or stimulus? (3) Originality or creativity: Is an original or creative solution advanced? While an original or creative solution might not be expected, when one is posited, credit should be given. V. Statistical Strategies for Improving Select Response Test Items (Item Analysis) A. Item Analysis: Purpose 1. The purpose of analyzing item behavior is to select items which are best suited for the purpose of the test. 2. The purpose of testing is to differentiate between those who know the content and those who dont. This can be done by: a. Identifying those items answered correctly by knowledgeable examinees. b. Identifying those items answered incorrectly by less knowledgeable examinees. 3. There are two common item analysis (IA) indices: item difficulty or p-value, Index of Discrimination (D). B. Item Difficulty (p-value) 1. Item difficulty is determined by the number of examinees correctly endorsing (i.e., answering) a dichotomously scored item. This indices has also been called the item mean. 2. The p-value is expressed as a proportion, with a range from 0.00 to 1.00. 3. Item format affects p-values, particularly where the item format is able to guess a response. For example: a. 37 x 3 = ________ (This format presents little opportunity to correctly guess--no choices.) b. 37 x 3 = ______ (This format presents more opportunity to correctly guess.) (1) 11.1 (3) 1111 (2) 111 (4) 11.11 4. Target p-values a. From the test designers perspective, to maximize total test variance (needed for high reliability coefficients), each item p-value should be at or close to 0.50. Some recommend a range between 0.40 and 0.60. (Crocker and Algina, 1986, pp. 311-313). b. Oosterhof (1994, p. 182) recommend differing p-value targets based on item format. These are: (1) True-false and 2 option multiple choice, 0.85.

192

(2) Three option multiple choice, 0.77. (3) Four option multiple choice, 0.74. (4) Five option multiple choice, 0.69. (5) Short-answer and completion, 0.50. c. p-values are rarely the primary criterion for item selection into a test and for tests designed to differentiate between those who know the content and those who dont; p-values should be of consistent, moderate difficulty. C. Item Discrimination Index (D) 1. Items with high discrimination ability are prized as they contribute to sorting examinee performance. Determining an items discrimination power requires computing D, the item discrimination index. D is an important index when selecting items for a test whose purpose is to sort examinees based on knowledge. 2. Formula: a. D = Pu - Pl b. Term Meanings (1) D = discrimination index (2) Pu = lower proportion of examinees who answered item correctly (3) Pl = upper proportion of examinees who answered item correctly c. Example: For Item 1 Pu = 0.80 & Pl = 0.30, so 0.80-.30 = 0.50, D = 0.50. 3. Determining Pu & Pl a. To determine Pu and/or Pl, select the upper 25-27% of examinees and 2527% of lower scoring examinees. b. If the examinee group is large (n = 60), an alternative is to use the median as the breaking point. Thus, Pu = 0.50 and Pl = 0.50. 4. Properties of D a. Ranges from -1.0 to 1.0 b. Positive values indicate the item favors the upper scoring group. c. Negative values indicate the item favors the lower scoring group. d. The closer D is to 0.00 or 1.0, the less likely the item is going to have a positive value. 5. Interpretative Guidelines (Crocker and Algina, 1986, pp. 314-315) a. D > .40, well functioning itemkeep as is. b. 0.30 <D > 0.39, little or no item revision needed. c. 0.20 < D > 0.29, marginal item, needs revision. d. D < 0.19, eliminate or completely revise item.

193

D. Distractor Analysis (Special IA indices for multiple choice items) 1. The purpose in distractor analysis is to assess how well each multiple choice item response option (also called foils or distractors) attracts examinee endorsement. It is hoped that all foils attract some endorsement. 2. To determine an acceptable level of distractor functioning, one considers: a. How effective is the foil in attracting an incorrect endorsement? b. A distractor which does not attract any endorsement is poorly functioning. c. A distractor which attracts an incorrect endorsement from a high scoring student is fine (most likely an error), provided such endorsements are not excessive. d. It is desirable for the most likely incorrect response to have the 2nd highest endorsement level. If all incorrect foils are equally likely, then lower scoring examinee endorsements should be fairly consistent across the incorrect foils. e. High and low scoring students selecting the same incorrect foils is not desirable. The foil should be revised. 3. Many commercial item analysis software programs, will only provide an all examinees distractor analysis. However, understanding how distractors function is best achieved by reviewing high and low scoring student foil endorsements along with the all examinees option. 4. Interpreting An Item Analysis Report a. Item 6 High Lower All A 7 13 12 B* 86 60 70 C 0 7 5 D 7 13 10 E 0 7 3 Omit 0 0 0

p-value = 0.70

D = 0.26

* = Correct foil

(1) Most distractors appear to function adequately. (2) For a five distractor item, the p-value is fine. (3) The D value suggests some revision. One might make foils C and D more attractive to higher scoring students. b. Item 13 High Lower All A 0 0 0 B 0 0 0 C* 100 100 100 D 0 0 0 E 0 0 0 Omit 0 0 0

p-value = 1.0

D = 0.00

* = Correct foil

194

(1) Unless the item is an ice breaker, it is a very poorly functioning item. It doesnt meet any targeted p-value and has no discrimination ability. The item will not contribute to test score variance. (2) Item should be discarded or completely revised, unless it is an ice breaker. c. Item 29 High Lower All A 36 41 24 B* 36 34 70 C 0 0 0 D 28 25 6 Omit 0 0 0

p-value = 0.70

D = 0.11

* = Correct foil

(1) The good p-value for a 4-foil multiple choice item suggests that the item has promise. However, the 0.11 D value indicates that the item has very little discriminating power. (2) Distractors A, B, and D, attracted similar endorsements from both the higher and lower scoring groups. This suggests item ambiguity. (3) A revision strategy would be to make foil C more attractive and revise foils A, B, and D so that they better differentiate between those examinees who know and dont know the content. 5. It should be clear that improving test items, regardless of format, requires a balance between targeted p-values and acceptable D values. Revision requires a thorough understanding of the content domain being tested and the mental skills required to correctly endorse an item. The item analysis study model advanced by Crocker and Algina (1986, pp. 321-328) is instructive in guiding test developers to efficiently and effectively revise test items. The elements of the model are: a. Decide whether you want a norm-referenced or criterion-referenced score interpretation. (1) Do you want a norm-referenced (NRT) or criterion-referenced test (CRT) score interpretation? (a) If the primary purpose of the test is to sort examinees based on what they know and dont, then the IA indices under consideration are appropriate. (b) If the primary purpose of testing is to make absolute judgments about an examinees domain mastery, then other IA indices should be used. (2) Given how content validity is established, it can be argued that test results (e.g., scores) can be used to assess instructional efficacy and as a basis for initiating remedial instruction, as NRT and CRT tests are constructed in much the same manner.

195

b. Select relevant item performance indices. (1) For most IA studies, the parameters of interest are p-value, D and distractor analysis (for multiple choice items). (2) Comparing higher and lower scoring examinees is standard practice. c. Pilot the test items to a representative examinee sample for whom the test is intended. (1) Pilot test group size: (a) 200 examinees for local studies or 5-10 the number of examinees compared to the number of items. (b) Several 1000 examinees are needed for large scale studies. (2) The above recommendations are for school, district-wide, regional, or state-wide testing programs. It is not expected that classroom teachers developing unit examinations would need a pilot test with large samples, but having a knowledgeable colleague review your test is recommended. d. Compute each item performance indices e. Select well performing items and/or revise poorly functioning ones. (1) Use target p-values and the item discrimination index criteria. (2) Crocker and Algina (p. 323) have described and resolved two common situations: (a) When there are more items than can be administered, the task is to select those items which make the most significant contribution to the desired level of test reliability and validity. Items with pvalues between 0.40 and 0.60 are recommended as such items have been shown to reliably discriminate across virtually all ability groups. (b) If the item pool is not overly large and the test developer wants to keep every possible item, revise those items with low D values. (3) If necessary, repeat the pilot test with revised or newly written items. f. Select the final set of items. (1) Use the D value and p-value criteria to guide decisions about revising flawed items. Eliminate only as a last resort. (2) Look for unexpected or abnormal response patterns, particularly in distractor analysis. g. Conduct a cross-validation study to determine whether or not the desired results have been attained. (1) This not usually done in classroom achievement testing. However, for reasonably high stakes tests such as a common departmental course final one should consider conducting a validation study. Such a study might involve using the same or similar items several times and then taking the best functioning items for a final version of the examination. (3) At this stage, the test developers interest is to see how well items function as second time. Some items will likely be removed. However, the examination is largely constructed.

196

VI. Constructing Direct Performance Assessments A. Introduction 1. In direct performance assessment (DPA), examinees are presented a task, problem, simulated or real world stimulus, or process where they are to construct either a verbal, written, or product response. The constructed response is then rated by a judge(s) (i.e., teacher, professor, etc.). Performance assessment is typically employed when a. Specific processes, skills (i.e., behaviors), outcomes (e.g., products), affective dispositions (e.g., attitudes, opinions, intentions), or social skills (e.g., self-direction, capability to work with others, manners, etc.) are to be assessed; b. Traditional forms of testing will not enable inferences based on direct performance observation. Recall that traditional testing is most efficient, valid, and reliable for testing lower order intellectual skills (i.e., knowledge, comprehension, and elementary applications); c. Complex intellectual skills are to be assessed, (i.e., involved applications, synthesis, and evaluation). d. Common assessment devices include checklists, rating scales, scoring rubrics, and portfolios. 2. Direct performance assessment inferences are highly context dependent. a. Assessment devices for products include charts, tables, maps, drawings, essays, experiments, projects, etc. b. Assessing higher order intellectual skills would include tasks which require the acquisition, organization, and use or application of information to real or simulated problems, scenarios, or opportunities. c. The assessment of psychomotor skills or the integration of intellectual and psychomotor skills include tasks which require the examinee to render an observable performance, which might take form as a speech, written communication, typing, dance, gymnastic routine, sales presentation, mock interview, or correctly executing a recipe. d. Disposition assessment is most often indirect and employs rating scales (also called indexes or scales). An examinee can individually respond to a rating scale to self-report motivation, attitude, opinion, persistence, or intention. Behaviorally oriented assessments, made by observation, can be used to assess examinee motivation, persistence, cooperation, social interaction skills, etc. 3. Popham (2000, pp. 280-281) has identified three characteristics of genuine performance assessment: a. Multiple evaluative criteria are used to assess performance. b. Pre-specified quality standards are used to classify performance. c. Humans (content experts) judge the acceptability of examinee performance.

197

B. Direct Performance Assessment Advantages and Disadvantages 1. DPA Advantages a. Allows for performance diagnoses or inferences which are impossible or at least very difficult to make with traditional testing strategies. b. The higher order intellectual skills, typically assessed through performance assessment strategies, are built on lower order intellectual skills. Competently designed and implemented tasks can be correctly solved in a variety of ways thus allowing evaluative inferences. c. Very clear connections to teaching or training quality can be made. Concrete performance examples are produced which allows for performance charting over time, d. Both the process of constructing the response and the response (i.e., product) itself can be assessed and evaluated. e. Dispositional benefits include increased examinee sense of control over learning, assessment and evaluation processes and motivation tends to be high when examinees are required to perform specified tasks. 2. DPA Limitations a. Performance assessments are very labor and time intensive to design, prepare, organize, and maintain records. b. Performance has to be scored immediately if a process is being assessed. c. Scoring is susceptible to rater error. It is essential that raters be highly trained on the rating form and criteria so that they rate similar performance consistently. A plan for breaking ties is needed as well. Typically, two raters will rate performance where there is wide variance; a third rater will serve as the tie breaker. d. Complex intellectual or psychomotor skills are composed of several different but complimentary enabling skills, which might not be recognized or assessed. Examinees will likely perform some enabling skills more proficiently than others. Critical enabling skills should be identified and specifically observed and rated. e. Other limitations affecting the use of performance assessment include time, cost, and the availability of equipment and colleagues. Due to these issues, performance assessment must be used to assess very highly relevant skills, which are teachable. C. Constructing a DPA 1. Before a performance assessment device is constructed a. Senior examiners should determine what process, skill, product, or disposition is to be assessed and salient indicators which define realistic performance levels as well as the plan for scoring examinee performance. For example, in educational applications, learning outcomes are the place to start. b. Raters should be trained and practice until they are proficient. Only key indicators or primary performance traits should be assessed so that the raters dont become overwhelmed and ineffectual.

198

c. Examinees should be told what is expected of them and be given advance copies of the checklist, rating scale, or scoring rubric to be employed as well as variable definitions (or specifications) and descriptions of performance categories or levels. d. Ensure and document that examinees have mastered prerequisite knowledge and skills prior to launching a performance assessment. 2. DPA Construction a. Stiggins (1997) outlined a DPA construction process (1) What decisions (mastery, rank order, or a combination) are to be made and who is or are to make the decisions? (2) What performance is to be assessed in terms of content and skill focus, processes to be employed, and the work product? What are the performance criteria? (3) Design the exercise. Is the exercise structured or natural? Is the assessment unobtrusive? What is the amount of evidence needed? (4) Design the rating plan. Is the rubric holistic or analytical? Who rates performance? What is the scoring mechanism? b. Application: Appendix 5.3 (1) What decisions (mastery, rank order, or a combination) are to be made and who is or are to make the decisions. Application: Graduate students in an assessment course must successfully construct a direct performance assessment task using performance criteria prepared by the professor. (2) What performance is to be assessed in terms of content and skill focus, processes to be employed, and the work product? What are the performance criteria? Application: Ability to construct a direct performance assessment task is assessed across the dimensions of task description clarity, performance criteria, scoring system and authenticity. Students will work in teams to prepare a highly structured work product. Performance criteria are listed in the task specifications which are provided to each student prior to starting the project. (3) Design the exercise. Is the exercise structured or natural? Is the assessment unobtrusive? What is the amount of evidence needed? Application: The exercise is structured and is not unobtrusive. Each work team DPA will be reworked until performance criteria indicate successful mastery. (4) Design the rating plan. Is the rubric holistic or analytical? Who rates performance? What is the scoring mechanism? Application: An analytical rubric based on project specifications was developed. The project is assessed by the professor using a rating scale of four levels per item with each performance level defined.

199

E. Checklists & Rating Scales 1. Checklists a. A checklist is a list of specific, discrete actions to be executed by examinees to be observed by a rater. Often, the actions are listed in the expected order of performance. Checklists typically employ a binary choice with a did not observe option. Checklists are well suited to score procedures, are time consuming to construct, fairly efficient to score, and highly reliable and defensible. Checklists provide examinees with quality feedback on performance. b. If you have ever taken a first aid or CPR class, at least a portion of your performance was most likely assessed with a checklist. If youve not taken a first aid or CPR class, you should, if you can. 2. Rating Scales a. Rating scales are efficient in assessing dispositions, products, and social skills. Rating scales are more difficult to construct than checklists, but tend to be both reliable and defensible. Rating scales are efficient to score and provide quality feedback to examinees. b. Rating scales are very similar to checklists but use a multiple, e.g., four, five, or six point response options. The use of multiple point rating options adds information to the examinee regarding his or her performance and increases the discriminating power of the rating scale. c. It is essential that each response option be fully defined and that the definition be logically related to the purpose of the rating scale and be progressive (i.e., represent plausible examinee performance levels). Like checklists, rating scales tend to be unidimensional in that each assesses one characteristic of performance. However, related unidimensional rating scales are often combined into multidimensional rating scales. (1) See Appendix 5.6, Part A for an example of a unidimensional rating scale. When Parts A, B, and C are combined, a multidimensional rating scale is produced. Performance options are defined by key words. (2) Presented in Appendix 3.1 is a unidimensional rating scale with five point response options. d. Rating scales tend to be analytical in that they are more descriptive and diagnostic than checklists. For example, problem solving is thought of as a series of sequential, but related steps and sub-steps. A rating scale will assess examinee performance across each key sub-step and step in the process. As with the checklist, the rating scale becomes the point of comparison and not other examinees. 3. Constructing and Scoring Checklists & Rating Scales a. Construction Guidelines for Checklists and Rating Scales (1) Ensure that the process, skill, product, disposition, or combination thereof, along with key indicators have been clearly and fully

200

specified. Use only the number of indicators necessary to allow for competent inferences to be made. Rate only the key indicators. (2) Make sure that there are sufficient resources (e.g., time, money, equipment, and supportive colleagues) to successfully implement the performance assessment. Decide a priori whether or not examinees are permitted reference materials, there is sufficient enabling knowledge to support the performance, and what the scoring criteria are. (3) Estimate the amount of time for virtually all examinees to complete the task. If sufficient time is not available, then reframe the task. (4) Use the fewest, most accurate words possible to define the points on the checklist or rating scale continuum. Avoid using redundant phrases. Telegraphic phrases, centered on key nouns and verbs, which are understood by each rater are recommended. (5) Ensure that the form, itself, is efficient to score. Rater responses should be recorded by circling a number or using a x or . For either checklists or rating scales, provide a convenient space to mark or circle. (6) When rating a process, individual items must be grouped in the correct sequence. For product assessments, items of similar content or stage of production must be grouped. (7) Keep item polarity consistent. For rating scales, placing the least desired or characteristic or lowest rating on the left end of the continuum and the most desired quality or highest rating on the right end of the continuum is recommended. For checklists, use the most desired process or product quality. (8) Provide a space for rater comments beside each item or at the end of the checklist or rating scale. (9) Always train raters and have them practice so that they are proficient in using the checklist or rating scale. Also, after each practice session, debrief the raters so that all can understand each others logic in rating an examinees performance, and where necessary arrive at a rating consensus. Stress to raters the importance of using the full range of rating options on rating scales; many raters use only the upper tier of the rating continuum, which is fine provided all examinees actually perform at those levels. b. Scoring Checklists and Rating Scales (1) The binary choice (e.g., yes or no) can be weighted 2 or 1 respectively and then summed to arrive at a total score and if items are grouped, subtest scores as well. Avoid weighting the not observed option. For most checklist applications, the not observed should never be endorsed. (2) The multiple point rating options, used in rating scales, can be weighted where 1 represents the lowest rating and 5 the highest if a five point scale is used. Avoid not observed options. After point weighting, total and/or subtest scores can be computed.

201

(3) A common strategy is to place examinees into performance categories based on their ratings. For an example, see Appendix 3.1. C. Scoring Rubrics 1. Scoring rubrics are similar to rating scales but are more complex. A scoring rubric is (a) composed of multiple rating scales which can be viewed as subtests which when assembled together represent the complete process, skill, or product to be rated (Appendix 5.5) or (b) a series of progressive performance descriptions which address multiple qualities simultaneously within a common progressive scale (Appendix 5.6) 2. Definition of Terms a. Rubric: The instrument for rating the constructed examinee response is called a rubric which may be either holistic or analytical. b. Evaluative Criterion: Evaluative criteria or specifications are identified and are used to distinguish acceptable from unacceptable performance. c. Quality Definitions: Various performance levels are specified and defined so that differences in examinee performance are able to be determined. Each performance level needs its own definition. 3. Scoring Strategy is either holistic or analytic. a. Holistic: All evaluative criteria are considered but summarized into a single, overall quality judgment. (1) While fairly easy to adapt or write, its reliability tends to be weak and generalizability is thus limited. (2) This scale is subject to bias in scoring and rater inconsistencies. (3) Unless holistic rubrics are well defined and detailed, raters are likely to substitute their personal opinions or definitions. Each performance description must represent plausible examinee performance levels. These definitions should be written by senior examiners who are very familiar with the process, skill, or product being rated as well as examinee performance. See Appendix 5.8, Table 3. (4) The holistic rubric is fine if an overall appraisal of a specific ability can be achieved and makes sense, but should be used on less critical process, skills, or products and not used when high stakes decisions are to be made about an examinees performance. b. Analytic: Each evaluative criteria is scored and contributes directly to an examinees overall score. (1) Analytic rubrics tend to be richer in detail and description, thereby offering greater diagnostic value. This provides examinees with a detailed assessment of strengths and weaknesses. (2) If the focus of the rating exercise is to assist individual examinees to improve process or skill performance, then analytic rating is recommended, despite its time intensive nature and high cost. (3) Analytic rubrics should be based on a thorough analysis of the process, skill, or task to be completed. Performance definitions should then be

202

developed to clearly differentiate between performance levels. Checklists and rating scales are commonly used in analytic ratings. 5. Types of Scoring Rubrics a. Task-Specific Rubric (Appendix 5.3): This type of scoring rubric focuses on a particular task rather than the skill associated with that task. Task specific rubrics contribute to improving inter-rater reliability. Popham (2000, p. 290) recommends against use. However, Haladyna (1997, pp. 134) has observed: (1) Task-specific scoring rubrics are expensive to create, but if well constructed, it can be used repeatedly. (2) Task-specific rubrics are both valid and reliable and provide examinees students with accurate performance information. b. Skill-Focused Rubric (Appendix 5.6): These scoring rubrics consider the skill being assessed. Haladynas comments on task-specific rubrics apply to skill-focused rubrics. Popham (2000, p. 290) recommends the use of skill-focused rubrics. He argues that skill mastery is more instructionally relevant than task mastery. Popham provides guidelines for constructing a skill focused rubric: (1) The rubric should consist of 3 5 evaluative criteria. (2) A teachable attribute or sub-skill should form the basis of the evaluative criterion (3) The evaluative criteria should apply to any task designed to assess student skill mastery. (4) The format of the rubric should be well organized. c. Generic Rubric (Appendix 5.5 Parts A, B, or C): A generic rubric is used to assess the general aspect of a skill of interest. While generic rubrics improve the generalizability of results, Haladyna (1997, p. 134) has identified two problems with generic rubrics: (1) a single rating scale seldom leads to a reliable test result (2) no ability [skill or task] is so simple that a single rating scale can cover it. 6. General Guidelines for Constructing Scoring Rubrics a. Determine what process, skill, or product is to be assessed. b. Define clearly the evaluative criteria and realistic performance levels. The performance categories must represent realistic levels of examinee attainment. See Appendices 5.3 and 5.5 for examples of performance categories or levels. c. Decide whether a holistic or analytic scoring strategy will be used as well as the type of scoring rubric. d. Develop a draft of the scoring rubric. Be sure to edit and proof it. e. Have qualified colleagues review, comment, edit and proof your scoring rubric. Train raters using role-playing and debriefing. f. Pilot-test the rubric. Look carefully for administration difficulties, unclear items (i.e., indicators), incomplete decision-making information, accuracy

203

of performance descriptions, etc. Repeat pilot testing until you are satisfied with the rubrics performance. g. Apply the rubric to assess examinee performance. h. Compute and interpret appropriate reliability indices. i. Evaluate your results and make changes in items (evaluative criteria) for future applications. References Alexander, P. A. (1996). The past, present, and future of knowledge research: A reexamination of the role of knowledge in learning and instruction. Educational Psychologist, 31, 89-92. Airasian, P. W., (1997). Classroom assessment (3rd ed.). New York: McGraw-Hill. Bloom. B. S., Engelhart, M. D., Frost, E. J., & Krathwohl, D. (1956). Taxonomy of educational objectives. Book 1 Cognitative Domain. New York: Longman. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart, & Winston. Deming, E. W. (1986). Out of the crisis. Boston: MIT Press. Farnham-Diggory, S. (1994). Paradigms of knowledge and instruction. Review of Educational Research, 64, 463-477. Gagne, R. M. (1985). The conditions of learning and theory of instruction (4th ed.). Chicago: Holt, Rinehart, & Winston, Inc. Greeno, J. G., Collins, A. M., & Resnick, L. B. (1996). Cognition and learning. In D. Berliner & R. Calfee (Eds.), Handbook of educational psychology (pp. 15-46). New York: Macmillan. Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Needham Heights, MA: Allyn & Bacon. Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Needham Heights, MA: Allyn & Bacon. Hergenhan, B. R. & Olson, M.H. (1997). An introduction to theories of learning (5th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc. Kimble, G. A. (1961). Hilgard and Marquis conditioning and learning (2nd ed.) Englewood Cliffs, NJ: Prentice Hall. Kubiszyn, T. & Borich, G. (1996). Educational testing and measurement. New York: Harper Collins College Publishers.

204

Lyman, H. B. (1998). Test scores and what they mean (6th ed.). Needham Heights, MA: Allyn & Bacon. Mitchell, R. (1996). Front-end alignment: Using standards to steer educational practice. Washington, DC: The Education Trust. Oosterhof, A. (1994). Classroom applications of educational measurement (2nd ed.). New York: Merrill. Paris, S. G. & Cunningham, A. E. (1996). Children becoming students. In D. Berliner & R. Calfee (Eds.), Handbook of educational psychology (pp. 117-146). New York: Macmillan. Paris, S. G. Lipson, M. Y. & Wixson, K. K. (1993). Becoming a strategic reader. Contemporary Educational Psychology, 8, 293-316. Popham, W. J. (2000). Modern educational measurement (3rd ed.). Needham Heights, MA: Allyn & Bacon. Quellmalz, E. S. (1987). Developing reasoning skills. In Joan Baron & Robert Sternberg (eds.) Teaching thinking skills: Theory and practice (pp. 86-105). New York: Freeman. State of Florida (1996). Florida curriculum framework: Language arts PreK-12 Sunshine State standards and instructional practice. Tallahassee, FL: Florida Department of Education. Stiggins, R. J., Griswold, M. M., & Wikelund, K. R. (1989). Measuring thinking skills through classroom assessment. Journal of Educational Measurement, 26, (3), 233246. Woolfolk, A. (2001) Educational psychology (8th ed.). Boston: Allyn & Bacon.

205

Appendices Appendix 5.1 outlines Bloom's Taxonomy of Intellectual Skills and selected specific thinking skills. Knowing what intellectual skill is required for examinees to correctly answer an item is critical in constructing that item. Higher-order intellectual skills (analysis, synthesis, and evaluation) require the use of specific thinking skills some of which are described. The intellectual skill application may also require the use of specific thinking skills. Appendix 5.2 is an interpretive exercise example. Multiple choice items can be constructed such that they assess higher-order thinking skills. An interpretive exercise has a core prompt or scenario around which several multiple-choice items are constructed. Appendix 5.3 is a task specific, analytical rubric which includes a detailed description of the task to be accomplished and the performance scoring rubric. Appendix 5.4 is a simple rating scale that was used by team members to rate the contribution of each other's contribution in the construction of a class project. Appendix 5.5 is a skill focused scoring rubric used to rate student critiques of an article from a juried journal. It has a very brief task description but most of the performance criteria are presented in the rubric. The score for each criterion is summed to reach a final performance rating on the assignment. Appendix 5.6 is a three-dimensional (three subtest) skill focused, analytical scoring rubric measuring presentation performance. Appendix 5.7 describes ethical test preparation strategies for examinees as well as guidance on test taking skills. Appendix 5.8 presents a technically competent classroom achievement test proposal addressing all important criteria and processes presented in this chapter. It's an excellent summary case. The task description and rubric for the project follow the test proposal.

206

Appendix 5.1 Blooms Intellectual Skill Taxonomy & Specific Thinking Skills A. Bloom's Taxonomy of Intellectual Skills 1. Bloom (1956, p. 201) defines knowledge to include the recall of facts, methods, processes, patterns, structures, settings, etc. Knowledge is stored in the brain; the purpose of measurement is to present a response which will clue the examinee to recall of the stored knowledge. a. Kubiszyn and Borich (1996, p. 60) define knowledge to be what students must remember. b. Writing performance standards at the knowledge level is the most common practice in education and is done perhaps too much. Knowledge that is memorized tends to be forgotten rather quickly. It is essential for students to have this level of declarative knowledge as it is the basis for higher order intellectual skills. 2. Comprehension is the lowest of the higher order intellectual skills in the taxonomy. Students use the knowledge largely within the context in which it was taught or learned. Students are expected to translate knowledge from one form to another without losing its essential meaning; interpret knowledge so as to identify its central elements or ideas, and then make inferences, generalizations, or summaries but within the original context or application; or based on the knowledge learned, to extrapolate trends, implications, consequences, etc. but again within a defined context. 3. Application is the use of the newly learned information in either an extension of the learning situation or a new but related context. Neither the test item nor the context should clue the examinee or student as to what prior learning is to be applied. Procedural rules, technical principles, theories, etc. are examples of what must be remembered and applied. a. Predicting a probable change in a dependent variable given a change in the independent variable or diagnosing an automobile starter problem given prior experience with the same problem but with a different vehicle. b. The key distinction between application and comprehension is that examinees or students are required to perform what is or was comprehended in a new environment. They may apply abstract procedural knowledge to new or marginally related prompts, problems, or other stimuli. 4. Analysis is the breakdown, deconstruction, or backwards engineering of a communication, theory, process, or other whole into its constituent elements so that relationships and any hierarchical ordering is made explicit. Such an analysis reveals internal organization, assumptions, biases, etc. a. Performance standards at the analysis level contribute to the development or refinement of a students or examinees critical thinking skills. Tasks or questions built to the analysis level will take time and perhaps even

207

have more than one plausible answer or skill demonstration. To respond to analysis level questions or simulations, the student or examinee must (1) Deconstruct an argument, recognize unstated assumptions, separate fact from conjecture, identify motives, and separate a conclusion from its supporting evidence, identify logical contradictions or inferences; (2) Once the constituent parts of a communication, e.g., argument, evidence, or simulation have been identified, relationships between those parts must be assessed. It may be necessary to revise or delete elements which are less critical or less related to the intent of the communication; or (3) The student or examinee may need to analyze how the communication was structured or organized, i.e., identify the organizing principles and techniques (e.g., form, pattern, etc.). b. Care should be taken so as not to confuse analysis with comprehension or evaluation. Comprehension centers on the content of the communication, regardless of form; analysis considers both. Evaluation involves a judgment as to merit or worth given content and form, as measured against either internal or explicit criteria. 5. Synthesis is embodied in the production of a unique communication, a plan or proposed set of operations, theory, etc. In effect, different elements are combined into a new whole. a. Synthesis differs from comprehension, application, and analysis in that synthesis tends to be more substantial and thorough, respecting the task. There is greater emphasis on creativity (uniqueness and originality) than in the other levels. Comprehension, application, and analysis tend to focus on a whole for better understanding whereas synthesis requires the student or examinee to assemble many different elements from many different sources so as to construct a whole that was not there before. Operations at the synthesis level rarely have more than one correct answer. Assignments or tasks that require the student or examinee to function at the synthesis level enhance creativity, but to be effective a thorough knowledge of the content or skill domain is required. b. Producing a unique communication: This type of synthesis requires an original communication to inform an audience or reader about the authors ideas, feelings, experiences, etc. Influencing factors to such communications are the desired effects, nature of the audience, medium of communication, conventions and forms of the medium selected to convey the communication, and the student or examinee him or herself. The student is fairly free to craft whatever content he or she wishes subject to the above influencing factors; this makes the product unique. c. Production of a plan or proposed set of operations: Students or examinees are required to construct a plan or order of operations (i.e., a procedure for doing or accomplishing something). The plan or procedure is the product, which must satisfy the requirements of the task, usually specifications or data which become the basis for the plan or procedure. What is produced

208

must meet the specifications and/or be consistent with the data. Typically, there is room for the student or examinee to include a personal touch, so that the product is unique. d. Derivation of a set of abstract relations: The student or examinee must construct a set of abstract relationships. There are two tasks usually associated. First, the student starts with concrete data or phenomena and must explain or classify what he or she started with. Examples include the periodic table, biological phyla, developing taxonomy of intellectual skills, positing a theory or hypotheses. Secondly, the student or examinee starts with basic propositions, hypotheses, or symbolic representations (as in math) and then deduces other propositions or relationships. The student must reason within a fixed framework. Examples include theory formulation, positing hypotheses based on data or other knowledge, modifying theory or hypotheses based on qualitative or quantitative data. The difference between the first and second tasks is that in the first, the task starts with concrete, typically quantitative, data and in the second, qualitative. 6. Evaluation involves the application of criteria and/or standards to methods, ideas, people, products, works, solutions, etc. for the purpose of making a judgment about merit or worth. These judgments are predicated upon internal and external criteria. a. Judgments from internal evidence: The evaluation focuses on the accuracy of the work, regardless of form (e.g., idea, solution, methods, etc.). Attention is given to internal (i.e., within the work) logic, consistency, and lack of internal flaws. Indicators include consistent use of terminology, flow, relationship of conclusions or hypotheses to the material presented, precision and exactness of words and phrases, reference citations, etc. Considered together, the indicators influence perceptions of accuracy and quality. b. Judgments from external criteria: The work must be evaluated in light of criteria drawn from its discipline, trade, or other appropriate source. A nursing work must be evaluated in terms of nursing criteria; art or literature in terms of the genre and governing conventions; or an assignment in light of its scoring rubric. For example, a rural development project might be evaluated on whether or not its means to its ends were effective (considering alternatives), economical, efficient, and socially acceptable. B. Specific (Selected) Thinking Skills 1. One can also think of the verbs such as describe, explain, compare and contrast as specific thinking skills. a. The course designer and/or teacher must recognize that to design and deliver effective instruction, the specific skills required by students or trainees must be built into design and delivery.

209

b. Specific thinking skills rely on the presence of intellectual skills in order for them to be developed and successfully executed. 2. Reasoning a. After reviewing the literature on frameworks for conceptualizing reasoning, Quellmalz and Hoskyn (1997) concluded that each presented four reasoning skills: analysis, comparison, inference and interpretation, and evaluation. (1) Analysis is much the same as described by Bloom (1956). When a whole is divided into its component elements, relationships among and between those parts and their whole emerge. McMillan (2004, p. 172) points out that examinees, who are able to analyze, can break down, differentiate, categorize, sort and subdivide. (2) Comparison entails the identification of differences and similarities. The learner compares, contrasts, or relates between and among explanations, data, arguments, assertions, or other objects of interest. (3) Inductive and deductive thinking gives rise to inference making (e.g., hypothesizing, generalizing, concluding, and predicting) and interpretation. Interpretation is based on the inferences drawn. (4) Evaluation according to Quellmalz and Hoskyn (1997) is very similar to critical thinking. b. See Elder and Paul (2005) for an easy to digest, practical discussion. 3. Critical Thinking a. Ennis (1987, p. 10) defined critical thinking as reasonable reflective thinking that is focused on deciding what to believe or do. Critical thinking is the ability to evaluate information, evidence, action, or belief in order to make a considered judgment as to its truth, value, and relevance. To assess critical thinking skills, interactive multiple choice exercises, extended response essays, and performance assessments are most suitable. b. An adaptation of Ennis (1987) framework is: (1) Clarify the problem, issue, or opportunity. Formulate an inquiry (e.g., proposition or question) within a relevant context. Ask questions or collect information which helps to clarify the problem, issue, or opportunity. (2) Collect more information. Assess the veracity of facts and claims made by information sources, including the sources themselves. Distinguish between relevant and irrelevant information, arguments, or assertions. Detect bias in explanations, facts presented, arguments, or assertions made by information sources. (3) Apply inductive and deductive reasoning to the information collected. Identify logical inconsistencies and leaps in deductive and inductive reasoning from, within, between, and among the explanations, facts presented, arguments, or assertions made by information sources. (4) Analyze and synthesize the collected information. Search for implied or unstated assumptions; vague or irrational explanations, arguments

210

or assertions; stereotypes; or name calling. Determine the nature of potentially critical relationships (e.g., coincidental, cause and effect, or spurious). (5) Make a judgment. Formulate alternative answers, solutions, or choices. Within the most suitable mix of costs, values, beliefs, laws, regulations, rules, and customs consider each alternative and its anticipated consequence. Make a judgment, but be prepared to justify, explain, and argue for it. c. See Elder and Paul (2004) for an easy to digest and practical discussion. 4. Decision Making a. A decision-making process may be diagramed as found in Figure 5.1.1.
Issue or Problem Identified Issue or Problem Researched Desired Outcome Identified Alternative Strategies Identified Alternative Strategies Evaluated

Strategy Selected

Strategy Implemented

Strategy Implementation Monitored & Adjusted

Issue or Problem Resolved or Process Repeats

b. Decision-making Process (1) The first two stages of the decision-making process require the decisionmaker to identify the existence of an issue or problem, and then research its causes, reasons for persistence, and impact. (2) Next, the decision-maker identifies his or her desired outcome. (3) Several strategies for attaining it are identified and evaluated as to its likelihood of success in producing the desired outcome. (4) Once alternative strategies are evaluated, one may be selected and implemented. If it is determined that no feasible corrective solution strategy exits, the decision-maker may stop the process. (5) Assuming a feasible corrective strategy is found, it is implemented, monitored and adjusted, as necessary. (6) After some predetermined time, cost, or other criteria, the issue or problem is declared resolved. If not resolved, the decision-making process repeats. References Bloom. B. S., Engelhart, M. D., Frost, E. J., & Krathwohl, D. (1956). Taxonomy of educational objectives. Book 1 Cognitative Domain. New York: Longman. Kubiszyn, T. & Borich, G. (1996). Educational testing and measurement. New York: Harper Collins College Publishers.

211

Elder, L. & Paul, R. (2005). Analytic thinking. (2nd ed.). Dillon Beach, CA: The Foundation for Critical Thinking. Ennis, R. (1987). A taxonomy of critical thinking dispositions and abilities. In J. Barton & R. Sternberg (Eds.), Teaching thinking skills (pp. 9-26). New York: W. H. Freeman and Company. McMillan, J. H. (2004). Classroom assessment: Principles and practice for effective instruction (3rd. ed.). Boston: Pearson. Quellmalz, E. S. & Hoskyn, J. (1997). Classroom assessment of reasoning strategies. In. G. D. Phye (Ed.), Handbook of classroom assessment (pp. 103-130). San Diego, CA: Academic Press.

212

Appendix 5.2 Interpretative Exercise Example The following essay was written by a student and is in rough-draft form. Read Mt. Washington Climb. Then answer multiple-choice questions 15 through 18.

Mt. Washington Climb


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 I had gotten up around four-thirty that cool summer morning, anxious to get on the road. My parents were still asleep, however, my mother had my lunch packed and waiting on the kitchen table. I double-checked my gear, grabbed my boots, backpack, and lunch, and left the house. Walking to my car, the first pale hint of red was entering the sky as the sun rose higher and higher. I was right on schedule and greatly anticipating a good day. It was a hot, dry summer morning when I reached the parking lot designated for hikers. The cars in the parking lot were mostly from other states. I opened the trunk of my car and grabbed my hiking boots. I had definitely chosen the right day for my first climb of Mt. Washington. I tied my boots, put on my pack, made sure the car was locked up, and walked over to the map displayed at the head of the trail. I studied the map for a few minutes and then started my six-mile journey to the summit. For the first two miles I walked slowly, enjoying the scenery. It was very beautiful out there. The birds were out, the air was crisp and clean, and a soft breeze tickled my ears. After an hour and a half had passed, I gradually picked up the pace. I reached an intersection in the trail at about eleven oclock. The sun was almost directly overhead, and I judged the temperature as about ninety degrees. I had three miles to go and I felt great. I drank a bottle of water before continuing. As I was about to. get up, a huge deer walked right out in front of me; I never even heard it. It was by far the most magnificent-looking animal I had ever seen. The deers fur was light brown, almost tan, and its antlers had strips of golden velvet that were torn and dirty. Just as soon as the deer was there, he was gone, and I was back on my way to the summit I walked cautiously among the trees for another hour. As I was walking, I noticed the sky got brighter and brighter. Soon I broke through the treeline, and I could see the summit. The sun glistened off the tower. Hundreds of people climbing toward the summit. I hesitated for a moment, awed by the view, and then scrambled over the rocks toward the summit. Beads of sweat ran down my face as I pushed toward the top. The summit was half a mile away, yet it seemed like two feet. My legs burned but nothing could have stopped menot a rockslide, an earthquake, an avalanchenothing. Determination filled my body and gave me phenomenal energy. What seemed like two minutes was forty-five and before I knew it I was on my knees at the summit. I had made it.

1. What is the correct punctuation edit for the sentence in lines 2 and 3 (My . . . table.)? A. My parents were still asleep; however, my mother had my lunch packed and waiting on the kitchen table. B. My parents were still asleep however; my mother had my lunch packed and waiting the kitchen table. C. My parents were still asleep, however, my mother had my lunch; packed and waiting on the kitchen table. D. My parents were still asleep, however, my mother had my lunch packed and waiting; on the kitchen table.
ID:27480 Mt. Washington A

2. Which is the correct revision of the sentence in lines 4 and 5 (Walking . . . higher.)? A. The first pale hint of red was entering the sky as the sun rose higher and higher, walking to my car.

213

B. As the sun rose higher and higher, walking to my car, the first pale hint of red was entering the sky. C. Walking to my car, I noticed the first pale hint of red entering the sky as the sun rose higher and higher. D. Walking to my car, the sun rose higher and higher as the first pale hint of red was entering the sky.
ID:88949 Mt. Washington C

3. Which is a sentence fragment? A. I opened the trunk of my car and grabbed my hiking boots. (lines 78) B. It was very beautiful out there. (lines 1213) C. Hundreds of people climbing toward the summit. (line 24) D. I had made it. (line 30)
ID:27490 Mt. Washington C

4. Which sentence does not add to the development of the essays main idea? A. The cars in the parking lot were mostly from other states. (line 7) B. I had definitely chosen the right day for my first climb of Mt. Washington. (lines 89) C. After an hour and a half had passed, I gradually picked up the pace. (lines 1314) D. Beads of sweat ran down my face as I pushed toward the top. (line 26)
ID:27487 Mt. Washington A

Source: New Hampshire Educational Assessment and Improvement Program, New Hampshire Department of Education. End-of-Grade 10 COMMON ITEMS, Released Items 2000-2001. Retrieved June 1, 2002 from: http://www.ed.state.nh.us/Assessment/2000-2001/2001Common10.pdf

Appendix 5.3 Task Specific, Analytical Scoring Rubric


Direct Performance Assessment Task Description Student work teams will construct a direct performance task which meets the standards in the attached Performance Task Quality Assessment Index (PTQAI). Each work team will develop a unit or mid-term performance task with a scoring rubric suitable for the 6 th 12th grade classroom. For this task, your work team will rely on the text, Internet research, team member experience, external consultants (e.g., senior teachers, etc.) and research and/or professional journals. The final work product is to be submitted electronically using Microsoft Word 2007 or a later edition. The professor will give one free read, using the Performance Task Quality Assessment Index (PTQAI) before grading. The work product will be assessed on four dimensions or traits: quality of the task description, clarity and relevance of the performance criteria, scoring rubric functionality, and task authenticity, i.e., how realistic the intended task is for examinees. The rubric is analytical during the formative phase of the task and holistic at grading, as a total score is awarded. Maximum points are 132. See the PTQAI. The task will be divided into two parts: task description and rubric. Students will submit a clear set of directions (PTQAI items 1-3), appropriate task description (PTQAI items 4-10), relevant performance criteria (PTQAI items 11-15), a suitable scoring rubric (PTQAI items 16-23), and be authentic (PTQAI items 24-33). The performance criteria may be embedded within the scoring rubric. The team must describe the classroom performance assessment context, using the attached Direct Performance Assessment Task Scenario.

214

Direct Performance Assessment Task Scenario


Before constructing the direct performance task, complete the scenario description to set the context within which it is set. Provide the following information about examinees and their school setting by either filling in the blank with the requested information or checking a blank as requested. Ensure that intended learning target content or skills are fully described; do not refer the professor to a web site or other document containing this information. Otherwise, the task will be returned until the learning target information is supplied as required. Examinee Characteristics 1) Ages: _______ (Average Age) 2) Grade Level: _________ (Specify)

3) Ethnic Mix (name ethnicity, percent of class): ______________________________________________________________________________________ __________________________________________________________ 4) Free/reduced Lunch: ________% of class 5) LEP: ______% of class

6) Exceptionalities present in class: ___________________________________________ 7) Classroom: Regular Ed: ________ 8) School Setting: Urban____ 9) School Ownership: Public: _____ Inclusive: _______ (Check one) Suburban_____ Rural_____ (Check one) Private: ______ Charter: _____ (Check one) High: ___ (Check one)

10) School Type: Elementary: ____ Middle: ____ 11) School Grade Range: ________

12) School Size: _______

13) Other Relevant Context Setting Information: ______________________________________________________________________________________ ______________________________________________________________________________________ Learning Target(s): Clearly state the intended learning target content and/or skills which are to be assessed in your extended performance task. Classroom Assessment Plan: Describe in a few paragraphs your general classroom assessment plan. Next, describe how your teams extended performance task would fit into that assessment plan and how you would use the information to improve classroom teaching and learning.

215

Performance Task Quality Assessment Index (PTQAI) Each direct performance task should meet the 33 standards presented in this index. Four dimensions of performance assessment are examined: the task description, performance criteria, the scoring system, and authenticity. For each standard, one of five performance levels or scores is possible: 0= Missing, 1= Does not meet standard, 2=Marginally meets standard, 3= Meets standard, or 4= Exceeds standard. Task Description Score 1. Directions are explicitly stated. 2. Directions are likely to be understood by examines. 3. Directions are likely to produce intended examinee behavior. 4. Critical learning target content/skills are integrated into the task. 5. The task allows for multiple plausible solutions. 6. Expected examinee activities (e.g., individual or group effort, internet research, interviews, etc.) required to complete the task are explicitly stated. 7. Resources needed to complete the task are explicitly stated. 8 Expected performance or work products are so described to enable examinee understanding. 9. The teachers role in relation to the task is fully described. 10. Scoring procedures are so described to enable examinee understanding. Performance Criteria Score 11. Performance criteria are explicitly stated (i.e., unambiguous). 12. Performance criteria are likely to be understood by examinees. 13. Performance criteria focus on important task dimensions. 14. Performance criteria are logically related to the task. 15. Performance criteria are directly observable, little or no inference making. Scoring System (Rubric) Score 16. The type of rating scale (holistic or analytical) is appropriate. 17. The scoring procedure is feasible, i.e., workable. 18. The scoring procedure minimizes scoring error. 19. The scoring procedure is likely to be understood by examinees. 20. Rating scale increments (e.g., numerical or qualitative) are suitable. 21. Performance level descriptions are logically related to the task. 22. Performance level descriptions are suitable given the task. 23. Performance level descriptions are likely to be understood by examinees. Authenticity Score 24. The task is feasible, but reasonably challenging for examinees. 25. The task requires examinees to employ suitable intellectual skills (e.g., analysis, synthesis). 26. The task requires examinees to use apt thinking skills (e.g., creativity, problem solving, etc.). 27. Examinees must apply learning target content and/or skills. 28. The task replicates or simulates an academic, personal, civic, or vocational event, experience, etc. examinees are likely to encounter. 29. Examinees are required to efficiently and effectively apply learning target content and/or skills to a suitably complex task. 30. Examinees must rehearse or practice required learning target content and/or skills to successfully complete the task. 31. Examinees must consult resources (e.g., texts, internet, libraries, people, etc.) to complete the task. 32. Examinees are provided feedback before graded final performance or work product submission. 33. Examinees must complete the task under suitable constraints (e.g., time, prior knowledge, resources, etc.). Total Score (Maximum = 132) Comments:

216

Appendix 5.4 Group Contribution Rating Scale


Read each statement carefully. Next, circle either yes or no if the indicator (behavior) was demonstrated by the group member or not. The ? indicates that the indicator was not observed. 1. The group members participation was focused on the task at hand. The group member usually exhibited a respectful demeanor. The group member contributed an acceptable quantity of data, e.g., research articles, URLs, books, etc., given the teams task. The quality of the group members data (e.g., research articles, URLs, books, etc.) contribution was high, given the task. The group members contribution of data (e.g., research articles, URLs, books, etc.) was relevant to the teams task. The group member acceptably met the teams deadlines. When required, the member exhibited appropriate mediating skills. The member followed team directions in an acceptable manner. The group member exhibited appropriate listening skills which assisted the team in accomplishing its task.

Yes Yes

No No

? ?

2. 3.

Yes

No

4.

Yes

No

5.

Yes Yes Yes Yes

No No No No

? ? ? ?

6. 7. 8. 9.

Yes

No

10. The team member was sufficiently flexible so as to enable the work group to complete the task at hand. 11. The team member demonstrated writing skills, which helped the work group meet its objective. 12. By providing constructive feedback to team mates, the member contributed towards accomplishing the teams task.

Yes

No

Yes

No

Yes

No

217

Appendix 5.5 Skill Focused Scoring Rubric Article Review Scoring Rubric
Names ______________________ Date Completed_____________ Total Score_____ of 100

Students will work in groups to review and critique a short article from a peer-reviewed electronic journal which reports on a single quantitative, qualitative, or mixed-method study. Do not select a literature review or a meta-analysis article. The group will critique the work based on evidence of reliability, validity, design suitability, and practical usefulness of the information. When asked to discuss the suitability and sufficiency of any supporting data, briefly summarize the data and then critique it using prevailing best practices, professional standards, your text, and other research. Cite references in an APA style reference list. The task is a three page maximum, excluding title and reference pages. Left margin headings: Intent, Type, Reliability/Authority, Validity/Verisimilitude, and Conclusion. The articles URL must be provided. Rating: Exceptional corresponds to an A (95-100%). Performance is outstanding; significantly above the usual expectations. Proficient corresponds to a grade of B to A- (83-94%). Skills and standards are at the level of expectation. Basic corresponds to a C to B- (75-82%). Skills and standards are acceptable but improvements are needed to meet expectations well. Novice corresponds to an F (< 74%). Performance is weak; the skills or standards are not sufficiently demonstrated at this time. 0 This criterion is missing or not in evidence.
Ratings Basic Proficient 7.5 8.2 8.25 9.4 7.5 8.2 18.5 19.9 8.25 9.4

Criteria Intent. The intent of the research is summarized succinctly and thoroughly in a style appropriate to the research design. State the purpose for which the research was conducted. Type. State whether the study was primarily or exclusively quantitative, qualitative, or mixed method. Briefly describe the study's methodology to prove your designation. Reliability/Authority. For an article which reports on a quantitative study, describe the data collection devices internal consistency, stability, equivalence, or inter-rater reliability; give coefficients if available. For an article reporting on a qualitative study, describe the authors qualifications and experience, the sponsoring organization, number of times the article has been cited in other research, cite other researchers opinions of the article, how consistent the reported findings are with other studies, etc.) Discuss the suitability and sufficiency of any supporting data. Validity/Verisimilitude. For an article which reports on a quantitative study, describe the data collection devices content, criterion-related, or construct validity; also comment on the control or prevention of internal research design threats to internal design validity. For an article reporting on a qualitative study, show the articles verisimilitude (i.e., appearance of truth). Do this by commenting on the logical analysis used by the authors; describe how consistent their findings and recommendations are with other researchers; comment on the consistency of the study's design and execution with other research on the topic, etc. When assessing logical analysis, consider logic leaps, the internal consistency of arguments, deductive and inductive reasoning, etc. Discuss the suitability and sufficiency of any supporting data. Conclusion. Provide an overall assessment of research reported in the article. Did the study meet prevailing best practices for its research design and data collection strategies? Why or why not? Describe the practical application of findings to professional practice. Writing and grammar skills are appropriate to the graduate level (including APA citations and references).

Novice 1.0 - 7.4

Exceptional 9.5 - 10

1.0 - 7.4 1.0 18.4

9.5 - 10

20 - 23.4

23.5 - 25

1.0 18.4

18.5 19.9

20 - 23.4

23.5 - 25

1.0 11.4

11.5 12.4

12.5 14.4

14.5 - 15

1 11.1

11.2-12.3

12.4-14.1

14.2-15

Total Earned Points: ____________

218

Appendix 5.6 Presentation Rating Rubric The presentation scoring rubric is presented below and is composed of three subtests or sections: Organization, Presence, and Technology. Odd number ratings reflect the mid-point between two even numbered scores. A. Execution: Score ____ (out of 48) 1. Introduction: Score ____ (out of 6) 0 Poor to nonexistent introduction, no attention grabber 2 Vague objectives or simply reads the problem statement 4 -- Good attention-grabber, weak objective foundation 4 -- Weak attention-grabber, clear objective foundation 6 Good attention-grabber, lays clear foundation of objectives Content: Score ____ (out of 8) 0 Wrong data, wrong problem, great leaps in logic, poor question anticipation 4 Some understanding of presentation content demonstrated, some incorrect terminology, a few small leaps in logic, fair question anticipation 6 -- Proficient understanding of presentation content demonstrated, little incorrect terminology, no leaps in logic, good question anticipation 8-- Exemplary understanding of presentation content demonstrated, no incorrect terminology, no leaps in logic, excellent question anticipation Questioning: Score ____ (out of 6) 0 Failed to answer questions, engaged in no discussion 2 -- Answered questions awkwardly, partial explanations, ineffective discussion 4 -- Answered questions somewhat effectively, with accurate explanations, poorly lead discussion(s) 6 -- Answered questions effectively with accurate explanations & effectively lead discussion(s) Communication Strategy: Score ____ (out of 6) 0 -- Inappropriate strategy, ineffectually executed 2 -- Appropriate strategy, but ineffectually executed 4 -- Appropriate strategy, proficiently executed 6 -- Appropriate strategy, expertly executed Use of Notes: Score ____ (out of 6) 0 -- Relies heavily on notes or prepared text, lost place several times 2 -- Relies moderately on notes or prepared text, lost place some times 4 -- Relies little on notes or prepared text, lost place once or twice 6 -- Doesnt rely on notes or prepared text, lost place only once Conclusion and Wrap-Up: Score ____ (out of 8) 0 Poor to nonexistent conclusion 2 Simple re-hash of main point(s), and not tied together 4 Main point(s) clarified, implications weakly presented, & tied together 6 --Main point(s) clarified, implications reasonably well presented & tied together 8 --Main point(s) clarified, implications well presented & tied together Professional Impression: Score ____ (out of 8) 0 -- Slip-shod appearance 2-- Amateurish in appearance 4 -- Less amateurish in appearance 6 -- Proficient in appearance 8 -- Expert in appearance

2.

3.

4.

5.

6.

7.

B.

Presence: Score ____ (out of 24) 1. Eye Contact: Score ____ (out of 4) 0 Makes little eye contact, 2 Makes moderate eye contact, focuses on one group or side of the room 4 Makes and holds eye contact with people all over the room,

219

2.

Use of Hands and Body Movement: Score ____ (out of 4) 0 Distracts or annoys audience or gives perception of being nervous 2 Somewhat comfortable but movements interrupt flow of presentation 4 Completely comfortable, appropriate hand-gestures, and non-awkward movements Voice & Inflection: Score ____ (out of 6) 0 Too hard to hear, sounds disinterested, did not project voice 2 Speaks in a monotone, sounds disinterested, or many ums and likes, inconsistent voice projection 4 -- Varies voice and inflection appropriately, conveys some enthusiasm, fairly consistent voice projection 6 Varies voice and inflection expertly, conveys enthusiasm, appropriate voice projection Articulation & Pace: Score ____ (out of 4) 0 -- Poorly articulated words and/or sentences, very distracting speaking pace, many awkward pauses 2 -- Mispronounces some words and/or mangles sentences, inconsistent speaking pace, few awkward pauses 4 -- Articulates words and sentences clearly, speaking pace appropriate, no awkward pauses Professional Appearance: Score ____ (out of 6) 0 -- Very inappropriately attired for presentation subject, audience, and/or environment 2 -- Somewhat inappropriately attired for presentation subject, audience, and/or environment 4 -- Appropriately attired for presentation subject, audience, and/or environment 6 -- Very appropriately attired for presentation subject, audience, and/or environment

3.

4.

5.

C.

Technology: Score ____ (out of 28) 1. Slides, Graphics, Figures, etc. Layout: Score ____ (out of 4) 0 -- Visual aids are poorly designed, cluttered, & many have missing labels 2 -- Visual aids are sometimes difficult to read; some slides are cluttered, and labeling was inconsistent. 4 -- Visual aids easy to read, uncluttered, and fully labeled Slides, Graphics, Figures Color: Score ____ (out of 4) 0 Visual aid colors distracting and confusing for presentation 2 -- Visual aid color appropriateness inconsistent for presentation 4 -- Visual aid colors appropriate for presentation Slide, etc. /Presenter Alignment: Score ____ (out of 4) 0Visual aids and presenter were frequently out-of-alignment 2Visual aids and presenter were occasionally out-of-alignment 4Visual aids and presenter were rarely out-of-alignment Slide, etc. Reading: Score ____ (out of 4) 0Presenter frequently read slides to audience 2Presenter occasionally read slides to audience 4Presenter rarely read slides to audience Slide Presentation Support: Score: _____ (out of 4) 0Visual aids ineffective & hard to follow 2 Visual aids moderately effective & easy to follow 4 Visual aids effective & easy to follow General Usage: Score ____ (out of 4) 0 -- Visual aids lacking or poorly utilized, very distracting 2 -- Visual aid effectiveness inconsistent and distracting, at points, during presentation 4 -- Visual aids utilized effectively and not distracting Usage Impact: Score ____ (out of 4) 0 -- Learning not enhanced 2 -- Learning enhancement inconsistent 4 -- Learning enhanced Total Score: ________

2.

3.

4.

5.

6.

7.

220

Appendix 5.7: Preparing Examinees for a Testing Session A. Preparing the Testing Environment 1. Each time an examination is administered, the testing environment should be similar to that of a nationally standardized test, unless a performance assessment is administered; modify these guidelines as appropriate. a. The room should be clean, uncluttered, and ordered. b. The temperature should be comfortable, adequately lighted, and well ventilated. c. Test materials should be ready and in sufficient number. d. Distractions such as noise or unnecessary movement should be avoided. e. Discipline should be maintained. f. Answer individual questions about items carefully. This can be substantially reduced if language and item construction are developmentally appropriate. 2. Examinees are going to be anxious and some will be frustrated; reassure as best as is possible. Breathing exercises may be helpful as will the progressive relaxation of muscles. If an examinees text anxiety appears to be chronically acute, refer the examinee for learning disability evaluation. 3. Special Learning Assessment Applications a. Open book tests are helpful if the testing emphasis is on application, not memorization. b. Unannounced tests are not recommended as examinees tend to underperform due to anxiety and inadequate preparation time. Study time + test preparation = high scores. c. For summative evaluation, selection, and placement, single test administrations are sufficient. For diagnostic and formative evaluation, frequent testing is recommended. Frequent testing motivates examinees and improves scoring. B. Ethical Test Preparation 1. Test-preparation training tends to produce higher test scores. To improve testtaking skills, a. Examinees need to understand the mechanics of test-taking, such as the need to carefully follow instructions, checking their work, and so forth. b. Examinees should use appropriate test-taking strategies, including ways in which test items should be addressed and how to make educated guesses. c. Examinees need to practice test-taking skills. 2. Successful examinees tend to understand the purpose for testing, know how the results will be used, comprehend the tests importance and its relevance to teaching and learning, expect to score well, and have confidence in their abilities.

221

3. Test preparation is effective. A critical question is, When does test preparation become teaching to the test? a. Mehrens and Kaminski (1989) suggest that providing general instruction on content and performance objectives without reference to those tested by an intended achievement (often, a standardized) test, and the teaching of test taking skills, are ethical. b. They argue that: (1) Teaching content and performance standards which are common to many achievement tests (standardized or locally developed); (2) Teaching content and performance standards which specifically match those on the test to be administered; and (3) Teaching to specifically matched content and performance standards and where practice follows the same item formats as found on the test to be administered is a matter of professional opinion. c. However, provision of practice or instruction on a parallel form of the test to be administered or on the actual test itself is unethical (Mehrens & Kaminski, 1989). d. Clearly, the cut lies between b(1), b(2), or b(3). Two guiding principles should be employed. (1) The justification for teaching test taking skills is that such instruction reduces the presence of random error in an examinees test score so that a more accurate measure of learning is taken. It seems that the teaching of specific content known to be included on a test to be administered and which is not part of the students regular instructional program is teaching to the test and thus artificially inflating the examinees test score. There should be very tight alignment between any standardized test and the regular instructional program or the standardized test should not be used to measure achievement. (2) The second guideline is the type of test score inference to be drawn. Mehrens & Kaminski (1989) write, the only reasonable, direct inference you can make from a test score is the degree to which a student knows the content that the test samples. Any inference about why the student knows that content...is clearly a weaker inference. (a) Teaching to the test involves teaching specific content which weakens the direct inference about what the examinee knows and can do. Testing is done to generalize to a fairly broad domain of knowledge and skills, not the specific content and item format presented on a specific test. (b) When one seeks to infer from a test score why an examinee knows what he or she knows, an indirect inference is being made. Indirect inferences are dangerous and often lead to interpretation errors. e. Applying these two guidelines leads one to conclude that (1) Providing general instruction on content and performance objectives without reference to those tested by an intended achievement (often, a standardized) test;

222

(2) Teaching test taking skills; and/or (3) Teaching content and performance standards which are common to many achievement tests (standardized or locally developed) are ethical. C. Test Preparation Principles Performed by Teachers for Students 1. Specific Classroom Recommendations are: a. If practice tests are used in the examinees classroom, make it a learning experience. Explain why the correct answer is correct and why the incorrect answers are incorrect. Use brainstorming and other strategies that promote a diversity of responses. Examinees should develop questions for class discussions and practice tests. Have the class (as in small groups) identify content from the text, notes, supplemental learning materials etc. which points towards the correct answer. b. Incorporate all intellectual skills into daily activities, assignments, class discussion, homework, and tests. Teach examinees the different intellectual skills; help them learn to classify course or class activities and questions by intellectual skill so that the various categories of thinking (recall, analysis, comparison, inference, and evaluation) are learned and practiced. c. Encourage examinees to explain their thinking, i.e., how they arrived at their answer, conclusion, or opinion. Practice discerning fact from opinion and the relevant from the irrelevant. Practice looking for relationships among ideas by identifying common threads. Have examinees solve verbal analogies, logic puzzles, and other classification problems. d. Apply learned information to new and different situations or issues. Encourage application of information by asking examinees to relate what has been learned to their own experiences. e. Ask open-ended questions which ensure that examinees do not assume that there is one correct answer. f. Assign time limits to classroom work and structure assignments, quizzes, or tests in formats similar to those found on standardized tests. 2. Improving Examinee Motivation Recommendations a. Expect good results and model a positive attitude. b. Use appropriate motivational activities and provide appropriate incentives. c. Discuss the tests purpose and relevance and how scores are to be used. d. Discuss with examinees the testing dates, times, content, time available to complete the test(s), test item format, and length of reading passages if any. e. Know and use correct test administration procedures while ensuring a quiet, orderly testing environment. f. Ensure that all skills to be tested have been taught and practiced to proficiency. Ensure that examinees know they have adequately prepared for the testing experience. Test-preparation is not a substitute for thorough, diligent preparation.

223

D. Test Preparation Principles Performed by Students 1. List the major topics for the entire course (if a comprehensive midterm or final) or for chapter(s) which contribute test content. Review all of relevant source material: textbook(s), notes, handouts, outside readings, returned quizzes, etc. 2. For each topic, in an outline, summarize key or critical information. It will also be helpful to construct a map of the major concepts so that you will note connections between key concepts. 3. Plan study time. Study material with which you have the most difficulty. Move on to more familiar or easier material after mastering the more difficult content. However, balance time allocations so that there is sufficient time to review all critical or key material before the test. 4. Study material in a sequenced fashion over time. Allocate three or four hours each day up to the night before the test. Spaced review is more effective because you repeatedly review content which increases retention and builds content connections. Dont cram. Develop a study schedule and stick to it. Rereading difficult content will help a student learn. Tutoring can be a valuable study asset. 5. Frame questions which are likely to be on the test in the item format likely to be found. If an instructor has almost always used multiple choice items on prior tests or quizzes, it is likely that he or she will continue to use that item format. Anticipate questions an examiner might ask. For example: a. For introductory courses, there will likely to be many terms to memorize. These terms form the disciplinary vocabulary. In addition to being able to define such terms, the examinee might need to apply or identify the terms. b. It is likely the examinee will need to apply, compare and contrast terms concepts or ideas, apply formulae, solve problems, construct procedures, evaluate a position or opinion, etc. Most likely, the examinee will need to show analytical and synthesis skills. 6. Form Study Groups a. Study groups tend to be effective in preparing examinees provided the group is properly managed. Time spent studying does not necessarily translate into leaning, and hence improved test scores. b. Study group benefits include shared resources, multiple perspectives, mutual support and encouragement, and a sharing of class learning aides (e.g., notes, chapter outlines, summaries, etc.). c. To form a study group, seek out dedicated students (i.e., those who ask questions in class and take notes), meet a few times to ascertain whether or not the group will work well together, and keep the group limited to five or six members.

224

d. Once started, meet regularly and focus on one subject or project at a time, have an agenda for each study session, ensure that logistics remain worked out and fair, follow an agreed upon meeting format, include a time-limited open discussion of the topic, and brainstorm possible test questions. 7. Other Recommendations a. For machine-graded multiple-choice tests, ensure the selected answer corresponds to the question the examinee intends to answer. b. If using an answer sheet and test booklet, check the test booklet against the answer sheet whenever starting a new page, column, or section. c. Read test directions very carefully. d. If efficient use of time is critical, quickly review the test and organize a mental test completion schedule. Check to ensure that when one-quarter of the available time is used, the examinee is one-quarter through the test. e. Don't waste time reflecting on difficult-to-answer questions. Guess if there is no correction for guessing; but it is better to mark the item and return to it later if time allows, as other test items might cue you to the correct answer. f. Don't read more complexity into test items than is presented. Simple test items almost always require simple answers. g. Ask the examiner to clarify an item if needed, unless explicitly forbidden. h. If the test is completed and time remains, review answers, especially those which were guessed at or not known. i. Changing answers may produce higher test scores, if the examinee is reasonably sure that the revised answer is correct. j. Write down formula equations, critical facts, etc. in the margin of the test before answering items. E. Strategies for Taking Multiple Choice Tests 1. Read the question. Advise examinees to think of the answer first as he or she reads the stem. By thinking of the answer first, he or she is less likely to be fooled by an incorrect answer. However, read all answer options before selecting an answer. 2. Do not spend too much time on any one question, as the item may actually have two correct answers, instead of one or no correct answer. a. In these cases select an answer at random, provided there is no penalty for guessing and wrong answers are not counted. b. Advise the examinee to circle the question number so he or she can go later if there is time. Go on to the next item. 3. If an examinee doesnt know the answer, then he or she should mark out options which are known to be incorrect. This increases the chances of a correct guess. Many multiple choice items, have only two plausible answer options, regardless of the number presented. 4. Do not keep changing answers. If the item seems to have two correct answers, select the best option and move along.

225

a. Based on testing research, the first answer is probably correct. An examinee is most likely to change a correct answer to an incorrect one. b. Only change an answer if absolutely sure a mistake was made. 5. After finishing the test, go back to circled items. a. Dont leave a testing session early, unless absolutely sure each item is correctly answered. Invest what time is available, to answer unanswered items. b. If an item is still not able to be answered, guess. You have a 25% chance of selecting the correct answer, if four-options are presented. The chances of a correct guess are higher if other answer options can be eliminated. 6. If the test is not properly constructed, the following tricks might raise scores: a. Where the examinee must complete a sentence, select the one option that fits better grammatically. b. Answer options which repeat key words from the stem are likely correct. c. Longer answer options tend to be correct more often than incorrect. d. If there is only one correct answer to an item, that answer is likely to be different. Thus if two or three answer options mean the same, they must be incorrect. e. If guessing and a typing error is noted in an answer option, select another option. 7. If two answer options have the same meaning, neither are likely to be correct. 8. If two answer options have opposite meanings, one is usually is correct. 9. The one answer which is more general than the others is usually the right answer. 10. For an occasional test item whose answer options end with all the above; select all of the above. If one answer option is incorrect, then all of the above cannot be a correct option. 11. The answer option which is more inclusive, (i.e., contains information which is also presented in other answer options), is likely to be the correct. 12. When you have studied and dont know the answer, select C if there is no guessing penalty. 13. If you do not lose points for incorrect answers, consider these guidelines for making an educated guess: a. If two answers are similar, save for a couple of words, select one of them. b. If a sentence completion stem, eliminate any possible answer which would not form a grammatically correct sentence. c. If numerical answer options cover a wide range, choose a number in the middle. 14. Answer options containing always, never, necessarily, only, must, completely, totally, etc., tend to be incorrect. 15. Answer options which present carefully crafted statements incorporating such qualifiers as often, sometimes, perhaps, may and generally, tend to be correct.

226

F. Strategies for Answering Other Test Formats 1. True-False Tests a. If any part of a statement is false, the answer is false. b. Items containing absolute qualifiers, as key words, such as always or never, often are false. 2. Open Book Tests a. Write down any formulas you will need on a separate sheet. b. Place tabs on critical book pages. c. If using notes, number each page and make a table of contents. d. Prepare thoroughly; these types of examinations are often very difficult. 3. Short Answer/Fill-in-the-Blank a. These test items, ask examinees to provide definitions (e.g., a few words) or short descriptions in a sentence or two. b. Use flashcards with important terms and phrases, missing or highlighted when studying. Key words and facts will be familiar and easy to recall. 4. Essay Tests a. Decide precisely what the question is asking. If a question asks you to contrast, do not interpret. b. If an examinee doesnt know an answer or if testing time is short, usually it's a good idea to answer those items that the examinee knows the answers to first. Then sort based on a combination of best guess and available points. Attempt to answer those items worth the most points and for which the examinee has the greatest amount of knowledge. c. Verbs used in essays include: analyze, compare, contrast, criticize, define, describe, discuss, evaluate, explain, interpret, list, outline, prove, summarize. Look up any unfamiliar words in a dictionary. d. Before writing, make a brief, but quick outline. (1) Thoughts will be more organized and the examinee is less likely to omit key facts and/or thoughts. (2) The examinee will write faster. (3) The examinee may earn some points with the outline, if he or she runs out of time. (4) Points are often lost as the examiner has little understanding of an examinees response due to its poor organization. Use headings or numbers to guide the reader. e. Leave plenty of space between answers. The extra space is needed to add information if time is available. f. When you write, get to the point. Start off by including part of the question in your answer. (1) This helps focus your answer. (2) Build upon your answer with supporting ideas and facts. (3) Review your answers for proper grammar, clarity and legibility. g. Dont pad answers; this tends to irritate examiners. Be clear, concise, and to the point. Use technical language appropriately.

227

References Mehrens, W. A. & Kaminski, J. (1989). Methods for improving standardized test scores: Fruitful, fruitless, or fraudulent? Educational Measurement: Issues and Practice, 8, (1), 14-22.

228

Appendix 5.8 A Traditional Classroom Achievement Test Proposal

Used with Permission

March 2011

229

ASSESSMENT SITUATION A determination needs is to be conducted on middle school students at KTW Academy, a public school located in Tampa, Florida, in regards to measuring writing skills. An achievement test will be issued at the end of the school year to all students at the 6th grade level who have successfully passed the grade level benchmark requirements as set forth by KTW Academy in order to determine writing placement for entering the 7th grade level. The specifications as identified in the paragraphs following will be used for the criterion-referenced interpretation of the test results, to make sure that the description of the test measured assessment domain is sufficiently accurate so that precise interpretations can be drawn from a students test performance (Popham, 2000, p. 99). This interpretation will be utilized to identify students for recommendation to the 7th grade Advanced Placement (AP) writing class. Participants The test will be administered to a group of 18-20 students per class with a total of approximately 150 students; 58% female and 42% male students. The student population is comprised of a community of 60% Caucasian, 20% Hispanic, 10% Asian, 5% African-American, and 5% other; with 25% highachievers, 65% average achievers, and 10% below-average achievers. The economic-status of this community is approximately 15% upper-middle class, 82% middle class, and 3% below-middle class (Ellington, 2011, p. 25). Mastery Determination The focus of the achievement test is to determine whether students will be placed in either an AP writing class or a traditional writing class for the 7 th grade. Students who successfully pass the test in the upper twenty-five percentile will be considered as having mastered writing and have ability to understand writing rules and concepts at a more in-depth knowledge and pace, and thus recommended for the AP writing class. The recommendation for this AP writing class will be based upon measures from this assessment by school officials to parents, who will then make the final determination by accepting or rejecting the recommendation. Students scoring below the upper twenty-five percentile will be automatically placed in the traditional writing class. The achievement test will include information pertaining to sentence organizational skills (noun, verb, and adjective placement), sentence fluency, word choice, and composition using a central idea or

230

theme. The test will be comprised of a select response of 15 test items, worth a 100 points total, consisting of the following test item formats: true/false, matching, multiple-choice, brief and extended response. A scoring rubric has been provided for assessment guidelines. The test will be administered to students in a classroom setting. The achievement test is an important instrument for determining the number of students that have achieved mastery skills in writing vs. those students who will need remedial instruction. The assessment assists instructors in analyzing a students knowledge and skills in regards to current instruction, focuses on student understanding of writing rules and concepts, and identifies areas of strengths and weaknesses to focus on in the upcoming year. Students who have been identified from this assessment as needing remediation will be placed in the traditional writing classroom, from which basic writing instruction and skills will be addressed. It is not necessarily the goal to expect that all 6 th grade students would be recommended for the AP writing class, but that the achievement test will aid KTW Academy in deciding what curricula might have been overlooked or not given as much focus as should have been in the 6 th grade. KTW Academys overall plan for middle school writing excellence is to implement a strategy which includes a sequence of immediate objectives and learning activities, leading to the achievement of the instructional goals. KTW Academy recognizes that students must move beyond the memorization process and utilize an in-depth thinking process. Making predictions, drawing conclusions and making inferences are examples of strategies that typically elicit higher-order thinking (Veeravagu, Muthusamy, Marimuthu, & Michael, 2010, p. 205). The overall strategy of the school is to emphasize the importance of writing in life situations. This achievement test identifies students who understand and can correctly apply writing rules and concepts, and assists instructors in preparing students to become successful and effective writers in the future. TEST CONTENT Students at KTW Academy have been instructed throughout the 6th grade year on the rules and concepts of writing a successful composition. The importance of effective writing is a principle instructional goal at KTW Academy as instructors have identified that many students, upon entering the 6th grade, utilize a simplistic writing style. Students have knowledge of sentence structure and reading comprehension. Over the course of the academic year, students have been instructed in composition

231

development using standardized writing guidelines. Students are expected to comprehend the process of writing, using the standardized guidelines. The purpose of a writing achievement test is to identify students who are advanced in their comprehension of the writing process. KTW Academy has identified these learning standards as the key components to measuring the ability of their students to write effectively. Benchmarks have been identified to serve as requirements which the student must accomplish in order to show mastery of the learning standard. Learning Standards The learning standards, as referenced in Table 1, identify the knowledge and skill development that is expected of students. Table 1: LEARNING STANDARDS 1) The student will correctly identify the types of sentences: a. Accurately identify four types of sentences: declarative, exclamatory, imperative, and interrogative; b. Accurately identify the two types of opening sentences: declarative and rhetorical; c. Compose an effective opening sentence; and d. Accurately identify the characteristics of a successful opening sentence. 2) The student will accurately demonstrate sentence fluency, word choice, and sentence agreement: a. Demonstrate sentence fluency through correct selection of word choice and grammatical organization; b. Identify characteristics of sentence agreement and fluency; c. Identify theme of an essay; d. Accurately identify correct parts of speech; and e. Accurately compose a sentence using descriptive adjectives, adverbs, action verbs. 3) The student will correctly identify characteristics of a writing checklist: a. Precisely demonstrate characteristics of successful composition from checklist; b. Create essay, identifying key characteristics of successful writing Learning Standard 1 defines the types of sentence formats which students must comprehend in order to construct effective essays. Students must understand the basic types of sentences, including opening sentences, to develop interesting and engaging compositions that entice readers. In order to be a successful writer, students must have knowledge of the characteristics of good sentence structure and creativity, in order to demonstrate understanding and knowledge of subject and project this information to the reader. Students mastery of this benchmark proves understanding of this learning standard. Learning Standard 2 defines the importance of sentence fluency, word choice, and sentence agreement by students in order to construct skillful essays. By demonstrating correct word choice and organizing sentence structure in the proper format, students convey to readers their higher level of understanding of the English language. In addition, students must be able to understand how to focus their

232

writings on a central theme or subject in order to demonstrate a clear message to the reader. Proper word choice of descriptive adjectives, adverbs, and verbs is important for articulating to readers the message of the story. Students mastery of this benchmark proves understanding of this learning standard. Learning Standard 3 defines the importance of the utilization of a checklist during the writing process. Student recollection and application of checklist items during the composition phase of the writing process aids in developing more specific and concrete examples of successful writing. A checklist is a guideline for improving writing skills while identifying mistakes; reworking of the composition is critical for effective writing. Students mastery of this benchmark proves understanding of this learning standard. Test Blueprint The Test Blueprint, as referenced in Appendix A, Table 1, utilizes Blooms Intellectual Skill Taxonomy of cognitive domain of learning to define knowledge and intellectual skills iwithin educational activities. These activities recall or recognize the specific facts, procedural patterns, and concepts that serve in the development of intellectual abilities and skills (Hale & Astolfi, 2000, p. 4). The test blueprint serves as an important indicator for measurement by sorting the learning standards into intellectual skills. This serves to identify the influence of test content as it relates to the number of test items. Test items are allocated for each learning standard, with the more critical standards having more allocated test items (Hale & Astolfi, 2000, p. 63). When studying the Test Blueprint, as referenced in Appendix A, Table 1, Items 1 and 2, which relate to sentence structure it correlates to Blooms Taxonomy Knowledge Intellectual Domain as these items pertain to the student having to recall information. The first and second learning standards have a greater number of test items, and therefore, the content relating to those learning standards has a greater weight of importance due to the student having to comprehend different parts of the process in order to prove mastery of the subject. Item 3 of the Test Blueprint requires application and creativity skills by requiring the student to demonstrate their understanding of writing process as a whole by constructing a paragraph.

233

Thinking and Intellectual Skill Alignment Thinking and Intellectual Skills are identified in Appendix B Table 1. The intellectual skills of Blooms Taxonomy (knowledge, comprehension, application, analysis, and synthesis) are associated with specific thinking skills (problem-solving, decision-making, and creativity) which identify test items that unite with the two types of skills. The level of test items designed according to Bloom's Taxonomy influence the students' performance in answering comprehension items, such that analytical skills require higher-level problem-solving skills to identify the correct response. Findings conclude that there's a relationship between the level of thinking processes needed and the students' ability to answer these test items correctly (Veeravagu, Muthusamy, Marimuthu & Michael, 2010, p. 205). As identified in Appendix B Table 1, items pertaining to the intellectual skill of application, analysis, and synthesis require problem-solving, decision-making, and creativity skills in order to solve the problem. Students are required to identify types of opening statements, define terms, demonstrate sentence structure, and distinguish incorrect items through decision-making, problem-solving and creative thinking skills. In addition, students are required to use problem-solving and creative skills when responding to brief and extended responses which assess the intellectual capability of the student. Test items that involve problem-solving, decision-making, and/or creativity, utilize a higher-order of learning to process and retain the information as opposed to memorization (knowledge), in which information is quickly forgotten (Hale & Astolfi, 2000, p. 64). It is important to consider the linguistic process of selecting words and constructing sentences and cognitive process of planning and translating ideas into text when developing assessments for the writing process (McMaster, Du, Yeo, Deno, Parker & Ellis, 2011, p. 186). Intellectual and thinking skills have been appropriately matched to measure student knowledge and skill set in determining their placement in the writing program. Learning Target Attainment Indicators by Test Items Appendix C, Table 1 provides a detailed breakdown of intellectual and thinking skills as they relate to each learning standard and associated benchmark. Each benchmark is identified by the intellectual skill (comprehension) and thinking skill (knowledge) for assessing student performance. For example, Learning Standard 1 Benchmark, Accurately identify four types of sentences: declarative, exclamatory, imperative, and interrogative, requires the student to recall the definition of each

234

of these terms, along with examples associated with each definition, in order to determine which definition corresponds with the given sentence. For this learning standard, it was important to provide multiple test items that relate to specific benchmarks in order to measure student understanding of this learning standard, as students can get easily confused due to the similarity between the sentence types. Learning Standard 2 Benchmark, Identify characteristics of sentence agreement and fluency, requires the student to analyze the test item using their intellectual skills to determine through decisionmaking what characteristics are not representative of sentences that are in agreement and fluent. Once again, the student must apply different skills, analytical and decision-making, to answer the item correctly. In final analysis, the achievement test contains a variety of intellectual skills matched with thinking skills that provide for a comprehensive examination of high-order skills for problem-solving as illustrated in Appendix A Table 1 and Appendix C Table 1. The achievement test is composed of a sufficient number of test items associated with a learning standard which identifies students who display advanced intellectual and thinking skills. Within each learning standard are a minimum of two benchmarks which highlight the knowledge and skills necessary to be placed in an advanced writing class. Learning Standard 1 is comprised of seven test items which require the student to demonstrate mastery of identifying sentence types, opening sentences, sentence structure, and composition structure. Learning Standard 2 is comprised of six test items which require the student to demonstrate mastery in sentence fluency, word choice, and sentence agreement. Lastly, Learning Standard 3 is comprised of the characteristics of good writing skills which require students to demonstrate mastery of writing skills through response to an extended test item. The test items are reflective of the each learning standard and the overall instructional goal. TEST ORGANIZATION AND SCORING The select-response test items were created based upon instructional material driven by three learning targets identified in Table 1. The test items vary in complexity and difficulty and are classified with lower level items requiring memorization of facts and higher level items requiring understanding of concepts and skills (Gronlund, 1998, pp. 60-74). Test items are designed with a logical correlation between the test items and skill being assessed. The test, which contains a minimum of two test items per learning standard, contains the following:

235

True/False. The achievement test contains four true/false test items. These test items determine students knowledge of principal features of writing, namely types of sentences and central theme. Students who understand the process of writing can distinguish between the different types of sentences and how to apply these types of sentences in writing. True/False test items are common for student assessment due to their ease of construction, efficient and objective scoring, and adequate sampling of content domain; the only downside is that true/false test items are susceptible to guessing (Hale & Astolfi, 2000, p. 73). This type of response was selected to identify students comprehension of a statement of fact. Matching. The achievement test contains two matching test items. Matching test items evaluate students understanding of the relationship between the elements (Hale & Astolfi, 2007, p. 75). These test items provide more concrete evidence to instructors of student knowledge and understanding of the instructional material; and in regards to writing, if a student is able to correctly identify the different parts of speech and types of sentences. This type of test item allows the instructor to understand more fully the intellectual level of the student understanding of the instructional material as it relates to the writing process. This type of response was selected to demonstrate student comprehension skills in regards to word choice and sentence selection. Multiple-Choice. The achievement test contains six multiple-choice test items. Test items are clear and structured for accurate measuring of student knowledge as well as intellectual and thinking skills. Multiple-choice items measure simple and/or complex learning standards, are highly structured, and provide examinee performance information (Hale & Astolfi, 2000, p. 70). The limitations of multiplechoice test items are the amount of labor and time for constructing these test items. One such example in the achievement test is as follows:

1. a. b. c. d. e.

Identify the four types of sentences. Declarative, Rhetorical, Exclamatory, Interrogative Declarative, Question, Imperative, Interrogative Declarative, Exclamatory, Imperative, Interrogative Statement, Command, Question Declarative, Exclamatory, Rhetorical, Imperative

This example test item is structured to address a complex issue, yet the structured order is simple for students to comprehend and determine the correct response. Scoring is simple, quick, and efficient, and the instructor can easily measure student performance through this type of test item.

236

Brief and Extended Response. Two types of supply-responses items are required of the student: brief response and extended response. These types of test items were selected to assess students in-depth knowledge of writing and creativity. A student must utilize their intellectual skills of knowledge, analysis, application, comprehension, and synthesis to write effectively. In addition, students must utilize their thinking skills of problem-solving, decision-making, and creativity to demonstrate understanding of subject. For example, student must take a topic and develop a sentence which entices the reader with vivid imagery and word choice. In order to perform this task, students must apply their knowledge and creative skills to develop an inspiring sentence through the aid of a thesaurus. Essay formats assess a students writing, grammatical, and vocabulary skills which are important for determining lower and higher cognitive skills (Hale & Asolfi, 2007, p. 79). With the writing process, students must understand the writing process using mental checklists (check tense agreement, action verbs, reduce to be verbs, etc) in order to develop a rich composition. The assessment test, as referenced in Appendix D, was developed with a consistent font size and organized layout. The test was designed to match the appropriate level of difficulty based on the learning standards as they relate to intellectual and thinking skills. Students will have 30 minutes to complete the test. The test has been divided into four subtests: true/false, matching, multiple-choice, and short-answer and extended response. Directions have been clearly outlined to the student in a clear, concise format. Students are required to follow the instructions, by circling the correct response or providing a complete response in written format. Test item point values are indicated within each subtest. Test items with similar content are grouped together to measure content and are contained on the same page (Hale & Astolfi, 2007, p. 64). True/false and multiple-choice test items require the student to circle the correct response. Matching test items require the student to match the elements by placing the correct letter in the space provided. Brief and extended responses require the student to compose a sentence or essay as their response. An answer key, as referenced in Table 2, has been provided with sample responses and explanatory information. A grading rubric, as referenced in Table 3, has been provided as a guideline for measuring student performance.

237

The test will be administered to the students in a classroom setting, with the test initially distributed face down. The students will be instructed to start their test, and the instructor will monitor the time allowance. Upon completion of the time allowance, all papers will be collected. Students who complete their test early will be allowed to submit their test to the instructor and then asked to quietly wait while other students finish their test. Table 2 ANSWER KEY Writing Test Test Item 1 2 3 4 5.1 5.2 5.3 5.4 Point Value 5 5 5 5 1.25 1.25 1.25 1.25 Correct Answer A B B B C D A B This test item has four matching components which are worth 1.25 points each for a total of 5pts. C D A B This test item has four matching components which are worth 1.25 points each for a total of 5pts. B C B C C D The boy saw a toy on the shelf. The student must modify the sentence by incorporating adjectives, adverbs, and action verbs to describe the subject. Ex: The young, curious boy spied a small toy on the shelf in the bookstore. See the rubric following this answer key for grading guidelines regarding partial/full credit. The student must create an opening sentence that either is declarative, which makes a statement, such as Florida is famous for its white, sandy beaches. Or create a rhetorical sentence, which asks a question that the reader might not be able to answer, such as Why was the first battle of the civil war so important to the Union army? See the rubric following this answer key for grading guidelines regarding partial/full credit. The student is to compose an essay, with a minimum of five sentences, which describe the characteristics of a good checklist. Characteristics to be included: enticing opening sentence (declarative or rhetorical), clearly written essay, peaks curiosity, verb tense agreement, active voice, concise wording, reduction of weak/slang/repetitive words, descriptive sentence structure, and central idea. See the rubric following this answer key for grading guidelines regarding

6.1 6.2 6.3 6.4

1.25 1.25 1.25 1.25

7 8 9 10 11 12 13

5 5 5 5 5 5 10

14

10

15

20

238

partial/full credit.

Table 3 GRADING RUBRIC Test Items 13 - 15 Brief & Extended Response Point values are indicated as following: Brief Response (B), Extended Response (E) Point Label Label Definition B = 2.5 E=5 B=5 E = 10 B = 7.5 E = 15 B = 10 E = 20 Accurate Writing Structure Needs Slight Improvement Correct Specific Content Too Vague Student response inaccurate; lacking in description and creativity; little change to original writing prompt; Lacking in direction in writing process. Student response incorporating little creativity; word choice and sentence structure not suitable; Needs improvement and direction in writing process. Student integrates personal descriptive and creativity; needs refinement of word choice and sentence structure; Needs slight improvement and direction in writing process. Student accurately incorporates personal descriptions and creativity. No need for improvement in writing process.

TEST RELIABILITY AND VALIDITY Statistics have indicated that many students are not reaching proficient writing levels. Early identification and intervention require assessment tools that provide reliable and valid indicators of students' current performance levels (Coker & Ritchey, 2010, p. 175). Reliability Reliability is an indicator of test consistency over a period of time. There are four different types of reliability. internal consistency reliability (ICR), as an indicator of reliability, is utilized for measurements on a group which is administered only once. With ICR, test item responses are correlated to the total test score (Hale & Astolfi, 20007, p. 34). KTW Academy has selected to utilize the ICR method due to this test being administered one time at the end of the school year and will base their conclusions on student performance of subsets as they relate to the overall content.

239

For this assessment, the achievement test will be administered one time with the Cronbachs alpha. Cronbachs alpha is as a measure of reliability for measuring students knowledge, skill, or aptitude (Chen, 2010, p. 192). A score correlating at r = .70 will be the determining score on whether or not the instrument is sufficiently reliable. The Cronbachs alpha is reflective of a lower band estimate of the true reliability coefficient due to all items not being parallel (Hale & Astolfi, 20007, p. 42). The consistency in reliability coefficients across studies supports the use of sentence writing tasks for obtaining reliable indices of writing performance (McMaster, Du, Yeo, Deno, Parker & Ellis, 2011, p. 191). Factors that could be a threat to test reliability include group homogeneity, time allowance, test length, examinee, guessing, boredom, etc. (Hale & Astolfi, 2007, p. 37). If, for example, the test length is too short, then the reliability coefficient will be low. It is important to provide students with sufficient time to complete the test so that the test is an accurate reflection of the content domain, and reflective of the skill set; students are instructed to be well-rested and prepared for the test. The writing achievement test has allowed for sufficient amount of time for students to complete the exam, along with an adequate test length, and variety in test items to reduce boredom and promote thoughtful, reflective, decision-making and problem-solving. Tests are reviewed and verified by trained staff to ensure accuracy in scoring. Validity Validity is the accuracy of the measurement. A measure must be reliable before it can be valid; and a measure can be reliable but not valid (Hale & Astolfi, 2007, p. 43). Content validity is based on the extent to which a measurement reflects the specific intended domain of content in which the test must aim to provide a true measure of the domain in content, which it is intended to measure (Hale & Asolfi, 2007, p. 44). To the extent that it measures external knowledge and other skills at the same time, it will not be a valid test (Chen, 2010, p. 194). The validity validation process determines if there is evidence of a relationship between the test score and criterion performance (Hale & Astolfi, 2007, p. 46). The process involves selecting the appropriate standard, administering the test, recording the scores, obtaining measurements from each student, and determining the relationship between the standard and the test scores. The achievement test measures writing skills and demonstrates validity by incorporating skills relating to writing, which include sentence type, sentence structure, sentence fluency, and composition. A plan and strategy has been developed prior to the test construction to ensure its validity and to ensure that

240

the test was representative of the domain of content. The process begins with learning objectives which outline the specific knowledge, skills, or attitudes students should be able to demonstrate upon completion of instruction. A blueprint was developed which represents these learning standards, as referenced in Appendix A, Table 1. The Test Blueprint is further validated by Appendix B, Table 1 and Appendix C, Table 1, which show the relationships between the skills as they relate to the learning standards. In order for content validity to be evident, criterion relationships, which compose the construct, must be present along with the instrument that measures it (Hale & Astolfi, 2007, p. 48). In summary, the test items are demonstrative of content validity based upon the test items aligning with content instruction to measure student knowledge and skills. In order to demonstrate construct validity, there must be evidence that a relationship exists and that the instrument measures it (Hale & Astolfi, 2000, p. 48). The extent to which the domain of knowledge, skill, or affect that the test is suppose to represent is, in fact, adequately represented (Popham, 2000, p.114). These factors have determined, overall, that this achievement test is reliable and valid for assessing student skills in writing. References Chen, C. (2010). On reading test and its validity. Asian Social Science, 6(12), 192-194. Coker, D., & Ritchey, K. (2010). Curriculum-based measurement of writing in kindergarten and first grade: an investigation of production and qualitative scores. Exceptional Children, volume number 76(2), 175-193. Ellington, K. (2011). Accountability, research, and measurement. Retrieved from http://www.fldoe.org/arm/. Gronlund, N. E. (1998). Assessment of student achievement (6th ed). Needham Heights, MA: Allyn & Bacon. Hale, C. D. & Astolfi, D. (2007). Measuring learning and performance: A primer. Retrieved from http://www.CharlesDennisHale.com. McMaster, K., Du, X., Yeo, S., Deno, S., Parker, D. & Ellis, T. (2011). Curriculum-based measures of beginning writing: technical features of the slope. Exceptional Children., 75(2), 185-206. Popham, W. J. (2000). Modern educational measurement (3rd ed.). Boston: Allyn & Bacon. Veeravagu, J., Muthusamy, C., Marimuthu, R., & Michael, A. (2010). Using Bloom's taxonomy to gauge students' reading comprehension performance. Canadian Social Science, 6(3), 205-212.

241

Appendix A Table 1 TEST BLUEPRINT Intellectual Skill The student will Identify the types of sentences; characteristics good sentence structure; compose opening sentence Demonstrate sentence fluency, word choice, theme, parts of speech, and sentence agreement Identify characteristics of a writing checklist Knowledge 1, 10 Comprehension 3, 4, 5 Application 14 Analysis 8 Synthesis

6, 7

9, 11

13

12

15

Intellectual Skill Knowledge Comprehension Application Analysis Synthesis

Appendix B Table 1 THINKING & INTELLECTUAL SKILL ALIGNMENT Specific Thinking Skills Problem-Solving Decision-Making Creative Thinking

7 2, 8, 9, 11, 12

14 13, 15

242

Appendix C Table 1 LEARING TARGET ATTAINMENT INDICATORS BY TEST ITEM Learning Target 1: Intellectual Skill Thinking Skill The student will correctly identify the types of sentences 1.a Accurately identify four types of Knowledge sentences: declarative, exclamatory, Comprehension imperative, and interrogative 1.b Accurately identifies the two types of Knowledge opening sentences: declarative and Comprehension rhetorical 1.c Compose an effective opening sentence Application Creativity 1.d Accurately identify the characteristics Analytical of a successful opening sentence Learning Target 2: Intellectual Skill Thinking Skill The student will accurately demonstrate sentence fluency, word choice, and sentence agreement 2.a Demonstrate sentence fluency through Application Problem-Solving correct selection of word choice and grammatical organization 2.b Identify characteristics of sentence Analytical agreement and fluency 2.c Identifies theme of an essay Analysis Decision-Making Problem-Solving 2.d Accurately identify correct parts of Comprehension speech 2.e Accurately compose a sentence using Synthesis Creativity descriptive adjectives, adverbs, action verbs Learning Target 3: Intellectual Skill Thinking Skill The student will correctly identify characteristics of a writing checklist 3.a Precisely demonstrates characteristics of Analytical Problem-Solving successful composition from checklist 3.b Creates essay identifying key Synthesis Creativity characteristics of successful writing

Item #

3, 4 6, 10 1

14 8 Item #

9 2 11 5 13 Item #

12 15

243

APPENDIX D KTW ACADEMY MIDDLE SCHOOL WRITING TEST GRADE LEVEL 6

Name: __________________________________________ Instructor: ________________________________________ Time Allowance: 30 Minutes DIRECTIONS. Read all instructions carefully.

Date: ________________

You will have 30 minutes to complete the test. Use your best judgment and do not guess. The point value for each question is indicated at the beginning of each section. Submit your test to your instructor when completed. Blank paper and a thesaurus have been provided for your use. Good Luck!

Part I: True/False. Indicate whether the following statements are true or false. Circle the best answer for each item. (5 points each)

1. The two types of opening sentences are Declarative and Rhetorical. a. True b. False 2. A theme of an essay is defined as a descriptive idea. a. True b. False 3. The sentence You are a great writer! is called an imperative sentence. a. True b. False 4. The sentence Go get that book on the shelf. is called a declarative sentence. a. True b. False

244

Part II: Matching On the line to the left of each item in Column A, write the letter that matches with the item in Column B. Each item in Column B may be used no more than once. (1.25 points each).

5. Parts of Speech
Column A _____ _____ _____ _____ 1. Adjective 2. Adverb 3. Prepositional Phrase 4. Conjunction Column B a. or b. yesterday c. wild d. over the fence

6. Types of Sentences
Column A _____ _____ _____ _____ 1. Declarative 2. Exclamatory 3. Imperative 4. Interrogative Column B a. Command b. Asks question c. Tells the reader something d. Show emotion/excitement

Part III: Multiple-Choice. Read the instructions for each test item carefully. Circle the best answer. (5 points each)

7. Identify the underlined word choice from the sample sentence.


A crazy, red car speeds quietly through the crowed city streets.

a. b. c. d.

Descriptive adjective Descriptive adverb Descriptive predicate Descriptive noun

8. Read the opening sentence of J.R.R. Tolkiens novel, Lord of the Rings, and answer the following question. Circle the correct response.
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventhy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. What type of quality is NOT part of this opening sentence?

a. Clearly written b. Descriptive word choice c. Confusing and wordy

245

d. Peaks curiosity e. Active voice 9. A before sentence is provided below in italics. Circle the correct response which identifies a sentence which demonstrates sentence fluency, effective word choice, and interest to the reader.
My bear sits on a shelf and my bear is a people-watcher.

a. b. c. d.

My bear sits on a shelf but my bear is not a people-watcher. My old, musty Paddington bear sits on a shelf watching people stroll by. My bear sits and my bear is a people-watcher. The old bear sits on a shelf.

10. Identify the four types of sentences. Circle the correct response. a. b. c. d. e. Declarative, Rhetorical, Exclamatory, Interrogative Declarative, Question, Imperative, Interrogative Declarative, Exclamatory, Imperative, Interrogative Statement, Command, Question Declarative, Exclamatory, Rhetorical, Imperative

11. Read the sample essay and determine the theme of the essay.
A famous saint who started the Society of Jesus (Jesuits) was Ignatius of Loyola. He was from a rich Spanish family and was one of twelve children. As a boy, he was sent to be a page at the royal court of King Ferdinand and Queen Isabella. He heard from fellow citizens that many people wanted to live in the country. He wished someday to become a great soldier and marry a beautiful lady.

a. Ignatius of Loyola was from a rich family. b. King Ferdinand and Queen Isabella thought highly of Ignatius of Loyola. c. Ignatius wanted to be famous and marry a beautiful lady, but instead became a great person by starting the Society of Jesus (Jesuits). d. Ignatius was from the town of Loyola. e. Ignatius wanted to be a great soldier but decided to move out to the country with many other people. 12. Distinguish which characteristics from the list are demonstrative of a successful essay. I. Tense agreement II. Concrete/descriptive words III. Active voice (action verbs) IV. Passive voice (to be verbs) V. Weak/slang/vague words VI. Concise wording

246

a. b. c. d. e.

I, II, IV I, II, III, IV III, V, VI I, II, III, VI I, II, III, IV

Part IV: Brief Response. Read the instructions for each test item carefully. Provide a complete answer. (10 points each)

13. Rewrite the sample sentence in the space provided using adjectives, adverbs and action verbs to paint a vivid picture to the reader. You may use your thesaurus for reference.
The boy saw a toy on the shelf. ______________________________________________________________________________________ ______________________________________________________________________________________

14. In the space provided, compose an opening sentence in the proper format based on the subject specified. You may use your thesaurus for reference.
Subject: My Favorite Pet _____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ Part IV: Extended Response. Read the instructions for each test item carefully. Provide a complete answer. (20 points)

15. Compose an essay (minimum 5 sentences) that tells of the five key elements to be included in a checklist for successful writing. Be sure to write a descriptive essay that entices the reader, uses correct sentence structure, word choice, and fluency. This essay should demonstrate your writing knowledge and skills. You may use your thesaurus for reference.
_____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ This Completes the test. Turn the test into the examiner and work quietly at your seat.

247

Traditional Classroom Test Construction Task (TCTCT)

Task Description
In the task description, below, directions are provided for you to successfully meet each standard. The scoring rubric presents standards that are used to rate the quality of the test developed. For this exercise, your work team will construct a traditional classroom achievement test and show how it meets quality standards. The test must have a between 12-15 select response items, employing at least 3 different item formats (i.e., true/false, multiple choice, matching, fill-in-the-blank or completion). In addition, there must be one brief response (this is not a completion item) and one extended response item, along with a scoring plan for each of the two supply response items. The test will comprise at least 2 subtests (usually every learning target, standard, or outcome has its own subtest or set of test items which measure examinee knowledge and skill associated with it). You may NOT use an existing test; you must create an original test. The final work product is to be submitted electronically using Microsoft Word 2003 or a later edition. The professor may grant one optional free read for you before grading. This will likely be a 14-16 page report excluding the cover page, appendices, and reference list. You will have centered (bold) main headings along with required left-margin headings. Submissions which fail to conform to the task description may receive a zero with no opportunity for revision and resubmission, at the professors discretion. 1. Assessment Situation Describe who is to be tested; why they are to be tested; what decisions are to be made (e.g., reteaching, mastery determination, etc.) to improve learning; and what information is needed to make those decisions. Typically, the only decision to be made is content mastery or non-mastery. If the examinee has mastered the content, he or she passes. If the content is not mastered, he or she fails or is remediated. (See CTQI Standards 1-2.) 2. Test Content Describe the content (knowledge and/or skills) the test is intended to measure. Relevant course learning targets, outcomes, or standards (terms are used interchangeably; see Hale & Astolfi, (2007a, pp. 59-62), would be cited here. First, construct a small table where each learning target is listed along with associated attainment indicators and place this in the report narrative (Standard 3). Second, construct a test blueprint, Appendix A, Table 1, (Hale & Astolfi, 2007a, p. 67); a test blueprint presents learning targets horizontally (rows) and intellectual skills using Blooms Taxonomy of Intellectual Skills (Hale & Astolfi, 2007a, pp. 4-7), vertically (columns). In each table cell, the specific relevant item numbers are recorded (Standard 4). Third, construct a table (Appendix B, Table 1) with assessed intellectual skills (Hale & Astolfi, 2007b, pp. 6-8) horizontally (rows) and specific thinking skills, vertically (columns); in each table cell, identify specific item numbers assessing each specific thinking skill and its associated intellectual skill (Standard 5). Fifth, describe how each test item is matched to the learning outcomes, intellectual skills, and specific thinking skills in a table for the reader (Standard 6). Sixth, in a paragraph show that the number and type of items are sufficient to accurately measure student mastery; this paragraph acts as the ending point to this section. (See CTQI Standards 3-7.) 3. Test Organization & Scoring First, after writing the test items (Standards 8-9), prepare directions which clearly explain how the examinee is to approach the test, record answers, indicate his or her completion of the examination, and return testing materials (Standard 10). The directions should explain how the examinee records

248

answers for select response items (Standard 11) and supply response items (Standard 12). Secondly, explain the rationale for formatting the test as presented in Appendix D (Standard 13). Thirdly, describe how the test is scored and the rationale for the scoring plan; explain how the scores are transformed into information for decision-making (Standard 14). (See CTQI Standards 8-14.) 4. Test Reliability and Validity Every test should be reliable and valid. Identify relevant types of reliability and validity, based on the assessment situation (Hale & Astolfi, 2007a, pp. 35-49); explain how each is established, citing textbooks, professional standards and/or research evidence. Explain only the types of reliability and validity relevant to the assessment situation. (See CTQI items 15-17.)

Report Sections & Content


Assessment Situation In a few sentences, (a) describe why the test is important, how many times its administered, how the test is administered, time allocated to complete the test, and how the test fits into any applicable testing strategy; (b) describe what decision(s) (e.g., mastery or non-mastery) is/are to be made based on test data; and (c) for each decision, describe exactly how the information will be used to improve learning. For example, indicate what happens to examinees that have mastered or failed to master a test on the causes of the Spanish-American War. This is the response to Standard 1. In describing who is to be tested, briefly describe demographically participants, participant numbers, and any other relevant information needed to set a full context. Cite a few references, identifying where the information came from. This is the response to Standard 2. Test Content First, identify each learning target (outcome or standard) (Hale & Astolfi, 2007a, pp. 59-62); explain each learning targets relevance to the examinees learning. Refer to Table 1 by number; dont put the table in an appendix. Each learning outcome or target has benchmarks which are also called attainment indicators (Hale & Astolfi, 2007a, p. 62) which show what mastering that target or outcome or meeting that standard actually means in concrete terms. Individual test items are written based on these benchmarks or attainment indicators: thus, if you have 14 benchmarks, there are at least 14 test items. Thus, one conventional strategy for constructing an achievement test is to identify the learning outcome or standard, and then specify any number of benchmarks or attainment indicators which when accomplished, signal learning outcome accomplishment. Table 1 should be placed in your narrative response to standard 3, and not in an appendix. Table 1: Learning Targets & Benchmarks Learning Target 1 (Write it) Learning Target 2 (Write it) Benchmark 1 (Write it) Benchmark 1 (Write it) Benchmark 2 (Write it) Benchmark 2 (Write it) Benchmark 3 (Write it) Benchmark 3 (Write it) Benchmark 4 (Write it) Benchmark 4 (Write it) Benchmark 5 (Write it) Attainment Indicator 5 (Write it) Limit your achievement test to two subtests (i.e., two learning targets), each consisting of a set of items to measure student attainment of each learning targets benchmarks. Since you need 12-15 items you might have 12-15 benchmarks or you may have more than 1 item measuring a benchmark. Your test should incorporate no more than two learning targets, but you can add a third if needed. This is the response to Standard 3.

249

Second, introduce the test blue print (Hale & Astolfi, 2007b, pp. 3-8) and refer to it by location, Appendix A Table 1. Describe for the reader what the test blue print is and why it is important. Give an example or two from the table, so that the reader will know how to use the table to align test items with relevant intellectual skills. This is the response to Standard 4. Third, introduce the reader to Appendix B Table 1. Explain what specific intellectual skills are being tested and any associated specific thinking skills, e.g., problem-solving, creative thinking, decision-making, etc. (Hale & Astolfi, 2007b, pp. 3-8) (the test must have at least two). Explain how each specific thinking skill is related to the tests purpose by logically linking each to a specific intellectual skill; a specific thinking skill is based on an intellectual skill; however; not all test items will measure a specific thinking skill. Cite references. This is the response to Standard 5. Fourth, learning target benchmarks, intellectual skills, and/or specific thinking skills being measured, must be matched to suitable test items (Hale & Astolfi, 2007a, pp. 68-80). Appendix C, Table 1 combines Appendix A, Table 1 and Appendix B, Table 1 into one table. In the narrative, to explain Appendix C, Table 1, take one or two illustrative benchmarks and show how it incorporates an intellectual skill and/or a specific thinking skill. Show that the associated test item or items is logically related to the benchmark and requires the examinee to use the applicable intellectual skill and/or thinking skill to correctly answer the item. Insert specific item numbers into Appendix C, Table 1 after you have written the test items. Present the actual traditional classroom achievement test as Appendix D. Ideally, there should be at least one item per learning target benchmark; if an item is used to assess more than one benchmark, then an explanation must be provided as to why one item assesses more than one benchmark. For example if there are 4 items which assess multiple benchmarks, then 4 separate explanations must be provided in the narrative; write each explanation using the item number as the starting point. This is the response to Standard 6. Finally, in the last paragraph of the section, you need to show that there are an appropriate number of test items to accurately assess each of the learning targets you stated. You must explain your argument for having enough items and cite relevant references. This is the response to Standard 7. Test Organization & Scoring Test Organization First, write 12-15 select response test items (Hale & Astolfi, 2007a, pp. 68-80) which conform to item writing guidelines. You must use at least 3 different select response item formats (i.e., true/false, multiple choice, matching, fill-in-the-blank or completion). Explain why each format was selected and how each format permits a student to demonstrate his or her knowledge or skill regarding a specific intellectual and/or specific thinking skill. Use one (1) representative item for each format and show how it conforms to specific item writing guidelines, citing references; do not make simple affirmative statements that items conform to guidelines; show that they do. Since this may be your first attempt to write test items, keep the items simple and follow the examples given. Be sure each select response test item is logically related to its corresponding benchmark, and that the relevance of each intellectual and/or specific thinking skill is explained. Select response items are presented in Appendix D. This is the response to Standard 8. Second, write one brief response item and one extended response item. Be sure each select response test item is logically related to its corresponding benchmark and requires use of the identified intellectual and/or specific thinking skill. The brief response item may not be a fill-in-the blank or completion item. Select response items are presented in Appendix D. This is the response to Standard 9.

250

Third, show how the directions are well written and clear (Hale & Astolfi, 2007a, pp.64-65); each section of the test must have its own set of directions. Cite references. This is your response to Standard 10. Test Scoring Fourth, explain how select and supply response item answers are recorded in detail. Do not simply refer the reader to the test. Cite at least one reference supporting your conclusion. This is the response to Standard 11. Fifth, prepare a model answer for each supply response item and allocate points for each key portion of the model answer; an examinee should be able to earn partial credit or some of the points possible for each supply response item. This is the response to Standard 12. Sixth, explain the rationale for organizing the test, i.e., formatting the test as presented in Appendix D. Do this is in some detail. This is the response to Standard 13. Finally, describe the tests scoring plan and how the raw data (points) are transformed (i.e., changed) into information for decision-making. For example, Subtest As earned points are divided by possible points; those examinees with 85% of possible points on Subtest A are considered masters and have achieved Learning Target A. Generally a learning target should have 3-5 items assessing it. So, there should be sufficient number of points which allow for a reasonable mastery determination. This is the response to Standard 14. Test Reliability and Validity Reliability First, indicate how internal consistency reliability (ICR) is established (Hale & Astolfi, 2007a, pp. 40-43); each test must possess at least this type of reliability. Ensure the discussion explains what ICR is, how its affected by test design and administration, how reliability threats are managed, an acceptable reliability coefficient value, and how the coefficient is interpreted. Present similar information for other types of applicable types of reliability which should be included depending on the assessment situation. This is the response to Standard 15. Validity Second, show whether or not the classroom test has at least content validity (Hale & Astolfi, 2007a, pp.44-45). Content validity is dependent on the degree to which test items align with the content taught (hence the reasons for Standards 3-6). The discussion should define content validity and explain how it was built into the test; draw on Appendix A Table 1, Appendix B Table 1, and Appendix C, Table 1 to frame the argument for content validity; blend this discussion into a description of content validation process. A second factor to ensure content validity is to write test items based on item writing guidelines; remind the reader about how items were well written. This is the response to Standard 16. Third, reliability and validity discussions must be richly (substantially) documented; cite several references. Use a significant number of references. This is the response to Standard 17. Appendix A: Test Blue Print Appendix A, Table 1: Test Blue Print Intellectual Skill Knowledge Comprehension Application 1, 3 8, 9 13, 14 4, 5, 9, 10 15, 16, 6, 7 11, 12 17, 18

Learning Target LT 1 (Write it) LT 2 (Write it) LT 3 Write it)

Analysis 19 20 21

Synthesis 22 23 24

251

Each learning target is presented along with the test item number designed to measure student mastery of the learning target. Ensure you include all test items. Appendix B: Thinking & Intellectual Skill Alignment Appendix B, Table 1: Thinking & Intellectual Skill Alignment Intellectual Specific Thinking Skills Skill Problem-Solving Decision-Making Creative Thinking Knowledge Comprehension Application 13, 14 15, 16 17, 18 Analysis 19 20 21 Synthesis 22 23 24 Remember, at least two specific thinking skills are required. Supply response items are typically associated with specific thinking skills. Dont insert select response items for Appendix B, Table 1. Appendix C: Learning Outcome, Attainment Indicator, & Test Item Alignment Appendix C, Table 1: Learning Target Attainment Indicators by Test Item Learning Target 1 (Write it) Intellectual Skill Thinking Skill Benchmark 1 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 2 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 3 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 4 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 5 (Write it) Insert Applicable skill Insert Applicable skill Learning Target 2 (Write it) Intellectual Skill Thinking Skill Benchmark 1 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 2 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 3 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 4 (Write it) Insert Applicable skill Insert Applicable skill Benchmark 5 (Write it) Insert Applicable skill Insert Applicable skill

Item #

Item #

Ideally, there should be at least one item per learning target benchmark; if an item is used to assess more than one benchmark, then an explanation must be provided as to why one item assesses more than one benchmark. For example, if there are 4 items which assess multiple benchmarks, then 4 separate explanations must be provided in the report narrative; write each explanation using the item number as the starting point. Appendix D: Traditional Classroom Achievement Test Integrate the traditional classroom achievement test into the body of the report as Appendix D. Do not submit the test as a separate document. References Ensure that all references conform to the APA 6 th edition style manual. References Hale, C. D. & Astolfi, D. (2007a). Measuring learning and performance: A primer. Retrieved from http://charlesdennishale.com Hale, C. D. & Astolfi, D. (2007b). Active teaching and learning: A primer. Retrieved from http://charlesdennishale.com

252

Classroom Test Quality Index (CTQI) Each classroom achievement test should meet the 17 standards presented in this rubric. For each standard, one of five performance levels or scores is possible: 0=Not in evidence or missing, 1=Does Not Meet Standard, 2=Marginally Meets Standard, 3= Mostly Meets Standard or 4= Meets Standard. A mid-point score may also be awarded, e.g., 3.5. Assessment Situation Score 1. The (a) importance of the test is described, its administration, and its role in the testing strategy, (b) decisions to be made are clearly stated with linkage to purpose thoroughly examined, and (c) the decisions to be made are linked to instructional improvements. 2. Examinees are completely and thoroughly described. Test Content: Learning Targets or Objectives Score 3. The content (e.g., learning targets or standards and benchmarks) is identified & relevance explained. (Score: _____ x 2) 4. Intellectual skills (e.g., application, analysis) are identified and relevance explained. (Score: _____ x 2) 5. Thinking skills (e.g., creativity, problem-solving) are identified and relevance explained. (Score: _____ x 2) 6. Content, Intellectual skills, and thinking skills are appropriately and logically matched to test items. (Score: _____ x 2) 7. The number of test items per learning target is sufficient and is appropriately documented. Test Organization & Scoring Score 8. Select response items conform to corresponding item writing guidelines. (Score: _____ x 3) 9. Supply response items conform to corresponding item writing guidelines Score: _____ x2) 10. Test directions are clear, complete, and easily understood. 11. It is convenient for examinees to record answers for supply response answers. 12. Supply response items scoring plan is complete and easily applied. 13. The rationale for organizing the test as presented is logical. 14. The tests scoring plan is consistent with its stated purpose. (Score: _____ x 2) Reliability and Validity Score 15. Applicable reliability types are identified and relevance explained. (Score: _____ x 2) 16. Applicable validity types are identified and relevance explained. (Score: _____ x 2) 17. Reliability and validity explanation is thoroughly documented. (Score: _____ x 2) Comments: (Up to 10 points may be lost for failure to comply with the APA style manual.)

Total Score: _______/112 = _____% or ______ (Grade)

Das könnte Ihnen auch gefallen