Test Construction: Step #1: Define The Test Universe, Test Takers, and Purpose

Test Construction
Step #1: Define the Test Universe, Test Takers, and Purpose Test Universe: Prepare a working definition of the construct the test will measure. Be sure to check the psychological literature to help with this. Test Takers: List the characteristics of the people you expect to take the test. Especially those that will affect how the test taker responds. Some examples include: Reading level; Disability; Motivation to answer honestly; Language. Purpose of The Test: Include what the test will measure, e.g., temptation. Also include how the outcomes will be used. Will the scores be used to compare test takers (normative) or will they be compared to some achievement level (criterion)? Will the scores be used to predict some performance or make a diagnosis? Will the scores be used cumulatively to help prove/disprove a theory, or individually to provide information about the test taker? Why is it important to include information about the test universe, test takers and purpose of the test? This type of information helps the user to determine if the test is appropriate for things such as group administration, or paper-pencil verses oral administration. It helps the user make a more informed decision about the test's usefulness. Step 2: Develop a Test Plan Construct Definition Now you want to provide your construct with a concise definition. The definition should include: 1) An operationalized statement regarding observable and measurable behaviours. 2) Boundaries of the test domain. The content that you are testing. Include content that is not appropriate. Include an estimate (percentage) of how many questions are needed to sample the test domain. Test Format: Choose the test format (e.g., objective or subjective) and the type of questions the test will contain (e.g., multiple-choice, true/false, short answer, verbal questions and responses). Most tests use a consistent format. However, if you use different formats be sure to provide detailed instructions about each type of question. Specify Administration and Scoring Methods: Specify how the test should be administered and scored. Answer questions like: How will the test be administered? How long do test takers have to complete the test? Should the test be given in a group setting or to individuals? Does the test need to be scored by the publisher or administer? Is there a particular weighting for each question? What type of data will the test yield? The most common method for scoring is termed the cumulative model. This type of model states that the more the test taker responds in a given fashion, the greater the exhibition of the attribute being measured. The most common method gives one point for each measure of the attribute. The total accumulation of the 'one points' becomes the raw score. These tests typically yield interval-level data. Other scoring methods include: A) Categorical model: This type of method is used to place test takers into a given group. This type of model generally yields nominal data. B) Ipsative model: The test taker's scores on various scales within the test are compared to each other and yield a profile of the individual. C) Note: All the above 3 models can be combined in any fashion on any given test. Step 3: Develop Test Items Objective Items: The most commonly used formats are multiple choice, true/false, forced-choice, and likert scales. Multiple choice (MC) format consist of a sentence (or part of a sentence), called the stem, which is followed by a number of response (usually 3-5) of which only one is correct. Incorrect responses are termed distractors.
When composing a MC question you should try to clearly differentiate the correct response from the distractors. Distractors that are clearly wrong or almost right often detract from your test's accuracy, unless you take this into account with IRT. See handout on some MC questions guidelines. The stem component of a true/false item is usually, Which of the following is true? Some researchers like to convert true/false items into MC items to decrease the guessing advantage. Forced-choice items typically have a format similar to that of MC. For example, a question may read, Place an X in the space to the left of the word that best describes your personality. 1. _____ cheery 2. ____ loyal _____ friendly ______ outgoing Forced choice often makes it difficult to guess or fake since the paired words often appear to have nothing in common, thus it is difficult to guess what the correct response should be. Can you think of a short fall with the forced-choice method? Likert Scales are generally used when expressing positive or negative attitudes towards a specific object or event. A large number of item statements are presented and test takers are asked to indicate, on a 4 7 point scale (5 point scales are most common), how they feel about the statement. Each point on the scale is assigned a value, and the test taker's score is calculated in some fashion. Since Likert scales are assumed to be equal-interval, most statistical procedure can be applied. We'll look at an example in class. Can you think of any pros and cons? Subjective Items Examples include essay questions, interview questions, projective techniques, sentence completion. Essay questions are popular in educational setting, as Im sure you are aware, and allow freedom of response, and generally require higher cognitive functioning (analysis, synthesis, and evaluation) to answer. What are some pros and cons to this method? Interview questions tend to also be general in scope and good answers are up to the interviewer. Interview tests should be planned (not just sit down and chat) and focus on the knowledge, skills, abilities, and other characteristics needed for the job (KSAOs). Clinical interviews should also be highly structured. Why? Projective techniques are often used in clinical settings and use highly ambiguous stimuli to elicit unstructured responses from test takers. Result interpretation is often difficult, and often requires significant study and special skills. Sentence completion is another subjective technique, especially in personality testing. This technique presents a partial sentence and the test taker is asked to complete it. For example, I feel betrayed when ________. Scoring is often accomplished by comparing responses with those provided by the test developer. If a match occurs a point is rewarded for a particular trait. Writing Good Items As you can see a lot of creativity, originality, and knowledge goes into writing an item. Here are some rules of thumb that may help. 1) Identify item topics by consulting the test plan. 2) Be sure each item is based on an important learning objective or topic. 3) Write items that assess information or skills drawn only from your testing universe. 4) Write each item as clearly and directly as possible. 5) Use appropriate language. 6) Try to make all items independent. 7) Ask someone to review the items. Some Things To Be Aware Of With Both Objective and Subjective Items Response Bias Some test takers have response sets or styles for choosing answers on tests. These response sets often lead to false or misleading information. For example, if they are unsure of a response they always say 'yes' or 'no' or pick 'c'. There are some methods that attempt to detect or minimize response bias errors. Well discuss some later in the course.
Some types of response biases include: Social Desirability One of the most problematic of the response biases is social desirability. Often test takers have a tendency to choose answers that make them look good or are socially acceptable. It is important that you take this into account when developing your test items. Why? Try to balance the social desirability of the distractors. Or use an ipsative format. How would you do this? Some researchers believe that social responding is a part of personality and should not be removed statistically since that will detract from the validity of your results. Random Responding Sometimes test takers cannot respond accurately to a test item. Thus, they respond in a random fashion. How can you check to see if a person is responding randomly? Add a scale that attempts to identify this. For example, questions that almost everyone in the population answer correctly. Faking Faking refers to the inclination of test takers to try to answer items in a way that will cause a desirable outcome or diagnosis. You can 'fake good' try to answer items in a way that makes you appear to have more of a desirable trait. You can 'fake bad' Answer items in a way that makes you appear to have more of an undesirable trait. What types of situations generate faking? How can you prevent faking or cheating? You can use what are referred to as subtle questions. These questions have no real relation to the test purpose and thus are hard to fake, i.e., a personality test with a question like, Birds fly south for the winter? or you can built in catch scales (see p. 224-225 of your text). Research is mixed with respect to faking. Some studies show that even if faking is detected there's no way of estimating a person's true score. Thus, what so you do with their data? Some research suggests that even if persons fake, this may not affect the validity of predicting future behaviours. What does this suggest? Acquiescence This type of response style refers to the tendency to agree (yes people) with any ideas or behaviours that are presented to the test taker. An example could be someone who only responds true on true/false test items. Because of acquiescence it is a good idea to balance items for which the correct response would be positive with an equal number of items for which the correct response would be negative. How will this affect your scoring? Well consider an example in class. Writing Test Administration Instructions It is important to remember that even though the test items make up the bulk of any new test, they are meaningless without good, specific instructions on how the test is to be administered. As the test developer you need to develop 2 sets of instructions. One for the persons who will be administering the test, and another set for the test takers. Administrator Instructions The testing environment, the circumstances under which the test is administered, can affect how test takers respond. Standardized testing environments decrease scoring error or variation that cannot be attributed to the attribute being measured. Specific and concise instructions should address the following: A) Group or individual administration; B) Specific requirements for the location and equipment. Include things like privacy needs, quiet, chairs, tables, desks, and required equipment (i.e., pencil, computer with CD-ROM drive). C) Specify time limitations, or approximate completion times if there is no time limit.
D) Prepare a script for the administrator to read to the test takers. The script should include answers to specific questions that test takers are likely to ask. Instructions for the Test Taker Test taker instructions are usually, but not always, delivered orally by test administrators, who read your prepared script. Can you see a problem with this? Instructions also appear in writing at the beginning of the test or in the test booklet. Instructions should include things like: A) Where the test taker should respond. B) How the test taker should respond. An example is always helpful. C) Instructions should encourage accurate and honest answering. D) Some tests need to have test takers thinking of a specific context or environment (e.g., at home, work, school). Thus provide statements like, Think of you current work situation when responding to the following questions. E) If the instructions are too complicated you are very likely to confuse some test takers and thus increase the probability of response errors. If you find that your instructions are too complicated then revise them, or alternatively revise your test!! Ill bring an example to class. Revising the Test Revision of the test is a major part of the test development process. Usually test developers write more items than are needed and then use the quantitative and qualitative analyses of the test to choose those items that together provide the most information about the construct being measured (see handout). Choosing the Final Items To choose the tests final items, you must weigh each of the following for each item: A) Content Validity; B) Item difficulty and discrimination; C) Inter-item correlation (reliability measure); D) All biases. E) Also you should take into account test length and face validity. Often a matrix highlighting all the above characteristics is created to help in the selection process. The matrix organizes all the information, in clear view, for you to consider. Well go through one in class. Can you recall what it is youll be looking for as evidence of a good item? Now you should be able to see why you need to start with so many items. Dont forget about the test instructions. They too should be revised. You will undoubtedly discover instructions that you forgot or were unclear or were difficult to follow. These should be changed. Validation and Cross Validation Now that you are satisfied with your final test items, your test needs be evaluated to ensure that it is reliable and valid. How should you go about this? That depends on the type of test and what it will be used for. But here are some general guidelines: 1) The first thing that usually is done in validation is establishing content validity. If you followed all the steps required in defining the construct initially, this is already taken care of. 2) To establish other types of validity (construct and criterion-related) youll need to run another round of data collection. These additional rounds of data collection are similar to the pilot round except they may match the actual testing protocol better than your pilot study. That is, they may use a sample of people who closely resemble the target audience. Your additional samples should also be large enough so youll be able to comfortably run the analyses in question. Power analysis will help you here. 3) You should also be collecting data on the demographic characteristics of the test takers. Things like sex, race, age, SES, etc. will help when detecting for bias. Developing Norms And Cut Scores Norms and cut scores are decision points for dividing test scores into groups (i.e., pass/fail, depressed/not depressed). These norms and cut scores help the test user with interpretation of the test data. Not all test will have norms. Norms and cut scores depend on 2 things: A) The purpose of the test;
B) How widely used the test is. How do you develop norms? Norms are based on the distribution of test scores and provide a reference point or structure for understanding ones personal score. Here are the steps required for developing norms: 1) Obtain a random sample of the target audience. As you know, larger is better. This is difficult to do, why? What is the next best thing? Can you use some of the pilot data or the data obtained during validation? 2) Once the sample is large enough norms are calculated. As the database grows the norms should be adjusted as well. Why? 3) Larger data bases also allow for the development of subgroup norms stats that describe a specific proportion of the target audience (i.e., males only). 4) The norms should be published in the test manual. Identifying Cut Scores Cut scores are scores at which a decision changes. Setting cut scores is not an easy process and has all kinds of legal, professional, and psychometric implications. Can you think of any? There are 2 main approaches to setting cut scores: 1) With employment tests a panel of expert judges provide an opinion or rating about the number of test items that a barely qualified person is likely to get right. This information then becomes the cut score. 2) Also a more empirical approach can be used. Here the correlations obtained between the test and an outside criterion are used to predict the test score that a minimally acceptable candidate is likely to achieve. A regression equation is used to predict the score that a person who is rated minimally acceptable is likely to make. This score then becomes the cut score. A major problem with setting cut scores is error. Recall that SEM is an indicator of how much error exists in someone's test score. It is very likely that a person who scores only a point or two below the cut score will score above the cut score if asked to take the test again. And the increase in score can be solely due to test error, and not ability. Because of this Anastasi and Urbina (1997) suggest that cut scores be a band of scores rather than a single score. Instead of a cut score of 60, you use the SEM to compute a band of scores, say 58 - 62. Developing The Test Manual We have noted previously that the test manual is an important component of any test. The manual includes things such as: A) Rationale for test construction; B) A history of the developmental process; C) Results of validation studies; D) The target audience; E) Instructions for administration and scoring; F) Norms; G) Information on interpreting individual scores; H) Limitations of use and measurement accuracy. The writing of the manual is not left until the end. It is an ongoing process that begins with your conception of the test, and continues throughout the developmental phases. If you diligently record things in your test manual it will serve as a source of documentation and reference for each part of the developmental process. We'll look at the Wisconsin Card Scoring Test Manual in class.

Test Construction: Step #1: Define The Test Universe, Test Takers, and Purpose

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Test Construction: Step #1: Define The Test Universe, Test Takers, and Purpose

Hochgeladen von

Copyright:

Verfügbare Formate

Test Construction

Das könnte Ihnen auch gefallen