Sie sind auf Seite 1von 16

Development of General Physics 1 Achievement Test (GP1-AT) in Force, Energy and Motion

for Grade 12 STEM Students

Edwin C. Barba, Jr.*


*Post-graduate student, College of Education, University of the Philippines; Chairperson, STEM-HUMMS Area,
Senior High School Division, Central Colleges of the Philippines, QuezonCity

ABSTRACT
The purpose of this study is to develop an achievement test which will assess grade 12 STEM students’
knowledge and understanding of topics in General Physic 1, particularly force, energy and motion. For
this purpose, a 100-item multiple choice questions were created and pilot tested to 300 grade 12 STEM
students. Item and distractor analysis was carried out to analyze the quality of items and distractors.
Internal consistency reliability was also investigated to check the internal consistency and reliability of
the items and the test. Results showed that the mean difficulty index of .51 which suggests that the test
has an over-all desirable difficulty. It was also shown that 7% of the test is easy, 63% are desirable and
26% are difficult. The mean discrimination of the test is .35 which implies that the test has good
discriminating power. However, 47% of the items have very good discriminating power and 15% have
good discriminating power. Distractor analysis of the test showed 82% of the items have 100%
distractor efficiency and only 7.7% of the total 300 distractors were found to be non-functional.
Reliability analysis shows that the Cronbach’s alpha of the test is .904 which means that the initial form
of the test was highly reliable and has excellent internal consistency.

Keywords: Achievement Test, General Physics 1,Multiple Choice Questions, Item Analysis Difficulty
Index, Discrimination Index, Distractor Efficiency,

INTRODUCTION

The implementation of the Senior High School program under the new K-12 Curriculum has

brought a lot of changes in the educational landscape of the Philippines. A new program entails new

courses, innovative teaching strategies and most especially new ways of assessing and evaluating students’

learning. Today, student assessment and evaluation in the K-12 program is divided into three components:

written works, performance task and quarterly examination. Each component’s weight varies across

different subjects but the methods and principle remains the same.
Though there are many ways to measure students’ learning under the new curriculum, the use of

pen-and-paper tests remain to be one of the most common way teachers’ assess and evaluate students.

Perhaps, multiple choice tests is the most extensively-used format in the assessment of students today.

National standardized test such as the National Achievement Test (NAT) and the National Career

Assessment Exam (NCAE) and licensure examinations in the Philippines remains to be administered in a

multiple choice test format. This shows that multiple choice remains to be relevant in today’s educational

landscape.

The challenge to the teachers now, particularly those teaching new courses in the Senior High

School Program, is to develop a pool of valid and reliable pool of multiple choice questions that can be

used in classroom assessment as well as in large-scale testing. The need for any test to be valid and

reliable is due to the fact that decisions are usually based on the results of the test. Validity is the extent

or degree to which the test can measure the qualities, skills, abilities, traits, or information that it was

designed to measure; while reliability is the extent to which it can consistently make these measurements.

(Nwadinigwe and Naibi, 2013). Multiple choice tests are a commonly preferred format because they have

been found to give valid and reliable results.

This purpose of this study is to develop a reliable and valid achievement test in General Physics 1,

particularly in topics involving force, energy and motion for grade 12 STEM students. General Physics 1

is one of the new courses implemented in the STEM track of the Senior High School Program and the

development of an achievement test will surely help teachers and eventually add to a pool of quality

multiple choice items that may be used not only for classroom assessment but for large-scale testing as

well.

Page 2 of 16
METHODS

Research Design

This study utilizes the descriptive research design. This research design was used to describe the

quality of the items and the test as a whole using different statistical analyses such as item analysis,

distractor analysis as well as reliability analysis using Cronbach’s alpha.

Participants

This study uses the purposive sampling design in which participants are chose based on the criteria

that they must be grade 12 students, enrolled in the STEM track of the Senior High School Program of

their school and have taken General Physics 1 or at least the first half of the said course.

The test was pilot tested to 300 grade 12 STEM students from two private institutions in Quezon

City. The test was administered for a total of two hours and thirty-minutes under the supervision of the

author and a teacher. The students were asked to write their answers on a separate sheet of paper. After

the test administration, the test questionnaire and the answer sheets were collected from the students.

Test Description

The General Physics 1 Achievement Test in Force, Energy Motion (GP1-AT) is a 100-item,

multiple choice test developed to measure the knowledge and understanding of students in the General

Physics 1, particularly in the chapter that deals with motion, force and energy. The researcher decided to

utilize questions with four options (1 key, 3 distractors). The researcher believed that using four options

will minimize the chance of students guessing the correct answer. The questions were developed based

on the content of the Senior High School Curriculum Guide in General Physics 1 issued by the Department

of Education last December, 2013.

Page 3 of 16
Test Development Procedure

The GP1-AT was developed with the intention of measuring the knowledge and understanding of

grade 12 STEM students in physics topics concerning force, energy and motion. The researcher identified

the content of the test based on the topics in the Senior High School Curriculum Guide for General Physics

1. Table 1 below shows a simplified version of the table of specification of the test. (A detailed table of

specification may be found at the Appendix.)

Table 1. GP1-AT Table of specifications


Thinking Skills
Analyzing &
Topics Remembering Understanding Applying Evaluating Total
Concept of Force 4 4 5 3 16
Newton’s Laws of
Motion 5 8 10 5 28
Kinematics of
Motion 6 6 14 3 29
Work, Power and
Energy 6 6 10 5 27
Total 21 24 39 16 100

The initial GP1-AT covered topics namely Concept of Force, Newton’s Laws of Motion,

Kinematics of Motion and Work, Power and Energy. There are 16 items (16%) for Concept of Force, 28

items (28%) for Newton’s Laws of Motion, 29 items (29%) for Kinematics of Motion and 27 items (27%)

for Work, Power and Energy. The questions in GP1-AT are designed to measure students’ thinking

abilities specifically Remembering, Understanding, Applying, Analyzing and Evaluating. There are 21

items (21%) for Remembering, 24 items (24%) for Understanding, 39 items (39%) for Applying and 16

items (16%) for Analyzing and Understanding.

It is worth noting that the initial form GP1-AT was not validated by experts in physics education

due to time constraint. The researcher depended on his own knowledge and experience in writing and

creating the multiple choice items.

Page 4 of 16
Test Statistics

Item analysis and distractor analysis were performed to determine how well the test and the

individual items contributed to the scores of the participants and to further improved the quality of the

individual items and the test as a whole. The internal consistency reliability of the test was also

investigated to see how well the items on a test measure the same construct or idea.

Item analysis is the systematic evaluation of the effectiveness of the individual items on a test.

(Brown, 1996). In this study, two statistical indices – difficulty index, p and discrimination index, DI –

were used.

Difficulty index or p value, is an inverse index – the lower the value, the more difficult the item.

The formula for p value, used in the study is

𝑈𝐺 + 𝐿𝐺
𝑝=
2

where 𝑈𝐺 is the proportion of students who got the correct answer in the upper group and 𝐿𝐺 is the

proportion of students who got the correct answer in the lower group. Table 2 below shows how the

difficulty index or p value can be interpreted. An inspection of the item difficulty levels can reveal

problems with the test and even in the instruction.

Table 2. Interpretation of difficulty index or p value


Difficulty index or p value Interpretation
.86 – 1.00 Very Easy item
.71 – .85 Easy item
.40 – .70 Desirable item
.15 – .39 Difficult item
.01 – .14 Very difficult item

Page 5 of 16
Discrimination index or DI indicates the degree to which an item separates the students who

performed well from those who performed poorly. (Brown, 1996). The formula for the discrimination

index or DI used in this study is

𝐷𝐼 = 𝑈𝐺 − 𝐿𝐺

where 𝑈𝐺 is the proportion of students who got the correct answer in the upper group and 𝐿𝐺 is the

proportion of students who got the correct answer in the lower group. A negative DI value could mean

problems in the item or in the instruction. Table 3 below shows how the discrimination index or DI is

interpreted.

Table 3. Interpretation of discrimination index of DI


Discrimination index or DI Interpretation
.40 and above Very Good item
.30 – .39 Good item
.20 – .29 Needs improvement item
.19 and below Poor item

Distractor analysis on the other hand, was used to determine the degree to which the distractors

are attracting students who do not know the correct answer. (Brown, 1996). In this research, distractor

efficiency, DE was used to look at the performance of the distractors of each item in the GP1-AT. Non-

functional distractors (NDF) are options that are selected by less than 5% of the participants while

functional distractors is the option selected by 5% or more participants. (Mukherjee and Lahiri, 2015)

Distractor efficiency, DE is determined for each item on the basis of number of NFD’s and it ranges from

0 to 100%. If an item has three, or two or one or no NFD’s then DE will be 0, 33.3%, 66.6% and 100%

respectively. NFD’s must be revised, removed or replaced with more plausible distractor.

Page 6 of 16
The internal consistency of the test was measured using the Cronbach’s alpha. Cronbach’s alpha

measures the internal consistency of a group of items by measuring the homogeneity of the group of items.

(BrckaLorenz, Chiang, & Nelson Laird, 2013). Cronbach’s alpha ranges from 0 to 1.00, with values close

to 1.00 indicating high consistency. High-stakes standardized tests should have a Cronbach’s alpha of at

least .90 while low-stakes standardized tests should have an alpha of at least .80 or .85. For classroom

exam, it is desirable to have a Cronbach’s alpha of .70 or higher. (Wells & Wollack, 2003) To measure

the internal consistency of GP1-AT Cronbach’s alpha was calculated using IBM SPSS 20 software

package.

RESULTS

The initial form of GP1-AT was pilot tested to 300 grade 12 STEM students from two private

institutions in Quezon City. The 100-item test was administered for a total of two hours and thirty minutes.

Answer sheets were collected after the test and evaluated by the researcher. Table 4 below shows the

result of the pilot test.

Table 4. Results of the GP1-AT pilot test


Characteristics Value
Mean raw score 48.95
Std. Error of Mean .825
Std. Deviation 14.296
Skewness .505
Std. Error of Skewness .141
Kurtosis -.331
Std. Error of Kurtosis .281

The highest and lowest scores in the pilot test is 91 and 23, respectively. The mean score of the

participants in the pilot test is 48.95 with standard deviation of 14.296. The score distribution of 300

participants is shown to be positively skewed and platykurtic. This shows that most of the participants

scored low on the test.

Page 7 of 16
Item Analysis

The raw scores of the participants are sorted from highest to lowest to identify the upper group and

lower group. The top 80 participants (27%) are designated as the upper group (UG) and the bottom 80

participants (27%) are designated as the lower group (LG). The response of the upper and lower groups

are encoded as 1 for correct response and 0 for wrong response for each item of the test. The proportion

of the students who got the correct answer in each group is computed by dividing the number of correct

response in each group divided by the number of participants in each group. The difficulty index and

discrimination index of each item were computed and interpreted. Table 5 below shows the results of the

item analysis.

Table 5. Results of the item analysis


Correct Response Proportion
p Interpretation DI Interpretation
Item UG LG UG LG
Item1 70 22 0.88 0.28 0.58 Desirable 0.6 Very Good
Item2 79 60 0.99 0.75 0.87 Very Easy 0.24 Needs Improvement
Item3 76 38 0.95 0.48 0.72 Easy 0.47 Very Good
Item4 59 24 0.74 0.3 0.52 Desirable 0.44 Very Good
Item5 42 6 0.53 0.08 0.31 Difficult 0.45 Very Good
Item6 21 4 0.26 0.05 0.16 Difficult 0.21 Needs Improvement
Item7 16 12 0.2 0.15 0.18 Difficult 0.05 Poor
Item8 68 41 0.85 0.51 0.68 Desirable 0.34 Good
Item9 71 14 0.89 0.18 0.54 Desirable 0.71 Very Good
Item10 28 12 0.35 0.15 0.25 Difficult 0.2 Needs Improvement
Item11 31 26 0.39 0.33 0.36 Difficult 0.06 Poor
Item12 76 26 0.95 0.33 0.64 Desirable 0.62 Very Good
Item13 70 26 0.88 0.33 0.61 Desirable 0.55 Very Good
Item14 34 22 0.43 0.28 0.36 Difficult 0.15 Poor
Item15 47 23 0.59 0.29 0.44 Desirable 0.3 Good
Item16 16 19 0.2 0.24 0.22 Difficult -0.04 Poor
Item17 72 37 0.9 0.46 0.68 Desirable 0.44 Very Good
Item18 58 34 0.73 0.43 0.58 Desirable 0.3 Good
Item19 38 15 0.48 0.19 0.34 Difficult 0.29 Needs Improvement
Item20 33 21 0.41 0.26 0.34 Difficult 0.15 Poor
Item21 60 41 0.75 0.51 0.63 Desirable 0.24 Needs Improvement
Item22 29 24 0.36 0.3 0.33 Difficult 0.06 Poor
Item23 25 15 0.31 0.19 0.25 Difficult 0.12 Poor
Item24 66 35 0.83 0.44 0.64 Desirable 0.39 Good

Page 8 of 16
Item25 60 21 0.75 0.26 0.51 Desirable 0.49 Very Good
Item26 66 37 0.83 0.46 0.65 Desirable 0.37 Good
Item27 73 50 0.91 0.63 0.77 Easy 0.28 Needs Improvement
Item28 34 19 0.43 0.24 0.34 Difficult 0.19 Poor
Item29 62 26 0.78 0.33 0.56 Desirable 0.45 Very Good
Item30 75 27 0.94 0.34 0.64 Desirable 0.6 Very Good
Item31 64 35 0.8 0.44 0.62 Desirable 0.36 Good
Item32 65 25 0.81 0.31 0.56 Desirable 0.5 Very Good
Item33 62 21 0.78 0.26 0.52 Desirable 0.52 Very Good
Item34 48 16 0.6 0.2 0.4 Desirable 0.4 Very Good
Item35 76 25 0.95 0.31 0.63 Desirable 0.64 Very Good
Item36 52 22 0.65 0.28 0.47 Desirable 0.37 Good
Item37 60 28 0.75 0.35 0.55 Desirable 0.4 Very Good
Item38 57 22 0.71 0.28 0.5 Desirable 0.43 Very Good
Item39 32 22 0.4 0.28 0.34 Difficult 0.12 Poor
Item40 23 15 0.29 0.19 0.24 Difficult 0.1 Poor
Item41 58 17 0.73 0.21 0.47 Desirable 0.52 Very Good
Item42 74 42 0.93 0.53 0.73 Easy 0.4 Very Good
Item43 14 28 0.18 0.35 0.27 Difficult -0.17 Poor
Item44 63 21 0.79 0.26 0.53 Desirable 0.53 Very Good
Item45 73 35 0.91 0.44 0.68 Desirable 0.47 Very Good
Item46 51 20 0.64 0.25 0.45 Desirable 0.39 Good
Item47 70 17 0.88 0.21 0.55 Desirable 0.67 Very Good
Item48 48 9 0.6 0.11 0.36 Difficult 0.49 Very Good
Item49 32 27 0.4 0.34 0.37 Difficult 0.06 Poor
Item50 58 25 0.73 0.31 0.52 Desirable 0.42 Very Good
Item51 61 22 0.76 0.28 0.52 Desirable 0.48 Very Good
Item52 50 14 0.63 0.18 0.41 Desirable 0.45 Very Good
Item53 60 22 0.75 0.28 0.52 Desirable 0.47 Very Good
Item54 66 34 0.83 0.43 0.63 Desirable 0.4 Very Good
Item55 34 7 0.43 0.09 0.26 Difficult 0.34 Good
Item56 27 6 0.34 0.08 0.21 Difficult 0.26 Needs Improvement
Item57 77 17 0.96 0.21 0.59 Desirable 0.75 Very Good
Item58 68 27 0.85 0.34 0.6 Desirable 0.51 Very Good
Item59 70 46 0.88 0.58 0.73 Easy 0.3 Good
Item60 78 28 0.98 0.35 0.67 Desirable 0.63 Very Good
Item61 59 28 0.74 0.35 0.55 Desirable 0.39 Good
Item62 36 34 0.45 0.43 0.44 Desirable 0.02 Poor
Item63 44 29 0.55 0.36 0.46 Desirable 0.19 Poor
Item64 78 25 0.98 0.31 0.65 Desirable 0.67 Very Good
Item65 51 34 0.64 0.43 0.54 Desirable 0.21 Needs Improvement
Item66 51 17 0.64 0.21 0.43 Desirable 0.43 Very Good
Item67 30 9 0.38 0.11 0.25 Difficult 0.27 Needs Improvement
Item68 33 11 0.41 0.14 0.28 Difficult 0.27 Needs Improvement
Item69 59 19 0.74 0.24 0.49 Desirable 0.5 Very Good

Page 9 of 16
Item70 54 23 0.68 0.29 0.49 Desirable 0.39 Good
Item71 61 40 0.76 0.5 0.63 Desirable 0.26 Needs Improvement
Item72 40 46 0.5 0.58 0.54 Desirable -0.08 Poor
Item73 66 47 0.83 0.59 0.71 Easy 0.24 Needs Improvement
Item74 75 26 0.94 0.33 0.64 Desirable 0.61 Very Good
Item75 42 12 0.53 0.15 0.34 Difficult 0.38 Good
Item76 79 69 0.99 0.86 0.93 Very Easy 0.13 Poor
Item77 4 5 0.05 0.06 0.06 Very Difficult -0.01 Poor
Item78 58 32 0.73 0.4 0.57 Desirable 0.33 Good
Item79 66 33 0.83 0.41 0.62 Desirable 0.42 Very Good
Item80 20 24 0.25 0.3 0.28 Difficult -0.05 Poor
Item81 62 41 0.78 0.51 0.65 Desirable 0.27 Needs Improvement
Item82 66 35 0.83 0.44 0.64 Desirable 0.39 Good
Item83 26 16 0.33 0.2 0.27 Difficult 0.13 Poor
Item84 80 67 1 0.84 0.92 Very Easy 0.16 Poor
Item85 67 22 0.84 0.28 0.56 Desirable 0.56 Very Good
Item86 70 21 0.88 0.26 0.57 Desirable 0.62 Very Good
Item87 56 12 0.7 0.15 0.43 Desirable 0.55 Very Good
Item88 64 32 0.8 0.4 0.6 Desirable 0.4 Very Good
Item89 36 15 0.45 0.19 0.32 Difficult 0.26 Needs Improvement
Item90 75 37 0.94 0.46 0.7 Desirable 0.48 Very Good
Item91 44 22 0.55 0.28 0.42 Desirable 0.27 Needs Improvement
Item92 72 31 0.9 0.39 0.65 Desirable 0.51 Very Good
Item93 77 36 0.96 0.45 0.71 Easy 0.51 Very Good
Item94 79 37 0.99 0.46 0.73 Easy 0.53 Very Good
Item95 66 45 0.83 0.56 0.7 Desirable 0.27 Needs Improvement
Item96 75 24 0.94 0.3 0.62 Desirable 0.64 Very Good
Item97 61 22 0.76 0.28 0.52 Desirable 0.48 Very Good
Item98 46 29 0.58 0.36 0.47 Desirable 0.22 Needs Improvement
Item99 60 28 0.75 0.35 0.55 Desirable 0.4 Very Good
Item100 24 12 0.3 0.15 0.23 Difficult 0.15 Poor

The mean difficulty index of the test is .51 and the mean discrimination index is .35. In terms of

difficulty index, 3 items (3%) are very easy, 7 items (7%) are easy, 63 (63%) are desirable, 26 items (26

%) are difficult and 1 item (1%) is very difficult. In terms of discrimination index, 47 items (47%) have

very good discrimination power, 15 items (15%) are good, 17 items (17%) needs improvement and 21

items (21%) are poor in discriminating students. Table 6 below shows the distribution of items relative

to their difficulty index and discrimination index.

Page 10 of 16
Table 6. Distribution of items relative to difficulty and discrimination indices
p value
Very Easy Easy Desirable Difficult Very Difficult
DI (.86 – 1.0) (.71 – .85) (.40 – .70) (.15 – .39) (.01 – .14) Total
Very good
4 40 2 1 46
(≥. 𝟒𝟎)
Good
1 13 2 16
(. 𝟑𝟎−. 𝟑𝟗)
NI*
1 2 7 4 14
(. 𝟐𝟎−. 𝟐𝟗)
Poor
2 3 18 24
(≤. 𝟏𝟗)
Total 3 7 63 26 1 100
*NI – Needs improvement

The shaded area in Table 5 shows the number of items that have difficulty index that ranges from

.71 to .39 and discrimination index from .30 and above. There are 62 items (62%) items that fall within

these categories. These items performed well in the pilot test and are usually considered to be good to be

used in classroom tests. There are 38 items (38%) that are outside the shaded area which may be changed,

revised or rejected from the test.

Distractor analysis were performed by identifying the non-functional distractors (NFD) –

distractors that were chosen by less than 5% of the population – and computing the distractor efficiency

of each item. Table 7 shows the distribution of items based on distractor efficiency DE.

Table 7. Distribution of items based on distractor efficiency


Distractor Efficiency (DE)
0 NFD 1 NFD 2 NFD’s 3 NFD’s
(100% DE) (66.6% DE) (33.3% DE) (0 DE)
Number of Items 82 14 3 1
Percentage 82% 14% 3% 1%

Table 6 shows that 82 items (82%) have distractor efficiency of 100%, and only 1 item (1%) has

a distractor efficiency of 0 percent. Out of 300 distractors, 23 distractors (7.7%) are non-functional. Items

with zero non-functional distractors have a mean difficulty index of 0.49 and mean discrimination index

of 0.36 while items with three non-functional distractor have a mean difficulty index of 0.93 and mean

discrimination of 0.13.
Page 11 of 16
The researcher have observed that an increase in the distractor efficiency results to an increase in

difficulty index and a decrease in discrimination index. This relationship is shown in table 8 below.

Table 8. Non-functioning distractors (NFD’s) and mean difficulty and discrimination indices
Distractor Efficiency (DE)
0 NFD 1 NFD 2 NFD’s 3 NFD’s
(100% DE) (66.6% DE) (33.3% DE) (0 DE)
Number of Items 82 14 3 1
Percentage 82% 14% 3% 1%
Mean difficulty index .49 .53 .81 .93
Mean discrimination index .36 .35 .21 .13

Test Reliability

The internal consistency of the initial form of the GP1-AT was investigated using Cronbach’s

alpha using the IBM SPSS 20 software package. The initial Cronbach’s alpha of the test is .904 which is

considered to be excellent. Table 9 shows the results of the initial reliability analysis.

Table 9. Reliability analysis


Cronbach's Alpha Based on
Cronbach's Alpha Standardized Items N of Items
.904 .901 100

DISCUSSION

Single-response multiple choice test is a usual choice for classroom assessment and standardized

testing because of its objectivity and ease of scoring. However, constructing valid and reliable multiple

choice tests may prove to be a difficult task. In order ensure the efficiency of multiple choice questions,

varied statistical analysis may be utilized. The difficulty and discrimination indices are among the tools

used to check whether the multiple choice questions are well constructed and to further analyze the quality

of distractors, distractor efficiency may be used. (Mukherjee and Lahiri, 2015) Reliability analysis, such

as internal consistency reliability maybe used to investigate if the test is measuring the intended construct

or content.
Page 12 of 16
In the present study, the researcher aimed to develop a multiple choice test to evaluate the

achievement of grade 12 STEM students in selected topics in General Physics 1. The initial form of the

test has 100 multiple choice items with four options. The test was pilot tested to 300 grade 12 STEM

students from two private institutions in Quezon City.

The mean difficulty index or p value of the test was found to be .51 which means that most of the

items are within the desirable difficulty. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). Majority of the

items (63%) were found to have a desirable difficulty, 26 items (26%) were difficult and 7 items (7%)

were easy.

The mean discrimination index or DI was computed to be .35 which means that the initial form of

the test have good discriminating power. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). However, 47

items (47%) were found to have a discrimination index or DI of .40 and above which is means that these

items have very good discriminating power. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). There are

38 items (38%) whose discrimination is poor and needs to be improved. Three items (3%) have a negative

discrimination index which means that there are more students in the lower group who answered the items

correctly than in the upper group. A negative discrimination index could be caused by ambiguous

questions, wrong key or poor general preparation of the students. ((Mukherjee & Lahiri, 2015). Items

with discrimination index or DI of .20 and below must be replaced, revised or rejected.

Distractor analysis showed that out of 300 distractors, 23 (7.7%) are non-functional. Majority of

the items (82%) have 100% distractor efficiency which means that all distractors are functioning while

there is only one item (1%) who has three non-functioning distractors. Non-functioning distractors must

be changed or revised to improve the efficiency of the item in discriminating lower-group students from

upper-group students.

Page 13 of 16
Items with 100% DE have a mean p value of .49 and a mean DI of .36 while items with 0 DE

have a mean p-value of .93 and a mean DI of .13. A relationship between DE, p value and DI can be

concluded from the results. As the distractor efficiency or DE increases, the mean difficulty index or p

value increases and the mean discrimination index or DI decreases. Items with high distractor efficiency

is shown to have better discrimination index. The present study also shows that the as the non-functional

distractors in an item increases the more likely that the students will get the answer correct thus its

discriminating power decreases. This shows that the quality of distractors can affect the difficulty and

discrimination indices of the items.

The internal consistency reliability of the test was investigated using Cronbach’s alpha. The

Cronbach’s alpha of the initial form of the GP1-AT after the pilot test was found to be .904 which suggests

high reliability and excellent internal consistency. Though not much data is available, one rule of thumb

states that values equal to or greater than .7 are acceptable.

RECOMMENDATIONS

The initial form of the GP1-AT in Force, Energy and Motion is shown to have a mean desirable

difficulty index and a mean discrimination index that is considered to be good with high reliability

excellent internal consistency. Though many items performed well after the pilot test, it is highly

recommended by the author to have the test be validated by experts in physics education. Due to time

constraint, the items did not undergo a face and content validity. This will surely improve the quality of

distractors and the items as well.

Distractor analysis of the test shows that there are non-functioning distractors. The present study

has shown that distractor efficiency has an effect on the difficulty and discrimination indices of the items.

These distractors must be changed or revised in order to improve the discriminating power of the items.

Page 14 of 16
The initial form of the GP1-AT has items with four options usually placed after the stem. Though

the arrangement of options or the case of letter options (upper case or lower case) are shown to have no

effect in students’ performance (Bendulo et al., 2017) the number of options should be considered.

Traditionally, four or five options are used to increase the reliability of the test (Thorndike & Thorndike-

Christ, 2010; Hopkins, 1998; Mehrens & Lehman, 1991) however there are increasing study that endorses

the use of three options (Haladyna & Dawning, 1993; Haladyna, et al., 2002; Costin, 1970; Nwadinigwe

& Naibi, 2013). The researcher recommends for future researcher to look into the likelihood of using

three options for the GP1-AT.

REFERENCES

Bendulo, H. O., Tibus, E. D., Bande, R. A., Oyzon, V. Q., Milla, N. E., & Macalinao, M. L. (2017).
Format of options in multiple choice test vis-a-vis test performance. International Journal of
Evaluation and Research in Education, 157-163.
BrckaLorenz, A., Chiang, Y., & Laird, N. (2013). Internal Consistency. Retrieved from FSSE
Psychometric Portfolio: fsse.indiana.edu.
Brown, J. (1996). Testing in the language programs. Upper Saddle River, New Jersey: Prentice Hall
Regents.
Costin, F. (1970). The optimal number of alternatives in multiple-choice achievement tests: Some
empirical evidence of a mathematical proof. Educational and Psychological Measurement, 353-
358.
Haladyna , T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple- choice item-
writing for classroom assessment. Applied Measurement in Education, 309-334.
Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice item?
Educational and Psychological Measurement, 999-1010.
Hopkins, K. D. (1998). Educational and Psychological Measurement and Evaluation (8th ed.).
Needham Heights, MA: Allyn and Bacon.
Mehrens, W. A., & Lehman, I. J. (1991). Measurement and Evaluation in Education and Psychology
(4th ed.). Forthworth, TX: Harcourt Brace Jovanovich.
Mukherjee, P., & Lahiri, S. (2015). Analysis of multiple choice questions (MCQs):Item and test
statistics from an assessment in a medical college of Kolkata, West Bengal. IOSR Journal of
Dental and Medical Sciences, 47-52.
Page 15 of 16
Nwadinigwe, P. I., & Naibi, L. (2013). The number of options in a multiple-choice test item and the
psychometric properties. Journal of Education and Practice, 189-196.
Scannel, D. P., & Tracy, D. B. (1975). Testing and measurement in the classroom. Boston: Houghton
Mifflin Co.
Thorndike, R. M., & Thorndike-Christ, T. (2010). Measurement and Evaluation in Psychology and
Education (8th ed.). Upper Saddle River, NJ: Pearson/Merril Prentice Hall.
Wells, C. A., & Wollack, J. A. (2003, November). An instructor's guide to understanding test reliability.
University of Wisconsin, Testing and Evaluation Services, Madison, Wisonsin.
Wiersma, W., & Jurs, S. G. (1990). Educational Measurment and Testing. USA: Ally and Bacon.

Page 16 of 16

Das könnte Ihnen auch gefallen