Sie sind auf Seite 1von 137

Strand 11

Evaluation and assessment of


student learning and development

Strand 11

Evaluation and assessment of student learning and development

CONTENTS
Chapter Title
1

Introduction

Page
1

Robin Millar, Jens Dolin


2

Performance assessment of practical skills in science in teacher


training programs useful in school

Ann Mutvei Berrez, Jan-Eric Mattsson


3

Development of an instrument to measure childrens systems


thinking

13

Kyriake Constantinide, Michalis Michaelides, Costantinos P.


Constantinou
4

Development of a two-tier test- instrument for geometrical


optics

24

Claudia Haagen, Martin Hopf


5

Strengthening assessment in high school inquiry classrooms

31

Chris Harrison
6

Analysis of student concept knowledge in kinematics

38

Andreas Lichtenberger, Andreas Vaterlaous, Clemens Wagner


7

Measuring experimental skills in large-scale assessments:


Developing a simulation-based test instrument

50

Martin Dickmann, Bodo Eickhorst, Heike Theyssen, Knut


Neumann, Horst Schecker, Nico Schreiber
8

The notion of authenticity according to PISA: An empirical


analysis

59

Laura Weiss, Andreas Mueller


9

Examining whether secondary school students make changes


suggested by expert or peer assessors in the science webportfolio

68

Olia Tsivitanidou, Zacharias Zacharia, Tasos Hovardas


10

Sources of difficulties in PISA science items

76

Florence Le Hebel, Andree Tiberghien, Pascale Montpied


i

Strand 11

Evaluation and assessment of student learning and development

11

In-context items in a nation wide examination: Which


knowledge and skills are actually assessed?

85

Nelio Bizzo, Ana Maria Santos Gouw, Paulo Sergio Garcia,


Paulo Henrique Nico Monteiro, Luiz Caldeira Brant de
Tolentino-Neto
12

Predicting success of freshmen in chemistry using moderated


multiple linear regression analysis

93

Katja Freyer, Matthias Epple, Elke Sumfleth


13

Testing student conceptual understanding of electric circuits as a


system

101

Hildegard Urban-Woldron
14

Process-oriented and product-oriented assessment of


experimental skills in physics: A comparison

112

Nico Schreiber, Heike Theyssen, Horst Schecker


15

Modelling and assessing experimental competence: An


interdisciplinary progress model for hands-on assessments

120

Susanne Metzger, Christoph Gut, Pitt Hild, Josiane Tardent


16

Effects of self-evaluation on students achievements in


chemistry education

128

Inga Kallweit, Insa Melle

ii

Strand 11

Evaluation and assessment of student learning and development

INTRODUCTION
Strand 11 focuses on the evaluation and assessment of student learning and
development. Many studies presented in other conference strands, of course, involve
the assessment of student learning or of affective characteristics and outcomes such as
students attitudes or interests and use existing instruments or new ones developed
for the study in hand. In such studies, assessment instruments are tools to be used to
try to explore and answer other questions of interest. In strand 11, the emphasis is on
the development, validation and use of assessment instruments; the focus is on the
instrument itself. These can include standardized tests, achievement tests, high stakes
tests, and instruments for measuring attitudes, interests, beliefs, self-efficacy, science
process skills, conceptual understandings, and so on. They may be developed with a
view to making assessment more authentic in some sense, to facilitate formative
assessment, or to improve summative assessment of student learning.
Fifteen papers presented in this strand are included in this book of e-proceedings.
Four of them discuss the development of new or modified instruments to assess
students conceptual understanding of a science topic. Two use the two-tier multiple
choice format that many researchers have found valuable for probing understanding,
to explore the topics of electric circuits and geometrical optics. Another explores the
factors that may underlie the observed patterns in students responses, trying to tease
out the relative importance of mathematical and physical ideas in determining
performance on questions about kinematics. A fourth paper begins the exploration of
a relatively new and novel science domain, systems thinking. Here assessment items
have a particularly significant role to play in helping to define the domain in
operational terms, and facilitating discussion within the science education research
community.
Four papers explore issues concerning the assessment of practical competence and
skills. One looks at the general issue of developing a model to describe progress in
carrying out hands-on activities; another focuses more specifically on experimental
skills in physics; and a third considers performance assessment in the context of initial
teacher education. The fourth paper looks at the potential use of simulations as
surrogates for bench practical activities. Work in this domain is important, as science
educators seek to come to a better understanding of the factors that lead to variation in
students responses to practical tasks.
Three papers look in different ways at the influence of contexts on students answers
and responses to tasks. Two take the PISA studies as their starting point, looking in
detail at the thinking of students as they respond to PISA tasks and questioning the
extent to which the PISA interpretation of authenticity enhances student interest and
engagement with assessment tasks. Both point to the value of listening to students
talking about their thinking as they answer questions, and suggest that this may be
quite different from what we would expect, and perhaps hope. A third paper with an
interest in the effects of contextualisation presents data from a study in Brazil
comparing students answered to sets of parallel questions with fuller and more
abridged contextual information. The findings have implications for item design, and
suggest that reading demands should be kept carefully in check if we aim to assess
science learning.

Strand 11

Evaluation and assessment of student learning and development

Three papers in this section explore the formative use of assessment. One has a focus
on the assessment of learning that results from inquiry-based science teaching.
Another looks at the ways in which students respond to formative feedback on their
work. The context for this study is web portfolios, but the research question is one
with wider applicability to other forms of feedback, and across science contents more
generally. The third uses an experimental design to explore the impact on student
learning in a topic on chemical reactions of a self-evaluation instrument that asks
students to try to monitor their own learning and to take action to address areas in
which they judge themselves to be weak.
All of the papers described above collect data from students of secondary school age
or prospective teachers. The final paper in this strand looks at the potential use of an
attitude assessment instrument to predict undergraduate students success in chemistry
learning.
The set of papers highlights the key role of assessment items and instruments as
operational definitions of intended learning outcomes, bringing greater clarity to the
constructs used and to our understanding of learning in the domains that they study.

Jens Dolin and Robin Millar

Strand 11

Evaluation and assessment of student learning and development

PERFORMANCE ASSESSMENT OF PRACTICAL


SKILLS IN SCIENCE IN TEACHER TRAINING
PROGRAMS USEFUL IN SCHOOL
Ann Mutvei and Jan-Eric Mattsson
School of Natural Sciences, Technology and Environmental Studies, Sdertrn
University, Sweden.
Abstract: There is a general process towards an understanding of knowledge not as a
question of remembering facts but to achieve the skill to use what is learnt under
different circumstances. According to this, knowledge should be useful at different
occasions also outside school. This process may also be identified in the development
of new tests performed in order to assess knowledge.
In courses in biology, chemistry and physics focused on didactics we have developed
performance assessments aimed at assessing the understanding of general scientific
principles by simple practical investigations. Although, designed to assess whether
specific goals are attained, we discovered how small alterations of performance
assessments promoted the development of didactic skills. Performance assessments
may act as tools for the academic teacher, school teacher and for enhancement of
student understanding of the theory.
This workshop was focused on performance assessments of the ability to present
skills and to develop new ideas. We presented, discussed, explained and familiarized
a practical approach to performance assessments in science education together with
the other participants. The emphasis was to demonstrate and to give experience of this
assessment tool.
We performed elaborative tasks as they may be used by teachers working at different
levels, assessed the performances and evaluated the learning outcome of the activity.
Different assessment rubrics where be presented and tested at the workshop. Learning
by doing filled the major part of the workshop but there were also opportunities for
discussions, sharing ideas and suggestions for further development.
The activities performed may be seen as models possible for further development into
new assessments.
Keywords: assessment, rubric, practical skills, knowledge requirement
INTRODUCTION
During the last ten or fifteen years there has been a general process towards an
understanding of knowledge not as a question of remembering facts but to achieve the
skill to use what is learnt under different more or less practical circumstances.
According to this view knowledge should be useful at different occasions also outside
school. Traditional textbooks often had facts arranged in a linear and in a hierarchical
order. More recent books are focused on the development of the thoughts and ideas of
the student by presenting general principles underpinned by good examples,

Strand 11

Evaluation and assessment of student learning and development

diagnoses, questions to discuss, reflective tasks without any presentation of a correct


answers, etc. (cf. Audesirk et al. 2008, Hewitt et al. 2008, Reece et. al 2011, Trefil &
Hazen 2010). A similar development can be found in teacher training programs,
where lectures and traditional text seminars to some extent have been replaced by
more interactive forms of teaching. This development we also found in examinations
at our own university where tests performed in order to assess knowledge of literature
content have been replaced by tests where students have to show their capacity to use
their knowledge.
Practical performance assessments are important when assessing abilities or skills of
students in teacher training programs. In science courses in biology, chemistry and
physics focused on didactics we have for several years developed performance
assessments focused on understanding of general scientific principles, but based on
simple practical investigations or studies. Although, designed to assess whether
students reached the goals of a specific course, we often have discovered how small
alterations of these performance assessments have promoted the development of the
didactic skills of the student. Thus, they may act as assessment tools for the academic
teacher, models for assessments in school and enhancement of the students
theoretical understanding of the subject and theory. The assessments may be made on
oral or written reports, during guided excursions or museum visits or practical
experiments, on traditional or esthetical diaries, self diagnoses or diagnoses made by
other students based on certain criteria.
We have been working several years with teacher training programs focused on work
in primary and secondary schools, with further education for teachers and with
university students studying biology and chemistry. The wide range of courses and
students have been giving us experiences how to work with different contents adapted
to different ages of students at school. Out of this we have found some similar and
different basic problems and needs of understanding depending on the subject. These
experiences also give us the opportunity to contribute to national seminars and
conferences.
CURRICULUM AT SWEDISH SCHOOL
The new curriculum in Sweden for the primary and lower secondary schools
(Skolverket 2010) as well as the new one for the upper secondary school put the
emphasis on the students skills rather than knowledge (facts). It is the ability to use
the knowledge that is to be assessed. This development is a global trend; see e.g.
Eurasian Journal of Mathematics, Science & Technology Education 8(1). This is a
great change compared to earlier curricula, especially when compared to the common
interpretation and implementation of these at the local level. A similar development
has occurred in the universities in Sweden. Today the intended learning outcomes
should be described in the syllabi as abilities the student can show after finishing the
course and how this should be done.
Many teachers have problems with this view as they are used to assess the students
ability to reproduce facts. These teachers find it hard to understand how to work with
performance assessments instead of tests targeting the knowledge of facts. They often
ask for clear directions and expect strict answers instead of guidelines how to improve
their own ability to work with performance assessments.

Strand 11

Evaluation and assessment of student learning and development

Teaching according to these new curricula starts with the design of performance
assessments suitable for the assessment of a specific skill and to create a rubric for the
assessment. Thereafter the teacher plans the exercises beneficial for student
development and finally decides the time needed and plans the activities according to
this.

Figure 1. How to plan learning situations.


As an example of how teachers may work with this method we designed a practical
assessment of practical skills and presented it as a workshop at ESERA 2013.
HOW TO DESIGN A PERFORMANCE ASSESSMENT OF PRACTICAL
SCIENCE SKILLS
In order to design a workshop on performance assessments of the skills we tried to do
as teachers are supposed to do at school. The emphasis was to demonstrate and give a
possibility to get experience of this assessment tool under realistic conditions. Thus,
these performance assessments are constructed in accordance with the curriculum in
Sweden from 2011 (Skolverket 2010) but they are probably useful for anyone who
wants to assess abilities or skills rather than memories of facts or texts. We tried to
present, explain and familiarize the participants with a practical approach to
performance assessments in science education at school.
The skill of assessment has to be learned. If teachers are used to assess skills these
normally are of a more or less theoretical kind. They are used to assess the quality of
the language used or the correctness of a mathematical calculation. Assessment of
practical skills does not has to be more complicated but it has to be trained. According
to the Swedish curriculum 150200 assessments of each student and in each of the
about 15 school subjects should be done at the end of years 6 and 9 and many of these
refer to practical skills. In order to simplify this monstrous task it is possible and
necessary to assess several skills in more than one subject at one occasion.
We had prepared four similar activities, all with the same material; candle, wick, and
matchbox but with different purposes. They were supposed to represent studies of

Strand 11

Evaluation and assessment of student learning and development

mass transfer, energy transformation, technical design, and phase changes. The latter
is presented here in detail.

General principles of performance assessments


In the preparations we followed the directions of the Swedish curriculum for the
compulsory school (Skolverket 2010). We selected the core content and the
knowledge requirements relevant for phase transitions as the foundation for
development of the performance assessment. Usually teachers start with the
knowledge requirements, interpret these and design tests for assessing the students
skills according to the requirements, design suitable learning situations or practical
training of the skills and finally decide what parts of the core content should be used
(Figure 1). Here we started with the core content as it were some specific areas of
knowledge we wanted to study. When the core content was selected the assessment
rubric was developed by interpreting and dissecting the knowledge requirements.

Core content
The teaching in science studies should, in this case, according to the curriculum of
primary and secondary school (Skolverket 2010), deal with the core content presented
in Table 1.
Table 1
Core content in Swedish compulsory school curriculum relevant for phase transitions
and scientific studies.
In years 13
Various forms of water:
solids, liquids and gases.
Transition between the
forms: evaporation,
boiling, condensation,
melting and solidification.
Simple scientific studies.

In years 46
Simple particle model to
describe and explain the structure,
recycling and indestructibility of
matter. Movements of particles as
an explanation for transitions
between solids, liquids and gases.
Simple systematic studies.
Planning, execution and
evaluation.

In years 79
Particle models to
describe and explain the
properties of phases, phase
transitions and distribution
processes for matter in air,
water and the ground.
Systematic studies.
Formulating simple
questions, planning,
execution and evaluation.
The relationship between
chemical experiments and
the development of
concepts, models and
theories.

Knowledge requirements
The knowledge requirements are related to the age of the students and show a clear
progression through school. At the end of the third, sixth and ninth year there are
clearly defined knowledge requirements (Table 2). Grades are introduced in the sixth
year and levels for grades E (lowest), C, and A (highest) are described in the
curriculum. Also D and B are being used. Grades D or B means that the knowledge
requirements for grade E or C and most of C or A are satisfied respectively.

Strand 11

Evaluation and assessment of student learning and development

Table 2
Knowledge requirements for different years and grades
Year
3
Year
6

Year
9

Based on clear instructions, pupils can carry out [] simple studies dealing with
nature and people, power and motion, and also water and air.
Grade E
Grade C
Grade A
Pupils can talk about and
Pupils can talk about and
Pupils can talk about and
discuss simple questions
discuss simple questions
discuss simple questions
concerning energy.
concerning energy.
concerning energy.
Pupils can carry out
Pupils can carry out
Pupils can carry out simple
simple studies based on
simple studies based on
studies based on given
given plans and also
given plans and also
plans and also formulate
formulate simple
simple questions and
contribute to
formulating simple
questions and planning
planning which after some
questions and planning
which after some
reworking can be
which can be
reworking can be
systematically developed.
systematically developed. systematically developed. In their work, pupils use
In their work, pupils use
In their work, pupils use
equipment in a safe,
equipment in a safe and
equipment in a safe and
appropriate and effective
basically functional way. appropriate way. Pupils
way. Pupils can [] make
Pupils can []
can [] make proposals proposals which can
which after some
improve the study.
contribute to making
proposals that can
reworking can improve
improve the study.
the study.
Pupils can talk about and
Pupils can talk about and
Pupils can talk about and
discuss questions
discuss questions
discuss questions
concerning energy. Pupils concerning energy. Pupils concerning energy. Pupils
can carry out studies
can carry out studies
can carry out studies based
based on given plans and
based on given plans and
on given plans and also
also contribute to
also formulate simple
formulate simple questions
formulating simple
questions and planning
and planning that can be
questions and planning
which after some
systematically developed.
which can be
reworking can be
In their investigations,
systematically developed. systematically developed. pupils use equipment in a
In their studies, pupils use In their studies, pupils use safe, appropriate and
equipment in a safe and
equipment in a safe and
effective way. Pupils apply
basically functional way. appropriate way. Pupils
well developed reasoning
Pupils apply simple
apply developed
concerning the plausibility
reasoning about the
reasoning about the
of their results in relation
plausibility of their results plausibility of their results to possible sources of
and contribute to making and make proposals on
error and make proposals
proposals on how the
how the studies can be
on how the studies can be
studies can be improved.
improved. Pupils have
improved and identify new
Pupils have basic
good knowledge of
questions for further
knowledge of energy,
energy, matter, [] and
study. Pupils have very
matter, [] and show this show this by explaining
good knowledge of energy,
by giving examples and
and showing
matter, [] and show this
describing these with
relationships with
by explaining and showing
some use of the concepts, relatively good use of the relationships between
models and theories.
concepts, models and
them and some general
theories.
characteristics with good
use of the concepts, models
and theories

Strand 11

Evaluation and assessment of student learning and development

Assessments of knowledge requirements


The knowledge requirements were interpreted and dissected in smaller units in order
to construct an assessment rubric adapted to the inquiry. Five main skills were
selected from the knowledge requirements; Use of theory, Improvement of the
experiment, Explanations, Relate, and Discuss. In order to make the assessment rubric
more generalized we decided not to use the grades of the curriculum but recognized
three levels of skills; Sufficient, Good, and Better corresponding to the grades E, C
and A respectively. In all cases we also gave examples of relevant student answers.
This is a more or less necessary requirement in order to make sure that the performer,
assessor or teacher really understands what is meant by a specific requirement (Arter
& McTighe 2001, Jnsson 2011).
As an example of this we can look at the knowledge requirement Pupils can carry
out studies based on given plans and also contribute to formulating simple questions
and planning which can be systematically developed. In their studies, pupils use
equipment in a safe and basically functional way. Pupils apply simple reasoning
about the plausibility of their results and contribute to making proposals on how the
studies can be improved. (Year 9, level E). This requirement contains information
that may be dissected into several units.
Primarily it is necessary to look at the five skills of the students that are going to be
assessed and look at the suitable requirements for each skill. The students are
supposed to carry out studies based on given plans. In the case the experiment is
very simple, (light and observe a burning candle), and hardly useful assessing this
specific skill. They shall also contribute to formulating simple questions and
planning which can be systematically developed. This requirement can be further
developed to suit the five skills.
In order to show this skill it is necessary to have some knowledge about the theory
and use it in a suitable way. The skill use of theory is a necessary condition for this
and may be formulated as The student draws simple conclusions partly related to
chemical models and theories. This criterion also is in concordance with the skill
simple reasoning about the plausibility of their results and contribute to making
proposals on how the studies can be improved. This may be formulated as the
student discusses the observations and contributes with suggestions of improvements
in the rubric for assessment of the improvement of the experiment requirement.
In a similar way the assessment of remaining three skills may be developed into more
specific criteria adapted to this experiment (Table 3).
In order to make it possible for the student to understand what is expected it is
necessary to clarify the requirement criteria and give realistic examples of these
requirements. The meaning of words differs between disciplines not only in the
academic world but also in school (cf. Chanock 2000). This has consequences when
students get feedback as they often do not understand the academic discourse with its
specific concepts and fail to use the feedback later (Lea & Street 2006). Criteria
combined with explicit examples are necessary to solve this problem (Sadler 1987).
This is also important when designing assessment rubrics (Busching 1998, Arter &
McTighe 2001). Thus, to every criterion there has to be at least one example given. In
Table 3 this is exemplified in every combination of skill and grade requirement.

Strand 11

Evaluation and assessment of student learning and development

Table 3
Assessment rubric for assessing skills in an experiment of phase changes
Use of theory

Improvement
of the
experiment

Explanations

Relate

Discuss

Sufficient
The student draws
simple conclusions
partly related to
chemical models and
theories. (I can see
stearic acid in solid,
liquid and gas phase.)
The student discusses
the observations and
contributes with
suggestions of
improvements.
(Observe more
burning candles.)

Good
The student draws
conclusions based on
chemical models and
theories. (The heat of the
candle causes the phase
transfer between the
phases.)
The student discusses
different interpretations
of the observations and
suggests improvements.
(Remove the wick and
relight the candle.)

The student gives


simple and relatively
well founded
explanations. (The
stearic acid melts by
heat produced by the
flame.)
The student gives
examples of similar
processes as in the
experiment related to
questions about
energy, environment,
health and society.
(The warmth of the sun
melts the ice on the
lake at the end of the
winter.)
The student
contributes to a
discussion of the
occurrence of the
phenomena studied in
society and makes
statements partly based
on facts and describes
some possible
consequences. (Gases
are often transported
in a liquid phase which
has a lower volume.)

The student gives


developed and well
founded explanations.
(Also the change from
liquid phase to gaseous
phase depends on the
heat from the flame.)
The student generalizes
and describes the
occurrence of similar
phenomena as in the
experiment related to
questions about energy,
environment, health and
society. (In the frying
pan it is hot enough for
butter to melt and in the
sauna water vaporizes.)
The student describes
and discusses the
occurrence of the
phenomena studied in
society and makes
statements based on facts
and fairly complicated
physical relations and
theories. (The bottle of a
gas stove has fuel mainly
in liquid phase but it is
transported in the hose
and burnt i gaseous
phase.)

Better
The student draws well
founded conclusions out of
chemical models and
theories. (Stearic acid
must in gas phase and mix
with oxygen to burn.)
The student discusses well
founded interpretations of
the observations, if they
are reasonable, and
suggests based on these
improvements which allow
enquiries of new
questions. (Heat a small
amount of stearic acid and
try to light the gas phase
above.)
The student presents
theoretically developed
and well founded
explanations. (All phase
changes from solid to
liquid or liquid to gaseous
need energy.)
The student discusses the
occurrence of the
phenomena observed in
everyday life and the use
of it and its impact on
environment, health and
society. (The phase change
from liquid to gaseous
phase cools you down
when you are sweating.)
The student uses the
experiment as a model and
discusses the occurrence
of the phenomena studied
in society and makes
statements and
consequences based on
facts and complicated
physical relations and
theories (The phase
change from liquid to
gaseous phase cools you
down when you are
sweating.)

Strand 11

Evaluation and assessment of student learning and development

WORKSHOP
We had prepared four similar activities, all with the same material; candle, wick, and
matchbox but with different purposes. The activities represented studies of mass
transfer, energy transformation, technical design, and phase changes. At the workshop
three groups were formed, omitting the study of technical design. The three groups
were not informed about the differences between the aims of their experiments. The
groups were constructed to include people with as varied background as possible.
Thus, participants from one specific country or similar fields as chemistry or physics
were allocated to different groups. They performed elaborative tasks similar to those
used by teachers working at different levels, assessed the performance and evaluated
the learning outcome of the activity. Within each group one person was selected to do
the assessment of activities the others made. The person assessing the work should
focus not only on the results of the discussions within the group but also try to
evaluate the process, as the aim was to assess the skills of the participants rather than
the content of their knowledge.
Discussion
The aim was to demonstrate of how peer reviewing within the group may be used for
producing information of several kinds beneficial for the performance assessment of
science education at school. Discussions arose among the participants about how an
integrated approach, especially in relation to other subjects in school, improved the
usefulness of the methods. Learning by doing followed by discussions became the
major part of the workshop with sharing of ideas and suggestions for further
development.
Most of the participants had weak knowledge of assessments of practical skills and
expressed their astonishment of the positive result of the workshop and showed
curiosity to use the method. Some of the participants also showed didactic skills when
explaining the different aspects of the experiment they mastered to the others, a good
example of the importance of variation in the skills of group members.
The persons who made the assessments expressed the need of further practicing. They
realized the complexity in assessing different skills at the same time as assessing the
grade. They also expressed a will to develop this ability as they realized the strength
in assessing several skills at one occasion. Further, the participants noted the
importance of questions like the last on in the instructions (Appendix) in order to
assess the quality of the relation between theory and practice.
Conclusion
Although, based on a simple experiment of a burning candle, the workshop gave a
opportunity to discuss and understand theories being regarded as difficult to
understand from the viewpoint of the student or difficult to teach from the teachers
view. The experiments, although similar, were of different character, thus, reflecting a
wide spectrum of possibilities.
Thus, the activities performed may be seen as models or examples possible to further
develop new assessments according to the content of the subject.

10

Strand 11

Evaluation and assessment of student learning and development

REFERENCES
Arter, J. A. & McTighe, J. (2001). Scoring rubrics in the classroom, Corwin
Audesirk, T., Audesirk, G., & Byers, G B. (2008). Life on earth, 5 ed., San Francisco,
Pearson Education.
Busching, B. (1998). Grading Inquiry Projects. New Directions for Teaching and
Learning 74: 8996.
Chanock, K. (2000). Comments on Essays: do students understand what tutors write?
Teaching in Higher Education 5 (1): 95105.
Hewitt, P. G., Suchocki, J. & Hewitt, L. A. (2008). Conceptual physical science, 4 ed.
San Francisco, Pearson Education.
Jnsson, A. (2011). Lrande bedmning. Gleerups.
Lea, M.R. & Street B.V. (2006). The Academic Literacies Model: Theory and
Applications. Theory into Practice, 45(4): 368377.
Reece, J.B., Urry, L.A., Cain, M.L., Wasserman, S.A., Minorsky, P.V. & Jackson, R.
B. (2011). Campbell Biology Global Edition, Pearson.
Sadler, D.R. (1987). Specifying and Promulgating Achievement Standards. Oxford
Review of Education 13(2): 191209.
Skolverket (Swedish National Agency for Education). (2010) Curriculum for the
compulsory school, preschool class and the recreation centre 2011. Skolverket.
Trefil, J. & Hazen R.M. (2010). Sciences an integrated approach. Wiley Eurasian
Journal of Mathematics, Science & Technology Education 8(1).

11

Strand 11

Evaluation and assessment of student learning and development

APPENDIX
INQUIRY OF A BURNING CANDLE

This is an experiment of phase changes


1.
2.
3.
4.
5.
6.
7.

Light the candle and observe the change of phases.


Which changes of phase can you observe?
Where do they occur?
Why do they occur?
What happens in the different phases?
How may you improve the experiment?
Give examples of phase changes in daily life and the society.
INQUIRY OF A BURNING CANDLE

This is an experiment of energy transformation


1.
2.
3.
4.
5.
6.
7.

Light the candle and observe the energy transformations.


Which changes of energy forms can you observe?
Where do they occur?
Why do they occur?
What happens during the different energy transformations?
How may you improve the experiment?
Give examples of energy transformations in daily life and the society.
INQUIRY OF A BURNING CANDLE

This is an experiment of mass transfer


1.
2.
3.
4.
5.
6.
7.

Light the candle and observe mass transfer


Which types of mass transfer can you observe?
Where do they occur?
Why do they occur?
What happens to the candle due to this mass transfer?
How may you improve the experiment?
Give examples of mass transfer in daily life and the society.
INQUIRY OF A BURNING CANDLE

This is an experiment of candle design


1.
2.
3.
4.
5.
6.
7.

Light the candle and discuss the design of the candle.


Which different parts can you observe in the candle?
Where are they and how are they united?
What function do the different parts have?
Why is the candle created in that way?
How may you improve the experiment?
Give examples of similar designs in daily life and the society.
12

Strand 11

Evaluation and assessment of student learning and development

DEVELOPMENT OF AN INSTRUMENT TO MEASURE


CHILDRENS SYSTEMS THINKING
Kyriake Constantinide, Michalis Michaelides and Costas P. Constantinou
University of Cyprus
Abstract: Systems thinking is a higher order thinking skill required to meet the demands
of social, environmental, technological and scientific advancements. Science abounds in
systems and makes system function a core object of investigation and analysis. As a
consequence, teaching in science can be a valuable framework for developing systems
thinking. In order to approach this methodically, it becomes important to specify the
aspects that constitute the systems thinking construct, design curriculum materials to
help students develop these aspects, and develop instruments for evaluating students
competence and monitoring the learning process. The present study aims at the
development of an instrument for standardized assessment of systems thinking. It draws
on a methodology that follows a cyclic procedure for instrument development and
validation, where literature, experts, students and educators contribute in the procedure.
Currently, the assessment instrument is in the second cycle of field testing, having
collected data from about 900 students and having used these to develop a first version of
a validated test and a scale for measuring 10-14-year-old childrens systems thinking.
The test consists of multiple-choice scenario items that draw their content from everyday
life. We present the methodology we are following, providing some examples of
multiple-choice items to demonstrate their development and transformation throughout
the process.
Keywords: systems thinking, assessment, test development

BACKGROUND
The rate of advancements in scientific knowledge and technology and the widespread
demands on young people to participate actively in solving problems in almost every
aspect of our lives have reoriented the role of education in general and science teaching
in particular. Nowadays, science teaching aims at developing scientifically literate people
with flexible thinking skills and an ability to participate critically in meaningful
discourse. More specifically, it aims at helping students acquire positive attitudes towards
learning and science, a variety of experiences, conceptual understanding, epistemological
awareness, practical and scientific skills and creative thinking skills (Constantinide,
Kalyfommatou & Constantinou, 2001).
The definitions of systems thinking described in the literature (e.g., Senge, 1990; Thier &
Knott, 1992; Booth Sweeney, 2001; Ben-Zvi Assaraf & Orion, 2005) include thinking
about a system, meaning a number of interacting items that produce a result over a period
of time. According to the Benchmarks for Science Literacy (AAAS; 1993), systems
thinking is an essential component of higher order thinking, whereas Kali, Orion and

13

Strand 11

Evaluation and assessment of student learning and development

Eylon (2003) refer to systems thinking as a high-order thinking skill required in


scientific, technological, and everyday domains. Senge (1990) claims that systemic
thinkers are able to change their own mental models, control their way of thinking and
their problem-solving process. Therefore, defining, promoting through curricula, and
measuring systems thinking should be an essential priority for education. Science
teaching and learning can be a valuable framework for developing such skills, since it
abounds in systems and science makes system function a core object of investigation and
analysis.
Several structured teaching attempts to promote systems thinking are reported in the
literature, making the development of instruments for measuring systems thinking and for
evaluating the effectiveness of such curricula a necessity. The most common means of
evaluating systems thinking that has been reported thus far include tests (e.g. Riess &
Mischo, 2009), interviews (e.g. Hmelo-Silver & Green Pheffer, 2004) and computer
simulations and logs (e.g. Sheehy, Wylie, McGuinness & Orchard, 2000). Some
researchers in order to triangulate their data used a combination of various data sources
(e.g. Ben-Zvi Assaraf & Orion, 2005). Almost all means include tasks where a problem is
introduced and the subjects have to propose solutions or predict the behavior of the
system and its elements. Nevertheless, to date there is no validated instrument and prior
research has not provided a scale for measuring systems thinking of children aged 10-14
years old. The purpose of this paper is to describe the on-going development process of
the Systems Thinking Assessment (STA), a test designed to assess systems thinking.

RESEARCH METHODOLOGY
Systems Thinking Assessment (STA): purpose and specifications
The STA will be used to measure the quality of thinking about systems by children aged
10-14 and the effectiveness of curricula designed to promote systems thinking. It consists
of multiple-choice items in the context of everyday phenomena, familiar to the children
of the specific age range. The stems of the items include a scenario and children are
asked to choose the best possible answer, amongst four alternatives.
Multiple choice items have advantages and disadvantages. Given that every other criterion
was taken into account, grading a multiple choice test is objective, since a grader would
mark an item in the same way as anybody else. Besides, a short amount of time is needed
to administer many items, in order to sufficiently cover the content domain under study.
They are also more reliable than other forms of questions, since, in a possible
readministration of a test, it is more likely that a subject will produce the same answers if
the questions are multiple choice than if they are open-ended. A basic disadvantage of
multiple choice questions is that they do not provide much information on the subjects
thinking processes, namely the reasons for which they answer each item the way they do.
Nevertheless, the procedure of the tests development and Rasch analysis minimize the
effect of this disadvantage on the results.

14

Strand 11

Evaluation and assessment of student learning and development

In order to be able to make generalizations, there was an intentional effort to include items
that utilize various systems: physical-biological systems (such as water cycle, a forest, a
dam or food webs), mechanical-electrical systems (such as a bicycle or a car) and
socioeconomic systems (such as a family, a village or a store). Moreover, where possible,
a picture or a diagram was added in the items wording, so as to make the item clearer and
the test more eye-pleasant.
We have adopted the following operational definition of systems thinking, which relies
on four strands:
(a) System definition includes identifying the essential elements of a system, its
temporal boundaries and its emergent phenomena. (b) System interactions includes
reasoning about causes and effects when interactions are taking place within the system.
(c) System balance refers to the abilities of recognizing the relation between
interactions and the systems balance. (d) Flows refers to reasoning about the relation
of inflows and outflows in a system and recognizing cyclic flows of matter or energy.

STAs cyclic development procedure


Figure 1 presents the cyclic nature of the STA development. The definition of Systems
Thinking in the center of the cycle is in regard to both the abilities that constitute it and
the items that measure it. Involved parties (experts, educators, students and existing
literature), provide feedback on Systems Thinking definition through data that define the
tests validity and reliability.

Educators
(face validity)

Students
(test admin. and
interview data)
(construct,
criterion and face
validity,
reliability)

Systems
Thinking
(Abilities and
items)

Literature
(content validity)

Experts
(content validity)

Figure 1. Development procedure for STA

15

Strand 11

Evaluation and assessment of student learning and development

The STA has already undergone its first cycle of development. Reviewing the literature
led to 13 abilities that seemed to define Systems Thinking. The original items were
developed and administered to a small number of 10-year-old students. Qualitative and
quantitative data led to modifications (content and wording changes) and the
development of new items. Two experts gave feedback on the tests content validity.
Further improvements were carried out and two educators with experience with children
aged 10-14 years old examined the face validity of the test. The revised version was once
again administered to a small number of 10-year-old students and after the necessary
modifications the final form of the test with 52 multiple-choice items was administered
to 900 students. Rasch modeling led to a scale showing items difficulty and students
ability.
Based on a broader literature review and the development of separate examples regarding
each ability, the second development cycle began with revising the 13-ability schema and
reducing the abilities to 10 and the items to 41. The revised test was given to
approximately 90 10-14-year-old students. Test and items difficulty indices, items
discrimination indices and frequencies were calculated and, items were either modified or
replaced. Afterwards, 16 students participated in interviews, answering the items and
following a think-aloud protocol (Ericsson & Simon, 1998). Non-effective items were
replaced or modified.
The latest version of the items is under evaluation by independent experts. Graduate/PhD
students in Learning in Science, academics specialized in Science Teaching or
Psychology and international researchers with experience on Systems Thinking
measurement will provide feedback on the test by solving it first, and by judging its
efficiency based on a structured protocol. Finally, an expert panel will be formed,
during which any problems will be discussed until the panel reaches consensus. The
revised test will be given to four educators to evaluate its face validity. The test will then
be administered to 100 10-14-year-old students to statistically assess its clarity and its
developmental validity. The improved test will finally be administered to 500 students
and the data will be analyzed using Rasch modeling. Confirmatory Factor Analysis will
be carried out in order to assess the 10-ability structure of the construct.

RESULTS
At the final stage of the first cycle of the STA development, the test was administered to
about 900 students. Rasch statistical model provided a scale for the 52 items of the STT,
where both subjects score and items degree of difficulty are presented (Figure 2).
It is evident that the 52 items of the test fit the model well. Both students scores and
items degree of difficulty are distributed uniformly on the scale. Students scores vary
between -2.16 and 2.37 logits, whereas the items degree of difficulty varies between 2.41 and 2.53 logits.

16

Strand 11

Evaluation and assessment of student learning and development

* Every represents 4 students

Figure 2. Scale of STT (at the end of first cycle)


17

Strand 11

Evaluation and assessment of student learning and development

Table 1
Statistical values for the 52 STA items for the whole sample and the four groups

Statistical indices

Mean (items*)
(persons)
Standard deviation (items)
(persons)
Separability** (items)
(persons)
Mean Infit mean square (items)
(persons)
Mean Outfit mean square (items)
(persons)
Infit t (items)
(persons)
Outfit t (items)
(persons)

Total
sample

5th Gr.
6th
Prima
grade
ry
Primary

(n=848
)
0.00
-0.01
0.97
0.72
0.99
0.81
1.00
1.00
1.01
1.01
-0.12
-0.04
0.09
0.02

(n=21
9)
0.00
-0.30
0.96
0.66
0.97
0.77
1.00
1.00
1.02
1.02
-0.13
-0.07
0.05
0.04

(n=249)
0.00
-0.05
0.97
0.73
0.98
0.81
1.00
1.00
1.02
1.02
-0.03
-0.03
0.09
0.03

1st
grade
Secon
d.
(n=13
7)
0.00
0.14
1.08
0.73
0.96
0.81
1.00
1.00
1.01
1.01
0.00
-0.02
0.05
-0.01

2nd
grade
Secon
d.
(n=24
3)
0.00
0.21
1.03
0.68
0.98
0.78
1.00
1.00
1.01
1.01
0.04
-0.01
0.08
0.02

*L=52 items
** Separability: value=1 shows great reliability, whereas value=0 very little reliability

Table 1 shows the statistical values of Rasch statistical model for the whole sample and
the four subgroups (5th and 6th primary grades and 1st and 2nd secondary grades)
separately. It is evident that, for the whole sample and the subgroups, items reliability
values are over .95, whereas subjects reliability values are over .76. Although the
generally accepted values for such a scale are over .90 (Wright, 1985), the subjects
reliability may be accepted. Furthermore, Mean Infit mean square for both items and
subjects equals to 1 for the whole sample and the subgroups, while Mean Outfit mean
square is either 1.01 or 1.02. Infit t and Outfit t, range from -0.13 to 0.09. Subjects
Standard Deviation is rather small (SD=0.72), indicating uniformity in the samples
behavior. Namely, students aged 10-14 respond to STT as an unvarying group. Besides,
the subjects mean score increases with age, suggesting developmental validity of the
test. Rasch analysis also showed that the items receive infit values from .87 to 1.18,
which fit the generally accepted range .77-1.30 (Adams & Khoo, 1993). Three of the
items have an outfit value over 1.30, but since the difference between infit and outfit
values for these items is small, they remain in the test.

18

Strand 11

Evaluation and assessment of student learning and development

This is an on-going study and, at the moment, the test is under its second cycle of
development. Test administration and interviews with students, feedback from experts
and educators provide data to validate the items. The way data from each stage were
analyzed is indicated in the Tables 1 and 2 that are presented in the next subchapter. At
the end of the second cycle, Rasch analysis, as well as confirmatory factor analysis will
be conducted and results will be published.

Two examples of the items development


The development of two items through the STA construction cycles can be seen in Tables
2 and 3. The bicycle item presented in Table 2 refers to the strand System definition
and more specifically to the ability of identifying the essential elements of a system and
during the procedure it has been revised. The apple tree item presented in Table 3
refers to the strand System balance and more specifically to the ability of identifying
reinforcing balancing loops. It has been replaced by a different one because of
problematic item statistics during the pre-pilot phase of the second development cycle.
Table 2
The development of the bicycle item
1st
Translation in English
cycle
Prepilot

To
experts

To
educat
ors

Which are the least elements that a


bicycle that can troll should have?
. frame, two wheels, pedals, chain
. frame, two wheels, gears, handle
bar
C. frame, two wheels, pedals, seat
D. frame, two wheels
Which are the elements that a
bicycle SHOULD have in order to
roll, when someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain
D. frame, two wheels
Which are the elements that a
bicycle SHOULD have in order to
troll, when someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain
D. frame, two wheels

Comments

Action

Students did not understand


the wording of the stem
Frequencies per alternative:
A
0,41

B
0,09

C
0,32

D
0,18

Experts relate the item to two


initially separate abilities
(the abilities 1.1 and 1.2 were
afterwards unified)

OK

Change
wording
of main
body
and
alternati
ves
Keep as
is

Keep as
is

19

Strand 11

Pilot

Final
admini
stration

2nd
cycle
Prepilot

Intervi
ews
(first
set)

Intervi
ews
(secon
d set)

Evaluation and assessment of student learning and development

Which are the elements that a


bicycle SHOULD have in order to
troll, when someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain
D. frame, two wheels
Which are the elements that a
bicycle SHOULD have in order to
troll, when someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain, handle
bar
D. frame, two wheels

Frequencies per alternative:

Which are the elements that a


bicycle SHOULD NECESSARILY
have in order to troll, when
someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain, handle
bar
D. frame, two wheels
Which are the elements that a
bicycle SHOULD NECESSARILY
have in order to troll, when
someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar
. frame, two wheels, chain, pedals
C. frame, two wheels, chain, handle
bar
D. frame, two wheels
Which are the elements that a
bicycle SHOULD NECESSARILY
have in order to troll, when
someone is pushing it?
. frame, two wheels, chain,
pedals, handle bar

Difficulty index (0.21)


Discrimination index (0.3)
Alternatives ok
Frequencies per alternative:

A
0,56

B
0,19

C
0,00

D
0,25

Frequencies per alternative:


A
0,58

A
0,40

B
0,07

B
0,15

C
0,16

C
0,24

D
0,18

Revise
distract
or

Change
wording
of the
stem

Keep as
is

D
0,21

Correct answer with


CORRECT reasoning (4/11)
Wrong answer (7/11)
Suggestion of other
alternatives (2/11)
(wheels, pedals, handle bar)
Alternative (B) not chosen by
anyone

Change
alternati
ve
content

Correct answer with


CORRECT reasoning (1/5)
Wrong answer (4/5)

Keep as
is

20

Strand 11

Evaluation and assessment of student learning and development

. frame, two wheels, pedals,


handle bar
C. frame, two wheels, chain, handle
bar
D. frame, two wheels

Table 3
The development of the apple tree item
1st cycle
Pre-pilot

Translation in English
-

Comments
-

Action
-

To experts
To
educators

Mr George planted a small apple tree 10


years ago. Now the apple tree is quite big.
As the apple tree grows,
A. it needs more water.
B. it needs less water.
C. the trees need in water does not change.
D. it does not need extra water, since it has
already grown.
Mr George planted a small apple tree 10
years ago. Now the apple tree is quite big.
As the apple tree grows,
A. it needs more water .
B. it needs less water.
C. the trees need in water does not change.
D. it does not need extra water, since it has
already grown.
Mr George planted a small apple tree 10
years ago. Now the apple tree is quite big.
As the apple tree grows,
A. it needs more water.
B. it needs less water.
C. the trees need in water does not change.
D. it does not need extra water, since it has
already grown.

Keep as is

Frequencies per alternative:

Keep as is

Mr George planted a small apple tree 10


years ago. Now the apple tree is quite big.
As the apple tree grows,
A. it needs more water .
B. it needs less water.
C. the trees need in water does not change.
D. it does not need extra water, since it has
already grown.

Difficulty index (0.43) OK


Discrimination index (-0.3)

Pilot

Final
administra
tion

2nd cycle
Pre-pilot

0,38

0,31

0,13

0,19

Frequencies per alternative:


A
0,48

B
0,18

C
0,25

Keep as is

D
0,07

Item
replaced

Frequencies per lternatives


A
0,43

B
0,21

C
0,28

D
0,07

21

Strand 11

Evaluation and assessment of student learning and development

CONCLUSION
Systems thinking is a higher order skill, important in dealing with everyday phenomena
and in solving problems. At the same time, science is a field with plenty of models to
analyze and model. Despite the widespread research on curriculum development on
systems thinking, no validated tests have been developed to evaluate their effectiveness.
STA is developed following a cyclic and iterative procedure. It aspires to be a useful
instrument in assessing a curriculum designed to promote systems thinking in upperprimary and lower-secondary school students.

REFERENCES
Adams, R. J. & Khoo, S. T. (1993). Quest: The Interactive Test Analysis System.
Camberwell, Victoria: ACER.
American Association for the Advancement of Science (1993). Benchmarks for science
literacy. New York: Oxford University Press: Author.
Constantinide, K., Kalyfommatou, N. & Constantinou, C. P. (2001). The development of
modeling skills through computer based simulation of an ant colony. In
Proceedings of the Fifth International Conference on Computer Based Learning
in Science, July 7th July 12th 2001, Masaryk University, Faculty of Education,
Brno, Czech Republic.
Ben-Zvi Assaraf, O. & Orion, N. (2005). Development of System Thinking Skills in the
Context of Earth System Education. Journal of Research in Science Teaching, 42
(5), 518560
Booth Sweeney, L. B. (2001). When a butterfly sneezes. Pegasus Communications, Inc,
Waltham.
Ericsson, K. A. and Simon, H. A.(1998). How to Study Thinking in Everyday Life:
Contrasting Think-Aloud Protocols With Descriptions and Explanations of
Thinking. Mind, Culture and Activity, 5, 178-186.
Hmelo-Silver, C. E. and Green Pheffer, M. (2004). Comparing expert and vonice
understanding of a complex system prom the perspective of structures, behaviors,
and functions. Cognitive Science, 28, 127-138.
Kali, Y., Orion, N., & Eylon, B. (2003). The effect of knowledge integration activities on
students perception of the earths crust as a cyclic system. Journal of Research in
Science Teaching, 40, 545565.
Riess, W., & Mischo, C. (2009). Promoting Systems Thinking through Biology Lessons.
International Journal of Science Education, 1-21.

22

Strand 11

Evaluation and assessment of student learning and development

Senge, P. (1990). The Fifth Discipline: The Art and Practice of the Learning
Organization. New York: Doubleday.
Sheehy, N., Wylie, J., McGuinness, C. & Orchard, G. (2000). How Children Solve
Environmental Problems: using computer simulations to investigate systems
thinking. Environmental Education Research, 6, 2, 109-126.
Thier, H. D. & Knott, R. C. (1992). Subsystems and Variables. Teachers guide, Level 3,
Science Curriculum Improvement Study. Delta Education, Inc., Hudson.

23

Strand 11

Evaluation and assessment of student learning and development

DEVELOPMENT OF A TWO-TIER TEST-INSTRUMENT


FOR GEOMETRICAL OPTICS
Claudia Haagen and Martin Hopf
University of Vienna, AECCP, Vienna, Austria
Abstract: Light is part of our everyday life. Nevertheless, students face enormous
difficulties in explaining everyday optical phenomena with the help of scientific concepts.
Usually they rely on alternative concepts deduced from everyday experience, which are
often in conflict to scientific views. The identification of such alternative conceptions is
one of the most important prerequisite for promoting conceptual change (Duit und
Treagust 2003). Investigating students concepts with interviews is quite time consuming
and difficult to handle in school-settings. Multiple-choice tests on the other hand, depict
the conceptual knowledge base frequently in a superficial way. The main aim of our
project is to develop a two-tier multiple-choice test which reliably and validly diagnoses
year-8 students' understanding of geometrical optics. So far, we have developed and
empirically tested a first (N=643) and second test version (N=367) partly based on items
from literature. Though, the overall results are promising, the quality of the items differs a
lot: There are a number of items which do not have appropriate distractors for the second
tier. In addition, students and teachers feedback on the test indicates that some items pose
problems due to their wording or the kind of representation chosen. For a closer analysis of
these problematic items the qualitative method of student interviews was chosen. Semistructured, problem based interviews were led with 29 year-8 students after their formal
instruction in optics. Based on the results of these interviews, test items were revised and
extended.
Keywords: geometrical optics, two-tier multiple choice test, test development

INTRODUCTION
Despite everyday experience with light, understanding geometrical optics turns out to be
difficult for students. Physics education research shows that students hold numerous
conceptions about optics which differ from scientifically adequate concepts (Duit 2009).
Alternative conceptions are very stable. Research shows that formal instruction is
frequently not able to transform them into scientifically accepted ideas (Andersson und
Krrqvist 1983; Fetherstonhaugh und Treagust 1992; Galili 1996; Langley et al. 1997).
Teachers knowledge about their students learning difficulties is one important
prerequisite for the design of successful instruction. Exploring students conceptual
knowledgebase can provide important feedback: It can support students in their individual
learning process and can serve as basis for further teaching decisions.
In general, there are two main methods used for examining students conceptual
knowledge: Interviews and open ended questionnaires. The most effective methods like
interviews are very time consuming and difficult to handle for teachers in classroom
situations. In search for alternatives out of this dilemma, we encountered the method of
two-tier tests as used by e.g. Treagust 2006; Law & Treagust 2008. Two-tiered test items
are items that require an explanation or defence for the answer [] (see Wiggins and
24

Strand 11

Evaluation and assessment of student learning and development

McTighe 1998, p. 14) (Treagust 2006). Each item consists of two parts, called tiers. The
first part of the item is a multiple-choice question which consists of distractors including
known student alternative conceptions. In the second part of each item, students have to
justify the choice made in step one by choosing among several given reasons (Treagust
2006).
Research on alternative conceptions in optics has mainly used the methods of interviews or
questionnaires with open answers (Andersson und Krrqvist 1983; Driver et al. 1985;
Guesne 1985; Viennot 2003). In addition, multiple-choice tests were developed (Bardar et
al. 2006; Chen et al. 2002;Chu et al. 2009; Fetherstonhaugh und Treagust 1992). These
tests focus on various age-groups and on different content areas within geometrical optics.
We have, however, not found a psychometric valid test-instrument designed to portray
basics conceptions in geometrical optics of students on the lower secondary level.
Our main research objective is the development of a multiple-choice test-instrument for
year-8 students which is able to portray the students conceptions in geometrical optics.

DEVELOPMENT OF THE TEST INSTRUMENT


The test instrument was so far developed in two phases. In the first phase of the test
development the content area of the test was identified based on the Austrian curriculum of
year-8. Then students conceptions related to the key ideas of the content area were
investigated by intensive literature research. Finally, items for the test were selected from
already existing assessment tools for geometrical optics and adopted to the two-tier
structure, where possible. Where already existing items were added a second tier,
distractors for this second tier were taken from research on students conceptions.
Additionally, some items were newly developed. The final version of the test was tried out
with N=643 year-8 students.
The results of this first test phase were used to revise the first test version. The second test
version was tested with N=367 year-8 students, after their conventional instruction in
geometrical optics in year-8. This version consisted of 20 two-tier items and 6 items with
only one-tier, which were partly taken from literature (Fetherstonhaugh und Treagust 1992;
Kutluay 2005; Bardar et al. 2006; Chu et al. 2009). The results of the statistical analysis
with SPSS and students and teachers feedback on the test indicated a potential for
improvement. Some items did not have appropriate distractors for the second tier, while
others seemed to pose problems due to their wording or the kind of representations (Colin
et al. 2002) chosen.
Consequently, semi-structured, problem based interviews were conducted with year-8
students, after their instruction in geometrical optics. These interviews were carried out for
the following reasons: Firstly, we wanted to make sure that the distractors which had been
taken from literature were exhaustive. Secondly, the interviews should investigate the
response space of the newly developed items. Finally, the language and the graphical
representations used in the items should be validated by students.

Participants and Setting


We interviewed 29 students (17 female, 12 male) after their instruction in geometrical
optics. The students attended year-8 in 5 different schools. The students went to 8 different

25

Strand 11

Evaluation and assessment of student learning and development

classes and thus had 8 different physics teachers. The schools our sample attended
contained all different types of schools available in Austria at year-8 level.
The interviews were conducted in the school setting. Each student was interviewed
individually. The average duration of the interviews was 19.5 minutes.

METHOD
We carried out semi-structured, problem based interviews (Lamnek, 2002; Mayring, 2002;
Witzel, 1985). The interviews were based on seven selected items of the second test
version. The students were just given the item task without any distractors. The interview
followed a four step structure for each item. The students had to:

paraphrase the task of the item


describe the graphical representation used in the item
answer the item
account for the answer given

Figure 1. Flow chart of the structure of the interviews

Data analysis
The interviews were recorded and transcribed. Afterwards they were analysed with
MAXQDA following the method of qualitative content analysis by Mayring (2010) and
Gropengieer (2008).
The data was analysed concerning three main categories: language issues, the forms of
visual representations used and students conceptions related to the content of the items. As
far as language issues are concerned, we were interested how students interpreted the task
of the item on basis of the text given. Additionally, we tried to identify unfamiliar words
and expressions as well as too long or complicated sentences.
For the visual representations our main aim was to find out if the students were able to
grasp the content or the situation represented in visual form.
The final category on students conceptions was supposed to analyse the response space
concerning the problems posed and so to get a good overview of students conceptions
related to the problem posed.

26

Strand 11

Evaluation and assessment of student learning and development

FINDINGS
The findings presented here are results of the empirical testing of the second test version
(N=376). The reliability of the test was established by a Cronbach alpha coefficient of
=0.77. An overview of the test and item statistics concerning the 20 two-tier items is
given in figure 2.

Figure 2. Test and item statistics of the second test version


Two-tier items were on average answered only in 37.2% of cases correctly. Contrary, onetier items were solved on average in 47.41% of the cases. The solution frequencies of onetier items (8.5% - 88.0%) were higher than those of two-tier items, which varied between
3.0% and 57.2%. This effect is well known from research using two-tier items. Next to
other factors, it is mainly caused by the fact that the probability of guessing is reduced by
the necessity of accounting in the second tier for the choice made in tier one (cf. e.g. Tan &
Treagust 2002).
This is also supposed to be one way of distinguishing students who just possess a
superficial factual knowledge of phenomena from students who have a deeper conceptual
knowledge of phenomena as they are not only able to give a correct answer for the first tier
of a multiple choice item but are also able to give a correct reason for their choice. As
reported elsewhere (cf. Haagen & Hopf 2012) a more detailed analysis of the items
indicated that most two-tier items used had a higher potential of portraying students
conceptions in more detail in comparison to one-tier items.
The second part of the findings section is going to concentrate on the findings of the
interviews. As already mentioned above, the interviews were used to find appropriate
distractors for items not having a second tier. For this paper, the focus is on this issue and
in the following, one example of adding a second tier with help of the interview results is
reported.
For the topic of continuous propagation of light, the following item represented in figure 3
was used.

27

Strand 11

Evaluation and assessment of student learning and development

Figure 3. One-tier item of test version two concerning the key idea of continuous
propagation of light

For those students who indicated in the first tier that they supposed a different distance of
propagation of light from the campfire during day and night, we got 6 different categories
of reasons as shown in figure 4.

Figure 4. Reasons for a different propagation distance of light from a campfire during day
and night
Each of these categories was retranslated into students language taking either a student
statement directly from the interviews or modifying a student statement slightly in order to
fulfil psychometric guidelines for distractor construction. This procedure led to the second
tier for this item as presented below in figure 5.

28

Strand 11

Evaluation and assessment of student learning and development

Figure 5. Two-tier item of test version two concerning the key idea of continuous
propagation of light

CONCLUSION
In conclusion, the analysis of the second test version showed that two-tier items of the test
are well able to portray several types of students conceptions known from literature. On
the other hand, results indicated that some items needed still revision and improvement.
The results obtained by interviews were integrated and make up the third test-version,
which needs to be tested.

REFERENCES
Andersson, B.; Krrqvist, C. (1983): How Swedish pupils, aged 12-15 years, understand
light and its properties. In: IJSE 5 (4), S. 387402.
Bardar, E.M; Prather, E.E; Brecher, K.; Slater, T.F (2006): Development and validation of
the light and spectroscopy concept inventory. In: Astronomy Education Review 5, S.
103.
Chu, H.E; Treagust, D.; Chandrasegaran, A. L. (2009): A stratified study of students'
understanding of basic optics concepts in different contexts using two-tier multiplechoice items. In: RSTE 27, S. 253265.
Colin, P.; Chauvet, F.; Viennot, L. (2002): Reading images in optics: students difficulties
and teachers views. In: IJSE 24 (3), S. 313332.
Driver, R.; Guesne, E.; Tiberghien, A. (Hg.) (1985): Children's ideas in science.
Buckingham: Open University Press.
Duit, R. (2009): BibliographySTCSE: Students and teachers conceptions and science
education. Retrieved October 20, 2009.
Duit, R.; Treagust, D.F (2003): Conceptual change: a powerful framework for improving
science teaching and learning. In: IJSE 25 (6), S. 671688.
29

Strand 11

Evaluation and assessment of student learning and development

Fetherstonhaugh, T.; Treagust, D. F. (1992): Students' understanding of light and its


properties: Teaching to engender conceptual change. In: SE 76 (6), S. 653672.
Galili, I. (1996): Students conceptual change in geometrical optics. In: IJSE 18 (7), S.
847868.
Guesne, E. (1985): Light. In: R. Driver, E. Guesne und A. Tiberghien (Hg.): Children's
ideas in science. 1993. Aufl. Buckingham: Open University Press, S. 1032.
Langley, D.; Ronen, M.; Eylon, B. S. (1997): Light propagation and visual patterns:
Preinstruction learners' conceptions. In: JRST 34 (4), S. 399424.
Law, J.F; Treagust, D. F. (2008): Diagnosis of student understanding of content specific
science areas using on-line two-tier diagnostic tests. Curtin University of
Technology.
Mayring, P. (2010): Qualitative Inhaltsanalyse. Weinheim: Beltz.
Treagust, D. F. (2006): Diagnostic assessment in science as a means to improving
teaching, learning and retention. In: UniSever Science - Symposium Proceedings:
Assessment in science teaching and learning. Sidney, 2006. UniServe Science.
Treagust, D.F; Glynn, S. M.; Duit, R. (1995): Diagnostic assessment of students science
knowledge. In: Learning science in the schools: Research reforming practice 1, S.
327436.
Viennot, L. (2003): Teaching physics. Supported by: U. Besso, F. Chauvet, P. Colin, F.
Hirn-Chaine, W. Kaminski und S. Rainson: Springer Netherlands.

30

Strand 11

Evaluation and assessment of student learning and development

STRENGTHENING ASSESSMENT IN HIGH SCHOOL


INQUIRY CLASSROOMS
Chris Harrison
Kings college London
Abstract: Inquiry provides both the impetus and experience that helps students
acquire problem solving and lifelong learning skills. Teachers on the Strategies for
Assessment of Inquiry Learning in Science Project (SAILS) strengthened their
inquiry pedagogy, through focusing on seeking assessment evidence for formative
action. Observing learners in the classroom as they carry out investigations, listening
to learners piece together evidence in a group discussion, reading through answers to
homework questions and watching learners respond to what is being offered as
possible solutions to problems all provide plentiful and rich assessment data for
teachers.
Keywords: Inquiry, Assessment, Teacher change

BACKGROUND
The European Parliament and Council (2006) identified and defined the key
competencies necessary for personal fulfillment, active citizenship, social inclusion
and employability in our modern day society. These included communication skills
both in mother tongue and foreign languages, mathematical, scientific, digital and
technological competencies, social and civic competencies, cultural awareness and
expression, entrepreneurship and learning to learn. These key competencies formed
the foundation for the approach that our European Framework 7 project (EUFP7)
Strategies for Assessment of Inquiry Learning in Science Project (SAILS) took to
developing, researching and understanding how teachers might strengthen their
teaching of inquiry-based science education.
Since the Rocard Report (2007) recommended that school science teaching should
move from a deductive to an inquiry approach to science learning, there have been
several EUFP7 projects such as S-TEAM, ESTABLISH, Fibonacci, PRIMAS and
Pathway,.whose remit has been to support groups of teachers across Europe in
bringing about this radical change in practice. These projects have been successful in
highlighting the importance of IBSE across Europe. They also have enabled us to
determine the range of understanding of what the term inquiry means to teachers
across Europe, and to establish to what extent skills and competencies that are
developed through inquiry practices have been identified. The term inquiry has
figured prominently in science education, yet it refers to at least three distinct
categories of activitieswhat scientists do (e.g., conducting investigations using
scientific methods), how students learn (e.g., actively inquiring through thinking
and doing into a phenomenon or problem, often mirroring the processes used by
scientists), and a pedagogical approach that teachers employ (e.g., designing or
using curricula that allow for extended investigations) (Minner et al, 2009).
Inquiry-based science education (IBSE) has proved its efficacy at both primary and
secondary levels in increasing childrens and students interest and attainments levels
(Minner et al, 2009: Osborne et al, 2008) while at the same time stimulating teacher

31

Strand 11

Evaluation and assessment of student learning and development

motivation (Wilson et al, 2010). One area that has remained problematic for teachers
and cited as one of the areas limiting the development of IBSE within schools has
been assessment. (Wellcome, 2011). This EUFP7 project Strategies for Assessment of
Inquiry Learning in Science (SAILS) aims to prepare science teachers, not only to be
able to teach science through inquiry, but also to be confident and competent in the
assessment of their students learning through inquiry. The literature on teacher
change suggests that teacher change is a slow (and often difficult process and none
moreso than when the initiative requires teachers to review and change their
assessment practices (Harrison, 2012).
Part of the reason for this slow implementation of IBSE in science classrooms is the
time lag that happens between introducing ideas and the training of teachers at both
inservice and preservice level. While this situation should improve over the next few
years, there is a fundamental problem with an IBSE approach and this lies with
assessment. While the many EU IBSE projects have produced teaching materials,
they have not produced support materials to help teachers with the assessment of this
approach. Linked to this is the low level of IBSE type items in national and
international assessments which gives the message to teachers that IBSE is not
considered important in terms of skills in science education. It is clear that there is a
need to produce an assessment model and support materials to help teachers assess
IBSE learning in their classrooms if this approach is to be further developed and
sustained in classrooms across Europe.

Inquiry Skills
Inquiry skills are what learners use to make sense of the world around them. These
skills are important both to create citizens that can make sense of the science in the
world they live in so that they make informed decisions and also to develop scientific
reasoning for those undertaking future scientific careers or careers that require the
logical approach that science encourages. An inquiry approach not only helps
youngsters develop a set of skills such as critical thinking that they may find useful in
a variety of contexts, it can also help them develop their conceptual understanding of
science inquiry based science education (IBSE) and encourages students motivation
and engagement with science.
The term inquiry has figured prominently in science education, yet it refers to at least
three distinct categories of activitieswhat scientists do (e.g., conducting
investigations using scientific methods), how students learn (e.g., actively inquiring
through thinking and doing into a phenomenon or problem, often mirroring the
processes used by scientists), and a pedagogical approach that teachers employ
(e.g., designing or using curricula that allow for extended investigations) (Minner,
2009). However, whether it is the scientist, student, or teacher who is doing or
supporting inquiry, the act itself has some core components.
Inquiry based science education is an approach to teaching and learning science that is
conducted through the process of raising questions and seeking answers (Wenning,
2005, 2007) . An inquiry approach fits within a constructivist paradigm in that it
requires the learner to take note of new ideas and contexts and question how these fit
with their existing understanding. It is not about the teacher delivering a curriculum
of knowledge to the learner but rather about the learner building an understanding
through guidance and challenge from their teacher and from their peers.

32

Strand 11

Evaluation and assessment of student learning and development

Some of the key characteristics of inquiry based learning are:

Students are engaged with a difficult problem or situation that is open-ended


to such a degree that a variety of solutions or responses are conceivable.
Students have control over the direction of the inquiry and the methods or
approaches that are taken.
Students draw upon their existing knowledge and they identify what their
learning needs are.
The different tasks stimulate curiosity in the students, which encourages them
to continue to search for new data or evidence.
The students are responsible for the analysis of the evidence and also for
presenting evidence in an appropriate manner which defends their solution to
the initial problem (Kahn & O'Rourke, 2005).

In our view, these inquiry skills are developed and experienced through working
collaboratively with others and so communication, teamwork, and peer support are
vital components of inquiry classrooms.
Within an inquiry culture there is also a clear belief that student learning outcomes are
especially valued. One characteristic of inquiry learning is that students are fully
involved in the active learning process. Students who are making observations,
collecting data, analyzing data, synthesizing information, and drawing conclusions are
developing problem-solving skills. These skills fully incorporate the basic and
integrated science process skills necessary in scientific inquiry. In England, there has
been a move to support more practical work in science classrooms, through the Get
Practical Project (Abrahams et al, 2011). This project has worked through the
Association of Science Education and the National Science Learning Centre and
supported primary and secondary teachers in 30 schools in developing their practice
through practical work resulting in observable changes in the emphasis given to
practical work in schools and also to improvements in the learning of science
concepts. The findings of this project also included an important caveat and that was
that what was required was more than being aware of the Get Practical message.
They found that teachers needed to plan scaffolding (Wood et al, 1976) in order for
their learners to be guided towards viewing scientific phenomena in a similar way to
what their teachers perceive it (Ogborn et al, 1996; Lunetta, 1998). Such an approach
requires the teachers to take note of what their learners struggle with and then plan
and implement teaching that helps their pupils improve. I other words the approach
that teachers need to take is formative.
A second characteristic of inquiry learning is that students develop the lifelong skills
critical to thinking creatively, as they learn how to solve problems using a logic and
reasoning. These skills are essential for drawing sound conclusions from experimental
findings. While many projects have focused on the evaluation of conceptual
understanding of science principles developed, there is a clear need to evaluate other
key learning outcomes, such as process and other self-directed learning skills, with
the aim to foster the development of interest, social competencies and openness for
inquiry so as to prepare students for lifelong learning. This has been the aim of many
of the EUFP7 projects so far and central to this approach is teamwork and
collaborative behavior. So the move to implement more IBSE type learning across
Europe has been successful in terms of raising awareness of the importance of this

33

Strand 11

Evaluation and assessment of student learning and development

approach but the introduction of these ideas into mainstream teaching and learning
has been less readily taken up.
In many schools, we know that generally science practicals are presented as recipes to
follow so that students experience scientific phenomena. This approach means that
the raising of questions about phenomena lies with the teacher rather than the student.
So, in most science practicals, the student role is limited to simply collecting and
presenting data that is then made sense of by the teacher. This approach to practical
work is unlikely to aid conceptual understanding and development of inquiry skills
beyond practice of a limited number of skills.

Assessment Approach
The Strategies for Assessment of Inquiry Skills in Science Project (SAILS) consists
of 14 partners from across Europe and is currently in its second year of development.
The prime aim of this project is to produce and trial assessment models and materials
that will help teachers assess inquiry skills in the classroom. At the centre of this work
is Assessment for Learning. Two of the lead members of the Kings College London
team Chris Harrison and Paul Black have been working with a pilot group of 16
expert science teachers developing the first round of materials for the project. The
materials produced are then being trialled in 13 different countries to see how the
approach fits within different cultural contexts. Three topics have been selected for
the first set of materials Food, Rates of Reaction and Speed and Acceleration.
Since the formative use of the assessment data is essential to drive the pedagogy most
likely to bring about conceptual change in the learners, our approach has been first to
strengthen the formative assessment that occurs within inquiry teaching. So SAILS
teachers need to recognize and collect the assessment data that arises directly from
inquiry lessons. To do this they need to think carefully about the variety of ways in
which learners might respond to the new ideas or new contexts or challenging
question being offered. By listening carefully to classroom discussions during inquiry
or to solutions to problems that have arisen during the inquiry or to group reflections
on an inquiry, teachers can gather evidence of their learners emerging understanding.
Teachers can note misconceptions, identify partly answered questions from full
answers, and recognize errors and possible reasons why such errors are occurring.
Such data is rich in inquiry lessons because the very nature of the approach means
that the lesson is challenging and so understanding is interrogated. The teacher can
then use this assessment data to scaffold the next stage in learning for their students.
Such data place teachers in a good position to sum up the progress and to have a
realistic awareness of each learners understanding by the end of the learning
sequence of activities.
This type of assessment has high validity. It satisfies one of the conditions for validity
in having high reliability, in that the learner is assessed on several different occasions,
thereby compensating for variations in a learners performance from day to day, and
in several ways, thereby sampling the full range of learning aims. The fact that the
learner has been assessed in contexts which have been interspersed with the learning
secures both coverage and authenticity, particularly because the teacher is able to test
and re-test her interpretations of what the data mean in relation to each individuals
developing understanding. Such data place teachers in a good position to sum up the
progress and to have a realistic awareness of each learners understanding by the end

34

Strand 11

Evaluation and assessment of student learning and development

of the learning sequence of activities. This is radically different from assessing the
learner in the artificial context of the formal test, and it is far more valid i.e. the
teacher can be far more confident in reporting to a parent, or to the next teacher of
the learner, or to any others who might want to have and use assessment results
about the learners potential to both use and to extend her learning in the future.

FINDINGS
The SAILS pilot so far looks promising. Teachers have reported that they feel that
they gain far more evidence of student performance by collecting evidence during the
inquiry activities than from marking reports of the inquiry. They have realized that
only a limited number of skills can be assessed if the evidence is only sourced from
the written report and many of the interchanges they witnessed as students discussed
which inquiry questions were likely ones to form the inquiry and then how to identify,
select and control and manipulate variables were much richer in reality than in the
written reports of the investigation. This is because, by the time the students have
produced a final report of the inquiry, the ideas been through so many iterative
interchanges, that the data had been reduced to stark statements that do not capture
their developing inquiry skills and capabilities. While the written reports indicated
whether the students could or could not identify relevant variables, the ease with
which they could do this and their competence in justifying one variable as testable
and rejecting another was far better portrayed during the inquiry than in their written
reports.
Teachers also recognized that as well as getting a better feel for their students
capabilities, there were some areas that were better assessed during the inquiry than
could be done by other assessment methods and they discussed the limitations of the
previous system of assessing inquiry by coursework and also those of the current
system for assessing inquiry by controlled assessments. This meant that a far wider
range of inquiry skills were assessed than the teachers had previously attempted to do,
when the assessment was focused on the written report of the investigation. The
teachers were especially interested in assessing students capability to raise
investigable questions, their cooperation and teamwork behaviour and their resilience
in learning from their mistakes.
However, engaging in more inquiry in their classrooms and assessing in this different
way also caused concerns and dilemmas for the SAILS pilot teachers. These were:

Teachers unable to collect data on every student during each inquiry activity

Teachers working formatively and so unsure on what they should report students first attempt, last attempt, average attempt

Students working collaboratively and this may affect individual performance

These concerns are shadowed by continual concerns by many of the SAILS pilot
teachers on public and government confidence in teacher assessment and how the
teachers might communicate to parents and others why and how a more formative
approach can be as robust as the assessment judgments that are made through
examinations at the end of courses.
The teachers were also are able to feed evidence back into their teaching and so
respond formatively to both the needs and progress of learners. The teachers also
reported that they had begun to see the inquiry capabilities of their learners more

35

Strand 11

Evaluation and assessment of student learning and development

positively then they had done when previously doing practical work with these
youngsters. The teachers were surprised by how well the learners managed to raise
inquiry questions, how innovative the learners could be when not limited to following
a particular path to solving in inquiry problem and how learners were willing to learn
from their mistakes while still remaining motivated.
The SAILS pilot teachers reported that they gave far more curriculum time to inquiry
than they had anticipated was possible at the start of the project. After each meeting
teachers were asked to try, as a minimum, one inquiry project of around an hour. All
16 teachers did considerably more than this with several teachers doing extended
inquiry projects over several weeks and the majority trying 3-6 inquiries with classes
between January and June. As the teachers gained more confidence with the IBSE
approach, the inquiry activities became more open in their structure and direction and
several of the teachers reported that this more open approach not only further
motivated learners, it allowed the teachers to assess the learners on a wider range of
inquiry skills. Certainly in the first few inquiry activities teachers focused on aspects
of planning or of data collection whereas in the more inquiry style activities teachers
felt more confident to also assess broad-reaching skills such as teamwork and
communication.

CONCLUSION
Work so far on the SAILS project has indicated that teachers are willing to strengthen
their commitments to IBSE through taking a formative assessment approach to
inquiry. The SAILS pilot teachers have demonstrated that they can assess as the
inquiry learning is taking place and then use this assessment data to inform later
stages in the IBSE learning. The formative approach to assessment of inquiry in
science classrooms has encouraged teachers to allow students to do more IBSE type
work than previously and to take a more open approach to inquiry and this has
enabled the students to be more innovative in their inquiry approach. In turn, because
the students are expressing a broader range of skills than the science teachers
normally observe in general practical work, the teacher shave reported that they have
been surprise and pleased by students inquiry capabilities and willingness to learn
from making mistakes.
Issues relating to public confidence in teacher assessment remained problematic and
we hope to address that issue in the coming year through both looking at how science
teachers in our partner countries across Europe work with these ideas and also by
helping the SAILS pilot teachers in England address how they might build an
assessment portfolio of their learners work in inquiry over the course of the school
year.
For more information on the SAILS project see www.sails-project.eu

REFERENCES
Abrahams, I., Sharpe, R. & Reiss, M.(2011) Getting Practical: Improving practical
work in science. Association for Science Education:Hatfield
Abrahams, I. & Millar, R. (2008) Does practical work really work? A study of the

36

Strand 11

Evaluation and assessment of student learning and development

effectiveness of practical work as a teaching and learning method in school


science. International Journal of Science Education, 30(14), 1945-1969
European Commission. (2007). Science education now: A renewed pedagogy for the
future of Europe. Brussels: European Commission. Retrieved from http: / /
ec.europa.eu / research / science-society / document__library / pdf_06 / reportrocard-on-science-education_en.pdf
Kahn, P., and ORourke, K. (2005) Understanding Enquiry-based learning in
T. Barrett, I. Mac Labhrainn, H. Fallon, (eds). Handbook of Enquiry & Problem
Based Learning. Available online at: http://www.nuigalway.ie/celt/pblbook
Minner, D.D, et al. (2010) Inquiry-based science instruction what is it and does it
matter? Results from a research synthesis years 1984 to 2002. Journal of
Research in Science Teaching 47(4):47496.
Osborne, J.& Dillon, J.(2008) Science Education in Europe: Critical Reflections.
London: Nuffield Foundation
Ogborn , J., Kress, G., Martins, I. and McGillicuddy, K. (1996) Explaining Science in
the classroom. Buckingham : OUP
Lunetta, V. N. (1998). The school science laboratory: Historical perspectives and
centers for contemporary teaching. In B. J. Fraser & K. G. Tobin (Eds.),
International handbook of science education. Dordrecht: Kluwer
Rocard, M.et al (2007) Science Education Now: A Renewed Pedagogy for the Future
of Europe. Brussels: EC Directorate for Research (Science, Economy and
Society).
Wellcome Trust (2011) Perspectives on Education: Inquiry-based Learning . London:
Wellcome
Wenning C.J. (2007). Assessing inquiry skills as a component of scientific
literacy. Journal of Physics Teacher Education Online, 4(2), 21-24.
Wenning, C. J. (2005). Levels of inquiry: Hierarchies of pedagogical practices
and inquiry processes. Journal of Physics Teacher Education Online,
2(3), 3-11.
Wilson, C., Taylor, J, Kowalski, S, Carlson, J. (2010). The Relative Effects
and Equity of Inquiry-based and Commonplace Science Teaching on
Students' Knowledge, Reasoning and Argumentation. Journal of
Research in Science Teaching, 47(3), 276-301.
Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem
solving. Journal of Child Psychology & Psychiatry & Allied Disciplines,
17(2), 89100.

37

Strand 11

Evaluation and assessment of student learning and development

ANALYSIS OF STUDENT CONCEPT KNOWLEDGE IN


KINEMATICS
Andreas Lichtenberger, Andreas Vaterlaus and Clemens Wagner
ETH, Department of Physics, Zurich, Switzerland
Abstract: We have developed a diagnostic test in kinematics to investigate the student
concept knowledge at the high school level. The 56 multiple-choice test items are
based on seven basic kinematics concepts we have identified. We perform an
exploratory factor analysis on a data set collected from 56 students at two Swiss high
schools addressing the following issues: What factors do the data reveal? What are the
consequences of the factor analysis on the teaching of kinematics? How can this test
be included in a kinematics course?
We show that there are two basic mathematical concepts that are crucial for the
understanding of kinematics: the concept of rate and the concept of vector (including
the direction and the addition of vectors). Furthermore the investigation of items with
different representations of motion (i.e. stroboscopic pictures, table of values and
diagrams) reveals that the students use different concepts for the different
representations. In particular, there seems to be no direct transfer between the
picture/table representation and the diagram representation.
Finally, we show how the test can be used as a diagnostic tool in a formative way
providing useful feedback for the students and for the teacher. By means of a latent
class analysis we identify four classes of students with different kinematics concepts
profiles. Such a classification may be helpful for teachers in order to prepare adjusted
learning material.
Keywords: Kinematics, Concept Knowledge, Diagnostic Test, Exploratory Factor
Analysis

INTRODUCTION
In the last two decades investigations in physics teaching at the high school and
undergraduate level have shown that a majority of science students have difficulties to
understand physics concepts (Hake, 1998; Halloun & Hestenes, 1985). Students often
attend classes with solid initial misconceptions. Conventional physics instruction
produces only little changes in their conceptual knowledge. The students may know
how to use formulas and calculate certain numerical problems but they still fail to
comprehend the physics concepts. The mentioned studies indicate that instruction can
only be effective if it takes into account the student preconceptions. The proper
concepts have to be learned but also the misconceptions have to be unlearned
(Wagner & Vaterlaus, 2011). This requires the diagnosis of student concepts and
misconceptions.
We have designed a diagnostic test with the purpose of identifying the student
concepts and misconceptions in kinematics at the high school level. The test is based
on the following list of kinematics concepts:

38

Strand 11

Evaluation and assessment of student learning and development

C1: Velocity as rate


C2: Velocity as vector in one dimension (i.e. direction of the velocity)
C3: Vector addition of velocities in two dimensions
C4: Displacement as area under the curve in a velocity-time-graph
C5: Acceleration as rate
C6: Acceleration as vector in one dimension (i.e. direction of the acceleration)
C7: Change in velocity as area under the curve in an acceleration-time-graph

The list of concepts has been verified by experts and is in good agreement with the
concepts identified in other studies (e.g. Hestenes, Wells & Swackhamer, 1992).
The development of a new kinematics test has been necessary, since so far there exists
no test that allows measuring the student concept knowledge for each concept
separately. The FCI (Hestenes, Wells & Swackhamer, 1992) and the MBT (Hestenes
& Wells, 1992) are mainly used as tests to evaluate the overall dynamics concept
knowledge. They actually both contain items that correspond to the concepts
mentioned above. However, the number of these items is too small to analyze each
concept separately. The Motion Conceptual Evaluation (Thornton & Sokoloff, 1998)
and the Test of Understanding Graphs in Kinematics (Beichner, 1993) on the other
hand are rather based on task-related objectives than on concepts. The items can
therefore not clearly be linked to the concepts listed above.
We have analyzed student responses to our kinematics test addressing the following
questions: Is the test a valid instrument to determine student concept knowledge about
kinematics? Do the students answer coherently referring to the suggested concepts?
What are the consequences of the test results on teaching?
In order to treat the first two issues we carry out an exploratory factor analysis similar
to the factor analysis of the FCI data done by Scott and Schumayer (2012). Factor
analysis is a standard technique in the statistical analysis of educational data sets and
is described in detail in many pieces of literature (e.g. Merrifield, 1974, Bhner,
2011). The goal of a factor analysis is to explain the correlations among the items in
terms of only a few fundamental entities called factors or latent traits. A latent trait is
interpreted as a characteristic property of the students and made visible while
attempting to answer the items. The degree to which a student possesses a particular
trait determines the likelihood to answer a particular item correctly. Thus the items are
the manifested indicators of the latent factors. Scott and Schumayer (2012) point out
that it is important to distinguish between "factors" and "concepts". In our context the
concepts are constructs defined by experts while the factors represent the coherence of
the student thinking. The interesting issue is whether the association of items seen by
an expert agrees with the association of questions seen by students.
Referring to the third issue we suggest how the test can be applied at school in a
formative way. We show how by means of a latent class analysis different groups of
students with a similar profile of concept knowledge can be found and how a
characterization of these groups can help the teacher to prepare individual material for
every group.
The following section describes the methods including the test instrument, the
collection of data and the exploratory factor analysis. Thereafter the results of the
factor analysis are presented and interpreted. The last two sections are devoted to the
application of the instrument at school and to a final discussion of the results.

39

Strand 11

Evaluation and assessment of student learning and development

METHODS
Test Instrument
The kinematics diagnostic test is designed for high school students at level K-10. The
test items are based on the list of concepts presented in the previous section. To every
concept there is also a set of corresponding misconceptions. The misconceptions have
been verified by asking the students open questions and by analyzing their answers.
Furthermore they have been confirmed by experts.
The test consists of 56 multiple-choice items on kinematics, each item containing one
right answer and three to four distractors. Every distractor has been chosen in a way
that it can be assigned to a single misconception. This is different from the other
kinematics tests mentioned before. Thus the test not only uncovers student concepts
but also student misconceptions. The items can be furthermore divided into three
levels of abstraction:

Level A: Items with images (e.g. stroboscopic pictures)


Level B: Items with tables of values
Level C: Items with diagrams

For all levels of abstraction a representative test item is presented in the appendix.
Prior to the explorative factor analysis we empirically verified the data set. In a first
step we sorted out all items with a difficulty above 0.85 or below 0.15 because items
with such high or low difficulties do not serve as good discriminators. As a second
step we determined the internal consistency calculating Cronbach's alpha for each
concept separately. We reviewed every item that did not contribute to the internal
consistency with respect to its content. Items that were considered capable of being
misunderstood or referring to multiple concepts were dropped.
Table 1
Distribution of the items. The stars mark items that refer to two concepts. Concepts 4
and 7 are completely excluded from further analysis.
Items

Level A

Level B

Level C

Total

Concept 1

2 4 5

51 53

35 36 40 46* 47*

10

Concept 2

6 7

28 42 46* 47*

Concept 3

8 9 10

[Concept 4]
Concept 5

11 12

Concept 6

13 14

54

[Concept 7]
Total

12

[16 29 37 43]

[4]

38 39 48* 49*

44 48* 49*

[18 45]

[2]

12

27

Through the whole verification process the data set was finally reduced from 56 to 27
items. The Table 1 shows the distribution of the remaining test items according to the
concepts and levels of abstractions. The numbers indicate the item numbers in the

40

Strand 11

Evaluation and assessment of student learning and development

test. The stars mark items that refer to two concepts. As the Cronbach's alpha
coefficients for the concepts 4 and 7 are below 0.30 these concepts are excluded from
further analysis. The Cronbach's alphas for the other concepts are between 0.60 and
0.80, the mean inter-item-correlations are between 0.21 and 0.41.

Collection of data
We collected the data from 56 students from classes of two teachers at two Swiss high
schools in autumn 2012. The average age of the participants was 16 years with a
standard deviation of 1 year and a range from 14 to 18 years. 30 participants were
female, 26 were male. About half of the students were majoring in economics, the
others in science and languages. Independent of their major subject all of the students
attended a similar basic kinematics course over about six weeks. The test was
presented online at the end of the instruction. The order of items was the same for all
students. They were required to complete the survey and no item could be skipped.
The time to answer the items as well as the time to complete the test was recorded
individually. The average overall time for completing the 56 items was (46 8) min.

Exploratory Factor Analysis


For the analysis we used the program SPSS (2010). The first step of an exploratory
factor analysis is the construction of the correlation matrix between the set of items
that are investigated. We used the standard Pearson correlation function to calculate
the matrix. As a next step we chose the maximum-likelihood method for the reduction
of the correlation matrix. It is one of the standard methods and usually provides
similar results to the principal axis analysis. Moreover this method is mostly used also
in confirmatory factor analysis (Bhner, 2011). Before reducing the matrix, we
determined the optimal number of factors. This may be achieved in several ways. We
used the scree plot technique (Cattell, 1966), where the eigenvalues of the correlation
matrix are plotted in decreasing order. Scanning the graph from left to right one can
look for a knee. Then all the factors on the left of the knee are counted, while those
factors, which fall in the "scree" of the graph, are discarded (Bortz, 1999). An
example is given in Figure 1. The graph illustrates the eigenvalue of each successive
factor in the explanation of the data set. The full lines are guides to the eye. The
graphic suggests a three-factor model. Of course this method is somehow subjective,
as the "knee" is not defined accurately. Therefore it is important to prove if the
number of factors can be derived also from a theoretical model (Bhner, 2011).
The last step in factor analysis was to perform a rotation of the factor axes to see if
there was another set of eigenvectors, which is more amenable to interpretation. There
are two possibilities in rotating the axes. If we can assume that the factors are
uncorrelated, we require the resulting eigenvectors to be orthogonal. Alternatively, if
we do not know if the factors are correlated we allow the rotation to produce a set of
non-orthogonal eigenvectors. The latter option provides us with information about the
relationship between the factors. We used this option choosing the prevalent Promaxmethod.
In order to carry out an exploratory factor analysis it is a standard rule of thumb to use
at least 10 times as many respondents as there are items in the test (Everitt, 1975). As
our sample size (N = 56) is relatively small concerning the number of items (n = 27)

41

Strand 11

Evaluation and assessment of student learning and development

we decided to make the analysis step by step. We first conducted the analysis for the
data of different abstraction levels A, B and C separately. Moreover we left out the
items 46-49 which refer to two concepts. This way the number of items was reduced
to 12, 3 and 8 for level A, B and C, respectively. Afterwards we checked if the results
for the different levels were compatible. In order to check if the set of items was
applicable to an exploratory factor analysis we calculated the Kaiser-Meyer-Olkincoefficients (Cureton & DAgostino, 1983). The standard rule is that the KMOcoefficient should be at least above 0.60, for good results yet above 0.80. Our values
ranged from 0.65 to 0.77.

Figure 1. Scree Plot. The eigenvalues of the Pearson correlation matrix are depicted
in decreasing order. The knee is between the factors three and four. This suggests a
three-factor model.

RESULTS AND INTERPRETATION


Level A items
For the level A items the scree plot (see Fig. 1) suggests that three factors determine
student responses. These three factors account for 47 % of the variance in the data.
The data analysis shows that all items (except item 12) can be clearly assigned to one
of the underlying factors. The loadings of these items onto the factors are between .33
and 1.00. We note that the items 2, 4, 5 and 11, which are grouped into factor 1, refer
to the rate concepts C1 and C5. Furthermore, the items 6, 7, 13, 14, which are grouped
into factor 2 refer to the vector concepts C2 and C6. Finally the items 8, 9, and 10
corresponding to concept C3 load on a separate factor 3. The correlation coefficients
between the factors are in the present non-orthogonal three-factor model 0.19
(between factors 1 and 2), 0.31 (between 1 and 3) and 0.38 (between 2 and 3).
The factor structure shows that there is a tendency for a student to give a correct
answer to one of the "rate items" (2, 4, 5, 11) given that this student has answered
another rate item correctly. The same holds for the "vector items" (6, 7, 13, 14) and
for the "vector addition items" (8, 9, 10). We may draw the conclusion that the
association of items seen by the students is in accordance with the association of
questions seen by experts. Moreover the actual contents velocity and acceleration

42

Strand 11

Evaluation and assessment of student learning and development

seem to play only a limited role. Much more relevant for answering the items
correctly is the understanding of the mathematical concepts of rate and vector
(including direction and addition). It is therefore tempting to interpret the underlying
factors as "rate concept", "direction concept" and "vector addition concept". The three
factors are only marginally correlated meaning that we have three almost independent
factors. The fact that the correlation is the strongest between the factors 2 and 3 is in
line with our interpretation. These factors both refer to a vector concept whereas
factor 1 refers to a rate concept.

Level B items
Level B (tables of values) only contains three items. A factor analysis indicates that a
single factor may be taken as underlying student responses. The factor explains 50 %
of the variation in the data. The loadings of the items 51, 53 and 54 on the factor are
0.85, 0.69 and 0.56. We note that all the items have high loadings on the factor.
However, the loading of item 54 is the lowest. The items 51, 53 and 54 are related to
the rate concepts C1 and C5 (see Tab. 1).
Again there seems to be an underlying "rate concept" which can explain a notable part
of the correlation of the items 51, 53, 54. The fact that item 54 has a lower loading
may be due to its different content. While the items 51 and 53 are about velocity, item
54 polls student understanding of acceleration.

Level C items
Considering the scree-plot, we used a two-factor model for the data from the level C
items. The two factors account for 47 % of the variance in the data. All items can be
clearly assigned to one of the underlying factors. The factor loadings range from 0.29
to 0.96. The correlation coefficient between the factors is 0.301.
We find again that the items corresponding to the rate concepts C1 and C5 group into
one factor whereas the items linked to the direction concepts C2 and C6 group into
another one. As for solving the items with stroboscopic pictures (level A) also for
solving the diagram items (level C) there seem to be two underlying factors that may
be interpreted as a rate concept on one side and a direction concept on the other
side. Of course it is not clear if the factors found in the two different levels A and C
are actually the same. But again the understanding of the two basic mathematical
concepts of rate and direction seems to be crucial for the interpretation of diagrams in
kinematics. The correlation coefficient between the factors is again small indicating
that the two factors are mostly independent of each other.

Overall result
The interesting issue is whether the "rate factors and the direction factors found at
different abstraction levels are correlated: Are these two factors universal for solving
problems in kinematics? In order to investigate this issue we carried out a factor
analysis including all items, which loaded on these two factors at levels A, B and C.
The result of this analysis is shown in Table 2. Four factors were detected explaining
50.0 % of the total variance in the data set. It is common practice to accept loadings

43

Strand 11

Evaluation and assessment of student learning and development

above 0.3 as indicating a relevant correlation between a particular item and the
underlying factor (Kline, 1994). Therefore and for better clarity, absolute values
below 0.3 are either hidden or put in brackets, if they are important for interpretation.
The first factor groups together the items from level A and B corresponding to the rate
concepts C1 and C5. With exception of item 12, which also loads on factor 3, the
loadings are all between 0.59 and .99 meaning that these items have a high correlation
with the underlying factor. The second factor mainly groups the items from level A
and B, which refer to the direction concepts C2 and C5. However, item 7 loads on all
the factors and cannot be assigned clearly to one factor. The factors 3 and 4 group the
items of level C. Again there is a tendency that the items corresponding to the rate
concepts contribute to one factor whereas the items referring to the direction concepts
load on the other factor. The highest factor correlation is between the factors 2 and 3
with a value of 0.42. The other correlations are below 0.3.
Table 2
Factor loadings for all factor 1 and factor 2 items of the levels A-C.
Level

Item

Factor
1

Corresponding Concept
3

.78

.68

.99

11

.60

12

[.27]

51

.70

53

.71

54

.59

13

.69

14

.69

35

36

40

38

.72

39

.92

28

.30

42

.98

44

.47

C1: Velocity as rate

[-.27]

.35

[-.21]

C1: Velocity as rate


C5: Acceleration as rate
.37

[.08]

[.15]
[.13]

C5: Acceleration as rate

[.18]

C2: Velocity as vector


.35

.33
C6: Acceleration as vector

.40

[.24]

.67

.37

[-.03]

[.28]

[-.16]
C1: Velocity as rate
[.13]
C5: Acceleration as rate
C2: Velocity as vector
C6: Acceleration as vector

44

Strand 11

Evaluation and assessment of student learning and development

The main observation is that we have different factors for level A/B and level C items.
Obviously, from the students point of view the interpretation of diagrams differs from
the interpretation of stroboscopic pictures and tables. There is no direct transfer
between these two representations of motion. Therefore instead of having two
universal rate and direction factors we have to distinguish between the levels of
abstraction or, in other words, between the different representations. Overall there
seem to be five different underlying factors that are determining the correct answering
of the items. We suggest interpreting the factors as follows:

Factor 1: Rate concept for representations with images and tables


Factor 2: Direction concept for representations with images and tables
Factor 3: Rate concept for representations with diagrams
Factor 4: Direction concept for representations with diagrams
Factor 5: Vector addition concept for representations with images

There are some details in the results that need to be discussed. First item 12 does not
mainly load on factor 1. There is no indication that the item differs from the other
factor 1 items as regards form and content. A possible reason is the high difficulty of
.80. As discussed before high difficulties usually lead to smaller correlations, in
particular when the sample size is rather small. Also item 7 does not fit well into our
suggested 5-factor-model. Obviously the integration of the level C items into the
factor analysis slightly changes the factor axes such that the loading of item 7 on the
factor 2 is lowered. There is no obvious reason why item 7 loads on the factors linked
to the diagrams. We have to recall that the sample is actually to small for the number
of items included in the present factor analysis such that the values have to be
interpreted with caution. Finally on level C we have the items 35 and 36, which do not
only load on the rate factor anymore but also on the direction factor. This fact is
actually due to item 40. After removing that item from the analysis we discovered an
increase of the loadings of items 35 and 36 on the rate factor. This shows again that
the factor analysis is very sensitive to small changes when the number of items is big
compared to the sample size. The loadings of the items 35, 36 and 40 on both the
factors 2 and 3 are also the cause for the noted correlation between the factors 2 and 3.
There is no obvious reason for this correlation from a theoretical point of view.
At last we investigated how the items 46 49, which can be linked to both the rate
concept and the direction concept, fit into our 5-factor-model. All of these items
contain a given kinematics graph (e.g a velocity-time diagram). The student then has
to select another corresponding diagram (e.g. a position-time diagram). We integrated
the items one by one to check which factor they load on while the factor axes are not
changed too much. We found that all these items load on both the factors 3 and 4 with
values above 0.3. This is an important finding as it shows that also the answering to
items that are referring to more than one concept can be explained within our 5-factormodel. There is no indication that new factors emerge for more complex problems.

APPLICATION
We suggest integrating the present test in the basic kinematics course in a formative
way. The test provides a detailed feedback for the students as well as for the teacher.
For every student, two diagrams can be prepared, one illustrating the percentages of
items solved correctly for each of the seven concepts and the other showing which
misconceptions are still present. The teacher gets feedback about the overall

45

Strand 11

Evaluation and assessment of student learning and development

performance of the class. Furthermore by means of a latent class analysis (LCA) the
teacher can find groups of students with similar concept profiles (Collins & Lanza,
2010). This allows the teacher to prepare customized materials for the groups such
that the students can work on their individual deficits having the chance to catch up.
For better illustration we performed a LCA with help of the program MPlus (2011).
We included the data of the 27 items shown in Table 1. In order to determine the
optimal number of classes we used a technique similar to the one used for the factors.
Instead of plotting the eigenvalues, we plotted the loglikelihood against the number of
classes. By locating the knee in the graph we found four different classes that can be
assigned to four groups of students. The characteristics of the four groups are shown
in Figure 2. The mean score is defined as the group average of the fraction of
correctly solved items corresponding to the particular concept. Even if we did not
include the items referring to the concepts C4 and C7 in the LCA, we plotted the
mean scores for completeness. The four groups can be characterized as follows:

Class 1: All-rounders. These students understand all concepts sufficiently.


Class 2: 1D-students. These students solve problems in one dimension often
correctly, but they fail at the two-dimensional addition of velocity vectors.
Class 3: Non-Vectorians. These students seem to have a good understanding
of the rate concepts but they have difficulties with the vector concepts
(direction and addition).
Class 4: Conceptless. These students are not able to apply a concept in
different situations properly.

Figure 2. Characteristics of the classes found with LCA.


We suggest that the diagnostic test is followed by a reflective lesson where the
students are given time to work on their deficits. The classification simplifies the
preparation of individualized learning material. The teacher directly has an overview
of the characteristic groups of students and he can provide adjusted learning material.
For example, the teacher can prepare material about the addition of vectors for all
students who belong to class 2.
Of course, these groups are only exemplary. Further studies are needed to investigate
if these characteristics are typically found in kinematics classes at Swiss high schools.

46

Strand 11

Evaluation and assessment of student learning and development

DISCUSSION
We have found that there are two basic mathematical concepts that are crucial for the
understanding of kinematics: the concept of rate and the concept of vector (including
direction and addition). The context and the content seem to play only a minor role. If
a student understands the concept of rate he is able to answer correctly to questions
about velocity and acceleration in different contexts. The same holds for the vector
concept. This result has direct implications for the instruction. It suggests that in
kinematics courses the focus should be first on the learning of the mathematical
concepts. Transferring the mathematical concepts to physical contents and applying
them in different contexts is suggested to be easier for students than learning physical
concepts without a mathematical fundament. These findings are somewhat in line
with the results of Christensen and Thompson (2012) who investigated the graphical
representations of slope and derivative among third-semester students. In the
conclusion they stated, that some of their demonstrated difficulties [in physics] seem
to have origins in the understanding of the math concepts themselves. Moreover also
Bassok and Holyoak (1989) found similar results analyzing the interdomain transfer
between isomorphic topics in physics and algebra. Students who had learned
arithmetic progressions were very likely to spontaneously recognize the application of
the algebraic methods in kinematics. In contrast, students who had learned the physics
topic first almost never exhibited any detectable transfer to the isomorphic algebra
problems. Finally, it has to be mentioned, that even if the understanding of the
mathematical concepts seems to be a requirement for understanding kinematics, it
does not guarantee success (Planinic, Ivanjek and Sussac, 2013).
Another interesting finding is that the expert associations of items corresponding to
the concepts C4 and C7 could not be found in the student answers. These items
involve the evaluation of areas under the curve. Obviously most of the student did not
have proper area concepts. Instead of that, interviews showed that students often
argued with a concept of average. For example when they were asked to interpret the
velocity-time-diagram of an object regarding to its covered distance, they often did
not consider the area under the curve but tried to estimate the mean velocity. From a
mathematical point of view, finding the mean value is equivalent to determining the
area under the curve and dividing by the interval size. Still, the interviews indicated
that the use of an average concept is accompanied by different misconceptions than
the use of the area concept. All in all the items corresponding to concept C7 were the
most difficult of the test. This can be seen in Figure 2. These results are in line with
the findings of Planinic, Ivanjek and Susac (2013). They also found that the slope
concept (which we call the rate concept) could be easily transferred from
mathematical to physical contexts. However, this is not the case for the area under the
graph concept. The transfer of this concept from mathematics to physics was found to
be much more difficult for the students. A possible reason could be the fact that
during the teaching of kinematics the interpretation of the slope is usually emphasized
much more than the interpretation of the area under the graph.
As the kinematics test used in this study contains 27 items, a minimum number of 270
students is needed to produce a reliable result by means of a factor analysis. As we do
not meet this requirement (N = 56), the present results are preliminary. Still, the fact
that the association of items given by the assignment to the concepts by experts could
be clearly found in the student answers is very promising. Furthermore most of the
results in this study confirm results from other studies. This gives rise to hope that the
results will be corroborated in a following study with a bigger sample size.
47

Strand 11

Evaluation and assessment of student learning and development

REFERENCES
Bassok, M. & Holyoak, K. J. (1989). Interdomain Transfer Between Isomorphic
topics in Algebra and Physics. Journal of Experimental Psychology: Learning,
Memory, and Cognition 15 (1): 153-166.
Beichner, J. R. (1993). Testing student interpretation of kinematics graphs. Am. J.
Phys. 62 (8): 750-762.
Bortz, J. (1999). Statistik fr Sozialwissenschaftler (5. Aufl.). Berlin: Springer.
Bhner, M. (2011). Einfhrung in die Test- und Fragebogenkonstruktion (3. Aufl.).
Mchen: Pearson.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behav.
Res. 1, 245.
Christiansen, W. M. & Thomsen, J. R. (2012). Investigating graphical representations
of slope and derivative without a physics context. Phys. Rev. ST Phys. Educ.
Res. 8, 023101.
Collins L. M. & Lanza S. T. (2010). Latent Class and Latent Transition Analysis.
Edited by Balding, D. J., Cressie, N. A. C, Fitzmaurice, G. M., Johnstone, I.
M., Molenberghs, G., Scott, D. W., Smith, A. F. M., Tsay, R., S., Weisborg, S.
of Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley &
Sons, Inc.
Cureton, E. E. & DAgostino, R. B. (1983). Factor analysis: an applied approach.
Hillside, NJ: Lawrence Erlbaum Associates.
Everitt, B. S. (1975). Multivariate analysis: The need for data, and other problems. Br.
J. Psychiatry 126, 237.
Hake, R. R. (1998). Interactive-engagement versus traditional methods: A sixthousand-student survey of mechanics test data for introductory physics
courses. Am. J. Phys. 66 (1): 64-74.
Halloun, I. A. & Hestenes, D. (1985). The initial knowledge state of college physics
students. Am. J. Phys. 53 (11): 1043 -1055.
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The
Physics Teacher, 30, 141-158.
Hestenes, D. & Wells, M. (1992). A mechanics baseline test. Phys. Teach. 30, 159.
Kline, P. (1994). An Easy Guide to Factor Analysis. London: Routledge.
Merrifield, P. R. (1974). Factor analysis in educational research. Rev. Res. Educ. 2,
393.
MPlus (2011). MPlus Version 6.11, Muthen & Muthen. Available from
www.statmodel.com (Dec 16, 2013).
Planinic, M., Ivanjek, L. & Susac, A. (2013). Comparison of university students
understanding of graphs in different contexts. Phys. Rev. ST Phys. Educ. Res.
9, 020103
Scott, T. F. & Schumayer, D. (2012). Exploratory factor analysis of a Force Concept
Inventory data set. Phys. Rev. ST Phys. Educ. Res. 8, 020105.
SPSS (2010). IBM SPSS Statistics, Version 19.0.0 for Mac, SPSS Inc.
48

Strand 11

Evaluation and assessment of student learning and development

Thornton, R. & Sokoloff, D. (1998). Assessing student learning of Newtons laws:


The Force and Motion Conceptual evaluation. Am. J .Phys. 66 (4): 228-351.
Wagner, C. & Vaterlaus, A. (2011). A Model of Concept Learning in Physics.
Conference Proceedings, Frontiers of Fundamental Physics FFP12. Retrieved
Dec 16 2013 from: http://www.fisica.uniud.it/~ffp12/proceedings.html.

APPENDIX
Example 1: Item 14 (Level A, concept C6: acceleration as vector)
A helicopter is approaching
for a landing. It moves
vertically downwards and
reduces its velocity.
Which of the following Which
statements
describes
acceleration
of the
of the
following the
statements
describes
thehelicopter best?
1. The acceleration is zero.
2. The acceleration points downwards.
3. The acceleration points upwards.
4. The direction of the acceleration is not defined
5. The acceleration has no direction.
Niveau:(4(

Quelle:(vgl.(Beichner(3(

Example 2: Item 51s"t(Level B, concept C1: velocity as rate)


(
(
(
(
(
(
(
(
(
(
(
(
(
(
(

Two bodies are moving on a straight line. The positions of the bodies at successive 0.2+(
second
time intervals are represented in the table below.
(

(
Time
in s
(
(
Body 1: Position in m
(
Body 2: Position in m

0.0
0.2
0.0

0.2
0.4
0.4

0.4
0.7
0.8

0.6
1.1
1.2

0.8
1.6
1.6

1.0
2.2
2.0

1.2
2.9
2.4

1.4
3.7
2.8

Do( the bodies ever have the same speed?


1. No.
2. Yes, at the instants t = 0.2 s and t = 0.8 s.
a. Der&Krper&bewegt&sich&eine&schiefe&Ebene&hoch.&
b. Der&
bewegt&
sich&
ausschliesslich&
rckwrts.&
3.Krper&
Yes,
at
the
instants
t = 0.2 s.
c. Der&Krper&bewegt&sich&zuerst&rckwrts,&dann&vorwrts.&
4.
Yes,
at
the
instants
t = 0.8 s.
d. Der&Krper&bewegt&sich&ausschliesslich&vorwrts.&
5. Yes, at some time between t = 0.4 s and t = 0.6 s.
(

Quelle:(vgl.(Beichner(3(
Example
3: Item 28 (Level C, concept C2: velocity as vector)

Niveau:(4(

(
(

s"t
The following represents
a position-time graph (x-t-diagram) for an object.

x&
+(

(
(
(
(
(

(
(
(

(
(
(
(
(
(
(
(
(
(
(
(

t&

Which of these describes the motion best?


1. The object always moves forward.
a. Der&Krper&bewegt&sich&ausschliesslich&vorwrts.&
b. Der&
bewegt&
sich&ausschliesslich&
rckwrts.&
2.Krper&
The
object
always
moves backward.
c. Der&Krper&bewegt&sich&zuerst&vorwrts,&dann&rckwrts.&
3.
The
object
moves
forward
at first. Then it moves backward.
d. Der&Krper&bewegt&sich&eine&schiefe&Ebene&hinunter.&
4. The object moves down an inclined plane.
(

49

Strand 11

Evaluation and assessment of student learning and development

MEASURING EXPERIMENTAL SKILLS IN LARGESCALE ASSESSMENTS: DEVELOPING A


SIMULATION-BASED TEST INSTRUMENT
Martin Dickmann1, Bodo Eickhorst2, Heike Theyen1, Knut Neumann3, Horst
Schecker2 and Nico Schreiber1
1
University of Duisburg-Essen
2
University of Bremen
3
Leibniz Institute for Science and Mathematics Education
Abstract: Fostering students experimental skills is widely regarded as a key
objective of science education (e.g. KMK, 2005; NRC, 2012). Hence,
assessment tools for measuring experimental skills are required. In most largescale assessments experimental skills are measured by paper and pencil tests. This
is due to the efficiency of administering and scoring such tests. However, several
studies show only low correlations between students achievements in paper and
pencil tests and their achievements in hands-on experiments (e.g. Shavelson, RuizPrimo, & Wiley, 1999; Stecher et al., 2000; Hammann et al., 2008; Emden, 2011;
Schreiber, 2012). This calls the validity of the paper and pencil approach into
question. On the other hand, testing hands-on with real experiments is costly
and time-consuming concerning logistics, data acquisition and data analysis,
especially in large-scale-assessments. More manageable tools for measuring
experimental skills in large-scale assessments seem to be computer-based
experiments with interactive simulations (cf. Schreiber, 2012). Our goal is a
simulation-based test instrument that measures experimental skills validly and
reliably. It should cover all the three phases of experimenting: preparation,
performance and evaluation. Schreiber (2012) found that comprehensive
experimental tasks posed in an open format cause a high dropout rate during the
test. We thus develop consecutive tasks that divide the complex experimental
demands into a sequence of smaller items. Each item operationalizes a specific
experimental skill. To avoid follow-up errors, each item contains a sample
solution of the preceding item. In order to secure test quality, extensive validation
studies are carried out.
Keywords: large-scale assessment, computer-based testing, assessment of
competence, performance assessment, scientific experimentation

THEORETICAL FRAMEWORK
Science education standards emphasize the importance of experimental skills for
scientific literacy (e.g., KMK, 2005; NRC, 2012). Students abilities to plan and
carry out experimental investigations are included in evaluations of national
standards as well as in international student assessments (OECD, 2007). Theories
of the experimental process typically distinguish between three phases of
experimenting: preparation (e.g. planning experimental procedures), performance
(e.g. setting up the apparatus) and evaluation (e.g. interpreting results) (cf. Emden,
2011). A test instrument measuring experimental skills should cover all the
three phases.
50

Strand 11

Evaluation and assessment of student learning and development

Testing experimental skills has to address several problems, especially in largescale assessments. A process-based assessment, analyzing students actions during
hands-on experiments, is resource-consuming. Supplying standardized apparatus
for hands-on tests poses problems of logistics. Paper and pencil tests can hardly
cover experimental skills of the performance phase. Thus, paper-pencil tests are
often narrowed to the preparation of experiments and the evaluation of data (e.g.,
Glug, 2009).
Previous studies on the exchangeability of test formats for experimental skills show
only low correlations between students achievements in paper and pencil tests and
their achievements in hands-on experiments (e.g., Shavelson, Ruiz-Primo, &
Wiley, 1999; Stecher et al., 2000; Hammann et al., 2008; Emden, 2011;
Schreiber, 2012). On the other hand, studies indicate that computer-simulations
might be valid substitutes for hands-on experiments in tests (cf. Shavelson et al.,
1999; Schreiber, 2012).
Schreiber (2012) found no significant difference between the distributions of
achievement scores gained from computer-based testing with mouse-on
experiments and hands-on testing, whereas the distributions differed significantly
between a paper and pencil test and a hands-on test. Schreiber (2012) also found
that broad experimental tasks posed in an open format cause a high dropout rate
during the test. This leads to ground-effects and missing data.

RATIONALE AND METHODS


Aims of the study
Our overall aim is to develop a test instrument that can reliably and validly
measure experimental skills and that is suitable for large-scale assessments. We
assume that the problems and restrictions discussed above can be reduced by
computer-based testing. This approach allows us to comprise the performance
phase of experimental investigations. In contrast to hands-on experiments, the
logistics of computer-based testing is easier for large- scale studies. Students
actions can be recorded automatically and evaluated on the basis of log files. In
order to secure test quality, extensive validation studies are carried out.

Test instrument
The test instrument refers to typical experimental tasks in secondary school
physics instruction. The target group are students at the end of lower secondary
education (aged 14 to 16). The test instrument consists of several units. Each unit
deals with a specific experimental task. The students have to perform a complete
experimental investigation, i.e. plan the experiment, prepare the setup, perform
the measurements, analyze experimental data and draw conclusions. To
minimize drop-out caused by comprehensive experimental tasks (Schreiber,
2012), each unit is split up into a sequence of items each referring to one
experimental skill (e. g. plan the experiment or perform the measurements),
Furthermore, each item starts with a sample solution of the preceding item. Thus,
students experimental skills can be assessed across the full range of the phases of
an experimental investigation. For instance, students who do not succeed in
assembling an appropriate experimental set-up can still proceed with the
measurement item, because it provides them with a functional set-up. The

51

Strand 11

Evaluation and assessment of student learning and development

preparation

performance

evaluation

describe basic idea

assemble and test an


experimental setup

plan the evaluation of


data

perform and document


measurements

process data

specify procedure
select from
a given
set of
apparatus

sketch the
set-up

describe
the course
of action

prepare measurement report

draw conclusions

Figure 1: Model of experimental tasks (grey: phases of experimentation; light:


components)
intermediate solutions are presented by two fictitious students (Alina and Bodo)
who are said to have worked on the same experimental task.
Altogether, twelve units are being developed and tested. They cover content areas
in electric circuits, mechanics and geometrical optics.
Task development is based on the model shown in Figure 1. The model integrates
previously developed models for experimental skills (cf. Schreiber et al., 2012;
Nawrath, Maiseyenka & Schecker, 2011). The model describes eight experimental
skills (light boxes in Fig. 1), grouped into the three phases of experimentation. As the
test is intended to focus on the actual performance of an experimental investigation,
we do not consider more general components of scientific knowledge generation
like develop questions and phrase hypotheses.
A unit consists of six items (out of eight), with two items for each phase of
experimentation. The two components of the performance phase are included in each
unit.
Figure 2 shows a sample item of the unit Elongation of a rubber band from the
content area mechanics. In this unit Alina and Bodo want to test the hypothesis:
The expansion of a rubber band is proportional to the attached weight. The
students have to choose the right material, describe the measuring procedure,
assemble the setup etc. In the particular item shown in Figure 2 the students get
a functional setup to perform their measurements and a properly prepared table to
document the data.
For the choice of suitable material (preparation phase), the assembling and
testing of the experimental setup and the performance of the measurement
(performance phase) simulations are provided that enable the students to interact
with the material, to observe the effects and to measure data. In the simulation
shown in Figure 2 students can for example attach pieces of mass to the rubber
band, adjust the scale, and observe and measure the elongation of the rubber
band.

52

Strand 11

Evaluation and assessment of student learning and development

Figure 2: Sample item perform and document measurements. Pressing the green
button Your task will show the hypothesis which has to be tested. Explanations for
technical terms can be found using the yellow button Support.

Validation Studies
The development of a new test instrument and a new test format has to be
underpinned by extensive validation studies. Our validation studies include
content analysis, an analysis of students individual solution strategies, the analysis
of the relationship with external variables and the analysis of the internal test
structure (Wilhelm & Kunina, 2009). Table 1 shows an overview of the research
questions and the methods used for each validation aspect.
In this paper we focus on the content and the cognitive aspects of validation.

53

Strand 11

Evaluation and assessment of student learning and development

Table 1
Validation aspects, corresponding research questions and studies to answer the
research questions
Validation
Aspects
Content

Individual
strategies
(cognitive
processes)

Relationship
with external
variables

Internal test
structure

Research questions
Do the tasks represent experiments that students
are likely to have seen or worked on?
Are the tasks consistent with typical demands
posed in classroom practices of experimenting?
Do experimental considerations dominate in
students thinking while working on the tasks?
Do the tasks offer adequate support to compensate
for deficits in physics content knowledge?
Is the cognitive load of mouse-on experimenting
comparable to a hands-on test format?
Is the mouse-on test performance a good predictor
for performance in hands-on tests?

Studies
Analyses of syllabi
and schoolbooks;
expert ratings
Think aloud
(intro- and
retrospective)
Comparative studies
in the science
education lab
mouse-on vs.
hands-on

Do experimental skills differ sufficiently from


physics content knowledge and cognitive
abilities?
Do the items form a reliable scale?
Are the three phases of experimentation
empirically separable in students performances?

Large-scale
(400 students per
unit)

CONTENT ANALYSIS
Methods
Syllabi and schoolbooks were initially analyzed to identify physics content areas
and experimental challenges that are in accordance with aims and practices of
physics instruction (content validity). The analysis was done in two steps. In
the first step, key terms were identified in an inductive process by going
through curricula and schoolbooks. To ensure comparability across the syllabi of
the 16 German federal states, similar terms were clustered in 35 term groups.
The term group electrical resistance for example includes the terms electrical
resistance, specific resistance, I-U characteristics, Ohms law, and electrical
conductivity. The quality of this method was verified for the content areas
mechanics, optics, electricity, and thermodynamics. The inter-rater reliability
(Cohens kappa) of assigning terms to term groups is at least satisfactory (.78 <
< .95). In the second step, a criteria-based investigation of the 16 syllabi and of
selected schoolbooks was carried out. The curricula and schoolbooks were
searched for the term groups, differentiating between general occurrence and
occurrence in (explicit) conjunction with an experimental action (preparation,
performance, evaluation). In addition, the syllabi we were analyzed with regard to
the grades in which a term group occurs and whether it is obligatory or optional
content. In the schoolbooks all the experimental tasks referring to a term group were

54

Strand 11

Evaluation and assessment of student learning and development

identified.
Based on these analyses 22 suggestions for typical experimental tasks were
generated. 53 experts (experienced teachers) rated to which extent these tasks
comply with typical demands posed in classroom practices of experimenting (fourlevel Likert-scales). This was done by an online-survey. We e.g. asked the experts
how likely students would have had appropriate learning opportunities, enabling
them to solve the task. We also asked the experts how likely students could plan,
perform or evaluate just this or a very similar experiment at the end of lower
secondary education.
Evaluating the online-survey, we ranked the experimental tasks for each content
area separately. Our main criterion was that, according to the experts
estimations, it is likely or very likely that the students can perform the experimental
task.

Results
Our syllabus analysis confirmed the central role of experiments in physic teaching
(cf. Tesch, 2005). The analysis yielded a high consistency of the obligatory
content (measured by the occurrences of the term groups) to be dealt with during
lower secondary physics instruction across the 16 federal states of Germany.
Minor differences were found with regard to the grade in which the content is
taught. Comparing the experiments presented in the schoolbooks we were able to
identify a consistent set of widely used topics for student experiments. For 12 of the
22 tasks our main criterion (M 3.0 on a scale from 1 to 4) was fulfilled. Especially
in the domains electric circuits and optics our tasks comply with typical
demands posed in classroom practices of experimenting. In mechanics the ratings
of two out of four tasks were not satisfactory. We thus developed two more tasks
for this domain.
As the result of our content analysis we can build on a set of twelve experimental
tasks with high content validity for the physics domains electric circuits,
geometrical optics and mechanics. The tasks provide a solid basis for the
investigation of further validation aspects, in particular cognitive validity. We have
designed twelve complete units around these tasks (together with the interactive
simulations).

COGNITIVE VALIDATION
Methods
For the aspect of cognitive validation we focus on the students cognitive
processes while working on the units. They key issue is whether experimental
considerations dominate in students thinking while they try to solve the items:
Are their actions driven by reflections on the experiment to be conducted or by other
aspects, like operating the simulation software? To answer this research question,
four out of twelve units from the three content areas are analyzed with thinkaloud techniques (intro- and retrospective). About 40 students worked on each unit.
The verbalizations of the students are rated in a deductive mode of qualitative
content analysis. We use indicators to distinguish between students considerations
that are related to the process of experimentation (e.g. safety issues,

55

Strand 11

Evaluation and assessment of student learning and development

measurement accuracy etc.) and considerations that are based on nonexperimental arguments (e.g. plausibility considerations). Table 2 shows examples
for the item assemble and test the experimental setup of the unit elongation of a
rubber band.
Table2
Examples of experimental and non-experimental considerations in the unit
Elongation of a rubber band
Experimental considerations

Non-experimental considerations

In order to measure accurately,


For assembling the setup I am
the ruler has to be very close to the looking at which parts match
rubber band.
together
Ah, the elongation changes more The last device doesnt fit in
than the weight I attach. This cant anywhere. The setup should be
be proportional.
working now.
The inter-rater reliability (Cohens kappa) of differentiating between experimental
and non- experimental considerations is at least satisfactory for the item assemble
and test the experimental setup (assembling experimental setup: 88 % agreement (
= .714); test experimental setup: 100 % agreement). The analysis of further units
and items is in progress.

Results
The analysis of the cognitive processes for the item assemble and test the
experimental setup of the rubber band unit shows that most students dominantly
express experimental considerations (see figure 3) while working on this item.
Further analyses will show whether this result can be confirmed for other items and
units.
assemble experimental setup

test experimental setup


22%

42%
58%

experimental

non-experimental

78%

experimental

non-experimental

Figure 3: Percentage of students with experimental and non-experimental


considerations for the item assemble and test experimental setup

56

Strand 11

Evaluation and assessment of student learning and development

SUMMARY AND OUTLOOK


In our project the development of the test instrument and validation studies are
closely intertwined. Content analysis and expert panels have led to a set of
experimental tasks that are adequate challenges for secondary students. Around
these tasks, twelve test units with items for specific experimental skills have been
developed. The units are realized in an online test environment with embedded
simulations (mouse-on test). First empirical studies indicate that the test is
cognitively valid.
Besides further analyses of cognitive validity with consequences for test
improvement we will, as a next step, put a focus on studies of convergent
validity. Paper and pencil tests with hands-on experiments serve as benchmarks.
Structural validity will be researched on the basis of a large-scale data sampling in
2014.

REFERENCES
Emden, M. (2011). Prozessorientierte Leistungsmessung des naturwissenschaftlichexperimentellen Arbeitens. Berlin: Logos.
Glug, I. (2009). Entwicklung und Validierung eines Multiple-Choice-Tests zur
Erfassung prozessbezogener naturwissenschaftlicher Grundbildung. ChristianAlbrechts- Universitt zu Kiel.
Hammann, M., Phan, T. T. H., Ehmer, M. & Grimm, T. (2008). Assessing pupils
skills in experimentation. Journal of Biological Education 42 (2), 66-72.
KMK. Sekretariat der Stndigen Konferenz der Kultusminister der Lnder in der
Bundesrepublik Deutschland. (2005). Bildungsstandards im Fach Physik fr
den Mittleren Schulabschluss. Mnchen: Luchterhand.
Nawrath, D., Maiseyenka, V., & Schecker, H. (2011). Experimentelle Kompetenz Ein Modell fr die Unterrichtspraxis. Praxis der Naturwissenschaften Physik in
der Schule, 60(6), 4249.
NRC. National Research Council. (2012). A Framework for K-12 Science
Education: Practices, Crosscutting Concepts, and Core Ideas. Washington, DC:
The National Academies Press.
OECD (ed.) (2007). PISA 2006 - Schulleistungen im internationalen Vergleich:
Naturwissen- schaftliche Kompetenzen fr die Welt von morgen. Bielefeld:
Bertelsmann.
Schreiber, N. (2012). Diagnostik experimenteller Kompetenz. Validierung
technologie- gesttzter Testverfahren im Rahmen eines
Kompetenzstrukturmodells. Berlin: Logos.
Schreiber, N., Theyen, H., & Schecker, H. (2012). Experimental Competencies in
science: a comparison of assessment tools. In C. Bruguire, A. Tiberghien, & P.
Clment (Eds.), E- Book Proceedings of the ESERA 2011 Conference: Science

57

Strand 11

Evaluation and assessment of student learning and development

learning and Citizenship. Part 10 (co-ed. R. Millar), 6672. Lyon: European


Science Education Research Association.
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on Sources of
Sampling Variability in Science Performance Assessments. Journal of
Educational Measurement, 36(1), 6171
Stecher, B. M., Klein, S. P., Solano-Flores, G., McCaffrey, D., Robyn, A.,
Shavelson, R. J. (2000). The effects of Content, Format and Inquiry Level on
Science Performance Assessment Scores. Applied Measurement in Education,
13(2), 139-160.
Tesch, M. (2005). Das Experiment im Physikunterricht Didaktische Konzepte
und Ergebnisse einer Videostudie. Berlin: Logos.
Wilhelm, O. & Kunina, O. (2009). Pdagogisch-psychologische Diagnostik. In
Wild, E., & Mller, J. (Eds.), Pdagogische Psychologie. Heidelberg: Springer.

58

Strand 11

Evaluation and assessment of student learning and development

THE NOTION OF AUTHENTICITY ACCORDING TO


PISA: AN EMPIRICAL ANALYSIS
Laura Weiss and Andreas Mller
Universit de Genve
Abstract: The notion of authenticity in the sense ensuring relevance to students
interests and lives (OECD, 2007, p. 36) is essential to PISAs understanding of
scientific literacy and its assessment. The same holds true for many classroom
interventions and projects in the large framework of context-based science education
(CBSE; Bennett, Lubben & Hogarth, 2007). If there is no doubt about the factual
authenticity of PISA items (i.e. the existence of real-life links), the assumption that
they are relevant to pupils and perceived by them as authentic is more arguable. In
view of the large role this perception is supposed to have both for motivation and
cognitive activation of learners, it seems necessary to inquire about pupils perception
of the authenticity of the PISA units.
This contribution reports about a study of this kind, complementing the question of
authenticity with other motivational variables important in PISAs framework
(science related interests and self-beliefs). Teachers perceptions were studied as well,
and compared to those of pupils, as possible differences of these perceptions are
important for classroom practice.
The motivational variables in question were studied on the basis well-established
instruments, and within a sample of 150 pupils of secondary level I (14-15 years) and
20 physics teaching experts (mostly physics teachers). Several (such as gender and
general educational level) were taken into account, and results analyzed with
ANOVA. Results show that pupils perceived authenticity and interest as low, contrary
to the basic assumption of PISA, and that there is a large gap (often by factors > 1
on the used questionnaire) to the perceptions by teachers. Some possible influences
(items subjects, subject covariates) will be discussed, as well as some implications of
these findings for both practice and research.
Keywords: authenticity, context-based science education, motivation, PISA science
units, pupils, survey.

INTRODUCTION
On a three years regular basis, the PISA surveys, launched by OECD in 2000, aim to
monitor the outcomes of education systems in terms of 15-years-old pupils
achievements. The PISA project is to implement educational goals in order to prepare
the young generation to a responsible citizen adult life. With this purpose, the
designers of PISA consider that their specific concept of literacy, namely the
capacity of students to extrapolate from what they have learned and to apply their

59

Strand 11

Evaluation and assessment of student learning and development

knowledge in novel settings (OECD, 2007, p. 3) is relevant not only for the basics
competences of reading and math, but also for a scientific literacy which is a necessity
in our scientific and technological society. This notion of scientific literacy is defined
as:
An individuals scientific knowledge and use of that knowledge to identify
questions, to acquire new knowledge, to explain scientific phenomena, and to
draw evidence based conclusions about science-related issues, understanding of
the characteristic features of science as a form of human knowledge and enquiry,
awareness of how science and technology shape our material, intellectual, and
cultural environments, and willingness to engage in science-related issues, and
with the ideas of science, as a reflective citizen (OECD, 2006, p. 8).

Scientific literacy should also contribute to develop and strengthen interest for science
and technology, in particular to counteract the widespread disaffection towards these
areas, a current problem in the western countries (Rocard et al., 2007). As this
problem is particularly pronounced for the physical sciences (Be, Henriksen, Lyons,
& Schreiner, 2011; Murphy & Whitelegg, 2006; Jenkins, 2006; Zwick & Renn,
2000), we focus on this domain in the following.
PISAs choice of units to evaluate the scientific knowledge of 15 year-old pupils had
to take account of the fact that the scientific curricula can be very different according
to countries, as well as regards the different schools disciplines (integrated science,
biology, physics, chemistry, geology, astronomy, etc.) and topics covered (for
instance electricity, optics or motion in physical sciences). Furthermore, PISA wants
to focalize less on the scientific knowledge of pupils than on their competences to
understand and solve scientific problems. With this background, PISA opted for
questions chosen in areas of application of science such as Health, Environment, or
Technology, which give rise to debates in the society and/or are in connection with
recent technological progress the consequences of which for society have to be
discussed.
A central issue for PISA is that these questions are authentic and motivating for young
people and it is this point of view that we analyze in this article, based on empirical
data of a survey with secondary level I pupils and teachers in Geneva. The
contribution is thus an extension of our preceding studies aimed to better understand
and qualify what PISA actually evaluates: a first paper on the comparison between the
PISA science units and the science curriculums of French speaking Switzerland
(Weiss, 2010) and a second one where the compatibility between the Inquiry Based
Learning (IBL) and the science PISA survey was discussed (Weiss, submitted).
The paper is organized as follows: after giving a short theoretical background about
the notion of authenticity, we describe the PISA choices for its units and items to be
authentic. We then proceed with a description of the three released PISA units chosen
for our survey, of the sample, and the instruments of the study. Results from the
pupils and teachers sample about their perception of these PISA units will be
discussed and compared, with each other, and with another study on authentic

60

Strand 11

Evaluation and assessment of student learning and development

learning (Kuhn, 2010). Finally, several conclusions about classroom implications and
future research will be discussed.

THE NOTION OF AUTHENTICITY


A quite widespread, basic understanding of authentic learning (starting with the word
origin: gr. authentiks true ; lat. authenticus reliable) is that it should be related to
actual, real(istic), genuine contexts and experiences learners are supposed to
encounter. This point of view is also strongly advocated by PISA (OECD, 2006,
2007), as underlined e.g. by the in-depth analysis of Fensham (2009): real world
contexts have [...] been a central feature of the OECDs PISA project for the
assessment of scientific literacy among young people. Moreover, this is also the
basic understanding of the variety of approaches addressed as context based science
education (CBSE; Bennett, et al, 2007).
PISA states two important points about the understanding of authentic contexts
(OECD, 2006): First, such problems, to be encountered in real-world settings
(factual authenticity), are usually not stated in the disciplinary terms to be learned
or applied. Thus, a work of translation with terminological and conceptual
reframing has to be carried out, representing a first step of cognitive activation.
Second, the disciplinary content involved is genuinely directed to solving the
problem, i.e. learners can perceive that there is a real-world problem for the solution
of which some content of science or math is necessary (problem authenticity),
instead of the problem being just an invented, artificial occasion to practice this
content. Moreover, the combination of these two features of authenticity is also
supposed to be closely linked to the science related self-concept, as it should be
supported by the experience of actually being able to solve real-world problems using
the knowledge and competences one has acquired (OECD, 2006; Hattie, 2009).
Summing up, conceptual translation and genuine content-problem link in this
sense can justly be considered as important components of scientific literacy, as PISA
does.
Moreover, beyond cognitive features, authentic contexts are supposed to foster
attitudinal and affective aspects, in particular interest in science. Fensham (2009)
states Real world contexts from the students lives outside of school have the
potential to generate personal intrinsic interest, and their social or global significance
can add to this potential an extrinsic quality to this interest. CBSE in general
(Bennett, et al, 2007), makes the same claim about the potential of linking science
education to pupils life. A quite comprehensive conceptualization was given by
Shaffer and Resnick (1999). They distinguish the following 4 aspects:
1. learning related to the real-world outside school: factual authenticity,
2. learning personally meaningful to the learner : personal authenticity

61

Strand 11

Evaluation and assessment of student learning and development

3. learning providing an opportunity to think in the modes of a particular


discipline: cognitive authenticity, which can be understood as genuine usage of
disciplinary content to solve a real-life question, i.e. the problem authenticity
already mentioned, or as entering or participating in the community of practice
of the certain discipline)
4. assessment in line with the learning process: assessment authenticity.
There is a considerable body of international literature on this subject, which is
beyond the scope of this contribution; this conceptual framework is nevertheless
useful to understand PISA conceptualization.

RATIONALE AND RESEARCH QUESTIONS


Actually PISA designers have taken in account the question of authenticity in
elaborating the items: PISA units were aligned with five broad areas of personal,
social and global settings in the real life (OECD, 2006), with essential applications
of science such as health, environment or hazard (see OECD, 2007, p. 36) and not
with the more traditional division of the science disciplines as taught in school
(biology, chemistry, physics). As consequence, the PISA units are not limited to a
single specific discipline or topic as in a traditional school assessment, the items of a
given unit rather appeal to concepts connected to different sciences; moreover many
items do not need even specific science knowledge because the information is given
in the unit basis text, they require a scientific way of thinking, or sometimes a good
reading capacity. Thus, PISA units are not school exercises aiming at rote learning or
drill & practice, but try to address pupils (scientific) thinking.
For PISA, this degree of authenticity is essential both on the motivational and
cognitive level. But as the issue is about motivation and cognition of learners, it is
their perception of authenticity which is the essential variable, and not that of
researchers. The questions this contribution is dealing with are thus: Do pupils
consider PISA science units linked to real life (reality connection, authenticity; RA)?
Do pupils feel the answers to PISA science units interesting form a personal point of
view (intrinsic interest; IE)? And do they consider themselves as performing well in
science when they know the answers to these units (self-concept)? And do lower
secondary teachers feel the same perception of authenticity and interest of the PISA
units? In the following, we deal with the PISA units related to physical science topics,
where pupils interest is notoriously hard to achieve (see e.g. Zwick & Renn, 2000).

62

Strand 11

Evaluation and assessment of student learning and development

MATERIALS AND METHODS


Sample and Procedure
For an empirical answer to the above questions, we have analyzed three publicly
available PISA units related to the physical sciences (OECD, 2007): Sunscreens,
Greenhouse and Clothes.

Pupils sample
Perception of these units by pupils was tested within a sample of fourteen 8th and 9th
grade classes in lower secondary school in Geneva in June 2011. The collected data
concern: 151 pupils (70 girls and 76 boys, 5 gender not mentioned) from ten 9th grade
classes (118 pupils) and four 8th grade classes (33 pupils)1, distributed in ten high
educational level classes (A level, 129 pupils) and four lower achieving level classes
(B level, 22 pupils). These classes belong to four lower secondary schools and are
taught by six physics teachers.

Teacher sample
A panel of 20 persons involved in secondary school and/or in teachers pre-service
training was investigated about the same questions as pupils (see below). These
persons were two university teachers, six teachers trainers, who teach themselves in
the secondary school and twelve young teachers at the end of their pre-service (having
already their own classes). Further we refer to them as teachers.

Instruments
Motivational variables were assessed with an instrument well established in the
literature on science motivation (adapted from Hoffmann et al., 1997; total
Cronbachs C=.93) with the following subscales: Intrinsic interest (IE: C=0.89),
reality connection/authenticity (RA: C=0.95) and self-efficacy /self-concept (SC:
C=0.89); for details see Kuhn (2010). The instrument was translated in French and
adapted to the particular situation of a survey without an actual teaching with the
PISA units. Pupils had to evaluate the authenticity and the interest of three PISA units
by reading them without having to answer to the items (nevertheless some pupils did
it). The questions were about the connection of the PISA units to the out of school
life, the utility of solving them for our society, the pupils desire to learn more and to
speak with friends about the question, the pupils perception that he or she would be
effective in learning physics through these questions. In this questionnaire, RA and IE
were assessed each through 7 items, SC through 10 items.
A similar but shorter questionnaire was prepared for teachers, without SC questions.
These teachers answered about five PISA units, the same three of the pupils and two
more to verify if the results could be generalized to other PISA units.
For the results given below, motivation test scores on each sub-dimension are given as
percentage relative to the maximal possible value.

63

Strand 11

Evaluation and assessment of student learning and development

RESULTS
Pupils perceptions
Data show that pupils do perceive the PISA units as not realistic as RA<50% and
even less interesting with IE40% (Figure 1. Pupils perceptions about three PISA units.
RA measures reality connection/authenticity, IE the intrinsic interest and SC the selfconcept. Mot is the sum of the 3 dimensions).

Pupils perceptions
100%
80%
Sunscreens

60%

Greenhouse

40%

Clothes

20%

PISA (3 units)

0%
RA

IE

SC

Mot

Figure 1. Pupils perceptions about three PISA units. RA measures reality


connection/authenticity, IE the intrinsic interest and SC the self-concept. Mot is the
sum of the 3 dimensions
These findings depend little on the age and the general educational level. However the
perceptions of the girls are considerable lower than those of the boys, both for RA
(44% vs 53%) and IE (31% vs 47%). This result is coherent with the Swiss PISA
report (Zahner, Rossier & Holzer, 2007), which shows that in Switzerland the science
competences are higher among boys. It is well known that competences in a field are
correlated first with self-concept and with interest in the field.

Teachers perceptions
The teachers perception of the motivational features of PISA items lies considerably
above that of pupils as shown in Figure 2. Teachers perceptions about five PISA units (the
three ones evaluated by pupils plus Grand Canyon and Acid rain). RA measures reality
connection/authenticity, IE the intrinsic interest and Mot(2 dim) is the sum of the 2
dimensions.

64

Strand 11

Evaluation and assessment of student learning and development

Teachers perceptions
Sunscreens

100%

Greenhouse

80%

Clothes

60%

PISA (3 units)

40%
Grand Canyon

20%

Acid rain

0%
RA

IE

Mot (2 dim)

PISA (5 units)

Figure 2. Teachers perceptions about five PISA units (the three ones evaluated by
pupils plus Grand Canyon and Acid rain). RA measures reality
connection/authenticity, IE the intrinsic interest and Mot(2 dim) is the sum of the 2
dimensions.
The differences between teachers and pupils perceptions about the PISA units are all
statistically significant at the level p < 0.001 (apart for Clothes, where no significant
differences were found) and the effect sizes are high as shown in Table 1
Effect sizes (Cohen d) of teachers and pupils perceptions differences. The significance level of
all differences is p < .001.

Table 1
Effect sizes (Cohen d) of teachers and pupils perceptions differences. The
significance level of all differences is p < .001.
RA

IE

Sunscreens

1.24

1.72

Greenhouse

1.47

1.87

PISA (3 units)

1.15

1.43

Looking more precisely to these findings, teachers as pupils consider Greenhouse as


the more authentic, but are much more critical than pupils about Clothes, although the
topic of this unit is nowadays an important field of investigation for helping
handicapped people and one item of this unit is directly linked with the physics
curriculum (note that teachers motivation for Clothes is less than 65% of the
motivation for Greenhouse, although for pupils this proportion is 95%). The
evaluation of the five units gives a score similar to the average of the three first,
meaning that testing three units gives a good evaluation for the PISA units linked with
physics. A further study will investigate about teachers assumptions on pupils

65

Strand 11

Evaluation and assessment of student learning and development

perceptions, to better understand if teachers are aware of this discrepancy between


their perceptions and those of pupils.

CONCLUSIONS AND IMPLICATIONS


The perception of interest and authenticity (in the sense stated above) of the physical
science items of PISA by pupils of secondary level I (the PISA target group) turns out
to be generally low, lower than it ought to be expected, given the large role
motivating and cognitively science problems are supposed to have within the PISA
framework (OECD, 2006, 2007; Fensham, 2009), and the lot of care put into the
development of its testing items. In particular, pupils perception of interest and
authenticity is considerably lower than the perception of teachers (of the same items),
and it is much lower than those which can be attained in actual CBSE teaching
approaches (Kuhn, 2010).
PISAs concern of integrating learning and motivation issues is widely shared, and its
findings on learning are an essential building block for the current knowledge on
science education. However, the present study sheds some doubt on how well PISA
actually succeeded in implementing its understanding of interest and authenticity into
its assessment items. Researchers interested in CBSE, and in particular hypothesizing
benefits of some form of authentic tasks and learning, thus should not work with their
own perception of authenticity, even if widespread, but assess the actual perception of
their target group.
A similar statement is true for classroom practice: teachers should be aware of their
tendency of overestimation of pupils interest and authenticity perceptions, and if they
are for good reasons interested in developing or using some teaching approach
based on authenticity, they should assess the actual perception of their pupils. This
requires, of course, that they dispose of a classroom-proof (i.e. short and reliable)
test to do so, and the present work offers (for the given understanding of authenticity)
such an instrument.

REFERENCES
Bennett, J., Lubben, F., Hogarth, S. (2007). Bringing Science to Life: A Synthesis of
the Research Evidence on the Effects of Context-Based and STS Approaches to
Science Teaching. Science Education, 91(3), 347-370.
Be, M.V., Henriksen, E.K., Lyons, T. & Schreiner, C. (2011). Participation in
Science and Technology: Young peoples achievement-related choices in late
modern societies. Studies in Science Education, 47 (1), 37-72.
Fensham, P. J. (2009). Real world contexts in PISA science: Implications for contextbased science education. Journal of Research in Science Teaching (46) 884896

66

Strand 11

Evaluation and assessment of student learning and development

Hattie, A.C. (2009). Visible Learning. A synthesis of over 800 meta-analyses relating
to achievement. London, New York: Routledge.
Hoffmann, L., Huler, P. & Peters-Haft, S. (1997). An den Interessen von Mdchen
und Jungen orientierter Physikunterricht. Ergebnisse eines BLKModellversuches. Kiel: IPN.
Jenkins, E.W. (2006). The Student Voice and School Science Education, Studies in
Science Education, 42, 49-88.
Kuhn, J. (2010). Authentische Aufgaben im theoretischen Rahmen von Instruktionsund Lehr-Lern-Forschung: Effektivitt und Optimierung von Ankermedien fr
eine neue Aufgabenkultur im Physikunterricht. Wiesbaden: Vieweg + Teubner.
Murphy, P. & Whitelegg, E. (2006). Girls in the physics classroom: a review of the
research on the participation of girls in physics. Institute of Physics, London,
UK. http://oro.open.ac.uk/6499/, access 20/05/2013.
OECD. (2006). Assessing scientific, reading and mathematical literacy: A framework
for PISA 2006. Paris: OECD.
OECD. (2007). PISA 2006: Science competencies for tomorrows world. Volume 1:
Analysis. Paris: OECD.
Rocard, M., Csermely, P., Jorde, D., Lenzen, D., Walberg-Henriksson, H. & Hemmo,
V. (2007). Science education now: A Renewed Pedagogy for the Future of
Europe. Bruxelles: EU. Directorate-General for Research, Science, Economy and
Society.
Shaffer, D. W., Resnick, M. (1999). "Thick" Authenticity: New Media and Authentic
Learning. Journal of Interactive Learning Research, 10(2), 195-215.
Weiss, L. (submitted). PISA-sciences est-il IBL-compatible? Recherches en
didactique.
Weiss, L. (2010). Lenseignement des sciences au secondaire obligatoire en Suisse
romande, au regard des enqutes internationales sur la culture scientifique des
jeunes. Revue Suisse des Sciences de l'Education, 32, /3, 393-420.
Zwick, M. & Renn, O. (2000). Die Attraktivitt von technischen und
ingenieurwissenschaftlichen Fchern bei der Studien- und Berufswahl junger
Frauen und Mnner. Stuttgart: Akademie fr Technikfolgenabschtzung.
1

In Geneva secondary school, Physics is optional in 8th grade.

67

Strand 11

Evaluation and assessment of student learning and development

EXAMINING WHETHER SECONDARY SCHOOL


STUDENTS MAKE CHANGES SUGGESTED BY
EXPERT OR PEER ASSESSORS IN THE SCIENCE
WEB-PORTFOLIOS
Tsivitanidou Olia, Zacharias Zacharia and Hovardas Tasos
University of Cyprus, Department of Educational Sciences, Nicosia, Cyprus.
Abstract: In this study we aimed to examine whether either peer or expert feedback
led secondary school students to revise their science web-portfolios in any way. The
study was implemented in the context of reciprocal online peer assessment of webportfolios in a secondary school science course. Reciprocal peer assessment requires a
student to take on the roles of both assessor and assessee. The participants (28 seventh
graders) anonymously assessed each other's science web-portfolios on designing a
CO2-friendly house. Peer assessors and an expert assessor used the same pre-specified
assessment criteria to provide peer and expert feedback, respectively. Peer assessees
received feedback from two peers and the expert, and also had access to the feedback
they had produced for another peer when acting as peer assessors themselves. All
feedback produced focused on three different aspects of the web-portfolio, namely
science content, students inquiry skills and the appearance/organization of the webportfolio. Peer assessees made revisions, including revisions that concerned the
science content, as they saw fit after reviewing the feedback received. Screen capture
and video data were analyzed qualitatively and quantitatively. The findings showed
that our assessee groups appear to have employed decision-making strategies to
screen and process peer and expert feedback, which involved triangulating peer and
expert feedback and adopting the suggested changes that overlapped, or triangulating
peer and expert feedback along with the feedback they produced themselves when
acting as peer assessors, and adopting the suggested changes that overlapped all three
types of feedback.
Keywords: reciprocal peer assessment, peer feedback, expert feedback, webportfolios, secondary school science

BACKGROUND AND FRAMEWORK


Recent developments in the field of assessment stress the importance of formative
approaches. The use of the formative assessment practices in a classroom could
potentially enhance students learning achievements. According to Black and William
(2009) five main types of activity comprise formative practices and one of this is
peer-assessment. Therefore an interesting innovation in this direction is the active
involvement of students when assessing a peers work (Van Gennip, Segers, &
Tillema, 2010). In peer assessment students assess their fellow students performance
by providing feedback, which could be quantitative and/or qualitative in nature
(Topping, 1998). Studies covering several subject domains have documented a
number of benefits that peer assessment could offer to a learner (Topping, 1998), but
despite these benefits, several researchers have emphasized the fact that the enactment
of peer assessment is a rather complex undertaking (Sluijsmans, 2002), since it
requires students to use their assessment (Sluijsmans, 2002), cognitive and metacognitive skills in order to review, clarify, and correct others work (Ballantyne,

68

Strand 11

Evaluation and assessment of student learning and development

Hughes, & Mylonas, 2002). Additionally few science education studies have focused
on peer assessment (Crane & Winterbottom, 2008), especially at the primary (Harlen,
2007) and secondary school levels (Tsivitanidou, Zacharias, & Hovardas, 2011).
Consequently, we do not have a thorough picture of what primary and secondary
school students can do in a peer assessment context, especially in terms of the
heuristics that secondary school students use when revising their science webportfolios based on peer and expert feedback received. Such evidence is essential,
since peer assessment is gaining grounds in participative inquiry-based science
learning environments, especially computer-supported inquiry learning environments
(e.g., de Jong et al., 2010; 2012).

Rationale and Purpose


In this study, secondary school students were involved in the two distinct phases of
reciprocal peer assessment (the peer assessor phase and the peer assessee phase). Peer
assessment in this study involved the use of pre-specified assessment criteria to rate
science web-portfolios and asked participants for written comments justifying their
ratings and suggesting possible changes for revision. Peer assessees were offered
feedback not only from their peer assessors but also from an expert assessor. The idea
was to investigate whether students tend to adopt changes suggested by an expert,
such as a science teacher, rather than from their peers. Hence, the purpose of the study
was to examine whether either peer or expert feedback led secondary school students
to revise their science web-portfolios in any way and, in case they did, to examine if
any behavioral patterns existed among peer assessees.

METHOD
Participants
Participants were 28 seventh graders (14 year-olds), coming from two different
classes (Nclass1 = 14 and Nclass2 = 14) of a public school (Gymnasium) in Nicosia,
Cyprus. Participants in the study were guaranteed anonymity and that it would not
contribute to students final grade. All students had prior experience with reciprocal
peer assessment, since they had participated in reciprocal peer assessments of webportfolios whose content came from an environmental science context, similar to (but
not the same as) the one involved in this study.

Material
Students studied web-based material that were developed for the purposes of the SCY
(Science Created by You) project (de Jong et al, 2010) and concerned the construction
of CO2-friendly houses, namely, houses made with specific modifications during the
building and operation phases in order to produce lower CO2 emissions than
conventional houses. This learning material required from students to create a number
of learner artifacts (e.g., concept maps, tables, text) which were included in the
students' web-portfolios. For the peer-assessment purposes students worked with
Stochasmos, which is a web-based learning platform that supports collaborative
learning in an inquiry-based environment (Kyza, Michael, & Constantinou, 2007). We
chose Stochasmos because the participants were already familiar with it and it had the

69

Strand 11

Evaluation and assessment of student learning and development

features (e.g., web-portfolio, synchronous communication through a chat tool)


necessary for the purpose of this study.

Procedure
Each student group used a computer and the web-based platform to access the
curriculum material, follow the activity sequence and complete the accompanying
tasks. Each of these tasks corresponded to the development of a learner product which
was included in the students web-portfolio. Each home group created nine artifacts.
We chose a reciprocal peer assessment approach and employed an online and
anonymous peer assessment format. Participants worked in groups of two (home
group) while developing learner artifacts (see figure 1). However, they carried out the
role of peer assessor on an individual basis.
After all students completed all tasks, peer assessors could access the web-portfolio of
the peer group they were to assess, which was randomly assigned to them. Each webportfolio (all the learner artifacts included in a science web-portfolio) was assessed by
two peers from the same home group, who worked on different computers (see figure
1). Each web-portfolio was also assessed by an expert.

Figure 1. Peer and expert assessment procedure. Rhombuses represent learning


products; folders stand for web-portfolios. Students start the mission in pairs and do
all learning activities with their partner. During the peer review and assessment phase
they work alone, while during the revision phase they work in the same pairs as
before. The web-portfolios are assessed by two peers and an expert. The students in
home groups a and b take on the roles of both peer assessor and peer assessee.
Peer assessors and the expert used a rubric with pre-specified assessment criteria
regarding the (a) science content of web-portfolio, (b) students inquiry skills and (c)

70

Strand 11

Evaluation and assessment of student learning and development

appearance and organization of web-portfolio. Assessors also rated student


performance on all criteria according to a 3-point Likert scale (i.e., (1) unsatisfactory;
(2) moderately satisfactory; (3) (fully) satisfactory). Along with ratings, assessors
were instructed to provide written feedback (for each criterion separately) to assessee
groups, in which they were to explain the reasoning behind their ratings, provide
judgments and suggestions for revisions. On average, it took each peer assessor about
an hour to complete the assessment. After all peer and expert feedback was submitted,
the system sent it to the corresponding assessee group. They were then allowed time
to review peer and expert feedback and make revisions, which in most cases did not
take more than an hour.

Data collection and analysis


The data collection process involved screen captured and videotaped data. The data
analysis involved both qualitative and quantitative methods. First, we isolated the
episodes in which students enacted the assessee role and we coded students actions
throughout the assessee phases by using open coding. Inter-rater reliability data were
collected for this coding process (Cohens Kappa=0.90).
Second, we took the resulting codes and displayed them in time-line graphs (plotting
the time vs. the clusters of actions/practices), following the approach of Schoenfeld
(1989). The x axis of the graph showed time and the y axis showed students'
actions/responses when acting as either peer assessor or peer assessee. One graph was
produced for each peer assessor and one graph for each peer assessee group. We
compared the resulting graphs for each role separately, to identify similarities and
differences in the combinations of codes over time. The goal was to reveal
interrelationships of the codings over time and identify different patterns that might
indicate different profiles for peer assessors or peer assessees.

RESULTS
Screen captured data analysis allowed for the identification of student behavioral
patterns (responses/actions of peer assessees) during the feedback review and revision
of web-portfolios phase. By comparing the time-line graphs of peer assessees actions
we identified four different patterns/profiles (see Figure 2).
The first example presents peer assessees who studied both the expert and peer
feedback, but used the feedback that they had produced as peer assessors to filter it,
before making any change. The second example presents peer assessees who read
both the expert and peer feedback and considered the expert feedback to be more
valuable, but in the end did not make any changes. The third example presents peer
assessees who made changes while taking into consideration both the expert and peer
feedback. The fourth example presents peer assessees who quickly scanned both the
expert and peer feedback and then concentrated only on the expert feedback, which
they partially filtered through the use of the feedback they produced for others as peer
assessors before they proceeded with making changes to their learning products.

71

Strand 11

Evaluation and assessment of student learning and development

Figure 2. Four representative graphs of peer assessee actions over time. Graph 1 is for peer assessee group 3 (Konstantinos and Yannis),
Graph 2 is for group 7 (Tonia and Marcos), Graph 3 is for group 10 (Despina and Maria), and Graph 4 is for group 2 (Sotia and Nikos).
The y-axis represents peer assessee actions/responses and the x-axis gives the time in seconds. The codes for the y/axis are as follows: (1)
Reading expert feedback; (2) Reading peer feedback A; (3) Reading peer feedback B; (4) Reading own feedback A; (5) Reading own
feedback B; (6) Revisiting own learner products; (7) Revising own learner products; (8) Opening Stochasmos learning environment; (9)
Opening own web-portfolio; (10) Using software other than Stochasmos (e.g., MS Office); (11) Discussion between group members; (12)
Revisiting primary information resources from Stochasmos platform; (13) Browsing for information on the web.
72

Strand 11

Evaluation and assessment of student learning and development

Despite the differences among these four examples, there are similarities that lead to
important conclusions about the enactment of peer assessment:
(a) All assessee groups read both the peer and expert feedback, focusing on
negative/critical judgments. Peer assessees tended to skip the positive judgments or to
spend less time on reading them and focused primarily on the negative judgments.
(b) Many assessee groups tended to access the feedback they themselves had
produced as peer assessors after they went through expert and peer feedback and used
it as a point of reference. In most cases where assessee groups accessed their own
feedback, the changes that were actually adopted were related to the content of
science portfolios. In fact, half of the assessee groups accessed the feedback they
themselves had produced as peer assessors after they went through the expert and peer
feedback (e.g., graphs 1 and 4).
The analysis showed that the more time devoted to reviewing own feedback the
greater the number of changes eventually adopted by assessee groups (Kendalls tau b
= 0.47; p < 0.05). This result reveals the crucial role of a hidden factor, namely
own feedback, whose effect was not known to us at the beginning of this study
because it was not reported elsewhere in the literature of the domain. Surprisingly,
many students (50% of the peer assessee groups) were found to adopt changes,
including science content related changes, which overlapped all three types of
feedback. Also, comparison of the peer suggested changes that were made to those
suggested by the expert revealed that students were comparing the peer and expert
feedback and made actual changes only when the suggested changes overlapped.
Finally, the number of changes eventually adopted by assessee groups amounted to
slightly more than one-fifth of those recommended (42 out of 186, 22.58%). Changes
adopted referred to either the science content (23 changes) or the
appearance/organization (19 changes) of web-portfolios.

DISCUSSION AND IMPLICATIONS


Assessee groups seem to have employed two decision-making strategies to screen and
process peer and expert feedback. The first strategy involved the cross-checking of
peer and expert feedback, where overlap of negative/critical judgments between
expert and peer feedback was most likely to lead to changes. This strategy resembles
the triangulation strategy researchers follow for validation purposes, which causes
great surprise that secondary school students could invent such a strategy. Of
course, a reasonable question is why peer assessees felt the need of triangulation,
rather than fully adopting expert feedback. Again, this is an issue that needs to be
further investigated.
The second strategy involved the use of the feedback that peer assessees themselves
had produced while acting as peer assessors (i.e., own feedback). Half of the peer
assessees used their own feedback to filter expert and peer feedback before adopting
any changes. Again, this resembles the triangulation strategy. It should be noted that
own feedback was accessed by peer assessees without their being instructed to do so.
Moreover, it should be noted that this is a strategy that was not identified in prior
research. Therefore, future research should further examine the effect of own
feedback, in particular why peer assessees are so confident about the scientific
accuracy of their feedback that they consider it to be a point of reference.

73

Strand 11

Evaluation and assessment of student learning and development

Finally, the use of own feedback has to be acknowledged as a distinctive


characteristic of reciprocal peer assessment because it interrelates the two distinct
roles students take on in reciprocal peer assessment (i.e., the roles of peer assessor and
peer assessee). Moreover, a question that is raised concerning the enactment of the
role of peer assessee by secondary school students is why they do not adopt all the
changes suggested by peers and experts; hence, there is a need to understand the
circumstances under which a peer assessee is reluctant to adopt suggested changes,
especially those coming from an expert.
A number of implications for practice and policy emerge from these findings. More
future research efforts should focus on understanding the mechanisms students
employ for filtering and using the peer or expert feedback received. For example, the
strategy of triangulation, which was already employed by some students on their own,
could be introduced to all students, while explicitly explaining to them its value for
evaluating a source of feedback as being possibly credible.
Needless to say, this is another domain in which further research must be encouraged.
Overall there is a need to determine the factors that affect a peer assessee's decision to
accept or not to use information from the peer feedback received. The ultimate goal
should be to create conditions that enable students to learn through the enactment of
the role of both the peer assessor as well as the enactment of the role of the peer
assessee.

REFERENCES
Ballantyne, R., Hughes, K. & Mylonas, A. (2002). Developing procedures for
implementing peer assessment in large classes using an action research process.
Assessment & Evaluation in Higher Education, 27(5), 427-441.
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment.
Educational Assessment, Evaluation and Accountability (formerly: Journal of
Personnel Evaluation in Education), 21(1), 5-31.
Crane, L., & Winterbottom, M. (2008). Plants and photosynthesis: peer assessment to
help students learn. Journal of Biological Education, 42, 150-156.
De Jong, T., Van Joolingen, W., Giemza, A., Girault, I., Hoppe, U., Kindermann, J.,
Kluge, A., Lazonder, A., Vold, V., Weinberger, A., Weinbrenner, S.,
Wichmann, A., Anjewierden, A., Bodin, M., Bollen, L., dHam, C., Dolonen,
J., Engler, J., Geraedts, C., Grosskreutz, H., Hovardas, T., Julien, R., Lechner,
J., Ludvigsen, S., Matteman, Y., Meistadt5, ., Nss, B., Ney, M., Pedaste,
M., Perritano, A., Rinket, M., von Schlanbusch, H., Sarapuu, T., FSchulz, F.,
Sikken1, J., Slotta, J., Toussaint., J., Verkade, A., Wajeman, C., Wasson, B.,
Zacharia, Z., van der Zanden, M. (2010). Learning by creating and
exchanging objects: The SCY experience. British Journal of Educational
Technology, 41 (6), 909-921.
De Jong, T., Weinberger, A., Van Joolingen, W., Giemza, A., Girault, I., Hoppe, U.,
Kindermann, J., Kluge, A., Lazonder, A., Vold, V., Weinberger, A.,
Weinbrenner, S., Wichmann, A., Anjewierden, A., Bodin, M., Bollen, L.,
dHam, C., Dolonen, J., Engler, J., Geraedts, C., Grosskreutz, H., Hovardas,
T., Julien, R., Lechner, J., Ludvigsen, S., Matteman, Y., Meistadt5, ., Nss,
B., Ney, M., Pedaste, M., Perritano, A., Rinket, M., von Schlanbusch, H.,

74

Strand 11

Evaluation and assessment of student learning and development

Sarapuu, T., FSchulz, F., Sikken1, J., Slotta, J., Toussaint., J., Verkade, A.,
Wajeman, C., Wasson, B., Zacharia, Z., van der Zanden, M. (2012). Using
scenarios to design complex technology-enhanced learning environments.
Education, Technology, Research & Development, 60, 883-901.
Gielen, S., Peeters, E., Dochy, F., Onghena, P., & Struyven, K. (2010). Improving the
effectiveness of peer feedback for learning. Learning and Instruction, 20, 304315.
Harlen, W. (2007). Holding up a mirror to classroom practice. Primary Science
Review, 100, 2931.
Kyza, E., Michael, G., & Constantinou, C. (2007). The rationale, design, and
implementation of a web-based inquiry learning environment. In C.
Constantinou, Z. C. Zacharia, & M. Papaevripidou (Eds.), Contemporary
perspectives on new technologies in science and education, Proceedings of the
Eighth International Conference on Computer Based Learning in Science (pp.
531-539). Crete, Greece: E-media.
Sluijsmans, D. M. A. (2002). Student involvement in assessment, the training of peerassessment skills. Interuniversity Centre for Educational Research.
Topping, K. (1998). Peer assessment between students in colleges and universities.
Review of Educational Research, 68, 249276.
Tsivitanidou, O., Zacharia, Z. C., & Hovardas, A. (2011). High school students
unmediated potential to assess peers: unstructured and reciprocal peer
assessment of web- portfolios in a science course. Learning and Instruction, 21,
506519.
Van Gennip, N. A. E., Segers, M. S. R., & Tillema, H. H. (2010). Peer assessment as
a collaborative learning activity: the role of interpersonal variables and
conceptions. Learning and Instruction, 20, 280-290.

75

Strand 11

Evaluation and assessment of student learning and development

SOURCES OF DIFFICULTIES IN PISA SCIENCE ITEMS


Florence Le hebel, Andre Tiberghien and Pascale Montpied
University of Lyon1, UMR ICAR, CNRS, Lyon2 and ENS Lyon, France.
CNRS, UMR ICAR, CNRS, Lyon2 and ENS Lyon, France.
Abstract: In France, PISA science 2006 globally indicated a score around average, and a
rather high proportion (compared to OECD average) of students at or below level 1,
meaning that students are not able to use scientific knowledge to understand and do
PISAs easiest tasks. The aim of our present work is to understand the level of difficulty
given by students scores for the different PISA science 2006 items. Our hypothesis is
that different factors, in particular the cognitive demand required to provide a right
answer, the familiarity or not of the item context, the vocabulary difficulties and the
possible answering strategies influence the difficulty level or easiness of an item.We
collected audio taped and/or videotaped data from 15 year old students from grade 9 in
middle secondary schools and grade level 10 in an upper secondary school in order to
analyse students oral and written productions when they answered PISA questions which
we had previously selected.Our analysis shows the variety of possible students
difficulties in answering questions and demonstrates that making simple relationships
between a wrong answer and a lack of competency creates a risk of misinterpreting
students behaviors and/or knowledge at individual levels and at the large scale level.
Keywords: scientific literacy, PISA, evaluation, sources of difficulty, complexity.

BACKGROUND, FRAMEWORK AND PURPOSE


The Program for International Student Assessment (PISA http://www.pisa.oecd.org) was
launched by the Organization for Economic Cooperation and Development (OECD) in
1997, to assess to what degree 15-year-old students near the end of compulsory education
have acquired essential knowledge and skills for full participation in society. In all PISA
cycles, the domains of reading, mathematical and scientific literacy are covered in terms
of the knowledge and skills needed in adult life. Five assessments have so far been
carried out. PISA results have lead to the development of research projects on secondary
analysis of PISA results (see Olsen and Lie, 2006, Bybee et al, 2009 for an overview). As
scientific literacy was the major domain in 2006, our study is based on PISA science
2006 data. In France, PISA science 2006 globally indicated average scores, and a rather
high proportion (compared to the OECD average) of students in difficulty at or below
level 1 (meaning that students are not able to use scientific knowledge to understand and
do the PISA easiest tasks) (OECD, 2007). The French Ministry of Education in
agreement with the OECD has allowed us to use PISA 2006 results. The aim of our
present work is to investigate what makes PISA science items difficult or easy in order to
interpret the different scores obtained by the students

76

Strand 11

Evaluation and assessment of student learning and development

RATIONALE
To know more precisely what is effectively assessed, we need to understand what makes
a test question difficult. Indeed, if we are not able to understand why some questions are
more difficult than others, it means that we do not really know what we are measuring.
Little research shows a concern for construct validity in examination questions, i.e. that a
question measures what it claims to measure (e.g. Ahmed and Pollitt, 1999). Pollitt et al.
(1985) identified sources of difficulty (SODs) and sources of easiness (SOEs) through
empirical work from Scottish examinations in five subjects (Mathematics, Geography,
Science, English and French). Scripts from examinations were analyzed statistically in
order to identify the most difficult questions. The students answers to these questions
were then analyzed with the aim of discovering the most common errors made when
answering these questions. From these errors, the authors hypothesize that there were
certain Sources of difficulty (SODs) and of easiness (SOEs) in the questions. They
proposed three different categories of SODs: The concept difficulty, which is the intrinsic
difficulty of the concept itself; the process difficulty meaning the difficulty of cognitive
operations and demands made on a candidates cognitive resources; and the question
difficulty, which may be rooted in the language of questions, the presentation of
questions, etc.
In order to verify whether the hypothesized SODs affect students performances,
questions from a mathematics examination (GSCE-General Certificate of Secondary
Education) were manipulated and rewritten in order to remove some specific SODs (e.g.
Fisher-Hoch and Hughes, 1996; Fisher-Hoch et al., 1997). Results show that differences
in performance can be influenced quite significantly with small variations in the
questions by removing or adding a source of difficulty (context of question, ambiguous
resources, etc.). They propose an analysis of the sources of difficulty in exam questions
that would enable us to develop questions of higher construct validity and effectively
target different levels of difficulty. Ahmed and Pollitt (1999) investigated the issue of
what makes questions demanding and developed a scale of cognitive demands in four
dimensions: complexity of the question (number of operations that have to be carried
out), abstraction (to what extent the student has to deal with ideas rather than concrete
objects or events), resources (text, diagram, picture, etc.) and strategy. They show that
questions having the higher cognitive demand in terms of the four dimensions on the
scale tend to have more SODs occurring at different stages of the answering process.
Webb (1997) developed the Depth of knowledge (DOK) model to analyze the cognitive
expectation demanded by assessment tasks. The Webbs DOK levels for Science (Webb,
1997; Hess, 2005) is a scale of cognitive demand and reflects the cognitive complexity of
the question. The DOK level assigned should reflect the complexity of the cognitive
processes demanded by the task outlined by the objective. Ultimately, the DOK level
describes the kind of thinking required by a task, not necessarily whether or not the task
is difficult. The DOK or the cognitive demands of what students are expected to be able
to do is related to the number and strength of the connections within and between ideas.
The DOK required by an assessment is related to the number of connections of concepts
or ideas a student needs to make in order to produce a response, the level of reasoning,
and the use of other self-monitoring processes. It should reflect the level of work students
are most commonly required to perform in order for the response to be deemed

77

Strand 11

Evaluation and assessment of student learning and development

acceptable. DOK levels name four different ways students interact with content. Each
level is dependent upon how deeply students understand the content in order to respond.
As mentioned above, the Webb levels do not necessarily indicate degree of difficulty in
that DOK Level 1 can ask students to recall or restate either a simple or a much more
complex concept or procedure. Recall of a well-known concept will correspond to a low
degree of difficulty (students high score), whereas a recall of a concept that is not known
by a majority of students will lead to a high degree of difficulty of the item (students low
score). However, in both cases, the cognitive demand or DOK Level is 1. Conversely,
understanding a concept in depth is required to be able to explain how/why a concept
works (Level 2), apply it to real-world phenomena with justification/supporting evidence
(Level 3), or to integrate one concept with other concepts or perspectives (Level 4).
Pollitt and Ahmed (2000) investigate how context affects students processes of
answering an examination science question. The analysis of contextualized science
questions shows how school students can be misled and biased by the improper use of
context. The concept of difficulty used in the studies presented above and in ours, is
embedded in the interactions between the student and the item itself. Therefore it is not
possible to attribute the source of difficulty either to the item or to the student. Indeed, in
our a priori analysis of the item, we engage the representation of the students interacting
with the items in order to anticipate the students difficulties. Consequently, in this paper,
the students difficulties and items difficulties will not be systematically differentiated.
On the other hand, the level of difficulty defined by PISA, based on statistical analysis of
students scores is clearly different and it will always be called PISA level of difficulty.
Based on these studies, our hypothesis is that different factors influence the difficulty or
easiness of an item.. The different factors that we select in our analysis are:

the cognitive demand (or cognitive complexity) based on Webb DOK levels,
required to produce a right answer for a PISA item that we will have previously
determined in an a priori analysis;

the difficulties of the vocabulary found in the text of the unit determined in an a
priori analysis and confirmed by the observation of students in situation; let us
note that PISA main assessment consists of a series of units; each unit has an
introduction with a text and possibly photos, drawings, or diagrams presenting a
situation followed by a series of questions called items (an example of an
introduction and one item is given in figure 1).

the context of the item, that is the distance between the context of the item from
which the student should extract and apply information and the context in which
they have probably already made use of this information. We will determine it in
an a priori analysis and confirm by the observation of students in situation.

the question format (open answer, multiple choice, complex multiple choice)
determined in a priori analysis . Moreover, in a previous study (authors, 2012)
concerning the effective competencies involved in PISA items, we observed that
some PISA items may offer possibilities of supporting some students answering
strategies that do not require items understanding but that can lead in some cases
to the right answer. This constitutes a potential misinterpretation of the PISA level
of difficulty. Therefore, we take into account the answering strategies that can be

78

Strand 11

Evaluation and assessment of student learning and development

used by students to answer PISA items.


Consequently, our study is based on two types of analyses; an a priori analysis of PISA
items and the analysis of students actions in a situation where they answer PISA items.
Therefore we chose a case study methodology as presented below. This analysis should
lead us to better interpret the students scores according to the types of difficulty of each
question.

METHODS
For this study, we proceed in two steps: first, an a priori analysis and secondly, an
analysis of students processes while answering PISA items.

The a priori analyses of PISA units


The data used for these analyses are based on all PISA science 2006 items. The French
Ministry of Education (Department of Evaluation DEPP) in agreement with the OECD
has allowed us to use PISA 2006 results in France and in other OECD countries. We
carried out four a priori analyses of the PISA units; they involve:

analyzing each unit according to several criteria to select a set of units and items
to test;

characterizing the cognitive demand required to produce a right answer for each
selected item;

characterizing the potential familiarity of the item context;

characterizing the difficulties due to the vocabulary in the text for each selected
PISA item.

To select relevant units for studying French students processes, we base our a priori
analysis of all the units on the following criteria:

characterizing the potential familiarity of the item context;

the diversity of scientific knowledge required for the item (knowledge of science
or about science included in the school curriculum or not, daily knowledge, etc.)

the different content areas tested in the item (Physical systems, Living systems,
Earth and space systems, Technology systems, Scientific Inquiry, etc.)

the usefulness or not of a unit introduction to answer the items, the item format
(multiple choice, open, etc.), and the competency evaluated by PISA for the item

the scores obtained for each item in France compared to OECD average scores.

To characterize the cognitive demand, we use the four levels of cognitive complexity
proposed by Webb (1997). Level 1 (Recall and Reproduction) requires recall of
information, such as a fact, definition, term, or performance of a simple process or
procedure. Level 2 (skills and concepts) includes the engagement of some mental
processing beyond recalling or reproducing a response. This level includes the items
requiring students to make some decisions as how to approach the question or problem.

79

Strand 11

Evaluation and assessment of student learning and development

These actions imply more than one cognitive process. Level 3 (Strategic Thinking)
requires deep understanding as exhibited through planning, using evidence, and more
demanding cognitive reasoning. The cognitive demands at level 3 are complex and
abstract. An assessment item that has more than one possible answer and requires
students to justify the response they give would most likely be at level 3. At level 4
(Extended Thinking) students are expected to make connections, relate ideas within the
content or among content areas, and have to select or devise one approach among many
alternatives on how the situation can be solved.
To characterize the vocabulary difficulties in the text of PISA items, in the a priori
analysis we first notice all the items in which students may have difficulties and secondly
during the analysis while they answer the PISA item we check whether the students
effectively meet them with certain words.
The second step consists of analyzing the answering processes from students oral and
written data when they construct their answer and/or during the following interview.
The data used for these analyses consist of the video and audio recordings of 21 students
(9 pairs of students and 3 answering individually), 15 years old (age of PISA evaluation)
who differ by their academic level (8 high achievers and 13 low achievers).
During the interview, we asked the students to make their thinking process explicit about
how they answered the questions. We used the explicitation interview (Vermersch &
Maurel 1997). To analyze the videotapes, we used the software Transana
(http://www.transana.org).
All students (21) had 23 same items from 10 PISA units in their questionnaire.
First, the a priori analyses of questions and the case studies on students allow us to
confirm or not the sources of difficulty or easiness that we proposed. Then we use these
confirmed sources to interpret the students scores in France and the differences observed
between French and OECD countries scores.

RESULTS
After comparing the observed students sources of difficulty or easiness when answering
the 30 selected PISA science 2006 items to the factors we hypothesized, we propose a
classification of these different PISA items according to the main factors that can play a
role in students answers.
Our observations confirm the sources of difficulty or easiness in PISA items that we
proposed: cognitive complexity, familiarity/unfamiliarity of the item context, item
format, vocabulary. Thus, these SODs described for some of them, in literacy for
students assessment related to curricula were also observed for PISA items even if the
importance of the different SODs might differ between the two types of evaluation. The
role and the importance of the answering strategy already found in our previous study
(authors, 2012) is confirmed and moreover the role of students knowledge is highlighted.
Table 1 shows the results obtained for the 30 selected PISA items. For each PISA item,
we report the Webbs DOK levels for science determined in our a priori analysis. In the
second column, we indicate with two levels (familiar/unfamiliar) our appreciation of the
80

Strand 11

Evaluation and assessment of student learning and development

distance between the context of the item from which the student should extract and apply
information and the context in which they have probably already made use of this
information. The appreciation of the context is very cultural and can differ from one
country to another. However an item about a topic taught in the curriculum does not
necessarily imply a familiar context for the student. For instance, a PISA unit is about
genetically modified crops which are not in the curriculum in France for a 15 years old
student. Nevertheless, genetically modified crops is a controversial and highly publicized
topic in France, and thus the context of the situation from which this item is based, is
familiar for French students. Likewise, the teaching context of a topic in the French
curriculum does not necessarily imply that the context of an item connected to this topic
will be familiar for the student. For instance, an item is about evaporation, which is
taught in grade 7 in France. But the item situation (how to transform salt water in
drinkable water) is unusual for French students and makes the item context unfamiliar for
them.
The third column indicates the question format, and the fourth column describes the
items vocabulary familiarity. Then, for each selected item, we report the items PISA
level of difficulty according to OECD, the OECD and French scores, and finally the
number of right, wrong and no answers obtained in our sample. We classify 6 different
groups of items according to the factors that we consider the most relevant in explaining
different items PISA levels of difficulty given by students scores: the level of
complexity, the answering strategy, the item context unfamiliarity or familiarity.

Items with high level of complexity and high PISA level of difficulty for students:

The PISA level of difficulty is reflected by the low students scores. The level of
complexity of the items is enough to explain the difficulty.

Items with high or medium level of complexity which can be solved with
answering strategies (matching, association) without understanding the text.

The level of complexity can be as high as in the first group, but the scores are better.
Actually, we observed that, for these items, students can use answering strategies leading
to the right answer, even if they do not understand the text and the aim of the item or the
experiment. Two major answering strategies appear, in particular with low achievers:
-matching the words: Students search for some wording consistency between the item
words
and words in the leading text of the unit or in the introduction of the item.
-association of action verbs: For instance for an item, students associate the words
cooling and reduce (as opposed to increase) and finally choose the right answer.

Items with low or medium level of complexity (requiring only an element of


knowledge or a simple procedure)
Item contexts are rather familiar and students have the knowledge For
these items, students scores are good. The level of complexity is very
low, requiring a simple procedure or knowledge in a familiar context, even
if the vocabulary may be difficult.

81

Strand 11

Evaluation and assessment of student learning and development

Item contexts are familiar but students do not have the knowledge
The level of complexity is low, but the students show lack of
knowledge.
Item contexts are unfamiliar, students do have the knowledge The level of
complexity is low, even if the knowledge is common, students are not able
to mobilize it in another context than they are used to.
Items permit answering strategies (as matching or association) used by
students that can lead to the wrong answers.
The level of complexity is low, but students, in particular low achievers, use answering
strategies such as matching words or associating action verbs, leading to the wrong
answers.
First, we observe that higher complexity items correspond to items with the highest levels
of difficulty (from 1 to 6, given by PISA according to students scores,) . Nevertheless,
our results reveal that factors other than complexity can influence the difficulty or
easiness of the item. . Indeed, the item format (open answer, multiple choice, complex
multiple choice) appears to have an impact as well on the difficulty of an item. For
instance, in table 1, we can observe that all items requiring an open answer display the
highest PISA level of difficulty (between 3 and 5). We made the same observation at the
whole PISA science 2006 evaluation scale. The context of the item from which the
student should extract and apply information appears to influence the item difficulty as
well. This is particularly noticeable for items coded S408Q03, S304Q03a, S304Q03b,
Q268Q06, and S447Q04 which have a low DOK level (from 1 to 2) but show a high
PISA level of difficulty. The reason for this difficulty is very likely the unfamiliar
contexts. Our analysis shows that students have the knowledge required in these items but
the knowledge is difficult to mobilize in this context. Our analysis reveals that possible
answering strategies that can be used by students can influence the PISA level of
difficulty. These answering strategies such as matching words between the item
introduction and the different propositions in the case of a multiple choice item, or
associating action verbs (for instance cooling and reducing in an item) are mostly used by
low achievers without understanding and representing the aim of the question (Authors,
2012). These answering strategies can lead to the right answer, even if the item displays a
high complexity (for instance S476Q03 or S268Q01); but it can lead to the wrong
answers (for instance S213Q01). Consequently, it should be taken into account in the a
priori analysis of the PISA level of difficulty of a question, in particular common
answering strategies such as matching or associating action verbs.
Moreover, from our observations while students are solving PISA items, it appears that
the familiarity of vocabulary obviously influences the understanding and the solving of
the item. But we cannot draw clear conclusions about a supposed correlation between
unfamiliarity of vocabulary and item PISA levels of difficulty from our classification.
Our results show clearly that compared to the OECD average, French students have the
same level of difficulty in solving high cognitive complexity items. On the contrary they
obtain lower scores for some rather easy items requiring knowledge that they apparently
do not have.

82

Strand 11

Evaluation and assessment of student learning and development

Furthermore, we observe a link between high complexity PISA items and a high level of
difficulty, whereas PISA low complexity items display a large range of levels of
difficulty. We conclude that in PISA evaluation, it appears possible to anticipate a high
level of difficulty for an item displaying a high complexity. In the case of a low DOK
level item, the complexity level cannot be a sufficient factor to predict item difficulty.
Other factors than complexity have to be taken into account while creating items to
predict difficulties they may have and to specify what is actually assessed. In particular, if
we want to evaluate low achievers and understand their actual level, some new PISA
items with a low level of difficulty need to be added.

IMPLICATIONS
Our study shows that factors other than the complexity of a PISA item can influence the
PISA level of difficulty or easiness and thus the students scores. This analysis shows the
variety of possible students difficulties in answering questions: item context, question
format, and vocabulary. It shows as well that possible answering strategies can influence
PISA level of difficulty. Our study shows that making simple relationships between a
wrong answer and a lack of competency creates the risk of misinterpreting students
competency. It appears that French students show lower scores compared to the OECD
average for some items requiring knowledge, whereas no score difference is observed for
high complexity items. Furthermore, we observe a link between high complexity PISA
items and a high PISA level of difficulty, whereas PISA low complexity items display a
large range of PISA levels of difficulty. We conclude that in PISA evaluation, it appears
possible to anticipate a high PISA level of difficulty for an item displaying a high
complexity. In the case of a low DOK level item, the complexity level cannot be a
sufficient factor to predict item difficulty. Other factors than complexity have to be taken
into account while creating items to predict difficulties they may have and to specify
what is actually assessed. In particular, if we want to evaluate low achievers and
understand their actual level, some new PISA items with a low level of difficulty need to
be added. Our results could be of interest for policy-makers, large scale assessment
developers and also particularly for teachers in the case of a reflection on science
evaluation carried out in class. It could alert them to the variety of difficulties that their
students might have in their own assessments which they do not suspect.

REFERENCES
Ahmed, A. & Pollitt, A. (1999). Curriculum demands and question difficulty. Paper
presented at the Annual Conference of the International Association for Educational
Assessment. Slovenia. http://www.iaea.info/documents/paper_1162a1d9f3.pdf
Authors, (2014). Which effective competencies do students use in PISA assessment of
scientific literacy? in C. Bruguire, A. Tiberghien & P. Clment (Eds.), ESERA
2011 Selected Contributions. Topics and trends in current science education.
Springer.
Bybee, R., Fensham P., & Laurie R. (2009). Scientific Literacy and Contexts in PISA
2006 Science, Journal of Research in Science Teaching, 46, 862-864.

83

Strand 11

Evaluation and assessment of student learning and development

Fisher-Hoch, H. & Hughes, S. (1996). What makes mathematics exam questions difficult?
Paper presented at the British Educational research Association Annual
Conference, University of Lancaster.
http://www.leeds.ac.uk/educol/documents/000000050.htm
Fisher-Hoch, H., Hughes, S. & Bramley, T. (1997). What makes GCSE examination
questions difficult? Outcomes of manipulating difficulty of GCSE questions. Paper
presented at the British Educational research Association Annual Conference,
University of York. http://www.leeds.ac.uk/educol/documents/000000338.htm
Hess, K. (2005). Applying Webbs Depth-of-Knowledge (DOK) Levels in Science.
http://www.nciea.org
OECD, (2007). PISA 2006: Technical Report. Paris: OECD Publications.
Olsen, R. & Lie, S. (2010). Profiles of Students Interest in Science Issues around the
World: Analysis of data from PISA 2006. International Journal Of Science
Education. 33, 97120.
Pollitt, A. & Ayesha A. (2000). Comprehension Failures in Educational Assessment.
Paper presented at the Euopean Conference on Educational Research, Edinburgh.
http://www.cambridgeassessment.org.uk/ca/digitalAssets/113787_Comprehension_
Failures_in_Educational_Assesment.pdf
Vermesch, P. & Maurel, M. (1997). Pratique de lentretien dexplicitation. Paris :
Editions ES.
Webb, N. (1997). Criteria for Alignment of Expectations and Assessments on
Mathematics and Science Education. Research Monograph No. 6. Washington,
D.C.: CSSO.

84

Strand 11

Evaluation and assessment of student learning and development

IN-CONTEXT ITEMS IN A NATION WIDE EXAMINATION:


WHICH KNOWLEDGE AND SKILLS ARE ACTUALLY
ASSESSED?
Nelio Bizzo1, Ana Maria Santos Gouw1, Paulo Srgio Garcia1, Paulo Henrique Nico Monteiro1 and
Luiz Caldeira Brant de Tolentino Neto2
1
Faculty of Education and EDEVO Research Nucleus, So Paulo University, Brazil
2
Santa Maria Federal University, Brazil
Abstract: Over seven million Brazilian youngsters enrolled for a National Test (ENEM 2013) aimed
at those who are willing to get one of the 170,000 places available in public free universities (Jan
2014). ENEM submits 180 multiple choice questions in a context with supposedly relevant
information which should be applied, relying on few or no previous knowledge. It was originally
presented as a new possibility for poor students finding a path leading to good quality universities.
We prepared two instruments based on real ENEM questions focusing on the same subject matter
(biology), which were presented to two randomized groups of high school students. One group
received full length questions (n=1,631), and the other received the same questions (n=1,400) with
abridged context information, but with the same stem and options. Performance analysis not only
showed no statistical significance towards students who were answering full length questions, but
also showed that students performance was significantly higher in three questions with abridged
information. Conclusions show that students performance may rely more heavily on reading and
time management skills rather than on previous knowledge or mental skills. Democratization on
university access, if any, may be due to the novelty of the test.
Keywords: students assessment; intelligence assessment, ENEM; time management skills; reading
skills.

INTRODUCTION
The Brazilian Ministry of Education (MEC) organizes a National Test for students at the end of
High School (ENEM) since the year of 1998. In 2009 major changes were introduced, which
attracted a great number of students, not only those who are actually at the last year of High School.
All people who aspire to a university degree seem to have been encouraged to pursue such testing,
given the reward introduced for a good score, in the form of a place in a free public university.
Students have been challenged to achieve the highest possible mark, which would enable them to
apply for a place in a computerized system (SISU) provided by MEC, which compares ENEM
scores of students and assigns seats in public universities all over the country. In the year 2013 over
seven million students were enrolled in ENEM, competing for places in free public universities. In
addition to SISU, students can compete for over 170,000 scholarships in private universities
(PROUNI), which can be as high as 100% of the tuition fees, given some conditions related to
students socioeconomic status. According to official MEC information, about 110,000 students
enrolled in the first version of the then national test in 1998, and no one could believe that seven
million people would be enrolled the same test fifteen years later (2013 exam), competing for about
170,000 places in public universities throughout Brazil (January 2014).
ENEM is known for avoiding traditional questions, which rely heavily on the recollection of factual
knowledge. Since it was launched, it was presented as a new strategy to assess directly students
competencies, defined by an official document as structural modalities of intelligence (Franco
and Bonamino, 1999:29). The new test was warmly welcomed by the Brazilian press and broadly
marketed in grey literature, which is difficult to quote. Apparently it was taken as a strategy not

85

Strand 11

Evaluation and assessment of student learning and development

only for a new assessment-based educational reform, but also for social reform, as it would help
poor students to pursue a path to higher education and, in addition, was aimed explicitly at reaching
the job market. MEC presented the test as an opportunity for youngsters to plan their futures, having
a clear idea of their personal and professional potential, as the test would allow assessing their
potential in order to plan future choices (Zkia and Oliveira, 2003: 884). Even today, the Novo
Enem (New ENEM) is officially presented by MEC as a tool for democratization of access to
public institutions of higher education, which are free, to promote academic mobility and to induce
changes in high school curricula (MEC, 2013).
ENEM was originally based on five competencies and 21 abilities, aimed at reaching an
interdisciplinary approach, with no mention to specific school disciplines or subjects. The major
reform which took place in 2009 created the Novo ENEM (New ENEM) with a major increase
in the number of competencies and abilities under assessment, references to conceptual disciplinary
knowledge were introduced, and the total number of questions increased dramatically. The original
63 multiple choice questions (plus an optional written composition) performed in one afternoon
became, now in the new version, 180 questions (plus a compulsory written composition) and two
days are necessary, with a tight time schedule, which allows three minutes per item. They are taken
as unidimensional, as Item Response Theory (IRT) is now applied to establish final scores.
However, the major features of items construction seem to be essentially the same: some visual and
written context is given, followed by a stem and five options. Recollection of facts and concepts
should be rarely necessary, at least in the form of conceptual definitions; the essential information to
find the right option is supposedly part of the context given.
Previous research carried out with PISA items, which are also based on a stimulus which tells a
story to which the test items relate more or less directly, categorized items according to the level
of contextualization (Nentwig, et al, 2009). Items with high level of contextualization had
stimulus content which was essential for information extraction and processing, whereas items with
low level of contextualization brought stimulus which was not essential for answering the
question. In that piece of research both stimulus Content and Relevance were taken into
consideration in a threefold scale, in which items could have substantial information, which was
relevant for item solution (score 2), or could have some text or information but stimulus information
was not relevant for solution (score 1). Items could also bring few or no information as stimulus
(score 0).
Authors provided examples of items of PISA 2006 in which the question can be answered and
exclusively so with the recollection of factual knowledge not related to the stimulus, and were
coded 1. Their objective was to carry out further performance studies of selected questions,
comparing students of different countries, in order to understand how well German students could
extract and process information, rather than find the right answer recollecting factual knowledge.
Data is presented here testing the hypothesis that stimulus in a group of selected ENEM questions
was actually relevant for student performance in biology. Instead of simply rating questions on the
basis of stimulus Content and Relevance by judges, as done in the cited article, an additional step
was added. Low contextualization questions, corresponding to score 1 of Nentwig et al, 2009, were
selected and presented to students in two forms: full length, with the original stimulus, and abridged
version, in which stimulus was removed, leaving just the stem and options. Scores on the two
groups of students are presented and we discuss methods for identifying possible flawed multiple
choice items.

86

Strand 11

Evaluation and assessment of student learning and development

METHODS
A sample of seven questions with low level of contextualization clearly related to biology were
selected in the 2009 and 2010 tests (Novo ENEM), which were presented in 2011 to two
randomized groups of High School students. One group (n0=233) was asked to answer original
questions (Full ENEM), in a six-page long questionnaire; another group of similar students
(n1=200) was asked to answer the same questions with written stimulus entirely removed, leaving
the stem and the very same options, in the form of a three-page long questionnaire (Abridged
ENEM). Another three questions (standard questions) were included in the two sets of
questionnaires, focusing Biology subjects, with exactly the same brief stem and five options, for
comparison purposes. As survey participants were not selected by randomised procedures, these
questions would test general biology knowledge of the two groups, ascertaining their proficiency in
the field (Biology) was equivalent, and therefore the sample could be reliable for the only purpose
of comparing items. According to quota sampling techniques, choice of quota controls would
challenge the quota sampler's ingenuity , as quota variables should be strongly related to the
survey variables () thereby becoming substantially homogeneous. As Leslie Kish states, quota
sampling is not a standardized scientific method; rather, each one seems an artistic production
(Kish, 1965: 563), and a overview is provided below.
Each research assistant received one set of questionnaires, either short or long, and was responsible
for submitting it to students of one public high school of the city of So Paulo (SP, Brazil). Fourteen
schools were chosen according to assistant's convenience, as access to schools is quite difficult, and
test was performed by students within a specific week in mid September. Research assistants were
not aware of the differences of the two sets of questionnaires. The invitation letter required by the
Ethics Commission of our institution (FEUSP) was part of every questionnaire, and stated that
students were invited to collaborate in a research about assessment; they would not be identified in
the answer sheet, and the several participating schools in this piece of research would not be
identified or ranked.
School validation relied on a two level process. Reports of how the questionnaire was presented to
students and answered were analyzed, prior to the answer processing. Any kind of reported
situations which were not exactly the ideal ones led to school exclusion. For instance, when
different research assistants went to the same school, it was excluded from the sample, as students
could have had notice of the different length of the questions. We could validate fourteen schools at
this level. On another level of scrutiny, as part of the statistical analysis, school results were studied,
a search for outliers was carried out (see below), and one case was found in the group of schools
where abridged questions were presented, and the report of that specific school was reconsidered.
The school has a long record of good performance in large scale evaluations, but students now had
very low scores compared with the average of other schools. Score on the standard questions were
11%, which is surprisingly low for items with five options. The conclusion was that this specific
school was close to the university campus and students were not motivated to perform the test, as
they are quite used to similar university experiments. As they could not recognize items as
ENEM questions the task was probably seen as a waste of time. Therefore, that school was
considered an outlier, and the number of students aswering abridged questions was corrected to
n=127 (889 items analyzed). Students which answered full ENEM questions was n=233 (1631
items analyzed), with a total sample size of 360 students belonging to 13 schools, and 2,520 ENEM
items and 861 standard items analyzed.

ITEMS EXAMPLES
The following examples show the twofold forms of presentation of selected items. In the full
version items were reproduced from the beginning, where the question number appears for the first

87

Strand 11

Evaluation and assessment of student learning and development

time. In the abridged version, stimulus was removed, and the version presented to students began
where the question number appears for the second time, in the examples below. Colors will be
discussed below.
7 (full) - The biogeochemical carbon cycle comprises various compartments, including Earth,
the atmosphere and the oceans, and various processes allowing the transfer of compounds
between these reservoirs. Carbon stocks stored in the form of non-renewable resources, such as
oil, are limited, being of great importance to realize the importance of replacing fossil fuels by
renewable fuels.
7 (abridged) - The use of fossil fuels affects the carbon cycle, as it causes:
a) increase in the percentage of carbon on earth.
b) reduction in the rate of photosynthesis of higher plants.
c) increased production of carbohydrates produced by plants.
d) increase in the amount of atmospheres carbon.
e) reduction of the overall amount of carbon stored in the oceans.
8 (full) - A new method for producing artificial insulin using recombinant DNA technology was
developed by researchers at the Department of Cell Biology, University of Brasilia (UNB) in
partnership with the private sector. Researchers have genetically modified Escherichia coli
bacteria, which became able to synthesize the hormone. The process allowed the manufacture of
insulin in larger quantities and in only 30 days, one third of the time required to obtain it by the
traditional method, which consists in the extraction of the hormone from slaughtered animals
pancreas.
Cincia Hoje 24 April 2001. Available at: http://cienciahoje.uol.com.br (adapted).
8 (abridged) - The production of insulin by recombinant DNA technique has, as a consequence :
a) improvement of the process of extracting insulin from porcine pancreas .
b ) the selection of antibiotic-resistant microorganisms .
c ) progress in the technique of chemical synthesis of hormones.
d ) favorable impact on the health of diabetics .
e) creation of transgenic animals.

Distractors' keywords appear in color, associated with related terms in stimulus. As Thiessen
et al (1989) argued, they play an important role in item planning, and improve options'
plausibility. As we will argue later, a long, but not relevant, stimulus may improve the
effectiveness of distractors to the point of flawing the whole item.

RESULTS
The total number of questions focusing the national test was 2,520 (Table 1), other 861 standard
questions were included in order to test sample homogeneity (Table 2), with a total number of 3,381
questions answered and processed.
Statistical analysis included parametric essays, and search for outliers. One school (EEI1FB, n=73)
fell into this category, as previously mentioned, and was excluded from the sample. Fishers Exact
Test for ENEM questions (Table 1) reported p-value without statistically significant differences
between the groups on four questions (Q1, p-value= 0.906; Q5, p-value=; 0.077; Q9, p-value =
0.901; Q10, p-value = 0.152), and statistically significant differences on three questions, in favor of
abridged questions (Q03, p-value = 0.006; Q7, p-value < 0.001 e Q8, p-value < 0.001). Results of
the same statistical analyses for the three standard questions (Table 2) confirmed the sample's
homogeneity of the two groups (Q2, p-value = 0.787; Q4, p-value = 0.116 e Q6, p-value = 0.140).

88

Strand 11

Evaluation and assessment of student learning and development

Table 01
Right answers of the 2,520 ENEM questions (F.E.T= Fisher Exact Test)
N
1
2
3
4
5
6
7
8

School
EEA0HD
EEB0NL
EEC0SB
EED0XS
EME0EA
EEF0PM
EEG0GC
EEH0BT
Total

n0
15
52
32
13
32
34
13
42
233

Q1
1
18
15
7
12
21
7
27
108
46%

09
10
11
12
13

EEK1BM
ETL1HV
EEM1BC
EEN1MS
EEO1HF
Total

n1
43
25
19
18
22
127

F.E.T

p.value

Q1
14
10
14
4
20
62
49%
0.906

Full ENEM Questions


Q3
Q5
Q7
12
13
4
30
32
11
22
20
6
9
12
9
8
17
7
27
28
23
10
11
11
32
34
26
150
167
97
64%
72%
42%
Abridged ENEM Questions
Q3
Q5
Q7
35
36
25
19
20
20
17
13
9
10
12
9
19
21
16
100
102
79
79%
80%
62%
0.006
0.077
<0.001

Q8
2
18
16
6
4
19
3
9
77
30%

Q9
3
10
10
7
17
3
9
21
80
34%

Q10
3
12
21
8
7
17
11
20
99
42%

Q8
29
7
5
10
18
69
54%
<0.001

Q9
13
11
14
3
4
45
35%
0.901

Q10
15
18
6
11
14
64
50%
0.152

Table 02
Results of the 861 standard questions (F.E.T= Fisher Exact Test)
N
1
2
3
4
5
6
7
8

09
10
11
12
13

Full ENEM Questions


n0
Q2
Q4
15
4
10
52
8
17
32
8
13
13
4
7
32
3
11
34
4
11
13
3
11
42
15
6
233
49
86
21%
37%
Abridged ENEM Questions
n1
Q2
Q4
EEK1BM
43
4
9
ETL1HV
25
10
16
EEM1BC
19
6
16
EEN1MS
18
1
1
EEO1HF
22
4
16
Total
127
25
58
20%
46%
F.E.T
p.value
0.787
0.116
School
EEA0HD
EEB0NL
EEC0SB
EED0XS
EME0EA
EEF0PM
EEG0GC
EEH0BT
Total

Q6
2
4
7
5
7
7
4
10
46
20%
Q6
4
10
6
1
12
33
26%
0.140

89

Strand 11

Evaluation and assessment of student learning and development

Table 1 presents the results of the two groups of experimental questions. Considering this group of
low contextualization items, the hypothesis that stimulus is relevant to student performance found
no support, confirming previous categorization. An even more surprising result was found, as
comparing the two groups of ENEM questions answers it is possible to state that questions Q3, Q7
and Q8 allowed a statistically significant higher student performance when they brought no
stimulus, showing a phenomenon we named reversed induced performance (rip). In other words,
jumping stimulus brought to students, in this group of questions, either the same or even better
probability of a good performance.
A further analysis was performed with linguistic tools looking for causal explanations of these
surprising results. The group of students which answered items with no stimulus, went directly to
the stem line, and was not influenced by the text presented to the other group. These texts had
keywords, such as oil and insulin, which were also inadvertently referred to by their superordinated
words (fossil fuels, and hormones), demanding previous knowledge for full understanding.
In the item examples given, question 7 brings a text with poor information on the topic of carbon
cycle, and has lack of cohesion, comprising also the global warming issues. Item stem explores
previous student knowledge on a specific topic (effect of fossil fuels on the atmosphere). Without
previous knowledge, students, under pressure due to the tight time schedule, would read options
directly looking for similarities between keywords found there and in the text. There are three
carbon reservoirs mentioned in the text, and they appear on three different options. The stimulus
would drive students attention to these three options, whereas without it they would face a different
situation, thus becoming weak distractors. Fuel is a keyword in the stem, which easily connects to
the idea of combustion and smoke. The closest keyword is atmosphere, which is found in the
right answer. Therefore, lack of cohesion of the text could lead students to jump stimulus, and
concentrate in the stem, rising the probability of success, including reasons other than those
originally thought. This trajectory could explain the observed rip.
The other example is even clearer, as question 8 was presented above so that keywords were
colored, as their related terms, in the options and item stem. Apparently, students have to apply
information given in the text, as stem is plenty of keywords such as insulin. Stimulus brings
keywords which appear (or have correlated ideas) in four distractors. The only option which has no
connection with stimulus, as mentions diabetics, is the right one. Students should recollect facts
about hormones and insulin, and know something about the related diseases, as stimulus has evident
lack of cohesion regarding the context of the right answer, related to diabetics treatment. Students
who read stimulus would be bound to focus attention on the four distractors. There is a clear lack of
cohesion, as stimulus does not mention any disease; this strategy of diverting students attention by
changing subject, making stimulus not relevant for the answer, we called bafflement, which tends
to improve rip. In fact, this was the question with the greatest difference between the two groups
(Table 1). In real action, students could jump stimulus and would not be mislead to concentrate their
attention in the wrong options; with previous knowledge about insulin and related disease, answer
would be easily found. Therefore, it is possible to understand the observed rip as a consequence
of this bafflement strategy.
The four items in which no statistical difference between the abridged and full questions was found
also deserve analysis, as students who jumped stimulus, and went directly to options, were as
successful as those who did all the reading. However, as they have only three minutes for each
question, jumping students would have saved precious time for other questions, rising the
probability of a higher final score in real action.
These results show that low contextualization in ENEM questions with focus on Biology do rely on
students previous knowledge, and are not objective indicators of the alleged structural modalities

90

Strand 11

Evaluation and assessment of student learning and development

of intelligence. Moreover, items actually favor students with better reading and time management
skills than a balanced amount of biological knowledge and thinking skills.

DISCUSSION
Results show that low contextualization items (Nentwig et al, 2009) deserve more attention
regarding future research. If presented with a stimulus, which demand a considerable length of time
to be read and understood (score 1 in the cited article), they are actually context deficient (contdef). These items allow at least two different paths for the right answer, what brings a considerable
problem for the task of determining its degree of difficulty, with a serious implication for the Item
Response Theory. Contrary to direct items with no context (score 0 in the cited article), or with
actually relevant information in the stimulus (score 2 in the cited article), cont-def questions not
only allow similar probability of success with stimulus or without it, as seen in questions 1, 5, 9 and
10 (Table 1), but also may turn the question even more difficult. As seen in questions 3, 7 and 8
(Table 1), scores of students that received stimulus were significantly lower, showing a new
phenomenon, which we called reverse induced performance (rip).
This new phenomenon should be focused carefully in items pre-testing, as it brings a profound
effect to the determination of the degree of difficulty. The validity of Item Response Theory
requires unidimentional items, therefore items must be rip-free. This piece of research offers a
practical approach to perform such testing, with two randomized groups of students, one of them
receiving abridged items, which are suspect of being cont-def.
This research brought a new light to a long known fact, related to the commercially successful
ENEM training courses, privately owned, which have been active at least since 2003 (Zkia and
Oliveira, 2003: 885). There was suspicion that they were useless, as students would receive all, or
almost all, information needed in items' context stimulus, therefore, training would be of no help to
raise students performance in ENEM. All seven experimental questions produced results that can
explain the need of a specific student training to get higher scores in that exam. Within a very tight
time schedule, students may be trained not only to extract and process information given in the
stimulus, but also and mainly - to select and discard information which is not relevant to assign
the right option or even to lower distractors' efficiency.
The 2009 reform turned Novo ENEM not only into an instrument to select students for public
universities, but also aiming at monitoring education quality in a nationwide basis. The
democratization of higher education access provided by ENEM (if any) may be due to sudden
changes and would tend to disappear as time management skills are differently apprehended by
students in the socioeconomic spectrum.
The proposal of turning ENEM into a compulsory State Exam is currently under discussion. Our
results suggest that assessment-based educational reform and education quality monitoring based on
this instrument should be considered with caution. Further research is necessary, encompassing
other content areas, in order to have a clearer idea of the real impact of low contextualization items
in large scale exams such as ENEM.

ACKNOWLEDGEMENTS
Authors want to express their gratitude to the following persons: Alessandra Lupi, Alessandra
Stranieri, Alessandra Ramin, Andria Vieira, Ariana Carmona, Bianca Dazzani, Bruna Loureno,
Bruno Vieira, Carolina Bueno, Cristiano J. da Silva, Dbora Brandt, Fernando Sbio, Giselle
Armando, Guilherme Antar, Guilherme Stagni, Helenadja Mota, Henrique Neves, Joo Ferreira,
Karina Tisovec, Laisa Lorenti, Mariana Rosim, Marina Medeiros, Natacha Lodo, Pedro Machado,

91

Strand 11

Evaluation and assessment of student learning and development

Priscylla Arruda, Rafael Ogawa, Renato Rego, Rodrigo Gonalves, Rodrigo Dioz, Samara Moreira,
Talita Oliveira, Thas de Melo, Thales Hurtado, Thiago Madrigrano, Vitor Lee. The following
institutions provided funds for the several parts of the research: CNPq, FAPERGS, FAPESP, FEUSP
and Pr-Reitoria de Pesquisa da USP.

REFERENCES
Franco, C., A. Bonamino (1999). O ENEM no contexto das polticas para o ensino mdio.
Qumica Nova na Escola 10, 26-31.
Kish. L. (1965). Survey Sampling. New York: Wiley & Sons, Inc.
Ministrio da Educao (1999). Exame Nacional do Ensino Mdio ENEM: documento bsico
2000. MEC/INEP.
___________(2013). Sobre o ENEM. Available at http://portal.inep.gov.br/web/enem/sobre-oenem [access on Dic 15 2013].
Nentwig, P., Roennebeck, S., Schoeps, K., Rumann, S. and Carstensen, C. (2009). Performance
and levels of contextualization in a selection of OECD countries in PISA 2006. Journal of
Research in Science Teaching, 46 (8), 897-908.
Orlandi, E. P. (2012). Discurso e leitura. So Paulo: Cortz.
Thiessen, D. L. Steinbeck and A.R. Fitzpatrick (1989). Multiple-choice items: the distractors are
also part of the item. Journal of Educational Measurement 26 (2),161-176.
Zkia, S.; R. P. Oliveira (2003). Polticas de avaliao da educao e quase mercado no Brasil.
Educao e Sociedade 24 (84), 873-895.

92

Strand 11

Evaluation and assessment of student learning and development

PREDICTING SUCCESS OF FRESHMEN IN CHEMISTRY


USING MODERATED MULTIPLE LINEAR
REGRESSION ANALYSIS
Katja Freyer, Matthias Epple and Elke Sumfleth
University of Duisburg-Essen, Germany
Abstract: It is generally known that student success in chemistry is rather low. High dropout rates
of 43 % exemplify this clearly. Especially during the first semesters of studies, dropout rates are
the highest. This can be attributed to the large amount of challenges freshmen have to face. There
have been several studies on factors that influence student success but very few studies have
taken a closer look at the interaction between the predictors for student success in chemistry. For
this purpose, a questionnaire was developed to determine students attitudes towards their studies,
their cognitive abilities as well as their chemistry knowledge and their subject-specific
qualifications. At the end of the first semester, chemistry knowledge was measured again, and
results from the final exam could also be obtained. Using moderated multiple linear regression
analyses, student success, which was defined as the score in the final exam on the one hand and
as the score in the chemistry knowledge test on the other hand, was predicted. Three variables,
the grade from the secondary school graduation certificate (Abitur), prior knowledge in chemistry
and desired subject (I would rather study a different subject. Yes or No), were found to be
important predictors for both definitions of student success. For the score in the chemistry
knowledge test, study conditions also revealed predictive validity. In this way, it could be shown
that study conditions play an additional important role on chemistry achievement. Moderation
analyses also confirm this finding since only for the score in the chemistry knowledge test it was
possible to identify interactions between the predictors that are independent from the students
course of study and university.
Keywords: Chemistry, higher education, freshmen, prediction of success, moderated multiple
linear regression analysis

INTRODUCTION
The dropout rate of German bachelor students in chemistry amounts to 43 % (Heublein, Richter,
Schmelzer & Sommer., 2012). The main reasons for leaving university are difficulties in
performance and lack of motivation (Heublein, Hutzsch, Schreiber, Sommer & Besuch, 2010)
which is particularly caused by false expectations of their studies (Heublein, Spangenberg &
Sommer, 2003). The highest dropout rates in natural sciences studies occur during the first
semesters at university (Heublein et al., 2003) which can be attributed to the large number of
challenges freshmen are confronted with (Gerdes & Mallinckrodt, 1994).
Other European countries also have excessively high dropout rates. Ulriksen, Mller Madsen and
Holmegaard (2010) state that around one third of students end up their studies before the
scheduled time. Low student success is an important issue at American universities as well, as
only approximately 70 % of freshmen in chemistry pass the final exam in general chemistry at
the end of first semester (Legg, Legg & Greenbowe, 2001; McFate & Olmsted, 1999).
In order to gain a better understanding of student success, it is necessary to find factors that are
able to predict success as soon as studies begin. Up to now, a lot of approaches have been made
93

Strand 11

Evaluation and assessment of student learning and development

for determining the factors leading to student success. The attempts date back to 1921 (Powers,
1921). Several variables predicting success have been identified so far. But only few studies deal
with the prediction of student success in chemistry at German universities. This project aims at
filling this gap. Whereas a lot of studies tested a great variety of cognitive and non-cognitive
variables, only few studies also took the interactions between the predictors into account. The
great advantage of doing this, is to gain a deeper insight into the connections between the
variables that predict student success and therefore create a better understanding of how success
can be achieved. Identifying the variables leading to success and the way how those variables
interact, creates a starting point for increasing student success and decreasing dropout,
respectively.
Here, student success is one time defined as the score in the final chemistry examination and one
time as the score in a chemistry knowledge test which both have been taken place at the end of
the first semester (examination two to three weeks later than the test). Whereas the chemistry
exam represents a criterion that is highly relevant for the students, the chemistry test had no
relevance at all for them but was the same for all participating students and therefore is an
objective measure for student success.
For the prediction, a regression model has been built up on the basis of Schiefele, Krapp and
Winteler (1992) who state that usually three pieces are used in predicting academic achievement:
general cognitive factors, general motivational factors, and interest.
The cognitive factors have been split into two parts. As first predictor, prior knowledge has been
included into the model because prior knowledge is the pre-condition for accumulating further
knowledge (Schneider, Krkel & Weinert, 1990). Therefore it is also the basis for learning
growth which is finally measured at the end of the first semester. Prior knowledge is the domainspecific part of cognitive factors and its predictive strength increases with rising consistency of
test and study content (Heine, Briedis, Didi, Haase & Trost, 2006). Then, as rather domainunspecific part of cognitive factors, two variables are added to the regression model: grade from
the secondary school graduation certificate (Abitur) and ability in deductive thinking. The highest
predictive strength as a single predictor has been attributed to the grade from the secondary
school graduation certificate (Abitur). Abilities in deductive thinking are seen as central factor in
measuring cognitive abilities which show medium or satisfactory predictive strength (Heine et
al., 2006). As motivational factor the predictor desired subject is used. This variable contains one
item that asks the students whether they would rather study a different subject than the chosen
one. It could be shown that students who are satisfied with their subject achieve better results in
an exam than students who are unsatisfied with their subject (Ohlsen, 1985). The positive effect
of student satisfaction can be attributed to the fact that learning actions are rather initiated for
satisfied students. Furthermore, it is learnt more successfully, and self-regulated learning, which
has an enormous meaning for (self-)study, can be maintained (Voss, 2007). The next part of the
model is subject interest. Here, ambiguous information can be found. Some studies show no
predictive strength (Gold & Souvignier, 2005) whereas others find it meaningful for student
success (Fellenberg & Hannover, 2006; Giesen, Gold, Hummer & Jansen, 1986). Possibly, the
positive effect of interest on student success evolves only after a longer time of studying (Krapp,
1997). At this point, the model according to Schiefele, Krapp and Winteler (1992) is complete,
but since students from different universities and different courses of study were surveyed, study
conditions as further variable are also included into the model. It is well-known that different

94

Strand 11

Evaluation and assessment of student learning and development

study structures and requirements have an influence on student success as well (Krempkow,
2008; Preuss-Lausitz & Sommerkorn, 1968).

STUDY DESIGN
In winter semester 2011/12 freshmen in chemistry from different German universities were
surveyed at the very beginning of the semester (pre-test) and the end of the semester (post-test)
on their knowledge, abilities, and attitudes they bring to university.

Design & Participants


A number of 236 students of the two courses of study chemistry and chemistry education can be
included in the regression analyses (see Table 1). Three universities took part in this survey:
Humboldt-Universitt zu Berlin (HU Berlin), Universitt Duisburg-Essen (Uni DuE) and
Ludwig-Maximilians-Universitt Mnchen (LMU Munich). All of the students attended the
chemistry examination as well as the chemistry knowledge test.
Table 1
Participants divided by course of study and university.
Chemistry students
Education students
Total

HU Berlin
61
21
82

Uni DuE
36
26
62

LMU Munich
80
12
92

Total
177
59
236

Female students (40 %) are a little less present in the sample than male students (60 %). There are
also some differences concerning the students year of receiving their secondary school
graduation certificate (Abitur). Sixty-eight percent did their final exam just before starting their
studies in 2011; 20 % one year before in 2010 and 11 % earlier than that.

Material & Methods


The predictors are surveyed by a couple of questionnaires, tests, and items, respectively. Prior
knowledge is assessed by a Chemistry Knowledge Test (Cronbachs : .778) which consists of 23
multiple-choice items. For every item four possible answers are given with exactly one of them
being the right one. The test has been self-constructed on the basis of the lecture on general
chemistry offered for first semester students in chemistry at the University of Duisburg-Essen.
The content of the general chemistry lecture is very similar in Germany, so that the test can be
considered applicable to all participating universities. The Test for Measuring Deductive
Thinking is taken from Wilhelm, Schroeders and Schipolowski (2009). Only the visual part has
been used which required from the students to complete rows of figures. In this test, three to four
possible answers are given. For assessing subject interest, a questionnaire on study interest
(Schiefele, Krapp, Wild & Winteler, 1993) has been adapted. The test (Cronbachs : .775)
contains 11 chemistry-related items, like: Im studying this subject because engagement in
95

Strand 11

Evaluation and assessment of student learning and development

chemical topics and matters is important to me. For responding to the questionnaire on subject
interest, a four-point Likert scale is given for specifying ones level of agreement. For the
variable desired subject the following item has been used: I would rather study a different
subject. Students could answer with yes or no. Grade from the secondary school graduation
certificate (Abitur), course of study and university have been asked for with one item each. For
the operationalization of the study conditions, six dummy variables were created: chemistry
students from Berlin, education students from Berlin, chemistry students from Essen, education
students from Essen, chemistry students from Munich and education students from Munich. The
value 1 was given to the referring student cohort, whereas 0 stands for all the other students.
Chemistry students from Munich were taken as the reference group in the regression analyses
since they present the largest group. Validity and reliability for all questionnaires and tests could
be proven in the frame of a pilot study conducted one year before in winter semester 2010/11.
Additionally, scores in the final exam in chemistry at the end of the first semester have been
gathered. Since the students from the different courses of study and universities wrote different
exams, scores were z-standardized for regression analyses. The predictors were added blockwise
and by the enter method to the regression model. The moderation analyses were used for
identifying interactions between the above mentioned predictors. To keep the variables equally
weighted in the interaction terms, all variables were centered, meaning that the mean was set to
zero whereas the variance was kept (Cohen, Cohen, West & Aiken, 2003).

RESULTS AND DISCUSSION


The results from the regression analyses for the prediction of the exams score and score in the
knowledge test is shown in Table 2. The shown values are the regression coefficients of only
the significant predictors (p < .05). is a measure of the predictive strength of a predictor. It
describes the size of the effect of each predictor on student success.
Table 2
Results of regression analyses for predicting achievement in the knowledge test and the final
exam (only significant predictors (p < .05) are shown).

Prior knowledge
Cogn. A.
Grade
Deduct. Think.
Desired subject
Subject interest
Study conditions1
HU Berlin Chemistry
HU Berlin Chem. Educ.
Uni DuE Chemistry
Uni DuE Chem. Educ.
LMU Mnchen Chemistry
R/% [incl. study conditions]

.288
-.250
---.111
---

Knowledge test
t
p
4.713
<.001
-4.170
<.001
-----1.980
.049
-----

-.138
-2.087
.038
-.136
-2.730
.007
-------.300
-4.985
<.001
-.188
-3.272
.001
25.6 % [34.9 %]

.169
-.398
.148
-.173
-------------

Final exam
t
p
2.633
.009
-6.307
<.001
2.385
.018
-2.942
.004
-----

----------26.4 % [27.8 %]

-----------

1 Dummy-coded

96

Strand 11

Evaluation and assessment of student learning and development

It can be seen that prior knowledge has a higher effect on the score in the knowledge test
( = .288) than on the outcome in the exam ( = .169). This can be attributed to the fact that the
chemistry knowledge test in pre- and post-test was identical, and in the exam a broader field of
knowledge was asked for. On the other hand, cognitive abilities show a stronger effect on the
exams score (grade: (exam) = -.398 vs. (test) = -.250; deducting thinking: (exam) = .148 vs.
(test) = n. s.). This finding could be due to the fact that the tasks in the exam were more
complex than tasks in the knowledge test, and therefore make higher cognitive demands on the
students. Furthermore, students were possibly more engaged in solving the exam than solving the
test since the exam has a much higher relevance for them. This could also be a reason for the
variable desired subject showing a higher influence on the exam ( = -.173) than on the
knowledge test ( = -.111). An experimental study confirmed that the role of the relevance of a
test score and the students motivation when working on it, show a positive correlation to the
students score in the test (Liu, Bridgeman & Adler, 2012). Study interest shows neither an effect
on score in the test nor on score in the exam.
With the aforementioned predictors, in each of the two regression models approximately one
fourth of variance of the score in the exam and the test, respectively, can be explained. When
adding study conditions, the amount of explained variance is enhanced up to 35 % for the
chemistry test and approximately stays the same for the exam. Study conditions only show a high
effect on the achievement in the knowledge test whereas there is no significant effect at all on the
exams score. This result indicates that study conditions do have an influence on the students
outcome. The absence of an effect of study conditions on success in the final exam shows that the
exams that students wrote at the three universities are different from each other and presumably
include study conditions. From additional analyses (data not shown here) it can be seen that it is
not the content of the exams that makes the exams different. However, they differ from each
other in difficulty. Which variables cause the different levels of difficulty, is not clear at the
moment.
Concerning study conditions, similar results can be gathered from moderation analyses. Whereas
for the score in the exam it is not possible to find interactions between the predictors that are
independent from course of study and university, several interactions can be found when
predicting achievement in the knowledge test (see Figure 1). This finding also shows that the
exams are too different to gain interactions that are valid for all students, independent from the
students affiliation to university and course of study.
The left part of Figure 1 shows that the influence of prior knowledge on achievement is
significantly higher for students with a bad Abitur grade. This means that the better the grade is
the less prior knowledge is important for student success in the chemistry knowledge test. The
right part of the picture shows that the positive influence of prior knowledge on achievement is
significantly higher for students with a high ability in deductive thinking. This could be due to
the fact that students with a high ability in deductive thinking are more able to apply their prior
knowledge.

97

Strand 11

Evaluation and assessment of student learning and development

Figure 1. Results from moderation analyses: interactions between predictors of achievement in


the knowledge test: left: grade X prior knowledge (p = .077); right: prior knowledge X ability in
deductive thinking (p = .009).

SUMMARY
In the frame of this study, a theory-based regression model for the prediction of student success
in chemistry could be applied. The results show that prior knowledge, cognitive abilities, and the
desired subject are significant predictors for performance at the end of the first semester in
chemistry. Subject interest plays a minor role in the prediction. Student success was measured in
two ways: once through the score in the chemistry exam (as a criterion that is highly relevant for
the students) and once through the score in an objective chemistry test which was in contrast to
the exam the same for all participating students.
All in all the whole regression model explains about one fourth of variance of students
performance at the end of first semester. Additionally, study conditions, operationalized through
the students affiliation to their course of study and their university, show a clear influence on the
test outcome and raises the proportion of the explained variance up to 35 % whereas this effect is
missing for the exam. This result clearly indicates that the exam is part of the study conditions
and that the exams at the different universities and in the different courses of study must be
different, too. Further analyses show that not the content but the different difficulty of the exams
contribute to this finding.

REFERENCES
Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral science (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Fellenberg, F. & Hannover, B. (2006). Kaum begonnen, schon zerronnen? Psychologische
Ursachenfaktoren fr die Neigung von Studienanfngern, das Studium abzubrechen oder das
Fach zu wechseln [Easy come, easy go? Psychological causes of students drop out of
university or changing the subject at the beginning of their study]. Zeitschrift zu Theorie und
Praxis erziehungswissenschaftlicher Forschung, 20, 381-399.

98

Strand 11

Evaluation and assessment of student learning and development

Gerdes, H. & Mallinckrodt, B. (1994). Emotional, social and academic adjustment of college
students: a longitudinal study of retention. Journal of Counseling & Development, 72, 281288.
Giesen, H., Gold, A., Hummer, A. & Jansen, R. (1986). Prognose des Studienerfolgs. Ergebnisse
aus Lngsschnittuntersuchungen [Prognosis of student success: Results from longitudinal
studies]. Frankfurt a. M.: Institut fr Pdagogische Psychologie.
Gold, A. & Souvignier, E. (2005). Prognose der Studierfhigkeit: Ergebnisse aus
Lngsschnittanalysen [Prognosis of college outcomes: Results from longitudinal studies].
Zeitschrift fr Entwicklungspsychologie und Pdagogische Psychologie, 37, 214-222.
Heine, C., Briedis, K., Didi, H.-J., Haase, K. & Trost, G. (2006). Auswahl- und
Eignungsfeststellungsverfahren beim Hochschulzugang in Deutschland und ausgewhlten
Lndern. Eine Bestandsaufnahme [Placement tests in Germany and other selected countries.
A review] . Hannover: HIS.
Heublein, U., Hutzsch, C., Schreiber, J., Sommer, D. & Besuch, G. (2010). Ursachen des
Studienabbruchs in Bachelor- und herkmmlichen Studiengngen [Reasons for dropout in
Bachelor and conventional courses of study]. Hannover: HIS.
Heublein, U., Richter, J., Schmelzer, R. & Sommer, D. (2012). Die Entwicklung der Schwundund Studienabbruchquoten an den deutschen Hochschulen [The development of dropout rates
at German universities]. Hannover: HIS.
Heublein, U., Spangenberg, H. & Sommer, D. (2003). Ursachen des Studienabbruchs [Reasons
for dropout]. Hannover: HIS.
Krapp, A. (1997). Interesse und Studium [Interest and studies]. In: H. Gruber & A. Renkl (eds.), Wege
zum Knnen: Determinanten des Kompetenzerwerbs (pp. 45-58). Bern: Verlag Hans Huber.
Krempkow, R. (2008). Studienerfolg, Studienqualitt und Studierfhigkeit: Eine Analyse zu
Determinanten des Studienerfolgs in 150 schsischen Studiengngen [Student success, study
quality and study ability: An analysis on determinants of student success in 150 courses of
study in Saxony]. Die Hochschule, 91-107.
Legg, M. J., Legg, J. C. & Greenbowe, T. J. (2001). Analysis of success in general chemistry
based on diagnostic testing using logistic regression. Journal of Chemical Education, 78, 1117
- 1121.
Liu, O. L., Bridgeman, B. & Adler, R. M. (2012). Measuring learning outcomes in higher
education: motivation matters. Educational Researcher, 41, 352-362.
McFate, C. & Olmsted, J. (1999). Assessing student preparation through placement tests. Journal
of Chemical Education, 76, 562-565.
Ohlsen, U. (1985). Eine empirische Untersuchung der Einflugren des Examenserfolgs fr
Absolventen wirtschaftswissenschaftlicher Studiengnge der Universitt Mnster [An
empirical survey of determining factors on examination success for graduates of economic
courses of study at Mnster University]. Frankfurt a. M.: Lang.
Powers, S. R. (1921). The achievement of high school and freshmen college students in
chemistry. School Science and Mathematics, 21, 366-377.

99

Strand 11

Evaluation and assessment of student learning and development

Preuss-Lausitz, U. & Sommerkorn, I. N. (1968). Zur Situation von Studienanfngern [On the
situation of freshmen]. Zeitschrift fr Erziehung und Gesellschaft, 8, 434-453.
Schiefele, U., Krapp, A., Wild, K. & Winteler, A. (1993). Der Fragebogen zum
Studieninteresse (FSI) [The Questionnaire on study interest (FSI)]. Diagnostica, 39, 335351.
Schiefele, U., Krapp, A. & Winteler, A. (1992). Interest as a Predictor of Academic
Achievement: A Meta-Analysis of Research. In: K. A. Renninger, S. Hidi & A. Krapp (eds.),
The Role of Interest in Learning and Development (pp. 183-212). Hillsdale: LEA.
Schneider, W., Krkel, J. & Weinert, F. E. (1990). Expert Knowledge, General Abilities, and
Text Processing. In: W. Schneider & F. E. Weinert (eds.), Interactions among aptitudes,
strategies, and knowledge in cognitive performance (pp. 235-251). Springer: New York.
Ulriksen, L., Mller Madsen, L. & Holmegaard, H. (2010). What do we know about the
explanations for drop out/opt out among young people from STM higher education
programmes? Studies in Science Education, 46, 209-244.
Voss, R. (2007). Studienzufriedenheit: Analyse der Erwartungen von Studierenden [Student
satisfaction: analysis of student expectations]. Reihe: Wissenschafts- und
Hochschulmanagement, Band 9. Lohmar: Eul Verlag.
Wilhelm, O., Schroeders, U. & Schipolowski, S. (2009). BEFKI. Berliner Test zur Erfassung
fluider und kristalliner Intelligenz [Berlin test of fluid and crystallized intelligence].
Unpublished.

100

Strand 11

Evaluation and assessment of student learning and development

TESTING STUDENT CONCEPTUAL UNDERSTANDING


OF ELECTRIC CIRCUITS AS A SYSTEM
Hildegard Urban-Woldron
University College for Teacher Education, Lower Austria
Abstract: Often, profound misconceptions prevent students to get a firm grasp of basic
concepts in electricity. Identifying and removing these blocking stones is the key to
progress in physics education. Straightforward knowledge tests are generally not suitable
to identify these misconceptions as students might arrive at the right answer in a test, but
based on wrong assumptions. Therefore, there is need for conceptual understanding tests
which are useful in diagnosing the nature of students misconceptions related to simple
electric circuits and in consequence, can serve as a valid and reliable measure of students
qualitative understanding of simple electric circuits. As ordinary multiple choice tests
with one-tier were highly criticized in overestimating the students right correct as well as
wrong answers, two- and three-tier tests were developed by researchers. Although, there
is much research related to students conceptions in basic electricity, there is a lack of
instruments for testing electricity concepts of students at grade 7, especially addressing
an electric circuit as a system. To address this gap, the context of the present study is an
extension to the development of an already existing instrument developed by the author
for testing electricity concepts of students, specifically focusing on only two specific
aspects in depth: first, to develop three-tier items for figuring out sequential reasoning,
and second, to distinguish between misconceptions and lack of knowledge. The
participants of the study included 339 secondary school students from grade 7 to 12 after
instruction about electricity. Surprisingly, there are no dependences on students
misconceptions neither according to their gender nor to their age. In conclusion, the
findings of the study suggest that four items for uncovering students difficulties viewing
an electric circuit as a system can serve as a valid and reliable measure of students
qualitative understanding of the systemic character of an electric circuit.
Keywords: three-tier concept test, electric circuit as a system, uncovering students
conceptual understanding

THEORETICAL BACKGROUND
Research findings suggest that there are three categories of student difficulties in basic
electricity: inability to apply formal concepts to electric circuits, inability to use and
interpret formal representations of an electric circuit, and inability to qualitatively argue

101

Strand 11

Evaluation and assessment of student learning and development

about the behavior of an electric circuit (McDermott & Shaffer, 1992). Misconceptions
are strongly held and stable cognitive structures, which differ from expert conception and
affect how students understand scientific explanations (Hammer, D. (1996). Students
may have various, often pre-conceived misconceptions about electricity, which stand in
the way of learning. The most two resistant obstacles seem to be the conception to view a
battery as a source of constant current and to not consider a circuit as a system (Dupin &
Johsuam, 1987). Closset (1983) introduced the term sequential reasoning which appears
to be widespread among students (Shipstone, 1984). There is some evidence that
sequential reasoning at least partially id developed at school (Shipstone, 1988). and
reinforced by the teacher (Sebastia, 1993). Using the metaphor of a fluid in motion
(Pesman & Eryilmaz, 2010) and highlighting that electricty leaves the battery at one
terminal and goes to turn on the different components in the circuit successively does not
support students in order to view a circuit as a system (Brna, 1988). On the contrary, this
linear and temporal processing prevents students from making functional connections
between the elements of a circuit and from viewing the circuit structure as a unified
system (Heller & Finley, 1992). Surprisingly, research findings do not indicate a different
development of sequential reasoning according to age and teaching levels (Riley et al.,
1981). Similar conceptions are also hold by adults and some teachers (Bilal & Erol,
2009).
Therefore, there is need for diagnosis instrument to get informed about students
preconceptions and also for evaluating the physics classroom. In order to identify and
measure students misconceptions about electricity different approaches have been made.
In contrast to interviews, diagnostic multiple choice tests can be immediately scored and
applied to a large number of subjects. Pesman and Erylmaz (2010) used the three tier test
methodology for developing the SECDT (Simple Electric Circuits Diagnostic Test). As
ordinary multiple choice tests with one-tier were highly criticized in overestimating the
students right as well as wrong answers, two- and three-tier tests were developed by
researchers. Starting from an ordinary multiple choice question in the first tier, students
are asked about their reasoning in the second tier, and respectively, students estimate
their confidence about their answers in the third-tier. In view of lack of instruments for
testing electricity concepts of students at grade 7 and for being suitable to the Austrian
physics curriculum, the author already developed a diagnostic instrument with some twotier items for assessing students conceptual understanding as well as its potential use in
evaluating curricula and innovative approaches in physics education (Urban-Woldron &
Hopf, 2012).

AIM AND RESEARCH QUESTIONS


As already mentioned above, many students seem to be unable to consider a circuit as a
whole system, where any change in any of the elements definitely affects the whole

102

Strand 11

Evaluation and assessment of student learning and development

circuit. In consequence, they often demonstrate local reasoning by only focusing their
attention on one specific point in the circuit and by ignoring what is happening elsewhere
in the circuit. Additionally, students show sequential reasoning, by which they believe
that when any dynamic change takes place in a circuit, only elements coming after the
specific point are affected. However, the focus of the study is on revealing students ideas
and explanations how electricity moves in a simple circuit including lamps and resistors
connected in series and in what way they justify their answers.
For gaining a correct vision of student understanding, it is crucial to get informed what
students actually do not know and what kind of alternative conceptions they have.
Therefore, also for the researcher the wrong answers and the associated explanations of
the students are much more interesting and usable than the correct answers.
Consequently, the context of this study is an extension to the development of an already
existing instrument for testing electricity concepts of students at grade 7 on two specific
aspects: first, to develop items for figuring out sequential reasoning, and second, to
distinguish between misconceptions and lack of knowledge. The following two broad
research questions were addressed:
1. Do misconceptions related to understanding an electric circuit as a system depend on
gender, level of education, respectively age, and/or the teacher?
2. Can a three-tier multiple choice test be developed that is reliable, valid, and uncovers
students misconceptions related to grasp an electric circuit as a system?

METHOD
In order to develop a reliable tool to identify students misconceptions related to
understanding an electric circuit as a system, the author first conducted interviews based
on literature review, both more structured ones and also with open-ended questions. In an
initial stage a 10-item questionnaire was developed, including 10 two-tier items (meaning
question plus follow-up question, an example is provided in figure 2).
In first round of evaluation with 10 teachers and 113 students (grade 8, 58 f(emale), 15
m(ale)), the questionnaire was reduced to 7 items, extending them with third tier asking
for students confidence when answering each question. After a test run with 339 students
of grade 7 to grade 12 from secondary schools across Austria after formal instruction
(183 f, 156 m, mean age 14.7 y(ears), S(tandard) D(eviation) 1.7 y) results were
evaluated with the software programs SPSS and AMOS. In a polishing round, additional
interviews were used to optimize the test items. To get the score for a two-tier item, a
value of 1 was assigned when both responses were correct. Furthermore, by examining
specific combinations of answers other relevant variables were calculated to address
students misconceptions.

103

Strand 11

Evaluation and assessment of student learning and development

Participants and Setting


The participants of the study included 339 secondary school students from grade 7 to 12
(183 female; mean age = 14.7 years, SD = 1.7; 18 forms, 7 schools) after instruction
about electricity. Nine teachers were randomly asked to administer a paper and pencil test
with 7 three-tiered items related to sequential reasoning to their students. Figure 1 shows
the distribution of the students concerning their belonging to a particular grade.

Figure 1: Distribution of students and grades

Data Sources
Firstly, based on preliminary results gained from interviews with open-ended questions, a
questionnaire with ten two-tier items was developed and piloted with 113 students
accompanied by clarifying interviews to get deeper insights about the students cognitive
structures and reasoning. Consequently, four out of those ten items finally constituted the
test instrument used in this present study, assessing students understanding of the
systemic character of a simple electric circuit with three-tier items. In the following, the
author presents a three-tiered item (see figure 2), asking questions related to very simple
electric circuits; as we will see, there is ample space for misconceptions despite their
simplicity.
It has to be added here that the provided answers have not been thought up by the
researcher but are based both on literature review and actual experiences with students.
The first tier is a conventional multiple-choice question and the second tier presents some
reason for the given answer on the first tier. Additionally, to distinguish between
misconceptions and lack of knowledge, a third tier is implemented to examine how
confident students are about their answers.

104

Strand 11

Evaluation and assessment of student learning and development

A lamp and two resistors are connected to a battery.


a) What will happen to the brightness of the lamp if R1 is
increased and R2 remains constant?
The brightness of the lamp decreases.
The brightness of the lamp remains constant.
The brightness of the lamp increases.
b) How would you explain your reasoning?
It is the same battery. Therefore, the same current is delivered.
A change of the resistor only influences the brightness of the lamp if the lamp is behind
the resistor.
Any change of the resistor influences the brightness of the lamp independently of its
position in the circuit.
c) Are you sure about your answer to the previous two questions?
highly certain

rather certain

rather uncertain

highly uncertain

Figure 2: Sample Item A

Data Analysis
Starting with descriptive analyses, analyses of variance, confirmatory factor analyses, and
regression analysis using the software SPSS and AMOS were conducted.

RESULTS
Obviously, the correct answer for item A (see figure 2) would be a1 and b3. 108 students,
that are 33.4%, provide a correct answer to the first two tiers of item A. A closer look to
the numbers in table 1 shows that 51.7% or 167 students actually answered the first tier
correctly, but 59 out of these 167 students or 35.3% provided a wrong reason.
Consequently, more than one third of the correctly responding students on the first tier
can be added to so-called false positives. On the other hand, 153 students chose the right
distractor for the explanation, whereas only 70.6% of these students also gave a correct
answer on the first tier. Therefore, we critically overestimate students knowledge if we
only look at one tier. Overall, 30 students are highly certain, 105 are rather certain, 88
are rather uncertain, and 100 are highly uncertain about their answers. 37% of the highly
certain students and 26% of the rather certain ones give the correct answer for the first
and the second tier, whereas only 8% of the highly uncertain students answer this item

105

Strand 11

Evaluation and assessment of student learning and development

correctly. Table 1 gives an overview of the three answer options a1, a2, and a3 and the
three associated alternatives b1, b2, and b3 for the reasoning.
Table 1
Distribution of answers and reasons for item A

Next, three misconceptions figured out will be illustrated here:


Misconception #1 (Answers a1, b2)
In this misconception the student chooses the right answer, but based on the erroneous
assumption that the lamp is behind the resistor when electricity is turning round the
circuit from the positive to the negative terminal. 167 students give the right answer a1
for the first tier, only 108 out of them also present the right reason b3 for their previously
given answer whereas 55 students, who respond correctly on the first tier finally choose
answer b2 on the second tier. This is a prime example that a correct test answer is not yet
proof that the student had really understood the underlying concept.
Misconception #2 and #3 (Answers a2, b2 or b1)
Here, the student probably thinks that a constant amount of current leaves the battery at
the negative end and reaches the lamp before it arrives at the increased resistor. 36 out of
92 students choosing a2 for the first tier pick out answer b2, indicating that they think
sequentially. 49 students out of those 92 explain their reasoning by selecting b1,
respectively, by viewing the battery as a source of constant current not considering any
influence from the resistance on the intensity of current.
10.9% out of the students who choose answer a3 are rather certain, 56.3 % are rather
uncertain and 32.8% are highly uncertain. 6 out of 7 students who were rather certain
choose answer b3. Consequently, this reasoning may indicate that those students do not
have difficulties in viewing an electric circuit as a system but they do have a wrong
concept about a resistor in general. Furthermore, the answers of the other students who

106

Strand 11

Evaluation and assessment of student learning and development

are rather or highly uncertain about their answers may point to the assumption that they
simply guessed their answers.
Table 2
Possible combinations axby

Table 2 shows all nine possibilities for combinations of answers a1, a2 or a3 with
explanations b1, b2 or b3 within the red-framed rectangle. Four out of these nine
combinations can be attributed to specific physics concepts which can be described in the
following way: a1b3 indicates correct answer and correct reasoning, a1b2 stays for
correct answer and sequential reasoning, a2b2 hints at sequential reasoning, and a2b1
displays that the battery is viewed as a source of constant current.

Figure 3: Combination of answers in relation to students certainty


In addition to the four students concepts mentioned above, figure 3 illustrates that all
students who are highly certain only refer to one of these combinations. Whereas 70.6%
of the highly certain students give the correct answer and also the correct explanation for
107

Strand 11

Evaluation and assessment of student learning and development

Item A (see figure 2), 17.7% think sequentially and 11.8% think that the battery is a
source of constant current and therefore the resistor has no influence on the amount of
current in the circuit. All other possible combinations are subsumed under other,
depicted in yellow in figure 3. By looking at the yellow bars, it can be assessed that the
students tend to choose a combination which cannot be easily attributed to a specific
concept when they are not certain about their thinking and probably simply guess.

Figure 4: Dependence of correct answers and explanations on the teacher


Furthermore, analyses of variances indicate that the number of correct answers
significantly depends on the single teacher (see figure 4). In detail the values for
statistical evidence are: F (7, 338) = 2.75 and p = .009 for item A, F (7, 338) = 2.63 and p
= .012 for item B, F (7, 338) = 5.20 and p = .000 for item C and F (7, 338) = 2.12 and p =
.013 for item D. Whereas for example, less than 20% of all students of teacher T1 can
give correct answers and explanations for items A, B, C and D, significantly more than
37% of all students of teacher T5 can give correct answers and explanations for all four
items.

Figure 5: Latent Variable sequential reasoning


108

Strand 11

Evaluation and assessment of student learning and development

Construct validity was evaluated through factor analysis. Confirmatory factor analysis
with AMOS, using the maximum-likelihood-method and including specific combinations
of answers due to the first and second-tier of four different test items, resulted in a value of 5.805, which was not significant (p = .221). Therefore, a latent variable
sequential reasoning could be established (see figure 5). This variable can explain up to
52% of the variance of the single items.
A resistor and two lamps are connected to a battery.
a) What will happen to the brightness of the lamps if R is increased ?
L1 remains constant, L2 decreases.
L1 decreases, L2 remains constant.
The brightness of both lamps increases.
The brightness of both lamps decreases.
The brightness of both lamps remains constant.
b) How would you explain your reasoning?
A change of the resistor only influences the brightness of the lamp if the lamp is behind the resistor.
Any change of the resistor influences the brightness of both lamps.
It is the same battery. Therefore, the same current is delivered.
Both lamps have a direct connection to the battery. Therefore, the resistor has no effect on the
lamps.
c) Are you sure about your answer to the previous two questions?
highly certain

rather certain

rather uncertain

highly uncertain

Figure 6: Item D
Furthermore, findings from ANOVA reveal a main effect for correct answers concerning
all four items A to D on the particular school, respectively on the particular teacher.
Surprisingly, there are no dependences on students conceptions neither according to their
gender nor to their age.
Finally, regression analysis, where items A to C were used to predict sequential reasoning
for item D (see figure 6), suggests that those three factors together explain 31% of the
variance for item D (F (3, 338) = 49.89, p < .0001) and are significant individual
predictors of students sequential reasoning at item D.

109

Strand 11

Evaluation and assessment of student learning and development

CONCLUSIONS AND IMPLICATIONS


In conclusion, the findings of the study suggest that four items for uncovering students
difficulties viewing an electric circuit as a system can serve as a valid and reliable
measure of students qualitative understanding of the systemic character of an electric
circuit. Obviously, if researchers and/or teachers use only one tier in a multiple choice
instrument, they definitely overestimate correct answers and in consequence, gain of a
wrong vision of student understanding. The present instrument can be used as a tool both
for teachers and researchers to gain a correct vision of student understanding. It can be
easily administered to a large number of students and could be used as a research tool for
assessing new curriculum materials or teaching strategies. Although there is some
evidence that the conceptual test is reliable, valid and objective, there have to be a few
improvements. Additional interviews highlighted that it may not be perfectly
comprehensible for students what is meant with the wording on the third tier. As the case
may be, it is reasonable that a student is very confident about his or her answer on the
third tier but contrary not about his or her given explanations on the second tier.
Furthermore, the interviews revealed that some of the teachers tend to introduce the
direction of the current from the positive to the negative terminal of the battery, whereas
others use the direction of the negative charges from the positive to the negative pole.
Therefore, further improvements of the conceptual test instrument will take these
limitations of the present study into consideration. Definitely, there have to be found
ways to inform teacher education and continuing teacher professional development with
research findings related to students conceptions to close the research-practice gap and
consequently progress in physics education.

REFERENCES
Bilal, E. and Erol, M. (2009). Investigating students conceptions of some electricity
concepts. Lat. Am. J. Phys. Educ., 3(2), 193-201.
Brna, P. (1988). Confronting misconceptions in the domain of simple electric circuits.
Instructional Science, 17, 29-55.
Closset, J.-L. (1983). Sequential reasoning in electricity. In: Proceedings of the
International Workshop on Research in Physics Education, La Londe les Maures,
Paris: Editions du CNRS.

110

Strand 11

Evaluation and assessment of student learning and development

Dupin, J.-J. and Johsua, S. (1987). Conceptions of french pupils concerning electric
circuits: Structure and evolution. Journal of Research in Science Teaching, 24(9),
791806.
Hammer, D. (1996). More than misconceptions: Multiple perspectives on student.
knowledge and reasoning, and an appropriate role for educational research. American
Journal of Physics, 64(10), 1316-1325
Heller, P. M. and Finley, F. N. (1992). Variable uses of alternative conceptions: A case
study in current electricity. Journal of Research in Science Teaching, 29(3), 259-275.
McDermott, L.C. & Shaffer, P.S. (1992). Research as a guide for curriculum
development: An example form introductory electricity. Part I: investigation of
student understanding. American Journal of Physics, 60(11), 994-1003.
Pesman, H. and Eryilmaz, A. (2010). Development of a three-tier test to assess
misconceptions about simple electric circuits. Journal of Education Research,
103:3,208-222.
Riley M. S., Bee, N. V. and Mokwa, J. J. (1981). Representations in early learning: the
acquisition of problem-solving strategies in basic electricity and electronics. In:
Proc. Int. Workshop on Problems Concerning Students Representations of Physics
and Chemistry Knowledge, Ludwigsburg (Ludwigsburg, West Germany:
Paedagogische Hochschule) 107-173.
Rosencwajg, P. (1992). Analysis of problem solving strategies on electricity problems in
12 to 13 year olds. European Journal of Psychology of Education, VII,1, 5-22.
Sebastia, J. M. (1993). Cognitive mediators and interpretations of electric circuits. In:
The Proceedings of the Third International Seminar on Misconceptions and
Educational Strategies in Science and Mathematics, Misconceptions Trust: Ithaca,
NY (1993).
Shipstone, D. M. (1988). Pupils understanding of simple electric circuits: Some
implications for instruction. Phys. Educ. 23, 92-96.
Shipstone, D. M. (1984). A study of childrens understanding of electricity in simple DC
circuits. Eur. J . Sci. Educ. 6, 185-198.
Urban-Woldron, H. & Hopf, M. (2012). Developing a multiple choice test for
understanding basic electricity. ZfDN, 18, 201-227.

111

Strand 11

Evaluation and assessment of student learning and development

PROCESS-ORIENTED AND PRODUCT-ORIENTED


ASSESSMENT OF EXPERIMENTAL SKILLS IN PHYSICS:
A COMPARISON
Nico Schreiber1, Heike Theyen1 and Horst Schecker2
1
University of Duisburg-Essen
2
University of Bremen
Abstract: The acquisition of experimental skills is widely regarded as an important part of science education. Models describing experimental skills usually distinguish between three dimensions of an experiment: preparation, performance and data evaluation. Valid assessment procedures for experimental skills have to consider all these three dimensions. Hands-on tests can especially account for the performance dimension. However, in large-scale assessments the analysis
of students performances is usually only based on the products of the experiments. But does this
test format sufficiently account for a students ability to carry out experiments? A processoriented analysis that considers the quality of students actions, e.g. while setting up an experiment or measuring, provides a broader basis for assessments. At the same time it is more timeconsuming. In our study we compared a process-oriented and a product-oriented analysis of
hands-on tests. Results show medium correlations between both analysis methods in the performance dimension and rather high correlations in the dimensions preparation and data evaluation.
Keywords: experimental skills, assessment, process-oriented analysis, product-oriented analysis,
science education

BACKGROUND AND FRAMEWORK


The acquisition of experimental skills is widely regarded as an important part of science education (e.g. AAAS 1993, NRC 2012). Thus, there is a demand for assessment tools that allow for a
valid measurement of experimental skills. In our study we compared a process-oriented and a
product-oriented analysis of students performances in hands-on tests. The test instrument refers
to a specific model of experimental skills.

Modelling experimental skills


In the literature there is a broad consensus concerning typical experimental skills, like create an
experimental design, set up an experiment, observe / measure and interpret the results
(e.g. NRC 2012, DEE 1999). These skills can be assigned to three dimensions of an experimental
investigation: preparation, performance and data evaluation. Most models of experimental
skills are structured along these three dimensions with different accentuations (Klahr & Dunbar
1988, Hammann 2004, Mayer 2007, Emden & Sumfleth 2012).
Our model uses the three dimensions, too. In contrast to other models, it accentuates the performance dimension (Figure 1 adapted from Schreiber, Theyen, & Schecker 2009).

112

Strand 11

Evaluation and assessment of student learning and development

Figure 1: Model of experimental skills: 3 dimensions, 6 components (adapted from Schreiber,


Theyen & Schecker 2009).
At school, an experimental question is usually given to the students and not developed by them.
So students have to interpret and clarify the given task. In non-cookbook types of situations, students have to create the experimental design themselves. During the performance, students set up
the experiment, they measure and document data. During data evaluation, students process data,
give a solution and interpret the results. This description might suggest a linear order of steps.
However, this is not intended. The steps can occur in different orders and loops.

Measuring experimental skills


Written tests are established instruments to assess experimental skills. But especially with regard
to the performance dimension, their validity is in question (e.g. Shavelson, Ruiz-Primo, & Wiley
1999). Other approaches for the assessment of performance skills seem to be necessary (RuizPrimo & Shavelson 1996, Stebler, Reusser & Ramseier 1998, Garden 1999). Here, hands-on tests
show their potential. However, in large-scale assessments the analysis of students performance
in experimental tasks is usually only based on the products of experimenting, mostly documented
in lab sheets (e.g. Stebler, Reusser, & Ramseier 1998, Ramseier, Labudde & Adamina 2011, Gut
2012). The processes of experimentation, like setting up the apparatus and making measurements, are considered only indirectly, insofar as they affect the products.
On the one hand, it is in question whether a product-oriented analysis which neglects process aspects of experimenting yields adequate ratings compared to a process-oriented analysis. On the
other hand, a process-oriented analysis that considers the quality of students actions in the performance dimension (e.g. Neumann 2004) is very resource-consuming. In order to justify the additional effort for a process based analysis, it has to be shown that ratings from a product-based
analysis are insufficient predictors for ratings from a process-based analysis of the same hands-on
test at least for the performance phase.

113

Strand 11

Evaluation and assessment of student learning and development

RATIONALE AND METHODS


Hypotheses
In our study we investigate correlations between ratings from a product-based and a processbased analysis of students performances in hands-on tests. In the performance dimension students
get direct feedback from the experimental setup. A non-functional electric circuit e.g. may result
in a series of changes, until the setup finally works. The lab sheet will only show the final result.
Similar processes may occur during measurement. A product-based analysis only evaluates the
documented (i. e. usually the final) results, while a process-based approach looks at the chain of
students actions. Our first hypothesis is:
H1: Concerning the performance dimension, ratings from a product-oriented analysis are not
highly correlated with scores from a process-oriented analysis.
Students prepare the experiment and evaluate their data mostly in written form, without handling
experimental devices. We assume a close relationship between what they do and what they
document. Thus, our second hypothesis is:
H2: Concerning the preparation and evaluation of the experiment, ratings from a productoriented analysis are highly correlated with scores from a process-oriented analysis.
As a high correlation we define a correlation above 0.7 (Kendall Tau-b)

Methods
Tasks
For the comparison of the product- und process-based analysis we developed two experimental
tasks for the domain of electric circuits in secondary school curricula (Schreiber et al., 2012). The
first task is Here are three bulbs. Find out the one with the highest power at 6 V. In the second
task the students get a set of wires and have to find the best conductor from three metals. Students
have a set of apparatus and a pre-structured lab sheet at their disposal. The lab sheet is structured
along our model of experimental skills, requesting to plan a suitable experiment, assemble the
experimental setup, perform measurements, evaluate the data and draw conclusions. Both tasks
are open-ended and the students have to structure their paths towards the solutions on their own.
They are only assisted by written information on the necessary physics content knowledge.

Design
Table 1 shows the design of the study. It was embedded in a more extensive study concerning the
comparison of different assessment tools for experimental skills (Schreiber 2012).

114

Strand 11

Evaluation and assessment of student learning and development

Table 1
Design of the study
pre-test cognitive skills, content knowledge, selfconcept

45 min

training introducing the hands-on test

20 min

Hands-on tests
group 1 and task 1
highest power

group 2 and task 2


best conductor

30 min

138 upper secondary students, aged about 16 to 17, took part in this study. In a pre-test we measured personal variables that are supposed to have an influence on students test performances:
cognitive skills, self-concept concerning physics and experimenting in physics, and the content
knowledge in the field of electricity. Established tests and questionnaires were adapted for this
pre-test (Heller & Perleth 2000, Engelhardt & Beichner 2004, von Rhneck 1988, Brell 2008). In
the hands-on test the students worked on one of the two tasks described above. The use of two
different tasks was due to the design of the more extensive project into which this study was embedded. The students were assigned to the two groups based on their pre-test results in such a
way that a sufficient and similar variance of the personal variables was realized in both groups.
In a training session the students were introduced to the hands-on test (structures of the tasks and
handling of the devices). The training task was also taken from the domain of electric circuits
(measuring the current-voltage characteristic of a bulb).
In the hands-on test, students worked with a set of electric devices and a pre-structured lab sheet
(Figure 2, task 1). In the situation shown in Figure 2, the student documents his (inadequate)
setup with two multimeters, a battery and a bulb in the lab sheet. The pre-structured lab sheet
demands to clarify the question, to document the setup, to perform measurements and to interpret
the results. Students can choose when and in which order they fill in the sheet. The lab sheet does
not specify a particular solution or approach.
Students actions were videotaped and the lab
sheets were collected.

Process-oriented analysis

Figure 2: A student performed task 1 with


the hands-on test.

The videos and the lab sheets were analysed according to the components of experimenting
shown in Figure 1. The process-oriented analysis
leads to a quality index for each student in each
of these assessment categories. In a first step
students actions in the videotape are assigned to
one of the six components. A second step of
analysis codes the qualities of intermediate stages (e.g. whether an experimental setup is cor115

Strand 11

Evaluation and assessment of student learning and development

rect, imperfect or wrong) and the development (e.g. whether an imperfect setup is detected and
improved) (cf. Schreiber, Theyen & Schecker 2012, Theyen et al. 2013). The flow chart in
Figure 3 illustrates an example of how the rating decisions are made. The result is a quality index
on an ordinal scale with five levels. To secure validity and reliability of this analysis, several
studies with high inferent expert-ratings and interviews were conducted (details: Dickmann 2009,
Hollnder 2009, Fichtner 2011, Dickmann, Schreiber & Theyen 2012, Schreiber 2012). The
evaluation of double coding yields a high objectivity of the ratings (Cohens Kappa .67).

Figure 3: Formal analysis scheme of the sequence analysis specified for setup skills.

Product-oriented analysis
For the product-oriented analysis only the students documentations in the lab sheets were analysed with regard to the same six model components (skill in Figure 1). Each entry in the lab
sheet is directly associated with an assessment category. The single criterion is the correctness of
the entry. A development cannot be assessed since in most cases only one result is documented in
the sheets. Thus, using the formal analysis scheme (Figure 3), in the product-oriented analysis
only the levels 1, 2, and 5 can be scored. Again the objectivity in each assessment category is satisfying (Cohens Kappa >.62).

RESULTS
To test the hypotheses, rank correlations (Kendall-Tau b, ) between the quality parameters from
the product-oriented and the process-oriented analysis were calculated for each category (Tab. 2).
In all the four assessment categories that can be assigned to the preparation and the evaluation
dimensions, the correlations are high (
. For components of the performance dimension,
we found only medium or low correlations (
. Thus, both hypotheses can be confirmed.
The high correlations in the planning and data evaluation dimensions can be explained by the
data basis: In these dimensions the process-oriented analysis also refers mainly to the documentations in the lab sheets. Only in a few cases the videos provided further information concerning
116

Strand 11

Evaluation and assessment of student learning and development

developments. Thus,
regardless of the
Correlations (Kendall-Tau b, ) between a product-oriented and a pro- method of analysis,
in the dimensions of
cess-oriented analysis. The correlations are highly significant (**) or
*
planning and data
significant ( ). The assessment categories are assigned to the three experimental dimensions: preparation, performance and data evaluation. evaluation the scores
1, 2 and 5 dominate.
n: sample size.
In contrast, in both
dimension
assessment categories
n
assessment categories

belonging to the perpreparation


apprehend the task
.877** 138
**
formance dimension
create the experimental design
.728
138
**
the process-oriented
performance set up the experiment
.499
130
**
analysis largely profperform & document measurements .221
122
**
its from the videos.
data evalua- process the data & give a solution
.960
122
**
The videos provide
tion
interpret the results
.775
117
relevant information
about the developments while the students work on the set up and make measurements. The low correlations between the process-oriented and the product-oriented analysis are obviously caused by an information gap between the documented setups and measurements on the one hand and the actual setupand measurement-processes in the video on the other hand.
Table 2

A further result can be derived from Table 2: the sample size per category decreases over the
course of the experiment. Whereas 138 students clarified the questions and created an experimental design in the beginning, only 117 students interpreted the results in the end. This is a noticeable dropout of about 15 %. The reason is the use of an open task format (Fig. 2). The students had to structure the approach on their own and without any assistance. Students who e.g.
did not complete the setup were in the following not able to measure and to document data.

CONCLUSIONS
We draw two conclusions from our results:
1. Comparison of a process-oriented and a product-oriented analysis
A product-oriented analysis seems to be sufficient to analyse students skills of preparing an experiment and evaluating data. But in order to account for performance skills adequately, hands-on
tests with a process-oriented analysis of students actions seem to be necessary. These findings
should be considered for the development of more valid assessment procedures.
2. Open task format and sample size
The use of an open task format in testing experimental skills (Find out ) causes a noticeable
dropout of students during the test. For assessing the full range of experimental skills, we suggest
a guided test with non-interdependent sub-tasks. Each sub-task should refer to a specific experimental skill. To allow for a non-interdependent assessment, the item should present a sample solution of the preceding step, e.g. a measurement-item should provide a complete experimental
setup. We have started to work on such a test format.

117

Strand 11

Evaluation and assessment of student learning and development

REFERENCES
American Association for the Advancement of Science (AAAS) (Ed.) (1993). Benchmarks for
Science Literacy. New York: Oxford University Press.
Brell, C. (2008). Lernmedien und Lernerfolg - reale und virtuelle Materialien im Physikunterricht. Empirische Untersuchungen in achten Klassen an Gymnasien zum Laboreinsatz mit
Simulationen und IBE. In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum
Physik- und Chemielernen, Vol. 74. Berlin: Logos.
Department for Education and Employment (DEE) (Ed.) (1999). Science - The National Curriculum for England. London: Department for Education and Employment.
Dickmann, M. (2009). Validierung eines computergesttzten Experimentaltests zur Diagnostik
experimenteller Kompetenz (unpublished bachelor thesis). Dortmund: Technische Universitt Dortmund.
Dickmann, M., Schreiber, N. & Theyen, H. (2012). Vergleich prozessorientierter Auswertungsverfahren fr Experimentaltests. In S. Bernholt (Ed.), Konzepte fachdidaktischer Strukturierung fr den Unterricht, (pp. 449451). Mnster: LIT.
Emden, M. & Sumfleth, E. (2012). Prozessorientierte Leistungsbewertung des experimentellen
Arbeitens. Zur Eignung einer Protokollmethode zur Bewertung von Experimentierprozessen.
Der mathematische und naturwissenschaftliche Unterricht (MNU), 65 (2), 68-75.
Engelhardt, P. V. & Beichner, R. J. (2004). Students understanding of direct current resistive
electrical circuits. American Journal of Physics 72 (1), 98115.
Fichtner, A. (2011). Validierung eines schriftlichen Tests zur Experimentierfhigkeit von
Schlern (unpublished master thesis). Bremen: Universitt Bremen.
Garden, R. (1999). Development of TIMSS Performance Assessment Tasks. Studies in Educational Evaluation 25(3), 217241.
Gut, C. (2012). Modellierung und Messung experimenteller Kompetenz. Analyse eines largescale Experimentiertests. In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum
Physik- und Chemielernen, Vol. 134. Berlin: Logos.
Hammann, M. (2004). Kompetenzentwicklungsmodelle: Merkmale und ihre Bedeutung dargestellt anhand von Kompetenzen beim Experimentieren. Der mathematische und naturwissenschaftliche Unterricht 57(4), 196203.
Heller, K. A. & Perleth, C. (2000). Kognitiver Fhigkeitstest fr 4.-12. Klassen, Revision (KFT
4-12+ R). Gttingen: Hogrefe.
Hollnder, L. K. (2009). Validierung eines Experimentaltests mit Realexperimenten zur Diagnostik experimenteller Kompetenz (unpublished bachelor thesis). Dortmund: Technische
Universitt Dortmund.
Klahr, D. & Dunbar, K. (1988). Dual Space Search During Scientific Reasoning. Cognitive Science 12, 148.
Neumann, K. (2004). Didaktische Rekonstruktion eines physikalischen Praktikums fr Physiker.
In H. Niedderer, H. Fischler & E. Sumfleth (Eds.), Studien zum Physik- und Chemielernen,
Vol. 38. Berlin: Logos.

118

Strand 11

Evaluation and assessment of student learning and development

Mayer, J. (2007). Erkenntnisgewinnung als wissenschaftliches Problemlsen. In D. Krger & H.


Vogt (Eds.), Theorien in der biologiedidaktischen Forschung (pp. 177-186). Berlin, Heidelberg: Springer.
National Research Council (NRC) (Ed.) (2012). A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. Washington, DC: The National Academies
Press.
Ramseier, E., Labudde, P. & Adamina, M. (2011). Validierung des Kompetenzmodells HarmoS
Naturwissenschaften: Fazite und Defizite. Zeitschrift fr Didaktik der Naturwissenschaften,
17, 733.
Ruiz-Primo, M. A. & Shavelson, R. J. (1996). Rhetoric and Reality in Science Performance Assessments: An Update. Journal of Research in Science Teaching 33 (10), 10451063.
Schreiber, N. (2012). Diagnostik experimenteller Kompetenz - Validierung technologiegesttzter
Testverfahren im Rahmen eines Kompetenzstrukturmodells. In H. Niedderer, H. Fischler &
E. Sumfleth (Eds.), Studien zum Physik- und Chemielernen, Vol. 139. Berlin: Logos.
Schreiber, N., Theyen, H., Schecker, H. (2009). Experimentelle Kompetenz messen?! Physik
und Didaktik in Schule und Hochschule, 8 (3), 92-101.
Schreiber, N., Theyen, H. & Schecker, H. (2012). Experimental Competencies In Science: A
Comparison Of Assessment Tools. In C. Bruguire, A. Tiberghien & P. Clment (Eds.), EBook Proceedings of the ESERA 2011 Conference, Lyon France. Retrieved from:
http://www.esera.org/media/ebook/strand10/ebook-esera2011_SCHREIBER-10.pdf
(29.11.2013).
Shavelson, R. J., Ruiz-Primo, M. A. & Wiley, E. W. (1999). Note on Sources of Sampling Variability in Science Performance Assessments. Journal of Educational Measurement, 36 (1),
61-71.
Stebler, R., Reusser, K. & Ramseier, E. (1998). Praktische Anwendungsaufgaben zur integrierten
Frderung formaler und materialer Kompetenzen - Ertrge aus dem TIMSSExperimentiertest. Bildungsforschung und Bildungspraxis 20 (1), 2854.
Theyen, H., Schecker, H., Gut, C., Hopf, M., Kuhn, J., Labudde, P., Mller, A., Schreiber, N.,
Vogt, P. (2013). Modelling and Assessing Experimental Competencies in Physics. In C.
Bruguire, A. Tiberghien & P. Clment (Eds.), 9th ESERA Conference Contributions: Topics and trends in current science education - Contributions from Science Education Research
(pp. 321337). Dordrecht: Springer.
von Rhneck, C. (1988). Aufgaben zum Spannungsbegriff. Naturwissenschaften im Unterricht Physik/Chemie, 36(31), 3841.

119

Strand 11

Evaluation and assessment of student learning and development

MODELLING AND ASSESSING EXPERIMENTAL


COMPETENCE:
AN INTERDISCIPLINARY PROGRESS MODEL FOR
HANDS-ON ASSESSMENTS
Susanne Metzger, Christoph Gut, Pitt Hild, and Josiane Tardent
Zurich University of Teacher Education
Abstract: On the lower secondary level in Swiss schools, biology, chemistry, and physics
usually are taught as one subject. Accordingly, the national education standards do not
differentiate between the three sciences. In order to assess standards of experimental
competence, we developed a model that spans all three sciences. In this model,
experimental competence is structured by sub-dimensions referring to experimental
problem types such as categories conducted observation, measurement with a given
scale, scientific investigation, experimental comparison, or constructive problem
solving. The progression of competence is modelled for each problem type separately,
differentiating three to five levels in terms of quality standards. In a first attempt to validate
the progress model six tasks for categories conducted observation and measurement
with a given scale have been developed. A pilot test was administered to 250 students
(grade 7, 8, and 9) of different levels. The results of the pilot test affirm that the
progression model can be applied reasonably to all three sciences. The tasks can be
standardised well with respect to the test sheet structure, question formats, and textual
demands. Both, the structure of our model and the hierarchy of quality standards seem to
be applicable.
Keywords: science, experimental competence, hands-on assessment, lower secondary
school, Switzerland

INTRODUCTION
In Swiss schools, science usually is taught as one subject on the lower secondary level.
Accordingly, the national education standards do not differentiate between biology,
chemistry, and physics. Within the Swiss project HarmoS (Labudde et al., 2012) an
interdisciplinary structure model of scientific competence was developed. The validation
of the model by a large-scale hands-on test showed that the progression of experimental
competence cannot be explained post hoc (Gut, 2012). Based on these results, we
developed a new interdisciplinary normative progression model for experimental
competence for practical assessments. In this paper, the model and first results of the
validation by pilot assessments are presented.

120

Strand 11

Evaluation and assessment of student learning and development

RATIONALE
According to Schecker & Parchmann (2006) there are different purposes of normative
competence models: Such models should help to define competence by determining
structure and progression. They should be practical in formulating adequate standards for
experimenting in school science and useful for teacher education. They should also provide
an appropriate basis for the development of valid and reliable assessments of students
performance (e.g. Kauertz et al., 2012; Lunetta et al., 2007). In order to attain these goals,
four kinds of a priori decisions have to be made when experimental competence is
modelled. First, one has to define which scientific problems are to be standardised and
assessed. Second, one has to decide how competence may be decomposed in subdimensions. In the often-used process approach, problem solving is conceived as a linear
chain of processes (Murphy & Gott, 1984), such as formulating a hypothesis, planning and
carrying out experiments, and analysing data (Emden & Sumfleth, 2011). Alternatively,
one can differentiate between types of problems. The solution of each type demands
specific knowledge and skills (Millar et al., 1996). Therefore, each solution is scored by
typical scoring schemes and criteria (Ruiz-Primo & Shavelson, 1996). Thirdly, it has to be
decided whether the progression of competence is modelled in terms of task complexity
(e.g. Wellnitz et al., 2012), or in terms of quality criteria that standardised problems are
solved with (Millar et al., 1996), or in terms of both simultaneously. The fourth decision
concerns the kind of assessment (hands-on, simulation, or paper and pencil test) by which
the competence is to be measured. Of course, these four decisions are not independent of
each other. The process approach leads to a restricted view of experimental activities,
excluding engineering tasks for instance. Also, considering the variety of problem types,
the complexity cannot be empirically explained based upon one single progression model
(Gut, 2012).

MODEL OF EXPERIMENTAL COMPETENCE


In order to attain the goals above, we developed a competence model on the basis of the
following principals: In our conception, experimental competence refers only to problems
with an authentic hands-on interaction, involving scientific questions as well as
engineering tasks. Experimental competence is structured by sub-dimensions referring to
various problem types such as categories conducted observation, measurement with a
given scale, scientific investigation, experimental comparison, or constructive
problem solving. As we already mentioned before, the progression of competence is
modelled for each problem type separately differentiating three to five levels in terms of
quality standards for the solution of a standardised problem. These quality standards
correspond to different subtasks. For example, the basic problem of measurement with a
given scale is to measure a property of an object with given instruments as precisely as
possible. The students have the options to repeat measurements and to select an instrument.
The options correspond to different dimensions of the openness of a problem (Priemer,
2011). Using these options appropriately, students can reach higher progression levels, as
shown in figure 1.
121

Evaluation and assessment of student learning and development

quality stanchoice of
instruments dard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

quality standard achieved

measurement
repetition

basic
problem

quality standard achieved

IV
III
II

quality standard achieved

single
measurement

I
documentation
of results

sources of
measurement error

measurement
repetition

LEVEL

openess of the problem


(complexity)

Strand 11

instrument
selection

Quality of the solution (quality standards)

Figure 1: Progression model for the problem type measurement with a given scale; the
hierarchy of quality standards is set a priori.

METHOD
For the first internal validation six tasks have been developed: three tasks for categories
conducted observation and three tasks for measurement with a given scale within
selected topics in biology, chemistry, and physics. An example for the problem type
measurement with a given scale in physics is the task thread, where students should
find out at what force a thread breaks. They get a thread, scissors, two different spring
scales (A and B) and a calculator. Students are asked to think about how they could answer
the question, with which instrument they can do it best and how many measurements
would be necessary. After finding the result they have to draw or write down, how they did
the measurement. In addition they are explicitly asked to tell with which instrument they
did the measurement. Afterwards they are asked to estimate how exact the measurement is
and how they could improve it. At the end, there are some control questions, such as which
spring scale they used or if they calculated a mean and how they did it.
The pilot assessment has been administered to 250 students of different grades (7, 8, and 9)
and levels of the lower secondary school, especially to low achievers. In the following,
low level indicates the lowest achievement level, whereas high level indicates the
highest level of the so called Sekundarschule (in Switzerland, Gymnasium is even
higher). The students had to solve three tasks, each in 20 minutes. They worked on their
own with printed test sheets on which they were asked to write down their answers and a
brief report. Each task was coded by minimum two persons and as one single item, i.e. all
answers have been evaluated as a whole: Several dichotomous quality criteria could be

122

Strand 11

Evaluation and assessment of student learning and development

achieved or not. These quality criteria were clustered to the quality standards (see figure 1)
which could be achieved or not. To achieve a quality standard 50% or more of the quality
criteria had to be achieved.
The item score could be measured in two ways: On the one hand one can sum all achieved
quality standards (unconditional level score: uLev). On the other hand one can use the
hierarchy of quality standards and set the item score to that level, where all lower quality
standards are achieved (conditional level score: cLev). E.g. if a student achieves quality
standards 1, 2, 3 and 5 in a task, his unconditional level score would be uLev = 4 and his
conditional level score would be cLev = 3. The advantage of conditional levelling is to get
back information, which is lost by summing up scores.

RESULTS
The results of the first pilot assessment affirm that the progression model for the two
problem types categories conducted observation and measurement with a given scale
can be applied reasonably to all three science subjects. 1-dimensional Rasch analyses (with
the program Winsteps) for each problem type show good item fits. For conditional level
score we found for the three tasks of categories conducted observation .96 < infit < 1.13
and .82 <outfit < 1.24 and for the three tasks of measurement with a given scale .81 <
infit < 1.22 and .76 <outfit < 1.32. At least for the measurement with a given scale
sufficiently high reliability (> .6) is achieved.

Structure of the model


The low correlations between the two latent variables for categories conducted
observation and for measurement with a given scale (unconditional levels: .404**;
conditional levels: .368**) indicate that the structure of the model (i.e. differentiation of
problem types) seems to be reasonable.
For the problem type categories conducted observation three quality standards could be
distinguished:
1. single observation: correctness and completeness,
2. identification of differences,
3. identification of similarities.
For measurement with a given scale five levels as showed in figure 1 could be found,
but the a priori hierarchy of quality standards had to be changed: identifying sources of
measurement errors seems to pose higher difficulties to students than documentation of
results, measurement repetitions, and instrument selection. Therefore the new hierarchy of
the progression model is:
1. single measurement,
2. documentation of results,
3. measurement repetition,
4. instrument selection,
5. sources of measurement errors.
123

Strand 11

Evaluation and assessment of student learning and development

To validate these progressions we compared the frequencies of achieved quality standards


for each task. As an example, figure 2 shows the frequencies of achieved quality standards
for the task thread.
1,00
frequencies of achieved qs

grade 7, low level


grade 7, high level

0,80

grade 9, low level


grade 9, high level

0,60
0,40
0,20
0,00
single
documentation
multiple
measurement of results measurement

instrument
selection

quality standards (qs)

sources of
measurement
error

Figure 2: Frequencies of achieved quality standards for the task thread, differentiated
in four groups (low and high levels of grade 7 and 9 of the Sekundarschule).

Figure 2 shows some misfits of the model: First, frequencies of every group should be
smaller the higher (i.e. the more right on the x-axis) a quality standard is. Secondly,
frequencies of high level groups of the same grade and also higher grade groups of the
same level should be higher for every quality standard. Not each task shows the same
misfits as shown in figure 2. But generally the differentiation and the hierarchy of level 3
(multiple measurement) and 4 (instrument selection) is critical. More tests have to show, if
these two levels have to be changed or put together.

Validation of tasks
The tasks can be standardised well with respect to the structure of the test sheet, the
question formats, and textual demands. All tasks were coded by minimum two persons. On
the level of quality criteria as well as on quality standards a high interrater correlation (>.8)
can be achieved. Nevertheless, for categories conducted observation tasks more rater
training is needed than for measurement with a given scale tasks.
Separate 1-dimensional Rasch analyses with unconditional and with conditional levels
show high correlations between the two scoring alternatives (categories conducted
observation: .940**; measurement with a given scale: .847**). Therefore, in order to
gain more information it seems to be reasonable to work with the unconditional levels
instead of the conditional ones.

124

Strand 11

Evaluation and assessment of student learning and development

Students performance
To make the results comparable, we transformed the item parameters to skill points. Based
on PISA and other large scale assessments, we set the mean of all students of grade 9 to
500 with a standard deviation of 100. Table 1 presents the results for the tasks of the
problem types categories conducted observation and measurement with a given scale.
For categories conducted observation, increases from grade 7 to 9 (high level) and from
low to high level in grade 9 are significant with a value of almost a standard deviation. All
increases from grade 7 to 9 (both levels) and from low to high level (in grade 7 and 9) are
significant with a value of almost a standard deviation.
Table 1
Students performance for tasks of the problem types categories conducted observation (left)
and measurement with a given scale (right). In each case is given the mean and the standard
deviation in parenthesis. U stands for Mann-Withney-U-test, t for t-test and n.s. for not
significant.
categories
conducted
observation

low level

high level

measurement with a
given scale

low level

high level

grade 7

437 (79)

n.s.

452 (93)

grade 7

365 (101)

U: .000

450 (107)

n.s.

U: .006

U: .000

grade 9

459 (86)

537 (99)

grade 9

458 (83)

t: .011

U: .008

t: .009

538 (101)

With these results we can assign the students to the levels shown in table 2.
Table 2
Allocation of students performance to quality standard levels for tasks of the problem types
categories conducted observation (left) and measurement with a given scale (right).
categories
conducted
observation

low level

high level

measurement with a
given scale

low level

high level

grade 7

grade 7

II

grade 9

II

grade 9

II

IV

For categories conducted observation, students in grade 7 and in the low level of grade 9
achieve level I, i.e. they observe a phenomenon correct and complete. Only high level
students of grade 9 achieve level II, i.e. they also identify differences between two

125

Strand 11

Evaluation and assessment of student learning and development

observations. For measurement with a given scale, low level students in grade 7 achieve
level I, i.e. they measure correct. High level students of grade 7 and low level students of
grade 9 achieve level II, i.e. they also prepare a well documentation of results. Finally, high
level students of grade 9 attain level IV, i.e. in addition they repeat measurements and
select the right instruments.

CONCLUSION AND OUTLOOK


The results of the first pilot assessment suggest that the model works: On the one hand,
categories conducted observation and measurement with a given scale can be
discerned as two different problem types, so the structure of our model (i.e. differentiation
of problem types) seems to be applicable. On the other hand, we can show a hierarchy of
quality standards for measurement with a given scale as well as for categories
conducted observation, so the progression of out model seems also to be applicable.
Nevertheless, it has to be mentioned that these results are based on a sample of 250 Swiss
students. Currently further internal validation are undertaken for other problem types such
as scientific investigation and experimental comparison. These results are expected in
spring 2014.
Because of the low number of test items a quantitative analysis of the pilot test is
problematic. For a valid assessment, it is planned to develop more tasks and to assess these
with larger student samples.

REFERENCES
Emden, M. & Sumfleth, E. (2012). Prozessorientierte Leistungsbewertung. Zur Eignung
einer Protokollmethode fr die Bewertung von Experimentierprozesse. Der
mathematisch-naturwissenschaftliche Unterricht, 65 (2), 68-75.
Gut, C. (2012). Modellierung und Messung experimenteller Kompetenz. Analyse eines
large-scale Experimentiertests. Berlin: Logos.
Kauertz, A., Neumann, K. & Hrtig, H. (2012). Competence in Science Education. In: B. J.
Fraser, K. Tobin & C. J. McRobbie (Eds.). Second international handbook of science
education (pp. 711-721). Berlin: Springer.
Labudde, P., Nidegger, C., Aamina, M. & Gingins, F. (2012). The Development,
Validation, and Implementation of Standards in Science Education: Chances and
Difficulties in the Swiss Project HarmoS. In: S. Bernholt, K. Neumann & P. Nentwig
(Eds.). Making it tangible: Learning outcomes in science education (pp. 235-259).
Mnster: Waxmann.
Lunetta, V. N., Hofstein, A. & Clough, M. P. (2007). Learning and Teaching in the School
science Laboratory: An Analysis of Research, Theory, and Practice. In: S. K. Abell
& N. G. Lederman (Eds.). Handbook of research on science education (pp. 393441). Mahwah: Erlbaum.
Millar, R. & Driver, R. (1987). Beyond processes. Studies in Science Education, 14, 33-62.

126

Strand 11

Evaluation and assessment of student learning and development

Millar, R., Gott, R., Duggan, S. & Lubben, F. (1996).Childrens performance of


investigative tasks in science: A framework for considering progression. In: M.
Hughes (Eds.). Progression in learning (pp. 82-108). Clevedon: Multilingual
Matters.
Murphy, P. & Gott, R. (1984). Science: Assessment framework age 13 and 15 (APU).
Science report for teachers 2. Letchworth: The Garden City Press.
Priemer, B. (2011). Was ist offen beim offenen Experimentieren? Zeitschrift fr Didaktik
der Naturwissenschaften, 17, 339-355.
Ruiz-Primo, M. A. & Shavelson, R. J. (1996). Rhetoric and reality in science performance
assessments: An update. Journal of Research in Science Teaching, 33 (10), 10451063.
Schecker, H. & Parchmann, I. (2006). Modellierung naturwissenschaftlicher Kompetenz.
Zeitschrift fr Didaktik der Naturwissenschaften, 12, 45-66.
Wellnitz, N., Fischer, H. E., Kauertz, A., Mayer, J., Neumann, I., Pant, H. A., Sumfleth, E.
& Walpuski, M. (2012). Evaluation der Bildungsstandards eine fcherbergreifende Testkonzeption fr den Kompetenzbereich Erkenntnisgewinnung. Zeitschrift
fr Didaktik der Naturwissenschaften, 18, 261-291.

127

Strand 11

Evaluation and assessment of student learning and development

EFFECTS OF SELF-EVALUATION ON STUDENTS


ACHIEVEMENTS IN CHEMISTRY EDUCATION
Inga Kallweit and Insa Melle
TU Dortmund University, Germany
Abstract: The purpose of this paper is to investigate the effects of a self-evaluation
sheet on students learning outcomes when they work in an individualised learning
unit. Therefore, this study followed a pre/post/follow-up design with the usage of the
self-evaluation sheet as the experimental factor. The chemistry performance
concerning the topic chemical reaction was assessed by a multiple-choice test.
Additionally, the feelings towards the unit were measured by a feedback
questionnaire. This study was carried out with secondary education students (Grade 7,
N = 234), who were assigned to the two groups. The control group worked in the 90minute individualised teaching unit without the self-evaluation sheet, the selfevaluation group worked with the self-evaluation sheet in the lesson. The results
showed that students of the self-evaluation group achieved higher learning outcomes
than students of the control group; immediately after the learning unit as well as after
four weeks. Furthermore, the results indicated that there seem to be more positive
feelings towards the learning unit when students work with the self-evaluation sheet.
Keywords: self-evaluation, self-regulated learning, individualised teaching

INTRODUCTION
Individualised teaching has been featured in most education acts of Germany since
2005. The implementing of individualised teaching seems to be problematic: firstly,
there is no consensus about the definition of individualised teaching in literature;
secondly, there is insufficient learning material especially in the field of science
education. Further problems are caused by the general framework of the German
school system: teachers do not have time to create new learning material, and even
less in an individualised way for each student. Beyond that, research on individualised
teaching reveals that those methods increase the learning outcomes of high achieving
students only. To sum up, more research concerning the effectiveness of
individualised teaching methods, and ways to support the learning process of every
student is needed. In addition, learning material for individualised teaching has to be
developed and evaluated. This study was aiming at two aspects: firstly, it developed
and evaluated a self-evaluation sheet as a diagnostic instrument and a newly
constructed learning unit for individualised chemistry education, in which the work
with self-evaluation is embedded; secondly, it explored the effects of this diagnostic
tool on learning achievements. The second aspect will be dealt with in more depth.

RATIONALE
In the beginning of the 20th century, individualised teaching was requested by
representatives of the progressive education (e. g. Isaacs, 2010). After the discussion
fell silent for some time, it has been reopened by the implementation of individualised

128

Strand 11

Evaluation and assessment of student learning and development

teaching in the education acts of Germany in 2005. This has initiated the debate about
teaching methods and effects on academic achievement in the context of
individualised teaching.
There are various definitions of individualised teaching in literature. This paper uses
the following definition according to Kunze (2009) and Trautmann and Wischer
(2008): Individualised teaching consists of two aspects. The first aspect is the
diagnosis of students needs and knowledge concerning a specific topic. The second
aspect is, based on the results of the diagnosis, the implementation of differentiating
learning environments. The aim of individualised teaching is to support each student's
skills specifically. This can range from closing knowledge gaps to promoting
individual strengths.
Research on individualised teaching methods and differentiating learning
environments seems to demonstrate that only high achievers profit from these
methods and that there is no influence on low and average achieving students (e. g.
Bode, 1996; Baumert, Roeder, Sang, & Schmitz, 1986; Helmke, 1988). In addition,
research indicates that individualised instruction has a small effect on students
achievements (Hattie, 2012). Some researchers argue that the implementation of
individualised teaching is highly time-consuming (e. g. Gruehn, 2000), and as a
consequence of this, there is less time for work on task. Furthermore, the concept of
developing individualised learning environments for each student in class is not
realistic. In conclusion, more research concerning the effectiveness of individualised
teaching and ways for implementation are needed.
One aim of this study was to develop an individualised learning unit, which can be
adapted in chemistry education in an efficient way. Therefore, the theory of
individualised teaching and the theory of self-regulated learning had been combined.
According to Zimmerman (2002), self-regulated learners are active participants in
their own learning process. They are aware of their strengths as well as of lacks of
knowledge and they are self-motivated to achieve desired goals (Zimmerman, 1990).
Zimmermans model, which relies on the social-cognitive perspective, defines selfregulated learning as a cyclical process consisting of three important phases:
forethought, performance and the reflection phase. These phases are divided into
different categories. In the forethought phase, students plan their learning process, set
the goals they want to achieve, and they are self-motivated. This procedure is
influenced by students self-efficacy beliefs and experiences. The performance phase
is characterised by self-monitoring processes in order to regulate the actual learning
behaviour. After that, the learning process is evaluated and reflected in the reflection
phase. Students judge their achievements and find causal attributions. According to
this reflection, self-regulated learners set new goals and proceed with the forethought
phase (Zimmerman, 2002). To sum up, self-regulated learning is context-specific and
depends on students metacognitive, motivational and cognitive abilities
(Zimmerman, 2005). There is evidence that self-regulated learning can lead to higher
learning outcomes (Perels, Dignath, & Schmitz, 2009; Zimmerman & Martinez-Pons,
1988). From the social-cognitive perspective, self-regulated learning is not a static
trait. Research indicates that it can be trained and needs to be adopted to many
different situations (Perels, Grtler, & Schmitz, 2005). Some empirical studies focus
their trainings for self-regulated learning on selected categories of the three phases, as
for example self-monitoring. In this context, results reveal that consistent selfmonitoring leads to higher self-efficacy beliefs and better academic performance
(Schunk, 1982/1983). Besides, the accuracy of self-evaluation and self-assessment
129

Strand 11

Evaluation and assessment of student learning and development

seems to be a helpful predictor for learning success (Chen, 2003; Kostons, van Gog,
& Paas, 2012).
This study used a self-evaluation sheet to combine the theory of individualised
teaching with the theory of self-regulated learning. The work with the self-evaluation
sheet was embedded in an individualised learning unit and focused on the following
aspects of the theories: Firstly, students evaluate their own performance and use the
self-evaluation sheet as a diagnostic tool. Secondly, it enables students to plan,
monitor and document their own learning process. Thirdly, learners autonomously
record the learning behaviour on the sheet, which is important for a funded selfreflection.

RESEARCH QUESTIONS
Based on the theoretical background, this study addressed the following research
questions:
Question 1: Does self-evaluation have an effect on learning outcomes?
Question 2: Does self-evaluation have a long-term effect on learning outcomes?
Question 3: Does self-evaluation have an effect on students feelings towards the
individualised teaching unit?

METHODS
Materials
The materials used in the learning unit were constructed concerning the topic
chemical reaction. They consist of the self-evaluation sheet, problem sheets,
information texts, information cards and model answers. The self-evaluation sheet is
structured in tabular form. Nine statements about the students abilities are listed,
written in the first person singular. Each statement covers one subtopic of the topic
chemical reaction. Students assess themselves on a four-point Likert scale going from
I am very confident to I am not confident at all. On the self-evaluation sheet, there
are direct links to the problem sheets and information texts, which deal with the
particular subtopic. Besides, there is space to document what material has been used.
Students are encouraged to record their learning progress by evaluating themselves
again after a while, this time using a pen in a different colour. For those students who
are very confident in every subtopic, challenging exercises are provided. Students
work autonomously with the learning material. They decide in what order they use it.
Feedback towards the correctness can be obtained through model answers.
A multiple-choice test was developed to assess students chemistry performance
regarding the topic chemical reaction (35 items with five alternatives, Cronbachs
alpha = .85). Each statement on the self-evaluation sheet was covered by three or four
items of the multiple-choice test. Pre-, post- and follow-up-tests consisted of identical
items but differed in the order of the latter. Students feedback towards the learning
unit was assessed by conducting a feedback questionnaire (17 items with a five-point
Likert scale going from 1 (completely agree) to 5 (completely disagree),
Cronbachs alpha = .89). Some of these items have been adapted from the

130

Strand 11

Evaluation and assessment of student learning and development

Questionnaire to assess current motivation in learning situations (QCM)


(Rheinberg, Vollmeyer, & Burns, 2001).
Intellectual performance was measured as a control variable using one scale of the
CFT 20 (Wei, 1998). Additionally, the academic self-concept and the differential
academic self-concept concerning mathematics and physics were assessed (Rost,
Sparfeldt, & Schilling, 2007; Schne, Dickhuser, Spinath, & Stiensmeier-Pelster,
2002) since self-concept and self-evaluation are interrelated (Rost et al., 2007).

Participants
Students attending the seventh-grade of upper secondary schools in Germany
participated in this study (N = 234). Data sets of 218 students could be used in the
pre/post data analyses. Because of an additional drop-out, 207 data sets were
integrated in the pre/follow-up analyses. The mean age of the sample was 13.23 years
(SD = 0.43), and the percentage of male students was 57.3 %.

Design
To answer the research questions, two experimental groups were created: The selfevaluation group worked with the self-evaluation sheet and the learning material. The
control group did not work with the self-evaluation sheet, but with the same learning
material. In order to understand the structure of the material, this group got a short
overview of the subtopics the material deals with.
One week before the unit, chemistry performance, intellectual performance and
academic self-concept were assessed (ca. 60 minutes). Based on matched-pairs,
students were assigned to the two learning conditions. 108 students worked in the
control group and 110 in the self-evaluation group. Both groups worked
simultaneously with the learning material in two different rooms. Each group was
guided by one experimenter. The learning unit took 90 minutes and was started by a
10-minute introduction. In order to analyse the learning behaviour, 72 randomly
chosen students were videotaped during the lessons, and the learning material of all
students that worked in the unit had been scanned (n = 230). One week after the unit,
chemistry performance was measured again. Additionally, differential self-concept
and feedback towards the unit were assessed (ca. 45 minutes). Four weeks after the
post-test, data on chemistry performance was collected for the third time (follow-uptest, ca. 30 minutes).

RESULTS
For the group comparisons, a residual analysis was done. With regard to the results of
the pilot study, it was expected that the work with the self-evaluation sheet would lead
to higher learning outcomes directly after the unit and one month later. The analysis
of the residues of pre- and post-test revealed a significant difference between the
groups, tpre-post(216) = 2.53, p = .012, d = 0.35. The self-evaluation group scored
significantly higher (M = 0.17, SD = 0.91) than the control group (M = - 0.17, SD =
1.06). Exploring a long-term effect, the results of the residual analysis of pre- and
follow-up-test indicated similar findings, tpre-follow-up(205) = 2.14, p = .033, d = 0.29,
with better learning outcomes of the self-evaluation group (M = 0.14, SD = 1.00) than
131

Strand 11

Evaluation and assessment of student learning and development

the control group (M = - 0.15, SD = 0.98). Besides, it was expected that students
feelings towards the unit are more positive in the self-evaluation group. The results of
the analysis of the feedback questionnaire demonstrated that there is a significant
difference, tfeedback(214) = 3.13, p = .002, d = 0.50). Students of the self-evaluation
group had more positive feelings towards the lesson (M = 1.75, SD = 0.56) than
students of the control group (M = 2.00, SD = 0.65).
With regard to the findings of empirical research, interaction effects between group
and cognitive performance level were analysed. The results of the one-way ANOVA
on pre-post residues revealed a significant main effect of the cognitive performance
level, F(2, 215) = 4.52, p = .012, with higher learning outcomes of the high achieving
students (M = 0.29, SD = 0.77) than of the average (M = - 0.15, SD = 0.98) and low
achievers (M = - 0.16, SD = 1.14). As the t-test already indicated, the main effect of
the learning condition is also significant, F(1, 216) = 5.23, p = .023, with students of
the self-evaluation group scoring higher than students of the control group. The
interaction of group and cognitive performance level was not significant, F(2, 215) =
0.37, p = .693.

DISCUSSION AND CONCLUSION


The presented study aimed at exploring the effects of a self-evaluation sheet on
students achievements in chemistry education. Therefore, it followed an
experimental pre/post/follow-up design with two experimental groups: Firstly, the
self-evaluation group, in which students worked with the constructed learning
material and a self-evaluation sheet; and secondly, the control group in which students
worked with the same material but without the self-evaluation instrument. Using
matched-pairs, students were assigned to one of the groups so that these are
comparable.
The results of the residual data analyses showed that the self-evaluation group
achieved higher learning outcomes directly after the individualised unit than the
control group. This difference continues to have a significant impact in the long-run.
Additionally, students of the self-evaluation group had more positive feelings towards
the work in the unit than the control group.
The findings concerning the comparisons of chemistry performance showed small
effect sizes. This can be due to the fact that both groups worked in a very similar way:
they worked with the same learning material and in the same period of time. The only
difference between the groups was the use of the self-evaluation sheet. Further
analysis of the video data and the scanned material has to be done to explore whether
the learning behaviour differs between the groups.
In contrast to findings of empirical research, the results of this study did not indicate
an interaction of the learning condition and cognitive performance level. High
achievers as well as average and low achieving students profit from the work with the
self-evaluation sheet.
More specific analyses of the accuracy and quality of the self-evaluation have to be
done in order to specify if this diagnostic instrument was handled in the right way. In
addition, the influence of the academic self-concept on the self-evaluation and the
learning behaviour has to be explored. Furthermore, it might be important to
investigate the effects of self-evaluation on students of higher grades.

132

Strand 11

Evaluation and assessment of student learning and development

Future research might focus on the questions, how to assess self-regulated learning in
this context and if the work with self-evaluation sheets can empower students to be
more self-regulated.
It can be concluded that the work with the self-evaluation sheet could be an effective
way to implement individualised teaching methods in chemistry education. The focus
on students self-responsibility might be one possibility to develop and implement
individualised teaching methods in chemistry education. It could be shown that the
self-evaluation sheet supports every student regardless of ones cognitive performance
level.

REFERENCES
Baumert, J., Roeder, P. M., Sang, F., & Schmitz, B. (1986). Leistungsentwicklung
und Ausgleich von Leistungsunterschieden in Gymnasialklassen [Development of
achievement and compensation of differences in achievements of upper secondary
students]. Zeitschrift fr Pdagogik, 32(5), 639660.
Bode, R. K. (1996). Is it ability grouping or the tailoring of instruction that makes a
difference in student achievement? Paper presented at the annual meeting of the
American Educational Research Association, New York.
Gruehn, S. (2000). Unterricht und schulisches Lernen: Schler als Quellen der
Unterrichtsbeschreibung [Education and academic achievement: Students as
sources to describe education in school]. Mnster u. a.: Waxmann Verlag.
Hattie, J. (2012). Visible learning for teachers: Maximizing impact on learning.
London and New York: Routledge.
Helmke, A. (1988). Leistungssteigerung und Ausgleich von Leistungsunterschieden
in Schulklassen: unvereinbare Ziele? [Increase of achievement and compensation
of differences in achievements: contradictory aims?] Zeitschrift fr
Entwicklungspsychologie und Pdagogische Psychologie, 20(1), 4576.
Isaacs, B. (2010). Bringing the Montessori Approach to your Early Years Practice.
London and New York: Routledge.
Kunze, I. (2009). Begrndungen und Problembereiche individueller Frderung in der
Schule - Vorberlegungen zu einer empirischen Untersuchung [Reasons and
problems of individualised teaching - Preliminary considerations of an empirical
study]. In I. Kunze & C. Solzbacher (Eds.), Individuelle Frderung in der
Sekundarstufe I und II [Individualised teaching in secondary schools] (pp. 1326).
Baltmannsweiler: Schneider-Verlag Hohengehren.
Perels, F., Dignath, C., & Schmitz, B. (2009). Is it possible to improve mathematical
achievement by means of self-regulation strategies? Evaluation of an intervention
in regular math classes. European Journal of Psychology of Education, 24(1), 17
31.
Perels, F., Grtler, T., & Schmitz, B. (2005). Training of self-regulatory and problemsolving competence. Learning and Instruction, 15(2), 123139.
Rheinberg, F., Vollmeyer, R., & Burns, B. D. (2001). QCM: A questionnaire to assess
current motivation in learning situations. Retrieved from http://www.psych.unipotsdam.de/people/rheinberg/messverfahren/FAMLangfassung.pdf
Rost, D. H., Sparfeldt, J. R., & Schilling, S. R. (2007). DISK-Gitter mit SKSLF-8.
Differentielles Schulisches Selbstkonzept-Gitter mit Skala zur Erfassung des

133

Strand 11

Evaluation and assessment of student learning and development

Selbstkonzepts schulischer Leistungen und Fhigkeiten [Differential academic selfconcept test with one scale for assessing self-concept regarding academic
achievement and academic skills]. Gttingen: Hogrefe.
Schne, C., Dickhuser, O., Spinath, B., & Stiensmeier-Pelster, J. (2002). SESSKO.
Skalen zur Erfassung des schulischen Selbstkonzepts [Scales for assessing
academic self-concept]. Gttingen: Hogrefe.
Schunk, D. H. (1982/1983). Progress Self-Monitoring: Effects on Children's SelfEfficacy and Achievement. Journal of Experimental Education, 51(2), 8993.
Trautmann, M., & Wischer, B. (2008). Das Konzept der Inneren Differenzierung eine vergleichende Analyse der Diskussion der 1970er Jahre mit dem
aktuellen Heterogenittskurs [The conception of internal differentiation - a
comparative analysis of the discussion in the 1970's about current beliefs of
hetereogeneity]. In M. A. Meyer, M. Prenzel, & S. Hellekamps (Eds.), Zeitschrift
fr Erziehungswissenschaft: Sonderheft 9. Perspektiven der Didaktik (pp. 159
172). Wiesbaden: VS Verlag fr Sozialwissenschaften.
Wei, R. H. (1998). Grundintelligenztest Skala 1 (CFT 20) [Standard intelligence test
scale 1 (CFT 20)]. Gttingen: Hogrefe.
Zimmerman, B. J. (1990). Self-Regulated Learning and Academic Achievement: An
Overview. Educational Psychologist, 25(1), 317.
Zimmerman, B. J. (2002). Becoming a Self-Regulated Learner: An Overview. Theory
into Practice, 41(2), 6470.
Zimmerman, B. J. (2005). Attaining Self-Regulation: A social cognitive perspective.
In M. Boekaerts, P. R. Pintrich, & M. Zeidner (Eds.), Handbook of self-regulation
(pp. 1339). Burlington, San Diego, London: Elsevier Academic Press.
Zimmerman, B. J., & Martinez-Pons, M. (1988). Construct validation of a strategy
model of student self-regulated learning. Journal of Educational Psychology, 80(3),
284290.

134