Sie sind auf Seite 1von 121





Topic 1 provides you with some meanings of test, measurement, evaluation

and assessment, some basic historical development in language assessment,
and the changing trends of language assessment in the Malaysian context.

By the end of this topic, you will be able to:



define and explain the important terms of test, measurement,

evaluation, and assessment;


examine the historical development in Language Assessment;


describe the changing trends in Language Assessment in the

Malaysian context and discuss the contributing factors.





of various

SESSION ONE (3 hours)


Assessment and examinations are viewed as highly important in most Asian

countries such as Malaysia. Language tests and assessment have also
become a prevalent part of our education system. Often, public examination
results are taken as important national measures of school accountability.
While schools are ranked and classified according to their students
performance in major public examinations, scores from language tests are
used to infer individuals language ability and to inform decisions we make
about those individuals.
In this topic, lets discuss about the concept of measurement at its
numerous definitions. We will also look into the historical development in
language assessment and the changing trends of language assessment in
our country.


DEFINITION OF TERMS test, measurement, evaluation, and


1.4.1 Test
The four terms above are frequently used interchangeably in any
academic discussions. A test is a subset of assessment intended to measure
a test-taker's language proficiency, knowledge, performance or skills. Testing
is a type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are being
evaluated and measured.A test is first a method of measuring a test-takers
ability, knowledge or performance in a given area; and second it must
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-takers performance according to explicit
procedures or rules.

1.4.2 Assessment
Assessment is every so oftena misunderstood term. Assessment is a
comprehensive process of planning, collecting, analysing, reporting, and
using information on students over time(Gottlieb, 2006, p. 86).Mousavi
(2009)is of the opinion that assessment is appraising or estimating the level
of magnitude of some attribute of a person. Assessment is an important
aspect in the fields of language testing and educational measurement and
perhaps, the most challenging partof it. It is an ongoing process in
educational practice, which involves a multitude of methodological techniques.
It can consist of tests, projects, portfolios, anecdotal information and student
self-reflection.A test may be assessed formally or informally, subconsciously
or consciously, as well as incidental or intended by an appraiser.

1.4.3 Evaluation
Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In
reality, evaluation is involved when the results of a test (or other assessment
procedure) are used for decision-making (Bachman, 1990, pp. 22-23).
Evaluation involves the interpretation of information. If a teacher simply
records numbers or makes check marks on a chart, it does not constitute
evaluation. When a tester or marker evaluate, s/he values the results in
such a way that the worth of the performance is conveyed to the test-taker.
This is usually done with some reference to the consequences, either good or
bad of the performance.This is commonly practised in applied linguistics
research, where the focus is often on describing processes, individuals, and
groups, and the relationships among language use, the language use
situation, and language ability.

Test scores are an example of measurement, and conveying the

meaning of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a students correct
oral response with words like Excellent insight, Lilly!it is evaluation.

1.4.4 Measurement
Measurement is the assigning of numbers to certain attributes of
objects, events, or people according to a rule-governed system. For our
purposes of language testing, we will limit the discussion to unobservable
abilities or attributes, sometimes referred to as traits, such as grammatical
knowledge, strategic competence or language aptitude. Similar to other tyoes
of assessment, measurement must be conducted according to explicit rules
and procedures as spelled out in test specifications, criteria, and procedures
for scoring.Measurement could be interpreted as the process of quantifying
the observed performance of classroom learners. Bachman (1990) cautioned
us to distinguish between quantitative and qualitative descriptions. Simply
put, the former involves assigning numbers (including rankings and letter
grades) to observed performance, while the latter consists of written
descriptions, oral feedback, and non-quantifiable reports.
The relationships among test, measurement, assessment, and their
uses are illustrated in Figure 1.

Figure 1:The relationship between tests, measurement and assessment.

(Source: Bachman, 1990)

Historical development in language assessment

From the mid-1960s, through the 1970s, language testingpractices

reflected in large-scale institutional language testing and in most language

testing textbooks of the time - was informed essentially bya theoretical view of
language ability as consisting of skills (listening, speaking, reading and

writing) and components (e.g. grammar, vocabulary, pronunciation) and

an approach to test design that focused on testing isolated discrete
points of language, while theprimary concern was with psychometric
reliability (e.g. Lado,1961; Carroll,1968). Language testingresearchwas
dominated largely bythe hypothesis that language proficiency consisted of a
single unitarytrait, and a quantitative, statisticalresearch methodology (Oller,

The 1980s saw other areas of expansion in language testing,

mostimportantly, perhaps, in the influence of second language
acquisition(SLA) research, which spurred language testers to investigate
not only a wide variety of factors such as field independence/dependence
(e.g. Stansfield and Hansen, 1983; Hansen, 1984; Chapelle, 1988),
academic discipline and background knowledge (e.g. Erickson and Molly,
1983; Alderson and Urquhart, 1985; Hale, 1988) and discoursedomains
(Douglas and Selinker, 1985) on language test performance, but also the
strategies involved in the process of test-taking itself(e.g. Grotjahn, 1986;
Cohen, 1987).

If the 1980s saw a broadening of the issues and concerns of language

testing into other areas of applied linguistics, the 1990s saw a continuation of
this trend. In this decade the field also witnessed expansionsin a number of

research methodology;


practical advances;


factors that affect performance on language tests;


authentic, or performance, assessments; and


concerns with the ethics of language testing and professionalising

the field
The beginning of the new millennium is another exciting time for

anyone interested in language testing and assessment research. Current

developments in the fields of applied linguistics, language learning and

pedagogy, technological innovation, and educational measurement have

opened up some rich new research avenues.

Changing trends in Language Assessment-Malaysian context

History has clearly shown thatteaching and assessment should be

intertwined in education.Assessment and examinations are viewed as highly

important in Malaysia. One does not need to look very far to see how
important testing and assessment havebecome in our education system.
Often, public examination results are taken as important national measures of
school accountability. Schools are ranked and classified according to their
students performance in major public examinations. Just as assessment
impacts student learning and motivation, it also influences the natureof
instruction in the classroom. There has been considerable recent literature
that haspromoted assessment as something that is integrated with instruction,
and not an activitythat merely audits learning (Shepard, 2000). When
assessment is integrated with instructions, it informs teachers about what
activities and assignments will be most useful, what level of teaching is most
appropriate, and how summative assessments provide diagnostic information.
With this in mind, we have to look at the changing trends in
assessment particularly language assessment in this country, which has been
carried out mainly through the examination system until recent years.Starting
from the year 1845, written tests in schools were introduced for a number of
subjects. This trend in assessment continued with the intent to gauge the
effectiveness of the teaching-learning process. In Malaysia, the development
of formal evaluation and testing in education began after Independence.
Public examinations have long been the only measurement of students
achievement. Figure 1 shows the four stages/phases of development of
examination system in our country. The stages are as follow:
Razak Report
RahmanTalib Report
Cabinet Report
Malaysia Education Blueprint (2013-2025)

On 3rd May 1956, the Examination Unit (later known as Examination

Syndicate) in the Ministry of Education (MOE) was formed on the
recommendation of the Razak Report (1956). The main objective of the
Malaysia Examination Syndicate (MES) was to fulfil one of the Razak Reports
recommendations, which was to establish a common examination system for
all the schools in the country.

In line with the on-going transformation of the national educational

system, the current scenario is gradually changing. A new evaluation system
known as the School Based Assessment (SBA) was introduced in 2002 as a
move away from traditional teaching to keep abreast with changing trends of
assessment and to gauge the competence of students by taking into
consideration both academic and extra curricular achievements.

According to the Malaysian Ministry of Education (MOE), the new

assessment system aims to promote a combination of centralised and schoolbased assessment. Malaysian Teacher Education Division (TED) is entrusted
by the Ministry of Education to formulate policies and guidelines to prepare
teachers for the new implementation of assessment. As emphasised in the
innovation of the student assessment, continuous school-based assessment
is administered at all grades and all levels. Additionally, students sit for
common public examinations at the end of each level. It is also a fact that the
role of teachers in the new assessment system is vital. Teachers will be given
empowerment in assessing their students.

The Malaysia Education Blueprint was launched in September this

year, and with it, a three-wave initiative to revamp the education system over
the next 12 years. One of its main focuses is to overhaul the national
curriculum and examination system, widely seen as heavily content-based
and un-holistic.It is a timely move, given our poor results at the 2009
Programme for International Student Assessment (PISA) tests. Based on the
2009 assessment, Malaysia lags far behind regional peers like Singapore,
Japan, South Korea, and Hong Kong in every category.

Poor performance in Pisa is normally linked to students not being able

to demonstrate higher order thinking skill. To remedy this, the Ministry of
Education has started to implement numerous changes to the examination
system. Two out of the three nationwide examinations that we currently
administer to primary and secondary students have gradually seen major
changes. Generally, the policies are ideal and impressive, but there are still a
few questions on feasibility that have been raised by concern parties.
Figure 2 below shows the development of educational evaluation in Malaysia
since pre-independence until today.


Examinations were conducted according to the

needs of school or based on overseas
examinations such as the Overseas School

of the Razak
Report (1956)

Razak Report gave birth to the National

Education Policy and the creation of
Examination Syndicate (LP). LP conducted
examinations such as the Cambridge and
Malayan Secondary School Entrance
Examination (MSSEE), and Lower Certificate of
Education (LCE) Examination.

of the
Report (1960)

RahmanTalib Report recommended the

following actions:
1. Extend schooling age to 15 years old.
2. Automatic promotion to higher classes.
3. Multi-stream education (Aneka Jurusan).
The following changes in examination were
- The entry of elective subjects in LCE and
- Introduction examination of the Standard 5
Evaluation Examination.
- The introduction of Malaysia's Vocational
Education Examination.
- The introduction of the Standard 3 Dignostic

of the Cabinet
Report (1979)

Implementation of
the Malaysia
Education Blueprint
(2013 2025)

The implementation of Cabinet Report

resulted in evolution of the education system
to its present state, especially with KBSR
and KBSM. Adjustments were made in
examination to fulfill the new curriculum's
needs and to ensure it is in line with the
National Education Philosophy.

The emphasis is on School-Based Assessment

(SBA). It was first introduced in 2002. It is a new
system of assessment and is one of the new
areas where teachers are directly involved. The
revamp of the national examination and schoolbased assessments in stages, whereby by 2016,
at least 40% of questions in
UjianPenilaianSekolahRendah (UPSR) and 50%
in SijilPelajaran Malaysia (SPM) are of high order
thinking skills questions.

Figure 2: The development of educational evaluation in Malaysia

Source: Malaysia Examination Board (MES)

By and large, the role of MES is to complement and complete the

implementation of the national education policy. Among its achievements are:

Implementation of
the Open

of Malay
Language as the
Language (1960)

of Malaysia
Recognition of

Putting in place an
examination system
to meet national

Pioneering the
use of
computer in
the country

Taking over the
work of the

Figure 3: The achievements of Malaysia Examination Syndicate (MES)

Source:Malaysia Examination Board (MES)
Describe the stages involved in the development of
educational evaluation in Malaysia.

Tutorial question
Examine the contributing factors to the changing trends of
language assessment.
Create and present findings using graphic organisers.






Topic 2 provides you an insight on the reasons/purposes of assessment. It

also looks at the different types of assessments and the classifications of tests
according to their purpose.


By the end of this topic, you will be able to:



explain the reasons/purposes of assessment;


distinguish the differences between assessment of learning and

assessment for learning;


name and differentiate the different test types.


Role and
Purposes of
Assessment in
Teaching and

Reasons / Purposes
of Assessment

Assessment of
Learning /
Assessment for

Types of Tests:
Diagnostic, Aptitude,
and Placement Tests

SESSION TWO (3 hours)

Reasons/Purpose of Assessment

Critical to educators is the use of assessment to both inform and guide

instruction. Using a wide variety of assessment tools allows a teacher to
determine which instructional strategies are effective and which need to be
modified. In this way, assessment can be used to improve classroom practice,
plan curriculum, and research one's own teaching practice. Of course,
assessment will always be used to provide information to children, parents,
and administrators. In the past, this information was primarily expressed by a
"grade". Increasingly, this information is being seen as a vehicle to empower
students to be self-reflective learners who monitor and evaluate their own
progress as they develop the capacity to be self-directed learners. In addition
to informing instruction and developing learners with the ability to guide their
own instruction, assessment data can be used by a school district to measure
student achievement, examine the opportunity for children to learn, and
provide the basis for the evaluation of the district's language programmes.
Assessment instruments, whether formal tests or informal
assessments, serve multiple purposes. Commercially designed and
administered tests may be used for measuring proficiency, placing students
into one of several levels of course, or diagnosing students strengths and
weaknesses according to specific linguistic categories, among other
purposes. Classroom-based teacher-made tests might be used to diagnose
difficulty or measure achievement in a given unit of a course. Specifying the
purpose of an assessment instrument and stating its objectives are an
essential first step in choosing, designing, revising, or adapting the procedure
an educator will finally use.
We need to rethink the role of assessment in effective schools, where
effective means maximising learning for the most students. What uses of

assessment are most likely to maximise student learning and well being? How
best can we use assessment in the service of student learning and wellbeing?
We have a traditional answer to these questions. Our traditional answer says
that to maximise student learning we need to develop rigorous standardised
tests given once a year to all students at approximately the same time. Then,
the results are used for accountability, identifying schools for additional
assistance, and certifying the extent to which individual students are meeting
Let us take a closer look at the two assessments below i.e.
Assessment of Learning and Assessment for Learning.


Assessment of Learning
Assessment of learning is the use of a task or an activity to measure,

record, and report on a students level of achievement in regards to specific

learning expectations.
This traditional way of using assessment in the service of student
learning is assessment of learning - assessments that take place at a point in
time for the purpose of summarising the current status of student
achievement. This type of assessment is also known as summative
This summative assessment, the logic goes, will provide the focus to
improve student achievement, give everyone the information they need to
improve student achievement, and apply the pressure needed to motivate
teachers to work harder to teach and learn.

Assessment for leaning

Now compare this to assessment for learning. Assessment for

learning is roughly equivalent to formative assessment - assessment

intended to promote further improvement of student learning during the
learning process.

Assessment for learning is more commonly known as formative and

diagnostic assessments. Assessment for learning is the use of a task or an
activity for the purpose of determining student progress during a unit or block
of instruction. Teachers are now afforded the chance to adjust classroom
instruction based upon the needs of the students. Similarly, students are
provided valuable feedback on their own learning.
Formative assessment is not a new idea to us as educators. However,
during the past several years there has been literally an explosion of
applications linked to sound research.In this evolving conception, formative
assessment is more than testing frequently, although frequent information is
important. Formative assessment also involves actually adjusting teaching to
take account of these frequent assessment results. Nonetheless, formative
assessment is even more than using information to plan next
steps. Formative assessment seems to be most effective when students are
involved in their own assessment and goal setting.

Types of tests
The most common use of language tests is to identify strengths and

weaknesses in students abilities. For example, through testing we can

discover that a student has excellent oral abilities but a relatively low level of
reading comprehension. Information gleaned from tests also assists us in
deciding who should be allowed to participate in a particular course or
programme area. Another common use of tests is to provide information
about the effectiveness of programmes of instruction.
Henning (1987) identifies six kinds of information that tests provide about
students. They are:
o Diagnosis and feedback
o Screening and selection
o Placement
o Program evaluation
o Providing research criteria

o Assessment of attitudes and socio-psychological differences

Alderson, Clapham and Wall (1995) have a different classification
scheme. They sort tests into these broad categories: proficiency,
achievement, diagnostic, progress, andplacement. Brown (2010), however,
categorised tests according to their purpose, namely achievement tests,
diagnostic tests, placement tests, proficiency test, and aptitude tests.
Proficiency Tests

Proficiency tests are not based on a particular curriculum or language

programme. They are designed to assess the overall language ability of
students at varying levels. They may also tell us how capable a
person is in a particular language skill area.Their purpose is to describe what
students are capable of doing in a language.

Proficiency tests are usually developed by external bodies such as

examination boards like Educational Testing Services (ETS) or Cambridge
ESOL. Some proficiency tests have been standardised for international use,
such as the American TOEFL test which is used to measure the English
language proficiency of foreign college students who wish to study in NorthAmerican universities or the British-Australian IELTS test designed for those
who wish to study in the United Kingdom or Australia (Davies et al., 1999).
Achievement Tests

Achievement tests are similar to progress tests in that their purpose is

to see what a student has learned with regard to stated course outcomes.
However, they are usually administered at mid-and end- point of the semester
or academic year. The content of achievement tests is generally based on
the specific course content or on the course objectives. Achievement tests
are often cumulative, covering material drawn from an entire course or

Diagnostic Tests

Diagnostic tests seek to identify those language areas in which a

student needs further help. Harris and McCann (1994 p. 29) point out that
where other types of tests are based on success, diagnostic tests are based
on failure. The information gained from diagnostic tests is crucial for further
course activities and providing students with remediation. Because diagnostic
tests are difficult to write, placement tests often serve a dual function of both
placement and diagnosis (Harris & McCann, 1994; Davies et al., 1999).
Aptitude Tests

This type of test no longer enjoys the widespread use it once had. An
aptitude test is designed to measure general ability or capacity to learn a
foreign language a priori (before taking a course) and ultimate predicted
success in that undertaking. Language aptitude tests were seemingly
designed to apply to the classroom learning of any language. In the United
States, two common standardised English Language tests once used were
the Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the
Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is
no research to show unequivocally that these kinds of tasks predict
communicative success in a language, apart from untutored language
acquisition, standardised aptitude tests are seldom used today with the
exception of identifying foreign language disability (Stansfield & Reed, 2004).
Progress Tests

These tests measure the progress that students are making towards
defined course or programme goals. They are administered at various stages
throughout a language course to see what the students have learned,
perhaps after certain segments of instruction have been completed. Progress
tests are generally teacher produced and are narrower in focus than
achievement tests because they cover a smaller amount of material and
assess fewer objectives.

Placement Tests

These tests, on the other hand, are designed to assess students level
of language ability for placement in an appropriate course or class. This type
of test indicates the level at which a student will learn most effectively. The
main aim is to create groups, which are homogeneous in level. In designing a
placement test, the test developer may choose to base the test content either
on a theory of general language proficiency or on learning objectives of the
curriculum. In the former, institutions may choose to use a well-established
proficiency test such as the TOEFL or IELTS exam and link it to curricular
benchmarks. In the latter, tests are based on aspects of the syllabus taught
at the institution concerned.

In some contexts, students are placed according to their overall rank in

the test results. At other institutions, students are placed according to their
level in each individual skill area. Elsewhere, placement test scores are used
to determine if a student needs any further instruction in the language or could
matriculate directly into an academic programme.

Discuss and present the various types of tests and assessment

tasks that students have experienced.
Discuss the extent tests or assessment tasks serve their purpose.

The end of the topic. Happy reading!




Topic 3 provides input on basic testing terminology. It looks at the definitions,

purposes and differences of various tests.

By the end of this topic, you will be able to:


explain the meaning and purpose of different types of language

compare between Norm-Referenced Test and CriterionReferenced Test, Formative and Summative Tests, Objective
and Subjective Tests


and CriterionReferenced

Types of Tests

Formative and

Objective and


Norm-Referenced Test (NRT)

According to Brown (2010), in NRTs an individual test-takers score is

interpreted in relation to a mean (average score), median (middle score),

standard deviation (extent of variance in scores), and/or percentile rank. The
purpose of such tests is to place test-takers along a mathematical continuum
in rank order. In a test, scores are commonly reported back to the test-taker
in the form of a numerical score for example, 250 out of 300 and a percentile
rank for instance 78 percent, which denotes that the test-takers score was
higher than 78 percent of the total number of test-takers but lower than 22
pecent in the administration. In other words, NRT is administered to compare
an individual performance with his peers and/or compare a group with other
groups. In the School-Based Evaluation, NRT is used for the summative
evaluation, such as in the end of the year examination for the streaming and
selection of students.

Criterion-Referenced Test (CRT)

Gottlieb (2006) on the other hand refers Criterion-referenced tests as

the collection of information about student progress or achievement in relation

to a specified criterion. In a standards-based assessment model, the
standards serve as the criteria or yardstick for measurement. Following
Glaser (1973), the word criterion means the use of score values that can be
accepted as the index of attainment to a test-taker. Thus, CRTs are designed
to provide feedback to test-takers, mostly in the form of grades, on specific
course or lesson objectives. Curriculum Development Centre (2001) defines
CRT as an approach that provides information on students mastery based on
the criteria determined by the teacher. These criteria are based on learning
outcomes or objectives as specified in the syllabus. The main advantage of
CRTs is that they provide the testers to make inferences about how much
language proficiency, in the case of language proficiency tests, or knowledge

and skills, in the aspect of academic achievement tests, that testtakers/students originally have and their successive gains over time. As
opposed to NRTs, CRTs focus on students mastery of a subject matter
(represented in the standards) along a continuum instead of ranking student
on a bell curve. Table 3 below shows the differences between NormReferenced Test (NRT) and Criterion-Referenced Test (CRT).
Criterion-Referenced Test
An approach that
provides information on
students mastery based
on a criterion specified by
the teacher
Determine performance
Determine learning
difference among
mastery based on
individual and groups
specified criterion and
Test Item
From easy to difficult level Guided by minimum
and able to discriminate
achievement in the
examinees ability
related objectives
Continuous assessment
Continuous assessment
in the classroom
Summative evaluation
Formative evaluation
Public exams: UPSR,
Mastery test: monthly
test, coursework, project,
exercises in the
Table 3: The differences between Norm-Referenced Test (NRT) and
Criterion-Referenced Test (CRT)


Norm-Referenced Test
A test that measures
students achievement as
compared to other
students in the group

Formative Test
Formative test or assessment, as the name implies, is a kind of

feedback teachers give students while the course is progressing. Formative

assessment can be seen as assessment for learning. It is part of the
instructional process.We can think of formative assessment as practice.
With continual feedback the teachers may assist students to improve their
performance. The teachers point out on what the students have done wrong
and help them to get it right. This can take place when teachers examine the
results of achievement and progress tests. Based on the results of formative
test or assessment, the teachers can suggest changes to the focus of

curriculum or emphasis on some specific lesson elements. On the other hand,

students may also need to change and improve. Due to the demanding
nature of this formative test, numerous teachers prefer not to adopt this test
although giving back any assessed homework or achievement test present
both teachers and students healthy and ultimate learning opportunities.

Summative Test

Summative test or assessment, on the other hand, refers to the kind of

measurement that summarise what the student has learnt orgive a one-off
measurement.In other words, summative assessment is assessment of
student learning. Students are more likely to experience assessment carried
out individually where they are expected to reproduce discrete language items
from memory.The results then are used to yield a school report and to
determine what students know and do not know.It does not necessarily
provide a clear picture of an individuals overall progress or even his/her full
potential, especially if s/heis hindered by the fear factor of physically sitting for
a test, but may provide straightforward and invaluable results for teachers to
analyse. It is given at a point in time to measure student achievement in
relation to a clearly defined set of standards, but it does not necessarily show
the way to future progress. It is given after learning is supposed to occur. End
of the year tests in a course and other general proficiency or public exams are
some of the examples of summative tests or assessment.Table 3.1 shows
formative and summative assessments that are common in schools.
Formative Assessment
Anecdotal records
Quizzes and essays

Summative Assessment
Final exams
National exams (UPSR, PMR, SPM,
Diagnostic tests
Entrance exams
Table 3.1: Common formative and summative assessments in schools


Objective Test
According to BBC Teaching English, an objective test is a test that

consists of right or wrong answers or responses and thus it can be marked

objectively. Objective tests are popular because they are easy to prepare and
take, quick to mark, and provide a quantifiable and concrete result. They tend
to focus more on specific facts than on general ideas and concepts.

The types of objective tests include the following:


Multiple choice items/questions




Matchingitems/questions; and


Fill-in the blanks items/questions.

In this topic, let us focus on the multiple-choice questions, which may

look easy to construct but in reality, it is very difficult to build correctly. This is
congruent with the viewpoint of Hughes (2003, pp76-78) who warns against
many weaknesses of multiple-choice questions. The weaknesses include:

It may limit beneficial washback;

It may enable cheating among test-takers;

It is very challenging to write successful items;

This technique strictly limits what can be tested;

This technique tests only recognition knowledge;

It may encourage guessing,which may have a considerable effect on

test scores.

Lets look at some important terminology when designing multiple-choice

questions. This objective test item comprises five terminologies namely:


Receptive or selective response

Items that the test-takers chooses from a set of responses, commonly
called a supply type of response rather than creating a response.


Every multiple-choice item consists of a stem (the body of the item

that presents a stimulus). Stem is the question or assignment in an item. It is

in a complete or open, positive or negative sentence form. Stem must be
short or simple, compact and clear. However, it must not easily give away the
right answer.

Options or alternatives
They are known as a list of possible responses to a test item.
There are usually between three and five options/alternatives to
choose from.


This is the correct response. The response can either be

correct or the best one. Usually for a good item, the correct answer is not
obvious as compared to the distractors.

5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as
the correct answer but it is not.

When building multiple-choice items for both classroom-based and

large-scaled standardised tests, consider the four guidelines below:


Design each item to measure a single objective;


State both stem and options as simply and directly as possible;


Make certain that the intended answer is clearly the one correct


(Optional) Use item indices to accept, discard or revise item.


Subjective Test
Contrary to an objective test, a subjective test is evaluated by giving an

opinion, usually based on agreed criteria.Subjective tests include essay,

short-answer, vocabulary, and take-home tests. Some students become very
anxious of these tests because they feel their writing skills are not up to par.
In reality, a subjective test provides more opportunity to test-takers to
show/demonstrate their understanding and/or in-depth knowledge and skills in
the subject matter. In this case, test takers might provide some acceptable,
alternative responses that the tester, teacher or test developer did not
predict. Generally, subjective tests will test the higher skills of analysis,
synthesis, and evaluation. In short, subjective test will enable students to be
more creative and critical. Table 3.2 shows various types of objective and
subjective assessments.
Objective Assessments
Subjective Assessments
True/False Items
Extended-response Items
Multiple-choice Items
Restricted-response Items
Multiple-responses Item
Matching Items
Table 3.2: Various types of objective and subjective assessments

Some have argued that the distinction between objective and

subjective assessments is neither useful nor accurate because, in reality,
there is no such thing as objective assessment. In fact, all assessments are
created with inherent biases built into decisions about relevant subject matter
and content, as well as cultural (class, ethnic, and gender) biases.

Objective test items are items that have only one answer or correct
response. Describe in-depth the multiple-choice test item.


Subjective test-items allocate subjectivity in the response given by

thetest-takers. Explain in detail the various types of subjective testitems.

1. Identify at least three differences between formative and summative
2. What are the strengths of multiple-choice items compared to essay
3. Informal assessments are often unreliable, yet they are still
important in classrooms. Explain why this is the case, and defend
your explanation with examples.
4. Compare and contrast Norm-Referenced Test with CriterionReferenced Test.





Topic 4 defines the basic principles of assessment (reliability, validity,

practicality, washback, and authenticity) and the essential sub-categories
within reliability and validity.

By the end of this topic, you will be able to:



define the basic principles of assessment (reliability, validity,

practicality, washback, and authenticity) and the essential subcategories within reliability and validity;


explain the differences between validity and reliability;


distinguish the different types of validity and reliability in tests

and other instruments in language assessment.





Types of



Washback Effect

SESSION FOUR (3 hours)

Assessment is a complex, iterative process requiring skills,

understanding, and knowledge-in the exercise of professionally judgment. In

this process, there are five important criteria that the testers ought to look into
for testing a test: reliability, validity, practicality, washback and authenticity.
Since these five principles are context dependent, there is no priority order
implied in the order of presentation.

Reliability means the degree to which an assessment tool produces

stable and consistent results. It is a concept, which is easily being

misunderstood (Feldt & Brennan, 1989).
Reliability essentially denotes consistency, stability, dependability,
and accuracy of assessment results (McMillan, 2001a, p.65 in Brown, G. et
al, 2008). Since there is tremendous variability from either teacher or tester to
teacher/tester that affects student performance, thus reliability in planning,
implementing, and scoring student performances gives rise to valid
Fundamentally, a reliable test is consistent and dependable. If a
tester administers the same test to the same test-taker or matched test-takers
on two circumstances, the test should give the same results.In a validity
chain, it is stated that test administrators need to be sure that the scoring
performance has to be carried out properly. If scores used by the tester do
not reflect accurately what the test-taker actually did, would not be rewarded
by another marker, or would not be received on a similar assessment, then
these scores lack reliability. Errors occur in scoring in any ways-for example,
giving Level 2 when another rater would give Level 4, adding up marks
wrongly, transcribing scores from test paper to database inaccurately,
students performing really well on the first half of the assessment and poorly

on the second half due to fatigue, and so on. Thus, lack of reliability in the
scores students receive is a treat to validity.
According to Brown (2010), a reliable test can be described as

Consistent in its conditions across two or more administrations

Gives clear directions for scoring / evaluation
Has uniform rubrics for scoring / evaluation
Lends itself to consistent application of those rubrics by the
v Contains item / tasks that are unambiguous to the test-taker

4.4.1 Rater Reliability

When humans are involved in the measurement procedure,
there is a tendency of error, biasness and subjectivity in determining
the scores of similar test.There are two kinds of rater reliability namely
inter-rater reliability and intra-rater reliability.
Inter-rater reliability refers to the degree of similarity between
different tester or rater; can two or more testers/raters, without
influencing one another, give the same marks to the same set of scripts
(contrast with intra-rater reliability).

One way to test inter-rater reliability is to have each rater assign

each test item a score. For example, each rater might score
items on a

scale from 1 to 10. Next, you would calculate the

correlation between the two ratings to determine the level of inter-rater

reliability. Another means of testing inter-rater reliability is to have
raters determine which category each observation falls into and then
calculate the percentage of agreement between the raters. So, if the
raters agree 8 out of 10 times, the test has an 80% inter-rater reliability
rate. Rater reliability is assessed by having two or more independent
judges score the test. The scores are then compared to determine the
consistency of the raters estimates.

Intra-rater reliability is an internal factor. In intra-rater reliability,

its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does nothave
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intrarater reliability is the consistency of grading by a single rater.
Scores on a test are rated by a single rater/judge at different times.
When we grade tests at different times, we may become
inconsistent in our grading for various reasons. Some papers that are
graded during the day may get our full and careful attention, while
others that are graded towards the end of the day are very quickly
glossed over. As such, intra rater reliability determines the
consistency of our grading.

Both inter-and intra-rater reliabilitydeserve close attention in

that test scores are likely to vary from rater to rater or even from the
same rater (Clark, 1979).

4.4.2 Test Administration Reliability

There are a number of reasons which influences test
administration reliability. Unreliability occurs due to outside
interference like noise, variations in photocopying, temperature
variations, the amount of light in various parts of the room, and even
the condition of desk and chairs. Brown (2010) stated that he once
witnessed the administration of a test of aural comprehension in which
an audio player was used to deliver items for comprehension, but due
to street noise outside the building, test-taker sitting next to open
windows could not hear the stimuli clearly. According to him, that was
a clear case of unreliability caused by the conditions of the test

4.4.3 Factors influencing Reliability

Factors that can

affect the
reliability of a

Test Factor

Teacher and
Student Factor



Marking Factor

Figure 4.4.3 Factors that affect the reliability of a test

The outcome of a test is influenced by many factors.
Assuming that the factors are constant and not subject to
change, a test is considered to be reliable if the scores
are consistent and not different from other equivalent and
reliable test scores. However, tests are not free from
errors. Factors that affect the reliability of a test include
test length factors, teacher and student factors,
environment factors, test administration factors, and
marking factors.
a. Test length factors
In general, longer tests produce higher reliabilities. Due to
thedependency on coincidence and guessing, the scores will be more
accurate if the duration of the test is longer. An objective test has
higher consistency because it is not exposed to a variety of
interpretations. A valid test is said to be reliable but a reliable test need
not be valid. A consistent score does not necessary measure what is
intended to measure. In addition, the test items that are the samples of
the subject being tested and variation in the samples may be found in
two equivalent tests and there can be one of the causes test outcomes
are unreliable.


Teacher-Student factors
In most tests, it is normally for teachers to construct and

administer tests for students. Thus, any good teacher-student

relationship would help increase the consistency of the results. Other
factors that contribute to positive effects to the reliability of a test
include teachers encouragement, positive mental and physical
condition, familiarity to the test formats, and perseverance and


Environment factors
An examination environment certainly influences test-takers and

their scores. Any favourable environment with comfortable chairs and

desks, good ventilation, sufficient light and space will improve the
reliability of the test. On the contrary, a non-conducive environment will
affect test-takers performance and test reliability.


Test administration factors

Because students' grades are dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.

Marking factors

Unfortunately, we human judges have many opportunities to introduce

error in our scoring of essays (Linn & Gronlund, 2000; Weigle, 2002).It
is possible that our scoring invalidates many of the interpretations we
would like to make based on this type of assessment.Brennan (1996)
has reported that in large-scale, high-stakes marking panels that are
tightly trained and monitored marker effects are small. Hence, it can
be concluded that in low-stakes, small-scale marking, there is
potentially a large error introduced by individual markers. It is also

common that different markers award different marks for the same
answer even with a prepared mark scheme. A markers assessment
may vary from time to time and with different situations. Conversely, it
does not happen to the objective type of tests since the responses are
fixed. Thus, objectivity is a condition for reliability.


Validity refers to the evidence base that can be provided about

appropriateness of the inferences, uses, and consequences that come from

assessment (McMillan, 2001a).Appropriateness has to do with the
soundness, trustworthiness, or legitimacy of the claims or inferences that
testers would like to make on the basis of obtained scores. Clearly, we have
to evaluate the whole assessment process and its constituent parts by how
soundly we can defend the consequences that arise from the inferences and
decisions we make. Validity, in other words, is not a characteristic of a test or
assessment; but a judgment, which can have varying degrees of strength.

So, the second characteristic of good tests is validity, which refers to

whether the test is actually measuring what it claims to measure. This is
important for us as we do not want to make claims concerning what a student
can or cannot do based on a test when the test is actually measuring
something else. Validity is usually determined logically although several types
of validity may use correlation coefficients.

According to Brown (2010), a valid test of reading ability actually

measures reading ability and not 20/20 vision, or previous knowledge of a
subject, or some other variables of questionable relevance. To measure
writing ability, one might ask students to write as many words as they can in
15 minutes, then simply count the words for the final score. Such a test is
practical (easy to administer) and the scoring quite dependable (reliable).
However, it would not constitute a valid test of writing ability without taking into

account its comprehensibility, rhetorical discourse elements, and the

organisation of ideas.
The following are the different types of validity:

Face validity: Do the assessment items appear to be appropriate?

Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?

Construct validity: Are you measuring what you think you're measuring? Is
the test based on the best available theory of language and language use?

Concurrent validity: Can you use the current test score to estimate scores
of other criteria? Does the test correlate with other existing measures?

Predictive validity: Is it accurate for you to use your existing students

scores to predict future students scores? Does the test successfully predict
future outcomes?
It is fairly obvious that a valid assessment should have a good coverage of

the criteria (concepts, skills and knowledge) relevant to the purpose of the
examination. The important notion here is the purpose.

a. Face validity

b. Content Validity

Types of Validity

c. Construct Validity

d. Concurrent Validity

e. Predictive Validity

Figure 4.5: Types of Validity

4.5.1 Face validity
Face validity is validity which is determined impressionistically;
for example by asking students whether the examination was
appropriate to the expectations (Henning, 1987). Mousavi (2009)
refers face validity as the degree to which a test looks right, and
appears to measure the knowledge or abilities it claims to measure,
based on the subjective judgement of the examinees who take it, the
administrative personnel who decide on its use, and other
psychometrically unsophisticated observers.
It is pertinent that a test looks like a test even at first impression.
If students taking a test do not feel that the questions given to them are
not a test or part of a test, then the test may not be valid as the
students may not take it seriously to attempt the questions. The test,
hence, will not be able to measure what it claims to measure.

4.5.2 Content validity

Content validityis concerned with whether or not the content of
the test is sufficiently representative and comprehensive for the test to
be a valid measure of what it is supposed to measure (Henning,
1987).The most important step in making sure of content validity is to
make sure all content domains are presented in the test. Another
method to verify validity is through the use of Table of Test
Specification that can give detailed information on each content, level
of skills, status of difficulty, number of items, and item representation
for rating in each content or skill or topic.
We can quite easily imagine taking a test after going through an
entire language course. How would you feel if at the end of the course,
your final examination consists of only one question that covers one
element of language from the many that were introduced in the
course? If the language course was a conversational course focusing
on the different social situations that one may encounter, how valid is a
final examination that requires you to demonstrate your ability to place
an order at a posh restaurant in a five-star hotel?

4.5.3 Construct validity

Construct is a psychological concept used in measurement.
Construct validity is the most obvious reflection of whether a test
measures what it is supposed to measure as it directly addresses the
issue of what it is that is being measured. In other words, construct
validity refers to whether the underlying theoretical constructs that the
test measures are themselves valid. Proficiency, communicative
competence, and fluency are examples of linguistic constructs; selfesteem and motivation are psychological constructs.
Fundamentally every issue in language learning and teaching
involves theoretical constructs. When you are assessing a students
oral proficiency for instance. To possess construct validity, the test
should consist of various components of fluency: speed, rhythm,

juncture, (lack of) hesitations, and other elements within the construct
of fluency. Tests are, in a manner of speaking, operational definitions
of constructs in that their test tasks are the building blocks of the entity
that is being measured (see Davidson, Hudson, & Lynch, 1985; T.
McNamara, 2000).

4.5.4 Concurrent validity

Concurrent validity is the use of another more reputable and
recognised test to validate ones own test. For example, suppose you
come up with your own new test and would like to determine the
validity of your test. If you choose to use concurrent validity, you would
look for a reputable test and compare your students performance on
your test with their performance on the reputable and acknowledged
test. In concurrent validity, a correlation coefficient is obtained and
used to generate an actual numerical value. A high positive correlation
of 0.7 to 1 indicates that the learners score is relatively similar for the
two tests or measures.

For example, in a course unit whose objective is for students to

be able to orally produce voiced and unvoiced stops in all possible
phonetics environments, the results of one teachers unit test might be
compared with an independent assessment such as a commercially
produced test of similar phonemic proficiency. Since criterion-related
evidence usually falls into one of two categories of concurrent and
predictive validity, a classroom test designed to assess mastery of a
point of grammar in a communicative use will have criterion validity if
test scores are verified either by observed subsequent behaviour or by
other communicative measures of grammar point in question.

4.5.5 Predictive validity

Predictive validity is closely related to concurrent validity in that
it too generates a numerical value. For example, the predictive validity

of a university language placement test can be determined several

semesters later by correlating the scores on the test to the GPA of the
students who took the test. Therefore, a test with high predictive validity
is a test that would yield predictable results in a latter measure. A
simple example of tests that may be concerned with predictive validity
is the trial national examinations conducted at schools in Malaysia as it
is intended to predict the students performance on the actual SPM
national examinations. (Norleha Ibrahim, 2009)

As mentioned earlier validity is a complex concept, yet it is

crucial to the teachers understanding of what makes a good test. It is
good to heed Messicks (1989, p. 36) caution that validity is not an allor-none proposition and that various forms of validity may need to be
applied to a test in order to be satisfied worth its overall effectiveness.

What are reliability and validity? What determines the reliability of a

What are the different types of validity? Describe any three types and
cite examples.

4.5.6 Practicality
Although practicality is an important characteristic of tests, it is
by far a limiting factor in testing. There will be situations in which after
we have already determined what we consider to be the most valid
test, we need to reconsider the format purely because of practicality
issues. A valid test of spoken interaction, for example, would require
that the examinees be relaxed, interact with peers and speak on topics
that they are familiar and comfortable with. This sounds like the kind of
conversations that people have with their friends while sipping
afternoon teaby the roadside stalls. Of course such a situation would
be a highly valid measure of spoken interaction if we can setit up.
Imagine if we even try to do so. It would require hidden cameras as

well as a lot of telephone calls and money.

Therefore, a more practical form of the test especially if it is to

be administered at the national level as a standardised test, is to have
a short interview session of about fifteen minutes using perhaps a
picture or reading stimulus that the examinees would describe or
discuss. Therefore, practicality issues, although limiting in a sense,
cannot be dismissed if we are to come up with a useful assessment of
language ability. Practicality issues can involve economics or costs,
administration considerations such as time and scoring procedures, as
well as the ease of interpretation. Tests are only as good as how well
they are interpreted. Therefore tests that cannot be easily interpreted
will definitely cause many problems.

4.5.7 Objectivity
The objectivity of a test refers to the ability of
teachers/examiners who mark the answer scripts. Objectivity refers to
the extent, in which an examiner examines and awards scores to the
same answer script. The test is said to have high objectivity when the
examiner is able to give the same score to the similar answers guided
by the mark scheme. An objective test is a test that has the highest
level of objectivity due to the scoring that is not influenced by the
examiners skills and emotions. Meanwhile, subjective test is said to
have the lowest objectivity. Based on various researches, different
examiners tend to award different scores to an essay test. It is also
possible that the same examiner would give different scores to the
same essay if s/he is to re-check at different times.
4.5.8 Washback effect
The term 'washback' or backwash (Hughes, 2003, p.1)
refers to the impact that testshave on teaching and learning. Such
impact is usuallyseen as being negative: tests are said to force
teachersto do things they do not necessarily wish to do.However, some

have argued that tests are potentiallyalso 'levers for change' in

language education: theargument being that if a bad test has negative
impact,a good test should or could have positive washback(Alderson,
1986b; Pearson, 1988).

Cheng, Watanabe, and Curtis (2004) offered an entire anthology

to the issue of washback while Spratt (2005) challenged teachers to
become agents of beneficial washback in their language classrooms.
Brown (2010) discusses the factors that provide beneficial washback
in a test.He mentions that such a test can positively influence what and
how teachers teach, students learn; offer learners a chance to
adequately prepare, give learners feedback that enhance their
language development, is more formative in nature than summative,
and provide conditions for peak performance by the learners.

In large-scale assessment, washback often refers to the effects

that tests have on instruction in terms of how students prepare for the
test. In classroom-based assessment, washback can have a number
of positive manisfestations, ranging from the benefit of preparing and
reviewing for a test to the learning that accrues from feedback on ones
performance. Teachers can provide information that washes back to
students in the form of useful diagnoses of strengths and weaknesses.

The challenge to teachers is to create classroom tests that serve

as learning devices through which washback is achieved. Students
incorrect responses can become a platform for further improvements.
On the other hand, their correct responses need to be complimented,
especially when they represent accomplishments in a students
developing competence. Teachers can have various strategies in
providing guidance or coaching. Washback enhances a number of
basic principles of language acquisition namely intrinsic motivation,
autonomy, self-confidence, language ego, interlanguage, and strategic
investment, among others.
Washback is generally said to be either positive or negative.

Unfortunately, students and teachers tend to think of the negative

effects of testing such as test-driven curricula and only studying and
learning what they need to know for the test. Positive washback, or
what we prefer to call guided washback can benefit teachers,
students and administrators. Positive washback assumes that testing
and curriculum design are both based on clear course outcomes, which
are known to both students and teachers/testers. If students perceive
that tests are markers of their progress towards achieving these
outcomes, they have a sense of accomplishment. In short, tests must
be part of learning experiences for all involved. Positive washback
occurs when a test encourages good teaching practice.

Washback is particularly obvious when the tests or examinations

in question are regarded as being very vital and having a definite
impact on the students or test-takers future. We would expect, for
example, that national standardised examinations would have strong
washback effects compared to a school-based or classroom-based

4.5.9 Authenticity
Another major principle of language testing is authenticity. It is a
concept that is difficult to define, particularly within the art and science
of evaluating and designing test. Citing Bachman and Palmer (1996) in
Brown (2010) authenticity is the degree of correspondence of the
characteristics of a given language test task to the features of a target
language task (p.23) and then suggested an agenda for identifying
those target language tasks and for transforming them into valid test

Language learners are motivated to perform when they are

faced with tasks that reflect real world situations and contexts. Good
testing or assessment strives to use formats and tasks that reflect the
types of situation in which students would authentically use the target

language. Whenever possible, teachers should attempt to use

authentic materials in testing language skills.

4.6.0 Interpretability
Test interpretation encompasses all the ways that meaning is
assigned to the scores. Proper interpretation requires knowledge
about the test, which can be obtained by studying its manual and other
materials along with current research literature with respect to its
use; no one should undertake the interpretation of scores on any test
without such study. In any test interpretation, the following
considerations should be taken into account.
A. Consider Reliability: Reliability is important because it is a
prerequisite to validity and because the degree to which a score may
vary due to measurement error is an important factor in its
B. Consider Validity: Proper test interpretation requires knowledge of
the validity evidence available for the intended use of the test. Its
validity for other uses is not relevant. Indeed, use of a measurement
for a purpose for which it was not designed may constitute misuse.
The nature of the validity evidence required for a test depends upon its
C. Scores, Norms, and Related technical Features: The result of
scoring a test or subtest is usually a number called a raw score, which
by itself is not interpretable. Additional steps are needed to translate
the number directly into either a verbal description (e.g., pass or
fail) or into a derived score (e.g., a standard score). Less than full
understanding of these procedures is likely to produce errors in
interpretation and ultimately in counseling or other uses.
D. Administration and Scoring Variation: Stated criteria for score
interpretation assume standard procedures for administering and

scoring the test. Departures from standard conditions and procedures

modify and often invalidate these criteria.

Study some of commercially produced tests and evaluate the

authenticity of these tests/ test items.
Discuss the importance of authenticity in testing.
Based on samples of formative and summative assessments, discuss
aspects of reliability/validity that must be considered in these
Discuss measures that a teacher can take to ensure high validity of
language assessment for the primary classroom.






Topic 5 exposes you the stages of test construction, the preparing of test
blueprint/test specifications, the elements in a Test Specifications Guidelines
And the importance of following the guidelines for constructing tests items.
Then we look at the various test formats that are appropriate for language


By the end of this topic, you will be able to:

identify the different stages of test construction


describe the features of a test specification


draw up a test specification that reflect both the purpose and the
objectives of the test


compare and contrast Blooms taxonomy and SOLO taxonomy


categorise test items according to Blooms taxonomy


discuss the elements of test items of high quality, reliability and



identify the elements in a Test Specifications Guidelines


demonstrate an understanding of the importance of following the

guidelines for constructing tests items


illustrate test formats that are appropriate and meet the

requirements of the learning outcomes



Stages of Test

Preparing Test
Blueprint / Test

Guidelines for
constructing Test

Bloom's and SOLO


Test Format

SESSION FIVE (3 hours)

Stages of Test Construction

Constructing a test is not an easy task; it requires a variety of skills
along with deep knowledge in the area for which the test is to be
constructed. The steps include:




5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be

Who are the examinees?

What kind of test is to be made?

What is the precise purpose?

What abilities are to be tested?

How detailed and how accurate the results must be?

How important is the backwash effect?

What constraints are set by the unavailability of expertise, facilities,

time of construction, administration, and scoring?

What is the scope of the test?

5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures.
In this stage, the test constructor has to determine the content by
answering the following questions:
v Describing the purpose of the test;
v Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
v Defining the nature of the ability we want to measure;
v Developing a plan for evaluating the qualities of test usefulness, which
is the degree to which a test is useful for teachers and students, it
includes six qualities: reliability, validity, authenticity, practicality interactiveness, and impact;
v Identifying resources and developing a plan for their allocation and
v Determining format and timing of the test;
v Determining levels of performance;
v Determining scoring procedures

5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way to
identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop to

items that they have created, they must be open to, and ready to
accept, the criticisms that are offered to them. Good personal relations
are a desirable quality in any test writing team.

Test items writers should possess the following characteristics:


They have to be experienced in test construction.

They have to be quite knowledgeable of the content of the test.

They should have the capacity in using language clearly and


They have to be ready to sacrifice time and energy.

Another basic aspect in writing the items of the test is sampling.

Sampling means that test constructors choose widely from the whole
area of the course content. It is most unlikely that everything found
under the heading of 'Content in the specifications can be included in
any one version of the test. Choices have to be made for content
validity and for beneficial backwash. One should not concentrate solely
on elements known to be easy to test. Rather, the content of the test
should be a representative sample of the course material.

5.3.4 Preparing
One has to understand the major principles, techniques and
experience of preparing the test items. Not every teacher can make a
good tester. To construct different kinds of tests, the tester should
observe some principles. In the production-type tests, we have to bear
in mind that no comments are necessary. Test writers should also try to
avoid test items, which can be answered through test- wiseness. Testwiseness refers to the capacity of the examinees to utilise the
characteristics and formats of the test to guess the correct answer.

5.3.5 Reviewing
Principles for reviewing test items:
v The test should not be reviewed immediately after its construction,

but after some considerable time.

v Other teachers or testers should review it. In a language test, it is
preferable if native speakers are available to review the test.

5.3.6 Pre-testing
After reviewing the test, it should be submitted to pre-testing.
v The tester should administer the newly-developed test to a group of
examinees similar to the target group and the purpose is to analyse
every individual item as well as the whole test.
v Numerical data (test results) should be collected to check the
efficiency of the item, it should include item facility and

5.3.7 Validating
Item Facility (IF) shows to what extent the item is easy or difficult. The
items should neither be too easy nor too difficult. To measure the
facility or easiness of the item, the following formula is used:
IF= number of correct responses (c) / total number of candidates (N)
And to measure item difficulty:
IF= (w) / (N)
The results of such equations range from 0 1. An item with a
facility index of 0 is too difficult, and with 1 is too easy. The ideal item is
one with the value of (0.5) and the acceptability range for item facility is
between [0.37 0.63], i.e. less than 0.37 is difficult, and above 0.63 is
Thus, tests which are too easy or too difficult for a given sample
population, often show low reliability. As noted in Topic 4, reliability is
one of the complementary aspects of measurement.

Preparing Test Blueprint / Test Specifications

Test specifications (specs) for classroom use can be an outline of your
test (Brown, 2010), what it will look like. Consider your test
specs as a blueprint of the test that include the following:

a description of its content

item types (methods, such as multiple-choice, cloze, etc.)

tasks (e.g. written essay, reading a short passage, etc.)

skills to be included

how the test will be scored

how it will be reported to students

For classroom purposes (Davidson & Lynch, 2002), the specs

are your guiding plan for designing an instrument that effectively fulfils
your desired principles, especially validity.
It is vital to note that for large-scale standardised tests like Test
of English as a Foreign Language (TOEFL Test), International
English Language Testing System (IELTS), Michigan English
Language Assessment Battery) MELAB, and the like, that are intended
to be widely distributed and thus are broadly generalised, test
specifications are much more formal and detailed (Spaan, 2006). They
are also usually confidential so that the institution that is designing the
test can ensure the validity of subsequent forms of a test.
Many language teachers claim that it is difficult to construct an item. In
reality, it is rather easy to develop an item, if we are committed in the
planning of the measuring instruments to evaluate students
However, what exactly is an item for a test? An item is a tool,
an instrument, instruction or question used to get feedback from testtakers, which is an evidence t of something that is being measured. An
item is an instrument used to get feedback, which is a useful
information for consideration in measuring or asserting a construct
measurement. Items can be classified as a recall and thinking item. A
recall item is the item that requires one to recall in order to answer, and
a thinking item refers to an item that requires test-takers to use their
thinking skills to attempt.
For instance, in a grammar unit test that will be administered at
the end of a three-week grammar course for high beginning adult
learners (Level 2). The students will be taking a test that covers verb
tenses and two integrated skills (listening/speaking and reading/writing)

and the grammar class they attend serves to reinforce the grammatical
forms that they have learnt in the two earlier classes.
Based on the scenario above, the test specs that you design
might consist of the four sequential steps:
1. a broad outline of how the test will be organised
2. which of the eight sub-skills you will test
3. what the various tasks and item types will be
4. how results will be scored, reported to students, and used in future
class (washback)
Besides knowing the purpose of the test you are creating, you
are required to know as precisely as possible what it is you want to
test. Do not conduct a test hastily. Instead, you need to examine the
objectives for the unit you are testing carefully.

Blooms and SOLO Taxonomies

5.5.1 Blooms Taxonomy (Revised)
Blooms Taxonomy is a systematic way of describing how a
learners performance develops from simple to complex levels in their
affective, psychomotor and cognitive domain of learning. The Original
Taxonomy provided carefully developed definitions for each of the six
major categories in the cognitive domain. The categories were
Knowledge, Comprehension, Application, Analysis, Synthesis, and
Evaluation. With the exception of Application, each of these was
broken into subcategories. The complete structure of the original
Taxonomy is shown in Figure 5.1.

Figure 5.1: Original Terms of Blooms Taxonomy

Retrieved from: http://www.

The categories were ordered from simple to complex and from

concrete to abstract. Further, it was assumed that the original
Taxonomy represented a cumulative hierarchy; that is, mastery of each
simpler category was prerequisite to mastery of the next more complex
one. In their cognitive domain, there are six stages, namely:
Knowledge, Comprehension, Application, Analysis, Synthesis and
Evaluation. Unfortunately, traditional education tends to base the
student learning in this domain. In the original Taxonomy, the
Knowledge category embodied both noun and verb aspects. The noun
or subject matter aspect was specified in Knowledge's extensive
subcategories. The verb aspect was included in the definition given to
Knowledge in that the student was expected to be able to recall or
recognise knowledge. This brought uni-dimensionality to the framework
at the cost of a Knowledge category that was dual in nature and thus
different from the other Taxonomic categories. In 1990s, Anderson
(former student of

Bloom) eliminated this inconsistency in the revised

Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge

and the verb forming the basis for the Cognitive Process

dimension as shown in Figure 5.2.

Figure 5.2: Blooms Revised Taxonomy

Retrieved from: http://www.

In the revised Blooms Taxonomy, the names of six major

categories were changed from noun to verb forms. As the taxonomy
reflects different forms of thinking and thinking is an active process
verbs were used instead of nouns.
Besides, the subcategories of the six major categories were
also replaced by verbs and some subcategories were re-organised.
The knowledge category was renamed. Knowledge is an outcome or
product of thinking not a form of thinking per se. Consequently, the
word knowledge was inappropriate to describe a category of thinking
and was replaced with the word remembering instead. Comprehension
and synthesis were retitled to understanding and creating respectively,
in order to better reflect the nature of the thinking defined in each
category. Table 3 below provides a summary of the above.
Table 3: The Cognitive Dimension Process
Level 1 C1
Categories &
Cognitive Processes

Alternative Names





Retrieve knowledge
from long-term
Locating knowledge in
long-term memory that
is consistent with
presented material
Retrieving relevant
knowledge from longterm memory

Level 2 C2
Categories &
Cognitive Processes

Alternative Names












Constructing models

Construct meaning
from instructional
messages, including
oral, written, and
Changing from one form
of representation to

Finding a specific
example or illustration of
a concept or principle
Determining that
something belongs to a
Abstracting a general
theme or major point(s)
Drawing a logical
conclusion from
presenting information
between two ideas,
objects, and the like
Constructing a cause
and effect model of a

Level 3 C3
Categories &
Cognitive Processes

Alternative Names


Carrying out




Applying a procedure
to a familiar task
Applying a procedure to
a familiar task
Applying a procedure to
an unfamiliar task
Break materials into
its constituent parts




Finding coherence







and determine how the

parts relate to one
another and to an
overall structure or
Distinguishing relevant
from irrelevant parts or
important from
unimportant parts of
presented material
Determining how
elements fit or function
within a structure

Determining a point of
view, bias, values, or
intent underlying
presented material
Make judgments
based on criteria and
inconsistencies or
fallacies within a
process or product,
determining whether a
process or product has
internal consistency;
detecting the
effectiveness of a
procedure as it is being
betweena product and
whether a product has
external consistency;
detecting the
appropriateness of a
procedure for a given
Putting elements
together to form a
coherent or functional
whole; reorganise
elements into a new
pattern or structure







Coming upwith
alternative hypotheses
based on criteria
Devising a procedure for
accomplishing some
Inventing a product

The Knowledge Domain

Categories &
Cognitive Processes
Factual Knowledge
Conceptual Knowledge

Procedural Knowledge


The basic elements students must know to the
acquainted with a discipline or solve problems in it
The interrelationships among the basic elements
within a larger structure that enable them to
function together
How to do something, methods of inquiry, and
criteria for using skills, algorithms, techniques, and
Knowledge of cognition in general as well as
awareness and knowledge of ones own cognition

5.5.2 SOLO Taxonomy

On the other hand, SOLO, which stands for the Structure of the
Observed Learning Outcome, taxonomy is a systematic way of
describing how a learners performance develops from simple to
complex levels in their learning. Biggs & Collis first introduced it, in their
1982 study. There are 5 stages, namely Prestructural, Unistructural,
Multistructural, which are in a quantitative phrase and Relational and
Extended Abstract, which are in a qualitative phrase.

Students find learning more complex as it advances. SOLO is

a means of classifying learning outcomes in terms of their complexity,
enabling teachers to assess students work in terms of its quality not of
how many bits of this and of that they got right. At first we pick up only

one or few aspects of the task (unistructural), then several aspects but
they are unrelated (multistructural), then we learn how to integrate
them into a whole (relational), and finally, we are able to generalise
that whole to as yet untaught applications (extended abstract). The
diagram below shows lists verbs typical of each such level.

Figure 5.3: SOLO Taxonomy

The SOLO taxonomy maps the complexity of a students work by
linking it to one of five phases: little or no understanding (Prestructural),
through a simple and then more developed grasp of the topic (Unistructural
and Multistructural), to the ability to link the ideas and elements of a task
together (Relational) and finally (Extended Abstract) to understand the topic
for themselves, possibly going beyond the initial scope of the task (Biggs &
Collis, 1982; Hattie & Brown, 2004). In their later research into multimodal
learning, Biggs & Collis noted that there was an increase in the structural
complexity of their (the students) responses (1991:64).

It may be useful to view the SOLO taxonomy as an integrated strategy,

to be used in lesson design, in task guidance and formative and summative
assessment (Smith & Colby, 2007; Black & William, 2009; Hattie, 2009;
Smith, 2011). The structure of the taxonomy encourages viewing learning as
an on-going process, moving from simple recall of facts towards a deeper
understanding; that learning is a series of interconnected webs that can be
built upon and extended. Nckles et al., (2009:261) elaborates:
Cognitive strategies such as organization and elaboration are at
the heart of meaningful learning because they enable the learner
to organize learning into a coherent structure and integrate new
information with existing knowledge, thereby enabling deep
understanding and long-term retention.
This would help to develop Smiths (2011:92) self-regulating, self-evaluating
learners who were well motivated by learning.

A range of SOLO based techniques exist to assist teachers and

students. Use of constructional alignment (Biggs & Tang, 2009) encourages
teachers to be more explicit when creating learning objectives, focusing on
what the student should be able to do and at which level. This is essential for
a student to make progress and allows for the creation of rubrics, for use in
class (Black &Wiliam, 2009; Nckles et al., 2009; Huang, 2012), to make the
process explicit to the student. Use of HOTS viz. Higher Order Thinking Skills)
maps (Hook & Mills, 2011) can be used in English to scaffold in depth
discussion, encouraging students to:
Develop interpretations, use research and critical thinking
effectively to develop their own answers, and write essays that
engage with the critical conversation of the field (Linkon, 2005:247,
cited in Allen, 2011).

It may also be helpful in providing a range of techniques for differentiated

learning (Anderson, 2007; Hook & Mills, 2012).

The SOLO taxonomy has a number of proponents. Hook & Mills

(2011:5) refer to it as a model of learning outcomes that helps schools
develop a common understanding. Moseley et al. (2005:306) advocates its
use as a framework for developing the quality of assessment citing that it is
easily communicable to students. Hattie (2012:54), in his wide-ranging
investigation into effective teaching and visible learning, outlines three levels
of understanding: surface, deep and conceptual. He indicates that:
The most powerful model for understanding these three levels and
integrating them into learning intentions and success criteria is the
SOLO model.

However, the taxonomy is not without critics; Chick (1998:20) believes

that there is potential to misjudge the level of functioning and Chan et al.
(2002:512) criticises its conceptual ambiguity stating that the categorisation
is unstable. In these two studies, the SOLO taxonomy was used primarily for
assessing completed work, so use throughout the teaching process may
alleviate these issues.

An additional criticism, in particular when the taxonomy is compared

with that of Bloom (1956), is the SOLO taxonomys structure. Biggs & Collis
(1991) refers to the structure as a hierarchy, as does Moseley et al. (2005);
naturally, there are concerns when complex processes, such as human
thought, are categorised in this manner. However, Campbell et al. (1992)
explained the structure of the SOLO taxonomy as consisting as a series of
cycles (especially between the Unistructural, Multistructural and Relational
levels), which would allow for a development of breadth of knowledge as well
as depth.

However, SOLO taxonomy can be used not only in designing the

curriculum in terms of the learning outcomes intended, but also in
assessment.It can be effectively used for students to deconstruct exam
questions to understand marks awarded and as a vehicle for self-assessment
and peer-assessment.


Guidelines for constructing test items

Tests do not work without well-written test items. Test-takers

appreciate clearly written questions that do not attempt to trick or confuse
them into incorrect responses. The following presents the major
characteristics of well-written test items.

5.6.1 Aim of the test

Test item development is a critical step in building a test that properly
meets certain standards. A good test is only as good as the quality of the test
items. If the individual test items are not appropriate and do not perform well,
how can the test scores be meaningful? The topic to be evaluated (construct)
and where the evaluation is done (title/context) must be part of the
curriculum. If it is evaluated outside the curriculum, the curricular validity of
the item can be disputed. Therefore, test items must be developed to
precisely measure the objectives prescribed by the blueprint and meet quality

5.6.2 Range of the topics to be tested

A test must measure the test-takers ability or proficiency in applying
the knowledge and principles on the topics that they have learnt. Ample
opportunity must be given to students to learn the topics that are to be
evaluated. This opportunity would include the availability of language
teachers, well-equipped facilities, and the expertise of the language teachers
in conducting the lessons and providing the skills and knowledge that would
be evaluated to the test-takers or students.

5.6.3 Range of skills to be tested

Test item writers should always attempt to write test items that
measure higher levels of cognitive processing. This is not an easy task. It
should be a goal of the writer to ensure their items have cognitive
characteristics exemplifying understanding, problem-solving, critical
thinking, analysis, synthesis, evaluation and interpreting rather than just
declarative knowledge. There are many theories that provide frameworks on

levels of thinking and Blooms taxonomy is often cited as a tool to use in item
writing. Always stick to writing important questions that represent and can
predict that a test-taker is proficient at high levels of cognitive processing in
doing their test proficiently.

5.6.4 Test format

Test items should always follow a consistent design so that the
questioning process in itself does not give unnecessary difficulty to answering
questions. Therefore a logical and consistent stimulus format for writing test
items can help expedite the laborious process of writing test items as well as
supply a format for asking basic questions. A format that provides an initial
starting structure to use in writing questions can be valuable for item writers.
When these formats are used, test takers can quickly read and understand
the questions, since the format is expected. For example, to measure
understanding of knowledge or facts, questions can begin with the following:
What best defines .?
What is not the characteristic of .?
What is an example of .?
5.6.5 Level of difficulty

A test has a planned number of questions at a level of difficulty and

discrimination to best determine mastery and non-mastery performance
states. Test-takers should clearly understand what is needed in education and
language assessment to prepare for the examination and how much
experience performing certain activities would help in preparation. This should
be the road map that helps item writers create test items and helps test takers
understand what will be required of them to pass an examination. In any test
item construction, we must assure that weak students could answer easy
item, intermediate language proficiency students could answer easy and
moderate items whereas high language proficiency students could answer

easy, moderate and advance test items. A reliable and valid test instrument
should encompass all three levels of difficulties.

5.6.6 International and Cultural Considerations (biasness)

In standardised tests when exams are distributed internationally, either

in a single language or translated to other languages, always refrain from
the use of slang, geographic references, historical references or dates
(holidays) that may not be understood by an international examinee. Tests
need to be adapted to other society so that meaning is fully translated
correctly and benefits are not given to a particular group of test-takers. Steps
should be taken to avoid item content that may bias gender, race or other
cultural groups.
What are the good characteristics of a test item?
Explain each characteristic of a test item in a graphic organiser.


Test format

What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test, which
is organised a little bit different from the existing test items, what do you say?
Test format or test type? Test format refers to the layout of questions on a
test. For example, the format of a test could be two essay questions, 50
multiple- choice questions, etc.For the sake of brevity, I will consider providing
the outlines of some large-scale standardised tests.

Primary School Evaluation Test, also as known Ujian Penilaian
Sekolah Rendah (commonly abbreviated as UPSR; Malay), is a national
examination taken by all pupils in our country at the end of their sixth year
in primary school before they leave for secondary school. It is prepared and
examined by the Malaysian Examinations Syndicate. This test consists of two
papers namely Paper 1 and Paper 2.
Multiple-choice questions are tested using a standardised optical
answer sheet that uses optical mark recognition for detecting answers for
Paper 1 and Paper 2 comprises three sections, namely Sections A, B, and C.

TOEFL (Teaching of Foreign Language)

The TOEFL test is administered two ways; as an Internet-based test

(TOEFL iBT), and as a paper-based test (TOEFL PBT). Most of the 4,500+
test sites in the world use the TOEFL iBT.The TOEFL iBT test is given in
English and administered via the Internet. There are four sections (listening,
reading, speaking and writing), which take a total of about four and a half
hours to complete.

IELTS Test Format

IELTS is a test of all four language skills Listening, Reading, Writing
& Speaking. Test-takers will take the Listening, Reading and Writing tests all
on the same day one after the other, with no breaks in between. Depending
on the examinees test centre, ones Speaking test may be on the same day
as the other three tests, or up to seven days before or after that. The total test
time is under three hours. The test format is illustrated below.

Figure 6: IELTS Test Format





Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language
skills and language content. It also provides teachers with suggestions
on ways a teacher can assess the listening, speaking, reading and
writing skills in a classroom. It also discusses concepts of and
differences between discrete point test, integrative test and
communicative test.


At the end of Topic 6, teachers will be able to:

Identify and carry out the different types of assessment to assess

language skills and language content

Understand anddifferentiate between objective and subjective


Understand and differentiate between discrete point test,

integrative test and communicative test in assessing language.












SESSION SIX (6 hours)

Types of test items to assess language skills



Basically there are two kinds of listening tests: tests that test specific aspects
of listening, like sound discrimination; and task based tests which test skills in
accomplishing different types of listening tasks considered important for the
students being tested. In addition to this, Brown 2010 identified four types of
listening performance from which assessment could be considered.
i. Intensive : listening for perception of the components (phonemes, words,
intonation, discourse markers,etc) of a ;larger stretch of language.
ii. Responsive : listening to a relatively short stretch of language ( a
greeting, question, command, comprehension check, etc.) in order to
make an equally short response
iii. Selective : processing stretches of discourse such as short monologues
for several minutes in order to scan for certain information. The
purpose of such performance is not necessarily to look for global or
general meaning but to be able to comprehend designated information
in a context of longer stretches of spoken language( such as classroom
directions from a teacher, TV or radio news items, or stories).
Assessment tasks in selective listening could ask students, for example,
to listen for names, numbers, grammatical category, directions (in a
map exercise), or certain facts and events.
iv. Extensive : listening to develop a top-down , global
understanding of spoken language. Extensive performance
ranges from listening to lengthy lectures to listening to a
conversation and deriving a comprehensive message or
purpose. Listening for the gist or the main idea- and making
inferences are all part of extensive listening.


In the assessment of oral production, both discrete feature
objective tests and integrative task-based tests are used. The first
type tests such skills as pronunciation, knowledge of what
language is appropriate in different situations, language required
in doing different things like describing, giving directions, giving
instructions, etc. The second type involves finding out if pupils
can perform different tasks using spoken language that is
appropriate for the purpose and the context. Task-based activities
involve describing scenes shown in a picture, participating in a
discussion about a given topic, narrating a story, etc. As in the
listening performance assessment tasks, Brown 2010 cited four
categories for oral assessment.


Imitative . At one end of a continuum of types of speaking

performance is the ability to imitate a word or phrase or possibly
a sentence. Although this is a purely phonetic level of oral
production, a number of prosodic (intonation, rhythm,etc.),
lexical , and grammatical properties of language may be
included in the performance criteria. We are interested only in
what is traditionally labelled pronunciation; no inference are
made about the test-takers ability to understand or convey
meaning or to participate in an interactive conversation. The
only role of listening here is in the short-term storage of a
prompt, just long enough to allow the speaker to retain the short
stretch of language that must be imitated.


Intensive. The production of short stretches of oral language

designed to demonstrate competence in a narrow band of
grammatical, phrasal, lexical, or phonological relationships.
Examples of intensive assessment tasks include directed
response tasks (requests for specific production of speech),
reading aloud, sentence and dialogue completion, limited
picture-cued tasks including simple sentences, and translation
up to the simple sentence level.

3. Responsive. Responsive assessment tasks include interaction

and test comprehension but at somewhat limited level of very
short conversation, standard greetings, and small talk, simple
requests and comments, etc. The stimulus is almost always a
spoken prompt (to preserve authenticity) with one or two followup questions or retorts:


Liza : Excuse me, do you have the time?

Don : Yeah. Six-fifteen.


Jo : What is the most urgent social problem today?

Sue : I would say bullying.


Lan : Hey, Shan, hows it going?

Shan: Not bad, and yourself?
Lan : Im good.
Shan: Cool. Okay gotta go.

4. Interactive. The difference between responsive and interactive

speaking is in the length and complexity of the interaction, which
sometimes includes multiple exchanges and/or multiple
participants. Interaction can be broken down into two types : (a)
transactional language, which has the purpose of exchanging
specific information, and (b) interpersonal exchanges, which have
the purpose of maintaining social relationships. (In the three
dialogues cited above, A and B are transactional, and C is
5. Extensive (monologue). Extensive oral production tasks include
speeches, oral presentations, and storytelling, during which the
opportunity for oral interaction from listeners is either highly
limited (perhaps to nonverbal responses) or ruled out together.
Language style is more deliberative (planning is involved) and
formal for extensive tasks.In can include informal monologue such
as casually delivered speech (e.g., recalling a vacation in the

mountains, conveying recipes, recounting the plot of a novel or



Cohen (1994), discussed various types of reading and meaning

assessed. He describes skimming and scanning as two different types
of reading. In the first, a respondent is given a lengthy passage and is
required to inspect it rapidly (skim) or read to locate specific
information (scan) within a short period of time. He also discusses
receptive reading or intensive reading which refers to a form of
reading aimed at discovering exactly what the author seeks to
convey (p. 218). This is the most common form of reading especially
in test or assessment conditions. Another type of reading is to read
responsively where respondents are expected to respond to some
point in a reading text through writing or by answering questions.

A reading text can also convey various kinds of meaning and reading
involves the interpretation or comprehension of these meanings. First,
grammatical meaning are meanings that are expressed through
linguistic structures such as complex and simple sentences and the
correct interpretation of those structures. A second meaning is
informational meaning which refers largely to the concept or
messages contained in the text. Respondents may be required to
comprehend merely the information or content of the passage and this
may be assessed through various means such as summary and
prcis writing. Compared to grammatical or syntactic meaning,
informational meaning requires a more general understanding of a text
rather than having to pay close attention to the linguistic structure of
sentences. A third meaning contained in many texts is discourse
meaning. This refers to the perception of rhetorical functions conveyed
by the text. One typical function is discourse marking which adds
cohesiveness to a text. These words, such as unless, however, thus,
therefore etc., are crucial to the correct interpretation of a text and
students may be assessed on their ability to understand the discoursal

meaning that they bring in the passage. Finally, a fourth meaning

which may also be an object of assessment in a reading test is the
meaning conveyed by the writers tone. The writers tone whether it
is cynical, sarcastic, sad or etc.- is important in reading
comprehension but may be quite difficult to identify, especially by less
proficient learners. Nevertheless, there can be many situations where
the reader is completely wrong in comprehending a text simply
because he has failed to perceive the correct tone of the author.
d. Writing
Brown (2004), identifies three different genres of writing which are
academic writing, job-related writing and personal writing, each of
which can be expanded to include many different examples. Fiction,
for example, may be considered as personal writing according to
Browns taxonomy. Brown (2010) identified four categories of written
performance that capture the range of written production which can
be used to assess writing skill.


Imitative. To produce written language, the learner must attain

the skills in the fundamental, basic tasks of writing letters, words,
punctuation, and brief sentences. This category includes the
ability to spell correctly and to perceive phoneme-grapheme
correspondences in the English spelling system. At this stage
the learners are trying to master the mechanics of writing. Form
is the primary focus while context and meaning are of secondary


Intensive (controlled). Beyond the fundamentals of imitative

writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical features
up to the length of a sentence. Meaning and context are
important in determining correctness and appropriateness but
most assessment tasks are more concerned with a focus on form
and are rather strictly controlled by the test design.


Responsive. Assessment tasks require learners to perform at a

limited discourse level, connecting sentences into a paragraph

and creating a logically connected sequence of two or three

paragraphs. Tasks relate to pedagogical directives, lists of
criteria, outlines, and other guidelines. Genres of writing include
brief narratives and descriptions, short reports, lab reports,
summaries, brief responses to reading, and interpretations of
charts and graphs. Form-focused attention is mostly at the
discourse level, with a strong emphasis on context and meaning.

Extensive. Extensive writing implies successful management of

all the processes and strategies of writing for all purposes, up to
the length of an essay, a term paper, a major research project
report, or even a thesis. Focus is on achieving a purpose,
organizing and developing ideas logically, using details to
support or illustrate ideas, demonstrating syntactic and lexical
variety, and in many cases, engaging in the process of multiple
drafts to achieve a final product. Focus on grammatical form is
limited to occasional editing and proofreading of a draft.

6.2.2 Objective and Subjective test

Tests have been categorized in many different ways. The most
familiar terms regarding tests are the objective and subjective tests
. We normally associate objective tests with multiple choice
question type tests and subjective tests with essays. However, to
be more accurate we will consider how the test is graded.
Objective tests are tests that are graded objectively while
subjective tests are thought to involve subjectivity in grading.

There are many examples of each type of test. Objective type tests
include the multiple choice test, true false items and matching
items because each of these are graded objectively. In these
examples of objective tests, there is only
one correct response and the grader does not need to subjectively
assess the response.

Examples of the subjective test include essays and short answer

questions. However some other types of common tests such as

the dictation test, filling in the blank type tests, as well as
interviews and role plays can be considered subjective and
objective type tests where they fall on some sort of continuum
where some tests are more objective than others. As such, some
of these tests would fall closer to one end of the continuum or the

Two other terms, select type tests and supply type tests are related
terms when we think of objective and subjective tests. In most
cases, objective tests are similar to select type tests where
students are expected to select or choose the answer from a list of
options. Just as a multiple choice question test is an objective type
test, it can also be considered a select type test. Similarly, tests
involving essay type questions are supply type as the students are
expected to supply the answer through their essay. How then
would you classify a fill in the blank type test? Definitely for this
type of test, the students need to supply the answer, but what is
supplied is merely a single word or a short phrase which differs
tremendously from an essay. It may therefore be helpful to once
again consider a continuum with supply type and select type items
at each end of the continuum respectively.

It is possible to now combine both continua as shown in Figure 6.1

with the two different test formats placed within the two continua:

Figure 6.1: Continua for different types of test formats

It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.

Table 6.1: Types of Tests According to Students Expected Response

Selected response

Constructed response

Personal response

True false




Short answer


Multiple choice

Performance test

Self and peer


Selected response assessments, according to Brown and Hudson

(1998), are assessment procedures in which students typically do not
create any language but rather select the answer from a given list (p.
658). Constructed response assessment procedures require students to
produce language by writing, speaking, or doing something else (p.
660). Personal response assessments, on the other hand, require
students to produce language but also allows each students response to
be different from one another and for students to communicate what
they want to communicate (p. 663). These three types of tests,
categorised according to how students respond, are useful when we
wish to determine what students need to do when they attempt to
answer test questions.

Types of test items to assess language content


Discrete Point Test and Integrative Test

Language tests may also be categorised as either discrete point or
integrative. Discrete point tests examine one element at a time.

Integrative tests, on the other hand, requires the candidate to

combine many language elements in the completion of a
task (Hughes, 1989: 16). It is a simultaneous measure of
knowledge and ability of a variety of language features, modes, or
A multiple choice type test is usually cited as an example of a
discrete point test while essays are commonly regarded as the
epitome of integrative tests. However, both the discrete point test
and the integrative test are a matter of degree. A test may be more
discrete point than another and similarly a test may be more
integrative than another. Perhaps the more important aspect is to
be aware of the discrete point or integrative nature of a test as we
must be careful of what we believe the test measures.

This brings us to the question of how discrete point is a multiple

choice question type item? While it is definitely more discrete point
than an essay, it may still require more than just one skill or ability
in order to complete. Lets say you are interested in testing a
students knowledge of the relative pronoun and decide to do so by
using a multiple choice test item. If he fails to answer this test item
correctly, would you conclude that the student has problems with
the relative pronoun? The answer may not be as straight forward as
it seems. The test is presented in textual form and therefore
requires the student to read. As such, even the multiple choice test
item involves some integration of language skills as this example
shows, where in addition to the grammatical knowledge of relative
pronouns, the student must also be able to read and understand the

Perhaps a clearer way of viewing the distinction between the

discrete point and the integrative test is to examine the perspective
each takes toward language. In the discrete point test, language is
seen to be made up of smaller units and it may be possible to test

language by testing each unit at a time. Testing knowledge of the

relative pronoun, for example, is certainly assessing the students on
a particular unit of language and not on the language as a whole. In
an integrative test, on the other hand, the perspective of language
is that of an integrated whole which cannot be broken up into
smaller units or elements. Hence, the testing of language should
maintain the integrity or wholeness of the language.

Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not
surprising that communicative tests have also been given prominence.
A communicative emphasis in testing involves many aspects, two of
which revolve around communicative elements in tests and meaningful
content. Both these aspects are briefly addressed in the following sub

Integrating Communicative Elements into Examinations

Alderson and Banerjee (2002), report on various studies that seem to
point to the difficulty in achieving authenticity in tests. They cite
Spence-Brown (2001) who posits that the very act of assessment
changes the nature of a potentially authentic task and compromises
authenticity and that authenticity must be related to the
implementation of an activity, not to its design (p. 99). In her study,
students were required to interview native speakers outside the
classroom and submit a tape-recording of the interview. While this
activity seems quite authentic, the students were observed to prepare
for the interview by rehearsing the interview, editing the results, and
engaging in spontaneous, but flawed discourse (Alderson & Banerjee,
2002: 99), all of which are inauthentic when viewed in terms of real life
situations. Alderson himself argues that because candidates in
language tests are not interested in communicating but to display their
language abilities, the test situation is a communicative event in itself
and therefore cannot be used to replicate any real world event (p. 98).

Chalhoub-Deville (2003), argues for tests that take context into

consideration. She believes that there should be a shift in focus of our
measurement from traditional examinations of the construct in terms of
response consistency, to investigations that systematically explore
inconsistent (which does not mean random) performances across
contexts (p. 378). In the future, besides context, tests will also need to
integrate elements of communication such as topic initiation, topic
maintenance, and topic change in order for the test to become more
authentic and realistic. Due to issues of practicality, involving especially
the amount of time and extent of organisation to allow for such
communicative elements to emerge, it will not be an easy task to

The idea of bringing communicative elements into the language test is

not a new one. In his review of communicative tests, Fulcher (2000),
notes the descriptors of a communicative test as suggested by several
theorists. The three principles of communicative tests that he highlights
are that communicative tests:
involve performance;
are authentic; and
are scored on real-life outcomes.

In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree of
unpredictability which is typical of any language interaction situation.
These tests would also take the communicative purpose of the
interaction into consideration and require the student to interact with
language that is actual and unsimplified for the learner. Fulcher finally
points out that in a communicative test, the only real criterion of
success is the behavioural outcome, or whether the learner was
able to achieve the intended communicative effect (p. 493). It is
obvious from this description that the communicative test may not be

so easily developed and implemented. Practical reasons may hinder

some of the demands listed. Nevertheless, a solution to this problem
has to be found in the near future in order to have valid language that
are purposeful and can stimulate positive washback in teaching and

Exercise 1

In your opinion and based on your teaching

experience, how would you conduct the testing of
reading, writing and speaking skills of your own
students? What are the methods that you employ?
Share this with your classmates and exchange


Describe three different types of writing

performance as suggested by Brown (2004)
and relate their relationship to academic writing,
job related writing and personal writing.





Topic 7 focuses on the scoring, grading and assessment criteria. It
provides teachers with brief descriptions on the different approaches to
scoring namely:-objective, holistic and analytic.



By the end of Topic 7, teachers will be able to:

Identify and differentiate the different approaches used in scoring

Use the different approaches used in scoring in assessing language



Approaches to





Objective approach
A type of scoring approach is the objective scoring approach. This scoring
approach relies on quantified methods of evaluating students writing. A
sample of how objective scoring is conducted is given by Bailey (1999) as

Establish standardization by limiting the length of the assessment: Count

the first 250 words of the essay.
Identify the elements to be assessed: Go through the essay up to the
250th word underlining every mistake from spelling and mechanics
through verb tenses, morphology, vocabulary, etc. Include every error that
a literate reader might note.
Operationalise the assessment: Assign a weight score to each error, from
3 to 1. A score of 3 is a severe distortion of readability or flow of ideas; 2 is
a moderate distortion; and 1 is a minor error that does not affect readability
in any significant way.
Quantify the assessment: Calculate the essay Correctness Score by using
250 words as the numerator of a fraction, and the sum of error scores as
the denominator: The denominator is the sum of all the error scores:
7.2.2 Holistic approach
In holistic scoring, the reader reacts to the students compositions as a
whole and a single score is awarded to the writing. Normally this score is
on a scale of 1 to 4, or 1 to 6, or even 1 to 10.(Bailey, 1998 : 187). Each
score on the scale will be accompanied with general descriptors of ability.
The following is an example of a holistic scoring scheme based on a 6
point scale.

Table 7.1: Holistic Scoring Scheme

Source: S.S. Moya, Evaluation Assistance Center (EAC)-East, Georgetown
University, Washington



Vocabulary is precise, varied, and vivid.

Organization is appropriate to writing assignment
and contains clear introduction, development of
ideas, and conclusion.
Transition from one idea to another is smooth
and provides reader with clear understanding that
topic is changing.
Meaning is conveyed effectively.
A few mechanical errors may be present but do
not disrupt communication.
Shows a clear understanding of writing and topic
Vocabulary is adequate for grade level. Events
are organized logically, but some part of the
sample may not be fully developed.
Some transition of ideas is evident.
Meaning is conveyed but breaks down at times.
Mechanical errors are present but do not disrupt
Shows a good understanding of writing and topic
Vocabulary is simple. Organization may be
extremely simple or there may be evidence of
There are a few transitional markers or
repetitive transitional markers.
Meaning is frequently not clear.
Mechanical errors affect communication.
Shows some understanding of writing and
topic development.
Vocabulary is limited and repetitious. Sample
is comprised of only a few disjointed
No transitional markers.
Meaning is unclear.
Mechanical errors cause serious disruption in
Shows little evidence of discourse
Responds with a few isolated words. No
complete sentences are written.
No evidence of concepts of writing.
No response.

The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic

development and communication. Mechanics such as punctuation are

secondary to communication.
Bailey also describes another type of scoring related to the holistic approach
which she refers to as primary trait scoring. In primary trait scoring, a
particular functional focus is selected which is based on the purpose of the
writing and grading is based on how well the student is able to express that
function. For example, if the function is to persuade, scoring would be on how
well the author has been able to persuade the grader rather than how well
organised the ideas were, or how grammatical the structures in the essay
were. This technique to grading emphasises functional and communicative
ability rather than discrete linguistic ability and accuracy.
7.2.3 Analytic approach
Analytical scoring is a familiar approach to many teachers. In analytical
scoring, raters assess students performance on a variety of categories
which are hypothesised to make up the skill of writing. Content, for example,
is often seen as an important aspect of writing i.e. is there substance to
what is written? Is the essay meaningful? Similarly, we may also want to
consider the organisation of the essay. Does the writer begin the essay with
an appropriate topic sentence?
Are there good transitions between paragraphs? Other categories that we
may want to also consider include vocabulary, language use and
mechanics. The following are some possible components used in
assessing writing ability using an analytical scoring approach and the
suggested weightage assigned to each:

Language Used

30 points
20 points
20 points
25 points
5 points

The points assigned to each component reflect the importance of

each of the components.

Comparing the Three Approaches

Each of the three scoring approaches claims to have its own advantages
and disadvantages. These can be illustrated by Table 7.2
Table 7.2: Comparison of the Advantages and Disadvantages of the
Three Approaches to Scoring Essays




Quickly graded
Provide a public standard that is
understood by the teachers and
students alike
Relatively higher degree of rater
Applicable to the assessment of
many different topics
Emphasise the students
strengths rather than their
It provides clear guidelines in
grading in the form of the various
Allows the graders to consciously
address important aspects of
Emphasises the students
strengths rather than their

The single score may actually mask differences
across individual compositions.
Does not provide a lot of diagnostic feedback

Writing ability is unnaturally split up into


Still some degree of subjectivity involved.

Accentuates negative aspects of the learners
writing without giving credit for what they can
do well.


Based on your understanding, draw a mind map to indicate the

advantages and disadvantages of the three approaches to
scoring essays.




Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median,
mean, standard deviation, standard score and interpretation of data. It will also
look at some item analysis that deals with item difficulty and item discrimination.
Teachers will also be introduced to distractor analysis in language assessment.


By the end of Topic 8, teachers will be able to:


Identify and differentiate some basic statistics terminologies used.

determine how well items discriminate using item discrimination; and

Analyse how well a distractor in a test item performs















8.2.1 Basic Statistics

Let us assume that you have just graded the test papers for your class. You
now have a set of scores. If a person were to ask you about the performance
of the students in your class, it would be very difficult to give all the scores in
the class. Instead, you may prefer to cite only one score.
Or perhaps you would like to report on the performance by giving some
values that would help provide a good indication of how the students in your
class performed. What values would you give? In this section, we will look at
two kinds of measures, namely measures of central tendency and measures
of dispersion. Both these types of measures are useful in score reporting.
Central tendency measures the extent to which a set of scores gathers
around. There are three major measures of central tendency. They are the
mode, median and mean.



Mode is the most frequently occurring raw score in a set of

The following is a set of scores:
15, 13, 12, 12, 13, 16, 13, 17, 14, 18
What is the mode for this set of scores? If you said 13, then
you are correct as it occurs more often than others. It is
possible to have one mode in a set of scores. If there are
two modes, then the set of scores is referred to as being
The median refers to the score that is in the middle of the
set of scores when the scores are arranged in ascending or
descending order. There are seven scores in the set of
scores above. If we arrange it in order based on value, it
would be 45, 47, 50, 51, 52, 54, 65. In this set of scores, the
median will be 51 as it is the middle score. There are three
scores lower than it and an equal number of scores higher
than it.
What happens when there are an even number of scores?
Lets take the following set of scores as an example:
45, 47, 50, 51, 52, 53, 54, 65
As there is no one score that is in the middle, we need to
take the two in the middle, add them up and divide by two.
As such, the median is 51.5 as (51 + 52)/2 or 103/2 =51.5.
Always remember, however, that when we wish to find the
median, we have to first arrange the scores in either
ascending or descending order of value.
The mean of a set of test scores is the arithmetic mean or
average and is calculated as SX/N where S (sigma) refers
to the sum of, X refers to the raw or observed scores, and N
is the number of observed scores. Look at the following set
of scores:
47, 65, 45, 54, 50, 52, 51
The mean for this set of scores is 364/7 = 52


Standard deviation
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the following

To illustrate this, we will use 20, 25,30. Using standard deviation method,
we come up with the following table:
Table 8.1:Calculating the Standard Deviation Using the Deviation Method

Using the raw score method, we can come up with the following:

Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method

Both methods result in the same final value of 5. If you are calculating
standard deviation with a calculator, it is suggested that the deviation
method be used when there are only a few scores and the raw score
method be used when there are many scores. This is because when
there are many scores, it will be tedious to calculate the square of the
deviations and their sum.

8.2.3 Standard score

Standardised scores are necessary when we want to make
comparisons across tests and measurements. Z scores and T scores
are the more common forms of standardised scores although you
may come up with your own standardised score. A standardised score
can be computed for every raw score in a set of scores for a test.

i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is as

Table 8.3: Calculating the Z Score for a Set of Scores

Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting especially
for those unaccustomed to the concept. Imagine what a parent may
say if his child comes home with a report card with a Z score of 0.47
in English Language! Fortunately, there is another form of
standardised score - the T score with values that are more
palatable to the relevant parties.

The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10

(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values

seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.

Interpretation of data
The standardised score is actually a very important score if we want to
compare performance across tests and between students. Let us take the
following scenario as an example:

How can En. Abu solve this problem? He would have to have
standardised scores in order to decide. This would require the
following information:
Test 1 : X = 42 standard deviation= 7
Test 2 : X = 47 standard deviation= 8
Using the information above, En. Abu can find the Z score for each
raw score reported as follows:
Table 8.4: Z Score for Form 2A

Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into


The normal curve is a hypothetical curve that is supposed to represent all
naturally occurring phenomena. It is assumed that if we were to sample a
particular characteristic such as the height of Malaysian men, then we will
find that while most will have an average height of perhaps 5 feet 4 inches,
there will be a few who will be relatively shorter and an equal number who
are relatively taller. By plotting the heights of all Malaysian men according to
frequency of occurrence, it is expected that we would obtain something
similar to a normal distribution curve. Similarly, test scores that measure any
characteristic such as intelligence, language proficiency or writing ability of a
specific population is also expected to provide us with a normal curve.
The following is a diagram illustrating how the normal curve would look like.

Figure 8.1: The normal distribution or Bell curve

The normal curve in Figure 8.1 is partitioned according to standard

deviations (i.e. 4s, -3s, + 3s, + 4s) which are indicated on the
horizontal axis. The area of the curve between standard deviations is
indicated in percentage on the diagram. For example, the area between
the mean (0 standard deviation) and +1 standard deviation is 34.13%.

Similarly, the area between the mean and 1 standard deviation is also
34.13%. As such, the area between 1 and 1 standard deviations is
In using the normal curve, it is important to make a distinction between
standard deviation values and standard deviation scores. A standard
deviation value is a constant and is shown on the horizontal axis of the
diagram above. The standard deviation score, on the other hand, is the
obtained score when we use the standard deviation formula provided
earlier. So, if we find the score to be 5 as in the earlier example, then the
score for the standard deviation value of 1 is 5 and for the value of 2 is 5
x 2 = 10 and for the value of 3 is 15 and so on. Standard deviation
values of 1, -2, and 3 will have corresponding negative scores of 5, 10, and 15.

Item analysis

Item difficulty
Item difficulty refers to how easy or difficult an item is. The formula
used to measure item difficulty is quite straightforward. It involves
finding out how many students answered an item correctly and
dividing it by the number of students who took this test. The formula
is therefore:

For example, if twenty students took a test and 15 of them correctly

answered item 1, then the item difficulty for item 1 is 15/20 or 0.75.
Item difficulty is always reported in decimal points and can range from
0 to 1. An item difficulty of 0 refers to an extremely difficult item with
no students getting the item correct and an item difficulty of 1 refers to
an easy item which all students answered correctly.
The appropriate difficulty level will depend on the purpose of the test.
According to Anastasi & Urbina (1997), if the test is to assess
mastery, then items with a difficulty level of 0.8 can be accepted.

However, they go on to describe that if the purpose of the test is for

selection, then we should utilise items whose difficulty values come
closest to the desired selection ratio for example, if we want to select
20%, then we should choose items with a difficulty index of 0.20.
b. Item discrimination
Item discrimination is used to determine how well an item is able to
discriminate between good and poor students. Item discrimination
values range from 1 to 1. A value of 1 means that the item
discriminates perfectly, but in the wrong direction. This value would tell
us that the weaker students performed better on a item than the better
students. This is hardly what we want from an item and if we obtain
such a value, it may indicate that there is something not quite right with
the item. It is strongly recommended that we examine the item to see
whether it is ambiguous or poorly written. A discrimination value of 1
shows positive discrimination with the better students performing much
better than the weaker ones as is to be expected.

Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:

Table 8.5: Item Discrimination

As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
consist of students L, A, E, and G while the lower group would consist
of students J, H, D and I.
We now need to look at the performance of these students for each
item in order to find the item discrimination index of each item.
For item 1, all four students in the upper group (L, A, E, and G)
answered correctly while only student H in the lower group answered
correctly. Using the formula described earlier, we can plug in the
numbers as follows:

Two points should be noted. First, item discrimination is especially

important in norm referenced testing and interpretation as in such
instances there is a need to discriminate between good students who
do well in the measure and weaker students who perform poorly. In

criterion referenced tests, item discrimination does not have as

important a role. Secondly, the use of 33.3% of the total number of
students who took the test in the formula is not inflexible as it is
possible to use any percentage between 27.5% to 35% as the value.

Distractor analysis
Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers select
the correct answer, but how the distractors were able to function
effectively by drawing the test takers away from the correct answer.
The number of times each distractor is selected is noted in order to
determine the effectiveness of the distractor. We would expect that the
distractor is selected by enough candidates for it to be a viable
What exactly is an acceptable value? This depends to a large extent on
the difficulty of the item itself and what we consider to be an acceptable
item difficulty value for test items. If we are to assume that 0.7 is an
appropriate item difficulty value, then we should expect that the
remaining 0.3 be about evenly distributed among the distractors.

Let us take the following test item as an example:

In the story, he was unhappy because_____________________________
A. it rained all day
B. he was scolded
C. he hurt himself
D. the weather was hot

Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in
their role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.

Therefore, the ideal situation would be for each of the three distractors to
be selected by an equal number of all students who did not get the answer
correct, i.e. in this case 10 students. Therefore the effectiveness of each
distractor can be quantified as 10/100 or 0.1 where 10 is the number of
students who selected the tiems and 100 is the total number of students
who took the test. This technique is similar to a difficulty index although the
result does not indicate the difficulty of each item, but rather the
effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the indices
would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an
item, the value of the difficulty index formula for the distractors must be
interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly
from the regular item discrimination that we have looked at earlier. Instead
of expecting a positive value, we should logically expect a negative value
as more students from the lower group should select distractors. Each
distractor can have its own item discrimination value in order to analyse
how the distractors work and ultimately refine the effectiveness of the test
item itself.
Table 8.6: Selection of Distractors
Distractor A

Distractor B

Distractor C

Distractor D

Item 1


Item 2


Item 3


Item 4


Item 5



* indicates key

For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one

student from the lower group did so. If we assume that the three remaining
students from the lower group all selected distractor B, then the
discrimination index for item 1, distractor B will be:

This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This
result is to be expected of a distractor and a value of -1 to 0 is preferred.
1. Calculate the mean, mode, median and range of the following set of
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.

2. What is a normal curve and what does this show? Does the final
result always show a normal curve and how does this relate to
standardised tests?



Topic 9 focuses on reporting assessment data. It provides teachers with brief
descriptions on the purposes of reporting and the reporting methods.
By the end of Topic 9, teachers will be able to:

Understand the purposes of reporting of assessment data

Understand and use the different reporting methods in language assessment



SESSION NINE (3 hours)


9.2.1 Purposes of reporting

We can say that the main purpose of tests is to obtain information
concerning a particular behaviour or characteristic. Based on information
obtained from tests, several different types of decisions can be made.
Kubiszyn & Borich (2000), mention eight different types of decisions
made on the basis of information obtained from tests. These educational
decisions are shown in Figure 9.1

Figure 9.1 :Eight Types of Decisions Mode

Instructional decisions are made based on test results when, for

example, teachers decide to change or maintain their instructional
approach. If a teacher finds out that most of his class have failed his
test, there are many possible reactions he can have. The teacher
could evaluate the effectiveness of his own teaching or instructional
approach and implement the necessary changes.Tests yield scores
and teachers will have to make decisions in terms of the kind of
grades to give students. As grades are indicators of student
performance, teachers need to decide whether a student deserves a
high grade perhaps an A on the basis of some form of
Traditionally, and perhaps for a long time to come, this assessment will

be in the form of tests. Sometimes, we give tests to find out the

strengths and weaknesses of our students.
Decisions related to selection, placement, counselling and guidance,
programme or curriculum, and administrative policy are all made at
levels higher than the classroom.
Administrators, educational agencies and institutions may be involved in
these decisions.
Selection and placement decisions are somewhat similar. However, a
selection decision relates to whether or not a student is selected for a
programme or for admission into an institution based on a test score.
Tests such as TOEFL and IELTS are often used by universities to
decide whether a candidate is suitable, and hence selected for
A placement decision, however, deals with where a candidate should
be placed based on performance on the test. A clear example is the
language placement examination for newly admitted students
commonly administered by many local and foreign universities.
Based on their performance on such a test, students are placed into
different language classes that are arranged according to proficiency
Counselling and guidance decisions are also made by relevant parties
such as counsellors and administrators on the basis of exam results.
Counsellors often give advice in terms of appropriate vocations for
some of their students. These advice is likely to be made on the basis of
the students own test scores. Programme or curriculum decisions
reflect the kinds of changes made to the educational programme or
curriculum based on examination results. Finally, there are also
administrative policy decisions that need to be made which are also
greatly influenced by test scores.


Reporting methods
Student achievement progress can be reported by comparing:
i. Norm - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to other students.
ii Criterion - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to predetermined criteria.
An outcomes-approach to assessment will provide information about
student achievement to enable reporting against a standards
iii An outcomes-approach
Acknowledges that students, regardless of their class or grade, can be
working towards syllabus outcomes anywhere along the learning

Principles of effective and informative assessment and reporting

Effective and informative assessment and reporting practice:

Has clear, direct links with outcomes

The assessment strategies employed by the teacher in the
classroom need to be directly linked to and reflect the syllabus
outcomes. Syllabus outcomes in stages will describe the standard
against which student achievement is assessed and reported.

Is integral to teaching and learning

Effective and informative assessment practice involves selecting
strategies that are naturally derived from well structured teaching
and learning activities. These strategies should provide information
concerning student progress and achievement that helps inform
ongoing teaching and learning as well as the diagnosis of areas of
strength and need.

Is balanced, comprehensive and varied

Effective and informative assessment practice involves teachers
using a variety of assessment strategies that give students multiple
opportunities, in varying contexts, to demonstrate what they know,
understand and can do in relation to the syllabus outcomes.
Effective and informative reporting of student achievement takes a
number of forms including traditional reporting, student profiles,
Basic Skills Tests, parent and student interviews, annotations on
student work, comments in workBooks, portfolios, certificates and

Is valid
Assessment strategies should accurately and appropriately assess
clearly defined aspects of student achievement. If a strategy does
not accurately assess what it is designed to assess, then its use is
Valid assessment strategies are those that reflect the actual
intention of teaching and learning activities, based on syllabus
Where values and attitudes are expressed in syllabus outcomes,
these too should be assessed as part of student learning.
Is fair
Effective and informative assessment strategies are designed to
ensure equal opportunity for success regardless of students' age,
gender, physical or other disability, culture, background language,
socio-economic status or geographic location.
Engages the learner
Effective and informative assessment practice is student centred.
Ideally there is a cooperative interaction between teacher and
students, and among the students themselves.
The syllabus outcomes and the assessment processes to be used
should be made explicit to students. Students should participate in
the negotiation of learning tasks and actively monitor and reflect
upon their achievements and progress.
Values teacher judgement
Good assessment practice involves teachers making judgements,
on the weight of assessment evidence, about student progress
towards the achievement of outcomes.
Teachers can be confident a student has achieved an outcome

when the student has successfully demonstrated that outcome a

number of times, and in varying contexts.
The reliability of teacher judgement is enhanced when teachers
cooperatively develop a shared understanding of what constitutes
achievement of an outcome. This is developed through
cooperative programming and discussing samples of student work
and achievements within and between schools. Teacher
judgement based on well defined standards is a valuable and rich
form of student assessment.
Is time efficient and manageable
Effective and informative assessment practice is time efficient and
supports teaching and learning by providing constructive feedback to
the teacher and student that will guide further learning.
Teachers need to plan carefully the timing, frequency and nature of
their assessment strategies. Good planning ensures that assessment
and reporting is manageable and maximises the usefulness of the
strategies selected (for example, by addressing several outcomes in
one assessment task).
Recognises individual achievement and progress
Effective and informative assessment practice acknowledges that
students are individuals who develop differently. All students must be
given appropriate opportunities to demonstrate achievement.
Effective and informative assessment and reporting practice is
sensitive to the self esteem and general well-being of students,
providing honest and constructive feedback.
Values and attitudes outcomes are an important part of learning that
should be assessed and reported. They are distinct from knowledge,
understanding and skill outcomes.
Involves a whole school approach
An effective and informative assessment and reporting policy is
developed through a planned and coordinated whole school approach.
Decisions about assessment and reporting cannot be taken
independently of issues relating to curriculum, class groupings,
timetabling, programming and resource allocation.
Actively involves parents
Schools and their communities are responsible for jointly developing
assessment and reporting practices and policies according to their local
needs and expectations.

Schools should ensure full and informed participation by parents in the

continuing development and review of the school policy on reporting
Conveys meaningful and useful information
Reporting of student achievement serves a number of purposes, for a
variety of audiences. Students, parents, teachers, other schools and
employers are potential audiences. Schools can use student
achievement information at a number of levels including individual,
class, grade or school. This information helps identify students for
targeted intervention and can inform school improvement programs.
The form of the report must clearly serve its intended purpose and
Effective and informative reporting acknowledges that students can be
demonstrating progress and achievement of syllabus outcomes across
stages, not just within stages.
Good reporting practice takes into account the expectations of the
school community and system requirements, particularly the need for
information about standards that will enable parents to know how their
children are progressing.
Student achievement and progress can be reported by comparing
students' work against a standards framework of syllabus outcomes,
comparing their prior and current learning achievements, or comparing
their achievements to those of other students. Reporting can involve a
combination of these methods. It is important for schools and parents to
explore which methods of reporting will provide the most meaningful
and useful information.




Topic 10 focuses on the issues and concerns related to assessment in the
Malaysian primary schools. It will look at how assessment is viewed and used
in Malaysia.
By the end of Topic 10, teachers will be able to:

Understand some issues and concerns regarding assessment in the

Malaysian primary schools
Understand Chapter 4 of the Malaysian Education Blueprint 2013-2025
Use the different types of assessment in assessing language in school
(cognitive-level,school-based and alternative assessment)




Issues and
Concerns in


SESSION TEN (3 hours)

Levels of


Exam-oriented System

The educational administration in Malaysia is highly centralised with four

hierarchical levels; that is, federal, state, district and the lowest level, school.
Major decision-and policy-making take place at the federal level represented
by the Ministry of Education (MoE), which consists of the Curriculum
Development Centre, the school division, and the Malaysian Examination
Syndicate (MES).
The current education system in Malaysia is too examination-oriented and
over-emphasizes rote-learning with institutions of higher learning fast
becoming mere diploma mills.Like most Asian countries (e.g., Gang 1996;
Lim and Tan 1999; Choi 1999); Malaysia so far has focused on public
examination results as important determinants of students progression to
higher levels of education or occupational opportunities (Chiam 1984).
The Malaysian education system requires all students to sit for public
examinations at the end of each level of schooling. There are four public
examinations from primary to postsecondary education. These are the
Primary School Achievement Test (UPSR) at the end of six years of primary
education, the Lower Secondary Examination (PMR) at the end of another
three years schooling, the Malaysian Certificate of Education (SPM) at the
end of 11 years of schooling, and the Malaysian Higher School Certificate
Examination (STPM) or the Higher Malaysian Certificate for Religious
Education (STAM) at the end of 13 years schooling (MoE 2004).

Malaysia Education Blueprint 2013-2025

In October 2011, the Ministry of Education launched a
comprehensive review of the education system in Malaysia
in order to develop a new National Education Blueprint. This
decision was made in the context of rising international
education standards, the Governments aspiration of better
preparing Malaysias children for the needs of the 21st
century, and increased public and parental expectations of
education policy. Over the
course of 11 months, the Ministry drew on many sources of
input, from education experts at UNESCO, World Bank,
OECD, and six local universities, to principals, teachers,
parents, and students from every state in Malaysia. The
result is a preliminary Blueprint

that evaluates the performance of Malaysias education

system against historical starting points and international
benchmarks. The Blueprint also offers a vision of the
education system and students that Malaysia both needs
and deserves, and suggests
11 strategic and operational shifts that would be required to
achieve that vision. The Ministry hopes that this effort will
inform the national discussion on how to fundamentally
transform Malaysias education system, and will seek
feedback from across
the community on this preliminary effort before finalising the
Blueprint in December 2012.
The examined Curriculum
In public debate, the issue of teaching to the test has often translated
into debates over whether the UPSR, PMR, and SPM examinations
should be abolished. Summative national examinations should not in
themselves have any negative impact on students. The challenge is that
these examinations do not currently test the full range of skills that the
education system aspires to produce. An external review by Pearson
Education Group of the English examination papers at UPSR
and SPM level noted that these assessments would benefit from
the inclusion of more questions testing higher-order thinking skills,
such as application, analysis, synthesis and evaluation. For example,
their analysis of the 2010 and 2011 English Language UPSR papers
showed that approximately 70% of the questions tested basic skills of
knowledge and comprehension.
LP has started a series of reforms to ensure that, as per policy,
assessments are evaluating students holistically. In 2011, in parallel
with the KSSR, the LP rolled out the new PBS format that is intended
to be more holistic, robust, and aligned to the new standard-referenced
curriculum. There are four components to the new PBS:
School assessment refers to written tests that assess subject
learning. The test questions and marking schemes are developed,
administered, scored, and reported by school teachers based on
guidance from LP;
Central assessment refers to written tests, project work, or

oral tests (for languages) that assess subject learning. LP develops

the test questions and marking schemes. The tests are, however,
administered and marked by school teachers;
Psychometric assessment refers to aptitude tests and a
personality inventory to assess students skills, interests, aptitude,
attitude and personality. Aptitude tests are used to assess students
innate and acquired abilities, for example in thinking and problem
solving. The personality inventory is used to identify key traits and
characteristics that make up the students personality. LP develops
these instruments and provides guidelines for use. Schools are,
however, not required to comply with these guidelines; and
Physical, sports, and co-curricular activities assessment
refers to assessments of student performance and participation
in physical and health education, sports, uniformed bodies, clubs,
and other non-school sponsored activities. Schools are given the
flexibility to determine how this component will be assessed.

The new format enables students to be assessed on a broader range of

output over a longer period of time. It also provides teachers with more
regular information to take the appropriate remedial actions for their
students. These changes are hoped to reduce the overall emphasis on
teaching to the test, so that teachers can focus more time on delivering
meaningful learning as stipulated in the curriculum.
In 2014, the PMR national examinations will be replaced with school
and centralised assessment. In 2016, a students UPSR grade will no longer
be derived from a national examination alone, but from a combination of PBS
and the national examination. The format of the SPM remains the same, with
most subjects assessed through thenational examination, and some subjects
through a combination of examinations and centralised assessments.


Cognitive Levels of Assessment

Bloom's Taxonomy of Cognitive Levels


Recalling memorized information. May involve remembering a wide range of
material from specific facts to complete theories, but all that is required is the
bringing to mind of the appropriate information. Represents the lowest level
of learning outcomes in the cognitive domain.
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when? where?
The ability to grasp the meaning of material. Translating material from one
form to another (words to numbers), interpreting material (explaining or
summarizing), estimating future trends (predicting consequences or effects).
Goes one step beyond the simple remembering of material, and represent
the lowest level of understanding.
Learning objectives at this level: understand facts and principles, interpret
verbal material, interpret charts and graphs, translate verbal material to
mathematical formulae, estimate the future consequences implied in data,
justify methods and procedures.
Question verbs: Explain, predict, interpret, infer, summarize, convert,
translate, give example, account for, paraphrase x?
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning outcomes
in this area require a higher level of understanding than those under
Learning objectives at this level: apply concepts and principles to new
situations, apply laws and theories to practical situations, solve mathematical

problems, construct graphs and charts, demonstrate the correct usage of a

method or procedure.
Question verbs: How could x be used to y? How would you show, make use
of, modify, demonstrate, solve, or apply x to conditions y?

The ability to break down material into its component parts. Identifying parts,
analysis of relationships between parts, recognition of the organizational
principles involved. Learning outcomes here represent a higher intellectual
level than comprehension and application because they require an
understanding of both the content and the structural form of the material.
Learning objectives at this level: recognize unstated assumptions, recognizes
logical fallacies in reasoning, distinguish between facts and inferences,
evaluate the relevancy of data, analyze the organizational structure of a work
(art, music, writing).
Question verbs: Differentiate, compare / contrast, distinguish x from y, how
does x affect or relate to y? why? how? What piece of x is missing / needed?

(By definition, synthesis cannot be assessed with multiple-choice questions.
It appears here to complete Bloom's taxonomy.)
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose a
plan for an experiment, integrate learning from different areas into a plan for
solving a problem, formulate a new scheme for classifying objects (or events,
or ideas).
Question verbs: Design, construct, develop, formulate, imagine, create,
change, write a short story and label the following elements:

The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them. Learning
outcomes in this area are highest in the cognitive hierarchy because they
contain elements of all the other categories, plus conscious value judgments
based on clearly defined criteria.
Learning objectives at this level: judge the logical consistency of written
material, judge the adequacy with which conclusions are supported by data,
judge the value of a work (art, music, writing) by the use of internal criteria,
judge the value of a work (art, music, writing) by use of external standards of
Question verbs: Justify, appraise, evaluate, judge x according to given
criteria. Which option would be better/preferable to party y?

School-based Assessment
The traditional system of assessment no longer satisfies the educational
and social needs of the third millennium. In the past few decades, many
countries have made profound reforms in their assessment systems.
Several educational systems have in turn introduced school-based
assessment as part of or instead of external assessment in their
certification. While examination bodies acknowledge the immense
potential of school-based assessment in terms of validity and flexibility,
yet at the same time they have to guard against or deal with difficulties
related to reliability, quality control and quality assurance. In the debate
on school-based assessment, the issue of why has been widely written
about and there is general agreement on the principles of validity of
this form of assessment.
Izard (2001) as well as Raivoce and Pongi (2001) explain that schoolbased assessment (SBA) is often perceived as the process put in place
to collect evidence of what students have achieved, especially in

important learning outcomes that do not easily lend themselves to the

pen and paper tests. Daugherty (1994) clarifies that this type of
assessment has been recommended: because of the gains in the
validity which can be expected when students performance on
assessed tasks can be judged in a greater range of contexts and more
frequently than is possible within the constraints of time- limited, written
examinations. However, as Raivoce and Pongi (2001) suggest the
validity of SBA depends to a large extent on the various assessment
tasks students are required to perform.
Burton (1992) provides the following five rules of the thumb that may be
applied in the planning stage of school-based assessment :
1. The assessment should be appropriate to what is being assessed.
2. The assessment should enable the learner to demonstrate positive
achievement and reflect the learners strengths.
3. The criteria for successful performance should be clear to all
4. The assessment should be appropriate to all persons being assessed
5. The style of assessment should blend with the learning pattern so it
contributes to it.
In the Malaysian SBA context, assessment for and of learning
Standard-referenced Assessment
Components of SBA/ PBS


School Assessment (using Performance Standards)

Centralised Assessment


Physical Activities, Sports and Co-curricular Assessment (Pentaksiran

Aktiviti Jasmani, Sukan dan Kokurikulum - PAJSK)
Psychometric/Psychological Tests

Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer the
assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.

Alternative Assessment

Alternative assessments are assessment procedures that differ from

the traditional notions and practice of tests with respect to format,
performance, or implementation. It is likely that alternative assessment
found its roots in writing assessment because of the need to provide
continuous assessment rather than a single impromptu evaluation
(Alderson & Banerjee, 2001).

As the term indicates, alternative assessments are assessment

proposals that present alternatives to the more traditional
examination formats. They have become more popular of late
because of some doubts raised regarding the ability of traditional
assessment to elicit a fair and accurate measure of a students
performance. Alternative assessment brings together with it a
complete set of perspectives that contrast against traditional tests and
assessments. Table 10.1 illustrates some of the major differences
between traditional and alternative assessments.

Table 10.1: Contrasting Traditional and Alternative Assessment

Source: Adapted from Bailey (1998:207 and Puhl, 1997: 5)
Traditional Assessment

Alternative Assessment

One-shot tests

Continuous, longitudinal assessment

Indirect tests

Direct tests

Inauthentic tests

Authentic assessment

Individual projects

Group projects

No feedback to learners

Feedback provided to learners

Speeded exams

Power exams

Decontextualised test tasks

Contextualised test tasks

Norm-referenced score reporting Criterion-referenced score reporting

Standardised tests

Classroom-based tests



Product of instruction

Process of instruction





Teacher proof

Teacher mediated

In discussing alternative assessments, Herman et al. (1992: 6) list several of

their common characteristics. They describe alternative assessments as
performing the following:

Ask the students to perform, create, produce, or do something.

Tap higher-level thinking and problem-solving skills.

Use tasks that represent meaningful instructional activities.

Invoke real-world applications.

People, not machines, do the scoring, using human judgment.

Require new instructional and assessment roles for teachers.

Alternative assessments are suggested largely due to a growing concern

that traditional assessments are not able to accurately measure the ability
we are interested in. They are also seen to be more student centred as they
cater for different learning styles, cultural and educational backgrounds as
well as language proficiencies.

Tannenbaum (1996), comments that alternative assessments focus on

documenting individual strengths and development which would assist in
the teaching and learning process.
Nevertheless, although alternative assessments are compatible with the
contemporary emphases on the process as well as product of learning
(Croker, 1999), several shortcomings of alternative assessments have been
Perhaps one of the major limitations of alternative assessments is that
accounts of the benefits of alternative assessment tend to be descriptive
and persuasive, rather than research-based (Alderson & Banerjee, 2001:
229). Alternative assessments are also said to be limited to the classroom
and has not become part of mainstream assessment. Brown and Hudson, in
advocating alternative assessment, seem to have taken a safer approach by
suggesting the term alternatives in assessment. They believe that
educators should be familiar with all possible formats of assessment and
decide on the format that best measures the ability or construct that they are
interested in. Hence, these alternatives would include all possible
assessment formats both traditional and informal.
Despite these limitations, alternative assessments present a viable and
exciting option in eliciting and assessing the students actual abilities. There
are a number of test formats that are considered alternative assessment

Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Teacher-pupils conferences
Performace tasks
Self assessment
Peer assessment

A well known and commonly uses alternative assessment is the portfolio
assessment. The contents of the portfolio become evidence of abilities
much like how we would use a test to measure the abilities of our
Bailey (1998, p: 218), describes a portfolio to contain four primary

First, it should have an introduction to the portfolio itself

which provides an overview to the content of the portfolio.
Bailey even suggests that this section include a reflective essay
by the student in order to help express the students thoughts
and feelings about the portfolio, perhaps explaining strengths
and possible weaknesses as well as explain why certain pieces
are included in the portfolio.

Secondly, she argues that portfolios should have what

she refers to as an academic works section. This section is
meant to demonstrate the students improvement or
achievement in the major skill areas (p. 218).

The third section is described as a personal section in

which students may wish to include their journals, score reports
of tests that they have sat for, as well as photographs and other
items that illustrate their experiences with as well as
achievements in the English language.

Finally, an assessment section may contain evaluations

made by peers, teachers as well as self evaluations.
Table 10.1: Contents of a Portfolio
Source: Adapted from Bailey (1998: 218)

Introductory Section

Academic Works Section

Reflective Essay

Samples of best work

Samples of work demonstrating

Personal Section

Assessment Section

Evaluation by peers

Score reports
Personal items

The portfolio can be said to be a students personal documentation that

helps demonstrate his or her ability and successes in the language. It
may even require students to consciously select items that can
document their own progress as learners. The actual compilation of the

content of the portfolio is in itself a learning experience. Some suggest

that students should attach a short reflection on each piece or item
placed in the portfolio. Portfolio assessment, therefore, is both a learning
and assessment experience. This dual function can be considered as
one of the benefits of portfolio assessment.
Brown and Hudson (1998), summarise several other advantages in
using portfolios in assessment. They discuss these advantages
according to how the portfolio strengthens students learning, enhances
the teachers role and improves the testing process. With respect to
testing, the advantages of using portfolio as an assessment instrument
are listed as follows (pp.664-665):

enhances student and teacher involvement in


provides opportunities for teachers to observe students

using meaningful language;

to accomplish various authentic tasks in a variety of

contexts and situations;

permit the assessment of the multiple dimensions of

language learning;

provide opportunities for both students and teachers to

work together and reflect on what it means to assess students
language growth;

increase the variety of information collected on students;


make teachers ways of assessing student work more


Self Assessment and Peer Assessment

Two other common forms of alternative assessment are the selfassessment and peer-assessment procedures. Both these
forms of assessment are strongly advocated by Puhl (1997) as
she believes that they are essential to continuous assessment, a
cornerstone to alternative assessment. The benefits of self and
peer assessment are especially found in formative stages of
assessment in which the development of the students abilities
are emphasised.

Self appraisals are also thought to be quite accurate and are

said to increase student motivation. Puhl (1997), describes a

case study in which she believes self-assessment forced the
students to reread and thereby make necessary editing and
corrections to their essays before they handed them in.
Nevertheless, in order for self assessment to be useful and not a
futile exercise, the learners need to be trained and initially
guided in performing their self assessment. This training
involves providing students with the rationale for self
assessment and how it is intended to work and how it is capable
of helping them.

In language teaching and learning, self assessment is relevant in

assessing all the language skills. An example of the self
assessment of the listening skill, especially in the comprehension
of questions asked is suggested by Cohen (1994), as follows:
Comprehension of questions asked:

I can always understand the questions with no difficulties and without

having ask for repetition


I can usually understand questions, but I might occasionally ask for



I have difficulty with some questions, but I generally get the meaning


I have difficulty understanding most questions even after repetition


I dont understand questions well at all

These questions are useful in the formative stages of

assessment as it helps students identify their own strengths
and weaknesses and respond accordingly. Through asking
these types of self assessment questions, the students are
expected to become more sensitive to their own learning and
ultimately perform better in the final summative evaluation at
the end of the instructional programme.

Peer assessment differs from self assessment in that it involves

the social and emotional dimensions to a much greater extent.

Peer-assessment can be defined as a response in some form
to other learners work (Puhl, 1997). It can be given by a group
or an individual and it can take any of a variety of coding
systems: the spoken word, the written word, checklists,
questionnaires, nonverbal symbols, numbers along a scale,
colours, etc. (p.8) Peer assessment requires that a student
take up the role of a critical friend to another student in order
to support, challenge, and extend each others learning
(Brooks, 2002: 73). Among the reported benefits of peer
assessment are as follows:

remind learners they are not working in isolation;

help create a community of learners;

improve the product (Two heads are better than one);

improve the process; motivates, even inspires;

help learners be reflective; and

stimulate meta-cognition.
In your opinion, what are the advantages of using portfolios as
a form of alternative assessment?

Allen, I. J. (2011). Repriviledging reading: The negotiation of
Pedagogy: Critical Approaches to Teaching
Literature, Language Composition, and Culture, 12 (1) pp. 97120.
Available at: 26, 2013)
Alderson, J. C. (1986b). Innovations in language testing? In M.
(Ed.), Innovations in language testing. pp. 93-105.
Windsor: NFER/Nelson.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test
and evaluation. Cambridge: Cambridge University
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian,P.W.,
Cruikshank, K.A.,
Mayer, R.E., Pintrich, P.R.,Raths, J., &
Wittrock, M.C. (2001). A
taxonomy for learning, teaching, and
assessing: A revision of Bloom's
Taxonomy of Educational
Objectives (Complete edition). New York: Longman.
Anderson, K. M., (2007). Differentiating instruction to include all
students. Preventing School Failure, 51 (3) pp. 49-54.
Bachman, L. F. (2004). Statistical Analyses for Language
Assessment. pp.
22-23. Cambridge, UK: Cambridge
University Press.
Biggs, J. B. and Collis, K. F. (1982).Evaluating the Quality of
Learning: the
SOLO taxonomy. New York, NY: Academic Press.
Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the
quality of intelligent behaviour. In: H. Rowe (Ed.) Intelligence:
Reconceptualization and measurement. Hillsdale, NJ: Lawrence
Erlbaum. pp. 57-75.
Biggs, J.B.& Tang, C. (2009). Applying constructive alignment to
outcomes- based teaching and learning. Training Material. Quality
Teaching for
Learning in Higher Education Workshop for
Master Trainers. Ministry
of Higher Education. Kuala Lumpur.
Black, P. & Wiliam, D. (2009). Developing the theory of formative
J. Gardiner, ed. Educational Assessment
Evaluation and Accountability, 1 (1), pp. 531.
Available at: (Retrieved 23 August
Bloom, B. S. (Ed.). Engelhart, M.D., Furst, E.J., Hill,W.H., &

Krathwohl, D.R. (1956). Taxonomy of educational objectives: The

classification of educational goals. Handbook 1: Cognitive
domain.New York: David
Bloom, B. S. (1956). Taxonomy of Educational Objectives,
Handbook I: The Cognitive Domain. New York: David McKay Co
Brennan, R. L. (1996). Generalizability of performance assessments.
In G.
W. Phillips (Ed.), Technical issues in large-scale
assessment (NCES 96-802) (pp. 19-58).
Washington, DC: National
Center for Education Statistics.
Brown, H. D., & Abeywickrama, P. (2010). Language Assessment:
Principles and Classroom Practices.New York, NY: Pearson
Brown, G., & Yule, G. (1983). Teaching the spoken language.
Cambridge: Cambridge

University Press.

Brown, H.D. (1994). Teaching by principles: An interactive

approach to language pedagogy. Englewood Cliffs, NJ: Prentice
Hall Regents.
Campbell, K. J., Watson, J. M., & Collis, K. F. (1992).Volume
measurement and intellectual development. Journal of Structural
Learning. 11, pp.
Carroll, J. B., & Sapon, S. M. (1958). Modern Language Aptitude
Test. New
York, NY: The Psychological Corporation.
Cheng, L. Watanabe, Y., & Curtis, A. (Eds.). (2004). Washback in
testing: Research contexts and methods. Mahwah,
NJ: Lawrence Erlbaum Associates.
Chick, H. (1998).Cognition in the Formal Modes: Research
mathematics and the SOLO taxonomy. Mathematics Education
Research Journal. 10 (2)
pp. 4-26.
Clark, J. (1979). Direct vs. semi-direct tests of speaking ability. In
Briere & F. Hinofotis (Eds.), Concepts in language
testing: Some recent studies (pp. 35-49). Washington,
Davidson, F., Hudson, T. & Lynch, B. (1985). Language testing:
Operationalization in classroom measurement and L2 research.

In M.
Celce-Murcia (Ed.). Beyond basics: Issues and research
in TESOL pp. 137-152. Rowley, MA: Newbury House.
Davidson, F., & Lynch, B. (2002). Testcraft: A teachers guide to
writing and using language test specifications. New Haven, CT:
Yale University Press.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and
McNamara, T. (1999). Dictionary of language testing.
Cambridge: University ofCambridge Local Examinations
Syndicate and Cambridge University Press.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn
(ed.). Educational Measurement. (3rd. ed.) pp.105-146. New
York, NY: Macmillan.
Gottlieb, M. (2006). Assessing English Language Learners:
Bridges from Language Proficiency to Academic Achievement.
USA: Corwin Press.
Grotjahn, R. (1986).Test validation and cognitive psychology:
Some methodological considerations.Language Testing
Hattie, J. (2009).Visible Learning. New York: Routledge.
Hattie, J. (2012) Visible Learning for Teachers: Maximizing Impact
Learning. Abingdon: Routledge
Hattie, J. & Brown, G. (2004) Cognitive processes in asTTle: The
SOLO taxonomy. University of Auckland/Ministry of Education.
asTTle Technical Report 43
Hook, P. & Mills, J. (2011) SOLO Taxonomy: A Guide for Schools
Book 1: A
common language of learning. Laughton, UK:
Essential Resources Educational Publishers.
Huang, S.C. (2012).English Teaching: Practice and Critique 11 (4),
Hughes, A. (2003). Testing for language teachers (2nd. Ed.).
MA: Cambridge University Press.
Gavin, B. et al. (2008). An introduction to educational assessment,
measurement and evaluation. (2nd ed.). Australia: Pearson
Education New Zealand.

McNamara, T. (2000). Language testing. Oxford, UK: Oxford

University Press.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and
assessment in teaching. (8th ed.). Upper Saddle River, NJ:
Merrill/Prentice Hall.
Malaysia Education Blueprint 2013-2025.
McMillan, J. H. (2001a.). Classroom assessment: Principles and
practice for
effective instruction.(2nd ed.). Boston: MA: Allyn &
Messick, S. (1989). Validity. In R. Linn (Ed.) Educational
measurement. Pp.
13-103. New York, NY:: MacMillan.

Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., &
Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching
and learning. Cambridge: Cambridge
University Press.
Mousavi, S. A. (2009). An encyclopedic dictionary of language
testing (4th ed.)
Tehran: Rahnama Publications.
Norleha Ibrahim. (2009). Management of measurement and
Module. Selongor: Open University Malaysia.
Nckles, M., Hbner, S. & Renkl, A. (2009). Enhancing selfregulated learning
by writing learning protocols. Learning and
Instruction, 19(3), pp. 259 271. Available
(Retrieved March 26, 2013).
Oller, J. W. (1979). Language tests at school: A pragmatic
approach. London: Longman.
Pearson, I. (1988).Tests as levers for change. In D. Chamberlain
& R. Baumgardner (Eds.), ESP in the classroom: Practice and
evaluation (Vol. 128, 98-107). London: Modern
Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New
York, NY:
Harcourt, Brace & World.

Shepard, L. A. (2000). The role of assessment in a learning

culture. Paper
presented at the Annual Meeting of the
American Educational
Research Association.
(Retrieved 10.8.2013)
Smith, A. (2011) High Performers: The Secrets of Successful
Camarthen: Crown House Publishing.
Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning.
The Clearing House. 80 (5) pp. 205211.
Spaan, M. (2006). Test and item specifications
development.Language Assessment Quarterly, 3, pp. 71-79.
Spratt, M. (2005). Washback and the classroom: The implications
for teaching and learning of studies of washback from exams.
Language Teaching Research, 19, 5-29.
Stansfield, C., & Reed, D. (2004). The story behind the Modern
Aptitude Test: An interview with John B. Carrol
(1916-2003). Language
Assessment Quarterly, 1, pp.43-56.
(Retrieved 9.8.2013) - (Retrieved 10.8.2013) - (Retrieved 12.8.2013)





M.A TESL University of North Texas, USA

B.A (Hons) English North Texas State University, USA

Sijil Latihan Perguruan Guru Siswazah (Kementerian

Pelajaran Malaysia)

4 tahun sebagai guru di sekolah menengah

21 tahun sebagai pensyarah di IPG



M.Ed.TESL Universiti Teknologi Malaysia

B.Ed. (Hons.) Agri. Science/TESL, Universiti Pertanian


23 tahun sebagai guru di sekolah menengah

7 tahun sebagai pensyarah di IPG